Jia Zeng

Suzhou University, Suchow, Anhui Sheng, China

Are you Jia Zeng?

Claim your profile

Publications (58)82.08 Total impact

  • Jia Zeng · Zhi-Qiang Liu
    [Show abstract] [Hide abstract]
    ABSTRACT: This state-of-the-art book describes important advances in type-2 fuzzy systems that have been made in the past decade for real-world pattern recognition problems, such as speech recognition, handwriting recognition, and topic modeling. The success of type-2 fuzzy sets has been largely attributed to their three-dimensional membership functions to handle both randomness and fuzziness uncertainties in real-world problems. In pattern recognition, both features and models have uncertainties, such as nonstationary babble noise in speech signals, large variations of handwritten Chinese character shapes, uncertain meaning of words in topic modeling, and uncertain parameters of models because of insufficient and noisy training data. All these uncertainties motivate us to integrate type-2 fuzzy sets with probabilistic graphical models to achieve better overall performance in terms of robustness, generalization ability, or recognition accuracy. For example, we inte- grate type-2 fuzzy sets with graphical models such as Gaussian mixture models, hidden Markov models, Markov random fields, and latent Dirichlet allocation- based topic models for pattern recognition. The type-2 fuzzy Gaussian mixture models can describe uncertain densities of observations. The type-2 fuzzy hidden Markov models incorporate the first-order Markov chain into the type-2 fuzzy Gaussian mixture models, which is suitable for modeling uncertain speech signals under babble noise. The type-2 fuzzy Markov random fields combine type-2 fuzzy sets with Markov random fields, which is able to handle large variations in struc- tural patterns such as handwritten Chinese characters. The type-2 fuzzy topic models focus on uncertain mixed membership of words to different topical clusters, which is effective to partition the observed (visual) words into semantically meaningful topical themes. In conclusion, these real-world pattern recognition applications demonstrate the effectiveness of type-2 fuzzy graphical models for handling uncertainties.
    No preview · Book · Jan 2015
  • Jia Zeng · Zhi-Qiang Liu
    [Show abstract] [Hide abstract]
    ABSTRACT: Latent Dirichlet allocation (LDA) is an important hierarchical Bayesian model for probabilistic topic modeling, which attracts worldwide interests and touches on many important applications in text mining, computer vision and computational biology. We first introduce a novel inference algorithm, called belief propagation (BP), for learning LDA, and then introduce how to speed up BP for fast topic modeling tasks. Following the “bag-of-words” (BOW) representation for video sequences, this chapter also introduces novel type-2 fuzzy topic models (T2 FTM) to recognize human actions. In traditional topic models (TM) for visual recognition, each video sequence is modeled as a “document” composed of spatial–temporal interest points called visual “word”. Topic models automatically assign a “topic” label to explain the action category of each word, so that each video sequence becomes a mixture of action topics for recognition. The T2 FTM differs from previous TMs in that it uses type-2 fuzzy sets (T2 FS) to encode the semantic uncertainty of each topic. We ca use the primary membership function (MF) to measure the degree of uncertainty that a document or a visual word belongs to a specific action topic, and use the secondary MF to evaluate the fuzziness of the primary MF itself. In this chapter, we implement two T2 FTMs: (1) interval T2 FTM (IT2 FTM) with all secondary grades equal one, and (2) vertical-slice T2 FTM (VT2 FTM) with unequal secondary grades based on our prior knowledge. To estimate parameters in T2 FTMs, we derive the efficient message-passing algorithm. Experiments on KTH and Weizmann human action data sets demonstrate that T2 FTMs are better than TMs to encode visual word uncertainties for human action recognition.
    No preview · Chapter · Jan 2015
  • Jia Zeng · Zhi-Qiang Liu
    [Show abstract] [Hide abstract]
    ABSTRACT: This chapter introduces probabilistic graphical models as a statistical–structural pattern recognition paradigm. Many pattern recognition problems can be posed as labeling problems to which the solution is a set of linguistic labels assigned to extracted features from speech signals, image pixels, and image regions. Graphical models use Markov properties to measure a local probability on the labels within the neighborhood system. The Bayesian decision theory guarantees the best labeling configuration according to the maximum a posteriori criterion.
    No preview · Chapter · Jan 2015
  • Jia Zeng · Zhi-Qiang Liu
    [Show abstract] [Hide abstract]
    ABSTRACT: This chapter summarizes the book and envisions future works.
    No preview · Chapter · Jan 2015
  • Jia Zeng · Zhi-Qiang Liu · Xiao-Qin Cao
    [Show abstract] [Hide abstract]
    ABSTRACT: The expectation-maximization (EM) algorithm can compute the maximum-likelihood (ML) or maximum a posterior (MAP) point estimate of the mixture models or latent variable models such as latent Dirichlet allocation (LDA), which has been one of the most popular probabilistic topic modeling methods in the past decade. However, batch EM has high time and space complexities to learn big LDA models from big data streams. In this paper, we present a fast online EM (FOEM) algorithm that infers the topic distribution from the previously unseen documents incrementally with constant memory requirements. Within the stochastic approximation framework, we show that FOEM can converge to the local stationary point of the LDA's likelihood function. By dynamic scheduling for the fast speed and parameter streaming for the low memory usage, FOEM is more efficient for some lifelong topic modeling tasks than the state-of-the-art online LDA algorithms to handle both big data and big models (aka, big topic modeling) on just a PC.
    No preview · Article · Jan 2015 · IEEE Transactions on Knowledge and Data Engineering
  • Jia Zeng · Zhi-Qiang Liu
    [Show abstract] [Hide abstract]
    ABSTRACT: This chapter integrates type-2 fuzzy sets (T2 FSs) with Markov random fields (MRFs) referred to as T2 FMRFs, which may handle both fuzziness and randomness in the structural pattern representation. On the one hand, the T2 membership function (MF) has a three-dimensional structure in which the primary MF describes randomness, and the secondary MF evaluates the fuzziness of the primary MF. On the other hand, MRFs can represent patterns statistical-structurally in terms of neighborhood system \(\partial i\) and clique potentials \(V_c\), and thus have been widely applied to image analysis and computer vision. In the proposed T2 FMRFs, we define the same neighborhood system as that in classical MRFs. To describe uncertain structural information in patterns, we derive the fuzzy likelihood clique potentials from T2 fuzzy Gaussian mixture models (T2 FGMMs). The fuzzy prior clique potentials are penalties for the mismatched structures based on prior knowledge. Because Chinese characters have hierarchical structures, we use T2 FMRFs to model character structures in the handwritten Chinese character recognition (HCCR) system. The overall recognition rate is \(99.07\,\%\), which confirms the effectiveness of T2 FMRFs for statistical character structure modeling.
    No preview · Chapter · Jan 2015
  • Jia Zeng · Zhi-Qiang Liu
    [Show abstract] [Hide abstract]
    ABSTRACT: This chapter extends hidden Markov models (HMMs) to type-2 fuzzy HMMs (T2 FHMMs). We derive the T2 fuzzy forward-backward algorithm and Viterbi algorithm using T2 FS operations. To investigate the effectiveness of T2 FHMMs, we apply them to phoneme classification and recognition on the TIMIT speech database. Experimental results show that T2 FHMMs can effectively handle noise and dialect uncertainties in speech signals besides a better classification performance than the classical HMMs. We also find that the larger area of the FOU in T2 FHMMs with uncertain mean vectors performs better in classification when the signal-to-noise ratio is lower.
    No preview · Chapter · Jan 2015
  • Source
    Lei Xie · Jia Zeng · Zhi-Qiang Liu
    [Show abstract] [Hide abstract]
    ABSTRACT: IntroductionThe past decade has seen a rapid development of probabilistic topic models notably probabilistic latent semantic index (PLSI) and latent Dirichlet allocation (LDA). Originally, topic modeling methods have been used to find thematic word clusters called topics from a collection of documents. Since the bag-of-word (BOW) representations have been widely extended to represent both images and videos, topic modeling techniques have found many important applications in the multimedia area. Typical examples include natural scene categorization, human action recognition, multi-label image annotation, part of speech annotation, topic identification and spoken document segmentation. The advantage of topic models lies in their elegant graphical representations and efficient approximate inference algorithms. In the meanwhile, many real-world systems use topic modeling methods to automatically do the feature engineering job. However, different applications require investigating different ...
    Full-text · Article · Dec 2014 · Soft Computing
  • [Show abstract] [Hide abstract]
    ABSTRACT: Latent Dirichlet allocation (LDA) is a popular topic modeling method which has found many multimedia applications, such as motion analysis and image categorization. Communication cost is one of the main bottlenecks for large-scale parallel learning of LDA. To reduce communication cost, we introduce Zipf's law and propose novel parallel LDA algorithms that communicate only partial important information at each learning iteration. The proposed algorithms are much more efficient than the current state-of-theart algorithms in both communication and computation costs. Extensive experiments on large-scale data sets demonstrate that our algorithms can greatly reduce communication and computation costs to achieve a better scalability.
    No preview · Article · Jan 2014 · Soft Computing
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: To solve the big topic modeling problem, we need to reduce both time and space complexities of batch latent Dirichlet allocation (LDA) algorithms. Although parallel LDA algorithms on the multi-processor architecture have low time and space complexities, their communication costs among processors often scale linearly with the vocabulary size and the number of topics, leading to a serious scalability problem. To reduce the communication complexity among processors for a better scalability, we propose a novel communication-efficient parallel topic modeling architecture based on power law, which consumes orders of magnitude less communication time when the number of topics is large. We combine the proposed communication-efficient parallel architecture with the online belief propagation (OBP) algorithm referred to as POBP for big topic modeling tasks. Extensive empirical results confirm that POBP has the following advantages to solve the big topic modeling problem: 1) high accuracy, 2) communication-efficient, 3) fast speed, and 4) constant memory usage when compared with recent state-of-the-art parallel LDA algorithms on the multi-processor architecture.
    Full-text · Article · Nov 2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: For clustering biomedical documents, we can consider three different types of information: the local-content (LC) information from documents, the global-content (GC) information from the whole MEDLINE collections, and the medical subject heading (MeSH)-semantic (MS) information. Previous methods for clustering biomedical documents are not necessarily effective for integrating different types of information, by which only one or two types of information have been used. Recently, the performance of MEDLINE document clustering has been enhanced by linearly combining both the LC and MS information. However, the simple linear combination could be ineffective because of the limitation of the representation space for combining different types of information (similarities) with different reliability. To overcome the limitation, we propose a new semisupervised spectral clustering method, i.e., SSNCut, for clustering over the LC similarities, with two types of constraints: must-link (ML) constraints on document pairs with high MS (or GC) similarities and cannot-link (CL) constraints on those with low similarities. We empirically demonstrate the performance of SSNCut on MEDLINE document clustering, by using 100 data sets of MEDLINE records. Experimental results show that SSNCut outperformed a linear combination method and several well-known semisupervised clustering methods, being statistically significant. Furthermore, the performance of SSNCut with constraints from both MS and GC similarities outperformed that from only one type of similarities. Another interesting finding was that ML constraints more effectively worked than CL constraints, since CL constraints include around 10% incorrect ones, whereas this number was only 1% for ML constraints.
    Full-text · Article · Aug 2013 · Cybernetics, IEEE Transactions on
  • [Show abstract] [Hide abstract]
    ABSTRACT: Probabilistic latent semantic analysis (PLSA) is a topic model for text documents, which has been widely used in text mining, computer vision, computational biology and so on. For batch PLSA inference algorithms, the required memory size grows linearly with the data size, and handling massive data streams is very difficult. To process big data streams, we propose an online belief propagation (OBP) algorithm based on the improved factor graph representation for PLSA. The factor graph of PLSA facilitates the classic belief propagation (BP) algorithm. Furthermore, OBP splits the data stream into a set of small segments, and uses the estimated parameters of previous segments to calculate the gradient descent of the current segment. Because OBP removes each segment from memory after processing, it is memory-efficient for big data streams. We examine the performance of OBP on four document data sets, and demonstrate that OBP is competitive in both speed and accuracy for online expectation maximization (OEM) in PLSA, and can also give a more accurate topic evolution. Experiments on massive data streams from Baidu further confirm the effectiveness of the OBP algorithm.
    No preview · Article · Aug 2013 · Frontiers of Computer Science (print)
  • Source
    Jia Zeng · William K Cheung · Jiming Liu
    [Show abstract] [Hide abstract]
    ABSTRACT: Latent Dirichlet allocation (LDA) is an important hierarchical Bayesian model for probabilistic topic modeling, which attracts worldwide interest and touches on many important applications in text mining, computer vision and computational biology. This paper represents the collapsed LDA as a factor graph, which enables the classic loopy belief propagation (BP) algorithm for approximate inference and parameter estimation. Although two commonly used approximate inference methods, such as variational Bayes (VB) and collapsed Gibbs sampling (GS), have gained great success in learning LDA, the proposed BP is competitive in both speed and accuracy, as validated by encouraging experimental results on four large-scale document datasets. Furthermore, the BP algorithm has the potential to become a generic scheme for learning variants of LDA-based topic models in the collapsed space. To this end, we show how to learn two typical variants of LDA-based topic models, such as author-topic models (ATM) and relational topic models (RTM), using BP based on the factor graph representations.
    Full-text · Article · May 2013 · IEEE Transactions on Software Engineering
  • Juan Yang · Jia Zeng · William K. Cheung
    [Show abstract] [Hide abstract]
    ABSTRACT: Multiplex document networks have multiple types of links such as citation and coauthor links between scientific papers. Inferring thematic topics from multiplex document networks requires quantifying and balancing the influence from different types of links. It is therefore a problem of considerable interest and represents significant challenges. To address this problem, we propose a novel multiplex topic model (MTM) that represents the topic influence from different types of links using a factor graph. To estimate parameters in MTM, we also develop an approximate inference algorithm, multiplex belief propagation (MBP), which can estimate the influence weights of multiple links automatically at each learning iteration. Experimental results confirm the superiority of MTM in two applications, document clustering and link prediction, when compared with several state-of-the-art link-based topic models.
    No preview · Chapter · Apr 2013
  • Source
    Jia Zeng · Sridhar Hannenhalli
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Gene duplication, followed by functional evolution of duplicate genes, is a primary engine of evolutionary innovation. In turn, gene expression evolution is a critical component of overall functional evolution of paralogs. Inferring evolutionary history of gene expression among paralogs is therefore a problem of considerable interest. It also represents significant challenges. The standard approaches of evolutionary reconstruction assume that at an internal node of the duplication tree, the two duplicates evolve independently. However, because of various selection pressures functional evolution of the two paralogs may be coupled. The coupling of paralog evolution corresponds to three major fates of gene duplicates: subfunctionalization (SF), conserved function (CF) or neofunctionalization (NF). Quantitative analysis of these fates is of great interest and clearly influences evolutionary inference of expression. These two interrelated problems of inferring gene expression and evolutionary fates of gene duplicates have not been studied together previously and motivate the present study. Results Here we propose a novel probabilistic framework and algorithm to simultaneously infer (i) ancestral gene expression and (ii) the likely fate (SF, NF, CF) at each duplication event during the evolution of gene family. Using tissue-specific gene expression data, we develop a nonparametric belief propagation (NBP) algorithm to predict the ancestral expression level as a proxy for function, and describe a novel probabilistic model that relates the predicted and known expression levels to the possible evolutionary fates. We validate our model using simulation and then apply it to a genome-wide set of gene duplicates in human. Conclusions Our results suggest that SF tends to be more frequent at the earlier stage of gene family expansion, while NF occurs more frequently later on.
    Full-text · Article · Jan 2013 · BMC Genomics
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper reviews important advances that have been made in the past decade for topic modeling of large-scale document network data. Interest in topic modeling is worldwide and touches a number of practical text mining, computer vision and computational biology systems that are important in text summarization, information retrieval, information recommendation, topic detection and tracking, natural scene understanding, human motion categorization and microarray gene expression analysis. The main focus of this review is on the recent advances of topic modeling techniques for document network data. We introduce the four major characteristics of document network data and the current state-of-the-art topic models, with descriptions of what they are, what has been accomplished, and what remains to be done. Document network data contain dynamic, higher-order, multiplex, and distributed structures. Prior efforts on topic models focus on modeling parts of these structures for topic detection and tracking. To handle all document network structures, we discuss a three-dimensional Markov model that solves dynamic, higher-order, multiplex and distributed structures within a unified framework. In addition, we also discuss the integration of three-dimensional Markov models with type-2 fuzzy logic systems for distributed computing with words. Besides document network structure modeling, we also discuss the inference and parameter estimation method in terms of energy minimization for three-dimensional Markov models.
    Preview · Article · Dec 2012 · Chinese Journal of Computers
  • Source
    Jia Zeng · Zhi-Qiang Liu · Xiao-Qin Cao
    [Show abstract] [Hide abstract]
    ABSTRACT: The batch latent Dirichlet allocation (LDA) algorithms play important roles in probabilistic topic modeling, but they are not suitable for processing big data streams due to high time and space compleixty. Online LDA algorithms can not only extract topics from big data streams with constant memory requirements, but also detect topic shifts as the data stream flows. In this paper, we present a novel and easy-to-implement online belief propagation (OBP) algorithm that infers the topic distribution from the previously unseen documents incrementally within the stochastic approximation framework. We discuss intrinsic relations between OBP and online expectation-maximization (OEM) algorithms, and show that OBP can converge to the local stationary point of the LDA's likelihood function. Extensive empirical studies confirm that OBP significantly reduces training time and memory usage while achieves a much lower predictive perplexity when compared with current state-of-the-art online LDA algorithms. Due to its ease of use, fast speed and low memory usage, OBP is a strong candidate for becoming the standard online LDA algorithm.
    Full-text · Article · Oct 2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a novel communication-efficient parallel belief propagation (CE-PBP) algorithm for training latent Dirichlet allocation (LDA). Based on the synchronous belief propagation (BP) algorithm, we first develop a parallel belief propagation (PBP) algorithm on the parallel architecture. Because the extensive communication delay often causes a low efficiency of parallel topic modeling, we further use Zipf's law to reduce the total communication cost in PBP. Extensive experiments on different data sets demonstrate that CE-PBP achieves a higher topic modeling accuracy and reduces more than 80% communication cost than the state-of-the-art parallel Gibbs sampling (PGS) algorithm.
    Full-text · Article · Jun 2012
  • Source
    Jia Zeng · Zhi-Qiang Liu · Xiao-Qin Cao
    [Show abstract] [Hide abstract]
    ABSTRACT: As one of the simplest probabilistic topic modeling techniques, latent Dirichlet allocation (LDA) has found many important applications in text mining, computer vision and computational biology. Recent training algorithms for LDA can be interpreted within a unified message passing framework. However, message passing requires storing previous messages with a large amount of memory space, increasing linearly with the number of documents or the number of topics. Therefore, the high memory usage is often a major problem for topic modeling of massive corpora containing a large number of topics. To reduce the space complexity, we propose a novel algorithm without storing previous messages for training LDA: tiny belief propagation (TBP). The basic idea of TBP relates the message passing algorithms with the non-negative matrix factorization (NMF) algorithms, which absorb the message updating into the message passing process, and thus avoid storing previous messages. Experimental results on four large data sets confirm that TBP performs comparably well or even better than current state-of-the-art training algorithms for LDA but with a much less memory consumption. TBP can do topic modeling when massive corpora cannot fit in the computer memory, for example, extracting thematic topics from 7 GB PUBMED corpora on a common desktop computer with 2GB memory.
    Full-text · Article · Jun 2012
  • Source
    Jia Zeng · Xiao-Qin Cao · Zhi-Qiang Liu
    [Show abstract] [Hide abstract]
    ABSTRACT: Fast convergence speed is a desired property for training latent Dirichlet allocation (LDA), especially in online and parallel topic modeling for massive data sets. This paper presents a novel residual belief propagation (RBP) algorithm to accelerate the convergence speed for training LDA. The proposed RBP uses an informed scheduling scheme for asynchronous message passing, which passes fast-convergent messages with a higher priority to influence those slow-convergent messages at each learning iteration. Extensive empirical studies confirm that RBP significantly reduces the training time until convergence while achieves a much lower predictive perplexity than other state-of-the-art training algorithms for LDA, including variational Bayes (VB), collapsed Gibbs sampling (GS), loopy belief propagation (BP), and residual VB (RVB).
    Full-text · Article · Apr 2012

Publication Stats

559 Citations
82.08 Total Impact Points

Institutions

  • 2013-2014
    • Suzhou University
      Suchow, Anhui Sheng, China
  • 2010-2014
    • Soochow University (PRC)
      • Department of Computer Science and Technology
      Wu-hsien, Jiangsu Sheng, China
  • 2012
    • Fudan University
      Shanghai, Shanghai Shi, China
  • 2008-2010
    • Hong Kong Baptist University
      • Department of Computer Science
      Chiu-lung, Kowloon City, Hong Kong
  • 2006-2008
    • The University of Hong Kong
      • Department of Electrical and Electronic Engineering
      Hong Kong, Hong Kong
  • 2004-2008
    • City University of Hong Kong
      • School of Creative Media
      Chiu-lung, Kowloon City, Hong Kong