Conference Paper

Probabilistic topic modeling for genomic data interpretation

Coll. of Inf. Sci. & Technol., Drexel Univ., Philadelphia, PA, USA
DOI: 10.1109/BIBM.2010.5706554 Conference: 2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010, Hong Kong, China, 18 - 21 December 2010, Proceedings
Source: DBLP


Recently, the concept of a species containing both core and distributed genes, known as the supra- or pangenome theory, has been introduced. In this paper, we aim to develop a new method that is able to analyze the genome-level composition of DNA sequences, in order to characterize a set of common genomic features shared by the same species and tell their functional roles. To achieve this end, we firstly apply a composition-based approach to break down DNA sequences into sub-reads called the `N-mer' and represent the sequences by N-mer frequencies. Then, we introduce the Latent Dirichlet Allocation (LDA) model to study the genome-level statistic patterns (a.k.a. latent topics) of the `N-mer' features. Each estimated latent topic represents a certain component of the whole genome. With the help of the BioJava toolkit, we access to the gene region information of reference sequences from the NCBI database. We use our data mining framework to investigate two areas: 1) do strains within species share similar core and distributed topics? and 2) do genes with similar functional roles contain similar latent topics? After studying the mutual information between latent topics and gene regions, we provide examples of each, where the BioCyc database is used to correlate pathway and reaction information to the genes. The examples demonstrate the effectiveness of proposed method.

Full-text preview

Available from:
  • [Show abstract] [Hide abstract]
    ABSTRACT: Discovering the global structures of microbial community using large-scale metagenomes is a significant challenge in the era of post-genomics. Data-driven methods such as dimension reduction have shown to be useful when they applied on a metagenomics profile matrix which summarize the abundance of functional or taxonomic categorizations in metagenomic samples. Analogously, model-driven method such as probability topic model (PTM) has been used to build a generative model to simulate the generating of a microbial community based on metagenomic profiles. Data-driven methods are direct and simple, they provide intuitive visualization and understanding of metagenomic profiles. Model-driven methods are often complicated but give a generative mechanism of microbial community which is helpful in understanding the generating process of complex microbial ecology. However, results from model-driven methods are usually hard to visualize and there is less an intuitive understanding of them. We developed a new computational framework to incorporate the strength of data-driven methods into model-based methods and applied the framework to discover and interpret enterotype in human microbiome.
    No preview · Conference Paper · Jan 2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Topic Modeling (TM) is a rapidly-growing area at the interfaces of text mining, artificial in-telligence and statistical modeling, that is being increasingly deployed to address the 'information overload' associated with extensive text repositories. The goal in TM is typically to infer a rich yet intuitive summary model of a large document collection, indicating a specific collection of topics that characterizes the collection -each topic being a probability distribution over words -along with the degrees to which each individual document is concerned with each topic. The model then supports segmentation, clustering, profiling, browsing, and many other tasks. Current approaches to TM, dominated by Latent Dirichlet Allocation (LDA), assume a topic-driven document generation pro-cess and find a model that maximizes the likelihood of the data with respect to this process. This is clearly sensitive to any mismatch between the 'true' generating process and statistical model, while it is also clear that the quality of a topic model is multi-faceted and complex. Individual topics should be intuitively meaningful, sensibly distinct, and free of noise. Here we investigate multi-objective approaches to topic modeling, which attempt to infer coherent topic models by navigating the trade-offs between objectives that are oriented towards coherence as well as converge of the corpus at hand. Comparisons with LDA show that adoption of MOEA approaches enables significantly more coherent topics than LDA, consequently enhancing the use and interpretability of these models in a range of applications, without any significant degradation in the models' generalization ability.
    Full-text · Article · Jan 2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this work we present a new Bayesian topic model: latent hierarchical Pitman-Yor process allocation (LHPYA), which uses hierarchical Pitman-Yor process priors for both word and topic distributions, and generalizes a few of the existing topic models, including the latent Dirichlet allocation (LDA), the bigram topic model and the hierarchical Pitman-Yor topic model. Using such priors allows for integration of n-grams with a topic model, while smoothing them with the state-of-the-art method. Our model is evaluated by measuring its perplexity on a dataset of musical genre and harmony annotations \textit{3 Genre Database} (3GDB) and by measuring its ability to predict musical genre from chord sequences. In terms of perplexity, for a 262-chord dictionary we achieve a value of 2.74, compared to 18.05 for trigrams and 7.73 for a unigram topic model. In terms of genre prediction accuracy with 9 genres, the proposed approach performs about 33% better in relative terms than genre-dependent n-grams, achieving 60.4% of accuracy.
    Full-text · Article · Jan 2014 · IEEE Transactions on Audio Speech and Language Processing