Conference Paper

Probabilistic topic modeling for genomic data interpretation.

Coll. of Inf. Sci. & Technol., Drexel Univ., Philadelphia, PA, USA
DOI: 10.1109/BIBM.2010.5706554 Conference: 2010 IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2010, Hong Kong, China, 18 - 21 December 2010, Proceedings
Source: DBLP

ABSTRACT Recently, the concept of a species containing both core and distributed genes, known as the supra- or pangenome theory, has been introduced. In this paper, we aim to develop a new method that is able to analyze the genome-level composition of DNA sequences, in order to characterize a set of common genomic features shared by the same species and tell their functional roles. To achieve this end, we firstly apply a composition-based approach to break down DNA sequences into sub-reads called the `N-mer' and represent the sequences by N-mer frequencies. Then, we introduce the Latent Dirichlet Allocation (LDA) model to study the genome-level statistic patterns (a.k.a. latent topics) of the `N-mer' features. Each estimated latent topic represents a certain component of the whole genome. With the help of the BioJava toolkit, we access to the gene region information of reference sequences from the NCBI database. We use our data mining framework to investigate two areas: 1) do strains within species share similar core and distributed topics? and 2) do genes with similar functional roles contain similar latent topics? After studying the mutual information between latent topics and gene regions, we provide examples of each, where the BioCyc database is used to correlate pathway and reaction information to the genes. The examples demonstrate the effectiveness of proposed method.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Discovering the global structures of microbial community using large-scale metagenomes is a significant challenge in the era of post-genomics. Data-driven methods such as dimension reduction have shown to be useful when they applied on a metagenomics profile matrix which summarize the abundance of functional or taxonomic categorizations in metagenomic samples. Analogously, model-driven method such as probability topic model (PTM) has been used to build a generative model to simulate the generating of a microbial community based on metagenomic profiles. Data-driven methods are direct and simple, they provide intuitive visualization and understanding of metagenomic profiles. Model-driven methods are often complicated but give a generative mechanism of microbial community which is helpful in understanding the generating process of complex microbial ecology. However, results from model-driven methods are usually hard to visualize and there is less an intuitive understanding of them. We developed a new computational framework to incorporate the strength of data-driven methods into model-based methods and applied the framework to discover and interpret enterotype in human microbiome.
    Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on; 01/2012


Available from