Microbial comparative pan-genomics using binomial mixture models.

Biostatistics, Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, As, Norway.
BMC Genomics (Impact Factor: 4.04). 09/2009; 10:385. DOI: 10.1186/1471-2164-10-385
Source: PubMed

ABSTRACT The size of the core- and pan-genome of bacterial species is a topic of increasing interest due to the growing number of sequenced prokaryote genomes, many from the same species. Attempts to estimate these quantities have been made, using regression methods or mixture models. We extend the latter approach by using statistical ideas developed for capture-recapture problems in ecology and epidemiology.
We estimate core- and pan-genome sizes for 16 different bacterial species. The results reveal a complex dependency structure for most species, manifested as heterogeneous detection probabilities. Estimated pan-genome sizes range from small (around 2600 gene families) in Buchnera aphidicola to large (around 43000 gene families) in Escherichia coli. Results for Echerichia coli show that as more data become available, a larger diversity is estimated, indicating an extensive pool of rarely occurring genes in the population.
Analyzing pan-genomics data with binomial mixture models is a way to handle dependencies between genomes, which we find is always present. A bottleneck in the estimation procedure is the annotation of rarely occurring genes.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Genomes of bacteria and archaea (collectively, prokaryotes) appear to exist in incessant flux, expanding via horizontal gene transfer (HGT) and gene duplication, and contracting via gene loss. However, the actual rates of genome dynamics and relative contributions of different types of events across the diversity of prokaryotes are largely unknown, as are the sizes of microbial supergenomes, i.e. pools of genes that are accessible to the given microbial species.ResultsWe performed a comprehensive analysis of the genome dynamics in 35 groups (34 bacterial and one archaeal) of closely related microbial genomes using a phylogenetic birth-and-death maximum likelihood model to quantify the rates of gene family gain and loss, as well as expansion and reduction. The results show that loss of gene families dominates the evolution of prokaryotes, occurring at approximately three times the rate of gain. The rates of gene family expansion and reduction are typically 7 and 20 times less than the gain and loss rates, respectively. Thus, the prevailing mode of evolution in bacteria and archaea is genome contraction that is partially compensated by the gain of new gene families via horizontal gene transfer. However, the rates of gene family gain, loss, expansion and reduction vary within wide ranges, with the most stable genomes showing rates about 25 times lower than the most dynamic genomes. For many groups, the supergenome estimated from the fraction of repetitive gene family gains includes about 10 fold more gene families than the typical genome in the group although some groups appear to have vast, ¿open¿ supergenomes.Conclusions Reconstruction of evolution in groups of closely related bacteria and archaea reveals extremely rapid and highly variable flux of genes in evolving microbial genomes, demonstrates that extensive gene loss and horizontal gene transfer leading to innovation are the two dominant evolutionary processes, and yields robust estimates of the supergenome size.
    BMC Biology 08/2014; 12(1):66. DOI:10.1186/PREACCEPT-1328480924130239 · 7.43 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Because prokaryotic genomes experience a rapid flux of genes, selection may act at a higher level than an individual genome. We explore a quantitative model of the distributed genome whereby groups of genomes evolve by acquiring genes from a fixed reservoir which we denote as supergenome. Previous attempts to understand the nature of the supergenome treated genomes as random, independent collections of genes and assumed that the supergenome consists of a small number of homogeneous sub-reservoirs. Here we explore the consequences of relaxing both assumptions. We surveyed several methods for estimating the size and composition of the supergenome. The methods assumed that genomes were either random, independent samples of the supergenome or that they evolved from a common ancestor along a known tree via stochastic sampling from the reservoir. The reservoir was assumed to be either a collection of homogeneous sub-reservoirs or alternatively composed of genes with Gamma distributed gain probabilities. Empirical gene frequencies were used to either compute the likelihood of the data directly or first to reconstruct the history of gene gains and then compute the likelihood of the reconstructed numbers of gains. Supergenome size estimates using the empirical gene frequencies directly are not robust with respect to the choice of the model. By contrast, using the gene frequencies and the phylogenetic tree to reconstruct multiple gene gains produces reliable estimates of the supergenome size and indicates that a homogeneous supergenome is more consistent with the data than a supergenome with Gamma distributed gain probabilities.
    BMC Genomics 10/2014; 15(Suppl 6):S14. DOI:10.1186/1471-2164-15-S6-S14 · 4.04 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: GET_HOMOLOGUES is an open-source software package written in Perl and R to define robust core- and pan-genomes by computing consensus clusters of orthologous gene families from whole-genome sequences using the bidirectional best-hit, COGtriangles, and OrthoMCL clustering algorithms. The granularity of the clusters can be fine-tuned by a user-configurable filtering strategy based on a combination of blastp pairwise alignment parameters, hmmscan-based scanning of Pfam domain composition of the proteins in each cluster, and a partial synteny criterion. We present detailed protocols to fit exponential and binomial mixture models to estimate core- and pan-genome sizes, compute pan-genome trees from the pan-genome matrix using a parsimony criterion, analyze and graphically represent the pan-genome structure, and identify lineage-specific gene families for the 12 complete pIncA/C plasmids currently available in NCBI's RefSeq. The software package, license, and detailed user manual can be downloaded for free for academic use from two mirrors: and .
    Methods in Molecular Biology 01/2015; 1231:203-32. DOI:10.1007/978-1-4939-1720-4_14 · 1.29 Impact Factor

Full-text (2 Sources)

Available from
May 30, 2014