Microbial comparative pan-genomics using binomial mixture models

Biostatistics, Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, As, Norway.
BMC Genomics (Impact Factor: 3.99). 09/2009; 10(1):385. DOI: 10.1186/1471-2164-10-385
Source: PubMed


The size of the core- and pan-genome of bacterial species is a topic of increasing interest due to the growing number of sequenced prokaryote genomes, many from the same species. Attempts to estimate these quantities have been made, using regression methods or mixture models. We extend the latter approach by using statistical ideas developed for capture-recapture problems in ecology and epidemiology.
We estimate core- and pan-genome sizes for 16 different bacterial species. The results reveal a complex dependency structure for most species, manifested as heterogeneous detection probabilities. Estimated pan-genome sizes range from small (around 2600 gene families) in Buchnera aphidicola to large (around 43000 gene families) in Escherichia coli. Results for Echerichia coli show that as more data become available, a larger diversity is estimated, indicating an extensive pool of rarely occurring genes in the population.
Analyzing pan-genomics data with binomial mixture models is a way to handle dependencies between genomes, which we find is always present. A bottleneck in the estimation procedure is the annotation of rarely occurring genes.

Download full-text


Available from: Lars Snipen,
  • Source
    • "Gene families with at least one gene in common were gathered into the core genome. The rest, either unmatched or not qualifying according to the criterion, constitutes the species pan genome [24] [25] [26]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Helicobacter pylori is a human gastric pathogen implicated as the major cause of peptic ulcer and second leading cause of gastric cancer (~70%) around the world. Conversely, an increased resistance to antibiotics and hindrances in the development of vaccines against H. pylori are observed. Pangenome analyses of the global representative H. pylori species consisting of 39 complete genomes are presented in this article. Phylogenomics analyses have revealed close relationships among geographically diverse strains of H. Pylori. The close association among these genomes was further analyzed by pangenomic approach, the conserved gene families (1,193) were then characterized, which constitute ~77% of the average H. pylori genome and 42% of the species pangenome. Reverse vaccinology strategies have been adopted to identify and narrow-down the potential immunogenic candidates. 29 non-host homolog proteins were characterized as global vaccine targets based on their functional annotation and protein-protein interaction. Epitope mapping analysis has revealed some of the best antigenic epitopes. Finally, genome plasticity analysis revealed 3 highly conserved and 2 highly variable putative pathogenicity islands in all H. pylori strains. http://www.hindawi.com/journals/bmri/aip/139580/
    BioMed Research International 07/2014; 2015. DOI:10.1155/2015/139580 · 2.71 Impact Factor
  • Source
    • "The estimated core genome of C. jejuni was 947 genes in 130 genomes (Figure 3A). Our estimates are consistent with previous studies where core genome size of C. jejuni was estimated to range from 847 genes [33] and 1,001 genes [34] to a maximum of 1,295 genes [29]. However, it is interesting to note that the core genome size does not reach a clear plateau, even when about 200 genomes are sampled, which indicates that if more diverse samples were added to this analysis, even fewer genes would be shared, something that has also been shown for Escherichia coli [35]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The increasing availability of hundreds of whole bacterial genomes provides opportunities for enhanced understanding of the genes and alleles responsible for clinically important phenotypes and how they evolved. However, it is a significant challenge to develop easy-to-use and scalable methods for characterizing these large and complex data and relating it to disease epidemiology. Existing approaches typically focus on either homologous sequence variation in genes that are shared by all isolates, or non-homologous sequence variation - focusing on genes that are differentially present in the population. Here we present a comparative genomics approach that simultaneously approximates core and accessory genome variation in pathogen populations and apply it to pathogenic species in the genus Campylobacter. A total of 7 published Campylobacter jejuni and Campylobacter coli genomes were selected to represent diversity across these species, and a list of all loci that were present at least once was compiled. After filtering duplicates a 7-isolate reference pan-genome, of 3,933 loci, was defined. A core genome of 1,035 genes was ubiquitous in the sample accounting for 59% of the genes in each isolate (average genome size of 1.68 Mb). The accessory genome contained 2,792 genes. A Campylobacter population sample of 192 genomes was screened for the presence of reference pan-genome loci with gene presence defined as a BLAST match of ≥70% identity over ≥50% of the locus length - aligned using MUSCLE on a gene-by-gene basis. A total of 21 genes were present only in C. coli and 27 only in C. jejuni, providing information about functional differences associated with species and novel epidemiological markers for population genomic analyses. Homologs of these genes were found in several of the genomes used to define the pan-genome and, therefore, would not have been identified using a single reference strain approach.
    PLoS ONE 03/2014; 9(3):e92798. DOI:10.1371/journal.pone.0092798 · 3.23 Impact Factor
  • Source
    • "Even within a species, comparative genomics has highlighted a diversity that would not have been detected otherwise. The diversity within Escherichia coli was illustrated in a study from 2009, where the number of gene families, in Escherichia coli was estimated to be 43 000 [7]; this number is expected to become larger as more genomes are sequenced. Another example of the power of comparative genomics, this time within low diversity genomes, can be found in a study of two Bacillus species, B. anthracis and B. cereus. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Today, there are more than a hundred times as many sequenced prokaryotic genomes than were present in the year 2000. The economical sequencing of genomic DNA has facilitated a whole new approach to microbial genomics. The real power of genomics is manifested through comparative genomics that can reveal strain specific characteristics, diversity within species and many other aspects. However, comparative genomics is a field not easily entered into by scientists with few computational skills. The CMG-biotools package is designed for microbiologists with limited knowledge of computational analysis and can be used to perform a number of analyses and comparisons of genomic data. The CMG-biotools system presents a stand-alone interface for comparative microbial genomics. The package is a customized operating system, based on Xubuntu 10.10, available through the open source Ubuntu project. The system can be installed on a virtual computer, allowing the user to run the system alongside any other operating system. Source codes for all programs are provided under GNU license, which makes it possible to transfer the programs to other systems if so desired. We here demonstrate the package by comparing and analyzing the diversity within the class Negativicutes, represented by 31 genomes including 10 genera. The analyses include 16S rRNA phylogeny, basic DNA and codon statistics, proteome comparisons using BLAST and graphical analyses of DNA structures. This paper shows the strength and diverse use of the CMG-biotools system. The system can be installed on a vide range of host operating systems and utilizes as much of the host computer as desired. It allows the user to compare multiple genomes, from various sources using standardized data formats and intuitive visualizations of results. The examples presented here clearly shows that users with limited computational experience can perform complicated analysis without much training.
    PLoS ONE 04/2013; 8(4):e60120. DOI:10.1371/journal.pone.0060120 · 3.23 Impact Factor
Show more