Standard operating procedure for computing pangenome trees. Stand Genomic Sci 2:135-141

Standards in Genomic Sciences (Impact Factor: 3.17). 01/2010; 2(1):135-41. DOI: 10.4056/sigs.38923
Source: PubMed

ABSTRACT We present the pan-genome tree as a tool for visualizing similarities and differences between closely related microbial genomes within a species or genus. Distance between genomes is computed as a weighted relative Manhattan distance based on gene family presence/absence. The weights can be chosen with emphasis on groups of gene families conserved to various degrees inside the pan-genome. The software is available for free as an R-package.

Download full-text


Available from: Lars Snipen, Sep 28, 2015
14 Reads
  • Source
    • "The pan genome tree is the phylogenetic tree based on the profile of presence and absence of genes across genomes [2], [26], [27]. For the set of 34 genomes, the tree failed to cluster the outbreak strains into the corresponding groups of six different outbreak sources (Figure 1A). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Salmonella enterica is a common cause of minor and large food borne outbreaks. To achieve successful and nearly 'real-time' monitoring and identification of outbreaks, reliable sub-typing is essential. Whole genome sequencing (WGS) shows great promises for using as a routine epidemiological typing tool. Here we evaluate WGS for typing of S. Typhimurium including different approaches for analyzing and comparing the data. A collection of 34 S. Typhimurium isolates was sequenced. This consisted of 18 isolates from six outbreaks and 16 epidemiologically unrelated background strains. In addition, 8 S. Enteritidis and 5 S. Derby were also sequenced and used for comparison. A number of different bioinformatics approaches were applied on the data; including pan-genome tree, k-mer tree, nucleotide difference tree and SNP tree. The outcome of each approach was evaluated in relation to the association of the isolates to specific outbreaks. The pan-genome tree clustered 65% of the S. Typhimurium isolates according to the pre-defined epidemiology, the k-mer tree 88%, the nucleotide difference tree 100% and the SNP tree 100% of the strains within S. Typhimurium. The resulting outcome of the four phylogenetic analyses were also compared to PFGE reveling that WGS typing achieved the greater performance than the traditional method. In conclusion, for S. Typhimurium, SNP analysis and nucleotide difference approach of WGS data seem to be the superior methods for epidemiological typing compared to other phylogenetic analytic approaches that may be used on WGS. These approaches were also superior to the more classical typing method, PFGE. Our study also indicates that WGS alone is insufficient to determine whether strains are related or un-related to outbreaks. This still requires the combination of epidemiological data and whole genome sequencing results.
    PLoS ONE 02/2014; 9(2):e87991. DOI:10.1371/journal.pone.0087991 · 3.23 Impact Factor
  • Source
    • "Singletons were ignored. The tree was created with the R package, as previously described by Snipen & Ussery [53]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Escherichia coli exists in commensal and pathogenic forms. By measuring the variation of individual genes across more than a hundred sequenced genomes, gene variation can be studied in detail, including the number of mutations found for any given gene. This knowledge will be useful for creating better phylogenies, for determination of molecular clocks and for improved typing techniques. Results We find 3,051 gene clusters/families present in at least 95% of the genomes and 1,702 gene clusters present in 100% of the genomes. The former 'soft core' of about 3,000 gene families is perhaps more biologically relevant, especially considering that many of these genome sequences are draft quality. The E. coli pan-genome for this set of isolates contains 16,373 gene clusters. A core-gene tree, based on alignment and a pan-genome tree based on gene presence/absence, maps the relatedness of the 186 sequenced E. coli genomes. The core-gene tree displays high confidence and divides the E. coli strains into the observed MLST type clades and also separates defined phylotypes. Conclusion The results of comparing a large and diverse E. coli dataset support the theory that reliable and good resolution phylogenies can be inferred from the core-genome. The results further suggest that the resolution at the isolate level may, subsequently be improved by targeting more variable genes. The use of whole genome sequencing will make it possible to eliminate, or at least reduce, the need for several typing steps used in traditional epidemiology.
    BMC Genomics 10/2012; 13(1):577. DOI:10.1186/1471-2164-13-577 · 3.99 Impact Factor
  • Source
    • "It is obvious, that the genome similarity and differences are detectable not only by shared genes (core genome) between and among the genomes, but also by the absence of specific genes in specific genome (s). Therefore, the pangenomic tree has been generated based on the presence or absence of specific gene families across the Campylobacter genomes (Lukjancenko et al., 2012; Snipen and Ussery, 2010) (Fig. 3B). Again, Campylobacter species show greater conserved gene families between them and are positioned closely on the pan-genome dendrogram with greater branch strength. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The genus Campylobacter contains pathogens causing a wide range of diseases, targeting both humans and animals. Among them, the Campylobacter fetus subspecies fetus and venerealis deserve special attention, as they are the etiological agents of human bacterial gastroenteritis and bovine genital campylobacteriosis, respectively. We compare the whole genomes of both subspecies to get insights into genomic architecture, phylogenetic relationships, genome conservation and core virulence factors. Pan-genomic approach was applied to identify the core- and pan-genome for both C. fetus subspecies and members of the genus. The C. fetus subspecies conserved (76%) proteome were then analyzed for their subcellular localization and protein functions in biological processes. Furthermore, with pathogenomic strategies, unique candidate regions in the genomes and several potential core-virulence factors were identified. The potential candidate factors identified for attenuation and/or subunit vaccine development against C. fetus subspecies contain: nucleoside diphosphate kinase (Ndk), type IV secretion systems (T4SS), outer membrane proteins (OMP), substrate binding proteins CjaA and CjaC, surface array proteins, sap gene, and cytolethal distending toxin (CDT). Significantly, many of those genes were found in genomic regions with signals of horizontal gene transfer and, therefore, predicted as putative pathogenicity islands. We found CRISPR loci and dam genes in an island specific for C. fetus subsp. fetus, and T4SS and sap genes in an island specific for C. fetus subsp. venerealis. The genomic variations and potential core and unique virulence factors characterized in this study would lead to better insight into the species virulence and to more efficient use of the candidates for antibiotic, drug and vaccine development.
    Gene 08/2012; 508(2):145-56. DOI:10.1016/j.gene.2012.07.070 · 2.14 Impact Factor
Show more