ArticlePDF Available

GrapeTree: Visualization of core genomic relationships among 100,000 bacterial pathogens

Authors:

Abstract

Current methods struggle to reconstruct and visualize the genomic relationships of large numbers of bacterial genomes. GrapeTree facilitates the analyses of large numbers of allelic profiles by a static"GrapeTree Layout" algorithm which supports interactive visualizations of large trees within a web browser window. GrapeTree also implements a novel minimum spanning tree algorithm (MSTree V2) to reconstruct genetic relationships despite high levels of missing data. GrapeTree is a stand-alone package for investigating phylogenetic trees plus associated metadata, and is also integrated into EnteroBase to facilitate cutting edge navigation of genomic relationships among bacterial pathogens.
GrapeTree: visualization of core genomic relationships
among 100,000 bacterial pathogens
Zhemin Zhou,
1
Nabil-Fareed Alikhan,
1
Martin J. Sergeant,
1
Nina Luhmann,
1
Cátia Vaz,
2,3
Alexandre P. Francisco,
2,4
João André Carriço,
5
and Mark Achtman
1
1
Warwick Medical School, University of Warwick, Coventry, CV4 7AL, United Kingdom;
2
Instituto de Engenharia de Sistemas
e Computadores: Investigação e Desenvolvimento (INESC-ID), 1000-029 Lisboa, Portugal;
3
ADEETC, Instituto Superior de
Engenharia de Lisboa, Instituto Politécnico de Lisboa, 1959-007 Lisboa, Portugal;
4
Instituto Superior Técnico, Universidade de Lisboa,
1049-001 Lisboa, Portugal;
5
Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de
Lisboa, 1649-004 Lisboa, Portugal
Current methods struggle to reconstruct and visualize the genomic relationships of large numbers of bacterial genomes.
GrapeTree facilitates the analyses of large numbers of allelic profiles by a static GrapeTree Layoutalgorithm that sup-
ports interactive visualizations of large trees within a web browser window. GrapeTree also implements a novel minimum
spanning tree algorithm (MSTree V2) to reconstruct genetic relationships despite high levels of missing data. GrapeTree is a
stand-alone package for investigating phylogenetic trees plus associated metadata and is also integrated into EnteroBase to
facilitate cutting edge navigation of genomic relationships among bacterial pathogens.
[Supplemental material is available for this article.]
Legacy MLST (multilocus sequence typing) based on seven house-
keeping genes was introduced 20 years ago (Maiden et al. 1998)
and is now routinely used for the characterization of numerous
bacterial pathogens (Jolley and Maiden 2014). MLST assigns
distinct integer numbers to each unique sequence (allele) and a
distinct integer number, the sequence type (ST), to each unique
combination of allelic integers. Unrelated STs share few alleles or
none at all. In contrast, STs that share all but one or two alleles
are considered to be strongly related even if the differing alleles
contain multiple SNPs due to recombination. The largest legacy
MLST databases contain data on 60,000 bacterial strains (https
://pubmlst.org/databases.shtml).
In order to support epidemiological tracking of transmission
networks and disease control, the resolution achieved by MLST
was recently expanded to encompass more than seven gene frag-
ments. Expanded MLST schemes can include all 53 genes en-
coding ribosomal proteins (rMLST) (Jolley et al. 2012), thousands
of core genes that are present in most isolates of a species or genus
(core genome MLST, cgMLST) (Mellmann et al. 2011; Maiden et al.
2013; Moura et al. 2016), or even all the genes in the entire genome
(whole genome MLST, wgMLST) (Nadon et al. 2017). We have re-
cently developed EnteroBase (https://enterobase.warwick.ac.uk), a
genotyping website for selected enteric pathogens (Alikhan et al.
2018). EnteroBase automatically assembles Illumina short reads
into contigs and assigns the assembled sequences to MLST alleles
and STs at all levels of resolution from legacy MLST through to
wgMLST. EnteroBase performs these operations for short reads
that are in the public domain or uploaded by users.
In April 2018, EnteroBase contained 130,000 Salmonella ge-
nomes and >65,000 Escherichia genomes, and the numbers of sets
of Illumina short reads in the public domain continues to grow
rapidly (Alikhan et al. 2018). A driving force behind developing
such large databases is to facilitate our understanding of epidemi-
ological and population genetic phenomena among isolates from
distinct geographical sources and over extended time scales.
Initially, the genetic relationships of legacy STs were represented
by phylograms based on hierarchical clustering methods, an ap-
proach which can be very useful for visualizing deeper branching
structures. Phylograms may, however, be problematic for the pre-
sentation of large numbers of genotypes because each genotype is
represented by a unique branch, even when multiple genotypes
are identical. An example of this problem arises when visualizing
the allelic distances between 99,722 Salmonella spp. strains from
3902 legacy MLST STs. The associations between serovars and ge-
netic clades are somewhat difficult to interpret within the default
presentation of this phylogram by iTOL (Fig. 1A; Letunic and Bork
2016). Dendrograms generated by other programs (FigTree
[v.1.4.3, http://tree.bio.ed.ac.uk/software/figtree/]; Dendroscope
[Huson and Scornavacca 2012]) from large data sets were also dif-
ficult to interpret. Still other graphical user interfaces were unable
to even depict this large number of items, including PHYLOViZ
2.0 (Nascimento et al. 2017), SplitsTree4 (Huson and Bryant
2006), EvolView (He et al. 2016), Microreact (Argimon et al.
2016), TreeDyn (Chevenet et al. 2006), TreeView (Page 1996),
and Phandango (Hadfield et al. 2018). Similarly, handling more
than 5000 genomes presents problems for de novo sequence-based
SNP comparisons (Mazariegos-Canellas et al. 2017), and trees
based on phylogenetic algorithms are difficult to comprehend
when they contain large numbers of nodes (Fig. 1B,D).
An alternative to phylograms is minimum spanning trees,
which have less demanding graphical requirements because
they map clusters of related nodes in 2D space (Francisco et al.
2012; Nascimento et al. 2017). A commercial software pro-
gram (BioNumerics, Applied Maths) introduced an improved
Corresponding authors: zhemin.zhou@warwick.ac.uk,
m.achtman@warwick.ac.uk
Article published online before print. Article, supplemental material, and publi-
cation date are at http://www.genome.org/cgi/doi/10.1101/gr.232397.117.
Freely available online through the Genome Research Open Access option.
© 2018 Zhou et al. This article, published in Genome Research, is available un-
der a Creative Commons License (Attribution 4.0 International), as described at
http://creativecommons.org/licenses/by/4.0/.
Method
28:13951404 Published by Cold Spring Harbor Laboratory Press; ISSN 1088-9051/18; www.genome.org Genome Research 1395
www.genome.org
Cold Spring Harbor Laboratory Press on January 25, 2019 - Published by genome.cshlp.orgDownloaded from
visualization of a minimum spanning tree to microbiologists in
the early 1990s, which reduced complexity by grouping isolates
with identical STs within single nodes whose diameter reflected
the numbers of isolates. A similar visualization was subsequently
offered by the noncommercial PHYLOViZ software (Francisco
et al. 2009). We have now extended these approaches with
GrapeTree, a software package that supports the efficient visua-
lization of minimum spanning trees and phylograms from charac-
ter data. An initial indication of its capabilities can be gained
by comparing the representations of genetic relationships accord-
ing to legacy MLST data by iTOL (Fig. 1A,B) and GrapeTree (Fig.
1C,D).
Calculating minimum spanning trees from legacy MLST is
quick and efficient because legacy MLST is based on only seven
loci, and allelic calls for each of the seven loci are a prerequisite
for calling an ST, i.e., no missing data. As a result, the GrapeTree
visualization in Figure 1C took only 1.5 min. However, the
cgMLST of Salmonella spans 3002 loci (Alikhan et al. 2018), and
STs routinely include low levels of missing data because some
cgMLST genes are occasionally deleted or are not identified due
to various bioinformatics problems in the assembly of genomes
from short reads. As a result, multiple sets of almost identical STs
exist in EnteroBase that only differ due to missing data, but each
of which is, nevertheless, a unique node in a phylogram because
its allelic content differs from those of other STs. As demonstrated
below, missing data are also a problem for the classical minimum
spanning tree approach (henceforth MSTree) implemented by
BioNumerics and goeBURST (Francisco et al. 2009). We have there-
fore implemented MSTree V2, which is an improved algorithm for
generating minimum spanning trees from character sets that con-
tain missing data.
Here we present GrapeTree, a web browser application that ef-
ficiently reconstructs and visualizes intricate minimum spanning
trees together with detailed metadata.
B
A
CD
Figure 1. Visualization of 3902 legacy Salmonella MLST STs from 99,722 genomic assemblies in EnteroBase (Alikhan et al. 2018) by a phylogram versus a
minimum spanning tree. (A,B) iTol (Letunic and Bork 2016) visualization of genetic relationships. Nodes at the ends of the terminal edges represent each of
the 99,722 genomic assemblies. (C,D) GrapeTree visualization of genetic clusters. Nodes represent each of the 3902 STs, with diameters scaled to the num-
ber of assemblies. In C, edges between nodes mark allelic distances of 12 of the seven loci. (A,C) Representation of a minimum spanning tree generated by
MSTree V2 in Newick format. (B,D) Representation of a neighbor-joining tree generated by RapidNJ (Simonsen et al. 2011) in Newick format. Color codes
for the 60 most common serovars are indicated in the central key legend and used to color branches plus wedges in an external circle (A,B) or individual
nodes (C,D). Interactive versions of the trees can be found at (A) http://bit.ly/2qH06jp, (B) https://bit.ly/2mDOpbS, (C) http://bit.ly/2H69dkG, and
(D) https://bit.ly/2LG62Tl.
Zhou et al.
1396 Genome Research
www.genome.org
Cold Spring Harbor Laboratory Press on January 25, 2019 - Published by genome.cshlp.orgDownloaded from
Results
Overview of GrapeTree features
GrapeTree is a fully interactive, tree visualization program that
supports facile manipulations of both tree layout and metadata.
The visual component of GrapeTree is implemented in HTML/
JavaScript and served through a web server based on the Flask
web framework (Python 2.7). GrapeTree is available as a stand-
alone version (GrapeTree SA), which calculates trees from charac-
ter data, visualizes precalculated trees, and annotates them with in-
formation from metadata (Fig. 2). Calculating trees is handled by
an independent module (CL), which calls Python NumPy as well
as external C++ programs for efficiency. CL can also be run in com-
mand line mode, in which case it terminates after generating the
desired tree in Newick format. GrapeTree has also been integrated
into larger web services through wrapper functions. The wrappers
provide bidirectional communication with database servers con-
taining information from hundreds of thousands of bacterial ge-
nomes and their associated metadata (Supplemental Fig. S1). The
version of GrapeTree provided by EnteroBase (GrapeTree EB)
only displays trees calculated from EnteroBase data because the
module for performing those tree calculations is fully integrated
into EnteroBase. Jolley has written a separate GrapeTree wrapper
specific for the BigsDB website/database environment (Jolley and
Maiden 2010) and thereby enabled GrapeTree functionality for
all the databases served by PubMLST.
Inputs into GrapeTree SA
GrapeTree SA accepts matrices of character data (MLST allelic
profiles or SNPs), aligned multiple-FASTA files, precalculated tree
files in standard formats (Newick or NEXUS), and comma or tab-
delimited text for metadata (Fig. 2). Such files can be uploaded
into GrapeTree SA by dragging and dropping from a users local
workstation, pasting the content into an input box, or from online
sources. The GrapeTree SA backend module calculates trees from
character data or FASTA files, whereas precalculated tree files are
rendered without further modification. To illustrate this flexibility,
Figure 3 shows a GrapeTree representation of a phylogenetic tree of
1610 Ebola genomes from the 20132016 Ebola epidemic in West
Africa (Dudas et al. 2017) which was downloaded together with as-
sociated metadata from Microreact (Argimon et al. 2016). Note
that this is a topologically correct visualization of a real phyloge-
netic tree, including internal hypothetical nodes, some of which
have been collapsed for clarity.
Metadata
GrapeTree implements a high performance spreadsheet based
on JavaScript SlickGrid (https://github.com/6pac/SlickGrid) that
allows users to view and modify metadata that are associated
with the individual entries (Fig. 3, top right). Additional columns
from other experimental data or user-defined fields can be import-
ed from EnteroBase into the metadata table in GrapeTree EB. For
GrapeTree SA, the metadata can be ex-
ported locally, and novel metadata col-
umns can be added to the exported data
using a text editor or Microsoft Excel
and re-imported. Any column can be
used to color and/or label tree nodes.
For example, an attractive presentation
of a temporal gradient was implemented
by reformatting downloaded metadata
in Figure 3. The color codes for metadata
are assigned automatically but can be
changed manually, and the user can
specify the number of colors (right click
on key legend). In the metadata panel,
metadata columns can be sorted and/or
filtered at will to select individual entries.
Selecting genotypes
Entries that are selected in the metadata
panel are immediately highlighted by
red circles in the tree. Tree nodes can
also be selected by pressing the shift key
while opening a selection box over those
nodes with a mouse. The ability to select
a subset of the displayed nodes facilitates
focused attention on individual groups
of related genotypes. For example, we
re-investigated the global relationships
of recently described isolates of Salmonel-
la serovar Typhimurium of legacy MLST
ST313 and ST302 from Africa and the
UK (yellow polygon in Supplemental
Fig. S1A; Ashton et al. 2017). Legacy
MLST STs were used as metadata to
E
FB
A
C
D
Figure 2. Overview of GrapeTree features. GrapeTree can run within the EnteroBase environment (EB;
green), in stand-alone mode (SA; red), or in command line mode (CL), which disables graphic interac-
tions. Options for CL mode are shown by typing grapetree -h after installation.All features are common
to EB and SA, except where indicated in the figure. A demonstration version of GrapeTree is available for
experimentation at https://achtman-lab.github.io/GrapeTree/. (A) Inputs for GrapeTree EB (left) and
GrapeTree SA (right). (B) Metadata capabilities. (C) Algorithms used for tree constructions. (D) Static
and dynamic layout and branch collapsing. (E) Tree manipulation. (F) Outputs.
GrapeTree: a GUI for 100,000s of genomes
Genome Research 1397
www.genome.org
Cold Spring Harbor Laboratory Press on January 25, 2019 - Published by genome.cshlp.orgDownloaded from
identify and select these genomes among 19,670 Typhimurium
genomes in a GrapeTree based on cgMLST STs. The selected
genomes were displayed within a new EnteroBase workspace (click
EnteroBase\Load Selected), and used to generate a Neighbor
Joining (NJ) tree in a second GrapeTree EB window (Supplemental
Fig. S1B).
Algorithms
GrapeTree implements Kruskals algorithm (Kruskal 1956) for a
classical MSTree, Edmondsalgorithm (Edmonds 1967) for
MSTree V2, as well as the FastME V2 (Lefort et al. 2015) and
RapidNJ (Simonsen et al. 2011) implementations of Neighbour-
Joining. Maximum likelihood core SNP matrices can be calculated
against a selected reference genome within EnteroBase using
RAxML (Stamatakis 2014) for SNP projects containing up to
1000 genomes and then visualized by GrapeTree EB.
MSTree V2 is a novel minimum spanning tree which is better
suited for handling missing data than are classical MSTrees. The
workflow involved in calculating MSTree V2 is summarized in
Supplemental Figure S2. First, a directed minimal spanning arbo-
rescence (dMST) (Edmonds 1967) is calculated from asymmetric
(directional) distances with tie-breaking of coequal branches based
on allelic distances from a harmonic mean. Local branch recrafting
is subsequently performed to eliminate the spurious branches that
can arise within minimum spanning trees. Further details are pro-
vided in Methods.
Layout and tree manipulation
Complex trees are difficult to visualize with clarity. In order to ad-
dress this issue, GrapeTree initially collapses branches if there are
more than 20,000 nodes and then uses a static layout that splits
the tree layout task into a series of sequential node layout tasks
in an attempt to prevent overlapping child nodes (Supplemental
Fig. S3). Our implementation (Supplemental Material) provides a
solution to this task in linear time complexity. The resulting layout
can be further adjusted by a dynamic layout on the entire tree or on
selected subtrees, using the force-directed algorithm (Dwyer 2009)
in the JS D3 library (Supplemental Material). Users can also manu-
ally enforce a preferred layout by rotating selected nodes and
branches.
Figure 3. GrapeTree (SA) interface exemplified with a precalculated Newick tree based on 1610 Ebola genomes from the West African epidemic of 2013
2016. The tree and metadata were retrieved from microreact.org (https://microreact.org/project/west-african-ebola-epidemic), including a column des-
ignated collection_date.A new data column (year-month, upper right) was added to the metadata panel that contained the year and month information
from collection_date,and this column was used to color-code the visualization as a temporal gradient (key, lower right). Branches spanning <0.22 sub-
stitutions per site were collapsed for clarity. The data indicate progressive radiation from a central source, consistent with published findings (http://www.
nextstrain.org/ebola) (Dudas et al. 2017). An interactive version of this figure and metadata can be found at http://bit.ly/2EUkEKp.
Zhou et al.
1398 Genome Research
www.genome.org
Cold Spring Harbor Laboratory Press on January 25, 2019 - Published by genome.cshlp.orgDownloaded from
Many other visual aspects of a GrapeTree can also be custom-
ized by the user (Fig. 2). In particular, complex trees with numer-
ous nodes can be simplified by manually collapsing branches
connecting subsets of related nodes or by setting a global threshold
of differences below which all related branches are collapsed
(Supplemental Material). The relationship between node size and
numbers of entries can be adjusted in absolute terms or by adjust-
ing the kurtosis (Supplemental Material).
Trees can be manipulated manually with a mouse by dragging
branches, clicking on buttons, entering numbers into text boxes,
or choosing settings through sliders. Additional options appear af-
ter right clicking. The metadata columns from the metadata table
that are used for the presentation of text labels or node colors can
be freely chosen from drop-down lists. Right clicking on the key ta-
ble allows changes in presentation, including color codes. Options
under branch length allow branches with lengths above a given
threshold to be cropped or hidden, as in Figure 3. It is also possible
to toggle the display of branch lengths and/or node labels.
Outputs
GrapeTree can export the current state of the browser window as a
JSON file for use in future GrapeTree sessions. The JSON file in-
cludes both the tree layout and all metadata and facilitates sharing
of GrapeTree sessions between collaborators or with the general
public. The screen figure can be independently exported for ma-
nipulation with other software in Scalable Vector Graphics (SVG)
format, and the underlying phylogenetic tree can be exported in
Newick tree format. GrapeTree supports saving local metadata in
tab-delimited text format. GrapeTree EB can also upload modified
trees and metadata to EnteroBase and provide URLs for their public
access via EnteroBase.
Algorithms and performance
Comparative analyses with simulated data
We compared the accuracy of MSTree V2 against that of a classical
MSTree as implemented by goeBURST (Francisco et al. 2009) on
the basis of Kruskals algorithm. We also compared these results
with the intermediate MSTree (dMST) calculated with Edmonds
algorithm within GrapeTree prior to local branch recrafting.
These results were compared with the accuracy of NJ trees as a rep-
resentative of phylogenetic approaches. All algorithms were tested
on pairwise distance matrices calculated from simulated MLST
data from 2000 loci of known evolutionary history and spanning
a wide variety of genetic diversity (Methods\Data simulations).
The results were tested for precision/specificity (frequency of
true positives) as well as sensitivity (inverse frequency of false neg-
atives) by comparing the calculated topologies against the known
history of evolutionary changes in the simulated data (Fig. 4A).
Calculated topologies with different branching order were scored
as false positives (Fig. 4B). Similarly, quartets in which more
than two branches descended from a single node (polytomies)
(Fig. 4C) were scored as false negatives because only binary branch
splits had been allowed within the simulations.
MSTree V2 was associated with very high precision (>0.95), al-
most as high as that manifested by NJ (Fig. 4D). Somewhat lower
levels of precision were measured for goeBURST and dMST, rang-
ing from 0.93 at low allelic distances down to 0.83 at greater allelic
distances. Sensitivity was also very high for NJ (almost 1.0), but
much lower for either MSTree V2 (0.65) or the classical MSTree
algorithms (0.7).
We also compared precision and sensitivity with increasing
proportions of missing data by modifying the input distance ma-
trix calculated by goeBURST. To this end, missing alleles were re-
placed with 0, which forces missing values to be treated as an
additional allele (designated goeBURST[a]) or encoded as “–”,
which excludes the comparison of that locus from pairwise dis-
tances between profiles (designated goeBURST[i]). The results
showed that MSTree V2 and NJ maintained high precision even
up to 50% missing data (Fig. 4E). The precision of goeBURST and
dMST was lower at all levels of missing data, ranging down to 0.5
precision at 50% missing data for goeBURST[a]. Sensitivity was
slightly reduced by missing data for all algorithms, including NJ,
and once again, the lowest levels of sensitivity were observed for
MSTree V2.
Comparison of NJ, MSTree, and MSTree V2 on real data
We also examined the behavior of these algorithms with a relative-
ly uniform group of 222 genomes from related serovars within the
S. enterica Para C Lineage, including one ancient Paratyphi C ge-
nome which contained large amounts of missing data (Zhou
et al. 2018). A maximum-likelihood phylogenetic tree of nonre-
combinant SNP data (Fig. 5A) placed the 800-yr-old ancient DNA
(red node) on an early side branch predating the most recent
common ancestor of modern members of serovar Paratyphi
C. However, cgMLST data yielded classical MSTrees of different to-
pologies, possibly due to the extent of missing data in the ancient
genome. goeBURST[a] assigned the ancient genome to a spurious
long branch extending sideways from Paratyphi C (Fig. 5B), while
goeBURST[i] collapsed all branch distances, making it difficult to
E
BAC
D
Figure 4. Precision and sensitivity of trees calculated by different algo-
rithms from simulated allelic data. Trees were calculated for 100 replicates
from each of 24 simulated phylogenies that differed in substitution rates
(0.000010.07). (AC) Cartoon trees demonstrating the true topology
(A), low precision due to false positives (B), and low sensitivity due to false
negatives (C). (D) Average sensitivity vs. precision in the absence of missing
data after quartet analysis of branches calculated by NJ, goeBURST, the
dMST intermediate stage prior to local branch recrafting in MSTree V2,
and the full MSTree V2 algorithm including local branch recrafting.
Values calculated by the quartet analyses were assigned to four bins ac-
cording to allelic distances as indicated in the key. (E) Average sensitivity
vs. precision after quartet analysis of branches calculated with different lev-
els of random missing data for substitution rate 0.00005. goeBURST was
forced to treat missing values as additional alleles by encoding them as
0 (goeBURST[a]) or to ignore them by encoding them as “–” (goeBURST
[i]; defaults in MSTree). Values calculated by the quartet analyses were as-
signed to six bins according to the proportion of missing data as indicated
in the key.
GrapeTree: a GUI for 100,000s of genomes
Genome Research 1399
www.genome.org
Cold Spring Harbor Laboratory Press on January 25, 2019 - Published by genome.cshlp.orgDownloaded from
distinguish the individual serovars, and assigned the ancient ge-
nome to the center of the entire tree (Fig. 5C). Our general experi-
ence is that the classical minimum spanning tree algorithm
generally draws faulty topologies when confronted with missing
data and usually erroneously places nodes with extensive missing
data in central positions.
Although there were subtle differences in branch lengths and
detailed topologies between the MSTree V2 tree of cgMLST data
(Fig. 5D) and the SNP tree (Fig. 5A), the general clustering into dis-
crete groups was more or less concordant, as was the position of the
ancient genome. Phylogenies of cgMLST alleles with NJ (Fig. 5E) or
RapidNJ (Fig. 5F) yielded similar topologies for modern genomes to
that of the SNP tree but incorrectly placed the aDNA Ragna ge-
nome near the base of the branch leading to Paratyphi C, rather
than on a more recent side-branch.
Discussion
Our analyses of simulated data showed higher precision with
MSTree V2 than with dMST, which demonstrates the importance
of local branch recrafting in MSTree V2 for the accuracy of calls
(Fig. 4). Precision was also slightly higher for dMST than for either
of the goeBURST algorithms at intermediate levels of missing data,
which demonstrates that the directed MST approach adopted in
MSTree V2 contributes to improved accuracy. The trade-off is
that this high precision is accompanied by a slightly lower sensitiv-
ity than is true of classical minimum spanning trees.
Balanced versus unbalanced quartets
Classical minimal spanning trees and MSTree V2 yielded consider-
ably lower sensitivity (more false negatives) with the simulated
EF
B
AC
D
Figure 5. Comparisons of different topologies produced by six algorithms when extensive missing data is present. Trees were calculated from 20,114
nonrecombinant, core genomic SNPs (A) or 3002 loci in the cgMLST V2 scheme (Alikhan et al. 2018) (BF) that were found in 218 modern genomes from
Salmonella serovars Paratyphi C, Typhisuis, or Choleraesuis (Zhou et al. 2018). The modern genomes were supplemented by one ancient genome (Ragna;
red) that had been reconstructed from an 800-yr-old skeleton. The algorithms used were maximum likelihood (A, RAxML), MSTree (B, GoeBURST[a];
C, GoeBURST[i]), MSTree V2 (D, GrapeTree), NJ (E, FastMe), or RapidNJ (F, RapidNJ), and all trees were visualized in GrapeTree SA. Nodes are color-coded
by serovar. Due to fragmentation in the ancient Ragna DNA and intermediate levels of genome coverage, cgMLST alleles in Ragna could only be called for
215 (<10%) of the 3002 cgMLST loci, and the remainder of the cgMLST alleles were scored as missing data. Similarly, only 19,245 (96%) of the SNPs could
be called in the Ragna genome. The Ragna genome is on a side-branch that diverged prior to the coalescence of the crown branch leading to modern
Paratyphi C and differs from that coalescent by 263 SNPs. The correct position and branch length of the Ragna branches are as shown in A. Ragna is
on an artificial, long terminal branch in Bbecause all missing data count as different alleles. Ragna is central in part Cbecause it is 215 cgMLST allele
differences from all modern genomes and therefore forms an artificial central hub for all the genomes. MSTree V2 (D) maps Ragna to a tiny side branch
preceding the Paratyphi C coalescent, similar in topology to how it was mapped on the basis of SNPs (A). However, NJ (E) and RapidNJ (F) mapped Ragna
incorrectly near the base of the long, main branch leading to the crown group of Paratyphi C. The mapping of Ragna to the main branch rather than on its
own side branch resulted because those algorithms calculated a negative distance from Ragna to the main branch. Interactive versions of each tree are
available at (A) http://bit.ly/2vuFIIb, (B) http://bit.ly/2HF5tYt, (C) http://bit.ly/2qDD3GT, (D) http://bit.ly/2JRBvkQ, (E) https://bit.ly/2B6IS7v, and
(F) https://bit.ly/2z2LWRb.
Zhou et al.
1400 Genome Research
www.genome.org
Cold Spring Harbor Laboratory Press on January 25, 2019 - Published by genome.cshlp.orgDownloaded from
data than did NJ. We suspected that this observation might reflect
the use of inferred hypothetical ancestral nodes for node clustering
by NJ because minimum spanning trees simply join nearest neigh-
bors without calculating a possibly shorter path via hypothetical
ancestral nodes. In that case, the low sensitivity of minimum span-
ning trees might be restricted to particular branching patterns
rather than representing a general phenomenon. To test this hy-
pothesis, we compared the accuracy of minimum spanning tree al-
gorithms (and of NJ) between the balanced and unbalanced
quartets within the simulated data (Supplemental Fig. S4). All algo-
rithms yielded very high precision and sensitivity with balanced
quartets (Supplemental Fig. S4C). NJ also yielded high precision
and sensitivity for unbalanced quartets, unlike goeBURST or
dMST, where sensitivity was always low and precision decreased
with greater allelic diversity. The sensitivity with unbalanced quar-
tets was even (slightly) lower with MSTree V2, but in this case, pre-
cision remained quite high (Supplemental Fig. S4D).
We attribute the observed low sensitivity of minimum span-
ning trees with unbalanced quartets to ambiguities in joining node
4 to the group of nodes 1, 2, and 3 when node 4is equidistant from
all three other nodes (Supplemental Fig. S4A,B). A classical MSTree
attempts to connect node 4 to the founder node, which is node 1
or 2 in an unbalanced quartet, due to its reliance on the eBURST
heuristic for choosing between equidistant pairs of nodes. At low
levels of genetic divergence, most allelic differences reflect single
nucleotide changes, and the behavior of a classical MSTree is likely
to be correct (higher precision). At higher sequence divergence,
the eBURST heuristic is no longer as appropriate, because allelic
differences may well result from multiple mutations. Multiple mu-
tations result in lessened consistency between allelic distances and
the numbers of mutational events and correspondingly lower
precision.
In contrast, MSTree V2 does not use the eBURST heuristic but
instead breaks branches representing unbalanced splits during the
branch recrafting stage and rejoins them to the centroid nodes in
the subtrees (Supplemental Fig. S5), which removes most inaccu-
rate topologies and improves precision. The slightly lower sensitiv-
ity of MSTree V2 in comparison to classical MSTree algorithms (Fig.
4) likely reflects the fact that, while MSTree V2 removes erroneous
topologies, it simply makes no attempt whatsoever to resolve to-
pologies within unbalanced quartets.
Speed and memory requirements
Our observations show that phylogenetic topologies and branch
lengths are more accurately depicted by NJ trees or by other true
phylogenetic methods than by MSTrees. For GrapeTree users, we
would recommend using the maximum likelihood algorithm on
SNPs when possible. However, EnteroBase limits such analyses to
a maximum of 1000 closely related genomes in order not to ham-
per its performance for multiple users. We therefore recommend
using GrapeTree SA for larger SNP projects. We note, however,
that handling SNP distance matrices from more than 5000 ge-
nomes remains problematical (Mazariegos-Canellas et al. 2017).
GrapeTree can handle large data sets. Its implementations of
MSTree and MSTree V2 have a time complexity of O(n
2
), and
GrapeTree stores the calculated pairwise genetic differences in a
highly efficient data structure (Python NumPy). We quantified
the time and memory requirements of multiple algorithms by us-
ing GrapeTree SA in command line mode to calculate trees based
on increasing numbers of Salmonella cgMLST STs, each of which
includes 3002 integer values (Fig. 6). An NJ cgMLST tree from
10,000 genomes took over 4 h to calculate (Fig. 6A). In contrast,
calculating a distance matrix for up to 10,000 STs required only
a few minutes and <10 GB of RAM with the MSTree, MSTree V2,
or RapidNJ algorithms (Fig. 6A,B). Laptops running under MacOS
or Windows 10 could readily handle 8000 STs (Fig. 6C,D). For larg-
er data sets, we would recommend using multiple parallel process-
es on a Linux server. With five processes, our server could handle
100,000 STs in less than 700 min and used a maximum of 300
GB of RAM (Fig. 6A,B). We would recommend using MLST V2
over RapidNJ for interacting with large data sets because it allows
ready visualization of many details which are obscured in phylo-
grams containing large numbers of nodes (Fig. 1). MSTree V2 with-
in GrapeTree EB also provides the ability to rapidly drill down from
very large data sets containing missing data to clusters of scientific
interest (Supplemental Fig. S1), which is not readily possibly with
other approaches.
Conclusions
Core genome MLST provides a feasible approach for providing
public access to hundreds of thousands of bacterial genotypes at
the genomic level (Alikhan et al. 2018). Access to such databases
will facilitate international collaboration and support the global
surveillance of bacterial pathogens. A major current bottleneck
has been the lack of tools that can handle such data sets for the elu-
cidation of genetic relationships and the visualization of clusters of
related genotypes plus their metadata.
GrapeTree now allows users to explore the fine-grained
population structure and phenotypic properties of large numbers
of genomes in a web browser. GrapeTree SA is a stand-alone pro-
gram that provides bioinformaticians with a tool for rapidly
B
A
CD
Figure 6. Time and memory required for different algorithms to calcu-
late genetic relationships from cgMLST STs of Salmonella. Each point rep-
resents the average time and memory (three replicates) for GrapeTree in
command line mode to calculate a tree from an independent random sub-
set of 96,108 cgMLST STs from the EnteroBase Salmonella database.
Exceptionally, the rightmost points in Cand Drepresent only single repli-
cates, and only samples of 10,000 cgMLST STs were tested with NJ
(FastMeV2 implementation). (A,B) Time and memory profiles using five
processes within a Linux machine. (C,D) Time and memory profiles of
the MSTree V2 algorithm using various OS platforms. The Windows work-
station was unable to complete calculations with >8000 cgMLST STs, pos-
sibly due to insufficient RAM. The Windows and MacOS workstations each
contained four cores and 8 GB of RAM, whereas the Linux workstation con-
tained 40 cores running at 2 GHz and 1 TB of RAM.
GrapeTree: a GUI for 100,000s of genomes
Genome Research 1401
www.genome.org
Cold Spring Harbor Laboratory Press on January 25, 2019 - Published by genome.cshlp.orgDownloaded from
investigating the relationships of genomes of interest by NJ or
minimal spanning trees of SNPs or MLST data. Customized ver-
sions of GrapeTree provide the same graphical front-end function-
ality to EnteroBase (Alikhan et al. 2018) and BIGSdb, thereby
providing access to cgMLST schemes from most of the major bac-
terial pathogens. GrapeTree supports the input of data from a vari-
ety of sources and export to a variety of formats, thus empowering
the public exploitation and sharing of genomic data by nonbioin-
formaticians.
Methods
Detailed explanation of novel aspects of MSTree V2
Calculation of asymmetric distances
In order to handle missing data correctly, MSTree V2 implements a
directional measure based on normalized, asymmetric Hamming-
like distances, d(uv), between pairs of STs. This approach as-
sumes that one of the pair of STs is the ancestor of the other and
treats missing data as deletions from the ancestor to the
descendant.
Given a set of STs Sand a profile π(s) for each ST with a set
of loci L, we define d(uv) between an ordered pair of two STs
(u, v)Sas
d(uv)=
l[L
11{(
p
l(u)=
p
l(v))^(
p
l(v)=0)}
Nv
,
with Nv=l[L11{
p
l(v)=0} and assuming 0 to be a missing value in all
π. All possible values of these distances for each locus in the calcu-
lation of d(uv) are illustrated in Supplemental Figure S2B. Note
that these distances do not form a metric, because d(uv)d(v
u) when missing values are present. We can then define a fully con-
nected graph G(V, E) with V=Sand directed edges (uv)E
weighted by their distance. By analogy to a minimum spanning
tree for undirected graphs, we compute a direct minimum span-
ning tree (dMST, also designated minimal spanning arborescence)
on Gin polynomial time with Tarjans rapid implementation
(Tarjan 1977) of Edmondsalgorithm (Edmonds 1967), using the
Edmonds-alg package (http://edmonds-alg.sourceforge.net/).
Harmonic tie-breaking
During the construction of an MSTree, Bionumerics or goeBURST
chooses between multiple co-optimal branches by tie-breaking ac-
cording to the principles of eBURST (Feil et al. 2004) as summa-
rized and extended by Francisco et al. (2009). The eBURST
approach presumes that a clonal complex (lineage) is founded by
a founder genotype and that genetic variants of that founder re-
flect the progressive accumulation of additional variations over
time. A further implicit belief is that the number of variants de-
creases with distance from the founder genotype, such that the
founder is equated with the central genotype with the greatest
number of single locus variants, and edges between nodes are or-
dered based on their allelic distances. In case of a tie for direction-
ality of connections, the founder status is assigned to the node
with the greater number of single locus variants, double locus var-
iants, triple locus variants, and/or number of strains assigned to
that ST.
At cgMLST levels of resolution, the founder genotype may
not be present in a comparison, which renders the eBURST model
inappropriate for tie-breaking. Instead of depending on the pre-
conceived properties of a theoretical founder genotype, MSTree
V2 simply chooses central nodes between multiple co-optimal
branches on the basis of the harmonic mean of allelic distances.
We define a centroid genotype, which is the genotype for any
given population that has the smallest average allelic distance to
all other genotypes in the same population. The harmonic mean
of the allelic distances is used rather than an arithmetic mean in
order to give higher weights to variants with smaller allelic distanc-
es to other STs. In a fully connected graph G(V,E) as defined above,
we define the harmonic mean ht(u) of allelic distances for any
node uVto other nodes as
ht(u)=
v[V,
u=v
d(uv)1
|V|−1
1
.
All directed edges d(uv) are ordered in ascending order ac-
cording to ht(u), with the frequency of occurrence of uas the final
tie-break. This ordering results in a unique and optimal dMSTwith
Edmondsalgorithm. Furthermore, since we have a fully connect-
ed graph and dsatisfies the triangular inequality, the length of the
shortest (geodesic) path between any two vertices uand vis given
by d(u v).
We note that ht(u)
1
is also known as closeness centralityin
network science (Newman 2010). Closeness centrality is usually
defined for unweighted graphs as the inverse of the mean distance
between vertices. However, some interesting properties arise when
it is defined in our sense as the inverse of the harmonic mean dis-
tance between vertices: ht(u)
1
gives more weight to vertices that
are close to the vertex of interest than to those far away, and it
can also naturally deal with disconnected components.
Local branch recrafting
Edmondsalgorithm attempts to minimize the sum of the edge
lengths in the tree. However, the resulting dMST does not neces-
sarily represent true phylogenetic relationships between strains
because allelic distances do not always correlate with divergence
time. We therefore implemented a subsequent branch optimiza-
tion step that accounts for these discrepancies. Algorithm 1 gives
an overview over the local branch recrafting (see also Supplemen-
tal Fig. S5), starting from the already computed dMST(V,E), where
Eis a distance matrix sorted in ascending order of allelic distances,
and a forest Fwhere each uVis a single tree t(u)F.
Optimizations are applied to both ends of each edge in the
dMST(V,E) iteratively as shown in Supplemental Figure S5D. The
TargetNodes() function picks a subset of the nodes in tree t(u) which
are the centroids and the nodes that are directly connected to u
Algorithm 1 Local branch recrafting
Input: Initial edge (uv), Tree t(u)F, harmonic tiebreaker ht
Output: New edge (u
v
)
1: Initialize u
=u
2: for each node wTargetNodes(t(u
)) do
3: P(M
A
), P(M
B
)=ModelSelection(d(u
w), d(wu
), d(u
v), d(wv))
4: if P(MA)P(MB)^ht(u).ht(w)then
5: u
=w
6: else if P(MA),P(MB)^d(uv).d(wv)then
7: u
=w
8: Repeat (17) on vto obtain v
9: Return (u
v
)
Zhou et al.
1402 Genome Research
www.genome.org
Cold Spring Harbor Laboratory Press on January 25, 2019 - Published by genome.cshlp.orgDownloaded from
(Supplemental Fig. S5D). The ModelSelection() function compares
the maximum likelihoods of two models M
A
and M
B
(Supplemental Fig. S5B,C). Here we describe only the model selec-
tion process for u. Given d(uw), d(wu), d(uv) and d(wv),
when assuming d(uv)d(wv), the proportions of invariable
sites in branches l
A
,k
A
,l
B
and k
B
satisfy:
argmax
0lA1,0kA1
log P(MA|lA,kA)
=argmax
0lA1,0kA1
log P(uw|lA)P(uv|lA,kA)P(wv|lA,kA)
=argmax
0lA1,0kA1
|L|d(uw) log(1 l2
A)+|L|(1 d(uw))log (l2
A)
+|L|d(uv) log (1 lAkA)+|L|(1 d(uv)) log (lAkA)
+|L|d(wv) log (1 lAkA)+|L|(1 d(wv)) log (lAkA) (1)
argmax
0lB1,0kB1
log P(MB|lB,kB)
=argmax
0lB1,0kB1
log P(wu|lB)P(uv|lB,kB)P(wv|lB,kB)
=argmax
0lB1,0kB1
|L|d(wu) log(1 lB)+|L|(1 d(wu))log (lB)
+|L|d(uv) log (1 lBkB)+|L|(1 d(uv)) log (lBkB)
+|L|d(wv) log (1 kB)+|L|(1 d(wv)) log (kB), (2)
where Lis a set of loci in an MLST profile. Note that the direction of
the distances between uand ware different in the two equations.
Model Aassumes uas the centroid node and adopts d(uw)in
Equation 1, whereas model Btreats was the centroid and thus
uses d(wu). We further denote
x=1(1 d(wu))(1 d(wv)) +(1 d(uv))
2.
Then, the parameters in Equations 1 and 2 are calculated as
lA=
1d(uw)
kA=1(1/2)(d(uv)+d(wv))
lA
lB=1+xd(wu)
d(uv)2x
kB=1+xd(wv)
d(uv)2x.
These parameters can then be used to calculate P(M
A
) and P(M
B
)
using Equations 1 and 2.
Data simulations
In order to compare various algorithms with MLST data of known
evolutionary history, SimBac (Brown et al. 2016) was used to sim-
ulate the coalescence of 40 genomes of size 2 Mb. One hundred
replicate simulations were performed without homologous re-
combination and assuming a constant population size for each
of 24 different substitution rates ranging from 0.00001 to 0.07.
Simulated MLST data were then obtained by splitting each of the
final 40 genomic sequences into 2000 loci of 900 bp separated
by 100-bp intergenic regions. Each unique locus within the 40 ge-
nomes was assigned a unique allelic integer, and these integers
were used to generate allelic profiles for the simulated genomes
within each replicate. The genetic distances between the 40 allelic
profiles were used to compute classical MSTrees using goeBURST
(Francisco et al. 2009) and a directed MSTree using GrapeTree
MSTree V2. NJ trees were calculated with FastMEV2 (Lefort et al.
2015) and RapidNJ (Simonsen et al. 2011). In order to establish
the effects of local branch recrafting, we also extractedthe interme-
diate result (dMST) from MSTree V2 at the state immediately prior
to recrafting.
We also tested the effects of missing data by scoring random
allelic values from simulated data at a substitution rate of 0.00005
as missing values. Ten replicates were performed, spanning the
range from 0 to 40,000 missing values in steps of 8000 of the
80,000 allelic values (0%50%). These simulated data contained
an average allelic distance of 107 (CI 95%: 6211). In order to score
the randomly selected values as missing, they were replaced with
0 (goeBURST[a]), which forces missing values to be treated as an
additional allele by goeBURST, or encoded as “–” (goeBURST[i]),
which excludes the comparison of that locus from pairwise dis-
tances between profiles.
Data access
Interactive versions of trees in Figure 1: (A) http://bit.ly/2qH06jp;
(B) https://bit.ly/2mDOpbS; (C) http://bit.ly/2H69dkG; (D) https
://bit.ly/2LG62Tl. An interactive version of Figure 3 can be found
at http://bit.ly/2EUkEKp. Trees presented in Figure 5 are available
separately: (A) http://bit.ly/2vuFIIb; (B) http://bit.ly/2HF5tYt; (C)
http://bit.ly/2qDD3GT; (D) http://bit.ly/2JRBvkQ; (E) https://bit.
ly/2B6IS7v; and (F) https://bit.ly/2z2LWRb. Trees presented in
Supplemental Figure S1 can be found at http://bit.ly/2vjTn4I and
http://bit.ly/2H8py8F. Figure 1, A and B can be reconstructed by
uploading the trees and metadata files available in Supplemental
Data S1 into iTOL (http://itol.embl.de/). Other interactive figures
can be visualized in GrapeTree SA using the source files in Supple-
mental Data S2. Source code and precompiled binaries for Grape-
Tree are available as Supplemental Data S3 and also deposited
online at https://github.com/achtman-lab/GrapeTree. Simulation
and evaluation scripts are also available in Supplemental Data
S3 and on GitHub in the folder simulations.Online documenta-
tion and a live demo are available at http://enterobase.readthedocs
.io/en/latest/grapetree/grapetree-about.html. The documentation
is also available as Supplemental Data S4.
Acknowledgments
EnteroBase (BBSRC BB/L020319/1) was developed by N-F.A., M.J.
S., and Z.Z. (equal contributions) under guidance by M.A. Addi-
tional grant support was from the Wellcome Trust (202792/Z/
16/Z). A.P.F., C.V., and J.A.C. were partially funded by BacGen-
Track (TUBITAK/0004/2014; FCT/Scientific and Technological Re-
search Council of Turkey [Türkiye Bilimsel ve Teknolojik Ara srrma
Kurumu, Tübitak]).We thank Philippe Lemey and Joseph Healey
for beta testing and feedback.
Author contributions: Z.Z. and M.J.S. conceived the ideas and
designed methodology and functionality; M.J.S., Z.Z., N-F.A.,
and M.A. designed the look and feel of the GrapeTree GUI;
Z.Z. and N.L. performed simulations and analyzed the data;
A.P.F. and C.V. contributed to the formalization, correctness,
and analysis of algorithms; A.P.F. implemented the command
line version of goeBURST; J.A.C. reviewed concepts and imple-
mentation of MST and associated algorithms. M.A. led the writing
of the manuscript. All authors contributed critically to the drafts
and gave final approval for publication.
GrapeTree: a GUI for 100,000s of genomes
Genome Research 1403
www.genome.org
Cold Spring Harbor Laboratory Press on January 25, 2019 - Published by genome.cshlp.orgDownloaded from
References
Alikhan N-F, Zhou Z, Sergeant MJ, Achtman M. 2018. A genomic overview
of the population structure of Salmonella.PLoS Genet 14: e1007261.
Argimon S, Abudahab K, Goater RJ, Fedosejev A, Bhai J, Glasner C, Feil EJ,
Holden MT, Yeats CA, Grundmann H, et al. 2016. Microreact: visualiz-
ing and sharing data for genomic epidemiology and phylogeography.
Microb Genom 2: e000093.
Ashton PM, Owen SV, Kaindama L, Rowe WPM, Lane CR, Larkin L, Nair S,
Jenkins C, de Pinna EM, Feasey NA, et al. 2017. Public health surveil-
lance in the UK revolutionises our understanding of the invasive
Salmonella typhimurium epidemic in Africa. Genome Med 9: 92.
Brown T, Didelot X, Wilson DJ, De MN. 2016. SimBac: simulation of whole
bacterial genomes with homologous recombination. Microb Genom 2.
doi: 10.1099/mgen.0.000044.
Chevenet F, Brun C, Banuls AL, Jacq B, Christen R. 2006. TreeDyn: towards
dynamic graphics and annotations for analyses of trees. BMC
Bioinformatics 7: 439.
Dudas G, Carvalho LM, Bedford T, Tatem AJ, Baele G, Faria NR, Park DJ,
Ladner JT, Arias A, Asogun D, et al. 2017. Virus genomes reveal factors
that spread and sustained the Ebola epidemic. Nature 544: 309315.
Dwyer T. 2009. Scalable, versatile and simple constrained graph layout.
Eurographics 28. doi: 10.1111/j.1467-8659.2009.01449.x.
Edmonds J. 1967. Optimum branchings. J Res Nat Bur Standards 71B:
233240.
Feil EJ, Li BC, Aanensen DM, Hanage WP, Spratt BG. 2004. eBURST:
Inferring patterns of evolutionary descent among clusters of related bac-
terial genotypes from Multilocus Sequence Typing data. J Bacteriol 186:
15181530.
Francisco AP, Bugalho M, Ramirez M, Carrico JA. 2009. Global optimal
eBURST analysis of multilocus typing data using a graphic matroid ap-
proach. BMC Bioinformatics 10: 152.
Francisco AP, Vaz C, Monteiro PT, Melo-Cristino J, Ramirez M, Carrico JA.
2012. PHYLOViZ: phylogenetic inference and data visualization for se-
quence based typing methods. BMC Bioinformatics 13: 87.
Hadfield J, Croucher NJ, Goater RJ, Abudahab K, Aanensen DM, Harris SR.
2018. Phandango: an interactive viewer for bacterial population geno-
mics. Bioinformatics 34: 292293.
He Z, Zhang H, Gao S, Lercher MJ, Chen WH, Hu S. 2016. Evolview v2: an
online visualization and management tool for customized and an notat-
ed phylogenetic trees. Nucleic Acids Res 44: W236W241.
Huson DH, Bryant D. 2006. Application of phylogenetic networks in evolu-
tionary studies. Mol Biol Evol 23: 254267.
Huson DH, Scornavacca C. 2012. Dendroscope 3: an interactive tool for
rooted phylogenetic trees and networks. Syst Biol 61: 10611067.
Jolley KA, Maiden MC. 2010. BIGSdb: scalable analysis of bacterial genome
variation at the population level. BMC Bioinformatics 11: 595.
Jolley KA, Maiden MC. 2014. Using multilocus sequence typing to study
bacterial variation: prospects in the genomic era. Future Microbiol 9:
623630.
Jolley KA, Bliss CM, Bennett JS, Bratcher HB, Brehony C, Colles FM,
Wimalarathna H, Harrison OB, Sheppard SK, Cody AJ, et al. 2012.
Ribosomal multilocus sequence typing: universal characterization of
bacteria from domain to strain. Microbiology 158: 10051015.
Kruskal JB. 1956. On the shortest spanning subtree of a graph and the trav-
eling salesman problem. Proc Am Math Soc 7: 4850.
Lefort V, Desper R, Gascuel O. 2015. FastME 2.0: a comprehensive, accurate,
and fast distance-based phylogeny inference program. Mol Biol Evol 32:
27982800.
Letunic I, Bork P. 2016. Interactive tree of life (iTOL) v3: an online tool for
the display and annotation of phylogenetic and other trees. Nucleic
Acids Res 44: W242W245.
Maiden MCJ, Bygraves JA, Feil E, Morelli G, Russell JE, Urwin R, Zhang Q,
Zhou J, Zurth K, Caugant DA, et al. 1998. Multilocus sequence typing:
a portable approach to the identification of clones within populations
of pathogenic microorganisms. Proc Natl Acad Sci 95: 31403145.
Maiden MC, van Rensburg MJ, Bray JE, Earle SG, Ford SA, Jolley KA,
McCarthy ND. 2013. MLST revisited: the gene-by-gene approach to bac-
terial genomics. Nat Rev Microbiol 11: 728736.
Mazariegos-Canellas O, Do T, Peto T, Eyre DW, Underwood A, Crook D,
Wyllie DH. 2017. BugMat and FindNeighbour: command line and
server applications for investigating bacterial relatedness. BMC
Bioinformatics 18: 477.
Mellmann A, Harmsen D, Cummings CA, Zentz EB, Leopold SR, Rico A,
Prior K, Szczepanowski R, Ji Y, Zhang W, et al. 2011. Prospective geno-
mic characterization of the German enterohemorrhagic Escherichia
coli O104:H4 outbreak by rapid next generation sequencing technology.
PLoS One 6: e22751.
Moura A, Criscuolo A, Pouseele H, Maury MM, Leclercq A, Tarr C, Bjorkman
JT, Dallman T, Reimer A, Enouf V, et al. 2016. Whole genome-based
population biology and epidemiological surveillance of Listeria monocy-
togenes.Nat Microbiol 2: 16185.
Nadon C, Van Walle I, Gerner-Smidt P, Campos J, Chinen I, Concepcion-
Acevedo J, Gilpin B, Smith AM, Man KK, Perez E, et al. 2017. PulseNet
International: vision for the implementation of whole genome se-
quencing (WGS) for global food-borne disease surveillance. Euro
Surveill 22: 30544.
Nascimento M, Sousa A, Ramirez M, Francisco AP, Carrico JA, Vaz C. 2017.
PHYLOViZ 2.0: providing scalable data integration and visualization for
multiple phylogenetic inference methods. Bioinformatics 33: 128129.
Newman MEJ. 2010. Networks: an introduction. Oxford University Press,
Oxford, UK.
Page RD. 1996. TreeView: an application to display phylogenetic trees on
personal computers. Comput Appl Biosci 12: 357358.
Simonsen M, Mailund T, Pedersen CNS. 2011. Inference of large phyloge-
nies using Neighbour-Joining. In Biomedical Engineering Systems and
Technologies: 3rd International Joint Conference, BIOSTEC 2010.
Communications in Computer and Information Science, Vol. 127, pp.
334344. Springer Verlag, New York.
Stamatakis A. 2014. RAxML version 8: a tool for phylogenetic analysis and
post-analysis of large phylogenies. Bioinformatics 30: 13121313.
Tarjan RE. 1977. Finding optimum branchings. Networks 7: 2535.
Zhou Z, Lundstrøm I, Tran-Dien A, Duchêne S, Alikhan N-F, Sergeant MJ,
Langridge G, Fokatis AK, Nair S, Stenøien HK, et al. 2018. Pan-genome
analysis of ancient and modern Salmonella enterica demonstrates geno-
mic stability of the invasive para C lineage for millennia. Curr Biol 28:
24202428.e10.
Received November 14, 2017; accepted in revised form July 24, 2018.
Zhou et al.
1404 Genome Research
www.genome.org
Cold Spring Harbor Laboratory Press on January 25, 2019 - Published by genome.cshlp.orgDownloaded from
10.1101/gr.232397.117Access the most recent version at doi:
2018 28: 1395-1404 originally published online July 26, 2018Genome Res.
Zhemin Zhou, Nabil-Fareed Alikhan, Martin J. Sergeant, et al.
100,000 bacterial pathogens
GrapeTree: visualization of core genomic relationships among
Material
Supplemental http://genome.cshlp.org/content/suppl/2018/08/14/gr.232397.117.DC1
References
http://genome.cshlp.org/content/28/9/1395.full.html#ref-list-1
This article cites 32 articles, 2 of which can be accessed free at:
Open Access Open Access option.Genome ResearchFreely available online through the
License
Commons
Creative
.http://creativecommons.org/licenses/by/4.0/
Commons License (Attribution 4.0 International), as described at
, is available under a CreativeGenome ResearchThis article, published in
Service
Email Alerting
click here.top right corner of the article or
Receive free email alerts when new articles cite this article - sign up in the box at the
http://genome.cshlp.org/subscriptions
go to: Genome Research To subscribe to
© 2018 Zhou et al.; Published by Cold Spring Harbor Laboratory Press
Cold Spring Harbor Laboratory Press on January 25, 2019 - Published by genome.cshlp.orgDownloaded from
... The loci in the scheme were annotated using the UniprotFinder module. The minimum spanning (MS) tree based on the cgMLST profile of the 26801 meningococcal genomes was reconstructed using the MSTreeV2 algorithm and visualized by GrapeTree software v2.1 [29]. A neighbor-joining (NJ) tree of the CC4821 clade based on the cgMLST profile of 388 isolates was reconstructed using an NJ algorithm and visualized by GrapeTree software v2.1 [29]. ...
... The minimum spanning (MS) tree based on the cgMLST profile of the 26801 meningococcal genomes was reconstructed using the MSTreeV2 algorithm and visualized by GrapeTree software v2.1 [29]. A neighbor-joining (NJ) tree of the CC4821 clade based on the cgMLST profile of 388 isolates was reconstructed using an NJ algorithm and visualized by GrapeTree software v2.1 [29]. ...
Article
Full-text available
Neisseria meningitidis (N. meningitidis) is the causative agent of human invasive meningococcal disease (IMD). Clonal complex (CC) 4821 is a unique genetic cluster of N. meningitidis that emerged two decades ago in Anhui Province, China and became the predominant cluster. However, the evolutionary origin of CC4821 remains unclear. Herein, a distinct CC4821 clade was identified by a comprehensive cgMLST analysis of 26,801 N. meningitidis genomes. The CC4821 clade comprised 388 N. meningitidis isolates, with 364 assigned to CC4821, 1 assigned to CC8, and 23 unassigned (UA), as they could not be assigned to any defined CC. The phylogenetic analysis of the CC4821 clade revealed that six UA isolates, including the UA isolate NmR29026 collected in 1966 from Liaoning Province, China, occupied a basal position compared to all isolates within the CC4821 clade, indicating that CC4821 originated in the 1960s. Eight subclades (clades 1–8) were recognized within the CC4821 clade. Clades 1–4 have been present since the 1970s, while clades 5–8 emerged after the 2000s. Clade 5 represents a hyperinvasive lineage. N. meningitidis isolate HEB85-3, collected in 1985 in Hebei Province, China, exhibited the closest evolutionary relationship to clade 5, suggesting it is related to the origin of this hyperinvasive lineage. Our study reveals that CC4821 has emerged as the predominant cluster of N. meningitidis in China, representing the culmination of at least 60 years of continuous evolution in China, and is not solely attributable to the outbreak two decades ago.
... The minimum spanning tree (MST)-based core-genome multilocus sequence typing (cgMLST) was conducted using GrapeTree 43 . Heatmaps were generated using ImageGP (https://www.bic.ac.cn/BIC/#/). ...
Article
Full-text available
Background Salmonella enterica (S. enterica) causes tens of thousands of cases of diarrheal disease worldwide each year. However, our understanding of the genome and transmission dynamics of S. enterica in Minhang District in Shanghai, China is still insufficient. This study is aimed to better understand the population structure, antibiotic resistance patterns, and evolution dynamics of local strains. Methods We sequenced 458 S. enterica strains from outpatients at Minhang District Central Hospital in Shanghai, China, from 2012 to 2021. Bioinformatics analyses on antibiotic resistance genes, virulence factors, mobile genetic elements, pathogenic islands, and phylogenetic relationships were performed. Results Here we show that two dominant serovars are S. Enteritidis and S. Typhimurium isolated from outpatients in Minhang District in Shanghai, China. A total of 40 serovars and 53 sequence types (STs) are identified, two S. Montevideo strains isolated in 2013 belong to a newly identified ST10844, which is firstly identified in Minhang District in Shanghai, China. More than half of the isolates show resistance to fluoroquinolones and beta-lactams. Notably, 259 (56.6%) of the 458 isolates exhibit a multidrug-resistant pattern. Third-generation cephalosporin resistance gene blaCTX-M-55 is identified in 15 (3.3%) isolates, and fluoroquinolone resistance gene qnrS1 is identified in 42 (9.2%) isolates, both of which are strongly correlated with IS26. Mutations of T57S in ParC and D87Y in GyrA are observed in 149 (32.5%) and 133 (29.0%) isolates, respectively. In addition, phylogenetic analysis confirms the presence of outbreaks caused by S. Enteritidis and S. Typhimurium, respectively. Conclusions These results suggest local expansion and evolution in Salmonella occurred in Shanghai, China, and the underlying emergence of the undefined multidrug-resistant clone. Our findings enlarge the knowledge of local epidemics of Salmonella, especially S. Enteritidis and S. Typhimurium in Shanghai, and provide a piece of useful baseline information for future whole-genome sequencing surveillance.
... A minimum spanning tree was generated using Grapetree (v1.5.0, University of Warwick, Coventry, UK) [27], using a 5-SNP cut-off to delineate genomic clusters among the core SNP alignment to identify recent transmission. In silico spoligotyping was conducted using the reads from each isolate with SpoTyping (v2.1, Chinese Academy of Medical Sciences, Beijing, China) [28], and the binary spoligotype code obtained for each isolate was analyzed using the SITVIT2 platform (Pasteur Institute, Paris, France) [29] to identify the family and assign the respective spoligotype international type (SIT). ...
Article
Full-text available
Tuberculosis remains a significant health issue in Mexico, which has one of the highest incidence rates in the Americas. This study aimed to analyze the circulating sublineages, spoligotypes, drug resistance, and transmission patterns of Mycobacterium tuberculosis in Mexico’s Central Western region using whole-genome sequencing. Seventy-seven Mycobacterium tuberculosis strains underwent phenotypic drug susceptibility testing via MGIT. Genotypic resistance was assessed with TB-Profiler and Mykrobe, while phylogenetic relationships were reconstructed using Snippy and RaxML. SpoTyping identified circulating SITs and families, with a 5-SNP threshold defining genomic transmission clusters. The predominant sublineages were 4.1.1.3 (X-type, n = 19) and 4.1.2.1 (LAM, n = 11), with rare sublineages (EAI5, EAI2-Manila, and Beijing) also observed. Resistance to at least one first-line drug was found in 63.3% of strains, with streptomycin mono-resistance (24.5%) being notable. Multidrug-resistant TB was identified in 16.3% (n = 8) of strains. Five genomic clusters, involving 18.7% of strains, were identified. This study highlights the sublineage diversity in Mexico, emphasizing its importance in global databases and resistance research. The findings, such as SIT47 in GC1, underscore the value of localized genomic studies for effective TB control.
Article
Background Necrotizing enterocolitis (NEC), a severe comorbidity of prematurity, is usually sporadic, but occasional outbreaks suggest an infectious cause. Escherichia coli , the most frequent Gram-negative pathogen in preterm infants, historically displays a low inhospital transmissibility. Aim To report the management of an NEC outbreak in a Belgian neonatal intensive care unit and the molecular characterization of a rare, highly virulent/resistant E. coli strain. Methods Clinical data were extracted from electronic medical records. Surveillance and clinical isolates characterized using standard methods were secondarily analyzed by bacterial whole-genome sequencing using EnteroBase for phylogenic classification and BioNumerics for resistance and virulence profile determination. Findings A cluster of 6 infants was colonized by a single extended-spectrum beta-lactamase-producing E. coli strain in a 1-month period. Four infants developed severe NEC, resulting in 1 death and 3 short bowel syndromes. Although the index infant and his twin sibling acquired the strain vertically from their mother, transmission occurred horizontally through caregivers in subsequent cases. Enhanced infection prevention and control measures allowed containment of the outbreak. Molecular typing of the strain revealed a single, previously unregistered O6:H1 serotype of extraintestinal pathogenic E. coli , urinary pathogenic E. coli harboring multiple resistance and virulence genes, including extended-spectrum beta-lactamase-encoding bla CTX-M-15 and fimbriae-encoding papA . Conclusion The emergence of high-virulence strains in neonatal intensive care units calls for the implementation of enhanced infection prevention and control strategies. Bacterial genomic sequencing techniques, if implemented in multidrug-resistant organism screening, could represent a valuable addition for early characterization of virulence and resistance profiles, and improve prevention and containment of infectious outbreaks.
Article
The increase in the antimicrobial resistance (AMR) of Staphylococcus aureus (S. aureus) has become a global public health concern. This study globally monitored the large-scale longitudinal trend of AMR in S. aureus and examined the various human and environmental climate factors that influence the occurrence and spread of AMR in S. aureus, which might provide valuable data to support the development of a global surveillance system for S. aureus AMR and provide a theoretical basis for coordinated actions to control the emergence and development of AMR from multiple perspectives. There was a significantly positive correlation between the number of antibiotic resistance genes (ARGs) in S. aureus and the collection year, with a sharp increase in ARGs over time. The number of ARGs in S. aureus genomes significantly increased each decade, with the average number of ARGs per genome rising from 10.37 ± 3.55 before 1990 to 12.75 ± 4.04 after 2010, suggesting a growing problem of S. aureus AMR. The Spearman correlation results indicated that the human development index (HDI), antibiotic consumption, and mobile genetic elements (MGEs) were significantly associated with the AMR of S. aureus, and these factors played a crucial role in the emergence and development of S. aureus AMR. The results of structural equation modeling showed that the HDI significantly promoted an increase in antibiotic consumption, thereby indirectly enhancing the AMR of S. aureus. Antibiotic consumption also indirectly facilitated the progression of AMR in S. aureus through its impact on MGEs. The results of restricted cubic spline and generalized linear models showed that climate change also played a significant role in the progression of S. aureus AMR. In summary, this study provides a theoretical framework for monitoring the longitudinal trend of ARGs in S. aureus isolates and examining the possible influencing variables of ARGs in these isolates.
Article
Shiga-toxin-producing E. coli (STEC) is a leading causing of bacterial foodborne and zoonotic illnesses in the USA. Whole-genome sequencing (WGS) is a powerful tool used in public health and microbiology for the detection, surveillance, and outbreak investigation of STEC. In this study, we applied three WGS-based subtyping methods, high quality single-nucleotide polymorphism (hqSNP) analysis, whole genome multi-locus sequence typing using chromosome-associated loci [wgMLST (chrom)], and core genome multi-locus sequence typing (cgMLST), to isolate sequences from 11 STEC outbreaks. For each outbreak, we evaluated the concordance between subtyping methods using pairwise genomic differences (number of SNPs or alleles), linear regression models, and tanglegrams. Pairwise genomic differences were highly concordant between methods for all but one outbreak, which was associated with international travel. The slopes of the regressions for hqSNP vs. allele differences were 0.432 (cgMLST) and 0.966 wgMLST (chrom); the slope was 1.914 for cgMLST vs. wgMLST (chrom) differences. Tanglegrams comprised of outbreak and sporadic sequences showed moderate clustering concordance between methods, where Baker’s Gamma Indices (BGIs) ranged between 0.35 and 0.99 and Cophenetic Correlation Coefficients (CCCs) were ≥0.88 across all outbreaks. The K-means analysis using the Silhouette method showed the clear separation of outbreak groups with average silhouette widths ≥0.87 across all methods. This study validates the use of cgMLST for the national surveillance of STEC illness clusters using the PulseNet 2.0 system and demonstrates that hqSNP or wgMLST can be used for further resolution.
Article
Salmonella enterica subsp. enterica serovar Infantis ( S . Infantis), a major cause of human salmonellosis, is commonly associated with transmission via contaminated chicken meat. This study, as part of the national Salmonella monitoring program, assessed the prevalence of S. enterica, including S . Infantis, in chicken slaughterhouses across South Korea from 2014–2022. The presence of a megaplasmid, known as plasmid of emerging S . Infantis (pESI), was confirmed. This confirmation was based on multidrug-resistant and third-generation cephalosporin-resistant S . Infantis isolates using whole-genome sequencing. Phenotypic and genotypic characterization involved antimicrobial susceptibility tests, pulsed-field gel electrophoresis, polymerase chain reaction to screen for pESI plasmids, plasmid profiling, and conjugation assays. S . Infantis was identified in 9.3% of Salmonella -positive samples in 2014, undetected from 2015–2020, but re-emerged as the predominant serovar in 2021 (54.7%) and 2022 (75.5%). The isolates in 2014 were antibiotic susceptible, whereas most isolates from 2021 to 2022 exhibited multidrug-resistance, including resistance to third-generation cephalosporins. All isolates were sequence type 32 (ST32), with core genome multilocus sequence typing demonstrating pESI plasmid-based clustering. The pESI ⁺ isolates harbored genes, such as aadA1 , dfrA14 , sul1 , and tetA , and three multidrug-resistant pESI ⁺ isolates harbored bla CTX-M-65 . The plasmids were genetically similar to those observed in S . Infantis from broilers, chicken meat, and human clinical samples across various countries. This study highlights the spread of multidrug-resistant S . Infantis harboring the pESI plasmid with bla CTX-M-65 during early chicken production in South Korea. Continuous monitoring and control of resistant S . Infantis throughout the food chain is essential to inform public health initiatives. IMPORTANCE This study highlights the critical emergence of multidrug-resistant (MDR) Salmonella enterica serovar Infantis ( S . Infantis) in South Korea’s chicken slaughterhouses, driven by the acquisition of the pESI megaplasmid harboring the extended-spectrum beta-lactamase (ESBL) determinant bla CTX-M-65 . Using whole-genome sequencing and comprehensive phenotypic-genotypic analyses, the findings reveal that pESI ⁺ isolates in South Korea are genetically similar to strains from broilers, chicken meat, and human clinical cases worldwide. This underscores the transboundary nature of S . Infantis and its potential as a global public health threat.
Article
Full-text available
Non-typhoidal Salmonella (NTS) poses a significant global health burden due to its association with gastroenteritis and rising antimicrobial resistance (AMR). This study conducted a genomic analysis of 62 Salmonella isolates from outpatient cases in Jiangsu, China, to monitor the epidemiological characteristics of NTS, including genetic diversity, AMR profiles, and resistance transmission mechanisms 18 serovars and 21 sequence types (STs) were identified by whole genome sequencing, with S. enteritidis (27.42%) and S. typhimurium (19.35%) predominating. 61 resistance genes from ten different antimicrobial categories were found by genotypic AMR screening. 90.32% of isolates had β-lactam resistance genes, indicating a high frequency of extended-spectrum β-lactamases (ESBL). Serovar-dependent resistance patterns were highlighted by the most varied AMR profile (40/61 genes) found in S. typhimurium. The co-occurrence of genes for aminoglycoside resistance, sul2, and blaTEM indicated clustering driven by mobile genetic elements. A plasmid in a S. Stanley isolate harbored 12 AMR genes, which showed structural changes suggestive of horizontal gene transfer and active recombination. These findings underscore the role of plasmids in disseminating MDR and the urgent need for enhanced antimicrobial stewardship, food safety protocols, and One Health interventions to mitigate the spread of resistant Salmonella clones.
Article
Full-text available
For many decades, Salmonella enterica has been subdivided by serological properties into serovars or further subdivided for epidemiological tracing by a variety of diagnostic tests with higher resolution. Recently, it has been proposed that so-called eBurst groups (eBGs) based on the alleles of seven housekeeping genes (legacy multilocus sequence typing [MLST]) corresponded to natural populations and could replace serotyping. However, this approach lacks the resolution needed for epidemiological tracing and the existence of natural populations had not been independently validated by independent criteria. Here, we describe EnteroBase, a web-based platform that assembles draft genomes from Illumina short reads in the public domain or that are uploaded by users. EnteroBase implements legacy MLST as well as ribosomal gene MLST (rMLST), core genome MLST (cgMLST), and whole genome MLST (wgMLST) and currently contains over 100,000 assembled genomes from Salmonella. It also provides graphical tools for visual interrogation of these genotypes and those based on core single nucleotide polymorphisms (SNPs). eBGs based on legacy MLST are largely consistent with eBGs based on rMLST, thus demonstrating that these correspond to natural populations. rMLST also facilitated the selection of representative genotypes for SNP analyses of the entire breadth of diversity within Salmonella. In contrast, cgMLST provides the resolution needed for epidemiological investigations. These observations show that genomic genotyping, with the assistance of EnteroBase, can be applied at all levels of diversity within the Salmonella genus.
Article
Full-text available
Background Large scale bacterial sequencing has made the determination of genetic relationships within large sequence collections of bacterial genomes derived from the same microbial species an increasingly common task. Solutions to the problem have application to public health (for example, in the detection of possible disease transmission), and as part of divide-and-conquer strategies selecting groups of similar isolates for computationally intensive methods of phylogenetic inference using (for example) maximal likelihood methods. However, the generation and maintenance of distance matrices is computationally intensive, and rapid methods of doing so are needed to allow translation of microbial genomics into public health actions. ResultsWe developed, tested and deployed three solutions. BugMat is a fast C++ application which generates one-off in-memory distance matrices. FindNeighbour and FindNeighbour2 are server-side applications which build, maintain, and persist either complete (for FindNeighbour) or sparse (for FindNeighbour2) distance matrices given a set of sequences. FindNeighbour and BugMat use a variation model to accelerate computation, while FindNeighbour2 uses reference-based compression. Performance metrics show scalability into tens of thousands of sequences, with options for scaling further. Conclusion Three applications, each with distinct strengths and weaknesses, are available for distance-matrix based analysis of large bacterial collections. Deployed as part of the Public Health England solution for M. tuberculosis genomic processing, they will have wide applicability.
Article
Full-text available
Background The ST313 sequence type of Salmonella Typhimurium causes invasive non-typhoidal salmonellosis and was thought to be confined to sub-Saharan Africa. Two distinct phylogenetic lineages of African ST313 have been identified. Methods We analysed the whole genome sequences of S. Typhimurium isolates from UK patients that were generated following the introduction of routine whole-genome sequencing (WGS) of Salmonella enterica by Public Health England in 2014. Results We found that 2.7% (84/3147) of S. Typhimurium from patients in England and Wales were ST313 and were associated with gastrointestinal infection. Phylogenetic analysis revealed novel diversity of ST313 that distinguished UK-linked gastrointestinal isolates from African-associated extra-intestinal isolates. The majority of genome degradation of African ST313 lineage 2 was conserved in the UK-ST313, but the African lineages carried a characteristic prophage and antibiotic resistance gene repertoire. These findings suggest that a strong selection pressure exists for certain horizontally acquired genetic elements in the African setting. One UK-isolated lineage 2 strain that probably originated in Kenya carried a chromosomally located blaCTX-M-15, demonstrating the continual evolution of this sequence type in Africa in response to widespread antibiotic usage. Conclusions The discovery of ST313 isolates responsible for gastroenteritis in the UK reveals new diversity in this important sequence type. This study highlights the power of routine WGS by public health agencies to make epidemiologically significant deductions that would be missed by conventional microbiological methods. We speculate that the niche specialisation of sub-Saharan African ST313 lineages is driven in part by the acquisition of accessory genome elements. Electronic supplementary material The online version of this article (doi:10.1186/s13073-017-0480-7) contains supplementary material, which is available to authorized users.
Article
Full-text available
Fully exploiting the wealth of data in current bacterial population genomics datasets requires synthesising and integrating different types of analysis across millions of base pairs in hundreds or thousands of isolates. Current approaches often use static representations of phylogenetic, epidemiological, statistical and evolutionary analysis results that are difficult to relate to one another. Phandango is an interactive application running in a web browser allowing fast exploration of large-scale population genomics datasets combining the output from multiple genomic analysis methods in an intuitive and interactive manner. Availability: Phandango is a web application freely available for use at www.phandango.net and includes a diverse collection of datasets as examples. Source code together with a detailed wiki page is available on GitHub at https://github.com/jameshadfield/phandango. Contact: jh22@sanger.ac.uk, sh16@sanger.ac.uk.
Article
Full-text available
PulseNet International is a global network dedicated to laboratory-based surveillance for food-borne diseases. The network comprises the national and regional laboratory networks of Africa, Asia Pacific, Canada, Europe, Latin America and the Caribbean, the Middle East, and the United States. The PulseNet International vision is the standardised use of whole genome sequencing (WGS) to identify and subtype food-borne bacterial pathogens worldwide, replacing traditional methods to strengthen preparedness and response, reduce global social and economic disease burden, and save lives. To meet the needs of real-time surveillance, the PulseNet International network will standardise subtyping via WGS using whole genome multilocus sequence typing (wgMLST), which delivers sufficiently high resolution and epidemiological concordance, plus unambiguous nomenclature for the purposes of surveillance. Standardised protocols, validation studies, quality control programmes, database and nomenclature development, and training should support the implementation and decentralisation of WGS. Ideally, WGS data collected for surveillance purposes should be publicly available, in real time where possible, respecting data protection policies. WGS data are suitable for surveillance and outbreak purposes and for answering scientific questions pertaining to source attribution, antimicrobial resistance, transmission patterns, and virulence, which will further enable the protection and improvement of public health with respect to food-borne disease.
Article
Full-text available
Visualization is frequently used to aid our interpretation of complex datasets. Within microbial genomics, visualizing the relationships between multiple genomes as a tree provides a framework onto which associated data (geographical, temporal, phenotypic and epidemiological) are added to generate hypotheses and to explore the dynamics of the system under investigation. Selected static images are then used within publications to highlight the key findings to a wider audience. However, these images are a very inadequate way of exploring and interpreting the richness of the data. There is, therefore, a need for flexible, interactive software that presents the population genomic outputs and associated data in a user-friendly manner for a wide range of end users, from trained bioinformaticians to front-line epidemiologists and health workers. Here, we present Microreact, a web application for the easy visualization of datasets consisting of any combination of trees, geographical, temporal and associated metadata. Data files can be uploaded to Microreact directly via the web browser or by linking to their location (e.g. from Google Drive/Dropbox or via API), and an integrated visualization via trees, maps, timelines and tables provides interactive querying of the data. The visualization can be shared as a permanent web link among collaborators, or embedded within publications to enable readers to explore and download the data. Microreact can act as an end point for any tool or bioinformatic pipeline that ultimately generates a tree, and provides a simple, yet powerful, visualization method that will aid research and discovery and the open sharing of datasets.
Article
Full-text available
The 2013–2016 West African epidemic caused by the Ebola virus was of unprecedented magnitude, duration and impact. Here we reconstruct the dispersal, proliferation and decline of Ebola virus throughout the region by analysing 1,610 Ebola virus genomes, which represent over 5% of the known cases. We test the association of geography, climate and demography with viral movement among administrative regions, inferring a classic ‘gravity’ model, with intense dispersal between larger and closer populations. Despite attenuation of international dispersal after border closures, cross-border transmission had already sown the seeds for an international epidemic, rendering these measures ineffective at curbing the epidemic. We address why the epidemic did not spread into neighbouring countries, showing that these countries were susceptible to substantial outbreaks but at lower risk of introductions. Finally, we reveal that this large epidemic was a heterogeneous and spatially dissociated collection of transmission clusters of varying size, duration and connectivity. These insights will help to inform interventions in future epidemics.
Article
Full-text available
Listeria monocytogenes (Lm) is a major human foodborne pathogen. Numerous Lm outbreaks have been reported worldwide and associated with a high case fatality rate, reinforcing the need for strongly coordinated surveillance and outbreak control. We developed a universally applicable genome-wide strain genotyping approach and investigated the population diversity of Lm using 1,696 isolates from diverse sources and geographical locations. We define, with unprecedented precision, the population structure of Lm, demonstrate the occurrence of international circulation of strains and reveal the extent of heterogeneity in virulence and stress resistance genomic features among clinical and food isolates. Using historical isolates, we show that the evolutionary rate of Lm from lineage I and lineage II is low (∼2.5 × 10⁻⁷ substitutions per site per year, as inferred from the core genome) and that major sublineages (corresponding to so-called ‘epidemic clones’) are estimated to be at least 50–150 years old. This work demonstrates the urgent need to monitor Lm strains at the global level and provides the unified approach needed for global harmonization of Lm genome-based typing and population biology.
Article
Salmonella enterica serovar Paratyphi C causes enteric (paratyphoid) fever in humans. Its presentation can range from asymptomatic infections of the blood stream to gastrointestinal or urinary tract infection or even a fatal septicemia [1]. Paratyphi C is very rare in Europe and North America except for occasional travelers from South and East Asia or Africa, where the disease is more common [2, 3]. However, early 20th-century observations in Eastern Europe [3, 4] suggest that Paratyphi C enteric fever may once have had a wide-ranging impact on human societies. Here, we describe a draft Paratyphi C genome (Ragna) recovered from the 800-year-old skeleton (SK152) of a young woman in Trondheim, Norway. Paratyphi C sequences were recovered from her teeth and bones, suggesting that she died of enteric fever and demonstrating that these bacteria have long caused invasive salmonellosis in Europeans. Comparative analyses against modern Salmonella genome sequences revealed that Paratyphi C is a clade within the Para C lineage, which also includes serovars Choleraesuis, Typhisuis, and Lomita. Although Paratyphi C only infects humans, Choleraesuis causes septicemia in pigs and boar [5] (and occasionally humans), and Typhisuis causes epidemic swine salmonellosis (chronic paratyphoid) in domestic pigs [2, 3]. These different host specificities likely evolved in Europe over the last ∼4,000 years since the time of their most recent common ancestor (tMRCA) and are possibly associated with the differential acquisitions of two genomic islands, SPI-6 and SPI-7. The tMRCAs of these bacterial clades coincide with the timing of pig domestication in Europe [6].