ArticlePDF Available

TopHap: Rapid inference of key phylogenetic structures from common haplotypes in large genome collections with limited diversity

Authors:

Abstract and Figures

Motivation: Building reliable phylogenies from very large collections of sequences with a limited number of phylogenetically informative sites is challenging because sequencing errors and recurrent/backward mutations interfere with the phylogenetic signal, confounding true evolutionary relationships. Massive global efforts of sequencing genomes and reconstructing the phylogeny of SARS-CoV-2 strains exemplify these difficulties since there are only hundreds of phylogenetically informative sites and millions of genomes. For such datasets, we set out to develop a method for building the phylogenetic tree of genomic haplotypes consisting of positions harboring common variants to improve the signal-to-noise ratio for more accurate and fast phylogenetic inference of resolvable phylogenetic features. Results: We present the TopHap approach that determines spatiotemporally common haplotypes of common variants and builds their phylogeny at a fraction of the computational time of traditional methods. We develop a bootstrap resampling strategy that resamples genomes spatiotemporally to assess topological robustness. The application of TopHap to build a phylogeny of 68,057 SARS-CoV-2 genomes (68KG) from the first year of the pandemic produced an evolutionary tree of major SARS-CoV-2 haplotypes. This phylogeny is concordant with the mutation tree inferred using the co-occurrence pattern of mutations and recovers key phylogenetic relationships from more traditional analyses. We also evaluated alternative roots of the SARS-CoV-2 phylogeny and found that the earliest sampled genomes in 2019 likely evolved by four mutations of the most recent common ancestor of all SARS-CoV-2 genomes. An application of TopHap to more than 1 million SARS-CoV-2 genomes reconstructed the most comprehensive evolutionary relationships of major variants, which confirmed the 68KG phylogeny and provided evolutionary origins of major variants of concern. Availability: TopHap is available at https://github.com/SayakaMiura/TopHap.
Content may be subject to copyright.
Bioinformatics, YYYY, 0–0
doi: 10.1093/bioinformatics/xxxxx
Advance Access Publication Date: DD Month
YYYY
Original Papers
TopHap: Rapid inference of key phylogenetic
structures from common haplotypes in large genome
collections with limited diversity
Marcos A. Caraballo-Ortiz1,2,x, Sayaka Miura1,2,x, Maxwell Sanderford1,2, Tenzin Dolker1,2, Qiqing Tao1,2,
Steven Weaver1,2, Sergei L. K. Pond1,2, and Sudhir Kumar1,2,3,*
1Institute for Genomics and Evolutionary Medicine, 2Department of Biology, Temple University,
Philadelphia, PA 19122, USA, 3Center of Excellence in Genomic Medicine Research, King Abdulaziz
University, Saudi Arabia.
*To whom correspondence should be addressed.
x Joint first authors.
Abstract
Motivation: Building reliable phylogenies from very large collections of sequences with a limited number
of phylogenetically informative sites is challenging because sequencing errors and recurrent/backward
mutations interfere with the phylogenetic signal, confounding true evolutionary relationships. Massive
global efforts of sequencing genomes and reconstructing the phylogeny of SARS-CoV-2 strains
exemplify these difficulties since there are only hundreds of phylogenetically informative sites and
millions of genomes. For such datasets, we set out to develop a method for building the phylogenetic
tree of genomic haplotypes consisting of positions harboring common variants to improve the signal-to-
noise ratio for more accurate and fast phylogenetic inference of resolvable phylogenetic features.
Results: We present the
TopHap
approach that determines spatiotemporally common haplotypes of
common variants and builds their phylogeny at a fraction of the computational time of traditional
methods. We develop a bootstrap resampling strategy that resamples genomes spatiotemporally to
assess topological robustness. The application of
TopHap
to build a phylogeny of 68,057 SARS-CoV-
2 genomes (68KG) from the first year of the pandemic produced an evolutionary tree of major SARS-
CoV-2 haplotypes. This phylogeny is concordant with the mutation tree inferred using the co-occurrence
pattern of mutations and recovers key phylogenetic relationships from more traditional analyses. We
also evaluated alternative roots of the SARS-CoV-2 phylogeny and found that the earliest sampled
genomes in 2019 likely evolved by four mutations of the most recent common ancestor of all SARS-
CoV-2 genomes. An application of
TopHap
to more than 1 million SARS-CoV-2 genomes reconstructed
the most comprehensive evolutionary relationships of major variants, which confirmed the 68KG
phylogeny and provided evolutionary origins of major variants of concern.
Availability:
TopHap
is available at https://github.com/SayakaMiura/TopHap.
Contact: s.kumar@temple.edu
1 Introduction
The global health emergency caused by the SARS-CoV-2 coronavirus has
catalyzed an unprecedented effort to sequence millions of genomes from
all around the world and to analyze them to reveal viral origins and
evolutionary patterns (Andersen et al., 2020; Kumar et al., 2021; Rambaut
et al., 2020). However, applying classical phylogenetic methods to infer
the global SARS-CoV-2 phylogeny has been challenging (Kumar et al.,
2021; Morel et al., 2020). This is partly because phylogenetically
informative sites are relatively rare due to a low mutation rate and a short
evolutionary period of the outbreak. Genome sequences contain random
and systematic sequencing errors, which compete with informative
phylogenetic variation and mislead phylogenetic inference (Kumar et al.,
2021; Morel et al., 2020; Pipes et al., 2020). Consequently, applications of
standard phylogenetic methods to the multiple sequence alignments (MSA)
of SARS-CoV-2 genomes have produced many equally plausible
phylogenies, particularly when reconstructing early mutational history and
the root of the SARS-CoV-2 phylogeny (Nie et al., 2020; Pipes et al., 2020;
van Dorp et al., 2020).
Kumar et al. (2021) reconstructed a mutation tree using shared co-
occurrence patterns of mutations occurring in >1% of isolates, which they
refer to as the mutation order approach (MOA). They applied and advanced
Page 1 of 8 Bioinformatics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
© The Author(s) 2022. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License
(http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any
medium, provided the original work is properly cited. For commercial re-use, please contact journals.permissions@oup.com
Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac186/6553661 by guest on 28 March 2022
TopHap: Building big phylogenies by using major haplotypes
a maximum likelihood method (SCITE, Jahn et al., 2016) that models
false-positive and false-negative variant detections in the absence of
recombination (Jahn et al., 2016; Kumar et al., 2021). They reported
success deciphering the earliest phases of SARS-CoV-2 evolution and
recovered the most recent common ancestor (MRCA) genome, using
common variants observed in the early stages of SARS-CoV-2 evolution.
Based on the MOA’s success in building the mutation tree using common
variants, we hypothesized that it should be possible to build a reliable
molecular phylogeny of major SARS-CoV-2 haplotypes by filtering out all
genomic positions at which no minor allele rose to a frequency greater than
1%. Such filtering should effectively reduce the effect of the noise in
making molecular phylogenetic inferences using standard approaches (e.g.,
the maximum likelihood [ML] method). If successful, one would prefer a
traditional phylogenetic approach because it can better handle multiple
substitutions at the same site (homoplasy) and use outgroup sequences
more effectively than the mutation tree approaches.
However, the approach of excluding alignment sites with only low-
frequency variants followed by applying a standard phylogenetic approach
on remaining sites cannot be recommended. An example in figure 1
illustrates why. The ancestral genome contains only three polymorphic
positions where derived alleles occur at high frequencies (#1, #2, and #3;
Fig. 1a). In this case, we expect to see at most four correct haplotypes in
the absence of noise: three mutant strains (H1, H2, and H3) and one
ancestral haplotype. The addition of a small number of sequencing errors
and homoplasy generate additional haplotypes (e.g., H4, H5, and H6) that
occur with very low frequency but still mislead an ML analysis (Fig. 1b).
ML phylogenies place two spurious haplotypes (H5 and H6) near the root
of the tree (Fig. 1c), albeit without significant statistical support.
However, this behavior is rectified when one removes rare haplotypes
(Fig. 1d). This observation prompted us to develop a simple filtering
procedure to identify common (top) haplotypes of common variants for
molecular phylogenetic analysis. We first present this filtering process and
then apply it to infer the early evolutionary history of SARS-CoV-2 by
using 68,057 genomes previously analyzed by Kumar et al. (2021) for a
direct comparison of the TopHap phylogeny with the mutation tree
generated by using MOA.
2 Methods
2.1 The TopHap approach
As input, TopHap uses an MSA of genomes (n genomes and m alignment
columns). The first step is the selection of common variants by specifying
a desired minor allele frequency threshold (e.g., maf > 1%) without using
any reference genomes (Fig. 2). All alignment sites containing at least one
allele with a frequency greater than maf and another allele with a frequency
less than 1-maf are retained (k variant positions). Every genome is then
reduced to a haplotype containing k positions. Next, unique haplotype
sequences are identified, and their frequencies tallied. TopHap selects the
top h haplotypes given a desired hf frequency cutoff. Now, the MSA
contains h haplotypes, each k variants long and tagged with its frequency.
Outgroup genomes are added into the MSA by converting them into
haplotypes containing only k selected positions. The TopHap subjects the
reduced MSA to the phylogenetic analysis using the Maximum Parsimony
method (MP), which produces the TopHap phylogeny of common
haplotypes on common variants.
When information on sampling location and time of haplotypes is
available, TopHap can select variants and haplotypes for each
spatiotemporal slice of the dataset that is regionally (e.g., continent,
country, or city) and temporally (e.g., monthly) partitioned
(Supplementary Fig. S1). The same maf and hf thresholds are applied to
every spatiotemporal slice, and the final set of variants and haplotypes
across all spatiotemporal slices are pooled.
Calculation of the bootstrap support. In the TopHap approach, bootstrap
branch support for the inferred phylogeny of common haplotypes is
calculated by resampling (with replacement) of haplotypes, which is
intended to assess the robustness of the inferred phylogeny to the
inclusion/exclusion of haplotypes likely created by sequencing errors and
Figure 1 Traditional phylogenetic approach versus the new TopHap approach for a dataset that contains many sequences with few variants. (a) The true tree shows three
simulated mutant haplotypes. In this example, three mutations (α, β, and ) occurred sequentially and gave rise to haplotypes H1, H2, and H3. The size of triangles at each tip is
proportional to the number of genomes containing these haplotypes. (b) Phylogenetic approaches use a multiple sequence alignment, simplified here with only three informative
variants. Due to sequencing errors, a few spurious haplotypes may be observed (red letters, H4–H6) with low frequencies (0.3% – 1%). The inclusion of these spurious haplotypes
misguides standard phylogeny methods (e.g., maximum likelihood [ML] and maximum parsimony [MP]) and produces incorrect evolutionary inference. (c) Result based on a typical
ML approach suggests that the spurious haplotypes H6 and H5 were the first to arise. The bootstrap confidence limits for all the branching patterns are low (< 50%) because each
branch is only one mutation long, a situation where the bootstrap method is known to be powerless (see text). (d) The TopHap approach was able to infer the correct tree because it
restricts phylogenetic analysis to haplotypes greater than 1% frequency.
Page 2 of 8Bioinformatics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac186/6553661 by guest on 28 March 2022
TopHap: Building big phylogenies by using major haplotypes
convergent changes that are expected to have relatively low frequencies
spatiotemporally. The bootstrap resampling procedure is applied separately
to each spatiotemporal slice, and the final set of haplotypes are pooled
together. This genome resampling approach is different from Felsenstein’s
bootstrap approach of resampling sites to build bootstrap replicate datasets,
which needs at least three mutations per branch to achieve a 95%
confidence level even without any homoplasy (Felsenstein, 1985). MP
method is applied to every bootstrap replicate dataset, and haplotypes that
do not appear in all the replicates are pruned from bootstrap phylogenies.
Then a bootstrap consensus tree is generated, which has the bootstrap
confidence limits for every clade of haplotypes. Also, one may choose not
to prune haplotypes across bootstrap replicates. In this case, phylogenies
can be summarized using software that allows for an unequal number of
tips across phylogenies (Bouckaert, 2010).
Placement of additional haplotypes into the phylogeny. To place a new
genome into the TopHap phylogeny, the first step is to transform it into a
haplotype of k positions used to build the TopHap phylogeny. One may use
UShER (Turakhia et al., 2021), which is an MP approach, or RAxML-EPA
(Berger et al., 2011) and pplacer (Matsen et al., 2010), which are ML
approaches. We found RAxML-EPA convenient, so this option is
programmed in our TopHap implementation. When the intent is to place a
genome with variant(s) in the genomic position that was not used to build
the TopHap phylogeny, a TopHap phylogeny needs to be rebuilt by
requiring that the position(s) of interest be always included during the
TopHap analysis. This step is optional and available in the TopHap
analysis.
2.2 Genome Data Acquisition and Assembly
We obtained an MSA containing 68,057 genomes (hereafter, 68KG) of the
SARS-CoV-2 coronavirus from human hosts analyzed in Kumar et al.
(2021). These genomes were obtained from the GISAID database
(https://www.gisaid.org) and covered the period from December 24, 2019,
until October 12, 2020. The 68KG alignment was generated after filtering
133,741 SARS-CoV-2 genomes, such that genomes shorter than 28,000
bases and those with many ambiguous bases were removed. Three
outgroup coronavirus genomes were added to the alignment: Rhinolophus
affinis (RaTG13) and R. malayanus (RmYN02) bats, and the Manis
javanica pangolin (MT040335) (Liu et al., 2020; Zhou et al., 2020).
Following the above procedure, we also assembled a bigger dataset
containing 1,106,862 genomes (hereafter 1MG) from the GISAID database
covering the period from December 24, 2019, to September 11, 2021.
Annotations using Nextstrain and PANGO classifications. To compare
TopHap phylogeny with the Nextstrain classification, we annotated all the
TopHap haplotypes using the presence and absence of diagnostic
Nextstrain mutations (https://nextstrain.org/ncov). We also assigned a
PANGO lineage to each genome in the data using the Phylogenetic
Assignment of Named Global Outbreak Lineages (PANGOLIN) software
(Rambaut et al., 2020). TopHap haplotype ID was also assigned to
genomes whose haplotype was identical to the TopHap haplotype. When a
TopHap haplotype matched with multiple PANGO lineages, we paired a
TopHap haplotype with the major PANGO lineage.
3 Results
We stratified sequence isolates by (month of sampling and country)
attributes to select variants and haplotypes in the TopHap analysis of the
68KG dataset. We used spatial and regional maf and hf cutoffs of 5% to
avoid including problematic variants and haplotypes created by
recurrent/backward mutations and sequencing error, particularly because
of the small number of genomes available for some spatiotemporal slices.
When the number of genomes sampled from a country was fewer than 500,
we manually pooled them with adjacent countries with fewer than 500
genomes for countries located on the same continent. Also, the numbers of
genomes in December 2019 and October 2020 were <500, so we pooled
them with January 2020 and September 2020 time slices, respectively. The
TopHap’s filtering process (5% threshold for maf and hf) produced an
MSA of common haplotypes that consisted of 83 variable sites and 39
unique haplotypes after pruning haplotypes that were not sampled in all
bootstrap analyses.
We subjected the final haplotype MSA to an MP analysis in MEGA
(Tamura et al., 2021) and an ML analysis in RAxML (Kozlov et al., 2019).
The heuristic search was applied with default (Subtree-Pruning-
Regrafting). For the ML analysis, we used GTR nucleotide substitution
model and GAMMA among-site rate heterogeneity (4 discrete rate
categories) in RAxML (https://raxml-ng.vital-it.ch). We used Lewis'
ascertainment bias correction since the haplotype MSA contains only
variable sites (Lewis, 2001). In the ML phylogeny, many branches
received low bootstrap support (<52%; Supplementary Fig. S2).
Therefore, we disregarded these branches when comparing ML and MP
phylogenies and found that the two phylogenies were identical. This result
prompted us to implement the MP analysis in the TopHap software.
Figure 2 Overview of the TopHap approach. Input to TopHap is an alignment of
genome sequences (n sequences, m bases each). TopHap first identifies high-
frequency variants (> maf) and produces a restricted alignment with n sequences and
k bases. Next, high-frequency haplotypes (>hf) are identified, resulting in a reduced
alignment of h haplotypes each with k bases. These haplotypes are subjected to
standard phylogenetic inference. To compute bootstrap confidence limits, TopHap
resamples n haplotypes with replacement to form a replicate n×k dataset, which is
followed by the identification of high-frequency haplotypes (>hf) and the inference of
their phylogeny. This process is repeated for the desired number of bootstrap
replicates and a consensus phylogeny of haplotypes found in all replicates is produced.
Spatiotemporal information can also be used to construct subsets in which variants
and haplotypes are identified for each spatiotemporal slice separately (see
Supplementary Figure S1).
Page 3 of 8 Bioinformatics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac186/6553661 by guest on 28 March 2022
TopHap: Building big phylogenies by using major haplotypes
The TopHap analysis of the 68KG dataset with 100 bootstrap replicates
required less than one hour, and all but three groups received > 95%
bootstrap support (Fig. 3). The remaining groups received >80% bootstrap
support. Here, we used Greek symbols for variants designated with
symbols in Kumar et al. (2021, Supplementary Table 2). In this phylogeny,
many branches were longer than one mutation, indicating that haplotypes
corresponding to intermediate viruses did not rise to high enough
frequency in the data or were not sampled. Also, more than two
evolutionary lineages originated from the same ancestral lineage in many
cases, which is likely to be real because there was no mutational homoplasy
around those branches in the phylogeny (see further discussion below).
Temporal trends in variant frequencies
The TopHap approach does not use temporal information from sample
isolation dates during the reconstruction of the haplotype phylogeny.
Therefore, a TopHap phylogeny can be used to test the concordance
between the temporal order of mutation occurrence with the order of their
frequency predicted by the phylogeny. For this analysis, we first mapped
mutations to every branch in the SARS-CoV-2 phylogeny by
reconstructing the most parsimonious ancestral states. All mutations were
mapped unambiguously (Fig. 3). Frequencies of variants generally
decreased from the root to tip on evolutionary lineages (e.g., Fig. 4a). For
example, the mutant bases mapping to the earliest diverging branches in
the TopHap phylogeny occurred with the highest frequency in the 68KG
dataset. Also, the timing of the first sampling date of variants increased on
lineages from the root to tips (Fig. 4b). These trends are consistent with
the clonal evolution without recombination of SARS-CoV-2 during the
early stage of the pandemic.
Comparing 68KG TopHap phylogeny with the MOA tree
To directly compare the TopHap phylogeny with the MOA mutation tree
reported in Kumar et al. (2021), we also used spatial and regional maf and
hf cutoffs of 1% in analyzing the same 68KG dataset. The inferred TopHap
phylogeny contained a much larger number of haplotypes (302) and
variable sites (570), which included all 83 variants with > 1% global maf
analyzed in Kumar et al. (2021). The order of these mutations in the
TopHap phylogeny was similar to the MOA mutation tree in Kumar et al.
(2021), with a few minor differences noted in Supplementary Fig. S3.
Similarly, TopHap phylogeny agreed well with Nextstrain and PANGO
trees (Fig. 5).
TopHap analysis of >1 million SARS-CoV-2 genomes
Next, we analyzed a recent snapshot of SARS-CoV-2 genome collection
acquired one year after assembling the 68KG dataset. After filtering out
incomplete genome sequences, we constructed an alignment of 1,106,862
genomes (1MG dataset) that is 16 times bigger than the 68KG dataset.
Using TopHap with a 5% threshold for maf and hf, we obtained an MSA
of 150 haplotypes with 675 variable sites. The number of haplotypes
increased only four-fold between 68KG and 1MG datasets, and the number
of variable sites increased by eight times. This greater increase of the
number of variable sites than the number of haplotypes is likely due to
episodic mutations in the SARS-CoV-2 evolution, where intermediate
haplotypes are not found in appreciable frequency more. For example,
some multi-mutation branches in the TopHap phylogeny correspond well
Figure 3. The TopHap Phylogeny of 68KG SARS-
CoV-2 major haplotypes. (a) Numbers near nodes
are bootstrap confidence limits derived from bootstrap
resampling of genomes. Mutations mapped are shown
on branches. When the same mutations were included
in Kumar et al. (2021), their mutation IDs (Greek
symbols) were shown. Their mutations and genomic
positions are given in the right side. The Nextstrain
clade ID was annotated based on their diagnostic
mutations and is provided at the far right. PANGO
lineage was annotated for each genome using
PANGOLIN software (Rambaut et al., 2020). We also
annotated TopHap haplotype for each genome by
comparing its haplotype with TopHap haplotypes.
When an observed haplotype did not perfectly match
any of the TopHap haplotypes, we did not assign any
for the genome. Using these genome annotations, we
paired each TopHap haplotype with the major PANGO
lineage, and the percentage of genomes containing it is
presented in the parenthesis.
Page 4 of 8Bioinformatics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac186/6553661 by guest on 28 March 2022
TopHap: Building big phylogenies by using major haplotypes
with the unresolved branching order of mutations in Kumar et al. (2021),
which was suggested to be due to evolutionary bursts (e.g., three
mutations). These bursts are also observed in the 1MG phylogeny (Fig.
6a), which shows high concordance with the 68KG phylogeny. Orders of
the earliest mutations (13, 13, 13, 1, 1, and 12) were the same
in 1MG and 68KG phylogenies. Therefore, inferences about the early
history reported for the 68KG data set are robust to the expanded sampling
of genomes.
The 1MG TopHap phylogeny shows the evolutionary history of key
WHO-designated variants of concern (VOC). This includes WHO-
ALPHA, WHO-BETA, WHO-DELTA, WHO-ETA, WHO-GAMMA, and
WHO-LAMBDA variants. We used the WHO- prefix to avoid conflict
between Kumar et al. (2021) notations for mutations and WHO’s notation
for multi-mutation strains. Notably, Kumar et al. (2021) mutation
identifiers were proposed earlier than the WHO designations, so we have
retained them.
These VOCs' placements in TopHap are consistent with those in the
Nextstrain taxonomy (Fig. 6b and 6c). For example, Nextstrain and
TopHap infer WHO-ALPHA, WHO-GAMMA, and WHO-LAMBDA to
be sister lineages. Also, the N501Y Spike recurrent mutation (A23063T)
occurred independently in WHO-ALPHA, WHO-GAMMA, and WHO-
BETA lineages, which are placed correctly by TopHap (Fig. 6a). Since the
WHO-OMICRON variant appears to have originated after the last day of
sampling the 1MG dataset, the TopHap phylogeny does not contain it. So,
we used WHO-OMICRON’s diagnostic mutations listed on the Nextstrain
website (https://nextstrain.org/ncov) to place it in the 1MG TopHap
phylogeny. WHO-OMICRON is an offspring of the lineage, as it contains
, , and mutations. This placement agrees with Nextstrains’ inference
(Fig. 6b and 6c).
The TopHap analysis of the 1MB dataset with a 5% threshold for maf
and hf was completed in less than 3 hours, including 100 bootstrap
replicates. In this phylogeny, 57 out of 72 clusters received 100% bootstrap
support, most of which were shallow clusters (close to the tips). This
pattern was consistent with the 68KG data analysis.
We explored the impact of using a larger number of bootstrap replicates
(1,000), which took ten times longer, on the estimates of the bootstrap
support values. Bootstrap support values from 100 and 1,000 replicates
were generally similar (Supplementary Fig. S4). For example, the
evolutionary position of WHO-DELTA was 92% and 93% in the two
analyses, respectively. Therefore, the use of 100 bootstrap replicates
appears to be sufficient.
TopHap analysis of 1MG dataset with lower maf and hf thresholds
We also reconstructed the 1MG dataset using a 1% cutoff for maf and hf
to select regional variants and haplotypes in TopHap. In this phylogeny,
the number of variable sites and haplotypes increased to 1,793 with 594,
respectively (Supplementary Fig. S5). Restricting the comparison to only
haplotypes common in both phylogenies, i.e., 1% and 5% cutoffs, we found
a very high concordance, as there were only nine partition differences for
some recent strain divergences that were likely caused by the presence of
rarer haplotypes containing recurrent/reversal mutations in the 1% cutoff
analysis. The evolutionary placements of WHO variants of concern were
also identical between the phylogenies. The inclusion of these haplotypes
in the 1% cutoff TopHap phylogeny reduced the bootstrap support in
general, except for shallow nodes (close to tips). Therefore, one needs to
use high enough maf and hf to avoid haplotypes disrupting phylogenetic
inference. When the evolutionary relationship of low-frequency haplotypes
Figure 5. The comparison of TopHap phylogeny with the (a) Nextstrain and
(b) PANGO phylogenies. (a) Only clades included in the 68KG data are shown.
(b) Only PANGO lineages that were included in the TopHap phylogeny were
used. Corresponding PANGO IDs are found in figure 3.
Page 5 of 8 Bioinformatics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac186/6553661 by guest on 28 March 2022
TopHap: Building big phylogenies by using major haplotypes
needs to be inferred, we suggest using TopHap’s facility to place low-
frequency haplotypes of interest into a robust and well-supported
phylogeny (Fig. 2).
Rooting the tree of SARS-CoV-2 genomes
We find that Nextstrain and PANGO phylogeny broadly agree with 68KG
and 1MG TopHap phylogenies, except for the root placement (Fig. 3, 5,
and 6). For example, clade 19A is at the root of the Nextstrain phylogeny,
but TopHap phylogenies (using the bat/pangolin outgroups) suggest that
Clade 19A is derived. The bootstrap support was modest (>66%) for the
root of the TopHap phylogeny, but no bootstrap replicates supported the
Nextstrain rooting, and <34% supported the PANGO rooting.
The TopHap rooting is the same as that implied by MOA in Kumar et
al. (2021). The TopHap root is also consistent with one of the two preferred
roots in Bloom (2021), who analyzed 13 additional partial genomes from
the earliest phases of the pandemic in China. Key early mutations analyzed
in Bloom (2021) contained an additional variable site (genomic position
29095), where the minor base occurred with too low a frequency to be
included in the TopHap analysis (0.4% in the 68KG dataset). We,
therefore, added it to the 68KG MSA and referred to this mutation as x (=
29095, U is minor, and C is major).
We also searched for other rare haplotypes to see if others tend to cluster
at or near the root position in the 68KG TopHap phylogeny. We found 936
additional unique haplotypes in the 68KG dataset more than once. We
tested their placement one by one in the TopHap phylogeny. Only two were
attached at or near the root. One of them had the same haplotype sequence
as that of MRCA and was present in 17 isolates. This haplotype is the
proCoV2 sequence reported by Kumar et al. (2021); it circulated in early
2020. The other haplotype differed from the proCoV2 sequence in two
genomic positions (29095 [location of x variant] and 18060 [location of 1
variant]). It was attached to the trunk of the phylogeny (Fig. 7a). This
haplotype is the same as Bloom (2021) suggested to be important in rooting
the SARS-CoV-2 phylogeny. Also, Bloom (2021) reported two
evolutionary scenarios with this mutation x (Fig. 7b and 7c), which led us
to consider five alternative scenarios based on TopHap, MOA, Bloom
(2021), Nextstrain, and PANGO (Fig. 7). All these scenarios involved
eight positions that experienced early mutations (1-3, 1-3, 1-2, and x)
to give rise to seven major haplotypes. Therefore, we inferred phylogenies
containing only 1-3, 1-3, 1-2, and x mutations using MP, i.e., we
attached the haplotype with the x mutation into the phylogenies of TopHap
(Fig. 7a and 7b for two equally parsimonious solutions), MOA (Fig. 7b),
Nextstrain (Fig. 7d), and PANGO (Fig. 7f). Our evaluation of these five
scenarios is the most detailed comparison to date because of the size of the
dataset analyzed and the variants included. For example, 1 and 2 variants
were absent in Bloom (2021) dataset because the genomes included were
only until the end of January 2020, and variant x was missing from Kumar
et al. (2021) analysis because its global frequency was less than 1% in the
68KG dataset.
We then evaluated these five scenarios (topologies) using MP and ML
optimality criteria (Fig. 7). In the MP analysis, scenarios A, B, and C were
equally parsimonious, and D and E (PANGO and Nextstrain, respectively)
were less parsimonious by 1 and 3 mutations. Scenario D and E were also
less likely than A, B, and C, where we estimated the log-likelihood (lnL)
of all five scenarios (topologies) using a GTR model of nucleotide
substitutions in MEGA for the haplotypes shown in Fig. 7. While the log-
likelihood of scenario A was the highest, it was only slightly higher
(difference in lnL < 1.7) than that for B and C that were equally likely.
Among scenarios A, B, and C, variant x was lost in B, while variant 1 was
acquired twice in A and lost once in C.
In all the three equally most parsimonious scenarios (A, B, and C), the
addition of mutation x pushes back the MRCA of SARS-CoV-2 by one
mutation compared to the proCoV2 sequence of Kumar et al. (2021). In
these cases, the number of differences between Wuhan-1 and the MRCA
is four (Fig. 7). With a mutation rate range of 6.64 × 10−4 to 9.27 × 10−4
Figure 6. The 1MG TopHap
Phylogeny. (a) Red numbers near
nodes are bootstrap confidence limits
derived from bootstrap resampling of
genomes. Early mutations that were
predicted in Kumar et al. (2021) are
shown on branches using their
mutation IDs (Greek symbols). Their
mutations and genomic positions are
given in Figure 3. The haplotypes
with concerning mutations are
indicated by using WHO IDs, and
20A EU2 and 20E (EU1) are
Nextstrain clade IDs. These
haplotypes were identified by
annotating PANGO and Nextstrain
lineage for each genome. We also
annotated TopHap haplotype for
each genome by comparing its
haplotype with TopHap haplotypes.
When an observed haplotype did not
perfectly match any of the TopHap
haplotypes, we did not assign any for
the genome. Using these genome
annotations, we paired each TopHap
haplotype with the major PANGO
and Nextstrain lineage, which
contained the WHO annotation. We
assigned WHO ID when at least one
of the annotations indicated it.
Evolutionary relationship of lineages
with concerning mutations by (b)
Nextstrain and (c) TopHap.
Page 6 of 8Bioinformatics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac186/6553661 by guest on 28 March 2022
TopHap: Building big phylogenies by using major haplotypes
substitutions per site per year (Pekar et al., 2021), we can estimate that
proCoV2 existed 7.7 – 10.8 weeks before the December 24, 2019 sampling
date of Wuhan-1. This places the progenitor of SARS-CoV-2 to have
evolved in mid-September to early-October 2019, weeks earlier than the
mid-November 2019 date proposed by Pekar et al. (2021). For their
analysis, Pekar et al. (2021) used the rooting from scenario D in which the
lineage containing 2-3 and 1-3 (PANGO B) is a sister group of the
lineage containing 1 and 1-2 (PANGO A) (Fig. 7d). As noted above,
this scenario receives lower bootstrap support than the alternative in which
PANGO B arose from the ancestor containing 1. In this sense, Pekar et al.
(2021) have likely dated an event that occurred downstream of the MRCA.
4 Conclusions
The ongoing global efforts to monitor the evolution of the SARS-CoV-2
coronavirus have motivated hundreds of laboratories worldwide to
generate genome sequences continuously. The number of genomes has
grown quickly, becoming orders of magnitude greater than the genome
size. Rapid growth, low sequence variability, and the presence of
sequencing error have made the direct use of phylogenetic methods on
genome alignments challenging for such data (e.g., Morel et al., 2020).
We have shown that the TopHap phylogeny for common variants and
haplotypes in the 68KG SARS-CoV-2 dataset works well and agrees with
the mutation tree produced using MOA (Kumar et al., 2021). But, the
TopHap approach offers some advantages over MOA. Firstly, MOA
assumes the sequencing error rate to be constant throughout the outbreak,
which is unlikely to hold for pathogenomic datasets acquired in different
laboratories at different times.
Secondly, MOA analysis needs to have mutant bases indicated at the
outset, a limitation addressed by Kumar et al. (2021), but at a large
computational expense. In contrast, TopHap analyses directly use outgroup
in standard phylogenetic analysis. TopHap analysis is certainly more
computationally efficient as the analysis of the 68KG dataset took only a
few hours. In contrast, MOA took more than a week to compute.
Thirdly, TopHap analysis can use well-established methods to infer
phylogeny and ancestral sequences to identify recurrent and backward
mutations. In contrast, MOA assumes an infinite site model and, thus, is
not suitable for detecting recurrent and backward mutations. Lastly, rarer
haplotypes can also be attached to a backbone of a TopHap phylogeny by
simply adding the genomic position of interest in constructing the MSA of
haplotypes, as demonstrated above.
In conclusion, TopHap is a simple and effective method to build
haplotype phylogenies and assess their statistical robustness. TopHap can
be applied in any data containing a large number of sequences with a
handful of variants, including other pathogens and tumor single-cell
sequencing data that is now producing a large number of somatic cell
sequences (Navin, 2015).
Acknowledgments
We thank Sudip Sharma for comments and everyone depositing genome
data on GISAID (list at http://igem.temple.edu/COVID-19).
Author contributions
S.K. and S.M. developed the original method and designed research; S.M,
M.S., and T.D. implemented the technique; M.A.C.O., S.M., T.D., and
Q.T. performed analyses; S.P. and S.W. assembled sequence alignments;
and S.K., S.M., M.A.C.O., Q.T., and S.P. wrote the paper.
Figure 7. The early history of SARS-CoV-2 variants. Five root positions are explored in
which the haplotype with mutation x has been added to the TopHap phylogeny in figure 3 (A
and B), Kumar et al. (2021) mutational history (B), Bloom (2021) phylogeny (B and C),
PANGO classification (D), and the Nextstrain classification (E). Haplotypes have eight
positions that contain variants 1-3, 1-3, 1-2, and x. Genomic positions are shown
whenever a mutation occurs: green box for forward and red for backward mutations). Using
the maximum parsimony criteria, we placed the haplotype with x variant into each phylogeny.
TopHap had two equally parsimonious solutions (A and B), where the ML placement
predicted scenario A. ML log-likelihoods (lnL) and the number of MP substitutions are
shown. WH-1 is the haplotype corresponding to the Wuhan-1 genome. The gray triangle
represents all the other SARS-CoV-2 haplotypes of the ongoing infections in the world.
Page 7 of 8 Bioinformatics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac186/6553661 by guest on 28 March 2022
TopHap: Building big phylogenies by using major haplotypes
Funding
This work has been supported by the U.S. National Science Foundation to
S.K. and S.M. (DEB-2034228) and S.P. (DBI-2027196) and U.S. National
Institutes of Health to S.K. (GM 139504) and S.P. (AI-134384).
References
Andersen, K.G., et al. (2020) The proximal origin of SARS-CoV-2. Nat.
Med., 26(4), 450–452.
Berger, S.A., Krompass, D. and Stamatakis, A. (2011) Performance,
Accuracy, and Web Server for Evolutionary Placement of Short Sequence
Reads under Maximum Likelihood. Syst. Biol., 60(3), 291–302.
Bloom, J.D. (2021) Recovery of Deleted Deep Sequencing Data Sheds
More Light on the Early Wuhan SARS-CoV-2 Epidemic. Mol. Biol. Evol.,
38(12), 5211–5224.
Bouckaert, R.R. (2010) DensiTree: making sense of sets of phylogenetic
trees. Bioinformatics, 26(10), 1372–1373.
Felsenstein, J. (1985) Confidence limits on phylogenies: an approach using
the bootstrap. Evolution, 39(4), 783–791.
Jahn, K., Kuipers, J. and Beerenwinkel, N. (2016) Tree inference for
single-cell data. Genome Biology, 17(1), 86.
Kozlov, A.M., et al. (2019) RAxML-NG: a fast, scalable and user-friendly
tool for maximum likelihood phylogenetic inference. Bioinformatics,
35(21), 4453–4455.
Kumar, S., et al. (2021) An Evolutionary Portrait of the Progenitor SARS-
CoV-2 and Its Dominant Offshoots in COVID-19 Pandemic. Mol. Biol.
Evol., 38(8), 3046–3059.
Lewis, P.O. (2001) A Likelihood Approach to Estimating Phylogeny from
Discrete Morphological Character Data. Syst. Biol., 50(6), 913–925.
Liu, P., et al. (2020) Are pangolins the intermediate host of the 2019 novel
coronavirus (SARS-CoV-2)? PLoS Path., 16(5), e1008421.
Matsen, F.A., Kodner, R.B. and Armbrust, E.V. (2010) pplacer: linear time
maximum-likelihood and Bayesian phylogenetic placement of sequences
onto a fixed reference tree. BMC Bioinformatics, 11(1), 538.
Morel, B., et al. (2020) Phylogenetic Analysis of SARS-CoV-2 Data Is
Difficult. Mol. Biol. Evol., 38(5), 1777–1791.
Navin, N.E. (2015) The first five years of single-cell cancer genomics and
beyond. Genome Res., 25(10), 1499–1507.
Nie, Q., et al. (2020) Phylogenetic and phylodynamic analyses of SARS-
CoV-2. Virus Res., 287, 198098.
Pekar, J., et al. (2021) Evidence Against the Veracity of SARS-CoV-2
Genomes Intermediate between Lineages A and B. In., Virological.org.
Pipes, L., et al. (2020) Assessing Uncertainty in the Rooting of the SARS-
CoV-2 Phylogeny. Mol. Biol. Evol., 38(4), 1537–1543.
Rambaut, A., et al. (2020) A dynamic nomenclature proposal for SARS-
CoV-2 lineages to assist genomic epidemiology. Nature Microbiology,
5(11), 1403–1407.
Tamura, K., Stecher, G. and Kumar, S. (2021) MEGA11: Molecular
Evolutionary Genetics Analysis Version 11. Mol. Biol. Evol., 38(7), 3022–
3027.
Turakhia, Y., et al. (2021) Ultrafast Sample placement on Existing tRees
(UShER) enables real-time phylogenetics for the SARS-CoV-2 pandemic.
Nat. Genet., 53(6), 809–816.
van Dorp, L., et al. (2020) Emergence of genomic diversity and recurrent
mutations in SARS-CoV-2. Infect., Genet. Evol., 83, 104351.
Zhou, P., et al. (2020) A pneumonia outbreak associated with a new
coronavirus of probable bat origin. Nature, 579(7798), 270–273.
Page 8 of 8Bioinformatics
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btac186/6553661 by guest on 28 March 2022
... The consensus sequence represents the most frequent nucleotide at each position following short-read alignment. There have been attempts to align a large number of consensus sequences with each other, thus obtaining frequencies assignable to each position [3]. However, one sequence alone does not necessarily accurately represent the composition of the sample, as it represents only a single sequence. ...
Article
This study proposes a novel approach to studying severe acute respiratory syndrome coronavirus 2 virus mutations through sequencing data comparison. Traditional consensus-based methods, which focus on the most common nucleotide at each position, might overlook or obscure the presence of low-frequency variants. Our method, in contrast, retains all sequenced nucleotides at each position, forming a genomic matrix. Utilizing simulated short reads from genomes with specified mutations, we contrasted our genomic matrix approach with the consensus sequence method. Our matrix methodology, across multiple simulated datasets, accurately reflected the known mutations with an average accuracy improvement of 20% over the consensus method. In real-world tests using data from GISAID and NCBI-SRA, our approach demonstrated an increase in reliability by reducing the error margin by approximately 15%. The genomic matrix approach offers a more accurate representation of the viral genomic diversity, thereby providing superior insights into virus evolution and epidemiology.
... There have been attempts to align a large number of consensus sequences with each other, thus obtaining frequencies assignable to each position. 3 However, one sequence alone does not necessarily accurately represent the composition of the sample, as it represents only a single sequence. Meanwhile, a sample comprises a population of many viruses, with different variants potentially carrying various mutations. ...
Preprint
Full-text available
This study proposes a novel approach to studying SARS-CoV-2 virus mutations through sequencing data comparison. Traditional consensus-based methods, which focus on the most common nucleotide at each position, might overlook or obscure the presence of low-frequency variants. Our method, in contrast, retains all sequenced nucleotides at each position, forming a genomic matrix. Utilizing simulated short reads from genomes with specified mutations, we contrasted our genomic matrix approach with the consensus sequence method. Our matrix methodology accurately reflected the known mutations and true compositions, demonstrating its efficacy in understanding the sample variability and their interconnections. Further tests using real data from GISAID and NCBI-SRA confirmed its reliability and robustness. As we see, the genomic matrix approach offers a more accurate representation of the viral genomic diversity, thereby providing superior insights into virus evolution and epidemiology. Future application recommendations are provided based on our observed results.
... I. In addition to these within-clade rates, I estimated the rate at which the clades themselves accumulated amino acid and synonymous changes by regressing the number of differences of the clade's founder sequence (relative to the putative root in clade 19B (Caraballo-Ortiz et al., 2022)) against the estimated time of origin of the clade. These regressions are shown in thick gray lines in Fig. 3. ...
Article
Full-text available
Continued evolution and adaptation of SARS-CoV-2 has lead to more transmissible and immune-evasive variants with profound impact on the course of the pandemic. Here I analyze the evolution of the virus over 2.5 years since its emergence and estimate rates of evolution for synonymous and non-synonymous changes separately for evolution within clades – well defined mono-phyletic groups with gradual evolution – and for the pandemic overall. The rate of synonymous mutations is found to be around 6 changes per year. Synonymous rates within variants vary little from variant to variant and are compatible with the overall rate of 7 changes per year (or 7.5×1047.5\times 10^{-4} per year and codon). In contrast, the rate at which variants accumulate amino acid changes (non-synonymous mutation) was initially around 12-16 changes per year, but in 2021 and 2022 dropped to 6-9 changes per year. The overall rate of non-synonymous evolution, that is across variants, is estimated to be about 26 amino acid changes per year (or 2.7×1032.7\times 10^{-3} per year and codon). This strong acceleration of the overall rate compared to within clade evolution indicates that the evolutionary process that gave rise to the different variants is qualitatively different from that in typical transmission chains and likely dominated by adaptive evolution. I further quantify the spectrum of mutations and purifying selection in different SARS-CoV-2 proteins and show that the massive global sampling of SARS-CoV-2 is sufficient to estimate site specific fitness costs across the entire genome. Many accessory proteins evolve under limited evolutionary constraint with little short term purifying selection. About half of the mutations in other proteins are strongly deleterious.
... A zoonotic jump from animals has been proposed as a potential origin for SARS-CoV-2 [1]. The virus emerged in Wuhan in late September/early October [2] to as late as mid-October to mid-November 2019 [3] and spread worldwide, leading to over 6 million deaths to date [4]. The Huanan Seafood Market (HSM) was implicated early in the pandemic as a potential source of the virus, ostensibly via zoonosis [5], but several of the earliest reported cases had no link to the market [6], and importantly no animals at the HSM were found to test positive for the virus [7]. ...
Article
Full-text available
Pangolins are the only animals other than bats proposed to have been infected with SARS-CoV-2 related coronaviruses (SARS2r-CoVs) prior to the COVID-19 pandemic. Here, we examine the novel SARS2r-CoV we previously identified in game animal metatranscriptomic datasets sequenced by the Nanjing Agricultural University in 2022, and find that sections of the partial genome phylogenetically group with Guangxi pangolin CoVs (GX PCoVs), while the full RdRp sequence groups with bat-SL-CoVZC45. While the novel SARS2r-CoV is found in 6 pangolin datasets, it is also found in 10 additional NGS datasets from 5 separate mammalian species and is likely related to contamination by a laboratory researched virus. Absence of bat mitochondrial sequences from the datasets, the fragmentary nature of the virus sequence and the presence of a partial sequence of a cloning vector attached to a SARS2r-CoV read suggests that it has been cloned. We find that NGS datasets containing the novel SARS2r-CoV are contaminated with significant Homo sapiens genetic material, and numerous viruses not associated with the host animals sampled. We further identify the dominant human haplogroup of the contaminating H. sapiens genetic material to be F1c1a1, which is of East Asian provenance. The association of this novel SARS2r-CoV with both bat CoV and the GX PCoV clades is an important step towards identifying the origin of the GX PCoVs.
... I. In addition, I estimated the rate at which the variants themselves accumulated amino acid and synonymous changes by regressing the number of changes of the variants (relative to the putative root in clade 19B(Caraballo-Ortiz et al., 2022)) against the esDivergence and evolutionary rates of different Nextstrain clades. Panels A,B&C show the estimated divergence of the founder genotype of each clade (big dot or square) and the subsequent divergence trend for all nucleotide changes, amino acid changes, and synonymous changes, respectively. ...
Preprint
Full-text available
Continued evolution and adaptation of SARS-CoV-2 has lead to more transmissible and immune-evasive variants with profound impact on the course of the pandemic. Here I analyze the evolution of the virus over 2.5 years since its emergence and estimate rates of evolution for synonymous and non-synonymous changes separately for evolution within clades -- well defined mono-phyletic groups with gradual evolution -- and for the pandemic overall. The rate of synonymous mutations is found to be around 6 changes per year. Synonymous rates within variants vary little from variant to variant and are compatible with the overall rate. In contrast, the rate at which variants accumulate amino acid changes (non-synonymous mutation) was initially around 12-16 changes per year, but in 2021 and 2022 dropped to 6-9 changes per year. The overall rate of non-synonymous evolution, that is across variants, is estimated to be about 25 amino acid changes per year. This 2-fold higher rate indicates that the evolutionary process that gave rise to the different variants is qualitatively different from that in typical transmission chains and likely dominated by adaptive evolution. I further quantify the spectrum of mutations and purifying selection in different SARS-CoV-2 proteins. Many accessory proteins evolve under limited evolutionary constraint with little short term purifying selection. About half of the mutations in other proteins are strongly deleterious and rarely observed, not even at low frequency.
Article
The study of tumor evolution is being revolutionalized by single-cell sequencing technologies that survey the somatic variation of cancer cells. In these endeavors, reliable inference of the evolutionary relationship of single cells is a key step. However, single-cell sequences contain many errors and missing bases, which necessitate advancing standard molecular phylogenetics approaches for applications in analyzing these datasets. We have developed a computational approach that integratively applies standard phylogenetic optimality principles and patterns of co-occurrence of sequence variations to produce more expansive and accurate cellular phylogenies from single-cell sequence datasets. We found the new approach to also perform well for CRISPR/Cas9 genome editing datasets, suggesting that it can be useful for various applications. We apply the new approach to some empirical datasets to showcase its use for reconstructing recurrent mutations and mutational reversals as well as for phylodynamics analysis to infer metastatic cell migrations between tumors.
Article
Full-text available
A group of 156 virologists, including American Society of Microbiology journal editors-in-chief, has recently published across three ASM journals a “call for rational discourse” on such important topics as the origin of SARS-CoV-2 and gain of function research (e.g., F. Goodrum et al., mBio 14:e0018823, 2023, https://doi.org/10.1128/mbio.00188-23). Here, I answer the call, arguing that the origin of SARS-CoV-2 is unknown; that continued premature downplaying of a possible laboratory origin, now accompanied by a denial that this was ever so dismissed, undermines public trust in science; and that the benefits from risky gain-of-function research-of-concern are fewer than Goodrum et al. imply.
Article
Full-text available
Understanding the circumstances that lead to pandemics is important for their prevention. Here, we analyze the genomic diversity of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) early in the coronavirus disease 2019 (COVID-19) pandemic. We show that SARS-CoV-2 genomic diversity before February 2020 likely comprised only two distinct viral lineages, denoted A and B. Phylodynamic rooting methods, coupled with epidemic simulations, reveal that these lineages were the result of at least two separate cross-species transmission events into humans. The first zoonotic transmission likely involved lineage B viruses around 18 November 2019 (23 October–8 December), while the separate introduction of lineage A likely occurred within weeks of this event. These findings indicate that it is unlikely that SARS-CoV-2 circulated widely in humans prior to November 2019 and define the narrow window between when SARS-CoV-2 first jumped into humans and when the first cases of COVID-19 were reported. As with other coronaviruses, SARS-CoV-2 emergence likely resulted from multiple zoonotic events.
Preprint
Full-text available
Pangolins are the only animals other than bats proposed to have been infected with SARS-CoV-2 related coronaviruses (SARS2r-CoVs) prior to the COVID-19 pandemic. Here we examine the novel SARS2r-CoV we previously identified in game animal metatranscriptomic datasets sequenced by He et al. (2022) and find that sections of the partial genome phylogenetically group with Guangxi (GX) pangolin CoVs (GX PCoVs) while the full RdRp sequence groups with bat-SL-CoVZC45. While the novel SARS2r-CoV is found in 6 pangolin datasets, the same CoV is also found in 10 additional NGS datasets from 5 separate mammalian species and is likely related to contamination by a laboratory researched virus. We find that NGS datasets containing the novel SARS2r-CoV are contaminated with significant Homo sapiens genetic material, and numerous viruses not associated with the host animals sampled. We further identify the dominant human haplogroup of the contaminating H.sapiens genetic material to be F1c1a1, which is of East Asian provenance. The association of this novel CoV with both bat CoV and the Guangxi pangolin CoV (GX PCoV) clades is an important step towards identifying the origin of the GX PCoVs.
Article
Full-text available
Two years after the start of the COVID-19 pandemic, key questions about the emergence of its aetiological agent (SARS-CoV-2) remain a matter of considerable debate. Identifying when SARS-CoV-2 began spreading among people is one of those questions. Although the current canonically accepted timeline hypothesises viral emergence in Wuhan, China, in November or December 2019, a growing body of diverse studies provides evidence that the virus may have been spreading worldwide weeks, or even months, prior to that time. However, the hypothesis of earlier SARS-CoV-2 circulation is often dismissed with prejudicial scepticism and experimental studies pointing to early origins are frequently and speculatively attributed to false-positive tests. In this paper, we critically review current evidence that SARS-CoV-2 had been circulating prior to December of 2019, and emphasise how, despite some scientific limitations, this hypothesis should no longer be ignored and considered sufficient to warrant further larger-scale studies to determine its veracity.
Article
Full-text available
The origin and early spread of SARS-CoV-2 remains shrouded in mystery. Here I identify a data set containing SARS-CoV-2 sequences from early in the Wuhan epidemic that has been deleted from the NIH’s Sequence Read Archive. I recover the deleted files from the Google Cloud, and reconstruct partial sequences of 13 early epidemic viruses. Phylogenetic analysis of these sequences in the context of carefully annotated existing data further supports the idea that the Huanan Seafood Market sequences are not fully representative of the viruses in Wuhan early in the epidemic. Instead, the progenitor of currently known SARS-CoV-2 sequences likely contained three mutations relative to the market viruses that made it more similar to SARS-CoV-2’s bat coronavirus relatives.
Article
Full-text available
As the SARS-CoV-2 virus spreads through human populations, the unprecedented accumulation of viral genome sequences is ushering in a new era of ‘genomic contact tracing’—that is, using viral genomes to trace local transmission dynamics. However, because the viral phylogeny is already so large—and will undoubtedly grow many fold—placing new sequences onto the tree has emerged as a barrier to real-time genomic contact tracing. Here, we resolve this challenge by building an efficient tree-based data structure encoding the inferred evolutionary history of the virus. We demonstrate that our approach greatly improves the speed of phylogenetic placement of new samples and data visualization, making it possible to complete the placements under the constraints of real-time contact tracing. Thus, our method addresses an important need for maintaining a fully updated reference phylogeny. We make these tools available to the research community through the University of California Santa Cruz SARS-CoV-2 Genome Browser to enable rapid cross-referencing of information in new virus sequences with an ever-expanding array of molecular and structural biology data. The methods described here will empower research and genomic contact tracing for SARS-CoV-2 specifically for laboratories worldwide.
Article
Full-text available
Global sequencing of hundreds of thousands of genomes of Severe acute respiratory syndrome coronavirus 2, SARS-CoV-2, has continued to reveal new genetic variants that are the key to unraveling its early evolutionary history and tracking its global spread over time. Here, we present the heretofore cryptic mutational history and spatiotemporal dynamics of SARS-CoV-2 from an analysis of thousands of high-quality genomes. We report the likely most recent common ancestor of SARS-CoV-2, reconstructed through a novel application and advancement of computational methods initially developed to infer the mutational history of tumor cells in a patient. This progenitor genome differs from genomes of the first coronaviruses sampled in China by three variants, implying that none of the earliest patients represent the index case or gave rise to all the human infections. However, multiple coronavirus infections in China and the USA harbored the progenitor genetic fingerprint in January 2020 and later, suggesting that the progenitor was spreading worldwide months before and after the first reported cases of COVID-19 in China. Mutations of the progenitor and its offshoots have produced many dominant coronavirus strains, which have spread episodically over time. Fingerprinting based on common mutations reveals that the same coronavirus lineage has dominated North America for most of the pandemic in 2020. There have been multiple replacements of predominant coronavirus strains in Europe and Asia and the continued presence of multiple high-frequency strains in Asia and North America. We have developed a continually updating dashboard of global evolution and spatiotemporal trends of SARS-CoV-2 spread (http://sars2evo.datamonkey.org/).
Article
Full-text available
The Molecular Evolutionary Genetics Analysis (MEGA) software has matured to contain a large collection of methods and tools of computational molecular evolution. Here, we describe new additions that make MEGA a more comprehensive tool for building timetrees of species, pathogens, and gene families using rapid relaxed-clock methods. Methods for estimating divergence times and confidence intervals are implemented to use probability densities for calibration constraints for node-dating and sequence sampling dates for tip-dating analyses, which will be supported by new options for tagging sequences with spatiotemporal sampling information, an expanded interactive Node Calibrations Editor, and an extended Tree Explorer to display timetrees. We have now added a Bayesian method for estimating neutral evolutionary probabilities of alleles in a species using multispecies sequence alignments and a machine learning method to test for the autocorrelation of evolutionary rates in phylogenies. The computer memory requirements for the maximum likelihood analysis are reduced significantly through reprogramming, and the graphical user interface (GUI) has been made more responsive and interactive for very big datasets. These enhancements will improve the user experience, quality of results, and the pace of biological discovery. Natively compiled GUI and command-line versions of MEGA11 are available for Microsoft Windows, Linux, and macOS from www.megasoftware.net.
Article
Full-text available
Numerous studies covering some aspects of SARS-CoV-2 data analyses are being published on a daily basis, including a regularly updated phylogeny on nextstrain.org. Here, we review the difficulties of inferring reliable phylogenies by example of a data snapshot comprising a quality-filtered subset of 8, 736 out of all 16, 453 virus sequences available on May 5, 2020 from gisaid.org. We find that it is difficult to infer a reliable phylogeny on these data due to the large number of sequences in conjunction with the low number of mutations. We further find that rooting the inferred phylogeny with some degree of confidence either via the bat and pangolin outgroups or by applying novel computational methods on the ingroup phylogeny does not appear to be credible. Finally, an automatic classification of the current sequences into sub-classes using the mPTP tool for molecular species delimitation is also, as might be expected, not possible, as the sequences are too closely related. We conclude that, although the application of phylogenetic methods to disentangle the evolution and spread of COVID-19 provides some insight, results of phylogenetic analyses, in particular those conducted under the default settings of current phylogenetic inference tools, as well as downstream analyses on the inferred phylogenies, should be considered and interpreted with extreme caution.
Article
Full-text available
Numerous studies covering some aspects of SARS-CoV-2 data analyses are being published on a daily basis, including a regularly updated phylogeny on nextstrain.org. Here, we review the difficulties of inferring reliable phylogenies by example of a data snapshot comprising a quality-filtered subset of 8, 736 out of all 16, 453 virus sequences available on May 5, 2020 from gisaid.org. We find that it is difficult to infer a reliable phy-logeny on these data due to the large number of sequences in conjunction with the low number of mutations. We further find that rooting the inferred phylogeny with some degree of confidence either via the bat and pangolin outgroups or by applying novel computational methods on the ingroup phylogeny does not appear to be credible. Finally, an automatic classification of the current sequences into sub-classes using the mPTP tool for molecular species delimitation is also, as might be expected, not possible, as the sequences are too closely related. We conclude that, although the application of phylogenetic methods to disentangle the evolution and spread of COVID-19 provides some insight, results of phylogenetic analyses, in particular those conducted under the default settings of current phylogenetic inference tools, as well as downstream analyses on the inferred phylo-genies, should be considered and interpreted with extreme caution. SARS-CoV-2, phylogenetic inference, phylogeny rooting, outgroups, strain classification Correspondence: alexandros.stamatakis@h-its.org
Article
Full-text available
The rooting of the SARS-CoV-2 phylogeny is important for understanding the origin and early spread of the virus. Previously published phylogenies have used different rootings that do not always provide consistent results. We investigate several different strategies for rooting the SARS-CoV-2 tree and provide measures of statistical uncertainty for all methods. We show that methods based on the molecular clock tend to place the root in the B clade, while methods based on outgroup rooting tend to place the root in the A clade. The results from the two approaches are statistically incompatible, possibly as a consequence of deviations from a molecular clock or excess back-mutations. We also show that none of the methods provide strong statistical support for the placement of the root in any particular edge of the tree. These results suggest that phylogenetic evidence alone is unlikely to identify the origin of the SARS-CoV-2 virus and we caution against strong inferences regarding the early spread of the virus based solely on such evidence. Keyword SARS-CoV-2 phylogeny, outgroup rooting, molecular clock rooting
Article
Full-text available
Severe acute respiratory syndrome coronavirus 2, SARS-CoV-2, was quickly identified as the cause of COVID-19 disease soon after its earliest reports. The knowledge of the contemporary evolution of SARS-CoV-2 is urgently needed not only for a retrospective on how, when, and why COVID-19 has emerged and spread, but also for creating remedies through efforts of science, technology, medicine, and public policy. Global sequencing of thousands of genomes has revealed many common genetic variants, which are the key to unraveling the early evolutionary history of SARS-CoV-2 and tracking its global spread over time. However, our knowledge of fundamental events in the evolution and spread of this coronavirus remains grossly incomplete and highly uncertain. Here, we present the heretofore cryptic mutational history, phylogeny, and dynamics of SARS-CoV-2 from an analysis of tens of thousands of high-quality genomes. The reconstructed mutational progression is highly concordant with the timing of coronavirus sampling dates. It predicts the progenitor genome whose earliest offspring without any non-synonymous mutations were still spreading worldwide months after the report of COVID-19. Over time, mutations gave rise to seven major lineages that spread episodically, some of which arose in Europe and North America after the genesis of the ancestral lineages in China. Mutational barcoding establishes that North American coronaviruses harbor very different genome signatures than coronaviruses prevalent in Europe and Asia that have converged over time. These spatiotemporal patterns continue to evolve as the pandemic progresses and can be viewed live online.
Article
Full-text available
The ongoing pandemic spread of a new human coronavirus, SARS-CoV-2, which is associated with severe pneumonia/disease (COVID-19), has resulted in the generation of tens of thousands of virus genome sequences. The rate of genome generation is unprecedented, yet there is currently no coherent nor accepted scheme for naming the expanding phylogenetic diversity of SARS-CoV-2. Here, we present a rational and dynamic virus nomenclature that uses a phylogenetic framework to identify those lineages that contribute most to active spread. Our system is made tractable by constraining the number and depth of hierarchical lineage labels and by flagging and delabelling virus lineages that become unobserved and hence are probably inactive. By focusing on active virus lineages and those spreading to new locations, this nomenclature will assist in tracking and understanding the patterns and determinants of the global spread of SARS-CoV-2.
Article
To investigate the evolutionary and epidemiological dynamics of the current COVID-19 outbreak, a total of 112 genomes of SARS-CoV-2 strains sampled from China and 12 other countries with sampling dates between 24 December 2019 and 9 February 2020 were analyzed. We performed phylogenetic, split network, likelihood-mapping, model comparison, and phylodynamic analyses of the genomes. Based on Bayesian time-scaled phylogenetic analysis with the best-fitting combination models, we estimated the time to the most recent common ancestor (TMRCA) and evolutionary rate of SARS-CoV-2 to be 12 November 2019 (95% BCI: 11 October 2019 and 09 December 2019) and 9.90 × 10⁻⁴ substitutions per site per year (95% BCI: 6.29 × 10⁻⁴–1.35 × 10⁻³), respectively. Notably, the very low Re estimates of SARS-CoV-2 during the recent sampling period may be the result of the successful control of the pandemic in China due to extreme societal lockdown efforts. Our results emphasize the importance of using phylodynamic analyses to provide insights into the roles of various interventions to limit the spread of SARS-CoV-2 in China and beyond.