ArticlePDF Available

An evolutionary portrait of the progenitor SARS-CoV-2 and its dominant offshoots in COVID-19 pandemic



Global sequencing of hundreds of thousands of genomes of Severe acute respiratory syndrome coronavirus 2, SARS-CoV-2, has continued to reveal new genetic variants that are the key to unraveling its early evolutionary history and tracking its global spread over time. Here, we present the heretofore cryptic mutational history and spatiotemporal dynamics of SARS-CoV-2 from an analysis of thousands of high-quality genomes. We report the likely most recent common ancestor of SARS-CoV-2, reconstructed through a novel application and advancement of computational methods initially developed to infer the mutational history of tumor cells in a patient. This progenitor genome differs from genomes of the first coronaviruses sampled in China by three variants, implying that none of the earliest patients represent the index case or gave rise to all the human infections. However, multiple coronavirus infections in China and the USA harbored the progenitor genetic fingerprint in January 2020 and later, suggesting that the progenitor was spreading worldwide months before and after the first reported cases of COVID-19 in China. Mutations of the progenitor and its offshoots have produced many dominant coronavirus strains, which have spread episodically over time. Fingerprinting based on common mutations reveals that the same coronavirus lineage has dominated North America for most of the pandemic in 2020. There have been multiple replacements of predominant coronavirus strains in Europe and Asia and the continued presence of multiple high-frequency strains in Asia and North America. We have developed a continually updating dashboard of global evolution and spatiotemporal trends of SARS-CoV-2 spread (
An Evolutionary Portrait of the Progenitor SARS-CoV-2 and Its
Dominant Offshoots in COVID-19 Pandemic
Sudhir Kumar ,*
Qiqing Tao ,
Steven Weaver,
Maxwell Sanderford,
Marcos A. Caraballo-Ortiz,
Sudip Sharma,
Sergei L.K. Pond,*
and Sayaka Miura*
Institute for Genomics and Evolutionary Medicine, Temple University, Philadelphia, PA, USA
Department of Biology, Temple University, Philadelphia, PA, USA
Center for Excellence in Genome Medicine and Research, King Abdulaziz University, Jeddah, Saudi Arabia
*Corresponding authors: E-mails:;;
Associate editor: Meredith Yeager
Global sequencing of genomes of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has continued to reveal
new genetic variants that are the key to unraveling its early evolutionary history and tracking its global spread over time.
Here we present the heretofore cryptic mutational history and spatiotemporal dynamics of SARS-CoV-2 from an analysis
of thousands of high-quality genomes. We report the likely most recent common ancestor of SARS-CoV-2, reconstructed
through a novel application and advancement of computational methods initially developed to infer the mutational
history of tumor cells in a patient. This progenitor genome differs from genomes of the first coronaviruses sampled in
China by three variants, implying that none of the earliest patients represent the index case or gave rise to all the human
infections. However, multiple coronavirus infections in China and the United States harbored the progenitor genetic
fingerprint in January 2020 and later, suggesting that the progenitor was spreading worldwide months before and after
the first reported cases of COVID-19 in China. Mutations of the progenitor and its offshoots have produced many
dominant coronavirus strains that have spread episodically over time. Fingerprinting based on common mutations
reveals that the same coronavirus lineage has dominated North America for most of the pandemic in 2020. There
have been multiple replacements of predominant coronavirus strains in Europe and Asia as well as continued presence of
multiple high-frequency strains in Asia and North America. We have developed a continually updating dashboard of
global evolution and spatiotemporal trends of SARS-CoV-2 spread (
Key words: coronavirus, web tool, phylogeny.
The early evolutionary history of severe acute respiratory syn-
drome coronavirus 2 (SARS-CoV-2), which causes COVID-19,
remains unclear despite an unprecedented scope of global
genome sequencing of SARS-CoV-2 and a multitude of phy-
logenetic analyses (Forster et al. 2020;Lemey et al. 2020;
Rambaut, Holmes, et al. 2020;Tang et al. 2020;Worobey et
al. 2020;Chiara et al. 2021;da Silva Filipe et al. 2021;
Komissarov et al. 2021;Lemieux et al. 2021;Pekar et al.
2021). Sophisticated investigations have shown that tradi-
tional molecular phylogenetic analyses do not produce reli-
able evolutionary inferences about the early history of SARS-
CoV-2 due to low sequence divergence, a limited number of
phylogenetically informative sites, and the ubiquity of se-
quencing errors (De Maio et al. 2020;Mavian et al. 2020;
Turakhia et al. 2020). In particular, the root of the SARS-
CoV-2 phylogeny remains elusive (Morel et al. 2021;Pipes
et al. 2021) because the closely related nonhuman coronavi-
rus (outgroups) are more than 1,100 base differences from
human SARS-CoV-2 genomes, as compared with fewer than
30 differences between human SARS-CoV-2 genomes’
sequenced early on (December 2019 and January 2020)
(Andersen et al. 2020;Castells et al. 2020;G
et al. 2020;Lai et al. 2020;Mavian et al. 2020;Morel et al.
2021;Pipes et al. 2021;Wenzel 2020). Without a reliable root
of the SARS-CoV-2 phylogeny, the most recent ancestor se-
quence cannot be accurately reconstructed, and it is also not
possible to assess the genetic diversity of SARS-CoV-2 that
existed at the time of its first outbreak. Consequently, we
cannot determine if any of the coronaviruses isolated to
date carry the genome of the progenitor of all human
SARS-CoV-2 infections. Knowing the progenitor genome
will also help determine how close the earliest patients sam-
pled in China are to “patient zero,” that is, the first human
transmission case.
The orientation and order of early mutations giving rise to
common coronavirus variants will also be compromised if the
earliest coronavirus isolates are incorrectly used to root the
SARS-CoV-2 phylogenies (Dearlove et al. 2020;Fauver et al.
2020;StefanelLi et al. 2020;Tang et al. 2020). Some investiga-
tions of COVID-19 patients and their coronaviruses’ genomes
already reported the presence of multiple variants (Lu et al.
2020), as genomes of viral samples from December 2019 in
Article Fast Track
ßThe Author(s) 2021. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (
licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is
properly cited. Open Access
3046 Mol. Biol. Evol. 38(8):3046–3059 doi:10.1093/molbev/msab118 Advance Access publication May 4, 2021
Downloaded from by Temple University user on 02 August 2021
China had as many as five differences. These observations
require an explicit test to see if one of the early sampled
coronavirus genomes was the progenitor of all the strains
infecting humans.
Traditionally, ancestral sequences are estimated using a
rooted phylogeny (Yang et al. 1995;Nei and Kumar 2002).
This ancestral sequence can then be compared with the se-
quenced genomes to locate the one that is most similar to
the inferred progenitor and/or placed closest to the root in
thephylogeny.However,asnoted above, attempts using ad
hoc and traditional methods are fraught with difficulties and
have not produced consistent and robust results (Morel et al.
2021;Pipes et al. 2021). Some methods also incorporate sam-
pling times in phylogenetic inference, but they will automat-
ically favor placing the earliest sampled genomes at or near
the root of the tree. This fact introduces circularity in testing
the hypothesis that the earliest sampled genomes were an-
cestral because sampling time is used in the inference proce-
dure (Pipes et al. 2021;Pekar et al. 2021).
Results and Discussion
A Mutational Order Approach for SARS-CoV-2
We applied a mutational order approach (MOA) to directly
reconstruct the ancestral sequence and the mutational his-
tory of CoV-2 genomes (Jahn et al. 2016;Ross and Markowetz
2016;Miura et al. 2018). MOA does not infer phylogeny as an
intermediate step. It is often used to build the evolutionary
history of tumor cells that evolve clonally and without re-
combination. This approach is suitable for analyzing SARS-
CoV-2 genomes because of their quasi-species evolutionary
behavior (clonal) and the lack of evidence of significant re-
combination within human outbreaks (Richard et al. 2020),
both of which preserve the collinearity of variants in genomes.
Although reports of recombination among circulating SARS-
CoV-2 genomes have begun to appear (Jackson et al. 2020),
the fraction of circulating recombinant strains is likely very
small and geographically limited and will not affect analyses
conducted on sequences sampled in 2020. This feature per-
mits effective use of shared co-occurrence of variants in
genomes and the frequencies of individual variants for infer-
ring the mutational history, notwithstanding the presence of
sequencing errors and other artifacts (Kim and Simon 2014;
Jahn et al. 2016) (see Materials and Methods).
We advanced MOA to make it applicable for analyzing
SARS-CoV-2 genomes. Thus was needed because the normal
cell sequence in tumors provides the ancestral (noncancer-
ous) genome sequence to orient the mutational changes, but
such a direct ancestor is not available for coronaviruses in
which the closest outgroup sequences are over 30 times more
different than any two SARS-CoV-2 strains. Also, SARS-CoV-2
genome evolution may not satisfy the perfect phylogeny as-
sumption because some genomic positions have likely expe-
rienced multiple and recurrent mutations (van Dorp et al.
2020;Martin et al. 2021). So, we used shared co-occurrence of
variants among genomes to reduce the impact of the viola-
tion of the perfect phylogeny assumption and select muta-
tion orientations and histories in the maximum likelihood
approach. We also devised a bootstrap procedure to place
confidence limits on the inferred mutation order in which
bootstrap replicate data sets are generated by sampling
genomes with replacement.
Here we present results from analyses of two snapshots of
the fast-growing collection of SARS-CoV-2 genomes to make
inferences and assess the robustness of the inferred muta-
tional histories to the growing genome collection that is
expanding at an unprecedented rate. We first present results
from the 29KG data set and then evaluate the concordance of
the mutational history inferred by using an expanded 68KG
data set, which establishes that the conclusions are robust to
the sampling of genomes. The first snapshot was retrieved
fromGISAID (Shu and Mccauley 2017) on July 7, 2020 and
consisted of 60,332 genomes. Of these, 29,681 were selected
because they were longer than the 28,000 bases threshold we
imposed (29KG data set) and did not include an excessive
number of unresolved bases in any genomic regions. This
second snapshot was acquired on October 12, 2020 from
GISAID and contained 133,741 genomes, of which 68,057
genomes met the inclusion criteria (68KG data set).
We then applied mutational fingerprints inferred using the
68KG data set to an expanded data set of 172,480 genomes
(sampled on December 30, 2020; 172KG) to track global spa-
tiotemporal dynamics SARS-CoV-2. We have also set up a live
dashboard showing regularly updated results because the
processes of data analysis, manuscript preparation, and peer
review of scientific articles are much slower than the pace of
expansion of SARS-CoV-2 genome collection. Also, we pro-
2 genome based on key mutations derived by the MOA anal-
ysis (
Mutational History and Progenitor of SARS-COV-2
We used MOA to reconstruct the history of mutations that
gave rise to 49 common single nucleotide variants (SNVs) in
the 29KG data set. These variants occur with >1% variant
frequency (vf >1%)—a threshold chosen to avoid including
variants that may be due to sequencing errors (fig. 1a). To
simplify notation, we used the inferred mutation history to
denote key groups of mutations by assigning Greek symbols
(l,,a,b,c,d,ande) to them. Individual mutations were
assigned numbers and letters based on the reconstructed
order and their parent-offspring relationships (table 1). We
estimated the timing of mutation for each mutation based on
the timestamp of the viral samples’ genome sequences in
which it first appeared (table 1,seeMaterialsand
Methods). The inferred mutation order was in excellent
agreement with the temporal pattern of the first appearance
of variants in the 29KG data set. The timestamp of 47 out of
49 mutations was greater than or equal to the timestamp of
the corresponding preceding mutation in mutational history.
The exceptions were seen for two low-frequency offshoot
mutations (b
and b
; see Materials and Methods). This
concordance provides independent validation of the recon-
structed mutation graph because neither sampling dates nor
locations were used in MOA analysis.
Progenitor SARS-CoV-2 and Its Dominant Offshoots in COVID-19 Pandemic .doi:10.1093/molbev/msab118 MBE
Downloaded from by Temple University user on 02 August 2021
New variants occurred in the genomic background of the
variants preceding them in the reconstructed mutation his-
tory with a very high propensity (co-occurrence index, COI >
84%; fig. 2), except for one low-frequency offshoot mentioned
above (b
;COI¼54%). Overall, these results suggest a
strong signal to infer a sequential mutational history, even
though a small minority of sequences at a position may have
experienced homoplasy or recurrent mutations. Indeed, a
bootstrap analysis involving genome resampling to assess
the robustness of the mutation history produced high boot-
strap confidence levels (BCLs) for key groups of mutations as
well as many offshoots (fig. 2;BCL>95%).
Episodic Evolution and Selection
The order of some mutations in the mutational history is not
established with high BCLs, for example, the relative order of
mutations. This is because the three evariants
almost always occur together (7,624 genomes), and the in-
termediate combinations of evariants were found in only 42
genomes. Similarly, the count of genomes harboring all three
bvariants (22,739 genomes) far exceeded those with two or
fewer bvariants (201 genomes). There is a strong temporal
tendency of variants to be sampled together (e.g., e
), suggesting an episodic spread of variants (Wald–
Wolfowitz run tests P0.01; see Materials and Methods)
that does not allow for determining the precise order of
some mutations’ appearance. Episodic variant spread may
be caused by founder effects, positive selection, or both
(e.g., MacLean et al. 2021). It may also be an artifact of highly
uneven regional and temporal genome sequencing that will
produce a biased representative sample of the actual world-
wide population (fig. 1b).
In this mutation history, the ratio of nonsynonymous to
synonymous changes (N/S) is 1.9, which is almost ten times
larger than their ratio of 0.18 for the inferred proCoV2 and
observed Bat CoV proteins. The McDonald–Kreitman test
(McDonald and Kreitman 1991) rejected the similarity of
molecular evolutionary patterns observed within the SARS-
CoV-2 population (29KG data set) and between human
proCoV2 and the bat coronavirus. However, the selective in-
terpretation of such a difference is complicated by the fact
that polymorphisms in SARS-CoV-2 genomes are affected by
molecular mechanisms (e.g., RNA editing) (Giorgio et al. 2020;
Rice et al. 2021), not just selection, and slightly deleterious
alleles can become common when there is a population ex-
pansion (Casals and Bertranpetit 2012). Furthermore, selec-
tion may have played a significant role during the divergence
of human CoV-2 and bat CoV sequences (MacLean et al.
2021;Martin et al. 2021;Tegally et al. 2021). Nevertheless,
N/S patterns derived from common variants show that mo-
lecular evolutionary patterns observed within SARS-CoV-2
genomes infecting humans differ from those spanning the
divergence between the bat RaTG13 and SARS-CoV-2
genomes, even though positive selection in the early SARS-
49 80
420 966
SNV frequency
Number of SNVs
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28
Weeks aer December 24, 2019
Number of genomes
Days aer December 24, 2019
Number of genomes
12 22 28 32 36 41 45 52 60 69 77 83 87 93 98
Number of base
differences from proCoV2
Sampling date
Earliest sample
NCBI reference
GISAID reference
FIG.1.Counts of SNVs and genomes in the 29KG data set. (a)
Cumulative count of SNVs presented in the 29KG genome data set
at different frequencies. (b) The number of genomes in the 29KG
collection that were isolated weekly during the pandemic. (c) The
number of base differences from proCoV2 (see fig. 2) for genomes
sampled in December 2019 and January 2020. The 18 genomes sam-
pled in December 2019 in China (red) have three common SNVs
different from proCoV2. In contrast, six genomes sampled in
January 2020 in China (Asia, red) and the United States (North
America, blue) show no base differences. Multiple genomes (2 and
15) were sampled on two different days. (d) Temporal and spatial
distribution of strains identical to proCoV2 at the protein sequence
level, that is, they have only lmutations. The color scheme used to
mark sampling locations is shown in panel b.
Kumar et al. .doi:10.1093/molbev/msab118 MBE
Downloaded from by Temple University user on 02 August 2021
CoV-2 pandemic history may have been limited (Chiara et al.
2021;MacLean et al. 2021).
The Progenitor Genome and the Index Case
The root of the mutation tree is the most recent common
ancestor (MRCA) of all the genomes analyzed, which gave rise
to two early coronavirus lineages (and a;fig. 2). The MRCA
genome was the progenitor of all SARS-CoV-2 infections
globally, henceforth proCoV2, and was likely carried by the
“first detectable” case of human transmission in the COVID-
19 pandemic (index case). A comparison of proCoV2 with
Wuhan-1 genomes revealed three differences in 49 positions
analyzed, which was also true for other reference genomes
(fig. 1c). This suggests that the Wuhan-1 (EPI_ISL 402123) and
the other earliest sampled genomes are derived coronavirus
strains that arose from proCoV2 after the divergence of and
Table 1. SARS-CoV-2 variants in 29KG dataset.
Gene Genomic
Amino acid
ORF1ab 2416 U>C 0 98.1% 0 China, Asia
ORF1ab 19524 U>C 0 98.6% 0 China, Asia
S 23929 U>C 0 98.4% 18 China, Asia
ORF1ab 18060 U>C 0 95.1% 849 China, Asia
N 28657 C>U 63 1.3% 2 France, Europe
ORF1ab 9477 U>AF>Y 63 1.2% 3 France, Europe
N 28863 C>US>L 63 1.2% 5 France, Europe
ORF3a 25979 G>UG>V 63 1.2% 344 France, Europe
ORF1ab 8782 U>C 0 91.0% 47 China, Asia
ORF8 28144 C>US>L 0 90.8% 1115 China, Asia
ORF1ab 1606 U>C 43 1.7% 501 United Kingdom, Europe
ORF1ab 11083 G>UL>F 24 9.2% 376 China, Asia
N 28311 C>UP>L 64 1.9% 3 South Korea, Asia
ORF1ab 13730 C>UA>V 71 1.8% 3 Taiwan/Malaysia, Asia
ORF1ab 6312 C>AT>K 71 1.7% 483 Taiwan/Malaysia, Asia
ORF3a 26144 G>UG>V 28 5.1% 121 China, Asia
ORF1ab 14805 C>U 54 6.0% 334 United Kingdom, Europe
ORF1ab 17247 U>C 64 2.0% 580 Switzerland, Europe
ORF1ab 2558 C>UP>S 54 1.7% 26 United Kingdom, Europe
ORF1ab 2480 A>GI>V 54 1.6% 462 United Kingdom, Europe
ORF1ab 3037 C>U 31 77.0% 11 China, Asia
S 23403 A>GD>G 31 77.1% 36 China, Asia
ORF1ab 14408 C>UP>L 41 76.9% 3032 Saudi Arabia, Middle
ORF1ab 20268 A>G 64 5.7% 1213 Italy, Europe
N 28854 C>US>L 29 3.1% 527 China, Asia
ORF1ab 15324 C>U 29 2.3% 678 China, Asia
ORF3a 25429 G>UV>L 77 1.7% 485 United Kingdom, Europe
N 28836 C>US>L 74 1.6% 3 Switzerland, Europe
ORF1ab 13862 C>UT>I 74 1.6% 50 Switzerland, Europe
ORF1ab 10798 C>AD>E 86 1.4% 414 United Kingdom, Europe
ORF3a 25563 G>UQ>H 41 29.8% 884 Saudi Arabia, Middle
ORF1ab 18877 C>U 41 4.0% 757 Saudi Arabia, Middle
M 26735 C>U 41 1.5% 439 Saudi Arabia, Middle
ORF1ab 1059 C>UT>I 54 23.0% 5157 Singapore, Asia
S 24368 G>UD>Y 75 1.3% 389 Sweden, Europe
ORF8 27964 C>US>L 76 2.7% 790 USA, North America
ORF1ab 11916 C>US>L 72 1.6% 166 USA, North America
ORF1ab 18998 C>UA>V 72 1.0% 305 USA, North America
N 28881 G>AR>K 54 25.7% 2 United Kingdom, Europe
N 28882 G>AR>K 54 25.7% 2 United Kingdom, Europe
N 28883 G>CG>R 54 25.7% 5365 United Kingdom, Europe
ORF1ab 313 C>U 66 2.1% 608 USA, North America
ORF1ab 19839 U>C 64 1.5% 452 Switzerland, Europe
M 27046 C>UT>M 69 1.6% 453 Worldwide
ORF1ab 10097 G>AG>S 69 2.5% 5 Denmark, Europe
S 23731 C>U 69 2.5% 403 Denmark, Europe
N 28580 G>UD>Y 69 1.2% 353 Chile, South America
ORF1ab 17858 A>GY>C 59 4.7% 32 USA, North America
ORF1ab 17747 C>UP>L 59 4.7% 1374 USA, North America
*Amino acid change is shown only for non-synonymous change.
Progenitor SARS-CoV-2 and Its Dominant Offshoots in COVID-19 Pandemic .doi:10.1093/molbev/msab118 MBE
Downloaded from by Temple University user on 02 August 2021
Asia South America Africa
Europe North America Oceania
Middle East
1 2 3
1ab:p.P>L 1ab:p.Y>C
1ab:n.U>C 1ab:n.U>C
99% 100%
1ab:p.G>S S:n.C>U
1ab:n.U>C M:p.T>M
86%86% 3c
100% 59%
59.5% 96.5%
3g 3f 3e
FIG.2.Mutational history graph of SARS-CoV-2 from the 29KG data set. Thick arrows mark the pathway of widespread variants (frequency, vf
3%), and thin arrows show paths leading to other common mutations (3% >vf >1%). The pie-chart sizes are proportional to variant frequencies in
the 29KG data set, with pie-charts shown for variants with vf >3% and pie color based on the world’s region where that mutation was first
observed. A circle is used for all other variants, with the filled color corresponding to the earliest sampling region. The COI (black font) and the BCL
(blue font) of each mutation and its predecessor mutation are shown next to the arrow connecting them. Underlined BCL values mark variant
pairs for which BCLs were estimated for groups of variants (see Materials and Methods) because of the episodic nature of variant accumulation
within groups resulting in lower BCLs (<80%, dashed arrows). Base changes (n.) are shown for synonymous mutations, and amino acid changes (p.)
are shown for nonsynonymous mutations along with the gene/protein names (“ORF” is omitted from gene name abbreviations given in table 1).
More details on each mutation are presented in table 1.
Kumar et al. .doi:10.1093/molbev/msab118 MBE
Downloaded from by Temple University user on 02 August 2021
alineages (fig. 2). According to the mutational history, the
Wuhan-1 strain evolved by three successive amutations (two
synonymous and one nonsynonymous) in proCoV2 (a
and a
). This progression is statistically supported (BCL ¼
100%), which is made possible by the presence of 896 inter-
mediate genomes containing one or two avariants in the
29KG data set. Importantly, three closely related nonhuman
coronavirus genomes (bats and pangolin) all have the same
base at these positions as does the proCoV2 genome, suggest-
ing that the ancestral genome did not contain avariants.
Furthermore, genomes with variants of proCoV2 do not
contain the other 47 variants, all of which occurred on the
genomic background containing a
. These facts support
the inference that coronaviruses lacking avariants were the
ancestors of Wuhan-1 and other genomes sampled in
December 2019 in China (fig. 1c). Therefore, we conclude
that Wuhan-1 was not the direct ancestor of all the early
coronavirus infections globally.
A comparison of the proCoV2 genetic fingerprint (49 posi-
tions) in the 29KG collection revealed three matches in China
(Fujian, Guangdong, and Hangzhou) and three in the United
States (Washington) in January 2020 (fig. 1c). One more
match was found in New York in March 2020. The mutant
of proCoV2 was first sampled 59 days after the Wuhan-1
strain. This means that the progenitor coronavirus spread
and mutated in the human population for months after
the first reported COVID-19 cases. Furthermore, comparisons
of the protein sequences encoded by the proCoV2 genome
revealed 131 other genomic matches, which contained only
synonymous differences from proCoV2. A majority (89
genomes) of these matches were from coronaviruses sampled
in China and other Asian countries (fig. 1d). The first sequence
was sampled 12 days after Wuhan-1. Multiple matches were
found in all sampled continents and detected as late as April
2020 in Europe. These spatiotemporal patterns suggest that
proCoV2 already possessed the repertoire of protein sequen-
ces needed to infect, spread, and persist in the global human
population (see also MacLean et al. 2021).
Coronavirus Diversity before the First Coronavirus Outbreak
The progenitor of all genomes sequenced from human coro-
from the Wuhan-1 genome, which extends the ancestry before
late November/early December 2020 date that has been sug-
gested by Pekar et al. (2021). Their inference was based on
analyzing SARS-CoV-2 genomes from the first 4 months of
coronavirus infections in China with a strict molecular clock
in which they placed coronavirus genomes from December
2019 at the root of their phylogeny (Pekar et al. 2021). Their
most likely root position is the same as the Wuhan-1 position
in our mutation history (figs. 1 and 3), which is not surprising
because their data set did not include more than 1,000
genomes that comprise the early diverging lineage (fig. 3).
The genomes containing the lineage, sampled in North
America, descended from an earlier ancestor that also gave
rise to the genome (alineage) at the root of Pekar et al.
(2021) phylogeny. Therefore, our analysis has revealed an earlier
MRCA than that detectable by considering a smaller subset of
sequences from China.
The mutational history from proCoV2 to Wuhan-1 ge-
nome points to the presence of measurable coronavirus di-
versity before the earliest recognized coronavirus outbreak in
December 2019 (fig. 1). The presence of such diversity has
been acknowledged and analyzed, for example, Pekar et al.
(2021), but variants present in this diverse population were
not identified. Our analyses clearly show that the ancestors of
the Wuhan-1 genome gave rise to a diversity of Wuhan-1’s
sibling coronavirus lineages (a
;figs. 1 and 3). These sib-
ling coronavirus lineages were detected in China in January
2020 (a
and a
) and February 2020 in Asia (a
Europe (a
) (table 1). Thousands of genomes in the
29KG data set belong to siblings and ancestors of Wuhan-1
(table 1 and fig. 3, yellow box). However, the paucity of
genomes sampled in 2019 makes it impossible to establish
the date and location of origin precisely, but some must have
originated before the first detection of the outbreak. Notably,
the evolution of a
was preceded by the evolution of a
lineages, with a
spawning multiple offshoots first detected
in Europe in February 2020 (a
). The lineage, detected
in the United States in February 2020, is an even earlier de-
scendant of proCoV2 and is a sibling of the alineage (table 1
and fig. 3). These lineages may not have been detected earlier
because of the lack of sequencing in 2019, and it is likely that
some originated early and spread around the world, whereas
others evolved from proCoV2 or its early descendant in dif-
ferent parts of the world. Again, thousands of these corona-
virus genomes were found throughout the world (fig. 3,
yellow box). None of these genomes contained the widely
studied spike protein mutant (D614G), a bmutation that
occurred in the genomes carrying all three avariants and was
first seen in late January 2020. Therefore, the proCoV2
(MRCA) and a large diversity of its early descendants were
all able to spread in the global human population.
Estimated Timing of MRCA and the Index Case
Because proCoV2 is three bases different from the Wuhan-1
iants of proCoV2 occurred 5.8–8.1 weeks earlier, based on the
range of estimated mutation rates of coronavirus genomes
(see Materials and Methods). This timeline puts the presence
of proCoV2 in late October 2019, which is consistent with the
report of a fragment of spike protein identical to Wuhan-1 in
early December in Italy, among other evidence (Giovanetti et
al. 2020;Li, Wang, et al. 2020;van Dorp et al. 2020;Amendola
et al. 2021). The sequenced segment of the spike protein is
short (409 bases). It does not span positions in which 49
major early variants were observed, which means that the
Italian spike protein fragment can only confirm the existence
of proCoV2 before the first coronavirus detection in China.
Our timings of MRCA (tMRCA)is1montholderthanthe
date for the MRCA of genomes presented by Pekar et al.
(2021) because their analysis is restricted to the ancestry of
the coronaviruses sampled from China only, which resulted in
the exclusion of the lineage from their analysis. The
Progenitor SARS-CoV-2 and Its Dominant Offshoots in COVID-19 Pandemic .doi:10.1093/molbev/msab118 MBE
Downloaded from by Temple University user on 02 August 2021
potential for not sampling such lineages is well appreciated in
Pekar et al. (2021). This exclusion and the use of sampling
times in strict clock phylogenetic analyses would naturally
lead to analyses leaning closer to the earliest sampling times
of SARS-CoV-2 (December 2019). Anyway, if we assume that
proCoV2 was the index case, then the date of zoonosis
(tIndex) would be late October to mid-November 2019.
This range overlaps with Pekar et al.’s (2021) indexdatefalling
between mid-October and mid-November 2019. However, it
likely that the actual tIndex is much earlier than tMRCA be-
cause proCoV2 likely increased in frequency over time before
reaching a human host, and it is possible that one of its
immediate ancestor first infected a human. Based on an ap-
proximately 1-month difference between MRCA and Index
dates in Pekar et al. (2021), it is tempting to speculate that
tIndex could have been as early as September 2019 for our
SARS-CoV-2 phylogeny. This speculation requires more ex-
tensive analysis and experimental confirmation in the future.
Analysis of the 68KG Database Snapshot
Next, we analyzed a later snapshot of SARS-CoV-2 genome
collection acquired 3 months after the 29KG data set. This
data set expanded the collection of coronavirus genomes
from viral isolates collected after July 7, 2020 (16,739
genomes) and added 20,004 genome sequences from viral
isolates dated before July 7, 2020. In the expanded MOA
analysis, we retained 49 variants found with frequency >1%
in the 29KG data set and added variants found with a fre-
quency >1% in the 68KG data set (84 total variants; see
supplementary table S1,Supplementary Material online).
MOA analysis of the 68KG data set produced the proCoV2
genome identical to that inferred using the 29KG data set (see
Materials and Methods). We found one additional genome
(EPI_ISL_493171) with a proCoV2 fingerprint sampled in
Hubei, China, 4 weeks after the Wuhan-1 strain was reported.
The inferred mutation history from the 68KG data set was
well-supported with high COIs and BCLs and concordant
with the mutation history produced using the 29KG data
set (fig. 4). Therefore, inferences reported above for the
29KG data set were robust to the expanded sampling of
genomes. In the expanded mutation history, two new groups
of variants were identified (fand g). They originated in mid-
March 2020 and were found in a relatively high frequency in
the 68KG data set (4.4% and 8.0%, respectively, supplemen-
tary table S1,Supplementary Material online). Variants in f
and ggroups also showed episodic accumulation of muta-
tions, for example, the count of genomes containing three f
mutations (f
; 2,955 genomes) was much larger than
those with a subset of these variants (148 genomes). The
episodic nature of mutational spread for 84 variants in the
68KG is statistically significant (P<10
), that is, clusters of
mutations together have become common variants (see
Materials and Methods).
Coronavirus Fingerprints and Spatiotemporal
The mutation history progression directly transforms into a
collection of genetic fingerprints. Each fingerprint represents a
genome type containing all the variants on the path from
that tip node up to the progenitor proCoV2. These finger-
prints can classify genomes and track spatiotemporal pat-
terns of dominant lineages (see Materials and Methods).
We use a shorthand to refer to each fingerprint in which
only the major variant type is used. For example, afingerprint
refers to genomes that one or more of the avariants and no
other major variants, and ab fingerprint refers to genomes
1a -1d
3i -3j
3e -3g
1c -1d
Bat CoV
0.1% genomes
0.1% - 1% genomes
1% - 5% genomes
> 5% genomes
Synonymous mutaon
frequency > 3%
frequency 3%
Non-synonymous mutaon
frequency > 3%
frequency 3%
3d -3f
FIG.3.A waterfall display of genome phylogeny recapitulating the
mutation history in figure 2. The numbers of genomes mapped to
each node are depicted by open circles (very few genomes), open
triangles (few genomes), small gray triangles (many genomes), and
large black triangles (very many genomes). The actual number of
genomes is given in the parenthesis. The tip label is the name of
the mutation on the connecting branch. Green and red branches
are synonymous and nonsynonymous mutations, respectively.
Thick branches mark mutations that occur with a frequency >3%
in the 29KG data set. The yellow background highlights the diversity
of coronavirus lineages that evolved from the genomes leading to
Wuhan-1 coronavirus.
Kumar et al. .doi:10.1093/molbev/msab118 MBE
Downloaded from by Temple University user on 02 August 2021
that contain at least one a,atleastonebvariant, and no
other major variants. This nomenclature is intuitive and pro-
vides a way to glean evolutionary information from the co-
ronavirus lineage’s name. In the 68KG data set (October 12,
2020 GISAID snapshot), global frequencies of major proCoV2
fingerprints were as follows: abe (32.1%), abcd (17.7%), ab
(16.7%), abeg (9.9%), ab (9.8%), abc (6.8%), abf (4.5%), and
Figure 5 shows the evolving spatiotemporal of all major
fingerprints in Asia, Europe, and North America inferred for
an expanded data set of 172,480 genomes (December 30,
2020 snapshot). Spatiotemporal patterns in cities, countries,
and other regions are available online at We observe the spread and replacement of
prevailing strains in Europe (abe with abf)andAsia(a
with abe), the preponderance of the same strain for most
of the pandemic in North America (abcd), and the continued
presence of multiple high-frequency strains in Asia and North
America. Spatiotemporal patterns of strain spread converged
for Europe and Asia by July–August 2020 to abe genetic
fingerprints. These patterns diverged from North America,
where ab along with its mutant (abcd)werecommon.
After that, Europe saw fvariants of ab grow (abf), replacing
abe genomes and its new goffshoot (abeg)(e.g.,Hodcroft et
al. 2020). The fmutations were first detected 3 weeks after
the sampling of the first evariants. Remarkably, abcd has
remained the dominant lineage in North America since
April 2020, in contrast to the turnover seen in Europe and
More recently, novel fast-spreading variants have been
reported (e.g., Rambaut, Loman, et al. 2020). In particular,
an S protein variant (N501Y) from South Africa and the
United Kingdom has rapidly increased (Rambaut, Loman, et
al. 2020). Coronaviruses with N501Y variant in South Africa
carry the abcd genetic fingerprint, whereas those in the
United Kingdom carry the abe genetic fingerprint. This
means that the N501Y mutation arose independently in
two coronavirus lineages that show convergent patterns of
increased spread. At present, abf dominates the United
Kingdom, and the number of genomes publicly available
from South Africa is relatively small to make reliable infer-
ences (see for future
updates). Overall, our mutational fingerprinting and nomen-
clature provide a simple way to glean the ancestry of new
variants compared with phylogenetic designations, such as
B.1.351 and B.1.1.7 (Rambaut, Loman, et al. 2020).
Through innovative analyses of two large collections of SARS-
CoV-2 genomes, we have consistently reconstructed the
same progenitor coronavirus genome and identified its pres-
ence worldwide for many months after the pandemic began.
The progenitor genome is a better reference for rooting phy-
logenies, orienting mutations, and estimating sequence diver-
gences. The reconstructed mutational history of SARS-CoV-2
revealed major mutational fingerprints to identify and track
the novel coronavirus’s spatiotemporal evolution, revealing
Asia South America
Africa Europe
North America
Oceania Middle East
1a-1b 1-3
FIG.4.The backbone of SARS-CoV-2 mutational history. The mutational history inferred was from (a) 29KG and (b) 68KG data sets. Major variants
and their mutational pathways are shown in black, and minor variants and their mutational pathways are shown in gray. Circle color marks the
region where variants were sampled first. The 68KG data set contains 12 additional variants and more than two times the genomes than the 29KG
data set.
Progenitor SARS-CoV-2 and Its Dominant Offshoots in COVID-19 Pandemic .doi:10.1093/molbev/msab118 MBE
Downloaded from by Temple University user on 02 August 2021
convergences and divergences of dominant strains among
geographical regions from an analysis of more than 172 thou-
sand genomes.
Furthermore, the approach taken here to reconstruct the
progenitor genome and discover key mutational events will
generally be applicable for analyzing other pathogens during
the early stages of outbreaks. The approach is scalable for
even bigger data sets because it does not require more phy-
logenetically informative variants with an increasing number
of samples. In fact, it benefits from bigger data sets as they
afford more accurate estimates of individual and co-
occurrence frequencies of variants and enable more reliable
detection of lower frequency variants. Its continued applica-
tion to SARS-CoV-2 genomes and other pathogen outbreaks
will produce their ancestral genomes and their spatiotempo-
ral dynamics, improving our understanding of the past, cur-
rent, and future evolution of pathogens and associated
Materials and Methods
Genome Data Acquisition and Processing
A flowchart describing the protocol for data assembly and
processing is shown in supplementary figure S1,
Supplementary Material online. In the first step, we download
60,332 SARS-CoV-2 genomes from the GISAID database (Shu
and Mccauley 2017), along with information on sample col-
lection dates and locations (until July 7, 2020). Of all the
genomes downloaded, we only retained those with
>28,000 bases and were marked as originating from human
hosts and passing controls detailed below. Similarly, the sec-
ond data set, the 68KG data set, was assembled from 133,741
genomes and downloaded on October 12, 2020. Again, we
retained only those with >28,000 bases and marked as orig-
inating from human hosts.
Each genome was subjected to codon aware alignment
with the NCBI reference genome (accession number
NC_045512) and then subdivided into ten regions based on
the following coding sequence (CDS) features: ORF1a (includ-
ing nsp10), ORF1b (starting with nsp12), S, ORF3a, E, M,
ORF6, ORF7a, ORF8, N, and ORF10. Gene ORF7b was re-
moved because it was too short for alignment and compar-
isons. For each region, we scanned and discarded sequences
containing too many ambiguous nucleotides to remove data
with too many sequencing errors. Thresholds were 0.5% for
the S gene, 0.1% for ORF1a and ORF1b genes, and 1% for all
other genes. We mapped individual sequences to the NCBI
reference genome (NC_045512) using a codon-aware
Genomes (%)
2020 March May
Sample count
July September November
Genomes (%)
Sample count
March May July September November
North America
Genomes (%)
Sample count
March May July September November
FIG.5.Spatiotemporal dynamics of 172,480 SARS-CoV-2 genomes (December 2019–2020). Spatiotemporal patterns of genomes mapped to
lineages containing different combinations of major variants in (a) Asia, (b) Europe, and (c) North America. The number of genomes mapped to
major variant lineages contains all of its offshoots, for example, alineage contains all the genomes with a
, and a
variants only.
The stacked graph area is the proportion of genomes mapped to the corresponding lineage. The solid black line shows the count of total genome
samples. Spatiotemporal patterns in cities, countries, and other regions are available online at (last accessed on
March 28, 2021).
Kumar et al. .doi:10.1093/molbev/msab118 MBE
Downloaded from by Temple University user on 02 August 2021
extension to the Smith–Waterman algorithm implemented
in HyPhy (Pond et al. 2005;Gianella et al. 2011) (https://,
translated mapped sequence to amino acids, and performed
multiple protein sequence alignment with the auto settings
function of MAFFT (version 7.453; Katoh and Standley 2013).
Codon sequences were next mapped onto the amino acid
genomes was aligned with the sequence of three closest out-
groups, including the coronavirus genomes of the
Rhinolophus affinis bat (RaTG13), R. malayanus bat
(RmYN02), and Manis javanica pangolin (MT121216.1) (Liu
et al. 2020;Zhou et al. 2020). The alignment was visually
inspected and adjusted in Geneious Prime 2020.2.2 (https:// The final alignment contained all geno-
mic regions except ORF7b and noncoding regions (50and 30
UTRs, and intergenic spacers). After these filtering and align-
ment steps, the multiple sequence alignment contained
29,115 sites and 29,681 SARS-CoV-2 genomes for the July 7,
2020 snapshot, which we refer to as the 29KG data set. For the
October 12 snapshot, there were 68,057 sequences, which we
refer to as the 68KG data set. We also conducted a spatio-
temporal analysis on an expanded data set containing
172,480 genomes (172KG) acquired on December 30, 2020.
Reference Genomes and Collection Dates
We used the dates of viral collections provided by the GISAID
database (Shu and Mccauley 2017)inallouranalysesifthey
were resolved to the day (i.e., we discarded data that only
contained partial dates, for example, April 2020). All genomes
were used in the mutation ordering analyses, but genomes
with incomplete sampling dates were excluded from the spa-
tiotemporal analyses and derived interpretations. We noted
that the earliest sample included in GISAID (ID:
EPI_ISL_402123) was collected on December 24, 2019, al-
though the NCBI website lists its collection date as
December 23, 2019 (GenBank ID: MT019529). Therefore,
tency. Regarding the NCBI reference genome (GenBank ID:
NC_045512; GISAID ID: EPI_ISL_402125) (Wu et al. 2020), this
sample was collected on December 26, 2019 (Chiara et al.
2021). We also used the GISAID reference genome in our
analysis (ID: EPI_ISL_402124) collected on December 30,
2019 (Okada et al. 2020).
Mutation Order Analyses
First, we analyzed the 29KG data set. We used a maximum
likelihood method, SCITE (Jahn et al. 2016), and variant co-
occurrence information for reconstructing the order of muta-
tions corresponding to 49 common variants (frequency >
1%) observed in this data set (see flowchart in supplementary
fig. S2,Supplementary Material online). MOA has demon-
strated high accuracy for analyzing tumor cell genomes
that reproduce clonally, have frequent sequencing errors,
and exhibit limited sequence divergence (Jahn et al. 2016;
Miura et al. 2018). In MOA, higher frequency variants are
expected to have arisen earlier than low-frequency variants
in clonally reproducing populations (Kim and Simon 2014;
Jahn et al. 2016). We used the highest frequency variants to
anchor the analysis and the shared co-occurrence of variants
among genomes to order mutations while allowing probabil-
istically for sequencing errors and pooled sequencing of
genomes (Jahn et al. 2016). MOA is different from traditional
phylogenetic approaches where positions are treated inde-
pendently, that is, the shared co-occurrence of variants is not
directly utilized in the inference procedure. Notably, both
traditional phylogenetic and mutation order analyses are
expected to produce concordant patterns when sequencing
errors and other artifacts are minimized. However, sequenc-
ing errors and limited mutational input during the coronavi-
rus history adversely impact traditional methods, as does the
fact that the closest coronaviruses useable as outgroups have
more than a thousand base differences from SARS-CoV-2
genomes that only differ in a handful of bases from each
other (Mavian et al. 2020;Morel et al. 2021;Pipes et al. 2021).
MOA requires a binary matrix of presence/absence (1/0) of
mutants, which is straightforward in analyzing cell sequences
from tumors because they arise from normal cells that supply
definitive ancestral states. To designate mutation orientations
for applying MOA in SARS-CoV-2 analysis, we devised a sim-
ple approach in which we began by comparing nucleotides at
the 49 genomic positions among three closely related
genomes (bat RaTG13, bat RmYN02, and pangolin
MT121216.1) (Boni et al. 2020). We chose the consensus
base to be the initial reference base, such that SARS-CoV-2
genome bases were coded to be “0” whenever they were the
same as the consensus base at their respective positions. All
other bases were assigned a “1.” There were 39 positions in
which all 3 outgroup genomes were identical to each other
the remaining position (28657), all three outgroups differed,
sequence similar to the human SARS-CoV-2 NCBI reference
genome (NC_045512) because SARS-CoV-2’s ancestor likely
experienced genomic recombination before its zoonotic
transfer into humans (Huang et al. 2020;Li, Giorg, et al.
2020;MacLean et al. 2021). At one position, both major
and minor bases in humans were different from the consen-
sus base in the outgroups, so we assigned the mutant status
to the minority base (U; vf ¼29.8%). All missing and ambig-
uous bases were coded to be ignored (missing data) in all the
analyses. These initially assigned mutation orientations were
tested in a subsequent investigation of variants’ COI. COI for a
given variant (y) is the number of genomes that contain yand
its directly preceding mutation (x) in the mutation history,
divided by the number of genomes that contain y.WhenCOI
was lower than 70%, we reversed each position’s mutation
orientation individually and selected the mutation orienta-
tion that produced mutation histories with the highest COI
(see below).
In the SCITE analysis of 49 variants and 29,861 genomes,
we started with default parameter settings of false-negative
rate (FNR ¼0.21545) and false-positive rate (FPR ¼
0.0000604) of mutation detection. We carried out five inde-
pendent runs to ensure stability and convergence to obtain
29KG collection-specific estimates of FNR and FPR by
Progenitor SARS-CoV-2 and Its Dominant Offshoots in COVID-19 Pandemic .doi:10.1093/molbev/msab118 MBE
Downloaded from by Temple University user on 02 August 2021
comparing the observed and predicted sequences based on
this mutation graph. The estimated FNR (0.00488) and FPR
(0.00800) were very different from the SCITE default param-
eters, where the estimated FNR was much lower. This differ-
ence in error rates is unsurprising because we used only
common variants (vf >1%), and the 29KG data set was
not obtained from single-cell sequencing in which dropout
during single-cell tumor sequencing elevates FNR, that is, mu-
tant alleles are not sequenced.
As noted above, the initial mutation orientations were
simply the starting designations for our analysis, which are
subsequently investigated by evaluating the COI of each var-
iant in the reconstructed mutation history. In this process, we
reverse ancestor/mutant coding for variants that received low
COI to examine if a mutation history with higher COI can be
generated. Two positions (3037 and 28854) received low COI
(<70%). At position 3037, the reversed encoding (C !U)
received significantly higher COI (100%) than the starting
encoding (U !C; 60%), so the position was recoded. At
position 28854, the ordering and direction of mutation
remained ambiguous despite extensive analyses, but it did
not impact the predicted MRCA sequence. Therefore, we
only recoded the column for position 3037 and generated a
new 49 29861 (SNVs genomes) matrix.
At one position (28657), all three outgroup sequences had
different bases, so we initially selected the base found in the
gene with the highest sequence similar to the human SARS-
CoV-2 NCBI reference genome. We next tested if reversed
encoding produced a better mutation graph. The reversed
encoding produced a mutation graph with a much higher
log-likelihood, that is, 32,355.58 and 30,289.92, for the
initial and reversed encoding, respectively; P0.01 using
the AkaikeInformation Criterion (AIC) protocol in Pupko et
al. (2002). Therefore, we recoded position 28657 and gener-
ated a new 49 29,861 (SNVs genomes) matrix.
It was subjected to SCITE analysis and produced a muta-
tion graph for 49 variants in the 29KG data set. This graph
predicts an FNR of 0.00418 and FPR of 0.00295 per base. Using
these new FNR and FPR, we again performed SCITE analysis
and produced the final mutation history graph. Starting from
the top of a mutation graph, a distinct Greek symbol was
assigned to a group of mutations that were occurred sequen-
tially, and variants with similar frequency were assigned the
same Greek symbol (l,,a,b,c,d,ande). The high-frequency
variants with the same Greek symbol were distinguished by
numbers to represent the sequential relationship, for exam-
ple, a
and a
. When an offshoot of a high-frequency muta-
tion had low variant frequency, we assigned it the same Greek
symbol and number to represent the parent-offspring rela-
tionship and further distinguished descendants by adding a
small letter, for example, a
and a
Timing of the Progenitor Genome
and alineages in the mutation graph. MRCA is the progenitor
of all human SARS-CoV-2 infections (proCoV2), which
descended from the parental lineage after its divergence
from its closest relatives, including bats and pangolins. We
estimate that proCoV2 existed 5.8–8.1 weeks before
December 24, 2019 sampling date of Wuhan-1, by using an
SARS-CoV-2 HPD (Highest Posterior Density) mutation rate
range of 6.64 10
–9.27 10
substitutions per site per
year (Pekar et al. 2021).
We have made available the proCoV2 genome sequence in
FastA format at,whichis
corresponding to a
mutations at positions 18060, 8782,
and 28144, as discussed in the main text. In this mutation
graph, COI for each variant is shown next to the arrow.
Bootstrap Analysis
We assessed the robustness of the mutation history inference
to genome sampling by bootstrap analysis. We generated 100
bootstrap replicate data sets, each built by randomly selecting
29,861 genomes with replacement. Then, SCITE was used to
infer the mutation graph for each replicate data set. Bootstrap
confidence level, scored for each variant pair, was the number
of replicates in which the given pair of variants were directly
connected in the mutation history in the same way as shown
in figure 2. BCLs were often lower for major variants within
groups (e.g., e
) because they occur with very similar fre-
quencies. This feature adversely affected the BCL values of
mutation orders between groups, for example, band e.Inthis
puted BCL to be the proportion of replicates in which pairs of
groups were directly connected in the mutation history in the
same way as shown in figure 2.Groupsusedwereb
. All of these BCL values are shown with an
Temporal Concordance
Because MOA analyses did not use spatial or temporal infor-
mation for genomes or mutations, the inferred mutation
history can be validated by evaluating the concordance of
the inferred order of mutations with the timing of their first
appearance (tf). Using the genomes for which virus sampling
day, month, and year were available, we determined tf for
every variant in the 29KG data set. For mutation i,wecom-
pared its tf(i)withtf(j) such that jis the nearest preceding
mutation in the mutation graph. We found that tf(j)tf (i)
and b
pairs. These two
offshoot mutants of b
were sampled 35 days (b
12 days (b
) earlier than their predecessors, which could be
due to their low frequency or sequencing error. COI of one
variant (b
) was low (54%), but the other variant (b
very high COI (97%).
Mutational Fingerprints
Each node in the mutational history graph predicts an inter-
mediate (ancestral) or a tip sequence containing all the muta-
tions from that node to the mutation graph’s root. The
mutational fingerprint is then produced directly from the
mutation history graph drawn as a directional graph an-
chored on the root node. We compared our mutational
fingerprints of the genomes in the 29KG data set with a
phylogeny-based classification (Rambaut, Loman, et al.
Kumar et al. .doi:10.1093/molbev/msab118 MBE
Downloaded from by Temple University user on 02 August 2021
2020) obtained using the Pangolin service (v2.0.3; https://pan- We assigned each of the 29 K genomes to a
fingerprint based on the highest sequence similarity at posi-
tions containing 49 common variants, allowing mismatches
due to population-level variations and sequencing errors. A
small fraction of genomes (1.8%) could not be assigned un-
ambiguously to one fingerprint, so they were excluded and
will be investigated in the future. The number of genomes
assigned to each fingerprint is shown in table 1. We submitted
genome sequences to the Pangolin website for classification
one-by-one, and a clade designation was received. The results
are summarized in supplementary figure S3,Supplementary
Material online. In this table, all phylogenetic groups with
fewer than 20 genomes were excluded.
Of the 80 phylogenetic groups shown, 74 are defined pri-
marily by a single mutation-based fingerprint, as more than
90% of the genomes in those phylogenetic groups share the
same fingerprint. This includes all small- and medium-sized
phylogenetic groups (up to 488 genomes) and two large
groups (A.1 with 1,377 genomes and B.1.2 with 749 genomes).
One large group, B.1.1, predominately connects with e
(79%, 4,832 genomes), but some of its members belong to e
offshoots because they contain respective diagnostic muta-
tions. For group B.1.1.1, two other e
offshoots are mixed up
almost equally. Three other large differences between muta-
tional fingerprint-based classification and phylogeny-based
grouping are seen for A, B, B1.1, and B.2 groups. These differ-
ences are likely because the location of the root and the
earliest branching order of coronavirus lineages are problem-
atic in phylogeny-based classifications (Mavian et al. 2020;
Morel et al. 2021;Pipes et al. 2021;Wenzel 2020). Overall,
our mutational fingerprints are immediately informative
about the mutational ancestry of genomes.
Analysis of 68KG Data Set
We repeated the above MOA procedure on the 68KG data
set (68,057 genomes). This 68KG data contained 72 common
variants (>1% frequency). For direct comparison purposes,
we added 12 variants that were common variants on 29KG
data, but their frequency had become less than 1% in the
68KG data. Therefore, we used 84 variants in total and con-
structed a 84 68,057 (SNVs genomes) matrix for the
SCITE analysis to determine the mutational order. We also
conducted the bootstrap analysis and assigned mutational
fingerprints using the procedure mentioned above. The num-
ber of genomes mapped to each fingerprint is listed in sup-
plementary table S1,Supplementary Material online.
Sequence Classification for 172KG Data Set
We developed a sequence classification protocol that first
calls variants in a viral genome using proCoV2 as the reference
sequence using a browser-based sequence alignment (bioseq
npm package) based on the codebase of minimap2 (Li 2018).
Then, it assigns the sequence to a path in the mutation graph
with the highest concordance (Jaccard index). It is imple-
mented in a simple browser-based tool, which shows the
example output for ENA accession number MT675945 (sup-
plementary fig. S4,Supplementary Material online; http://, last accessed on March 18, 2021).
The classification is conducted on the client-side such that
the researcher’s data never leave their personal computer.
Testing Episodic Spread of Variants
We performed nonparametric Wald–Wolfowitz run tests
(Wald and Wolfowitz 1940;Mateus and Caeiro 2015)ofthe
null hypothesis that the first sampling of variants is randomly
distributed over time (i.e., evenly spaced). The null hypothesis
suggesting significant temporal clustering in both 29KG and
64KG data sets. Because many mutations were first sampled
on December 24, 2019, we only included one mutation for
that day to avoid biasing the test.
We also tested the null hypothesis of the same molecular
evolutionary patterns within the SARS-CoV-2 population and
between species (i.e., Human SARS-CoV-2 and Bat RaTG13)
by using a McDonald-Kreitman test (McDonald and
Kreitman 1991). The numbers of nonsynonymous and syn-
onymous polymorphisms with a frequency >1% were 32 and
17, compared with the numbers of nonsynonymous and syn-
onymous fixed differences (170 and 958, respectively) be-
tween proCoV2 and bat RaTG13 sequences. The
McDonald–Kreitman test rejected the null overwhelmingly
Supplementary Material
Supplementary data are available at Molecular Biology and
Evolution online.
We thank all the authors and organizations who have kindly
deposited and shared genome data on GISAID (see http:// for a list of all the authors). We
thank Ananias Escalante, Rob Kulathinal, Li Liu, Jose Barba-
Montoya, Antonia Chroni, Ravi Patel, and Caryn Babaian for
their critical comments. We appreciate the technical support
provided by Jared Knoblauch and Glen Stecher. This research
was supported by grants from the U.S. National Science
Foundation to S.K. (GCR-1934848, DEB-2034228) and S.P.
(DBI-2027196) and from the U.S. National Institutes of
Health to S.K. (GM-139504-01) and S.P. (AI-134384).
Data Availability
Live evolutionary history and spatiotemporal distributions of
common variants can be accessed via All genome sequences and metadata are avail-
able publicly at GISAID (, and the
predicted proCoV2 sequence is available at http://igem.tem- The other relevant information is pro-
vided in the supplementary materials,Supplementary
Material online.
Amendola A, Bianchi S, Gori M, Colzani D, Canuti M, Borghi E, Raviglione
MC, Zuccotti GV, Tanzi E. 2021. Evidence of SARS-CoV-2 RNA in an
Progenitor SARS-CoV-2 and Its Dominant Offshoots in COVID-19 Pandemic .doi:10.1093/molbev/msab118 MBE
Downloaded from by Temple University user on 02 August 2021
Oropharyngeal Swab Specimen, Milan, Italy, early December 2019.
Emerg Infect Dis. 27(2):648–650.
Andersen KG, Rambaut A, Lipkin WI, Holmes EC, Garry RF. 2020. The
proximal origin of SARS-CoV-2. Nat Med. 26(4):450–452.
Boni MF, Lemey P, Jiang X, Lam TTY, Perry BW, Castoe TA, Rambaut A,
Robertson DL. 2020. Evolutionary origins of the SARS-CoV-2 sarbe-
covirus lineage responsible for the COVID-19 pandemic. Nat
Microbiol. 5(11):1408–1417.
Casals F, Bertranpetit J. 2012. Human genetic variation, shared and pri-
vate. Science 337(6090):39–40.
Castells M, Lopez-Tort F, Colina R, Cristina J. 2020. Evidence of increasing
diversification of emerging SARS-CoV-2 strains. JMedVirol.
Chiara M, Horner DS, Gissi C, Pesole G. 2021. Comparative genomics
reveals early emergence and biased spatio-temporal distribution of
SARS-CoV-2. MolBiolEvol. 38(6):2547–2565.
da Silva Filipe A, Shepherd JG, Williams T, Hughes J, Aranday-Cortes E,
2021. Genomic epidemiology reveals multiple introductions of
SARS-CoV-2 from mainland Europe into Scotland. Nat Microbiol.
Dearlove BL, Lewitus E, Bai H, Li Y, Reeves DB, Joyce MG, Scott P, Amare
M, Vasan S, Michael NL, et al. 2020. A SARS-CoV-2 vaccine candidate
would likely match all currently circulating strains. Proc Natl Acad Sci
USA. 117(38):23652–23662.
Fauver JR, Petrone ME, Hodcroft EB, Shioda K, Ehrlich HY, Watts AG,
Vogels CBF, Brito AF, Alpert T, Muyombwe A, et al. 2020. Coast-to-
coast spread of SARS-CoV-2 during the early epidemic in the United
States. Cell 181(5):990–996.e5.
Forster P, Forster L, Renfrew C, Forster M. 2020. Phylogenetic network
analysis of SARS-CoV-2 genomes. Proc Natl Acad Sci U S A.
Gianella S, Delport W, Pacold ME, Young JA, Choi JY, Little SJ, Richman
DD, Kosakovsky Pond SL, Smith DM. 2011. Detection of minority
resistance during early HIV-1 infection: natural variation and spuri-
ous detection rather than transmission and evolution of multiple
viral variants. JVirol. 85(16):8359–8367.
Giorgio SD, Martignano F, Torcia MG, Mattiuz G, Conticello SG. 2020.
Evidence for host-dependent RNA editing in the transcriptome of
SARS-CoV-2. Sci Adv.6:19.
Giovanetti M, Benvenuto D, Angeletti S, Ciccozzi M. 2020. The first two
cases of 2019-nCoV in Italy: where they come from? JMedVirol.
omez-Carballa A, Bello X, Pardo-Seco J, Martin
on-Torres F, Salas A.
2020. Mapping genome variation of SARS-CoV-2 worldwide high-
lights the impact of COVID-19 super-spreaders. Genome Res.
Hodcroft EB, Zuber M, Nadeau S, Comas I, Gonz
alez Candelas F, Stadler
T, Neher RA. 2020. Emergence and spread of a SARS-CoV-2 variant
through Europe in the summer of 2020. medRxiv. doi:10.1101/
Huang J-M, Jan SS, Wei X, Wan Y, Ouyang S. 2020. Evidence of the
recombinant origin and ongoing mutations in severe acute respira-
tory syndrome coronavirus 2 (SARS-CoV-2). bioRxiv. doi:10.1101/
Jackson B, Rambaut A, Pybus OG, Robertson DL, Connor T, Loman NJ.
2020. Recombinant SARS-CoV-2 genomes involving lineage B.1.1.7
in the UK. Available from:
cov-2-genomes-involving-lineage-b-1-1-7-in-the-uk/658 (last access
March 24, 2021).
Jahn K, Kuipers J, Beerenwinkel N. 2016. Tree inference for single-cell
data. Genome Biol. 17:86.
Katoh K, Standley DM. 2013. MAFFT multiple sequence alignment soft-
ware version 7: improvements in performance and usability. Mol Biol
Evol. 30(4):772–780.
Kim KI, Simon R. 2014. Using single cell sequencing data to model the
evolutionary history of a tumor. BMC Bioinformatics 15:27.
Komissarov AB, Safina KR, Garushyants SK, Fadeev AV, Sergeeva MV,
Ivanova AA, Danilenko DM, Lioznov D, Shneider OV, Shvyrev N, et
al. 2021. Genomic epidemiology of the early stages of the SARS-CoV-
2 outbreak in Russia. Nat Commun. 12(1):1–13.
Lai A, Bergna A, Acciarri C, Galli M, Zehender G. 2020. Early phylogenetic
estimate of the effective reproduction number of SARS-CoV-2. JMed
Virol. 92(6):675–679.
Lemey P, Hong SL, Hill V, Baele G, Poletto C, Colizza V, O’Toole
McCrone JT, Andersen KG, Worobey M, et al. 2020.
Accommodating individual travel history and unsampled diversity
in Bayesian phylogeographic inference of SARS-CoV-2. Nat
Commun. 11(1):1–14.
Lemieux JE, Siddle KJ, Shaw BM, Loreth C, Schaffner SF, Gladden-Young
A, Adams G, Fink T, Tomkins-Tinch CH, Krasilnikova LA, et al. 2021.
Phylogenetic analysis of SARS-CoV-2 in Boston highlights the impact
of superspreading events. Science 371(6529):eabe3261.
Li H. 2018. Minimap2: pairwise alignment for nucleotide sequences.
Bioinformatics 34(18):3094–3100.
Li X, Giorg EE, Marichannegowda MH, Foley B, Xiao C, Kong XP, Chen Y,
Gnanakaran S, Korber B, Gao F. 2020. Emergence of SARS-CoV-2
through recombination and strong purifying selection. Sci Adv.6:112.
Li X, Wang W, Zhao X, Zai J, Zhao Q, Li Y, Chaillon A. 2020. Transmission
dynamics and evolutionary history of 2019-nCoV. JMedVirol.
et al. 2020. Are pangolins the intermediate host of the 2019 novel
coronavirus (SARS-CoV-2)? PLoS Pathog. 16(5):e1008421.
Lu R, Zhao X, Li J, Niu P, Yang B, Wu H, Wang W, Song H, Huang B, Zhu
N, et al. 2020. Genomic characterisation and epidemiology of 2019
novel coronavirus: implications for virus origins and receptor bind-
ing. Lancet 395(10224):565–574.
MacLean OA, Lytras S, Weaver S, Singer JB, Boni MF, Lemey P,
Kosakovsky Pond SL, Robertson DL. 2021. Natural selection in the
evolution of SARS-CoV-2 in bats, not humans, created a highly ca-
pable human pathogen. PLoS Biol. 19(3):e3001115.
De Maio N, Walke C, Borges R, Weilguny L, Slodkowicz G, Goldman N.
2020. Issues with SARS-CoV-2 sequencing data. Available from:
(last access March 24, 2021).
Martin D, Weaver S, Tegally H, San J, Wilkinson E, Giandhari J, Pillay Y,
Singh L, Lessells RJ, Oliveira TD, et al. 2021. The emergence and
ongoing convergent evolution of the N501Y lineages coincided
with a major global shift in the SARS-CoV-2 selective landscape.
medRxiv. doi:10.1101/2021.02.23.21252268.
Mateus A, Caeiro F. 2015. An R implementation of several randomness
tests. In: Simos ZK, Monovasilis T, editors. AIP Conf Proc.
1618(1):531–534. doi: 10.1063/1.4897792.
Mavian C, Pond SK, Marini S, Magalis BR, Vandamme AM, Dellicour S,
Scarpino SV, Houldcroft C, Villabona-Arenas J, Paisie TK, et al. 2020.
Sampling bias and incorrect rooting make phylogenetic network
tracing of SARS-COV-2 infections unreliable. Proc Natl Acad Sci U
SA. 117(23):12522–12523.
McDonald JH, Kreitman M. 1991. Adaptive protein evolution at the Adh
locus in Drosophila. Nature 351(6328):652–654.
Miura S, Huuki LA, Buturla T, Vu T, Gomez K, Kumar S. 2018.
Computational enhancement of single-cell sequences for inferring
tumor evolution. Bioinformatics 34(17):i917–i926.
Morel B, Barbera P, Czech L, Bettisworth B, Huebner L, Lutteropp S,
Serdari D, Kostaki E-G, Mamais I, Kozlov A, et al. 2021.
Phylogenetic analysis of SARS-CoV-2 data is difficult. MolBiolEvol.
Nei M, Kumar S. 2002. Molecular evolution and phylogenetics. New
York: Oxford University Press.
Okada P, Buathong R, Phuygun S, Thanadachakul T, Parnmen S,
Wongboot W, Waicharoen S, Wacharapluesadee S, Uttayamakul
S, Vachiraphan A, et al. 2020. Early transmission patterns of corona-
virus disease 2019 (COVID-19) in travellers from Wuhan to Thailand,
January 2020. Euro Surveill. 25:2000097.
Pekar J, Worobey M, Moshiri N, Scheffler K, Wertheim JO. 2021. Timing
the SARS-CoV-2 index case in Hubei province. Science
Kumar et al. .doi:10.1093/molbev/msab118 MBE
Downloaded from by Temple University user on 02 August 2021
Pipes L, Wang H, Huelsenbeck J, Nielsen R. 2021. Assessing uncertainty in
the rooting of the SARS-CoV-2 phylogeny. MolBiolEvol. 38(4):1537–
Pond SLK, Frost SDW, Muse SV. 2005. HyPhy: hypothesis testing using
phylogenies. Bioinformatics 21(5):676–679.
Pupko T, Huchon D, Cao Y, Okada N, Hasegawa M. 2002. Combining
multiple data sets in a likelihood analysis: which models are the best?
MolBiolEvol. 19(12):2294–2307.
Rambaut A, Holmes EC, O’Toole
A, Hill V, McCrone JT, Ruis C, du Plessis
L, Pybus OG. 2020. A dynamic nomenclature proposal for SARS-
CoV-2 lineages to assist genomic epidemiology. Nat Microbiol.
Rambaut A, Loman N, Pybus O, Barclay W, Barrett J, Carabelli A, Connor
T, Peacock T, Robertson DL, Volz E. 2020. Preliminary genomic char-
acterisation of an emergent SARS-CoV-2 lineage in the UK defined
by a novel set of spike mutations. Available from:https://virological.
563 (last access March 24, 2021).
RiceAM,MoralesAC,HoAT,MordsteinC,Mu¨hlhausen S, Watson S,
Cano L, Young B, Kudla G, Hurst LD. 2021. Evidence for strong
mutation bias towards, and selection against, U content in SARS-
CoV-2: implications for vaccine design. Mol Biol Evol. 38(1):67–83.
Richard D, Owen CJ, van Dorp L, Balloux F. 2020. No detectable signal for
ongoing genetic recombination in SARS-CoV-2. bioRxiv. doi:10.1101/
Ross EM, Markowetz F. 2016. OncoNEM: inferring tumor evolution from
single-cell sequencing data. Genome Biol. 17(1):1–14.
Shu Y, Mccauley J. 2017. GISAID: global initiative on sharing all influenza
data-from vision to reality. Euro Surveill. 22:30494.
Stefanelli P, Faggioni G, Lo Presti A, Fiore S, Marchi A, Benedetti E, Fabiani
C, Anselmo A, Ciammaruconi A, Fortunato A, et al. 2020. Whole
genome and phylogenetic analysis of two SARSCoV-2 strains
isolated in Italy in January and February 2020: additional clues on
multiple introductions and further circulation in Europe. Euro
Surveill. 25:1–5.
Tang X, Wu C, Li X, Song Y, Yao X, Wu X, Duan Y, Zhang H, Wang Y,
Qian Z, et al. 2020. On the origin and continuing evolution of SARS-
CoV-2. Natl Sci Rev. 7(6):1012–1023.
Tegally H, Wilkinson E, Giovanetti M, Iranzadeh A, Fonseca V, Giandhari
SARS-CoV-2 variant of concern with mutations in spike glycopro-
tein. Nature 592(7854):438–443.
Turakhia Y, De Maio N, Thornlow B, Gozashti L, Lanfear R, Walker CR,
Stability of SARS-CoV-2 phylogenies. PLoS Genet. 16(11):e1009175.
van Dorp L, Acman M, Richard D, Shaw LP, Ford CE, Ormond L, Owen
CJ, Pang J, Tan CCS, Boshier FAT, et al. 2020. Emergence of genomic
diversity and recurrent mutations in SARS-CoV-2. Infect Genet Evol.
Wald A, Wolfowitz J. 1940. On at test whether two samples are from the
same population. AnnMathStatist. 11(2):147–162.
Wenzel J. 2020. Origins of SARS-CoV-1 and SARS-CoV-2 are often poorly
explored in leading publications. Cladistics 36(4):374–379.
Worobey M, Pekar J, Larsen BB, Nelson MI, Hill V, Joy JB, Rambaut A,
Suchard MA, Wertheim JO, Lemey P. 2020. The emergence of
SARS-CoV-2 in Europe and the US. Science 370(6516):564–570.
Wu F, Zhao S, Yu B, Chen YM, Wang W, Song ZG, Hu Y, Tao ZW, Tian
JH, Pei YY, et al. 2020. A new coronavirus associated with human
respiratory disease in China. Nature 579(7798):265–269.
Yang Z, Kumar S, Nei M. 1995. A new method of inference of an-
cestral nucleotide and amino acid sequences. Genetics
Zhou P, Yang XL, Wang XG, Hu B, Zhang L, Zhang W, Si HR, Zhu Y, Li B,
Huang CL, et al. 2020. A pneumonia outbreak associated with a new
coronavirus of probable bat origin. Nature 579(7798):270–273.
Progenitor SARS-CoV-2 and Its Dominant Offshoots in COVID-19 Pandemic .doi:10.1093/molbev/msab118 MBE
Downloaded from by Temple University user on 02 August 2021
... Molecular evolutionary analyses suggest that SARS-CoV-2 emerged as a capable human pathogen, likely from a bat reservoir, although the mechanism of the original spillover(s) is a subject of ongoing debate. Although the timeline of SARS-CoV-2 emergence has not yet been firmly established, evolutionary analyses predicted that the virus likely circulated in China for some time, even months, before the first recorded December outbreak in Wuhan (Andersen et al., 2020;Leitner and Kumar, 2020;MacLean et al., 2021;Kumar et al., 2021;Xia, 2021;Chiara et al., 2021). ...
... The viral strain that dominated the Lombardy outbreak and that subsequently spread across Europe and beyond differs from the reference genome (Wuhan-Hu-1 strain) by the simultaneous presence of several mutations, including C3037T (synonymous), C14408T (RdRp P323L), and A23403G (Spike D614G) Hadfield et al., 2018;Alteri et al., 2021;Hodcroft et al., 2021). This strain is now classified as 20 A in NextStrain (Hadfield et al., 2018) and B.1 in Pangolin , has an αβ mutational signature (Kumar et al., 2021), and is also referred to as DG1111 haplotype (Ruan et al., 2021). ...
... Mutations with respect to the reference strain were only recorded if simultaneously present in both the forward and reverse reads. Identified mutations were classified using a recently developed mutation order analysis approach (Kumar et al., 2021), and the time of emergence of the SARS-CoV-2 progenitor (proCoV2) was approximated using a previously calibrated molecular clock (Pekar et al., 2021). ...
Full-text available
As a reference laboratory for measles and rubella surveillance in Lombardy, we evaluated the association between SARS-CoV-2 infection and measles-like syndromes, providing preliminary evidence for undetected early circulation of SARS-CoV-2. Overall, 435 samples from 156 cases were investigated. RNA from oropharyngeal swabs (N = 148) and urine (N = 141) was screened with four hemi-nested PCRs and molecular evidence for SARS-CoV-2 infection was found in 13 subjects. Two of the positive patients were from the pandemic period (2/12, 16.7%, March 2020–March 2021) and 11 were from the pre-pandemic period (11/44, 25%, August 2019–February 2020). Sera (N = 146) were tested for anti-SARS-CoV-2 IgG, IgM, and IgA antibodies. Five of the RNA-positive individuals also had detectable anti-SARS-CoV-2 antibodies. No strong evidence of infection was found in samples collected between August 2018 and July 2019 from 100 patients. The earliest sample with evidence of SARS-CoV-2 RNA was from September 12, 2019, and the positive patient was also positive for anti-SARS-CoV-2 antibodies (IgG and IgM). Mutations typical of B.1 strains previously reported to have emerged in January 2020 (C3037T, C14408T, and A23403G), were identified in samples collected as early as October 2019 in Lombardy. One of these mutations (C14408T) was also identified among sequences downloaded from public databases that were obtained by others from samples collected in Brazil in November 2019. We conclude that a SARS-CoV-2 progenitor capable of producing a measles-like syndrome may have emerged in late June-late July 2019 and that viruses with mutations characterizing B.1 strain may have been spreading globally before the first Wuhan outbreak. Our findings should be complemented by high-throughput sequencing to obtain additional sequence information. We highlight the importance of retrospective surveillance studies in understanding the early dynamics of COVID-19 spread and we encourage other groups to perform retrospective investigations to seek confirmatory proofs of early SARS-CoV-2 circulation.
... Then, we replaced NC_045512.2 with another two putative earliest SARS-CoV-2 sequences (ProCoV2, Guangdong/HKU-SZ-002/2020) (16,17) and found no suspected progenitor sequence (Fig. S4). ...
... There are three different putative earliest SARS-CoV-2 sequences: NC_045512.2, the ProCoV2 sequence, and the Guangdong/HKU-SZ-002/2020 sequence (16,17), which were used as E in our analysis, respectively. ...
Full-text available
SARS-CoV-2 has infected more than 600 million people. However, the origin of the virus is still unclear; knowing where the virus came from could help us prevent future zoonotic epidemics. Sequencing data, particularly metagenomic data, can profile the genomes of all species in the sample, including those not recognized at the time, thus allowing for the identification of the progenitor of SARS-CoV-2 in samples collected before the pandemic. We analyzed the data from 5,196 SARS-CoV-2-positive sequencing runs in the NCBI's SRA database with collection dates prior to 2020 or unknown. We found that the mutation patterns obtained from these suspicious SARS-CoV-2 reads did not match the genome characteristics of an unknown progenitor of the virus, suggesting that they may derive from circulating SARS-CoV-2 variants or other coronaviruses. Despite a negative result for tracking the progenitor of SARS-CoV-2, the methods developed in the study could assist in pinpointing the origin of various pathogens in the future. IMPORTANCE Sequences that are homologous to the SARS-CoV-2 genome were found in numerous sequencing runs that were not associated with the SARS-CoV-2 studies in the public database. It is unclear whether they are derived from the possible progenitor of SARS-CoV-2 or contamination of more recent SARS-CoV-2 variants circulated in the population due to the lack of information on the collection, library preparation, and sequencing processes. We have developed a computational framework to infer the evolutionary relationship between sequences based on the comparison of mutations, which enabled us to rule out the possibility that these suspicious sequences originate from unknown progenitors of SARS-CoV-2.
... If the viral tree can be properly rooted, then the parameters µ and T A can be estimated given a tree of viral strains with sample collection times used for calibrating a molecular clock [6][7][8][9][10][11]. The most recent common ancestor of SARS-CoV-2 was dated in two recent studies to October-November 2019 [12] and mid-August [5], respectively. Using viral genomes from China, Pekar et al. [13] inferred the first cryptic infection of SARS-CoV-2 to span the interval between mid-October and mid-November, 2019. ...
Full-text available
Almost all published rooting and dating studies on SARS-CoV-2 assumed that (1) evolutionary rate does not change over time although different lineages can have different evolutionary rates (uncorrelated relaxed clock), and (2) a zoonotic transmission occurred in Wuhan and the culprit was immediately captured, so that only the SARS-CoV-2 genomes obtained in 2019 and the first few months of 2020 (resulting from the first wave of the global expansion from Wuhan) are sufficient for dating the common ancestor. Empirical data contradict the first assumption. The second assumption is not warranted because mounting evidence suggests the presence of early SARS-CoV-2 lineages cocirculating with the Wuhan strains. Large trees with SARS-CoV-2 genomes beyond the first few months are needed to increase the likelihood of finding SARS-CoV-2 lineages that might have originated at the same time as (or even before) those early Wuhan strains. I extended a previously published rapid rooting method to model evolutionary rate as a linear function instead of a constant. This substantially improves the dating of the common ancestor of sampled SARS-CoV-2 genomes. Based on two large trees with 83,688 and 970,777 high-quality and full-length SARS-CoV-2 genomes that contain complete sample collection dates, the common ancestor was dated to 12 June 2019 and 7 July 2019 with the two trees, respectively. The two data sets would give dramatically different or even absurd estimates if the rate was treated as a constant. The large trees were also crucial for overcoming the high rate-heterogeneity among different viral lineages. The improved method was implemented in the software TRAD.
... The phylogeny of representative sequences was reconstructed using IQ-TREE (Minh et al., 2020). The tree was rooted by the outgroup USA-WA1/2020 (EPI_ISL_404895) that matched the sequence of the putative SARS-CoV-2 progenitor (Bloom, 2021;Kumar et al., 2021). The ancestral sequences at internal tree nodes were reconstructed by TreeTime (Sagulenko et al., 2018). ...
Full-text available
SARS-CoV-2 has adapted in a stepwise manner, with multiple beneficial mutations accumulating in a rapid succession at origins of VOCs, and the reasons for this are unclear. Here, we searched for coordinated evolution of amino acid sites in the spike protein of SARS-CoV-2. Specifically, we searched for concordantly evolving site pairs (CSPs) for which changes at one site were rapidly followed by changes at the other site in the same lineage. We detected 46 sites which formed 45 CSP. Sites in CSP were closer to each other in the protein structure than random pairs, indicating that concordant evolution has a functional basis. Notably, site pairs carrying lineage defining mutations of the four VOCs that circulated before May 2021 are enriched in CSPs. For the Alpha VOC, the enrichment is detected even if Alpha sequences are removed from analysis, indicating that VOC origin could have been facilitated by positive epistasis. Additionally, we detected nine discordantly evolving pairs of sites where mutations at one site unexpectedly rarely occurred on the background of a specific allele at another site, for example on the background of wild-type D at site 614 (four pairs) or derived Y at site 501 (three pairs). Our findings hint that positive epistasis between accumulating mutations could have delayed the assembly of advantageous combinations of mutations comprising at least some of the VOCs.
... Furthermore, several early COVID-19 market cases may have occurred via human to human transmission rather than via a zoonotic jump [8]. In addition, the presence of human infections associated with lineage B at the market [9] makes a market origin less likely, as lineage B is likely a derived lineage, while lineage A is likely ancestral [10] (lineage A and lineage B are the first two major SARS-CoV-2 lineages to emerge in 2019, and are only separated by two single nucleotide variants (SNVs) [11]). An alternative hypothesis that the progenitor of SARS-CoV-2 was present in one of the laboratories in Wuhan conducting bat coronavirus research and accidentally escaped has not been sufficiently investigated [12]. ...
Full-text available
Pangolins are the only animals other than bats proposed to have been infected with SARS-CoV-2 related coronaviruses (SARS2r-CoVs) prior to the COVID-19 pandemic. Here, we examine the novel SARS2r-CoV we previously identified in game animal metatranscriptomic datasets sequenced by the Nanjing Agricultural University in 2022, and find that sections of the partial genome phylogenetically group with Guangxi pangolin CoVs (GX PCoVs), while the full RdRp sequence groups with bat-SL-CoVZC45. While the novel SARS2r-CoV is found in 6 pangolin datasets, it is also found in 10 additional NGS datasets from 5 separate mammalian species and is likely related to contamination by a laboratory researched virus. Absence of bat mitochondrial sequences from the datasets, the fragmentary nature of the virus sequence and the presence of a partial sequence of a cloning vector attached to a SARS2r-CoV read suggests that it has been cloned. We find that NGS datasets containing the novel SARS2r-CoV are contaminated with significant Homo sapiens genetic material, and numerous viruses not associated with the host animals sampled. We further identify the dominant human haplogroup of the contaminating H. sapiens genetic material to be F1c1a1, which is of East Asian provenance. The association of this novel SARS2r-CoV with both bat CoV and the GX PCoV clades is an important step towards identifying the origin of the GX PCoVs.
... We added CHN/LA-67/2020, a T/T intermediate genome on Genbank. The possible progenitor genome 'proCoV2' proposed by (Kumar et al., 2021) was added to the set of genomes extracted from GISAID. Phylogenetic analysis was conducted using Augur v17.1.0 ...
Full-text available
Understanding how SARS-CoV-2 entered the human population, thereby causing the COVID-19 pandemic, is one of the most urgent questions in science today. Two hypotheses are widely acknowledged as being most likely to explain the pandemic’s origin in late 2019: (i) the “natural origin” hypothesis that one or more cross-species transmissions from animals into humans occurred, most likely at the Huanan Seafood Market in Wuhan, China; (ii) the “laboratory origin” hypothesis, that scientific research activities led to the unintentional leak of SARS-CoV-2 from a laboratory into the general population. A recent analysis of SARS-CoV-2 genomes by Pekar et al. [ Science 377 :960-966 (2022)] claims to establish at least two separate spillover events from animals into humans, thus claiming to provide strong evidence for the natural origin hypothesis. However, here we use outbreak simulations to show that the findings of Pekar et al. are heavily impacted by two methodological artifacts: the dubious exclusion of informative SARS-CoV-2 genomes, and their reliance on unrealistic phylodynamic models of SARS-CoV-2. Absent models that incorporate these effects, one cannot conclude multiple SARS-CoV-2 spillovers into humans. Our results cast doubt on a primary point of evidence in favor of the natural origin hypothesis. Lay Summary It is not known if SARS-CoV-2 spilled over from animals into humans at the Huanan Seafood Market, or arose as a result of research activities studying bat coronaviruses. Two recent papers had claimed to answer this question, but here we show those papers are both inconclusive as they fail to account for biases in how medical managers became alerted to SARS-CoV-2 and how public health authorities sampled early cases. Additionally, key data points conflicting with the authors’ conclusions were improperly excluded from the analysis. The papers’ methods do not justify their conclusions, and the origin of SARS-CoV-2 remains an urgent, open question for science.
... Unfortunately, this approach is of limited value for coronaviruses because of -post-transciptional genome editing (transitions C => U) that adds to an already high rate of mutations [17]. However, despite these uncertainties, such studies are in favor of an earlier beginning of the pandemic, between July and Fall 2019 [18][19][20][21][22][23][24]. These findings must be interpreted with caution, especially given that the first available complete genome sequences were obtained in January 2020. ...
Full-text available
The emergence and global spread of the Severe Acute Respiratory Syndrome Coronavirus-2 (SARS-CoV-2) is critical to understanding how to prevent or control a future viral pandemic. We review the tools used for this retrospective search, their limits, and results obtained from China, France, Italy and the USA. We examine possible scenarios for the emergence of SARS-CoV-2 in the human population. We consider the Chinese city of Wuhan where the first cases of atypical pneumonia were attributed to SARS-CoV-2 and from where the disease spread worldwide. Possible superspreading events include the Wuhan-based 7th Military World Games on October 18-27, 2019 and the Chinese New Year holidays from January 25 to February 2, 2020. Several clues point to an early regional circulation of SARS-CoV-2 in northern Italy (Lombardi) as soon as September/October 2019 and in France in November/December 2019, if not before. With the goal of preventing future pandemics, we call for additional retrospective studies designed to trace the origin of SARS-CoV-2.
Full-text available
The novel coronavirus disease (COVID-19) outbreak that emerged at the end of 2019 has now swept the world for more than two years, causing immeasurable damage to the lives and economies of the world. It has drawn so much attention to discovering how the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) originated and entered the human body. The current argument revolves around two contradictory theories: a scenario of laboratory spillover events and human contact with zoonotic diseases. Here, we reviewed the transmission, pathogenesis, possible hosts, as well as the genome and protein structure of SARS-CoV-2, which play key roles in the COVID-19 pandemic. We believe the coronavirus was originally transmitted to human by animals rather than by a laboratory leak. However, there still needs more investigations to determine the source of the pandemic. Understanding how COVID-19 emerged is vital to developing global strategies for mitigating future outbreaks. This article is protected by copyright. All rights reserved.
The fifth chapter explains three ways to infer ancestral sequences from sequence data. After explaining a heuristic approach to build the intuition of the readers, the chapter explains maximum likelihood estimation and Bayesian estimation with some simplifications to keep the formulas minimal. To adjust the results for uncertainty not quantified in the models, the previous chapter’s method of uncertainty quantification is adapted. Exercises provided at the end of the chapter give readers practical experience with estimating ancestral sequences and intuition about the uncertainty involved.KeywordsMaximum likelihood estimationEstimating an ancestral DNA sequenceDemocratic estimationBayesian estimationBayes’s theoremPrior probabilityPosterior probabilityPrior distributionPosterior distributionEmpirical bayes
Full-text available
Coronavirus disease 2019 (COVID-19) caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has caused a global pandemic. SARS-CoV-2 carries a unique group of mutations, and the transmission of the virus has led to the emergence of other mutants such as Alpha (B.1.1.7), Beta (B.1.351), Gamma (P.1), Kappa (B.1.617.1), Delta (B.1.617.2) and Omicron (B.1.1.529). The advent of a vaccine has raised hopes of ending the pandemic. However, the mutation variants of SARS-CoV-2 have raised concerns about the effectiveness of vaccines because the data showed that the vaccine was less effective against mutation variants compared to the previous variants. Mutation variants could easily mutate its N-segment structure and receptor domain of its spike glycoprotein (S) protein to escape antibody recognition. Therefore, it is vital to understand potential immune response and evasion mechanism of SARS-CoV-2 variants. In this review, immune response and evasion mechanisms of several SARS-CoV-2 variants are described, which could provide some helpful advices for future vaccines.
Full-text available
Following its emergence in late 2019, the spread of severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2)1,2 has been tracked via phylogenetic analysis of viral genome sequences in unprecedented detail3–5. While the virus spread globally in early 2020 before borders closed, intercontinental travel has since been greatly reduced. However, within Europe travel resumed in the summer of 2020. Here we report on a novel SARS-CoV-2 variant, 20E (EU1), that emerged in Spain in early summer, and subsequently spread across Europe. We find no evidence of increased transmissibility, but instead demonstrate how rising incidence in Spain, resumption of travel, and lack of effective screening and containment may explain the variant’s success. Despite travel restrictions, we estimate 20E (EU1) was introduced hundreds of times to European countries by summertime travelers, likely undermining local efforts to keep SARS-CoV-2 cases low. Our results demonstrate how a variant can rapidly become dominant even in absence of a substantial transmission advantage in favorable epidemiological settings. Genomic surveillance is critical to understanding how travel can impact SARS-CoV-2 transmission, and thus for informing future containment strategies as travel resumes.
Full-text available
Backtracking a pandemic Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) may have had a history of abortive human infections before a variant established a productive enough infection to create a transmission chain with pandemic potential. Therefore, the Wuhan cluster of infections identified in late December of 2019 may not have represented the initiating event. Pekar et al. used genome data collected from the early cases of the COVID-19 pandemic combined with molecular clock inference and epidemiological simulation to estimate when the most successful variant gained a foothold in humans. This analysis pushes human-to-human transmission back to mid-October to mid-November of 2019 in Hubei Province, China, with a likely short interval before epidemic transmission was initiated. Science , this issue p. 412
Full-text available
Virus host shifts are generally associated with novel adaptations to exploit the cells of the new host species optimally. Surprisingly, Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-CoV-2) has apparently required little to no significant adaptation to humans since the start of the Coronavirus Disease 2019 (COVID-19) pandemic and to October 2020. Here we assess the types of natural selection taking place in Sarbecoviruses in horseshoe bats versus the early SARS-CoV-2 evolution in humans. While there is moderate evidence of diversifying positive selection in SARS-CoV-2 in humans, it is limited to the early phase of the pandemic, and purifying selection is much weaker in SARS-CoV-2 than in related bat Sarbecoviruses . In contrast, our analysis detects evidence for significant positive episodic diversifying selection acting at the base of the bat virus lineage SARS-CoV-2 emerged from, accompanied by an adaptive depletion in CpG composition presumed to be linked to the action of antiviral mechanisms in these ancestral bat hosts. The closest bat virus to SARS-CoV-2, RmYN02 (sharing an ancestor about 1976), is a recombinant with a structure that includes differential CpG content in Spike; clear evidence of coinfection and evolution in bats without involvement of other species. While an undiscovered “facilitating” intermediate species cannot be discounted, collectively, our results support the progenitor of SARS-CoV-2 being capable of efficient human–human transmission as a consequence of its adaptive evolutionary history in bats, not humans, which created a relatively generalist virus.
Full-text available
Continued uncontrolled transmission of the severe acute respiratory syndrome-related coronavirus 2 (SARS-CoV-2) in many parts of the world is creating the conditions for significant virus evolution1,2. Here, we describe a new SARS-CoV-2 lineage (501Y.V2) characterised by eight lineage-defining mutations in the spike protein, including three at important residues in the receptor-binding domain (K417N, E484K and N501Y) that may have functional significance3–5. This lineage was identified in South Africa after the first epidemic wave in a severely affected metropolitan area, Nelson Mandela Bay, located on the coast of the Eastern Cape Province. This lineage spread rapidly, becoming dominant in the Eastern Cape, Western Cape and KwaZulu-Natal Provinces within weeks. Whilst the full significance of the mutations is yet to be determined, the genomic data, showing the rapid expansion and displacement of other lineages in multiple regions, suggest that this lineage is associated with a selection advantage, most plausibly as a result of increased transmissibility or immune escape6–8.
Full-text available
The emergence and rapid rise in prevalence of three independent SARS-CoV-2 '501Y lineages', B.1.1.7, B.1.351 and P.1, in the last three months of 2020 has prompted renewed concerns about the evolutionarily capacity of SARS-CoV-2 to adapt to both rising population immunity and public health interventions such as vaccines and social distancing. Viruses giving rise to the different 501Y lineages have, presumably under intense natural selection following a shift in host environment, independently acquired multiple unique and convergent mutations. As a consequence all have gained epidemiological and immunological properties that will likely complicate the control of COVID-19. Here, by examining patterns of mutations that arose in SARS-CoV-2 genomes during the pandemic we find evidence of a major change in the selective forces acting on immunologically important SARS-CoV-2 genes (such as N and S) that likely coincided with the emergence of the 501Y lineages. In addition to involving continuing sequence diversification, we find evidence that a significant portion of the ongoing adaptive evolution of the 501Y lineages also involves further convergence between the lineages. Our findings highlight the importance of monitoring how members of these known 501Y lineages, and others still undiscovered, are convergently evolving similar strategies to ensure their persistence in the face of mounting infection and vaccine induced host immune recognition.
Full-text available
Effective systems for the analysis of molecular data are fundamental for monitoring the spread of infectious diseases and studying pathogen evolution. The rapid identification of emerging viral strains, and/or genetic variants potentially associated with novel phenotypic features is one of the most important objectives of genomic surveillance of human pathogens and represents one of the first lines of defense for the control of their spread. During the COVID 19 pandemic, several taxonomic frameworks have been proposed for the classification of SARS-Cov-2 isolates. These systems, which are typically based on phylogenetic approaches, represent essential tools for epidemiological studies as well as contributing to the study of the origin of the outbreak. Here we propose an alternative, reproducible and transparent phenetic method to study changes in SARS-CoV-2 genomic diversity over time. We suggest that our approach can complement other systems and facilitate the identification of biologically relevant variants in the viral genome. To demonstrate the validity of our approach, we present comparative genomic analyses of more than 175,000 genomes. Our method delineates 22 distinct SARS-CoV-2 haplogroups, which, based on the distribution of high frequency genetic variants, fall into 4 major macro haplogroups. We highlight biased spatio-temporal distributions of SARS-CoV-2 genetic profiles and show that 7 of the 22 haplogroups (and of all of the 4 haplogroup clusters) showed a broad geographic distribution within China by the time the outbreak was widely recognised—suggesting early emergence and widespread cryptic circulation of the virus well before its isolation in January 2020. General patterns of genomic variability are remarkably similar within all major SARS-CoV-2 haplogroups, with UTRs consistently exhibiting the greatest variability, with s2m, a conserved secondary structure element of unknown function in the 3’ UTR of the viral genome showing evidence of a functional shift. While several polymorphic sites that are specific to one or more haplogroups were predicted to be under positive or negative selection, overall our analyses suggest that the emergence of novel types is unlikely to be driven by convergent evolution and independent fixation of advantageous substitutions, or by selection of recombined strains. In the absence of extensive clinical metadata for most available genome sequences, and in the context of extensive geographic and temporal biases in the sampling, many questions regarding the evolution and clinical characteristics of SARS-CoV-2 isolates remain open. However, our data indicate that the approach outlined here can be usefully employed in the identification of candidate SARS-CoV-2 genetic variants of clinical and epidemiological importance.
Full-text available
The ongoing pandemic of SARS-CoV-2 presents novel challenges and opportunities for the use of phylogenetics to understand and control its spread. Here, we analyze the emergence of SARS-CoV-2 in Russia in March and April 2020. Combining phylogeographic analysis with travel history data, we estimate that the sampled viral diversity has originated from at least 67 closely timed introductions into Russia, mostly in late February to early March. All but one of these introductions were not from China, suggesting that border closure with China has helped delay establishment of SARS-CoV-2 in Russia. These introductions resulted in at least 9 distinct Russian lineages corresponding to domestic transmission. A notable transmission cluster corresponded to a nosocomial outbreak at the Vreden hospital in Saint Petersburg; phylodynamic analysis of this cluster reveals multiple (2-3) introductions each giving rise to a large number of cases, with a high initial effective reproduction number of 3.0 [1.9, 4.3].
Full-text available
We identified severe acute respiratory syndrome coronavirus 2 RNA in an oropharyngeal swab specimen collected from a child with suspected measles in early December 2019, ≈3 months before the first identified coronavirus disease case in Italy. This finding expands our knowledge on timing and mapping of novel coronavirus transmission pathways.
Full-text available
Numerous studies covering some aspects of SARS-CoV-2 data analyses are being published on a daily basis, including a regularly updated phylogeny on Here, we review the difficulties of inferring reliable phylogenies by example of a data snapshot comprising a quality-filtered subset of 8, 736 out of all 16, 453 virus sequences available on May 5, 2020 from We find that it is difficult to infer a reliable phylogeny on these data due to the large number of sequences in conjunction with the low number of mutations. We further find that rooting the inferred phylogeny with some degree of confidence either via the bat and pangolin outgroups or by applying novel computational methods on the ingroup phylogeny does not appear to be credible. Finally, an automatic classification of the current sequences into sub-classes using the mPTP tool for molecular species delimitation is also, as might be expected, not possible, as the sequences are too closely related. We conclude that, although the application of phylogenetic methods to disentangle the evolution and spread of COVID-19 provides some insight, results of phylogenetic analyses, in particular those conducted under the default settings of current phylogenetic inference tools, as well as downstream analyses on the inferred phylogenies, should be considered and interpreted with extreme caution.