PreprintPDF Available

Time series analysis of SARS-CoV-2 genomes and correlations among highly prevalent mutations

Authors:

Abstract and Figures

The efforts of the scientific community to tame the recent SARS-CoV-2 pandemic seems to have been diluted by the emergence of new viral strains. Therefore, it becomes imperative to study and understand the effect of mutations on viral evolution, fitness and pathogenesis. In this regard, we performed a time-series analysis on 59541 SARS-CoV-2 genomic sequences from around the world. These 59541 genomes were grouped according to the months (January 2020-March 2021) based on the collection date. Meta-analysis of this data led us to identify highly significant mutations in viral genomes. Correlation and Hierarchical Clustering of the highly significant mutations led us to the identification of sixteen mutation pairs that were correlated with each other and were present in >30% of the genomes under study. Among these mutation pairs, some of the mutations have been shown to contribute towards the viral replication and fitness suggesting the possible role of other unexplored mutations in viral evolution and pathogenesis. Additionally, we employed various computational tools to investigate the effects of T85I, P323L, and Q57H mutations in Non-structural protein 2 (Nsp2), RNA-dependent RNA polymerase (RdRp) and Open reading frame 3a (ORF3a) respectively. Results show that T85I in Nsp2 and Q57H in ORF3a mutations are deleterious and destabilize the parent protein whereas P323L in RdRp is neutral and has a stabilizing effect. The normalized linear mutual information (nLMI) calculations revealed the significant residue correlation in Nsp2 and ORF3a in contrast to reduce correlation in RdRp protein.
Content may be subject to copyright.
1
Time series analysis of SARS-CoV-2 genomes and
correlations among highly prevalent mutations
Neha Periwal1, Shravan B. Rathod2, Sankritya Sarma3, Gundeep Singh4, Avantika
Jain1,5, Ravi P. Barnwal6, Kinsukh R. Srivastava7, Baljeet Kaur8, Pooja Arora3 and Vikas
Sood1*
Affiliations:
1. Department of Biochemistry, SCLS, Jamia Hamdard, New Delhi, India
2. Department of Chemistry, Smt. S. M. Panchal Science College, Talod, India
3. Department of Zoology, Hansraj College, University of Delhi, New Delhi, India
4. Humber College, Toronto, Canada
5. Delhi Institute of Pharmaceutical Sciences and Research, Delhi
6. Department of Biophysics, Panjab University, Chandigarh, India
7. Division of Medicinal and Process Chemistry, CDRI, Lucknow, India
8. Department of Computer Science, Hansraj College, University of Delhi, New
Delhi, India
Corresponding Author:
Vikas Sood PhD
Asst. Prof.
Department of Biochemistry
Jamia Hamdard
Delhi-62
India Email vikas1101@gmil.com, v.sood@jamihiahamdard.ac.in
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
2
Abstract
The efforts of the scientific community to tame the recent SARS-CoV-2 pandemic seems to
have been diluted by the emergence of new viral strains. Therefore, it becomes imperative to
study and understand the effect of mutations on viral evolution, fitness and pathogenesis. In
this regard, we performed a time-series analysis on 59541 SARS-CoV-2 genomic sequences
from around the world. These 59541 genomes were grouped according to the months (January
2020-March 2021) based on the collection date. Meta-analysis of this data led us to identify
highly significant mutations in viral genomes. Correlation and Hierarchical Clustering of the
highly significant mutations led us to the identification of sixteen mutation pairs that were
correlated with each other and were present in >30% of the genomes under study. Among these
mutation pairs, some of the mutations have been shown to contribute towards the viral
replication and fitness suggesting the possible role of other unexplored mutations in viral
evolution and pathogenesis. Additionally, we employed various computational tools to
investigate the effects of T85I, P323L, and Q57H mutations in Non-structural protein 2 (Nsp2),
RNA-dependent RNA polymerase (RdRp) and Open reading frame 3a (ORF3a) respectively.
Results show that T85I in Nsp2 and Q57H in ORF3a mutations are deleterious and destabilize
the parent protein whereas P323L in RdRp is neutral and has a stabilizing effect. The
normalized linear mutual information (nLMI) calculations revealed the significant residue
correlation in Nsp2 and ORF3a in contrast to reduce correlation in RdRp protein.
Key words: SARS-CoV-2, Pearson Correlation, Hierarchical Clustering, Mutations, Protein
Dynamics, Residue Correlation
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
3
Introduction
The novel coronavirus first appeared in Wuhan, China in December 2019 and became a public
health emergency of international concern. Since its emergence, the virus has caused
catastrophe across the globe. This virus known as SARS-CoV-2 has infected nearly 486 million
people and killed more than 6.1 million globally (WHO Coronavirus (COVID-19) Dashboard,
n.d.). Out of the seven known coronaviruses, (HCoV-OC43, HCoV-229E, HCoV-HKU1,
HCoV-NL63, SARS-CoV, MERS-CoV, and SARS-CoV-2) [1] SARS-CoV-2 is highly
pathogenic to humans [2]. This virus has linear, positive-sense, single-strand RNA as genetic
material which is about 30000 bp long and is encapsulated by the Nucleocapsid protein which
is one of the four structural proteins other being Spike, Envelope, and Membrane proteins [3].
Once the virus gains entry inside the cell, two viral polyproteins namely ORF1a and ORF1ab
are formed. These polyproteins are then cleaved by the viral proteases into sixteen non-
structural proteins. These proteins initiate the process of viral replication and transcription.
Apart from the viral non-structural proteins, SARS-CoV-2 encodes eleven accessory proteins
that play a key role in viral pathogenesis [4].
Among the non-structural proteins of SARS-CoV, Nsp14 along with Nsp10 and Nsp12 plays
a key role in maintaining the integrity of the viral RNA thereby resulting in a lesser number of
mutations as compared to other RNA viruses [5,6]. Despite the fact that SARS-CoV-2 mutates
at a slower pace, this virus has evolved into numerous variants since the onset of the pandemic
in December 2019 [7]. The continuous evolution of SARS-CoV-2 has already dampened the
efforts of the scientific community to design vaccines and antivirals against this virus [8]. Since
mutations are one of the driving factors for virus evolution, various studies have been done to
identify genomic variants among SARS-CoV-2. These studies have led to the discovery of a
wide number of genetic variations, including missense, synonymous, insertion and deletion in
the genomic sequences of SARS-CoV-2. The most common types of mutations along the viral
genome were reported to be missense and synonymous in nature [9]. Although synonymous
mutations may not have a direct impact on protein function, they do have consequences as they
may alter codon usage, translation efficiency as well as binding kinetics of microRNAs.
Furthermore, it was postulated that the mutations in the 5UTR may alter the virus's
transcription and replication rates, as well as the folding of the genomic ssRNA sequences [10].
SARS-CoV-2 genome analysis have revealed a substantial mutation bias towards U which
might be caused by improved immunogenicity, selection for greater expression, and better
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
4
mRNA stability [11]. The viral transmission rates are rapidly increasing as it is evolving. For
instance, a single mutation (D614G) in the Spike protein has been shown to increase the
infectivity of SARS-CoV-2 [12]. As the viral genome is accumulating mutations, there can be
a possible association among these mutations. It has been shown that co-mutations Y449S and
N501Y in the Spike protein can lead to reduced infectivity and play a major role in disrupting
antibody mediated virus neutralization [13]. This implies that mutations can have a synergistic
effect resulting in enhanced viral fitness and immune escape. Therefore, immediate steps are
required to identify the correlations and clusters among the mutations in the viral genome.
Several studies have been published in this direction. A study conducted by Zuckerman et al.
analysed 371 Israeli genomic sequences from February to April 2020 and identified
correlations among mutations with that of known clade defining mutations [14]. Another study
was conducted by Wang et al. where they analysed pairwise co-mutations of the top 11
missense mutations that were prevalent in the United States [15]. The 12754 SARS-CoV-2
genomes from the USA were analysed to obtain these 11 missense mutations.
In another study conducted by Rahman et al., the authors analysed 324 complete and near
complete SARS-CoV-2 genomic sequences obtained from Bangladesh which were isolated
between 30 March to 7 September 2020 [16]. They found 3037C>T was the most frequent
mutation that occurred in 98% of isolates. This mutation is a synonymous one that were shown
to co-occur with 3 other mutations including 241C>T, 144008C>T, and 23403A>G. In another
study, Chen et al. analysed 261323 sequences of SARS-CoV-2 across the globe to identify the
most common concurrent mutations among the top 17 mutations which occurred in more than
10% of the genomes under study [17]. The authors observed that there was a steady increase
in the number of concurrent mutations as the COVID-19 pandemic progressed. The authors
further showed that early M type genotype having two concurrent mutations evolved into WE1
with two additional concurrent mutations. WE1 further evolved into WE1.1 by incorporating
three additional concurrent mutations.
However, some of these studies were performed with viral sequences obtained from a specific
region, as well as focussed on the handful of top and missense mutations only. We hypothesized
whether a similar trend could be observed with the genomic sequences of SARS-CoV-2
obtained from around the world. In order to gain a better understanding into the origin of
mutations among SARS-CoV-2, we analysed viral genomic sequences in a time series-
dependent manner. Meta-analysis of SARS-CoV-2 time-series data led us to the identification
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
5
of highly significant mutations. We performed two highly common statistical tools including
Pearson Correlation and Hierarchical Clustering to identify co-mutations occurring in viral
genomes. In-silico protein dynamics were then used for the characterization of some of the
highly prominent co-related mutations circulating in SARS-CoV-2.
Material and Methods
SARS-CoV-2 Genomic Sequences
Since the onset of the SARS-CoV-2 pandemic in 2019, the virus is continuously evolving
resulting in the emergence of several variants. The availability of SARS-CoV-2 genomic
sequences has been instrumental in understanding viral evolution and pathogenesis. To gain an
in-depth understanding of the mutational landscape of SARS-CoV-2, we sought to analyse
SARS-CoV-2 genomic data in a time-series manner. All the SARS-CoV-2 genomic sequences
were collected in a month-wise manner (based on the sample collection month) from the Virus
Pathogen Resource (ViPR) database [18]. In the current study, SARS-CoV-2 sequences were
collected from January 2020 to March 2021 totalling to around 59541 samples.
Meta-analysis of SARS-CoV-2 genomic sequences
Once the SARS-CoV-2 sequences were obtained and categorized by the collection month, we
performed the meta-analysis of these sequences to identify highly significant mutations among
circulating genomes. The genomes from various months were analysed with respect to the
genomes collected in January 2020. The genomes obtained during the initial phase of infections
tend to be close to the wild type with a few mutations as compared to the genomes collected at
the later stages of the infection. For the identification of highly significant mutations, Meta-
data-driven comparative analysis tool for sequences (META-CATS) was used [19]. All the
analyses were performed with default settings and mutations having p>0.05 were considered
significant. We obtained highly significant mutations for each month. Notably, SARS-CoV-2
genomic sequences collected in December 2020 did not yield any significant mutation.
Therefore, December 2020 genomes were not included for further analysis.
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
6
Correlation Coefficient
Pearson Correlation coefficient
Our analysis identified highly significant mutations some of which were common among
several months. In order to identify the correlation among highly significant mutations in
SARS-CoV-2 circulating genomes, we used the Pearson Correlation. We created an empty
matrix with 30000 columns and 55759 rows using the NumPy module of Python. In this matrix,
the number of columns represents the maximum possible length of the SARS-CoV-2 genomic
sequences that were used in the study, while the number of rows represents the number of
SARS-CoV-2 genomic sequences that we analysed. Once the empty matrix was created, for
each genome of a particular month, we inserted the number “1” at those positions where Meta-
CATS identified a highly significant mutation for that month and “0” was inserted at those
positions where no mutation was identified. This task was performed using an in-house Python
script. Therefore, we obtained a binary 55759X30000 matrix where the number “1”
represented a position where a significant mutation was identified whereas a number “0”
represented a position where no mutation was present. Once this matrix was prepared, we
applied Pearson Correlation on this matrix to find out correlation coefficient among the highly
significant mutations. We used the corr method of the pandas library to implement Pearson
correlation.
Clustering
Hierarchical Clustering
To further validate our results obtained from the Pearson correlation, we used another statistical
tool called Hierarchical Clustering which groups the same objects in clusters thereby pointing
towards the closeness of the objects. Hierarchical Clustering is a very computationally
intensive process. Since highly frequent mutations tend to play a critical role in the evolution
of the virus, we used the top 25 mutations that were present in >10% of the genomes used in
this study [17]. Hierarchical Clustering was performed on the binary matrix of the top 25
mutations. We applied the figure_factory (create dendrogram) method of plotly that performs
Hierarchical Clustering on the matrix and created a dendrogram that represents the highly
significant mutations which are clustered together. To make the visualization more effective,
we represent the dendrogram with a heatmap using the pdist and squareform method of scipy
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
7
library. Values on the colour bar of dendrogram with heatmap correspond to distances between
highly significant mutations. All these analyses were performed in python.
Protein structure modelling and preparation
Our analysis identified nine highly significant and frequent mutations that were correlated with
each other. We then sought to use computational tools to identify the effect of these mutations
on protein structures. Out of these nine mutations, one mutation (241C>T) was present in
5’UTR which does not get translated into the protein. Two mutations (3037C>T and
28882G>A) are synonymous in nature and hence were not included for further analysis. Out
of the remaining six mutations, we performed the detailed analysis of three mutations
(23403A>G, 28881G>A, and 28883G>C) in our recent study [20]. Therefore, in this study, we
targeted the remaining three mutations 1059C>T (T85I), 14408C>T (P323L) and 25563G>T
(Q57H) to probe the impact of these mutations in Non-structural protein 2 (Nsp2), RNA-
dependent RNA polymerase (RdRp) and Open reading frame 3a (ORF3a) respectively. The
crystal structures of these proteins are available on RCSB Protein data bank (PDB)
(https://www.rcsb.org/) but they have missing residues. Hence, we employed a deep learning-
based protein modelling tool, RoseTTAFold [21] available at https://robetta.bakerlab.org/ to
add missing residues in our proteins. Nsp2 (PDB: 7MSW) is 638 amino acids long and in
crystal structure, the first three residues at the N-terminus are missing. RdRp (PDB: 7CYQ) is
942 amino acids long and 1-3 amino acids at N-terminus and 930-942 amino acids at C-
terminus were missing. ORF3a (PDB: 6XDC) is only 284 amino acids long but a large number
of residues from both the terminals (1-39 a.a. at N-terminal & 239-284 a.a. at C-terminal) and
six residues (175-180 a.a.) were missing inthe protein structure. The RoseTTAFold modelled
all these missing residues in three proteins except 9 histidine residues at the C-terminal in
RdRp. These three modelled proteins were further analysed for mutation effects.
Functional impacts of mutations
To investigate the effect of mutation on protein function, we used the widely popular
PredictSNP web server [22] available at https://loschmidt.chemi.muni.cz/predictsnp/. This web
tool is composed of six different predictors, PhD-SNP, MAPP, SNAP, PolyPhen-1, SIFT and
PolyPhen-2 to predict whether mutation is deleterious or neutral. PredictSNP gives a consensus
prediction score using these six predictors. These six predictors make use of different
methodologies to predict the nature of mutation. PhD-SNP, MAPP, SNAP, PolyPhen-1, SIFT
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
8
and PolyPhen-2 apply support vector machine, physicochemical characteristics and protein
sequence alignment score, neural network approach, expert set of empirical rules, protein
sequence alignment score and naïve Bayes respectively [22]. To calculate PredictSNP score,
following equation is employed,
(1)
Where, δi is an inclusive prediction (neutral: -1 & deleterious: +1), Si indicates the transformed
confidence scores and N is the number of predictors. PredictSNP consensus score is between -
1 and +1, where -1 to 0 is designated for neutral and 0 to +1 for deleterious mutation.
Effects of mutation on protein dynamics
The Normal mode analysis (NMA) based DynaMut [23] web tool
(http://biosig.unimelb.edu.au/dynamut/) was utilized to probe the effects of a single mutation
in each protein on protein stability and flexibility. The folding free energy change (ΔΔG) was
calculated to exactly predict the stability of protein under the influence of mutants. In addition
to its own ΔΔG prediction, DynaMut also predicts ΔΔG using NMA based ENCoM (Elastic
network contact model) [24] and, other structure-based predictors like mCSM [25], SDM [26]
and DUET [27]. The free energy change measures the energy difference between WT and MT
proteins and gives insight to the stability of proteins. Additionally, DynaMut employs ENCoM
to predict vibrational entropy energy (ΔΔSvib). The values of ΔΔSvib are calculated for WT and
MT by screening their all-atom pair interactions.
We utilized protein sequence-based Single amino acid folding free energy changes-sequence
(SAAFEC-SEQ) [28] to validate the DynaMut predictions for WT and MT proteins. This tool
is available at http://compbio.clemson.edu/SAAFEC-SEQ/ and uses different protocols such
as protein sequence properties, evolutionary details and physicochemical properties to
calculate ΔΔG value.
Linear mutual information (LMI)
To understand the dynamical nature and fluctuations of biomolecules, Dynamical cross-
correlation (DCC) and Linear mutual information (LMI) are employed widely [29–32]. Since
DCC cannot calculate the correlation of concurrently moving atoms in perpendicular directions
[33], we applied normalized LMI (nLMI) to overcome this limitation. To calculate the nLMI
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
9
of WT and MT proteins, we employed Python-based correlationplus 0.2.1 tool [33]. To
calculate nLMI, we used pdb files as input to the program. During the calculation, the program
used the Anisotropic network model (ANM) to generate 100 models of WT and MT proteins
and correlation was obtained using these models. Also, we calculated the difference in
correlation between WT and MT proteins. To calculate nLMI between residue i and j,
following equation is used;
!"#$%&' &(
)*
+
,-./$
0
123 23&&
4
,-./
%
5
623 23&&+,-./$%07&&&
(2)
Where,
/$'&89$
:9$;
,
/%' & 89%
:9%;
, and
/$% ' &8
4
9$< 9%
5
:&
4
9$< 9%
5
;
. And,
9$' & =$6 8=$;
and
9%'& =%6 8=%;
where,
=$
and
=%
are the atom
>
and
?
position vectors. In nLMI calculation,
the LMI was considered greater than or equal to 0.3 and distance threshold was less than or
equal to 7 Å. The 0 and 1 value indicate no correlation and complete correlation of residues
respectively.
Results and Discussion
SARS-CoV-2 genomes
Since the onset of the COVID-19 pandemic, several studies have identified mutations in SARS-
CoV-2 circulating genomes. However, in order to identify highly significant mutations, their
origin and frequency, we analysed SARS-CoV-2 genomic sequences in a time-series manner.
Since there can be a considerable time lapse between sample collection and sample processing,
therefore, we used the sample collection date criteria to classify the SARS-CoV-2 genomic
sequences to a particular month. We included a total of 59541 SARS-CoV-2 genomes that were
collected from January 2020 till March 2021 (Table 1). Global distribution of the samples
revealed that the majority of the samples were from the USA followed by Australia and India
(Figure 1).
Identification of highly significant mutations in SARS-CoV-2 genomes
Once the SARS-CoV-2 genomic sequences were grouped on the basis of the month, we used
the META-CATS algorithm to identify significant mutations among the genomes. This
algorithm compares two different datasets to identify highly significant mutations among them.
As the genomic sequences collected at the start of the pandemic tend to be very similar to the
parent sequence, hence all the SARS-CoV-2 genomic sequences collected in the month of
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
10
January 2020 were grouped as control sequences. These control sequences were then analysed
against the sequences obtained in the subsequent months to identify highly significant
mutations in SARS-CoV-2 genomes of that month. We obtained highly significant mutations
for each month except December 2020 (Figure 2(A)). Since mutations at the nucleotide levels
might not lead to the change in amino acids, therefore we focussed our attention on the
mutations at the amino acid level Figure 2(B-I). Across all the months we identified 940 unique
mutations at the level of amino acids which were unevenly distributed among the genomes of
SARS-CoV-2. Our analysis identified 610, 256, 33, 2, 11, 10, 16 and 2 mutations in SARS-
CoV-2 ORF1ab, S, ORF3a, M, ORF6, ORF8, N and ORF10 proteins respectively. Since the
length of SARS-CoV-2 proteins is highly variable, therefore we calculated the frequency of
highly significant mutations at amino acid level in order to understand its distribution in the
viral proteins. We observed that Spike protein had the highest frequency of mutations (20.10)
followed by ORF6 (18.0) and ORF1ab (8.59) (Table 2). We observed that the membrane
protein of SARS-CoV-2 had the least number of mutations as compared to the other proteins
suggesting that this region among SARS-CoV-2 was highly conserved. The emergence of
SARS-CoV-2 mutant named Omicron has been shown to have more than 40 mutations in Spike
protein of SARS-CoV-2 suggesting that the protein is highly amenable to mutations [34].
Additionally, it was also observed that some mutations including 241C>T, 3037C>T,
14408C>T and 23403A>G which originated in February-April 2020 were consistently present
till March 2021 suggesting that these mutations might play some critical role in viral
pathogenesis.
Correlation among highly significant SARS-CoV-2 mutations
Co-occurrence of several mutations has been shown to modulate the function of the proteins
[35]. Therefore, we sought to understand whether there was any association among the
mutations that we had identified in this study. For this purpose, we utilized two well-established
statistical approaches including Pearson Correlation and Hierarchical Clustering to identify the
mutations that were correlated among each other.
Pearson Correlation
Analysis of SARS-CoV-2 sequences in a time-series manner led us to the identification of
several highly significant mutations. In order to identify correlations among mutations in
SARS-CoV-2, a binary matrix was created. As discussed earlier, Pearson Correlation was
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
11
performed on the binary matrix with “1” representing the significant mutation while “0” does
not represent the mutations in SARS-CoV-2 genomes. The correlation values range from -0.1
to +0.1 with negative values indicating negative correlation whereas positive values indicating
positive correlations. Additionally, the absolute values closer to 1 indicates very strong
correlations. Therefore, the results obtained from the Pearson Correlation were then filtered to
obtain only those mutations where the absolute value of the correlation coefficient was greater
than 0.4. Using this criterion, we obtained 2205 mutation pairs (Figure 3(A)). Though these
mutation pairs were highly correlated, however, the frequency of the majority of these
mutations was very less. For instance, a correlated mutation pair 21306 and 22995 with
absolute correlation >0.4 but occurrence in less than 5% of the genomes might not be of
interest. Therefore, to consider only statistically significant pairs of correlated mutations, we
filtered the results to retain only those correlated mutations that were present in >30% of the
genomes. Using this stringent criterion, we were able to identify 9 mutations (16 mutation
pairs) that had an absolute value of correlation > 0.4 and were present in >30% of the genomes
(Figure 3(B) and Table 3). It was further observed that six mutation pairs were present in >89%
of the genomes suggesting their possible role in viral fitness. Our analysis captured a highly
prevalent mutation in Spike protein (D614G) that has nearly replaced the wild-type genome
and has been shown to increase the viral infectivity [12]. The identification of D614G mutation
using our approach further validated our approach and prompted us to further explore other
mutation pairs that were identified.
Hierarchical Clustering
In order to garner confidence in our approach, we used another widely used statistical approach
to identify the clusters among highly significant mutations in SARS-CoV-2. Since Hierarchical
Clustering is a computationally intensive process, we analysed only the top 25 mutations that
were highly significant and were present in >10% of the genomes used in this study (Table 4).
We obtained a binary matrix for the top 25 highly significant mutations. Similar to the results
obtained using Pearson Correlation, Hierarchical Clustering analysis led to clustering of the
similar mutations. Here we analyzed only those mutation clusters that were present in greater
than 30% of genomes. (Figure 3(C)). It can be observed that 241C>T, 14408C>T, 3037C>T
and 23403A>G forms a cluster and are most common concurrent mutations followed by
28881G>A, 28882G>A and 28883G>C. Since both the statistical tools provided similar
results, we then focussed our attention to these mutations for their in-depth understanding.
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
12
Frequency and global distribution of highly, correlated and frequent significant
mutations
Once the correlated pair of highly significant mutations that have absolute correlation
coefficient >0.4 and present in >30% of genomes were identified, we then investigated their
global distribution among the circulating SARS-CoV-2 genomes. It can be observed that
C241T mutation in the 5’UTR region completely replaced the wild-type genomes as early as
June-July 2020 (Figure 4). Similar trends were observed with the mutations C3037T, C14408T,
and A23403G suggesting their critical role in viral pathogenesis. However, some of the
mutations showed a mosaic pattern of global distribution with increase in time.
Mutation in the 5’ UTR
The untranslated region of the viral genome plays a vital role in viral replication. The region
has been shown to form various secondary structures to allow binding of cellular and viral
proteins thereby regulating the translation of viral proteins [36,37]. Therefore, any mutation
in these highly conserved regions has the potential to regulate viral replication. Statistical
approaches revealed that 241C>T mutation was highly correlated with three different
mutations 3037C>T (F106F), 14408C>T (P323L) and 23403A>G (D614G) in SARS-CoV-2
Nsp2, RdRp and Spike proteins respectively. Remarkably, it can be observed that the
correlation of mutation 241C>T with all the other mutations mentioned above was >0.75
pointing towards a very strong correlation. Additionally, these mutations were found in > 89%
of the genomes further pointing towards the critical role of these mutations in viral evolution.
These observations were further supported by Hierarchical Clustering where these mutations
were clustered together. Our results are in agreement with published studies which have shown
similar correlations among these mutations [14]. However, these studies were conducted on
the genomes from various countries including Israel, USA, and Bangladesh whereas our
analysis is obtained from the SARS-CoV-2 genomes obtained globally. The correlation of
241C>T mutation with highly occurring mutations in the SARS-CoV-2 genomes point towards
its role in viral pathogenesis and fitness.
Mutation in the Nsp2 protein
The Nsp2 protein of SARS-CoV-2 was recently shown to be associated with host proteins
involved in vesicle trafficking. It was also proposed that targeting viral protein N, Nsp2 and
Nsp8 interactions with host translational machinery might have therapeutic effects [38].
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
13
Therefore, understanding the dynamics of Nsp2 protein becomes essential. Our analysis
revealed that mutation 1059C>T (T85I) in viral Nsp2 protein was both positively and
negatively correlated with other mutations in SARS-CoV-2. As described in Table 3, 1059C>T
(T85I) mutation in Nsp2 protein was positively correlated with 25563G>T (Q57H) mutation
in ORF3a. The correlation coefficient is 0.863 and presence of these mutations in >40% of the
genomes suggests that the co-occurrence of these mutations might play a role in viral evolution.
These observations are in agreement with earlier studies where co-occurrence of 1059C>T
(T85I) with 25563G>T (Q57H) was observed in nearly 70% of COVID cases across the USA
[15].
Additionally, we observe that 1059C>T (T85I) mutation in Nsp2 protein was negatively
associated with three mutations 28881G>A (R203K), 28882G>A(R203R) and 28883G>C
(G204R) in the N protein. Though the co-occurrence of these mutations is established in other
study also [14], however, in this study, we showed that these mutations are negatively
correlated from the SARS-CoV-2 genomic sequences across the world.
Since T85I mutation was widespread among Nsp2 protein, we sought to investigate the role of
this mutation on the protein function. The full-length 3.2 Å crystal structure of Nsp2 (PDB:
7SMW) was solved by combining cryo-EM and recently developed AI tool AlphaFold2 [39].
In the structure, there was a highly conserved zinc binding site observed that indicates the role
of Nsp2 in RNA binding. Figure 5(A) represents the structure of Nsp2. In Nsp2, T85I mutation
in which polar threonine residue is substituted by hydrophobic isoleucine at position 85 at β-
hairpin loop formed by two β-strands (1. 78-82 a.a. & 2. 98-104 a.a.). PredictSNP revealed that
this mutation is deleterious in nature with around a 70% confidence score. The ENCoM based
ΔΔSvib value suggests that mutation brings some extent of flexibility in Nsp2 protein. It can be
seen from Figure 6(A) that two helices (1: 19-28 a.a. & 2: 35-45 a.a.) at the N-terminus gain
slight flexibility (red colour). Among six predictors, four predicted ΔΔG negative that implies
destabilization in Nsp2 (Table 5). Our results on Nsp2 stability and flexibility are in accordance
with already published reports [15]. In WT and MT, two identical residues (Phe83 & Asn87)
interact with WT and MT residues. In both the structures, van der Waals clashes were observed
between sidechain oxygen of Thr85 in WT and aromatic carbons of Phe83 and, sidechain
methyl group carbon atom of Ile85 in MT and aromatic carbons of Phe83. In WT, Thr85 amide
group oxygen and nitrogen interact with surrounding amide group atoms of Phe83 and Asn87
through hydrophobic, vdW and polar interactions. However, in MT, similar interactions were
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
14
noted but, there was a polar-vdW clash observed between Asn87 and Ile85. This might be the
cause of the predicted instability of T85I mutation in Nsp2 protein. WT and MT interactions
are illustrated in Figure 6(D).
Mutations in the Nsp3 protein
The Nsp3 protein in coronaviruses has been shown to antagonize the innate immune responses
[40]. The mutations in the Nsp3 macrodomain region lead to enhanced type I IFN responses
and reduced viral replication [41]. Understanding the dynamics of mutations in Nsp3 might
provide clues on SARS-CoV-2 mediated evasion of type I IFN signalling. We identified a
synonymous mutation 3037C>T (F106F) in SARS-CoV-2 Nsp3 protein. Though silent in
nature, this mutation was shown to disrupt the mir-197-5p target sequence [42]. Mir-197-5p
was shown to be associated with some other viruses also [43–45] thereby pointing towards its
role in viral biology. This mutation was highly correlated with various mutations spread across
the genome of SARS-CoV-2. It was positively correlated with 241C>T in the 5'UTR region,
23403A>G (D614G) in S protein and 14408C>T (P323L) mutation in RNA dependent RNA
polymerase (RdRp). These mutation correlation pairs have been shown to be critical in
evolution to the more infectious form of SARS-CoV-2. The 14408 mutation was shown to
increase the mutation rate among SARS-CoV-2 whereas D614G has been shown to contribute
towards the infectivity of the virus [16]. The presence of all these mutations in >90% of the
genomes further points towards their critical role in driving viral evolution.
Mutation in the RdRp protein
The RdRp (Nsp12) protein of the SARS-CoV-2 is the RNA-dependent RNA-polymerase
which is important for viral replication and transcription. This protein is also believed to be the
most prominent target for potential antiviral drugs [46]. Therefore, understanding the mutations
in this protein becomes critical for RdRp based drug designs. We identified a highly significant
mutation 14408C>T (P323L) in this protein which was present in >89% of genomes suggesting
that the mutation was now a part of the circulating genomes. Apart from its widespread
presence, this mutation was strongly associated with some other mutations including 241C>T
in 5’UTR, 3037C>T in Nsp3, and 23403A>G in S with extremely high absolute correlation
values 0.73, 0.94, and 0.94 respectively. Interestingly, P323L and G614G mutations were
reported in severe COVID-19 patients as compared to the mild ones suggesting their possible
role in viral pathology [47]. Owing to the widespread presence 14408C>T (P323L) mutation
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
15
in RdRp, we sought to study its putative effect of the stability of wt RdRp. The 2.83 Å crystal
structure of RdRp in complex with Nsp7, Nsp8, Nsp9, and Helicase was determined using
cryo-EM [48]. RdRp structure has the RdRp domain (367-920 a.a.) [49] and N-terminus (60-
249 a.a.) which adopts Nidovirus RdRp-associated nucleotidyltransferase (NiRAN) structural
scaffold [50]. Another region (4-118 a.a.) is composed of two helices and five β-strands that
are antiparallel. Additionally, the short β-strand (215-218 a.a.) was observed in RdRp which is
highly ordered in SARS-CoV-2 as compared to SARS-CoV. This β-strand has contact with
other β-strand residues (96-100 a.a.) consequently, it increases the conformational stability of
RdRp in SARS-CoV-2 [49]. The structure of RdRp is given in Figure 5(B).
P323L mutation is present on RdRp interface domain and, especially in the loop region which
connects the interface domain’s three helices to the same domain’s three β-strands. Earlier
study suggests that this mutation enhances the processivity of RdRp protein [51]. It is predicted
functionally neutral with a notable confidence score (83%) to the protein. In this mutation, a
conformationally rigid proline ring is swapped by a flexible side-chain containing leucine
residue. Though the nature of WT and MT residues is hydrophobic, their conformational
flexibility must be the deciding factor for the protein stability and flexibility. Nonetheless, this
mutation significantly rigidifies the RdRp (Figure 6(B)) and ΔΔSvib also observed least (Table
5). Results show that this mutation has a strong communication network in RdRp and impacts
various helices and β-sheets. P323L mutation is located in a loop formed by the β-strand (328-
335 a.a.) and helix (304-320 a.a.) thus, these two secondary structures gain rigidity. However,
a helix nearby mutation gained greater rigidness as compared to other regions of the RdRp.
The helices at the N- and C-terminal domains also gained rigidness due to this mutation. All-
atom simulation data also suggested that P323L mutation reduces the RdRp motions thus, this
is in line with our results [52]. The ΔΔG value prediction showed that three predictors predicted
stabilization and the remaining three predicted destabilization but the ΔΔG values of
stabilization are considerably higher in comparison with destabilization values. Mohammad et
al. have performed 200 ns all-atom MD simulation and by calculating free energy (ΔG) of WT
and MT RdRp they confirmed that P323L mutation increases the stability of MT RdRp [52].
Hence, this mutation stabilizes the RdRp structure.
The RdRp WT and MT interactions analysis revealed that there are a greater number of
interactions observed in MT as compared to WT. The WT and MT residues are surrounded by
Phe321, Ser325, Phe326, Arg349 and Phe396 residues. In WT, only single polar interaction
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
16
between Ser325 and Pro323 residues whilst in MT, two additional hydrogen bonds were
observed through Ser325 and Phe326 and polar interaction through Phe349. Thus, it can be
considered that higher stability in MT comes from these interactions. Figure 6(E) shows the
interactions in WT and MT RdRp proteins.
Mutation in the ORF3a protein
The largest accessory protein ORF3a plays a key role in the viral infection cycle. Moreover,
this protein is essential for viral replication, and mutations in this protein are associated with
higher mortality rates [53]. We identified a highly significant mutation 25563G>T (Q57H) in
SARS-CoV-2 ORF3a protein. Interestingly, this mutation has been shown to be associated with
decreased death and increased cases of COVID-19 [54]. We further show that 25563G>T
(Q57H) mutation is positively associated with 1059C>T mutation in Nsp2 protein whereas it
is negatively associated with 28881G>A, 28882G>A, 28883G>C mutations in N protein. Our
observations are in agreement with previous studies which have identified similar associations
within genomes of SARS-CoV-2 isolated from Israel [14].
The ORF3a functional domains are vital for SARS-CoV-2 infectivity, virulence, ion channel
synthesis and release of virus [55]. Previous study showed that ORF3a in SARS-CoV-2 has a
weaker potential for pro-apoptotic activity as compared to SARS-CoA ORF3a. The difference
between pro-apoptotic activity might be linked to the infectivity of viruses [56]. Furthermore,
another study has confirmed that ORF3a binds to the Homotypic fusion and protein sorting
(HOPS) and prevents the autolysosome formation [57]. ORF3a is also considered potential
vaccine and drug targets [58,59]. The experimental structure of ORF3a (PDB: 6XDC) was
determined using cryo-EM at 2.1 Å resolution. ORF3a has three main regions, N-terminal (1-
39 a.a.), cytoplasmic loop (175-180 a.a.) and C-terminal (239-275 a.a.) [60]. Figure 5(C) shows
the structure of ORF3a. Finally, the Q57H mutation in ORF3a was studied. In this mutation,
glutamine polar residue is exchanged by polar and basic histidine residue. This mutation is
situated in the helix region of ORF3a. This mutation has deleterious nature with a 76%
confidence score (Table 5) thus, it might alter the functions of ORF3a protein. This mutation
was reported predominantly occurring and deleterious in the previous studies [55,61].
The vibrational entropy energy (ΔΔSvib) value was noted negative hence, ORF3a gains
rigidness and becomes less flexible due to the insertion of this mutation. Similar to RdRp,
mutant residue in ORF3a has also wide communication dynamics. This single mutation at helix
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
17
increases the rigidity of the whole ORF3a protein (Figure 6(C)). Our finding supports previous
study on Q57H mutation and its rigidness [15]. This mutation was predicted destabilizing by
the majority of predictors based on ΔΔG values (Table 5). WT and MT residues are in close
proximity to Leu52, Leu53, Ala54, Val55, Ala59, Ser60, Lys61, Val77 and Cys81. In WT, two
hydrogen bonds are observed by Ser60 and Lys61 amide bond amino groups with amide
carbonyl group of WT residue. These identical hydrogen bonds are also present in MT
structure. Other types of interactions such as polar and hydrophobic were observed in WT and
MT ORF3a. But there are two new clashes seen in MT structure between the histidine ring and
Lys61 and mutant amide group and Leu53 residue. Thus, the overall number of clashes has
increased in MT and this might be the leading factor of destabilization of ORF3a protein under
the influence of Q57H mutation. This mutation was predicted to be having significant potential
to alter the ORF3a conformations and leads to disruption of intramolecular hydrogen bonds in
ORF3a [54]. Thus, our finding matches that Q57H mutation causes destabilization of ORF3a
with this study. WT and MT interactions are illustrated in Figure 6(F).
Mutation in the Spike protein
Spike protein is a homotrimer protein that is studded on the surface of SARS-CoV-2 thereby
giving it a crown like shape. Spike protein of SARS-CoV-2 is cleaved in the infected cells and
consists of two subunits that are covalently attached to each other. One of the subunits i.e S1
binds to the ACE2 receptor on the target cells where the other unit S2 helps in membrane fusion
a transmembrane protein that binds with the receptors on host cells to enable the virus to enter
inside the cells [62,63]. D614G mutation in the spike proteins has been shown to increase the
infectivity of the virus. In our previous study [20], we also characterized that the effect of
D614G mutation on protein activity and suggested that the mutation led to decreased protein
stability but enhanced protein movement. In the present study, we obtained a correlation of
D614G mutation in spike protein with mutations in 5’UTR (241C>T), Nsp3 protein (3037C>T)
and RdRp (14408C>T). The presence of these co-mutation pairs in >96% points towards the
critical role that these mutations play in viral fitness.
Mutation in the Nucleocapsid protein
Nucleocapsid protein is one of the most conserved proteins among SARS coronaviruses. This
protein is known to interact with viral RNA as well as M protein to aid virion assembly. This
protein is also shown to play a role in regulating host immune responses [64] and cellular
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
18
apoptosis [65]. SARS-CoV-2 Nucleocapsid protein acts as a viral RNAi suppressor where it
has been shown to antagonize cellular RNAi pathways [66]. Thus, understanding the role of
mutations in modulating the function of this protein becomes important. Our statistical analysis
and time series data analysis identified positive correlation of 28881G>A, 28882G>A and
28883G >C with mutations in Nsp2 (1059C>T) and ORF3a (25563G>T) respectively. These
three mutations are correlated with each other. Among the three mutations in the Nucleocapsid
protein, mutations at 28881 and 28883 are missense mutations whereas 28882 is a synonymous
mutation. Since the mutations in nucleocapsid protein are known to increase the infectivity and
virulence of the virus [67], therefore, the correlation of these mutations with other mutations
warrants further study. In our recent report [20] we used computational tools to characterize
the mutation in N protein and have reported that mutation at 28881 (R203K) stabilizes the
parent protein whereas at 28883 (G204R) destabilizes the N protein.
Impacts of mutations on protein dynamics
The protein structure and functions are significantly altered by the insertion of single point
mutations [68–70]. Investigating the structural or functional impacts of point mutations in all
protein is a mammoth task, thus, computational algorithms and tools like NMA models [71],
Gaussian network models (GNM) [72], Anisotropic network models (ANM) [73], Elastic
network models (ENM) [74], Discrete molecular dynamics (DMD) [75], All-atom molecular
dynamics (AAMD) simulation [76] and protein evolutionary data are being used currently.
Therefore, we employed numerous computational tools to probe the effects of mutations on
protein structures. The predicted results of the mutations are given in Table 5.
Linear mutual information (LMI) analysis
The normalized LMI gives insight into protein residue network and dynamic correlation. Figure
7 illustrates the nLMI correlation and correlation difference plots of Nsp2, RdRp and ORF3a
proteins and their mutants. It can be seen from Figure 7(A-C), that Nsp2 and ORF3a protein
residues are strongly correlated (>0.625) as compared to RdRp where correlation among
residues is considerably less (<0.500). However, to understand the impacts of every single
mutation in each protein, we obtained correlation differences between the WT and MT
structures of all three proteins. The correlation difference plots of proteins are shown in Figure
7(D-F). The yellow regions indicate no correlation of slight correlation (0.00-0.25) whereas
cyan regions are designated for slightly negative anticorrelation between the residues. In RdRp
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
19
and ORF3a MTs structures, residues have significantly less correlation. However, Nsp2 MT
structure’s residues are slightly anticorrelated. Thus, T85I mutation in Nsp2 causes a slight
disruption in the structure that can be considered destabilization of Nsp2 MT structure. But,
P323L in RdRp and Q57H in ORF3a do not bring notable destabilization to their mutant
structures.
Conclusion
Since the onset of the SARS-CoV-2 pandemic in December 2019, the virus has significantly
mutated. The mutations in the virus have led to the emergence of mutants that have the capacity
to dodge vaccine and antiviral therapies. It now seems that the virus is evolving to be more
infectious and less virulent. Therefore, understanding the dynamics of mutations in the viral
genome is of utmost importance. In this regard, we performed time-series analysis of viral
genomes to understand the origin and frequency of significant mutations that are significant in
the SARS-CoV-2 genome. Additionally, we used Pearson Correlation and Hierarchical
Clustering to identify correlations among highly significant mutations that are correlated. We
identified sixteen mutation pairs that had absolute correlation value >0.4 and were present in
>30% of the genomes analysed in this study. We identified a strong correlation coefficient
(>0.73) of mutation 241 in 5’UTR (>0.73) of mutation at 241 position in the 5’ UTR with 3037
(F106F) in Nsp3, 23403 (D614G) in Spike, and 14408 (P323L) in RdRp respectively. These
mutations were found in >89% of the genomes that were studied.
To investigate the impacts of T85I, P323L, and Q57H mutations in Nsp2, RdRp, and ORF3a
respectively on their structural stability and flexibility, we employed structure and sequence-
based tools. The free energy change (ΔΔG) and vibrational entropy energy (ΔΔSvib) terms were
used to evaluate the stability and flexibility of proteins respectively. Results showed that out
of three, two mutations (T85I in Nsp2 & Q57H in ORF3a) were predicted to destabilize while
P323L was stabilizing. Also, consensus predictors were used to predict the impacts of mutation
on protein functions. It was noted that T85I in Nsp2 and Q57H in ORF3a were found to have
deleterious nature which implies that they alter the protein functions. However, P323L in RdRp
was predicted neutral which suggests that this mutation does not have any impacts on protein
function. The graph theory-based nLMI correlation was also obtained for the WT and MT
structures of three proteins to understand the residue communication in proteins. The Nsp2 and
ORF3a residues have greater correlation in comparison with RdRp. Correlation difference
analysis suggests that T85I in Nsp2 is destabilizing whereas P323L in RdRp and Q57H in
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
20
ORF3a are slightly stabilizing. The fact that some of the mutations have destabilizing effects
but have very high frequency suggests that these might be playing some role in viral fitness.
Further experimental evidence is required to study the effect of these co-mutations on viral
transmission and pathogenesis.
Acknowledgements
NP is thankful to UGC for PhD fellowship. VS received research grant from UGC, Govt. of
India. SBR is thankful to his Chemistry Department for providing computational and
infrastructure facilities.
Author contribution
NP, SS, SBR, AJ, GS performed the experiments and analyzed the data. BK contributed to the
statistical experiments. VS, PA, BK, RPB, SBR, KRS designed and supervised the study. NP,
SBR, SS, and VS wrote the first draft. VS, PA, BK, SBR, RPB and KRS edited and finalized
the draft.
Conflict of Interest
Authors declare no conflict of interest
ORCID
Shravan B. Rathod https://orcid.org/0000-0002-1870-2508
Ravi P. Barnwal https://orcid.org/0000-0003-3156-5357
Vikas Sood https://orcid.org/0000-0001-6128-4279
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
21
References
[1] D.X. Liu, J.Q. Liang, T.S. Fung, Human Coronavirus-229E, -OC43, -NL63, and -HKU1
(Coronaviridae), Encyclopedia of Virology. (2021) 428–440.
https://doi.org/10.1016/B978-0-12-809633-8.21501-X.
[2] B. Hu, H. Guo, P. Zhou, Z.-L. Shi, Characteristics of SARS-CoV-2 and COVID-19, Nat
Rev Microbiol. (2020) 1–14. https://doi.org/10.1038/s41579-020-00459-7.
[3] P. V’kovski, A. Kratzel, S. Steiner, H. Stalder, V. Thiel, Coronavirus biology and
replication: implications for SARS-CoV-2, Nature Reviews Microbiology. 19 (2021) 155–
170.
[4] N. Redondo, S. Zaldívar-López, J.J. Garrido, M. Montoya, SARS-CoV-2 accessory
proteins in viral pathogenesis: knowns and unknowns, Frontiers in Immunology. (2021)
2698.
[5] M. Sevajol, L. Subissi, E. Decroly, B. Canard, I. Imbert, Insights into RNA synthesis,
capping, and proofreading mechanisms of SARS-coronavirus, Virus Research. 194 (2014)
90–99.
[6] F. Ferron, L. Subissi, A.T.S. De Morais, N.T.T. Le, M. Sevajol, L. Gluais, E. Decroly, C.
Vonrhein, G. Bricogne, B. Canard, Structural and molecular basis of mismatch correction
and ribavirin excision from coronavirus RNA, Proceedings of the National Academy of
Sciences. 115 (2018) E162–E171.
[7] L. Wang, G. Cheng, Sequence analysis of the Emerging Sars@CoV@2 Variant Omicron in
South Africa, Journal of Medical Virology. (2021).
[8] P.A. Christensen, R.J. Olsen, S.W. Long, S. Subedi, J.J. Davis, P. Hodjat, D.R. Walley,
J.C. Kinskey, M.O. Saavedra, L. Pruitt, Delta variants of SARS-CoV-2 cause significantly
increased vaccine breakthrough COVID-19 cases in Houston, Texas, The American
Journal of Pathology. 192 (2022) 320–331.
[9] A. Rahimi, A. Mirzazadeh, S. Tavakolpour, Genetics and genomics of SARS-CoV-2: A
review of the literature with the special focus on genetic diversity and SARS-CoV-2
genome detection, Genomics. 113 (2021) 1221–1232.
[10] D. Kim, J.-Y. Lee, J.-S. Yang, J.W. Kim, V.N. Kim, H. Chang, The architecture of
SARS-CoV-2 transcriptome, Cell. 181 (2020) 914–921.
[11] A.M. Rice, A. Castillo Morales, A.T. Ho, C. Mordstein, S. Mühlhausen, S. Watson, L.
Cano, B. Young, G. Kudla, L.D. Hurst, Evidence for strong mutation bias toward, and
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
22
selection against, U content in SARS-CoV-2: implications for vaccine design, Molecular
Biology and Evolution. 38 (2021) 67–83.
[12] B. Korber, W.M. Fischer, S. Gnanakaran, H. Yoon, J. Theiler, W. Abfalterer, N.
Hengartner, E.E. Giorgi, T. Bhattacharya, B. Foley, Tracking changes in SARS-CoV-2
spike: evidence that D614G increases infectivity of the COVID-19 virus, Cell. 182 (2020)
812–827.
[13] R. Wang, J. Chen, G.-W. Wei, Mechanisms of SARS-CoV-2 evolution revealing
vaccine-resistant mutations in Europe and America, The Journal of Physical Chemistry
Letters. 12 (2021) 11850–11857.
[14] N.S. Zuckerman, E. Bucris, Y. Drori, O. Erster, D. Sofer, R. Pando, E. Mendelson, O.
Mor, M. Mandelboim, Genomic variation and epidemiology of SARS-CoV-2 importation
and early circulation in Israel, PloS One. 16 (2021) e0243265.
[15] R. Wang, J. Chen, K. Gao, Y. Hozumi, C. Yin, G.-W. Wei, Analysis of SARS-CoV-2
mutations in the United States suggests presence of four substrains and novel variants,
Communications Biology. 4 (2021) 1–14.
[16] M.M. Rahman, S.B. Kader, S.S. Rizvi, Molecular characterization of SARS-CoV-2
from Bangladesh: Implications in genetic diversity, possible origin of the virus, and
functional significance of the mutations, Heliyon. 7 (2021) e07866.
[17] Y. Chen, S. Li, W. Wu, S. Geng, M. Mao, Distinct mutations and lineages of SARS@
CoV@2 virus in the early phase of COVID@19 pandemic and subsequent one@year global
expansion, Journal of Medical Virology. (2021).
[18] B.E. Pickett, E.L. Sadat, Y. Zhang, J.M. Noronha, R.B. Squires, V. Hunt, M. Liu, S.
Kumar, S. Zaremba, Z. Gu, ViPR: an open bioinformatics database and analysis resource
for virology research, Nucleic Acids Research. 40 (2012) D593–D598.
[19] B. Pickett, M. Liu, E. Sadat, R. Squires, J. Noronha, S. He, W. Jen, S. Zaremba, Z. Gu,
L. Zhou, Metadata-driven comparative analysis tool for sequences (meta-CATS): an
automated process for identifying significant sequence variations that correlate with virus
attributes, Virology. 447 (2013) 45–51.
[20] N. Periwal, S.B. Rathod, R. Pal, P. Sharma, L. Nebhnani, R.P. Barnwal, P. Arora, K.R.
Srivastava, V. Sood, In silico characterization of mutations circulating in SARS-CoV-2
structural proteins, Journal of Biomolecular Structure and Dynamics. (2021) 1–16.
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
23
[21] M. Baek, F. DiMaio, I. Anishchenko, J. Dauparas, S. Ovchinnikov, G.R. Lee, J. Wang,
Q. Cong, L.N. Kinch, R.D. Schaeffer, Accurate prediction of protein structures and
interactions using a three-track neural network, Science. 373 (2021) 871–876.
[22] J. Bendl, J. Stourac, O. Salanda, A. Pavelka, E.D. Wieben, J. Zendulka, J. Brezovsky,
J. Damborsky, PredictSNP: robust and accurate consensus classifier for prediction of
disease-related mutations, PLoS Computational Biology. 10 (2014) e1003440.
[23] C.H. Rodrigues, D.E. Pires, D.B. Ascher, DynaMut: predicting the impact of mutations
on protein conformation, flexibility and stability, Nucleic Acids Research. 46 (2018)
W350–W355.
[24] V. Frappier, R.J. Najmanovich, A coarse-grained elastic network atom contact model
and its use in the simulation of protein dynamics and the prediction of the effect of
mutations, PLoS Computational Biology. 10 (2014) e1003569.
[25] D.E. Pires, D.B. Ascher, T.L. Blundell, mCSM: predicting the effects of mutations in
proteins using graph-based signatures, Bioinformatics. 30 (2014) 335–342.
[26] C.L. Worth, R. Preissner, T.L. Blundell, SDM—a server for predicting effects of
mutations on protein stability and malfunction, Nucleic Acids Research. 39 (2011) W215–
W222.
[27] D.E. Pires, D.B. Ascher, T.L. Blundell, DUET: a server for predicting effects of
mutations on protein stability using an integrated computational approach, Nucleic Acids
Research. 42 (2014) W314–W319.
[28] G. Li, S.K. Panday, E. Alexov, SAAFEC-SEQ: a sequence-based method for predicting
the effect of single point mutations on protein thermodynamic stability, International
Journal of Molecular Sciences. 22 (2021) 606.
[29] C.F. Negre, U.N. Morzan, H.P. Hendrickson, R. Pal, G.P. Lisi, J.P. Loria, I. Rivalta, J.
Ho, V.S. Batista, Eigenvector centrality for characterization of protein allosteric pathways,
Proceedings of the National Academy of Sciences. 115 (2018) E12201–E12208.
[30] D.L. Penkler, C. Atilgan, Ö. Tastan Bishop, Allosteric modulation of human Hsp90α
conformational dynamics, Journal of Chemical Information and Modeling. 58 (2018) 383–
404.
[31] A. Sethi, J. Eargle, A.A. Black, Z. Luthey-Schulten, Dynamical networks in tRNA:
protein complexes, Proceedings of the National Academy of Sciences. 106 (2009) 6620–
6625.
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
24
[32] A.T. Van Wart, J. Durrant, L. Votapka, R.E. Amaro, Weighted implementation of
suboptimal paths (WISP): an optimized algorithm and tool for dynamical network analysis,
Journal of Chemical Theory and Computation. 10 (2014) 511–517.
[33] M. Tekpinar, B. Neron, M. Delarue, Extracting Dynamical Correlations and Identifying
Key Residues for Allosteric Communication in Proteins by correlationplus, J. Chem. Inf.
Model. 61 (2021) 4832–4838. https://doi.org/10.1021/acs.jcim.1c00742.
[34] D. Tian, Y. Sun, H. Xu, Q. Ye, The emergence and epidemic characteristics of the
highly mutated SARS-CoV-2 Omicron variant, Journal of Medical Virology. n/a (n.d.).
https://doi.org/10.1002/jmv.27643.
[35] S. Del Veliz, L. Rivera, D.M. Bustos, M. Uhart, Analysis of SARS-CoV-2 nucleocapsid
phosphoprotein N variations in the binding site to human 14-3-3 proteins, Biochem
Biophys Res Commun. 569 (2021) 154–160. https://doi.org/10.1016/j.bbrc.2021.06.100.
[36] D. Yang, J.L. Leibowitz, The structure and functions of coronavirus genomic 3 and 5
ends, Virus Research. 206 (2015) 120–133. https://doi.org/10.1016/j.virusres.2015.02.025.
[37] Z. Miao, A. Tidu, G. Eriani, F. Martin, Secondary structure of the SARS-CoV-2 5’-
UTR, RNA Biology. 18 (2021) 447–456.
https://doi.org/10.1080/15476286.2020.1814556.
[38] D.E. Gordon, G.M. Jang, M. Bouhaddou, J. Xu, K. Obernier, K.M. White, M.J.
O’Meara, V.V. Rezelj, J.Z. Guo, D.L. Swaney, T.A. Tummino, R. Hüttenhain, R.M.
Kaake, A.L. Richards, B. Tutuncuoglu, H. Foussard, J. Batra, K. Haas, M. Modak, M. Kim,
P. Haas, B.J. Polacco, H. Braberg, J.M. Fabius, M. Eckhardt, M. Soucheray, M.J. Bennett,
M. Cakir, M.J. McGregor, Q. Li, B. Meyer, F. Roesch, T. Vallet, A. Mac Kain, L. Miorin,
E. Moreno, Z.Z.C. Naing, Y. Zhou, S. Peng, Y. Shi, Z. Zhang, W. Shen, I.T. Kirby, J.E.
Melnyk, J.S. Chorba, K. Lou, S.A. Dai, I. Barrio-Hernandez, D. Memon, C. Hernandez-
Armenta, J. Lyu, C.J.P. Mathy, T. Perica, K.B. Pilla, S.J. Ganesan, D.J. Saltzberg, R.
Rakesh, X. Liu, S.B. Rosenthal, L. Calviello, S. Venkataramanan, J. Liboy-Lugo, Y. Lin,
X.-P. Huang, Y. Liu, S.A. Wankowicz, M. Bohn, M. Safari, F.S. Ugur, C. Koh, N.S. Savar,
Q.D. Tran, D. Shengjuler, S.J. Fletcher, M.C. O’Neal, Y. Cai, J.C.J. Chang, D.J.
Broadhurst, S. Klippsten, P.P. Sharp, N.A. Wenzell, D. Kuzuoglu-Ozturk, H.-Y. Wang, R.
Trenker, J.M. Young, D.A. Cavero, J. Hiatt, T.L. Roth, U. Rathore, A. Subramanian, J.
Noack, M. Hubert, R.M. Stroud, A.D. Frankel, O.S. Rosenberg, K.A. Verba, D.A. Agard,
M. Ott, M. Emerman, N. Jura, M. von Zastrow, E. Verdin, A. Ashworth, O. Schwartz, C.
d’Enfert, S. Mukherjee, M. Jacobson, H.S. Malik, D.G. Fujimori, T. Ideker, C.S. Craik,
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
25
S.N. Floor, J.S. Fraser, J.D. Gross, A. Sali, B.L. Roth, D. Ruggero, J. Taunton, T.
Kortemme, P. Beltrao, M. Vignuzzi, A. García-Sastre, K.M. Shokat, B.K. Shoichet, N.J.
Krogan, A SARS-CoV-2 protein interaction map reveals targets for drug repurposing,
Nature. 583 (2020) 459–468. https://doi.org/10.1038/s41586-020-2286-9.
[39] M. Gupta, C.M. Azumaya, M. Moritz, S. Pourmal, A. Diallo, G.E. Merz, G. Jang, M.
Bouhaddou, A. Fossati, A.F. Brilot, D. Diwanji, E. Hernandez, N. Herrera, H.T.
Kratochvil, V.L. Lam, F. Li, Y. Li, H.C. Nguyen, C. Nowotny, T.W. Owens, J.K. Peters,
A.N. Rizo, U. Schulze-Gahmen, A.M. Smith, I.D. Young, Z. Yu, D. Asarnow, C.
Billesbølle, M.G. Campbell, J. Chen, K.-H. Chen, U.S. Chio, M.S. Dickinson, L. Doan, M.
Jin, K. Kim, J. Li, Y.-L. Li, E. Linossi, Y. Liu, M. Lo, J. Lopez, K.E. Lopez, A. Mancino,
F.R. Moss, M.D. Paul, K.I. Pawar, A. Pelin, T.H. Pospiech, C. Puchades, S.G. Remesh, M.
Safari, K. Schaefer, M. Sun, M.C. Tabios, A.C. Thwin, E.W. Titus, R. Trenker, E. Tse,
T.K.M. Tsui, F. Wang, K. Zhang, Y. Zhang, J. Zhao, F. Zhou, Y. Zhou, L. Zuliani-Alvarez,
Q.S.B. Consortium, D.A. Agard, Y. Cheng, J.S. Fraser, N. Jura, T. Kortemme, A. Manglik,
D.R. Southworth, R.M. Stroud, D.L. Swaney, N.J. Krogan, A. Frost, O.S. Rosenberg, K.A.
Verba, CryoEM and AI reveal a structure of SARS-CoV-2 Nsp2, a multifunctional protein
involved in key host processes, (2021) 2021.05.10.443524.
https://doi.org/10.1101/2021.05.10.443524.
[40] S.G. Devaraj, N. Wang, Z. Chen, Z. Chen, M. Tseng, N. Barretto, R. Lin, C.J. Peters,
C.-T.K. Tseng, S.C. Baker, K. Li, Regulation of IRF-3-dependent Innate Immunity by the
Papain-like Protease Domain of the Severe Acute Respiratory Syndrome Coronavirus, J
Biol Chem. 282 (2007) 32208–32221. https://doi.org/10.1074/jbc.M704870200.
[41] A.R. Fehr, R. Channappanavar, G. Jankevicius, C. Fett, J. Zhao, J. Athmer, D.K.
Meyerholz, I. Ahel, S. Perlman, The Conserved Coronavirus Macrodomain Promotes
Virulence and Suppresses the Innate Immune Response during Severe Acute Respiratory
Syndrome Coronavirus Infection, MBio. 7 (2016) e01721-16.
https://doi.org/10.1128/mBio.01721-16.
[42] A. Hosseini Rad SM, A.D. McLellan, Implications of SARS-CoV-2 Mutations for
Genomic RNA Structure and Host microRNA Targeting, International Journal of
Molecular Sciences. 21 (2020) 4807. https://doi.org/10.3390/ijms21134807.
[43] T.N. Winther, C.H. Bang-Berthelsen, I.L. Heiberg, F. Pociot, B. Hogh, Differential
plasma microRNA profiles in HBeAg positive and HBeAg negative children with chronic
hepatitis B, PloS One. 8 (2013) e58236.
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
26
[44] K. Yu, G. Shi, N. Li, The function of MicroRNA in hepatitis B virus-related liver
diseases: from Dim to Bright, Annals of Hepatology. 14 (2015) 450–456.
[45] M. Hasan, E. McLean, O. Bagasra, A computational analysis to construct a potential
post-Exposure therapy against pox epidemic using miRNAs in silico, J Bioterror Biodef. 7
(2016).
[46] W.D. Jang, S. Jeon, S. Kim, S.Y. Lee, Drugs repurposed for COVID-19 by virtual
screening of 6,218 drugs and cell-based assay, Proceedings of the National Academy of
Sciences. 118 (2021).
[47] S.K. Biswas, S.R. Mudi, Spike protein D614G and RdRp P323L: the SARS-CoV-2
mutations associated with severity of COVID-19, Genomics Inform. 18 (2020) e44.
https://doi.org/10.5808/GI.2020.18.4.e44.
[48] L. Yan, J. Ge, L. Zheng, Y. Zhang, Y. Gao, T. Wang, Y. Huang, Y. Yang, S. Gao, M.
Li, Z. Liu, H. Wang, Y. Li, Y. Chen, L.W. Guddat, Q. Wang, Z. Rao, Z. Lou, Cryo-EM
Structure of an Extended SARS-CoV-2 Replication and Transcription Complex Reveals
an Intermediate State in Cap Synthesis, Cell. 184 (2021) 184-193.e10.
https://doi.org/10.1016/j.cell.2020.11.016.
[49] Y. Gao, L. Yan, Y. Huang, F. Liu, Y. Zhao, L. Cao, T. Wang, Q. Sun, Z. Ming, L.
Zhang, J. Ge, L. Zheng, Y. Zhang, H. Wang, Y. Zhu, C. Zhu, T. Hu, T. Hua, B. Zhang, X.
Yang, J. Li, H. Yang, Z. Liu, W. Xu, L.W. Guddat, Q. Wang, Z. Lou, Z. Rao, Structure of
the RNA-dependent RNA polymerase from COVID-19 virus, Science. 368 (2020) 779–
782. https://doi.org/10.1126/science.abb7498.
[50] K.C. Lehmann, A. Gulyaeva, J.C. Zevenhoven-Dobbe, G.M.C. Janssen, M. Ruben,
H.S. Overkleeft, P.A. van Veelen, D.V. Samborskiy, A.A. Kravchenko, A.M. Leontovich,
I.A. Sidorov, E.J. Snijder, C.C. Posthuma, A.E. Gorbalenya, Discovery of an essential
nucleotidylating activity associated with a newly delineated conserved domain in the RNA
polymerase-containing protein of all nidoviruses, Nucleic Acids Research. 43 (2015)
8416–8434. https://doi.org/10.1093/nar/gkv838.
[51] A.N. Spratt, S.R. Kannan, L.T. Woods, G.A. Weisman, T.P. Quinn, C.L. Lorson, A.
Sönnerborg, S.N. Byrareddy, K. Singh, Evolution, correlation, structural impact and
dynamics of emerging SARS-CoV-2 variants, Computational and Structural
Biotechnology Journal. 19 (2021) 3799–3809. https://doi.org/10.1016/j.csbj.2021.06.037.
[52] A. Mohammad, F. Al-Mulla, D.-Q. Wei, J. Abubaker, Remdesivir MD Simulations
Suggest a More Favourable Binding to SARS-CoV-2 RNA Dependent RNA Polymerase
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
27
Mutant P323L Than Wild-Type, Biomolecules. 11 (2021) 919.
https://doi.org/10.3390/biom11070919.
[53] J.M. Hyser, M.K. Estes, Pathophysiological Consequences of Calcium-Conducting
Viroporins, Annu Rev Virol. 2 (2015) 473–496. https://doi.org/10.1146/annurev-virology-
100114-054846.
[54] A. Oulas, M. Zanti, M. Tomazou, M. Zachariou, G. Minadakis, M.M. Bourdakou, P.
Pavlidis, G.M. Spyrou, Generalized linear models provide a measure of virulence for
specific mutations in SARS-CoV-2 strains, PLOS ONE. 16 (2021) e0238665.
https://doi.org/10.1371/journal.pone.0238665.
[55] E. Issa, G. Merhi, B. Panossian, T. Salloum, S. Tokajian, SARS-CoV-2 and ORF3a:
Nonsynonymous Mutations, Functional Domains, and Viral Pathogenesis, MSystems. 5
(2020) e00266-20. https://doi.org/10.1128/mSystems.00266-20.
[56] Y. Ren, T. Shu, D. Wu, J. Mu, C. Wang, M. Huang, Y. Han, X.-Y. Zhang, W. Zhou,
Y. Qiu, X. Zhou, The ORF3a protein of SARS-CoV-2 induces apoptosis in cells, Cell Mol
Immunol. 17 (2020) 881–883. https://doi.org/10.1038/s41423-020-0485-9.
[57] G. Miao, H. Zhao, Y. Li, M. Ji, Y. Chen, Y. Shi, Y. Bi, P. Wang, H. Zhang, ORF3a of
the COVID-19 virus SARS-CoV-2 blocks HOPS complex-mediated assembly of the
SNARE complex required for autolysosome formation, Developmental Cell. 56 (2021)
427-442.e5. https://doi.org/10.1016/j.devcel.2020.12.010.
[58] B. Lu, L. Tao, T. Wang, Z. Zheng, B. Li, Z. Chen, Y. Huang, Q. Hu, H. Wang, Humoral
and Cellular Immune Responses Induced by 3a DNA Vaccines against Severe Acute
Respiratory Syndrome (SARS) or SARS-Like Coronavirus in Mice, Clin Vaccine
Immunol. 16 (2009) 73–77. https://doi.org/10.1128/CVI.00261-08.
[59] X. Zhong, Z. Guo, H. Yang, L. Peng, Y. Xie, T.-Y. Wong, S.-T. Lai, Z. Guo, Amino
terminus of the SARS coronavirus protein 3a elicits strong, potentially protective humoral
responses in infected patients, Journal of General Virology. 87 (2006) 369–373.
https://doi.org/10.1099/vir.0.81078-0.
[60] D.M. Kern, B. Sorum, S.S. Mali, C.M. Hoel, S. Sridharan, J.P. Remis, D.B. Toso, A.
Kotecha, D.M. Bautista, S.G. Brohawn, Cryo-EM structure of SARS-CoV-2 ORF3a in
lipid nanodiscs, Nat Struct Mol Biol. 28 (2021) 573–582. https://doi.org/10.1038/s41594-
021-00619-0.
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
28
[61] G.K. Azad, P.K. Khan, Variations in Orf3a protein of SARS-CoV-2 alter its structure
and function, Biochem Biophys Rep. 26 (2021) 100933.
https://doi.org/10.1016/j.bbrep.2021.100933.
[62] J. Shang, Y. Wan, C. Luo, G. Ye, Q. Geng, A. Auerbach, F. Li, Cell entry mechanisms
of SARS-CoV-2, Proceedings of the National Academy of Sciences. 117 (2020) 11727–
11734.
[63] C.B. Jackson, M. Farzan, B. Chen, H. Choe, Mechanisms of SARS-CoV-2 entry into
cells, Nature Reviews Molecular Cell Biology. 23 (2022) 3–20.
[64] S.A. Kopecky-Bromberg, L. Martínez-Sobrido, M. Frieman, R.A. Baric, P. Palese,
Severe Acute Respiratory Syndrome Coronavirus Open Reading Frame (ORF) 3b, ORF 6,
and Nucleocapsid Proteins Function as Interferon Antagonists, J Virol. 81 (2007) 548–557.
https://doi.org/10.1128/JVI.01782-06.
[65] M. Surjit, B. Liu, V.T.K. Chow, S.K. Lal, The Nucleocapsid Protein of Severe Acute
Respiratory Syndrome-Coronavirus Inhibits the Activity of Cyclin-Cyclin-dependent
Kinase Complex and Blocks S Phase Progression in Mammalian Cells, J Biol Chem. 281
(2006) 10669–10681. https://doi.org/10.1074/jbc.M509233200.
[66] J. Mu, J. Xu, L. Zhang, T. Shu, D. Wu, M. Huang, Y. Ren, X. Li, Q. Geng, Y. Xu, Y.
Qiu, X. Zhou, SARS-CoV-2-encoded nucleocapsid protein acts as a viral suppressor of
RNA interference in cells, Sci China Life Sci. 63 (2020) 1413–1416.
https://doi.org/10.1007/s11427-020-1692-1.
[67] H. Wu, N. Xing, K. Meng, B. Fu, W. Xue, P. Dong, W. Tang, Y. Xiao, G. Liu, H. Luo,
W. Zhu, X. Lin, G. Meng, Z. Zhu, Nucleocapsid mutations R203K/G204R increase the
infectivity, fitness, and virulence of SARS-CoV-2, Cell Host Microbe. 29 (2021) 1788-
1801.e6. https://doi.org/10.1016/j.chom.2021.11.005.
[68] V.M. Prabantu, N. Naveenkumar, N. Srinivasan, Influence of Disease-Causing
Mutations on Protein Structural Networks, Front. Mol. Biosci. 7 (2021) 620554.
https://doi.org/10.3389/fmolb.2020.620554.
[69] E.I. Shakhnovich, A.M. Gutin, Influence of point mutations on protein structure:
Probability of a neutral mutation, Journal of Theoretical Biology. 149 (1991) 537–546.
https://doi.org/10.1016/S0022-5193(05)80097-9.
[70] C. Sotomayor-Vivas, E. Hernández-Lemus, R. Dorantes-Gilardi, Linking protein
structural and functional change to mutation using amino acid networks, PLOS ONE. 17
(2022) e0261829. https://doi.org/10.1371/journal.pone.0261829.
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
29
[71] H. Wako, S. Endo, Normal mode analysis as a method to derive protein dynamics
information from the Protein Data Bank, Biophys Rev. 9 (2017) 877–893.
https://doi.org/10.1007/s12551-017-0330-2.
[72] B. Erman, The Gaussian Network Model: Precise Predictions of Residue Fluctuations
and Application to Binding Problems, Biophysical Journal. 91 (2006) 3589–3599.
https://doi.org/10.1529/biophysj.106.090803.
[73] E. Eyal, L.-W. Yang, I. Bahar, Anisotropic network model: systematic evaluation and
a new web interface, Bioinformatics. 22 (2006) 2619–2627.
https://doi.org/10.1093/bioinformatics/btl448.
[74] L. Yang, G. Song, R.L. Jernigan, Protein elastic network models and the ranges of
cooperativity, Proceedings of the National Academy of Sciences. 106 (2009) 12347–
12352. https://doi.org/10.1073/pnas.0902159106.
[75] D. Shirvanyants, F. Ding, D. Tsao, S. Ramachandran, N.V. Dokholyan, Discrete
Molecular Dynamics: An Efficient And Versatile Simulation Method For Fine Protein
Characterization, J. Phys. Chem. B. 116 (2012) 8375–8382.
https://doi.org/10.1021/jp2114576.
[76] S.A. Hollingsworth, R.O. Dror, Molecular Dynamics Simulation for All, Neuron. 99
(2018) 1129–1143. https://doi.org/10.1016/j.neuron.2018.08.011.
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
30
Table 1: Table showing the number of genomes that were analysed in this study. The number
of genomes collected in month wise manner is also shown in the table.
Month
Number of
Genomes
January 2020
141
February 2020
289
March 2020
8453
April 2020
5023
May 2020
3250
June 2020
3379
July 2020
7359
August 2020
4833
September 2020
2225
October 2020
1528
November 2020
1997
December 2020
3782
January 2021
9633
February 2021
859
March 2021
6970
Total
59541
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
31
Table 2: Table showing the frequency of unique mutations in various proteins of SARS-CoV-
2. The mutation frequency was calculated by dividing the total number of mutations with length
of the respective protein.
SARS-CoV-2
Protein
Length of Protein
Number of Unique
Mutation
Frequency
Orf1ab
7096
610
8.59
S
1273
256
20.10
Orf3a
275
33
12
M
222
2
0.90
Orf6
61
11
18.0
Orf8
121
10
8.2
N
419
16
3.8
Orf10
38
2
5.1
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
32
Table 3: Table showing the correlations among various unique mutations in SARS-CoV-2
genomes. The mutations shown below have correlation value > 0.4 (absolute) and are present
in >30% of the genomes.
Mutation I
Mutation II
Percentage of
genomes that
have
mutation I
Percentage of
genomes that
have
mutation II
241 (5’ UTR)
3037 (F106F NSP3)
91.054
89.889
241 (5’ UTR)
23403 (D614G Spike)
91.054
90.082
241 (5’ UTR)
14408 (P323L RdRp)
91.054
89.385
1059 (T85I NSP2)
25563 (Q57H ORF3a)
40.022
46.948
1059 (T85I NSP2)
28882 (R203R
Nucleocapsid)
40.022
30.234
1059 (T85I NSP2)
28881 (R203K
Nucleocapsid)
40.022
30.381
1059 (T85I NSP2)
28883 (G204R
Nucleocapsid)
40.022
30.259
3037 (F106F NSP3)
23403 (D614G Spike)
89.889
90.082
3037 (F106F NSP3)
14408 (P323L RdRp)
89.889
89.385
14408 (P323L RdRp)
23403 (D614G Spike)
89.385
90.082
25563 (Q57H ORF3a)
28882 (R203R
Nucleocapsid)
46.948
30.234
25563 (Q57H ORF3a)
28881 (R203K
Nucleocapsid)
46.948
30.381
25563 (Q57H ORF3a)
28883 (G204R
Nucleocapsid)
46.948
30.234
28881 (R203K
Nucleocapsid)
28882 (R203R
Nucleocapsid)
30.381
30.259
28881 (R203K
Nucleocapsid)
28883 (G204R
Nucleocapsid)
30.381
30.259
28882 (R203R
Nucleocapsid)
28883 (G204R
Nucleocapsid)
30.234
30.259
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
33
Table 4: showing the mutations and their frequency. NC refers to the total number of sequences
with a specific mutation.
Mutation
Gene
NC
241
5’UTR
50771
23403 (D614G)
S
50229
3037 (F924F)
ORF1AB
50121
14408 (P4715L)
ORF1AB
49840
25563 (Q57H)
ORF3A
26178
1059 (T265I)
ORF1AB
22316
28881 (R203K)
N
16940
28883 (G204R)
N
16872
28882 (R203R)
N
16858
27964 (S24L)
ORF1AB
10985
1163 (I300F)
ORF1AB
9098
10319 (L3352F)
ORF1AB
8882
18555 (D6097D)
ORF1AB
8772
28869 (P199L)
N
8770
16647 (T5641T)
Helicase
8755
23401 (Q613H)
S
8746
7540 (T2425T)
ORF1AB
8729
18424 (N6054D)
ORF1AB
8758
28472 (P67T)
N
8572
21304 (R7014S)
ORF1AB
8320
25907 (G172D)
ORF3A
8236
22992 (S477N)
S
8146
19
5’UTR
6637
15
5’UTR
5662
13
5’UTR
5069
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
34
Table 5: Predicted results for the effects of mutation on functionality, stability and flexibility
of proteins using PredictSNP, DynaMut and SAAFEC-SEQ web servers.
Predictors
Proteins and mutation
Nsp2 (T85I)
RdRp (P323L)
ORF3a (Q57H)
PredictSNP
Confidence score (%)
& Nature of mutation
72
(Deleterious)
83
(Neutral)
76
Deleterious)
ENCoM DDSVib
(kcal.mol-1.K-1) &
Flexibility
0.021
(Increase)
-0.225
(Decrease)
-0.117
(Decrease)
DynaMut
DDG (kcal.mol-1)
-0.264
(Destabilizing)
0.732
(Stabilizing)
0.597
(Stabilizing)
ENCoM DDG
(kcal.mol-1)
-0.017
(Destabilizing)
0.180
(Destabilizing)
0.094
(Destabilizing)
mCSM DDG
(kcal.mol-1)
-0.125
(Destabilizing)
-0.261
(Destabilizing)
-0.843
(Destabilizing)
SDM DDG
(kcal.mol-1)
0.460
(Stabilizing)
1.570
(Stabilizing)
0.060
(Stabilizing)
DUET DDG
(kcal.mol-1)
0.151
(Stabilizing)
0.457
(Stabilizing)
-0.652
(Destabilizing)
SAAFEC-SEQ
DDG (kcal.mol-1)
-0.93
(Destabilizing)
-0.77
(Destabilizing)
-1.03
(Destabilizing)
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
35
Figure 1: representing the geographical distribution of samples used in this study. Circle size
indicate then number of samples from each geographical location. X-axis represents the
longitude whereas Y-axis represents the earths latitude.
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
36
Figure 2: Identification of highly significant mutations occurring in SARS-CoV-2 genomes
across various months. (A) Figure showing mutation at nucleotide level among SARS CoV-2
genomes. Figure showing the non-synonymous mutations at amino-acid level in (B) ORF1ab
protein (C) S protein (D) ORF3a protein (E) M protein (F) ORF6 protein (G) ORF8 protein
(H) N Protein and (I) ORF10 protein of SARS-CoV-2. X-axis represents months whereas Y-
axis for represents respective position of SARS-CoV-2.
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
37
Figure 3: representing correlation and Hierarchical clustering among significant mutations in
the genomes of SARS-CoV-2. (A) 3D plot showing the correlation among highly significant
mutations having absolute correlation >0.4 . X-axis represents mutation I of correlated pair, Y-
axis represents mutation II of correlated pair whereas Z-axis represents correlation correlation
coefficient. (B) Chord plot representing correlation among significant mutation pairs with
absolute correlation value >0.4 and mutation frequency >30% in genomes used in this study
(C) Hierarchical clustering of top 25 highly significant mutations having frequency >10%. The
color bar represents the distance among the mutations.
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
38
Figure 4: showing the running monthly counts of the mutations in SARS-CoV-2 genomes used
in this study. The highly significant mutations that are correlated with each other shown here
are present in >30% of the genomes and have absolute correlation coefficient >0.4 (A) 241C>T
(B) 1059C>T (C) 3037C>T (D) 14408C>T (E) 23403A>G (F) 25563G>T (G) 28881G>A (H)
28882G>A (I) 28883G>C.
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
39
Figure 5: Structure alignments of crystal and modelled proteins. (A) Nsp2. Crystal and
modelled structures are cyan and red in color respectively. (B) RdRp. Crystal and modelled
structures are purple and red in color respectively. (C) ORF3a. Crystal and modelled structures
are orange and blue in color respectively. Mutation positions are indicated in dashed circle and
mutant residues are represented in stick form.
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
40
Figure 6: Visual representation of MT protein dynamicity and intramolecular interactions of
WT and MT residues with proximal amino acids. Mutations are shown in sticks representation
and are inside the red circle. Red and blue colours indicate the flexibility and rigidity
respectively. WT and MT residues are shown in cyan colour. Mutant residues are red in color
while interacting residues are in black color. Interactions are illustrated in different colours and
for the further interpretation of interactions, readers are requested to visit the Web version of
Arpeggio web server. (A-C) Visual representation of Nsp2, RdRp, and ORF3a respectively.
(D-F) Intramolecular interactions of WT and MT residues in Nsp2, RdRp, and ORF3a
respectively.
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
41
Figure 7: Normalized linear mutual information (nLMI) correlation plots. (A) Nsp2_WT (B)
RdRp_WT (C) ORF3a_WT (D) Correlation difference plot of Nsp2 WT & MT (E) Correlation
difference plot of RdRp WT & MT (F) Correlation difference plot of ORF3a WT & MT.
Degree of correlation corresponds to the color bar.
.CC-BY-NC-ND 4.0 International licensemade available under a
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is
The copyright holder for this preprintthis version posted April 5, 2022. ; https://doi.org/10.1101/2022.04.05.487114doi: bioRxiv preprint
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The function of a protein is strongly dependent on its structure. During evolution, proteins acquire new functions through mutations in the amino-acid sequence. Given the advance in deep mutational scanning, recent findings have found functional change to be position dependent, notwithstanding the chemical properties of mutant and mutated amino acids. This could indicate that structural properties of a given position are potentially responsible for the functional relevance of a mutation. Here, we looked at the relation between structure and function of positions using five proteins with experimental data of functional change available. In order to measure structural change, we modeled mutated proteins via amino-acid networks and quantified the perturbation of each mutation. We found that structural change is position dependent, and strongly related to functional change. Strong changes in protein structure correlate with functional loss, and positions with functional gain due to mutations tend to be structurally robust. Finally, we constructed a computational method to predict functionally sensitive positions to mutations using structural change that performs well on all five proteins with a mean precision of 74.7% and recall of 69.3% of all functional positions.
Article
Full-text available
Previous work found that the cooccurring mutations R203K/G204R on the SARS-CoV-2 nucleocapsid (N) protein are increasing in frequency amongst emerging variants of concern or interest. Through a combination of in silico analyses, this study demonstrates that R203K/G204R are adaptive, while large-scale phylogenetic analyses indicate that R203K/G204R associates with the emergence of the high-transmissibility SARS-CoV-2 lineage B.1.1.7. Competition experiments suggest that the 203K/204R variants possess a replication advantage over the preceding R203/G204 variants, possibly related to ribonucleocapsid (RNP) assembly. Moreover, the 203K/204R virus shows increased infectivity in human lung cells and hamsters. Accordingly, we observe a positive association between increased COVID-19 severity and sample frequency of 203K/204R. Our work suggests that the 203K/204R mutations contribute to the increased transmission and virulence of select SARS-CoV-2 variants. In addition to mutations in the spike protein, mutations in the nucleocapsid protein are important for viral spreading during the pandemic.
Article
Full-text available
The importance of understanding SARS-CoV-2 evolution cannot be overemphasized. Recent studies confirm that natural selection is the dominating mechanism of SARS-CoV-2 evolution, which favors mutations that strengthen viral infectivity. We demonstrate that vaccine-breakthrough or antibody-resistant mutations provide a new mechanism of viral evolution. Specifically, vaccine-resistant mutation Y449S in the spike (S) protein receptor-bonding domain (RBD), which occurred in co-mutation [Y449S, N501Y], has reduced infectivity compared to the original SARS-CoV-2 but can disrupt existing antibodies that neutralize the virus. By tracing the evolutionary trajectories of vaccine-resistant mutations in over 1.9 million SARS-CoV-2 genomes, we reveal that the occurrence and frequency of vaccine-resistant mutations correlate strongly with the vaccination rates in Europe and America. We anticipate that as a complementary transmission pathway, vaccine-resistant mutations will become a dominating mechanism of SARS-CoV-2 evolution when most of the world's population is vaccinated. Our study sheds light on SARS-CoV-2 evolution and transmission and enables the design of the next-generation mutation-proof vaccines and antibody drugs.
Article
Full-text available
In a try to understand the pathogenesis, evolution and epidemiology of the SARS-CoV-2 virus, scientists from all over the world are tracking its genomic changes in real-time. Genomic studies can be helpful in understanding the disease dynamics. We have downloaded 324 complete and near complete SARS-CoV-2 genomes submitted in GISAID database from Bangladesh which were isolated between 30 March to 7 September, 2020. We then compared these genomes with Wuhan reference sequence and found 4160 mutation events including 2253 missense single nucleotide variations, 38 deletions and 10 insertions. The C>T nucleotide change was most prevalent (41% of all mutations) possibly due to selective mutation pressure to reduce CpG sites to evade CpG targeted host immune response. The most frequent mutation that occurred in 98% isolates was 3037C>T which is a synonymous change that usually accompanied 3 other mutations that include 241C>T, 14408C>T (P323L in RdRp) and 23403A>G (D614G in spike protein). The P323L was reported to increase mutation rate and D614G is associated with increased viral replication and currently most prevalent variant circulating all over the world. We identified multiple missense mutations in B-cell and T-cell predicted epitope regions and/or PCR target regions (including R203K and G204R that occurred in 86% of the isolates) that may impact immunogenicity and/or RT-PCR based diagnosis. Our analysis revealed 5 large deletion events in ORF7a and ORF8 gene products that may be associated with less severity of the disease and increased viral clearance. Our phylogeny analysis identified most of the isolates belonged to the Nextstrain clade 20B (86%) and GISAID clade GR (88%). Most of our isolates shared common ancestors either directly with European countries or jointly with middle eastern countries as well as Australia and India. Interestingly, the 19B clade (GISAID S clade) was unique to Chittagong, which was originally prevalent in China. This reveals possible multiple introductions of the virus in Bangladesh via different routes. Hence, more genome sequencing and analysis with related clinical data is needed to interpret functional significance and better predict the disease dynamics that may be helpful for policy makers to control the COVID-19 pandemic.
Article
Recently, the severe acute respiratory syndrome coronavirus 2 (SARS‐CoV‐2) Omicron variant (B.1.1.529) was first identified in Botswana in November 2021. It was first reported to the World Health Organization (WHO) on Nov 24. On Nov 26, 2021, according to the advice of scientists who are part of the WHO's Technical Advisory Group on SARS‐CoV‐2 Virus Evolution (TAG‐VE), the WHO defined the strain as a variant of concern (VOC) and named it Omicron. Compared to the other four VOCs (Alpha, Beta, Gamma, and Delta), the Omicron variant was the most highly mutated strain, with 50 mutations accumulated throughout the genome. The Omicron variant contains at least 32 mutations in the spike protein, which was twice as many as the Delta variant. Studies have shown that carrying many mutations can increase infectivity and immune escape of the Omicron variant compared with the early wild‐type strain and the other four VOCs. The Omicron variant is becoming the dominant strain in many countries worldwide and brings new challenges to preventing and controlling COVID‐19. The current review article aims to analyze and summarize information data about the biological characteristics of amino acid mutations, the epidemic characteristics, immune escape, and vaccine reactivity of the Omicron variant, hoping to provide a scientific reference for monitoring, prevention, and vaccine development strategies for the Omicron variant. This article is protected by copyright. All rights reserved.
Article
A novel coronavirus, SARS-CoV-2, has caused over 274 million cases and over 5.3 million deaths worldwide since it occurred in December 2019 in Wuhan, China. Here we conceptualized the temporospatial evolutionary and expansion dynamics of SARS-CoV-2 by taking a series of cross-sectional view of viral genomes from early outbreak in January 2020 in Wuhan to early phase of global ignition in early April, and finally to the subsequent global expansion by late December 2020. Based on the phylogenetic analysis of the early patients in Wuhan, Wuhan/WH04/2020 is supposed to be a more appropriate reference genome of SARS-CoV-2, instead of the first sequenced genome Wuhan-Hu-1. By scrutinizing the cases from the very early outbreak, we found a viral genotype from the Seafood Market in Wuhan featured with two concurrent mutations (i.e. M type) had become the overwhelmingly dominant genotype (95.3%) of the pandemic one year later. By analyzing 4,013 SARS-CoV-2 genomes from different continents by early April, we were able to interrogate the viral genomic composition dynamics of initial phase of global ignition over a timespan of 14-week. 11 major viral genotypes with unique geographic distributions were also identified. WE1 type, a descendant of M and predominantly witnessed in western Europe, consisted a half of all the cases (50.2%) at the time. The mutations of major genotypes at the same hierarchical level were mutually exclusive, which implying that various genotypes bearing the specific mutations were propagated during human-to-human transmission, not by accumulating hot-spot mutations during the replication of individual viral genomes. As the pandemic was unfolding, we also used the same approach to analyze 261,323 SARS-CoV-2 genomes from the world since the outbreak in Wuhan (i.e. including all the publicly available viral genomes) in order to recapitulate our findings over one-year timespan. By 25 December 2020, 95.3% of global cases were M type and 93.0% of M-type cases were WE1. In fact, at present all the five variants of concern (VOC) are the descendants of WE1 type. This study demonstrates the viral genotypes can be utilized as molecular barcodes in combination with epidemiologic data to monitor the spreading routes of the pandemic and evaluate the effectiveness of control measures. Moreover, the dynamics of viral mutational spectrum in the study may help the early identification of new strains in patients to reduce further spread of infection, guide the development of molecular diagnosis and vaccines against COVID-19, and help assess their accuracy and efficacy in real world at real time. This article is protected by copyright. All rights reserved.
Article
Despite the worldwide vaccination, COVID-19 pandemic continues as SARS-CoV-2 evolves into numerous variants. Since the first identification of the novel SARS-CoV-2 variant of concern (VOC) Omicron on November 24th, 2021, from an immunocompromised patient in South Africa, the variant has overtaken Delta as the predominant lineage in South Africa and has quickly spread to over 40 countries. Here we provide an initial molecular characterization of the Omicron variant through analyzing the large number of mutations especially in the spike protein receptor-binding domain (RBD) with their potential effects on viral infectivity and host immunity. Our analysis indicates that the Omicron variant has two subclades and may evolve from clade 20B instead of the currently dominant Delta variant. In addition, we have also identified mutations that may affect the ACE2 receptor and/or antibody bindings. Our study has raised additional questions on the evolution, transmission, virulence, and immune escape properties of this new Omicron variant. This article is protected by copyright. All rights reserved.
Article
Genetic variants of SARS-CoV-2 have repeatedly altered the course of the COVID-19 pandemic. Delta variants of concern are now the focus of intense international attention because they are causing widespread COVID-19 disease globally and are associated with vaccine breakthrough cases. We sequenced the genomes of 16,965 SARS-CoV-2 from samples acquired March 15, 2021 through September 20, 2021 in the Houston Methodist hospital system. This sample represents 91% of all Methodist system COVID-19 patients during the study period. Delta variants increased rapidly from late April onward to cause 99.9% of all COVID-19 cases and spread throughout the Houston metroplex. Compared to all other variants combined, Delta caused a significantly higher rate of vaccine breakthrough cases (23.7% for Delta compared to 6.6% for all other variants combined). Importantly, significantly fewer fully vaccinated individuals required hospitalization. Individuals with vaccine breakthrough cases caused by Delta had a low median PCR cycle threshold (Ct) value (a proxy for high virus load). This value was closely similar to the median Ct value for unvaccinated patients with COVID-19 caused by Delta variants, suggesting that fully vaccinated individuals can transmit SARS-CoV-2 to others. Patients infected with Alpha and Delta variants had several significant differences. Our integrated analysis emphasizes that vaccines used in the United States are highly effective in decreasing severe COVID-19 disease, hospitalizations, and deaths.
Article
Extracting dynamical pairwise correlations and identifying key residues from large molecular dynamics trajectories or normal-mode analysis of coarse-grained models are important for explaining various processes like ligand binding, mutational effects, and long-distance interactions. Efficient and flexible tools to perform this task can provide new insights about residues involved in allosteric regulation and protein function. In addition, combining and comparing dynamical coupling information with sequence coevolution data can help to understand better protein function. To this aim, we developed a Python package called correlationplus to calculate, visualize, and analyze pairwise correlations. In this way, the package aids to identify key residues and interactions in proteins. The source code of correlationplus is available under LGPL version 3 at https://github.com/tekpinar/correlationplus. The current version of the package (0.2.0) can be installed with common installation methods like conda or pip in addition to source code installation. Moreover, docker images are also available for usage of the code without installation.
Article
The unprecedented public health and economic impact of the COVID-19 pandemic caused by infection with severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) has been met with an equally unprecedented scientific response. Much of this response has focused, appropriately, on the mechanisms of SARS-CoV-2 entry into host cells, and in particular the binding of the spike (S) protein to its receptor, angiotensin-converting enzyme 2 (ACE2), and subsequent membrane fusion. This Review provides the structural and cellular foundations for understanding the multistep SARS-CoV-2 entry process, including S protein synthesis, S protein structure, conformational transitions necessary for association of the S protein with ACE2, engagement of the receptor-binding domain of the S protein with ACE2, proteolytic activation of the S protein, endocytosis and membrane fusion. We define the roles of furin-like proteases, transmembrane protease, serine 2 (TMPRSS2) and cathepsin L in these processes, and delineate the features of ACE2 orthologues in reservoir animal species and S protein adaptations that facilitate efficient human transmission. We also examine the utility of vaccines, antibodies and other potential therapeutics targeting SARS-CoV-2 entry mechanisms. Finally, we present key outstanding questions associated with this critical process. Entry of SARS-CoV-2 into host cells is mediated by the interaction between the viral spike protein and its receptor angiotensin-converting enzyme 2, followed by virus–cell membrane fusion. Worldwide research efforts have provided a detailed understanding of this process at the structural and cellular levels, enabling successful vaccine development for a rapid response to the COVID-19 pandemic.