Content uploaded by Konda Mani Saravanan
Author content
All content in this area was uploaded by Konda Mani Saravanan on May 11, 2020
Content may be subject to copyright.
Content uploaded by Haiping Zhang
Author content
All content in this area was uploaded by Haiping Zhang on Feb 05, 2020
Content may be subject to copyright.
Content uploaded by Konda Mani Saravanan
Author content
All content in this area was uploaded by Konda Mani Saravanan on Feb 05, 2020
Content may be subject to copyright.
Deep learning based drug screening for novel coronavirus 2019-nCov
Haiping Zhang1, Konda Mani Saravanan1, Yang Yang2, Md. Tofazzal Hossain1,6, Junxin Li3,
Xiaohu Ren4, Yi Pan5, Yanjie Wei1*
1Center for High Performance Computing, Joint Engineering Research Center for Health Big
Data Intelligent Analysis Technology
Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen,
Guangdong, PR China 518055
2Shenzhen Key Laboratory of Pathogen and Immunity, Guangdong Key Laboratory for
Diagnosis and Treatment of Emerging Infectious Diseases, State Key Discipline of Infectious
Disease, Second Hospital Affiliated to Southern University of Science and Technology,
Shenzhen Third People's Hospital, Shenzhen, 518112, China
3Shenzhen Laboratory of Human Antibody Engineering, Institute of Biomedicine and
Biotechnology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences,
1068 Xueyuan Boulevard, University City of Shenzhen, XiliNanshan, Shenzhen, 518055,
China
4Institute of Toxicology, Shenzhen Center for Disease Control and Prevention, No 8
Longyuan Road, Nanshan District, Shenzhen, 518055, China
5Department of Computer Science, Georgia State University, Atlanta, United States of
America 30302-5060
6University of Chinese Academy of Sciences, No.19(A) Yuquan Road, Shijingshan District,B
eijing, P.R.China 100049
Corresponding Author: yj.wei@siat.ac.cn
1
ABSTRACT 1
A novel coronavirus called 2019-nCoV was recently found in Wuhan, Hubei Province of 2
China, and now is spreading across China and other parts of the world. Although there are 3
some drugs to treat 2019-nCoV, there is no proper scientific evidence about its activity on the 4
virus. It is of high significance to develop a drug that can combat the virus effectively to save 5
valuable human lives. It usually takes a much longer time to develop a drug using traditional 6
methods. For 2019-nCoV, it is now better to rely on some alternative methods such as deep 7
learning to develop drugs that can combat such a disease effectively since 2019-nCoV is 8
highly homologous to SARS-CoV. In the present work, we first collected virus RNA 9
sequences of 18 patients reported to have 2019-nCoV from the public domain database, 10
translated the RNA into protein sequences, and performed multiple sequence alignment. After 11
a careful literature survey and sequence analysis, 3C-like protease is considered to be a major 12
therapeutic target and we built a protein 3D model of 3C-like protease using homology 13
modeling. Relying on the structural model, we used a pipeline to perform large scale virtual 14
screening by using a deep learning based method to accurately rank/identify protein-ligand 15
interacting pairs developed recently in our group. Our model identified potential drugs for 16
2019-nCoV 3C-like protease by performing drug screening against four chemical compound 17
databases (Chimdiv, Targetmol-Approved_Drug_Library, 18
Targetmol-Natural_Compound_Library, and Targetmol-Bioactive_Compound_Library) and a 19
database of tripeptides. Through this paper, we provided the list of possible chemical ligands 20
(Meglumine, Vidarabine, Adenosine, D-Sorbitol, D-Mannitol, Sodium_gluconate, 21
Ganciclovir and Chlorobutanol) and peptide drugs (combination of isoleucine, lysine and 22
proline) from the databases to guide the experimental scientists and validate the molecules 23
which can combat the virus in a shorter time. 24
25
Keywords 26
Coronavirus; Deep learning; Drug screening; homology modeling; 3C-like protease 27
28
2
Introduction 1
In December 2019, a severe respiratory illness similar to severe acute respiratory 2
syndrome coronavirus emerged in Wuhan, Hubei, China, and spreading all over the world 3
with high mortality. In the past, beta coronaviruses, severe acute respiratory syndrome 4
coronavirus (SARS-CoV) and Middle East respiratory syndrome coronavirus (MERS-CoV) 5
respectively have caused high mortality rates and became a threat to human life [1]. The most 6
recent outbreak of the viral pneumonia was first disclosed by the Wuhan Municipal Health 7
Commission [2, 3], the World Health Organization (WHO) was alarmed about the outbreak 8
of pneumonia by the Chinese Officials [4]. The novel coronavirus (2019-nCoV) was isolated 9
from 27 patients who were initially reported and the number of patients was subsequently 10
revised to 31498 as of March 23, 2020, with 3267 death [5]. The current 2019-nCoV 11
outbreak has some common features like the SARS outbreak: both happened in winter, linked 12
to live animal markets, and caused by unknown coronaviruses [2, 5]. 13
Fever, cough, and shortness of breath are the symptoms in common cases whereas 14
pneumonia, severe acute respiratory syndrome and kidney failure are being reported as the 15
symptoms in severe cases [4]. Most of the 2019-nCoV patients are linked to the Huanan 16
Seafood Wholesale Market where several wildlife animals including bats, snakes as well as 17
poultry are sold. So far, no specific wildlife animal is identified as the host of the novel 18
coronavirus. Bat is considered as the native host of the novel coronavirus (2019-nCoV) 19
although there are other hosts in transmission from bats to humans [5]. The Spring Festival 20
travel rush has accelerated the spread, so it is of top priority to prevent the spread, develop a 21
new drug to combat it, and cure the patients in time. Knowledge of current 2019-nCoV can 22
be learned from previous SARS-CoV. For SARS-CoV, a variety of modern machine learning 23
methods in particular deep neural networks were used for drug discovery and development. 24
These methods take advantage of bigger datasets compiled from high throughput screening 25
data and perform prediction of bioactivities of a target with high accuracy [6]. 26
The genetic sequences of 2019-nCoV have shown similarities to SARS-CoV (79.5%) [7, 27
8]. The S protein and 3C-like protease are potential drug targets. The S protein is the main 28
target of neutralizing antibodies, and antibodies binding with this protein have the potential to 29
stop the virus entry into host cells [9]. The 3C-like protease catalyzes a chemical reaction 30
which is important in SARS coronavirus replicase polyprotein processing [10, 11]. The 31
neutralizing antibodies against S protein of SARS have been obtained from human patients 32
and the anti-SARS-CoV S antibody triggered fusogenic conformational changes [9]. This 33
provides an important clue to prevent virus entry into host cells by antibodies or peptides. 34
The 3C-like protease inhibitors also have potential to prevent coronavirus maturation, and 35
series of unsaturated esters inhibitors against 3C-like protease of SARS-CoV was deposited 36
in PDB database (Crystal structures of SARS-Cov 3C-like protease complexed with a series 37
of unsaturated esters, Protein Databank Identifier: 3TIT). 38
One can also use these previous SARS inhibitors to design the inhibitor against 39
2019-nCoV. Based on the increasing protein-ligand complex structures, the deep learning 40
algorithms for identifying/predicting potential binding compounds for a given target became 41
possible [12, 13]. In addition to small molecular chemical compounds, scientists also rely on 42
peptide/antibody to combat the virus due to stronger binding affinity. In the post-genomics 43
era, a Dense Fully Convolutional Neural Network (DFCNN) model is more effective, faster 44
3
and cheaper for drug discovery, because the deep layers of the model can learn more features 1
from the data and perform an accurate prediction. By using these techniques, an antimalarial 2
drug “pyrimethamine” was discovered against Dihydrofolate reductase (DHFR) enzyme and 3
another drug BPM31510 is in a phase II trial involving humans with advanced pancreatic 4
cancer [14–16]. Hence we believe that the integrated applications of such machine learning 5
models as a pipeline for drug discovery has implications in therapeutic drug targeting. 6
Considering all the above facts, in the present work, we consider 2019-nCov_3C-like 7
protease as a potential target and built a structural model after systematically analyzing its 8
sequence features. We built a pipeline with a deep learning based method developed in our 9
group by representing molecules as vectors to identify potential drugs (peptides or small 10
ligands) against the protein target of the 2019-nCoV virus [13]. Our method is extremely fast 11
in virtual drug screening and it takes less than a day to finish the virtual screen over millions 12
of protein-ligand or protein-peptide predictions, whereas traditional docking methods take 13
several weeks with the help of a supercomputer. Although, 2019-nCoV outbreak is a major 14
challenge for clinicians [17], we believe the proposed potential drug list can help them to 15
validate the drug that relieves symptoms or even cures the disease rapidly. 16
17
Materials & Methods 18
Dataset and SequenceAlignment 19
We retrieved the virus RNA sequences from Global Initiative on Sharing All Influenza 20
Data (GISAID) database [18] and the sequences are aligned with a focus on the interested S 21
protein and ligand binding region of 2019-nCov_3C-like protease. The amino acid sequence 22
is translated from the RNA sequence by Translate web tool (https://web.expasy.org/translate/). 23
We used 18 patient’s sequences in this work (EPI_ISL_402119 to EPI_ISL_404228). Details 24
of the sequences and acknowledgement to the authors who submitted the data to the server is 25
presented in the Supplementary Table S1. Multiple sequence alignment is performed by using 26
Clustal Omega program [19]. 27
Homology modeling of 2019-nCov_3C-like protease 28
The structural model of 2019-nCov_3C-like protease was built by using Modeller 9.9 29
[20]. The SARS coronavirus 3C-like protease was used as a template (PDB ID: 3TNT) which 30
has about 96.07% amino acid sequence identity. The software outputs multiple predicted 31
structures and they are ranked according to the Discrete Optimized Protein Energy (DOPE) 32
score [21]. The quality of the model was validated by looking at the stereo chemical quality 33
on Ramachandran map. The model was further optimized by PROCHECK [22], ERPAT [23] 34
and Qmean [24] and the final optimized structural model is considered for further analysis. 35
A deep learning model is used to virtual screen large databases 36
In our previous work, we built a Dense Fully Convolutional Neural Network (DFCNN) 37
deep learning model to reverse search drug targets. Here we apply this model to perform 38
large scale virtual screening. Since the method is shown to have relatively higher accuracy 39
and efficiency, it is very suitable for applying to such an emerging disease outbreak. The 40
DFCNN is a densely fully connected neural network, and the densely network (similar to 41
DenseNet, but replace the convolution layer to fully connected layer) allows deep layer 42
without the gradient vanishing problem. The deeper layers make it to learn more abstract 43
features from the data. The training data of DFCNN is from PDB bind database [25], for 44
4
which we define the crystal protein-ligand PDB complexes as positive and cross-docking 1
complexes as negative. The detail process to build the deep learning model is described in our 2
recently published work to virtual screen targets by inputting a small molecule by using a 3
vector type of representation [13]. The overall workflow of the proposed method is shown in 4
Figure 1. DFCNN model has two advantages over many other methods such as independent 5
of docking simulation and the training dataset includes nonbinding decoys. The independence 6
of the docking simulation makes it extremely fast, while the inclusion of nonbinding decoys 7
during training makes the model robust in the real application scenarios. 8
Virtual screening against Chimdiv database 9
The structural model of the ligand binding region of 2019-nCov_3C-like protease is used 10
as the target protein structure. We define the residues with a cutoff distance of 1 nm from the 11
known ligand as a pocket (binding site is defined based on the ligand from the template PDB 12
3TNT is used). The ligand database is taken from the chimdiv company 13
(https://www.chemdiv.com/) which contains around 1000,000 compounds. We first used the 14
DFCNN model to perform large scale virtual screening. The mean and deviation of the 15
training dataset were used during data normalization for a more stable performance. In the 16
second stage, the top prediction by DFCNN model was chosen for an autodock vina based 17
docking simulation. The docking result was visualized and examined by the discovery studio 18
visualizer [26]. Finally, we provide a proposed compound list that has the potential to bind 19
protein pocket. 20
Virtual screening against Targetmol-Approved_Drug_Library, 21
Targetmol-Natural_Compound_Library, and Targetmol-Bioactive_Compound_Library 22
The Targetmol-Approved_Drug_Library, Targetmol-Natural_Compound_Library, and 23
Targetmol-Bioactive_Compound_Library contain about 2040, 1680, and 5370 compounds 24
respectively. We have applied DFCNN model to perform virtual screening against these 3 25
libraries for 2019-nCov_3C-like protease. The compounds with high DFCNN scores are 26
recommended as the potential inhibitors for further experimental validation. 27
Virtual screening against tripeptide database 28
Tri-amino acid peptide database is firstly built, with a total size of 8000. Each amino acid 29
in the tripeptide database was converted into a molecule vector by Mol2vec [27]. For each 30
peptide, the sum of its amino acid vector was used to represent this peptide’s vector. Protein 31
pocket is defined as residues with a cutoff distance of 1 nm from the known ligand. The 32
pocket is then converted into Vector. The pocket and peptide vector are then concatenated 33
into one line as input with a maximum dimension of 600. We will use the same model as 34
DFCNN, a densely fully connected model that is trained by a protein-ligand dataset from the 35
PDB bind database. Since the ligand and peptides are composed of chemical groups, the 36
model trained on the protein-ligand complexes should also be suitable for protein-small 37
peptide interaction. 38
39
Results 40
Sequence alignment and homology modeling 41
18 patient’s RNA sequences obtained from GISAID public domain database are 42
translated into protein sequences by using translate tool. The ligand binding sites of the 43
template protein (3TNT) is considered as reference to define pocket region of our homology 44
5
model. We have checked the mutations in the pocket region of 2019-nCov_3C-like protease, 1
and the sequences have 100% similarity with the virus from 18 different patients. This 2
indicates the virus is highly conserved in this region, and it is suitable for designing drugs by 3
targeting this site. The alignment of S-protein epitope regions also shows high conservation 4
among the patients (Supplementary Figure S1). From the figure, it is observed that the RNA 5
sequence EPI_ISL_402132 has a point mutation at 32nd position where the codon of 6
phenylalanine is replaced by isoleucine. 2019-nCoV_3C-like protease is also aligned to 7
SARS-CoV protease by Clustal Omega [19]. The aligned sequence is shown in Figure 2. 8
There are 276 amino acid residues in both of the proteins. The figure indicates high similarity 9
between 2019-nCov and SARS-CoV, which is consistent with the findings by Xu et al (2020) 10
[5]. Using the X-ray crystallographic structure of SARS coronavirus 3C-like protease solved 11
at 1.59Å resolution, a theoretical protein model is built for 2019-nCoV_3C-like protease 12
using modeler software. Figure 3A shows the crystallographic structure of 13
SARS_coronavirus_3C-like protease and 3B shows the homology model of 14
2019-nCoV_3C-like protease. There are only four mutations (T35V, A46S, S94A and K180N) 15
between SARS_coronavirus_3Clike protease and 2019-nCoV_3C-like protease shown in 16
Figure 3A and B. In the Figure, the mutated residues are marked with blue color. Figure 3C 17
shows the model structure with known SARS_coronavirus_3C-like protease inhibitor. The 18
binding pocket and two dimensional ligand interaction pattern of the target protein is shown 19
with reference to the template. There are 23 protein-ligand interactions observed including 15 20
hydrogen bonds, one disulphide bond and few pi stacking interactions which is shown in 21
figure 3D. The pocket extracted from the model is used for further analysis of large-scale 22
virtual screening. 23
Virtual screening against 4 small molecular compound databases 24
Chemdiv dataset, widely used for large scale virtual screening, contains a large amount 25
(~1000,000) of drug-like compounds or drug leads. The potential drug candidates with the 26
highest score (Autodock vina score and our deep learning model score) from the Chemdiv 27
dataset are presented in Table 1. Interestingly, the compound with identifier “C998-0189” has 28
a top vina score compared to other six compounds listed. The name of the compound is 29
N~2~-(3,5-dimethylphenyl)-N~2~-(5,5-dioxido-3a,4,6,6a-tetrahydrothieno[3,4-d][1,3]thiazol30
-2-yl)-N~1~-[3-(trifluoromethyl)phenyl]glycinamide with molecular formula 31
C22H22F3N3O3S2. The molecular weight of the compound is 497.6 g/mol and the 32
compound satisfies most of the drug-likeness parameters including Lipinski’s filters. The 33
other five recommended compounds also have reasonable vina scores around 7.5 with 34
important stabilizing interactions. 35
The top 100 predictions by our deep learning model against the database are shown in 36
Supplementary Table S2. The top five compounds with Chimdiv identifier 8017-4328, 37
8017-4325, 8002-7777, 8004-0123 and 8010-0095 respectively are listed with the high 38
DFCNN score. Three other well known compound libraries were screened in the present 39
work, including Targetmol-Approved_Drug_Library, Targetmol-Natural_Compound_Library 40
and Targetmol-Bioactive_Compound_Library. It is worth to test whether there is any natural 41
compound that can combat the virus by inhibiting 2019-nCov_3C-like protease. Table 2 42
shows the screening result for Targetmol-Natural compound library. The compounds with a 43
DFCNN score higher than 0.997 are listed in Table 2, and it is found that Adenosine, 44
6
Vidarabine, Mannitol, Dulcitol, D-Sorbitol, D-Mannitol, Allitol, Sodium_gluconate are the 1
top predictions (Table 2). Natural products are often active ingredients of known herb 2
medicine, and relatively safe because of long history usage. If it is proved by an experiment 3
that is effective to the target, patients can easily access it by taking corresponding herb 4
medicine. There are about 8 compounds with the score of 0.999 and about 20 compounds 5
with the score of 0.998 which are presented in Table S2. As indicated above, most of the 6
drugs listed by our model are antiviral drugs and hence it can be tested against nCoV-2019 7
and can be validated in the clinical lab within a short time. 8
The screening result for Targetmol-Approved Drug library is shown in Table 3. The 9
compounds with a DFCNN score higher than 0.997 are listed in Table 3. We randomly 10
considered drugs from potential drugs list and performed a systematic literature search. It is 11
found that Meglumine, Vidarabine, Adenosine, D-Sorbitol, D-Mannitol, Sodium_gluconate, 12
Ganciclovir and Chlorobutanol respectively are top predictions according to the DFCNN 13
score (Table 3). Interestingly, we found most of the drugs in the list such as meglumine, 14
Ganciclovir and Vidarabine respectively show antiviral activity. The list of all the compounds 15
above score 0.990 is provided in Table S4. The screening result for 16
Targetmol-Bioactive_Compound_Library is shown in Table 4. The compounds with a 17
DFCNN score higher than 0.997 is listed in Table 4. Bioactive compounds are a type of 18
chemicals that can found in plants and some foods and have been studied in the prevention of 19
various diseases. It is worth to check whether any of them can act on the target protein. We 20
found compounds such as Vidarabine, Adenosine, Dulcitol, D-Sorbitol, D-Mannitol, 21
Ganciclovir and 5'-Deoxyadenosine are the top predictions in the Targetmol-Bioactive 22
compounds (Table 4). The list of compounds all the compounds above score 0.99 is provided 23
in Table S5. The list in Table 4 has narrowed down the hit compounds for later drug 24
development stages, such as molecular dynamics simulation, or even directly experimental 25
validation for finding bioactive compounds against 2019-nCov_3C-like protease. 26
Virtual screening against database of tripeptides 27
Peptides have the potential to exert higher binding affinity and specificity than small 28
molecular chemical compounds meanwhile small peptides are easier to be synthesized 29
compared with small molecules and antibodies. Since the known ligands of SARS_3C-like 30
protease are compounds similar to tripeptides and the combination of 20 amino acids for 31
tripeptide is also affordable for our method, we decide to perform virtual screening on the 32
tripeptides. The screened tripeptides with a DFCNN score higher than 0.995 (0.997, 0.996 33
and 0.995) for the 2019-nCov_3C-like protease is shown in Table 5. A higher value indicates 34
the peptide can most likely bind with the pocket of the 2019-nCov_3C-like protease. Our 35
method found that the peptides formed by I, K, P amino acids have the highest possibility to 36
bind in the pocket. The combinations by G, K, L or G, K, K or K, P, V are also found to be 37
favorable binding partners predicted by DFCNN (Table 5). The list of all tripeptides above 38
score 0.99 is provided in Table S6. The combination of short peptides and its composition 39
play a crucial role in affecting the overall conformation of protein [28, 29]. It was found that 40
the tripeptide, pentapeptide and octapeptides are believed as a promising candidates for drug 41
development of infectious diseases [30, 31]. Since these peptides are relatively easy to 42
produce, many of the top predictions can be validated by the experimental techniques in a 43
very fast and less expensive manner. 44
7
Conclusion 1
Designing small compound or peptide drugs to cure the 2019-nCoV is extremely urgent. 2
Effective and safe drugs are required for treating deadly viral disease which caused an 3
epidemic outbreak all over the globe. Researchers use different modern technologies to 4
combat such diseases and deep learning is one among them with faster prediction and 5
achieves greater than ~80% accuracy. With the extremely high speed and relatively high 6
accuracy, our DFCNN model for 3C-like protease-ligand interaction analysis is suitable to 7
overcome the challenge of screening tens of thousands of drugs in a short time in a certain 8
emergency situations, such as 2019-nCov outbreak. Our deep learning model based on 9
DFCNN is a data-driven model, which learns 3C-like protease-ligand interaction from known 10
binding and non-binder data. The model use the binding pocket of 3C-like protease-ligand 11
conformation instead of whole conformation of the complex, hence our model is so fast and 12
accurate compared to all other molecular docking procedures. 13
The identified potential 3C-like protease-ligand pairs can be subjected to MD simulation 14
to further check the binding stability and atomic interaction pattern, or even the binding free 15
energy with techniques such as metadynamics to narrow down the candidate list. A variety of 16
repurposed drugs and investigational drugs have been identified in the past. Screening 17
National Medical products Administration (NMPA) approved drug libraries and other 18
chemical libraries have identified novel agents. Hundreds of clinical trials involving 19
remdesivir, chloroquine, favipiravir, chloroquine, convalescent plasma, TCM and other 20
interventions are planned or underway. In this connection, we have performed a deep learning 21
based drug screening and provided potential compound and tripeptide lists for 22
2019-nCov_3C-like protease. Since the inhibitor candidates provided are on-market drugs, 23
the list provided can help to facilitate the 2019-nCov_3C-like protease drug development and 24
could be used immediately. 25
26
References 27
1. Huang C, Wang Y, Li X, et al (2020) Clinical features of patients infected with 2019 28
novel coronavirus in Wuhan, China. Lancet 395(10223):497-506. 29
https://doi.org/10.1016/S0140-6736(20)30183-5 30
2. Lu H, Stratton CW, Tang Y (2020) Outbreak of Pneumonia of Unknown Etiology in 31
Wuhan China: the Mystery and the Miracle. J Med Virol 92(4):401-402. 32
https://doi.org/10.1002/jmv.25678 33
3. Thompson R (2020) Pandemic potential of 2019-nCoV. Lancet Infect Dis 20(3):P280. 34
https://doi.org/10.1016/s1473-3099(20)30068-2 35
4. Hui DS, I Azhar E, Madani TA, et al (2020) The continuing 2019-nCoV epidemic 36
threat of novel coronaviruses to global health — The latest 2019 novel coronavirus 37
outbreak in Wuhan, China. Int. J. Infect. Dis 91:264-266. 38
https://doi.org/10.1016/j.ijid.2020.01.009 39
5. Xintian Xu, Ping Chen, Jingfang Wang, Jiannan Feng, Hui Zhou, Xuan Li, Wu Zhong 40
PH (2020) Evolution of the novel coronavirus from the ongoing Wuhan outbreak and 41
modeling of its spike protein for risk of human transmission. Sci CHINA Life Sci 63: 42
457-460. https://doi.org/10.1007/s11427-020-1637-5 43
6. Ekins S, Puhl AC, Zorn KM, et al (2019) Exploiting machine learning for end-to-end 44
8
drug discovery and development. Nat. Mater 18:435-441. 1
https://doi.org/10.1038/s41563-019-0338-z 2
7. Zhou P, Yang X-L, Wang X-G, et al (2020) Discovery of a novel coronavirus 3
associated with the recent pneumonia outbreak in humans and its potential bat origin. 4
Nature 579:270-273. https://doi.org/10.1101/2020.01.22.914952 5
8. Lu R, Zhao X, Li J, et al (2020) Genomic characterisation and epidemiology of 2019 6
novel coronavirus: implications for virus origins and receptor binding. Lancet 7
395(10224):565-574. https://doi.org/10.1016/S0140-6736(20)30251-8 8
9. Walls AC, Xiong X, Park YJ, et al (2019) Unexpected Receptor Functional Mimicry 9
Elucidates Activation of Coronavirus Fusion. Cell 176(5):1026-1039. 10
https://doi.org/10.1016/j.cell.2018.12.028 11
10. Goetz DH, Choe Y, Hansell E, et al (2007) Substrate specificity profiling and 12
identification of a new class of inhibitor for the major protease of the SARS 13
Coronavirus. Biochemistry 46(30):8744-8752. https://doi.org/10.1021/bi0621415 14
11. Kim Y, Lovell S, Tiew K-C, et al (2012) Broad-Spectrum Antivirals against 3C or 15
3C-Like Proteases of Picornaviruses, Noroviruses, and Coronaviruses. J Virol 16
86(21):11754-11762. https://doi.org/10.1128/jvi.01348-12 17
12. Zhang H, Liao L, Saravanan KM, et al (2019) DeepBindRG: a deep learning based 18
method for estimating effective protein–ligand affinity. PeerJ 7:e7362. 19
https://doi.org/10.7717/peerj.7362 20
13. Zhang H, Liao L, Cai Y, et al (2019) IVS2vec: A tool of Inverse Virtual Screening 21
based on word2vec and deep learning techniques. Methods 166:57-65. 22
https://doi.org/10.1016/j.ymeth.2019.03.012 23
14. Fleming N (2018) How artificial intelligence is changing drug discovery. Nature 24
557:S55-S57. https://doi.org/10.1038/d41586-018-05267-x 25
15. Liu Z, Du J, Fang J, et al (2019) DeepScreening: a deep learning-based screening web 26
server for accelerating drug discovery. Database (Oxford).2019;1-11. 27
https://doi.org/10.1093/database/baz104 28
16. Chen H, Engkvist O, Wang Y, et al (2018) The rise of deep learning in drug discovery. 29
Drug Discov. Today 23(6):1241-1250. https://doi.org/10.1016/j.drudis.2018.01.039 30
17. Russell CD, Millar JE, Baillie JK (2020) Clinical evidence does not support 31
corticosteroid treatment for 2019-nCoV lung injury. Lancet 395:473–475. 32
https://doi.org/10.1016/S0140-6736(20)30317-2 33
18. Shu Y, McCauley J (2017) GISAID: Global initiative on sharing all influenza data – 34
from vision to reality. Eurosurveillance 22(13):30494. 35
https://doi.org/10.2807/1560-7917.ES.2017.22.13.30494 36
19. Sievers F, Higgins DG (2018) Clustal Omega for making accurate alignments of many 37
protein sequences. Protein Sci 27(1):135-145. https://doi.org/10.1002/pro.3290 38
20. Fiser A, Šali A (2003) MODELLER: Generation and Refinement of Homology-Based 39
Protein Structure Models. Methods Enzymol 374:461–491. 40
https://doi.org/10.1016/S0076-6879(03)74020-8 41
21. Shen M, Sali A (2006) Statistical potential for assessment and prediction of protein 42
structures. Protein Sci 15(11):2507-2524. https://doi.org/10.1110/ps.062416606 43
22. Laskowski R a., MacArthur MW, Moss DS, Thornton JM (1993) PROCHECK: a 44
9
program to check the stereochemical quality of protein structures. J Appl Crystallogr 1
26:283–291. https://doi.org/10.1107/S0021889892009944 2
23. Colovos C, Yeates TO (1993) Verification of protein structures: Patterns of nonbonded 3
atomic interactions. Protein Sci 2(9):1511-1519. 4
https://doi.org/10.1002/pro.5560020916 5
24. Benkert P, Tosatto SCE, Schomburg D (2008) QMEAN: A comprehensive scoring 6
function for model quality assessment. Proteins 71:261–277. 7
https://doi.org/10.1002/prot.21715 8
25. Liu Z, Li Y, Han L, et al (2015) PDB-wide collection of binding data: Current status of 9
the PDBbind database. Bioinformatics 31(3):405-412. 10
https://doi.org/10.1093/bioinformatics/btu626 11
26. Accelrys: Materials Studio is a Software Environment for Molecular Modeling (2009) 12
Dassault Systèmes BIOVIA,. Discovery. https://doi.org/10.1007/s10822-010-9395-8 13
27. Jaeger S, Fulle S, Turk S (2018) Mol2vec: Unsupervised Machine Learning Approach 14
with Chemical Intuition. J Chem Inf Model 58(1):27-35. 15
https://doi.org/10.1021/acs.jcim.7b00616 16
28. Santos S, Torcato I, Castanho MARB (2012) Biomedical applications of dipeptides 17
and tripeptides. Biopolymers 98(4):288-293. https://doi.org/10.1002/bip.22067 18
29. Saravanan KM, Selvaraj S (2012) Search for identical octapeptides in unrelated 19
proteins: Structural plasticity revisited. Biopolymers 98(1):11-26. 20
https://doi.org/10.1002/bip.21676 21
30. Wendler J, Schröder BO, Ehmann D, et al (2018) Tu1860 - A Novel Octapeptide as a 22
Promising Candidate for Antibiotic Drug Development and Host Derived Microbiome 23
Regulation. Gastroenterology 154(6):S-1040. 24
https://doi.org/10.1016/s0016-5085(18)33486-3 25
31. Saravanan KM, Dunker AK, Krishnaswamy S (2017) Sequence Fingerprints 26
Distinguish Erroneous from Correct Predictions of Intrinsically Disordered Protein 27
Regions. J Biomol Struct Dyn 36(16):4338-4351. 28
https://doi.org/10.1080/07391102.2017.1415822 29
30
31
10
Figure 1. The workflow of virtual screening of small chemical compounds and tripeptides 1
against the 2019-nCov_3C-like protease. 2
3
4
5
11
Figure 2. The sequence alignment of SARS_coronaivrus_3C-like protease and 1
2019-nCov_3C-like protease. 2
3
4
5
12
Figure 3. The structural model of 2019-nCov_3C-like protease and its template. In panels A 1
and B, the modeled 2019-nCov_3C-like protease and SARS_3C-like protease are shown with 2
the mutated four residues marked with blue color. The ligand from the PDB 3TNT is 3
transferred to the modeled structure (Panel C) and based on residue distance from the 4
transferred ligand, we define the pocket (Panel D). The interaction between the ligand and the 5
modeled 2019-nCov_3C-like protease is also shown (Panel D). 6
7
8
9
13
Table 1. The selected compounds that may inhibit 2019-nCov_3C-like protease based on the 1
DFCNN score and autodock vina score. 2
Chemdiv ID Vina score
(kcal/mol)
DeepBindVec Recommendation
C998-0189 -8.5 >0.995 Recommended
C998-0197 -7.9 >0.995 Can Try
C998-0090 -7.8 >0.995 Can Try
C998-0948 -7.7 >0.995 Recommended
C998-1046 -7.6 >0.995 Recommended
D076-0195 -7.3 >0.995 Recommended
3
4
14
Table 2. The potential drug candidates selected from the Targetmol-Natural compound 1
library. 2
Natural Compound DFCNN score
Adenosine;Vidarabine;Mannitol;Dulcitol;D-Sorbitol;D-Mannitol;A
llitol;Sodium_gluconate score>=0.999
L(-)-sorbose;D-(-)-Fructose;Guanosine;Inosine;Trichostatin_A;D-(
-)-Ribose;DL-Xylose;Cordycepin;β-Glycerophosphate_disodium_s
alt_hydrate;Xanthosine;Zeatin;N6-methyladenosine;Atractylodin;T
ubercidin;Glucosamine_sulfate;Panthenol;Dexpanthenol;Ubenimex
;Phospho(enol)pyruvic_acid_monopotassium 0.999>Score>=0.998
Aztreonam;Cytidine;Cytarabine;D-Saccharic_acid_potassium_salt;
D-Glucose_6-phosphate_sodium_salt;Quinic_acid;2'-Deoxyadenos
ine_monohydrate;N-Sulfo-glucosamine_sodium_salt;2'-Deoxyguan
osine_monohydrate 0.998>Score>=0.997
3
4
5
15
Table 3. The potential drug candidates selected from the Targetmol-Approved Drug library 1
2
Approved Drug name DFCNN score
Meglumine;Vidarabine;Adenosine;D-Sorbitol;D-Mannitol;Sodiu
m_gluconate;Ganciclovir;Chlorobutanol score>=0.999
AICAR_(Acadesine);Mylosar;Inosine;D-Pantothenic_acid_sodiu
m_salt;DL-Xylose;Ethambutol_dihydrochloride;Glucosamine;My
clobutanil;Sodium_etidronate;Fludarabine;Gemcitabine;Emtricita
bine;Tubercidin;Bestatin_hydrochloride;Panthenol;Dexpanthenol;
Cladribine;Entecavir;Ubenimex 0.999>Score>=0.998
Entecavir_hydrate;Procarbazine_hydrochloride;Aztreonam;Disop
yramide;Benznidazole;Clofarabine;Bucetin;Nifuroxazide;Triflupr
omazine_hydrochloride;Doxifluridine;Cytarabine;Cefdinir;Bupro
pion_hydrochloride;Fluoxetine;Tenofovir;Pentostatin;Fluoxetine_
hydrochloride;Imazalil;Atenolol 0.998>Score>=0.997
3
4
16
Table 4. The potential drug candidates selected from the Targetmol-Bioactive compounds. 1
2
Bioactive Compound DFCNN score
Vidarabine;Adenosine;Dulcitol;D-Sorbitol;D-Mannitol;Ganciclovir;5'
-DEOXYADENOSINE score>=0.999
Nelarabine;Tosedostat;Fosfomycin_Tromethamine;AICAR_(Acadesi
ne);Mylosar;Guanosine;Inosine;Crotonoside;D-(-)-Ribose;Cordycepin
;β-Glycerophosphate_disodium_salt_hydrate;Zeatin;Ethambutol_dihy
drochloride;5-Iodotubercidin;Myclobutanil;Sodium_etidronate;Atract
ylodin;Fludarabine;Heterophyllin_B;Gemcitabine;Emtricitabine;Diso
dium_clodronate_tetrahydrate;Ostarine;Tubercidin;Bestatin_hydrochl
oride;Panthenol;Dexpanthenol;FCCP;Cladribine;Z-VAD(OMe)-FMK
;WP1066;Entecavir;Ubenimex;Batimastat;ML264;GSK4112;Degrasy
n;Cefcapene_Pivoxil_Hydrochloride;Phospho(enol)pyruvic_acid_mo
nopotassium;A-804598;SR3335;IPTG
0.999>Score>=
0.998
KYA1797K;Mizoribine;5-Hydroxy-1,7-diphenyl-6-hepten-3-one;AT
PO;Entecavir_hydrate;Aztreonam;NXY-059;D-Pantothenic_acid;Bay
_11-7085;Disopyramide;Benznidazole;SB_297006;Imidafenacin;Clof
arabine;Bucetin;Nifuroxazide;Triflupromazine_hydrochloride;Doxiflu
ridine;Selegiline_hydrochloride;Cytarabine;Cytidine;BGP-15;Cefdini
r;Bupropion_hydrochloride;UK-371804;Fluoxetine;D-Saccharic_acid
_potassium_salt;D-Glucose_6-phosphate_sodium_salt;J147;Tenofovir
;N-Sulfo-glucosamine_sodium_salt;Pentostatin;Fluoxetine_hydrochlo
ride;Nifurtimox;Imazalil;5-Fluorouridine;Atenolol;Repertaxin;ACY-7
38
0.998>Score>=
0.997
3
17
Table 5. The predicted tripeptide that have high possibility (DFCNN score >=0.99) to bind 1
with the pocket of 2019-nCov_3C-like protease by DFCNN score. 2
3
Peptide sequence DFCNN score
IKP;IPK;KIP;KPI;PIK;PKI Score>=0.997
GKL;LGK;LKG;KGL;KLG;GKK;KGK;KKG;AKK;KAK;
KKA;KPV;KVP;PKV;PVK;VKP;VPK 0.997>Score>=0.996
GKI;IGK;IKG;KGI;KIG;LKP;LPK;KLP;KPL;PLK;PKL;L
LK;LKL;KLL 0.996>Score>=0.995
4