Reproducibility Project: DeepSite
Tatiana Malygina & Viacheslav Borovitskiy & Yuri Porozov
The laboratory of bioinformatics, ITMO University
Our goal is to reproduce one of possible approaches for in silico binding site prediction problem described
at DeepSite paper . We follow the original protocol and add several simpliﬁcations. Our ultimate goal is to
reproduce given approach as close as possible and to compare results, to interpret network layers outputs and to
experiment with modifying original CNN implementation. We describe the problems we had to deal with and our
future work below.
Let’s suppose that we have a protein structure. We also know that it interacts with several other com-
pounds in human body and something is wrong with one these interactions. If we want to prevent this
protein from interactions with other compounds (or, in the case of protein malfunction, to strengthen
these interactions), we must know the speciﬁc place of protein’s surface where the interaction of
interest takes place.
This speciﬁc region of protein surface is called active site, or (binding site) and is schematically
shown at Figure 1.
Figure 1: The picture shows the schematic of interaction between protein surface and ligand (small molecule, in this case
it is a caffeine molecule).
When we have information about interaction, we can deﬁne this region by using distance cutoff from
small molecule’s atoms. But in cases when there is no such information available, we want to predict
de novo what regions of protein surface can potentially bind the small molecules, to later use this
knowledge for molecular docking.
As the source of the data we use scPDB  - the structural database with protein-ligand pairs extracted
from Protein Data Bank , which were manually curated, annotated and clustered.
Current available version of this database contains about 16k protein-ligand pairs, each pair infor-
mation includes .mol2 ﬁles for ligand, protein, and protein’s active site.
To better understand how to work with sc-PDB data, we’ve started from reproducing results reported
in DeepSite paper .
Although DeepSite provides code for neural network architecture written in keras at Supplemen-
tary, the authors do not provide code for data preparation and feature extraction step. We follow the
instructions given in the article to reproduce this step and slightly modify it to simplify problem and
to be able to produce draft results before comparing on the whole dataset.
We start with feature extraction and dataset preparation. Original article uses scPDB  v.2013 for
network training. Currently only the latest version v.2017 is available, and it is bigger than DeepSite’s
training dataset (16k protein-ligand pairs).
DeepSite article mentions that scPDB has annotations with clustering information. We exported .csv
ﬁle with these annotations from scPDB website, for each protein-ligand pair it contains UNIPROT ID
and UNIPROT AC, and also CLUSTER ID ﬁeld. For all protein-ligand pairs in database CLUS-
TER ID is empty, that is why we could not reproduce original ﬁltering. DeepSite’s authors mention
that they provide list of selected structures in Supplementary, but it is not provided.
For each unique UNIPROT ID we select 1 protein-ligand pair, thus reducing dataset from more than
16000 to 5010 records. This approach doesn’t guarantee the elimination of similar binding sites, since
different proteins with different amino acid sequences can share similar function and shape. It is also
diminishes the variability of data, since the protein can have several binding sites and protein-ligand
pairs with the same UNIPROT ID can describe different binding sites originating in different parts of
protein’s surface. It simply reduces the amount of data, which is huge, and simpliﬁes our experiment.
Next we follow DeepSite’s original protocol. We use HTMD  for feature extraction and split its
output to blocks of 16*16*16 voxels with step of 4. We mark blocks as positive if their center is
closer than 4 ˚
A to protein’s geometric center and negative otherwise (as stated in the original article).
We follow the original paper and balance data by undersampling, since for most proteins the fraction
of positively-marked blocks doesn’t exceed 0.008.
Model modiﬁcations & Results
We use prepared samples to train keras  model (Figure 2). We split data to train and test sets at
ratio 9:1. We could obtain results similar to mentioned in paper - in particular, with the same network
scheme, we got 98.4% accuracy.
Figure 2: Original network architecture proposed at deepsite’s paper
However, after small tweaking (we decreased convolutional ﬁlter size to 3×3×3, as shown at Figure
3), we could obtain 99% accuracy on balanced data.
Figure 3: Modiﬁed network architecture with decreased convolutional ﬁlter size, which gives slightly better accuracy on
We also modiﬁed the network and added attention block to compare molecular descriptors. The
result was not surprising: hydrophobicity and geometric descriptors had the greatest importances.
This was not surprising, because hydrophobic aminoacids are known to avoid waters by taking part
in protein-protein and protein-ligand interactions and if there are many – to form active sites ,
and most of protein-ligand active sites can be explained by geometry only in 95% cases . The
corresponding pictures and pretrained models for this case can be found at the project’s repository
both with code for feature extraction and processing 1.
The Figure 4 shows 2 different proteins with predicted active sites.
Figure 4: The picture shows 2 proteins with predicted active sites colored in green - structure with PDB ID 1ype (good
example) and PDB ID 2osl (bad example). The descriptors used at deepsite’s article are not invertible - to draw this
pictures, we used the original correspondence between atoms and 16 ×16 ×16 blocks: we marked with green color atoms
which are closest to the block predicted as positive.
The ﬁrst structure looks good (probably because it is a globular protein), the second one looks bad -
but it is not obvious (the picture represents heavy chain and Fc fragment of antibody, the main binding
site is located in different place).
The original method works better on proteins, which have ”classic” geometry - with visible binding
We plan to future explore different types of descriptors and apply them to other problems solvable
with this dataset.
 Nglview - interactive molecular graphics for jupyter notebooks. Bioinformatics.
 Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat, Helge Weissig,
Ilya N. Shindyalov, and Philip E. Bourne. The protein data bank. Nucleic Acids Research,
 Franc¸ois Chollet et al. Keras. https://keras.io, 2015.
 S. Doerr, M. J. Harvey, Frank No, and G. De Fabritiis. Htmd: High-throughput molecular dynam-
ics for molecular discovery. Journal of Chemical Theory and Computation, 12(4):1845–1852,
2016. PMID: 26949976.
 Michael Feig. Computational protein structure reﬁnement: almost there, yet still so far to go.
Wiley Interdisciplinary Reviews: Computational Molecular Science, 7(3):e1307, 2017.
 J. Jimnez, S. Doerr, G. Martnez-Rosell, A. S. Rose, and G. De Fabritiis. Deepsite: protein-
binding site predictor using 3d-convolutional neural networks. Bioinformatics, 33(19):3036–
 Esther Kellenberger, Pascal Muller, Claire Schalon, Guillaume Bret, Nicolas Foata, and Didier
Rognan. sc-pdb: an annotated database of druggable binding sites from the protein data bank.
Journal of Chemical Information and Modeling, 46(2):717–727, 2006. PMID: 16563002.
 Joshua Meyers, Nathan Brown, and Julian Blagg. Mapping the 3d structures of small molecule
binding sites. Journal of Cheminformatics, 8(1):70, 12 2016.
 Peter Schmidtke, Catherine Souaille, F. Estienne, Nicolas Baurin, and Romano Kroemer. Large-
scale comparison of four binding site detection algorithms. Journal of chemical information and
modeling, 50:2191–200, 12 2010.
 CJ Tsai and R Nussinov. Hydrophobic folding units at protein-protein interfaces: implications to
protein folding and to protein-protein association. Protein science: a publication of the Protein
Society, 6(7):1426–1437, July 1997.
We would like to thank Bioinformatics Institute (https://bioinf.me/en) for cooperation and
opportunity to make this work a ”student project” (it provided us several additional deadlines).