PosterPDF Available


Our goal is to reproduce one of possible approaches for in silico binding site prediction problem described at DeepSite paper [1]. We follow the original protocol and add several simplifications. Our ultimate goal is to reproduce given approach as close as possible and to compare results, to interpret network layers outputs and to experiment with modifying original CNN implementation. We describe the problems we had to deal with and our future work below.
Reproducibility Project: DeepSite
Tatiana Malygina & Viacheslav Borovitskiy & Yuri Porozov
The laboratory of bioinformatics, ITMO University
Our goal is to reproduce one of possible approaches for in silico binding site prediction problem described
at DeepSite paper [1]. We follow the original protocol and add several simplifications. Our ultimate goal is to
reproduce given approach as close as possible and to compare results, to interpret network layers outputs and to
experiment with modifying original CNN implementation. We describe the problems we had to deal with and our
future work below.
Let’s suppose that we have a protein structure. We also know that it interacts with several other com-
pounds in human body and something is wrong with one these interactions. If we want to prevent this
protein from interactions with other compounds (or, in the case of protein malfunction, to strengthen
these interactions), we must know the specific place of protein’s surface where the interaction of
interest takes place.
This specific region of protein surface is called active site, or (binding site) and is schematically
shown at Figure 1.
Figure 1: The picture shows the schematic of interaction between protein surface and ligand (small molecule, in this case
it is a caffeine molecule).
When we have information about interaction, we can define this region by using distance cutoff from
small molecule’s atoms. But in cases when there is no such information available, we want to predict
de novo what regions of protein surface can potentially bind the small molecules, to later use this
knowledge for molecular docking.
scPDB database
As the source of the data we use scPDB [7] - the structural database with protein-ligand pairs extracted
from Protein Data Bank [2], which were manually curated, annotated and clustered.
Current available version of this database contains about 16k protein-ligand pairs, each pair infor-
mation includes .mol2 files for ligand, protein, and protein’s active site.
Reproducing DeepSite
To better understand how to work with sc-PDB data, we’ve started from reproducing results reported
in DeepSite paper [6].
Although DeepSite provides code for neural network architecture written in keras at Supplemen-
tary, the authors do not provide code for data preparation and feature extraction step. We follow the
instructions given in the article to reproduce this step and slightly modify it to simplify problem and
to be able to produce draft results before comparing on the whole dataset.
We start with feature extraction and dataset preparation. Original article uses scPDB [2] v.2013 for
network training. Currently only the latest version v.2017 is available, and it is bigger than DeepSite’s
training dataset (16k protein-ligand pairs).
DeepSite article mentions that scPDB has annotations with clustering information. We exported .csv
file with these annotations from scPDB website, for each protein-ligand pair it contains UNIPROT ID
and UNIPROT AC, and also CLUSTER ID field. For all protein-ligand pairs in database CLUS-
TER ID is empty, that is why we could not reproduce original filtering. DeepSite’s authors mention
that they provide list of selected structures in Supplementary, but it is not provided.
For each unique UNIPROT ID we select 1 protein-ligand pair, thus reducing dataset from more than
16000 to 5010 records. This approach doesn’t guarantee the elimination of similar binding sites, since
different proteins with different amino acid sequences can share similar function and shape. It is also
diminishes the variability of data, since the protein can have several binding sites and protein-ligand
pairs with the same UNIPROT ID can describe different binding sites originating in different parts of
protein’s surface. It simply reduces the amount of data, which is huge, and simplifies our experiment.
Next we follow DeepSite’s original protocol. We use HTMD [4] for feature extraction and split its
output to blocks of 16*16*16 voxels with step of 4. We mark blocks as positive if their center is
closer than 4 ˚
A to protein’s geometric center and negative otherwise (as stated in the original article).
We follow the original paper and balance data by undersampling, since for most proteins the fraction
of positively-marked blocks doesn’t exceed 0.008.
Model modifications & Results
We use prepared samples to train keras [3] model (Figure 2). We split data to train and test sets at
ratio 9:1. We could obtain results similar to mentioned in paper - in particular, with the same network
scheme, we got 98.4% accuracy.
Figure 2: Original network architecture proposed at deepsite’s paper
However, after small tweaking (we decreased convolutional filter size to 3×3×3, as shown at Figure
3), we could obtain 99% accuracy on balanced data.
Figure 3: Modified network architecture with decreased convolutional filter size, which gives slightly better accuracy on
balanced dataset.
We also modified the network and added attention block to compare molecular descriptors. The
result was not surprising: hydrophobicity and geometric descriptors had the greatest importances.
This was not surprising, because hydrophobic aminoacids are known to avoid waters by taking part
in protein-protein and protein-ligand interactions and if there are many to form active sites [10],
and most of protein-ligand active sites can be explained by geometry only in 95% cases [9]. The
corresponding pictures and pretrained models for this case can be found at the project’s repository
both with code for feature extraction and processing 1.
The Figure 4 shows 2 different proteins with predicted active sites.
Figure 4: The picture shows 2 proteins with predicted active sites colored in green - structure with PDB ID 1ype (good
example) and PDB ID 2osl (bad example). The descriptors used at deepsite’s article are not invertible - to draw this
pictures, we used the original correspondence between atoms and 16 ×16 ×16 blocks: we marked with green color atoms
which are closest to the block predicted as positive.
The first structure looks good (probably because it is a globular protein), the second one looks bad -
but it is not obvious (the picture represents heavy chain and Fc fragment of antibody, the main binding
site is located in different place).
The original method works better on proteins, which have ”classic” geometry - with visible binding
We plan to future explore different types of descriptors and apply them to other problems solvable
with this dataset.
[1] Nglview - interactive molecular graphics for jupyter notebooks. Bioinformatics.
[2] Helen M. Berman, John Westbrook, Zukang Feng, Gary Gilliland, T. N. Bhat, Helge Weissig,
Ilya N. Shindyalov, and Philip E. Bourne. The protein data bank. Nucleic Acids Research,
28(1):235–242, 2000.
[3] Franc¸ois Chollet et al. Keras., 2015.
[4] S. Doerr, M. J. Harvey, Frank No, and G. De Fabritiis. Htmd: High-throughput molecular dynam-
ics for molecular discovery. Journal of Chemical Theory and Computation, 12(4):1845–1852,
2016. PMID: 26949976.
[5] Michael Feig. Computational protein structure refinement: almost there, yet still so far to go.
Wiley Interdisciplinary Reviews: Computational Molecular Science, 7(3):e1307, 2017.
[6] J. Jimnez, S. Doerr, G. Martnez-Rosell, A. S. Rose, and G. De Fabritiis. Deepsite: protein-
binding site predictor using 3d-convolutional neural networks. Bioinformatics, 33(19):3036–
3042, 2017.
[7] Esther Kellenberger, Pascal Muller, Claire Schalon, Guillaume Bret, Nicolas Foata, and Didier
Rognan. sc-pdb: an annotated database of druggable binding sites from the protein data bank.
Journal of Chemical Information and Modeling, 46(2):717–727, 2006. PMID: 16563002.
[8] Joshua Meyers, Nathan Brown, and Julian Blagg. Mapping the 3d structures of small molecule
binding sites. Journal of Cheminformatics, 8(1):70, 12 2016.
[9] Peter Schmidtke, Catherine Souaille, F. Estienne, Nicolas Baurin, and Romano Kroemer. Large-
scale comparison of four binding site detection algorithms. Journal of chemical information and
modeling, 50:2191–200, 12 2010.
[10] CJ Tsai and R Nussinov. Hydrophobic folding units at protein-protein interfaces: implications to
protein folding and to protein-protein association. Protein science: a publication of the Protein
Society, 6(7):1426–1437, July 1997.
We would like to thank Bioinformatics Institute ( for cooperation and
opportunity to make this work a ”student project” (it provided us several additional deadlines).
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Background Analysis of the 3D structures of protein–ligand binding sites can provide valuable insights for drug discovery. Binding site comparison (BSC) studies can be employed to elucidate the function of orphan proteins or to predict the potential for polypharmacology. Many previous binding site analyses only consider binding sites surrounding an experimentally observed bound ligand. Results To encompass potential protein–ligand binding sites that do not have ligands known to bind, we have incorporated fpocket cavity detection software and assessed the impact of this inclusion on BSC performance. Using fpocket, we generated a database of ligand-independent potential binding sites and applied the BSC tool, SiteHopper, to analyze similarity relationships between protein binding sites. We developed a method for clustering potential binding sites using a curated dataset of structures for six therapeutically relevant proteins from diverse protein classes in the protein data bank. Two clustering methods were explored; hierarchical clustering and a density-based method adept at excluding noise and outliers from a dataset. We introduce circular plots to visualize binding site structure space. From the datasets analyzed in this study, we highlight a structural relationship between binding sites of cationic trypsin and prothrombin, protein targets known to bind structurally similar small molecules, exemplifying the potential utility of objectively and holistically mapping binding site space from the structural proteome. Conclusions We present a workflow for the objective mapping of potential protein–ligand binding sites derived from the currently available structural proteome. We show that ligand-independent binding site detection tools can be introduced without excessive penalty on BSC performance. Clustering combined with intuitive visualization tools can be applied to map relationships between the 3D structures of protein binding sites.Graphical abstractMapping binding site space. Electronic supplementary material The online version of this article (doi:10.1186/s13321-016-0180-0) contains supplementary material, which is available to authorized users.
Full-text available
The Protein Data Bank (PDB; ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.
Motivation: An important step in structure-based drug design consists in the prediction of druggable binding sites. Several algorithms for detecting binding cavities, those likely to bind to a small drug compound, have been developed over the years by clever exploitation of geometric, chemical and evolutionary features of the protein. Results: Here we present a novel knowledge-based approach that uses state-of-the-art convolutional neural networks, where the algorithm is learned by examples. In total, 7622 proteins from the scPDB database of binding sites have been evaluated using both a distance and a volumetric overlap approach. Our machine-learning based method demonstrates superior performance to two other competitive algorithmic strategies. Availability and implementation: DeepSite is freely available at . Users can submit either a PDB ID or PDB file for pocket detection to our NVIDIA GPU-equipped servers through a WebGL graphical interface. Contact: Supplementary information: Supplementary data are available at Bioinformatics online.
Protein structures are essential in modern biology yet experimental methods are far from being able to catch up with the rapid increase in available genomic data. Computational protein structure prediction methods aim to fill the gap while the role of protein structure refinement is to take approximate initial template‐based models and bring them closer to the true native structure. Current methods for computational structure refinement rely on molecular dynamics simulations, related sampling methods, or iterative structure optimization protocols. The best methods are able to achieve moderate degrees of refinement but consistent refinement that can reach near‐experimental accuracy remains elusive. Key issues revolve around the accuracy of the energy function, the inability to reliably rank multiple models, and the use of restraints that keep sampling close to the native state but also limit the degree of possible refinement. A different aspect is the question of what exactly the target of high‐resolution refinement should be as experimental structures are affected by experimental conditions and different biological questions require varying levels of accuracy. While improvement of the global protein structure is a difficult problem, high‐resolution refinement methods that improve local structural quality such as favorable stereochemistry and the avoidance of atomic clashes are much more successful. WIREs Comput Mol Sci 2017, 7:e1307. doi: 10.1002/wcms.1307 This article is categorized under: • Structure and Mechanism > Computational Biochemistry and Biophysics • Molecular and Statistical Mechanics > Molecular Dynamics and Monte-Carlo Methods • Software > Molecular Modeling
Recent advances in molecular simulations have allowed scientists to investigate slower biological processes than ever before. Together with these advances came an explosion of data which has transformed a traditionally compute-bound into a data-bound problem. Here we present HTMD, a programmable, extensible platform written in Python that aims to solve the data generation and analysis problem as well as increase reproducibility by providing a complete workspace for simulation-based discovery. So far, HTMD includes system building for CHARMM and AMBER force fields, projection methods, clustering, molecular simulation production, adaptive sampling, an Amazon cloud interface, Markov state models and visualization. As a result, a single, short HTMD script can lead from a PDB structure to useful quantities such as relaxation timescales, equilibrium populations, metastable conformations and kinetic rates. In this paper we focus on the adaptive sampling and Markov state modeling features.
A hydrophobic folding unit cutting algorithm, originally developed for dissecting single-chain proteins, has been applied to a dataset of dissimilar two-chain protein-protein interfaces. Rather than consider each individual chain separately, the two-chain complex has been treated as a single chain. The two-chain parsing results presented in this work show hydrophobicity to be a critical attribute of two-state versus three-state protein-protein complexes. The hydrophobic folding units at the interfaces of two-state complexes suggest that the cooperative nature of the two-chain protein folding is the outcome of the hydrophobic effect, similar to its being the driving force in a single-chain folding. In analogy to the protein-folding process, the two-chain, two-state model complex may correspond to the formation of compact, hydrophobic nuclei. On the other hand, the three-state model complex involves binding of already folded monomers, similar to the association of the hydrophobic folding units within a single chain. The similarity between folding entities in protein cores and in two-state protein-protein interfaces, despite the absence of some chain connectivities in the latter, indicates that chain linkage does not necessarily affect the native conformation. This further substantiates the notion that tertiary, non-local interactions play a critical role in protein folding. These compact, hydrophobic, two-chain folding units, derived from structurally dissimilar protein-protein interfaces, provide a rich set of data useful in investigations of the role played by chain connectivity and by tertiary interactions in studies of binding and of folding. Since they are composed of non-contiguous pieces of protein backbones, they may also aid in defining folding nuclei.
A large-scale evaluation and comparison of four cavity detection algorithms was carried out. The algorithms SiteFinder, fpocket, PocketFinder, and SiteMap were evaluated on a protein test set containing 5416 protein-ligand complexes and 9900 apo forms, corresponding to a subset of the set used earlier for benchmarking the PocketFinder algorithm. For the holo structures, all four algorithms correctly identified a similar amount of pockets (around 95%). SiteFinder, using optimized parameters, SiteMap, and fpocket showed similar pocket ranking performance, which was defined by ranking the correct binding site on rank 1 of the predictions or within the first 5 ranks of the predictions. On the apo structures, PocketFinder especially and also SiteFinder (optimized parameters) performed best, identifying 96% and 84% of all binding sites, respectively. The fpocket program predicts binding sites most accurately among the algorithms evaluated here. SiteFinder needed an average calculation time of 1.6 s compared with 2 min for SiteMap and around 2 s for fpocket.
The sc-PDB is a collection of 6 415 three-dimensional structures of binding sites found in the Protein Data Bank (PDB). Binding sites were extracted from all high-resolution crystal structures in which a complex between a protein cavity and a small-molecular-weight ligand could be identified. Importantly, ligands are considered from a pharmacological and not a structural point of view. Therefore, solvents, detergents, and most metal ions are not stored in the sc-PDB. Ligands are classified into four main categories: nucleotides (< 4-mer), peptides (< 9-mer), cofactors, and organic compounds. The corresponding binding site is formed by all protein residues (including amino acids, cofactors, and important metal ions) with at least one atom within 6.5 angstroms of any ligand atom. The database was carefully annotated by browsing several protein databases (PDB, UniProt, and GO) and storing, for every sc-PDB entry, the following features: protein name, function, source, domain and mutations, ligand name, and structure. The repository of ligands has also been archived by diversity analysis of molecular scaffolds, and several chemoinformatics descriptors were computed to better understand the chemical space covered by stored ligands. The sc-PDB may be used for several purposes: (i) screening a collection of binding sites for predicting the most likely target(s) of any ligand, (ii) analyzing the molecular similarity between different cavities, and (iii) deriving rules that describe the relationship between ligand pharmacophoric points and active-site properties. The database is periodically updated and accessible on the web at