Article

Modeling with Alternate Locations in X-ray Protein Structures

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In many molecular modeling applications, the standard procedure is still to handle proteins as single, rigid structures. While the importance of conformational flexibility is widely known, handling it remains challenging. Even the crystal structure of a protein usually contains variability exemplified in alternate side chain orientations or backbone segments. This conformational variability is encoded in PDB structure files by so-called alternate locations (AltLocs). Most modeling approaches either ignore AltLocs or resolve them with simple heuristics early on during structure import. We analyzed the occurrence and usage of AltLocs in the PDB and developed an algorithm to automatically handle AltLocs in PDB files enabling all structure-based methods using rigid structures to take the alternative protein conformations described by AltLocs into consideration. A respective software tool named AltLocEnumerator can be used as a structure preprocessor to easily exploit AltLocs. While the amount of data makes it difficult to show impact on a statistical level, handling AltLocs has a substantial impact on a case-by-case basis. We believe that the inspection and consideration of AltLocs is a very valuable approach in many modeling scenarios.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Here, the first encountered alternate location identifier is used for all atoms with alternate location identifiers in the structure. A recent discussion of alternate locations in PDB structures and different strategies to take them into account can be found in Gutermuth et al. (2023). 104 Furthermore, during PDB file interpretation, only the atom coordinates in the PDB file are considered, i.e., the asymmetric unit for structures obtained from the PDB. ...
... A recent discussion of alternate locations in PDB structures and different strategies to take them into account can be found in Gutermuth et al. (2023). 104 Furthermore, during PDB file interpretation, only the atom coordinates in the PDB file are considered, i.e., the asymmetric unit for structures obtained from the PDB. 63 Disulfide bridges are detected using a distancebased criterion; i.e., two cysteines are assumed to be connected by a disulfide bridge if both sulfur atoms have a distance of at most 2.5 Å. ...
... The construction of protein structure ensembles could also be integrated into the JAMDA preprocessing pipeline, e.g., by a PDB-wide search for similar binding sites using the SIENA tool 150 or our recently presented approach to translate alternate locations into protein structure ensembles. 104 ...
... Despite the dynamic nature of protein molecules, X-ray crystallography, perhaps the most widely used method for studying their structure, produces only static structure models. The ensemble nature of this measurement modality does, however, provide some information about the structure flexibility through the B-factors and electron density, and these structures allow specifying alternate locations for atoms to represent the dynamics (Stachowski and Fischer, 2023, Djinovic-Carugo 2015, Gutermuth 2023. For example, side chains are commonly found and modelled as a population of a discrete set of rotamer conformations, (Figure 1). ...
... According to the OCA browser database for protein structure/function (Prilusky, 1996), over 95% of Protein Data Bank (PDB) structures deposited after 2010 contain backbone atoms with alternate locations (altlocs). Gutermuth (2023) recently noted that the prevalence of altlocs is often unnoticed since most modelling approaches either ignore altlocs altogether or resolve them with simple heuristics. This could bear substantial unintended consequences, for example when such data are used to train structure prediction models such as AlphaFold (Jumper et al. 2021) as we shall discuss below. ...
Preprint
Proteins jiggle around, adopting ensembles of interchanging conformations. Here we show through a large-scale analysis of the Protein Data Bank and using molecular dynamics simulations, that segments of protein chains can also commonly adopt dual, transiently stable conformations which is not explained by direct interactions. Our analysis highlights how alternate conformations can be maintained as non-interchanging, separated states intrinsic to the protein chain, namely through steric barriers or the adoption of transient secondary structure elements. We further demonstrate that despite the commonality of the phenomenon, current structural ensemble prediction methods fail to capture these bimodal distributions of conformations.
... While the phenomenon is very common in flexible side chains, the heterogeneity in the backbone conformations has remained underappreciated. In fact, common visualization platforms (e.g., PyMOL (DeLano et al., 2002) and ChimeraX (Pettersen et al., 2021)) and structural modeling tools (e.g., GROMACS (Van Der Spoel et al., 2005)) frequently reading the first listed conformation as default and disregarding alternate conformations (Gutermuth et al., 2023). Nevertheless, the Protein Data Bank is actually riddled with such altlocs. ...
Preprint
Full-text available
Proteins exist as a dynamic ensemble of multiple conformations, and these motions are often crucial for their functions. However, current structure prediction methods predominantly yield a single conformation, overlooking the conformational heterogeneity revealed by diverse experimental modalities. Here, we present a framework for building experiment-grounded protein structure generative models that infer conformational ensembles consistent with measured experimental data. The key idea is to treat state-of-the-art protein structure predictors (e.g., AlphaFold3) as sequence-conditioned structural priors, and cast ensemble modeling as posterior inference of protein structures given experimental measurements. Through extensive real-data experiments, we demonstrate the generality of our method to incorporate a variety of experimental measurements. In particular, our framework uncovers previously unmodeled conformational heterogeneity from crystallographic densities, and generates high-accuracy NMR ensembles orders of magnitude faster than the status quo. Notably, we demonstrate that our ensembles outperform AlphaFold3 and sometimes better fit experimental data than publicly deposited structures to the Protein Data Bank (PDB). We believe that this approach will unlock building predictive models that fully embrace experimentally observed conformational diversity.
... However, where detectable, crystallographers model atoms in multiple alternate locations (commonly termed, altlocs). Alternately located segments of the protein backbone have remained under-recognised, since most visualisation platforms (e.g., pymol and chimeraX) and programs using structural models as inputs (e.g., gromacs) ignore altlocs altogether or resolve them with simple heuristics [4]. A recent work [11] created a comprehensive catalogue of altlocs extracted from PDB structures, suggesting that this dataset should find use in efforts towards predicting multiple structures from a single sequence. ...
Preprint
Full-text available
Proteins are dynamic, adopting ensembles of conformations. The nature of this conformational heterogenity is imprinted in the raw electron density measurements obtained from X-ray crystallography experiments. Fitting an ensemble of protein structures to these measurements is a challenging, ill-posed inverse problem. We propose a non-i.i.d. ensemble guidance approach to solve this problem using existing protein structure generative models and demonstrate that it accurately recovers complicated multi-modal alternate protein backbone conformations observed in certain single crystal measurements.
Article
Full-text available
Protein Data Bank (PDB) files list the relative spatial location of atoms in a protein structure as the final output of the process of fitting and refining to experimentally determined electron density measurements. Where experimental evidence exists for multiple conformations, atoms are modelled in alternate locations. Programs reading PDB files commonly ignore these alternate conformations by default leaving users oblivious to the presence of alternate conformations in the structures they analyze. This has led to underappreciation of their prevalence, under characterisation of their features and limited the accessibility to this high-resolution data representing structural ensembles. We have trawled PDB files to extract structural features of residues with alternately located atoms. The output includes the distance between alternate conformations and identifies the location of these segments within the protein chain and in proximity of all other atoms within a defined radius. This dataset should be of use in efforts to predict multiple structures from a single sequence and support studies investigating protein flexibility and the association with protein function.
Article
Full-text available
Protein flexibility is important for ligand binding but often ignored in drug design. Considering proteins as ensembles rather than static snapshots creates opportunities to target dynamic proteins that lack FDA-approved drugs, such as the human chaperone, heat shock protein 90 (Hsp90). Hsp90α accommodates ligands with a dynamic lid domain, yet no comprehensive analysis relating lid conformations to ligand properties is available. To date, ∼300 ligand-bound Hsp90α crystal structures are deposited in the Protein Data Bank, which enables us to consider ligand binding as a perturbation of the protein conformational landscape. By estimating binding site volumes, we classified structures into distinct major and minor lid conformations. Supported by retrospective docking, each conformation creates unique hotspots that bind chemically distinguishable ligands. Clustering revealed insightful exceptions and the impact of crystal packing. Overall, Hsp90α's plasticity provides a cautionary tale of overinterpreting individual crystal structures and motivates an ensemble-based view of drug design.
Article
Full-text available
Motivation After the outstanding breakthrough of AlphaFold in predicting protein 3D models, new questions appeared and remain unanswered. The ensemble nature of proteins, for example, challenges the structural prediction methods because the models should represent a set of conformers instead of single structures. The evolutionary and structural features captured by effective deep learning techniques may unveil the information to generate several diverse conformations from a single sequence. Here we address the performance of AlphaFold2 predictions obtained through ColabFold under this ensemble paradigm. Results Using a curated collection of apo-holo pairs of conformers, we found that AlphaFold2 predicts the holo form of a protein in ∼70% of the cases, being unable to reproduce the observed conformational diversity with the same error for both conformers. More importantly, we found that AlphaFold2's performance worsens with the increasing conformational diversity of the studied protein. This impairment is related to the heterogeneity in the degree of conformational diversity found between different members of the homologous family of the protein under study. Finally, we found that main-chain flexibility associated with apo-holo pairs of conformers negatively correlates with the predicted local model quality score plDDT, indicating that plDDT values in a single 3D model could be used to infer local conformational changes linked to ligand binding transitions. Availability Data and code used in this manuscript are publicly available at https://gitlab.com/sbgunq/publications/af2confdiv-oct2021 Supplementary Information Supplementary data is available at the journal's web site.
Article
Full-text available
While protein conformational heterogeneity plays an important role in many aspects of biological function, including ligand binding, its impact has been difficult to quantify. Macromolecular X-ray diffraction is commonly interpreted with a static structure, but it can provide information on both the anharmonic and harmonic contributions to conformational heterogeneity. Here, through multiconformer modeling of time- and space-averaged electron density, we measure conformational heterogeneity of 743 stringently matched pairs of crystallographic datasets that reflect unbound/apo and ligand-bound/holo states. When comparing the conformational heterogeneity of side chains, we observe that when binding site residues become more rigid upon ligand binding, distant residues tend to become more flexible, especially in non-solvent exposed regions. Among ligand properties, we observe increased protein flexibility as the number of hydrogen bonds decrease and relative hydrophobicity increases. Across a series of 13 inhibitor bound structures of CDK2, we find that conformational heterogeneity is correlated with inhibitor features and identify how conformational changes propagate differences in conformational heterogeneity away from the binding site. Collectively, our findings agree with models emerging from NMR studies suggesting that residual side chain entropy can modulate affinity and point to the need to integrate both static conformational changes and conformational heterogeneity in models of ligand binding.
Article
Full-text available
Proteins are essential to life, and understanding their structure can facilitate a mechanistic understanding of their function. Through an enormous experimental effort1, 2, 3–4, the structures of around 100,000 unique proteins have been determined⁵, but this represents a small fraction of the billions of known protein sequences6,7. Structural coverage is bottlenecked by the months to years of painstaking effort required to determine a single protein structure. Accurate computational approaches are needed to address this gap and to enable large-scale structural bioinformatics. Predicting the three-dimensional structure that a protein will adopt based solely on its amino acid sequence—the structure prediction component of the ‘protein folding problem’⁸—has been an important open research problem for more than 50 years⁹. Despite recent progress10, 11, 12, 13–14, existing methods fall far short of atomic accuracy, especially when no homologous structure is available. Here we provide the first computational method that can regularly predict protein structures with atomic accuracy even in cases in which no similar structure is known. We validated an entirely redesigned version of our neural network-based model, AlphaFold, in the challenging 14th Critical Assessment of protein Structure Prediction (CASP14)¹⁵, demonstrating accuracy competitive with experimental structures in a majority of cases and greatly outperforming other methods. Underpinning the latest version of AlphaFold is a novel machine learning approach that incorporates physical and biological knowledge about protein structure, leveraging multi-sequence alignments, into the design of the deep learning algorithm.
Article
Full-text available
UCSF ChimeraX is the next‐generation interactive visualization program from the Resource for Biocomputing, Visualization, and Informatics (RBVI), following UCSF Chimera. ChimeraX brings (a) significant performance and graphics enhancements; (b) new implementations of Chimera's most highly used tools, many with further improvements; (c) several entirely new analysis features; (d) support for new areas such as virtual reality, light‐sheet microscopy, and medical imaging data; (e) major ease‐of‐use advances, including toolbars with icons to perform actions with a single click, basic “undo” capabilities, and more logical and consistent commands; and (f) an app store for researchers to contribute new tools. ChimeraX includes full user documentation and is free for noncommercial use, with downloads available for Windows, Linux, and macOS from https://www.rbvi.ucsf.edu/chimerax.
Article
Full-text available
The Protein Data Bank (PDB) is the global archive for structural information on macromolecules, and a popular resource for researchers, teachers and students, amassing more than one million unique users each year. Crystallographic structure models in the PDB (more than 100,000 entries) are optimized against the crystal diffraction data and geometrical restraints. This process of crystallographic refinement typically ignored hydrogen bond (H-bond) distances as a source of information. However, H-bond restraints can improve structures at low resolution where diffraction data are limited. To improve low-resolution structure refinement, we present methods for deriving H-bond information either globally from well-refined high-resolution structures from the PDB-REDO databank, or specifically from on-the-fly constructed sets of homologous high-resolution structures. Refinement incorporating HOmology DErived Restraints (HODER), improves geometrical quality and the fit to the diffraction data for many low-resolution structures. To make these improvements readily available to the general public, we applied our new algorithms to all crystallographic structures in the PDB: using massively parallel computing, we constructed a new instance of the PDB-REDO databank (https://pdb-redo.eu). This resource is useful for researchers to gain insight on individual structures, on specific protein families (as we demonstrate with examples), and on general features of protein structure using data mining approaches on a uniformly treated dataset. This article is protected by copyright. All rights reserved.
Article
Full-text available
In macromolecular crystallography, the rigorous detection of changed states (for example, ligand binding) is difficult unless signal is strong. Ambiguous (‘weak’ or ‘noisy’) density is experimentally common, since molecular states are generally only fractionally present in the crystal. Existing methodologies focus on generating maximally accurate maps whereby minor states become discernible; in practice, such map interpretation is disappointingly subjective, time-consuming and methodologically unsound. Here we report the PanDDA method, which automatically reveals clear electron density for the changed state—even from inaccurate maps—by subtracting a proportion of the confounding ‘ground state’; changed states are objectively identified from statistical analysis of density distributions. The method is completely general, implying new best practice for all changed-state studies, including the routine collection of multiple ground-state crystals. More generally, these results demonstrate: the incompleteness of atomic models; that single data sets contain insufficient information to model them fully; and that accuracy requires further map-deconvolution approaches.
Article
Full-text available
Although noncovalent binding by small molecules cannot be assumed a priori to be stoichiometric in the crystal lattice, occupancy refinement of ligands is often avoided by convention. Occupancies tend to be set to unity, requiring the occupancy error to be modelled by the B factors, and residual weak density around the ligand is necessarily attributed to ‘disorder’. Where occupancy refinement is performed, the complementary, superposed unbound state is rarely modelled. Here, it is shown that superior accuracy is achieved by modelling the ligand as partially occupied and superposed on a ligand-free ‘ground-state’ model. Explicit incorporation of this model of the crystal, obtained from a reference data set, allows constrained occupancy refinement with minimal fear of overfitting. Better representation of the crystal also leads to more meaningful refined atomic parameters such as the B factor, allowing more insight into dynamics in the crystal. An outline of an approach for algorithmically generating ensemble models of crystals is presented, assuming that data sets representing the ground state are available. The applicability of various electron-density metrics to the validation of the resulting models is assessed, and it is concluded that ensemble models consistently score better than the corresponding single-state models. Furthermore, it appears that ignoring the superposed ground state becomes the dominant source of model error, locally, once the overall model is accurate enough; modelling the local ground state properly is then more meaningful than correcting all remaining model errors globally, especially for low-occupancy ligands. Implications for the simultaneous refinement of B factors and occupancies, and for future evaluation of the limits of the approach, in particular its behaviour at lower data resolution, are discussed.
Article
Full-text available
Protein side-chain conformation is closely related to their biological functions. The side-chain prediction is a key step in protein design, protein docking and structure optimization. However, side-chain polymorphism comprehensively exists in protein as various types and has been long overlooked by side-chain prediction. But such conformational variations have not been quantitatively studied and the correlations between these variations and residue features are vague. Here, we performed statistical analyses on large scale data sets and found that the side-chain conformational flexibility is closely related to the exposure to solvent, degree of freedom and hydrophilicity. These analyses allowed us to quantify different types of side-chain variabilities in PDB. The results underscore that protein side-chain conformation prediction is not a single-answer problem, leading us to reconsider the assessment approaches of side-chain prediction programs.
Article
Full-text available
Proteins must move between different conformations of their native ensemble to perform their functions. Crystal structures obtained from high-resolution X-ray diffraction data reflect this heterogeneity as a spatial and temporal conformational average. Although movement between natively populated alternative conformations can be critical for characterizing molecular mechanisms, it is challenging to identify these conformations within electron density maps. Alternative side chain conformations are generally well separated into distinct rotameric conformations, but alternative backbone conformations can overlap at several atomic positions. Our model building program qFit uses mixed integer quadratic programming (MIQP) to evaluate an extremely large number of combinations of sidechain conformers and backbone fragments to locally explain the electron density. Here, we describe two major modeling enhancements to qFit: peptide flips and alternative glycine conformations. We find that peptide flips fall into four stereotypical clusters and are enriched in glycine residues at the n+1 position. The potential for insights uncovered by new peptide flips and glycine conformations is exemplified by HIV protease, where different inhibitors are associated with peptide flips in the "flap" regions adjacent to the inhibitor binding site. Our results paint a picture of peptide flips as conformational switches, often enabled by glycine flexibility, that result in dramatic local rearrangements. Our results furthermore demonstrate the power of large-scale computational analysis to provide new insights into conformational heterogeneity. Overall, improved modeling of backbone heterogeneity with high-resolution X-ray data will connect dynamics to the structure-function relationship and help drive new design strategies for inhibitors of biomedically important systems.
Article
Full-text available
Introduction: Protein-ligand interactions play key roles in various metabolic pathways, and the proteins involved in these interactions represent major targets for drug discovery. Molecular docking is widely used to predict the structure of protein-ligand complexes, and protein flexibility stands out as one of the most important and challenging issues for binding mode prediction. Various docking methods accounting for protein flexibility have been proposed, tackling problems of ever-increasing dimensionality. Areas covered: This paper presents an overview of conformational sampling methods treating target flexibility during molecular docking. Special attention is given to approaches considering full protein flexibility. Contrary to what is frequently done, this review does not rely on classical biomolecular recognition models to classify existing docking methods. Instead, it applies algorithmic considerations, focusing on the level of flexibility accounted for. This review also discusses the diversity of docking applications, from virtual screening (VS) of small drug-like compounds to geometry prediction (GP) of protein-peptide complexes. Expert opinion: Considering the diversity of docking methods presented here, deciding which one is the best at treating protein flexibility depends on the system under study and the research application. In VS experiments, ensemble docking can be used to implicitly account for large-scale conformational changes, and selective docking can additionally consider local binding-site rearrangements. In other cases, on-the-fly exploration of the whole protein-ligand complex might be needed for accurate GP of the binding mode. Among other things, future methods are expected to provide alternative binding modes, which will better reflect the dynamic nature of protein-ligand interactions.
Article
Full-text available
Proteins fluctuate between alternative conformations, which presents a challenge for ligand discovery because such flexibility is difficult to treat computationally owing to problems with conformational sampling and energy weighting. Here we describe a flexible docking method that samples and weights protein conformations using experimentally derived conformations as a guide. The crystallographically refined occupancies of these conformations, which are observable in an apo receptor structure, define energy penalties for docking. In a large prospective library screen, we identified new ligands that target specific receptor conformations of a cavity in cytochrome c peroxidase, and we confirm both ligand pose and associated receptor conformation predictions by crystallography. The inclusion of receptor flexibility led to ligands with new chemotypes and physical properties. By exploiting experimental measures of loop and side-chain flexibility, this method can be extended to the discovery of new ligands for hundreds of targets in the Protein Data Bank for which similar experimental information is available.
Article
Full-text available
The role of virtual ligand screening in modern drug discovery is to mine large chemical collections and to prioritize for experimental testing a comparatively small and diverse set of compounds with expected activity against a target. Several studies have pointed out that the performance of virtual ligand screening can be improved by taking into account receptor flexibility. Here, we systematically assess how multiple crystallographic receptor conformations, a powerful way of discretely representing protein plasticity, can be exploited in screening protocols to separate binders from non-binders. Our analyses encompass 36 targets of pharmaceutical relevance and are based on actual molecules with reported activity against those targets. The results suggest that an ensemble receptor-based protocol displays a stronger discriminating power between active and inactive molecules as compared to its standard single rigid receptor counterpart. Moreover, such a protocol can be engineered not only to enrich a higher number of active compounds, but also to enhance their chemical diversity. Finally, some clear indications can be gathered on how to select a subset of receptor conformations that is most likely to provide the best performance in a real life scenario.
Article
Full-text available
The Biopython project is a mature open source international collaboration of volunteer developers, providing Python libraries for a wide range of bioinformatics problems. Biopython includes modules for reading and writing different sequence file formats and multiple sequence alignments, dealing with 3D macro molecular structures, interacting with common tools such as BLAST, ClustalW and EMBOSS, accessing key online databases, as well as providing numerical methods for statistical learning. Availability: Biopython is freely available, with documentation and source code at www.biopython.org under the Biopython license. Contact: All queries should be directed to the Biopython mailing lists, see www.biopython.org/wiki/_Mailing_listspeter.cock@scri.ac.uk.
Article
Full-text available
mentation will be kept publicly available and the distribution sites will mirror the PDB archive using identical contents and subdirec- tory structure. However, each member of the wwPDB will be able to develop its own web site, with a unique view of the primary data, providing a variety of tools and resources for the global community. An Advisory Board consisting of appointees from the wwPDB, the International Union of Crystallography and the International Council on Magnetic Resonance in Biological Systems will provide guidance through annual meetings with the wwPDB consortium. This board is responsible for reviewing and deter- mining policy as well as providing a forum for resolving issues related to the wwPDB. Specific details about the Advisory Board can be found in the wwPDB charter, available on the wwPDB web site. The RCSB is the 'archive keeper' of wwPDB. It has sole write access to the PDB archive and control over directory structure and contents, as well as responsibility for dis- tributing new PDB identifiers to all deposi- tion sites. The PDB archive is a collection of flat files in the legacy PDB file format 3 and in the mmCIF 4 format that follows the PDB exchange dictionary (http://deposit.pdb.org/ mmcif/). This dictionary describes the syntax and semantics of PDB data that are processed and exchanged during the process of data annotation. It was designed to provide consis- tency in data produced in structure laborato- ries, processed by the wwPDB members and used in bioinformatics applications. The PDB archive does not include the websites, browsers, software and database query engines developed by researchers worldwide. The members of the wwPDB will jointly agree to any modifications or extensions to the PDB exchange dictionary. As data tech- nology progresses, other data formats (such as XML) and delivery methods may be included in the official PDB archive if all the wwPDB members concur on the alteration. Any new formats will follow the naming and description conventions of the PDB exchange dictionary. In addition, the legacy PDB for- mat would not be modified unless there is a compelling reason for a change. Should such a situation occur, all three wwPDB members would have to agree on the changes and give the structural biology community 90 days advance notice. The creation of the wwPDB formalizes the international character of the PDB and ensures that the archive remains single and uniform. It provides a mechanism to ensure consistent data for software developers and users world- wide. We hope that this will encourage individ- ual creativity in developing tools for presenting structural data, which could benefit the scien- tific research community in general.
Article
Significance The dynamic nature of biomolecules is typically neglected in docking screens for ligand discovery. The key to benefitting from various receptor conformations is not only structural but also thermodynamic information. Here, we test a general approach that uses conformational preferences from enhanced and conventional molecular dynamics simulations to account for the cost of transitions to high-energy states. Including this information as a conformational penalty term in a docking, scoring function, we perform retrospective and prospective screens and experimentally confirm predicted ligands with T m upshift and X-ray crystallography. This not only allows us to test the predicted ligands for binding, it also tests whether they bind to the conformation of the binding site for which they were predicted.
Article
Scoring and numerical optimization of protein-ligand poses is an integral part of docking tools. Although many scoring functions exist, many of them are not continuously differentiable and they are rarely explicitly analyzed with respect to their numerical optimization behavior. Here, we present a consistent scheme for pose scoring and gradient-based pose optimization. It consists of a novel variant of the BFGS algorithm enabling step-length control, named LSL-BFGS (limited step length BFGS), and the empirical JAMDA scoring function designed for pose prediction and good numerical optimizability. The JAMDA scoring function shows a high pose prediction performance in the CASF-2016 docking power benchmark, top-ranking a pose with an RMSD of ≤2 Å in about 89% of the cases. The combination of JAMDA scoring with the LSL-BFGS algorithm shows a significantly higher optimization locality (i.e., no excessive movement of poses) than with the classical BFGS algorithm while retaining the characteristically low number of scoring function evaluations. The JAMDA scoring and optimization scheme is freely available for noncommercial use and academic research.
Article
We investigate unexpectedly short non-covalent distances (< 85% of the sum of van der Waals radii) in X-ray crystal structures of proteins. We curate over 11,000 high quality protein crystal structures and an ultra-high resolution (1.2 Å or better) subset containing > 900 structures. Although our non-covalent distance criterion excludes standard hydrogen bonds known to be essential in protein stability, we observe over 75,000 close contacts in the curated protein structures. Analysis of the frequency of amino acids participating in these interactions demonstrates some expected trends (i.e., enrichment of charged Lys, Arg, Asp, and Glu) but also reveals unexpected enhancement of Tyr in such interactions. Nearly all amino acids are observed to form at least one close contact with all other amino acids, and most interactions are preserved in the much smaller ultra high-resolution subset. We quantum-mechanically characterize the interaction energetics of a subset of > 5,000 close contacts with symmetry adapted perturbation theory to enable decomposition of interactions. We observe the majority of close contacts to be favorable. The shortest favorable non-covalent distances are under 2.2 Å and are very repulsive when characterized with classical force fields. This analysis reveals stabilization by a combination of electrostatic and charge transfer effects between hydrophobic (i.e., Val, Ile, Leu) amino acids and charged Asp or Glu. We also observe a unique hydrogen bonding configuration between Tyr and Asn/Gln involving both residues acting simultaneously as hydrogen bond donors and acceptors. This work confirms the importance of first-principles simulation in explaining unexpected geometries in protein crystal structures.
Article
UCSF ChimeraX is next-generation software for the visualization and analysis of molecular structures, density maps, 3D microscopy, and associated data. It addresses challenges in the size, scope, and disparate types of data attendant with cutting-edge experimental methods, while providing advanced options for high-quality rendering (interactive ambient occlusion, reliable molecular surface calculations, etc.) and professional approaches to software design and distribution. This paper highlights some specific advances in the areas of visualization and usability, performance, and extensibility. ChimeraX is free for noncommercial use and is available from http://www.rbvi.ucsf.edu/chimerax/ for Windows, Mac, and Linux. This article is protected by copyright. All rights reserved.
Article
Structural flexibility of proteins has an important influence on molecular recognition and enzymatic function. In modeling, structure ensembles are therefore often applied as a valuable source of alternative protein conformations. However, their usage is often complicated by structural artifacts and inconsistent data annotation. Here, we present SIENA, a new computational approach for the automated assembly and preprocessing of protein binding site ensembles. Starting with an arbitrarily defined binding site in a single protein structure, SIENA searches for alternative conformations of the same or sequentially closely related binding sites. The method is based on an indexed database for identifying perfect k-mer matches and a recently published algorithm for the alignment of protein binding site conformations. Furthermore, SIENA provides a new algorithm for the interaction-based selection of binding site conformations which aims at covering all known ligand-binding geometries. Various experiments highlight that SIENA is able to generate comprehensive and well selected binding site ensembles improving the compatibility to both known and unconsidered ligand molecules. Starting with the whole PDB as data source, the computation time of the whole ensemble generation takes only a few seconds. SIENA is available via a Web service at www.zbh.uni-hamburg.de/siena .
Article
Introduction: Molecular docking has become a popular method for virtual screening. Docking small molecules to a rigid biological receptor is fast but could produce many false negatives and identify less diverse compounds. Flexible receptor docking has alleviated this problem. Areas covered: This article focuses on reviewing ensemble docking as an approximate but inexpensive method to incorporate receptor flexibility in molecular docking. It outlines key features and recent advances of this method and points out problem areas that need to be addressed to make it even more useful in drug discovery. Expert opinion: Among the different methods introduced for flexible receptor docking, ensemble docking represents one of the most popular approaches, especially for high-throughput virtual screening. One can generate structural ensembles by using experimental structures, by structural modeling and by various types of molecular simulations. In building a structural ensemble, a judicious choice of the structures to be included can improve performance. Furthermore, reducing the size of the structural ensemble can cut computational costs, and removing the structures that can bind few ligands well could enrich the number of true actives identified by ensemble docking. The ability of ensemble docking to identify more true positives at the top of a rank-ordered list also depends on the choice of the methods to score and rank compounds, an area that needs further research.
Article
The Protein Data Bank (PDB; http://www.rcsb.org/pdb/ ) is the single worldwide archive of structural data of biological macromolecules. This paper describes the goals of the PDB, the systems in place for data deposition and access, how to obtain further information, and near-term plans for the future development of the resource.
Article
Conceptually, the simplistic lock and key model has been superseded by more realistic views of molecular recognition that take into account the intrinsic dynamics of biological macromolecules. However, it is still common for structure-based drug discovery methods to represent the receptor as static structures. The practical advantages of this approximation, the notable success attained over the past few decades with such simple models and the absence of clear guidelines for weighing the pros and cons of accounting for flexibility may prompt some investigators to stretch the rigid model beyond its scope. Here, we investigate the relationship between protein flexibility and binding free energy and present some useful hints for understanding when, and to what extent, flexibility should be considered. Using molecular dynamics simulations of hen egg-white lysozyme (HEWL) with explicit aqueous/organic solvent mixtures and a range of restraint conditions, we find out how artificially restricted mobility affects binding hot spots. Barring sampling errors or an inappropriate choice of reference structure, we find that decreased mobility (measured as B-factors) leads to artifactually more negative binding free energies, but a logarithmic relationship between both terms attenuates the errors. Consequently, ignoring flexibility may be an acceptable approximation for intrinsically rigid regions (such as the active site of enzymes) but may lead to larger errors elsewhere. For the same reason, local conformational sampling yields very accurate predictions and, owing to its practical advantages, may be preferable to full conformational sampling for many applications.
Article
The analysis of small molecule crystal structures is a common way to gather valuable information for drug development. The necessary structural data is usually provided in specific file formats containing only element identities and three-dimensional atomic coordinates as reliable chemical information. Consequently, the automated perception of molecular structures from atomic coordinates has become a standard task in cheminformatics. The molecules generated by such methods must be both chemically valid and reasonable to provide a reliable basis for subsequent calculations. This can be a difficult task since the provided coordinates may deviate from ideal molecular geometries due to experimental uncertainties or low resolution. Additionally, the quality of the input data often differs significantly thus making it difficult to distinguish between actual structural features and mere geometric distortions. We present a method for the generation of molecular structures from atomic coordinates based on the recently published NAOMI model. By making use of this consistent chemical description, our method is able to generate reliable results even with input data of low quality. Molecules from 363 Protein Data Bank (PDB) entries could be perceived with a success rate of 98%, a result which could not be achieved with previously described methods. The robustness of our approach has been assessed by processing all small molecules from the PDB and comparing them to reference structures. The complete data set can be processed in less than 3 minutes, thus showing that our approach is suitable for large scale applications.
Article
In most cheminformatics workflows, chemical information is stored in files which provide the necessary data for subsequent calculations. The correct interpretation of the file formats is an important prerequisite to obtain meaningful results. Consistent reading of molecules from files, however, is not an easy task. Each file format implicitly represents an underlying chemical model, which has to be taken into consideration when the input data is processed. Additionally, many data sources contain invalid molecules. These have to be identified and either corrected or discarded. We present the chemical file format converter NAOMI, which provides efficient procedures for reliable handling of molecules from the common chemical file formats SDF, MOL2, and SMILES. These procedures are based on a consistent chemical model which has been designed for the appropriate representation of molecules relevant in the context of drug discovery. NAOMI's functionality is tested by round robin file IO exercises with public data sets, which we believe should become a standard test for every cheminformatics tool.
Article
Although proteins populate large structural ensembles, X-ray diffraction data are traditionally interpreted using a single model. To search for evidence of alternate conformers, we developed a program, Ringer, which systematically samples electron density around the dihedral angles of protein side chains. In a diverse set of 402 structures, Ringer identified weak, nonrandom electron-density features that suggest of the presence of hidden, lowly populated conformations for >18% of uniquely modeled residues. Although these peaks occur at electron-density levels traditionally regarded as noise, statistically significant (P < 10(-5)) enrichment of peaks at successive rotameric chi angles validates the assignment of these features as unmodeled conformations. Weak electron density corresponding to alternate rotamers also was detected in an accurate electron density map free of model bias. Ringer analysis of the high-resolution structures of free and peptide-bound calmodulin identified shifts in ensembles and connected the alternate conformations to ligand recognition. These results show that the signal in high-resolution electron density maps extends below the traditional 1 sigma cutoff, and crystalline proteins are more polymorphic than current crystallographic models. Ringer provides an objective, systematic method to identify previously undiscovered alternate conformations that can mediate protein folding and function.
Article
Structure-based design usually focuses upon the optimization of ligand affinity. However, successful drug design also requires the optimization of many other properties. The primary source of structural information for protein-ligand complexes is X-ray crystallography. The uncertainties introduced during the derivation of an atomic model from the experimentally observed electron density data are not always appreciated. Uncertainties in the atomic model can have significant consequences when this model is subsequently used as the basis of manual design, docking, scoring, and virtual screening efforts. Docking and scoring algorithms are currently imperfect. A good correlation between observed and calculated binding affinities is usually only observed only when very large ranges of affinity are considered. Errors in the correlation often exceed the range of affinities commonly encountered during lead optimization. Some structure-based design approaches now involve screening libraries by using technologies based on NMR spectroscopy and X-ray crystallography to discover small polar templates, which are used for further optimization. Such compounds are defined as leadlike and are also sought by more traditional high-throughput screening technologies. Structure-based design and HTS technologies show important complementarity and a degree of convergence.
Article
One of the current challenges in docking studies is the inclusion of receptor flexibility. This is crucial because the binding sites of many therapeutic targets sample a wide range of conformational states, which has major consequences on molecular recognition. In this paper, we make use of very large sets of X-ray structures of cyclin dependent kinase 2 (CDK2) and heat shock protein 90 (HSP90) to assess the performance of flexible receptor docking in binding-mode prediction and virtual screening experiments. Flexible receptor docking performs much better than rigid receptor docking in the former application. Regarding the latter, we observe a significant improvement in the prediction of binding affinities, but owing to an increase in the number of false positives, this is not translated into better hit rates. A simple scoring scheme to correct this limitation is presented. More importantly, pitfalls inherent to flexible receptor docking have been identified and guidelines are presented to avoid them.
Article
The sc-PDB is a collection of 6 415 three-dimensional structures of binding sites found in the Protein Data Bank (PDB). Binding sites were extracted from all high-resolution crystal structures in which a complex between a protein cavity and a small-molecular-weight ligand could be identified. Importantly, ligands are considered from a pharmacological and not a structural point of view. Therefore, solvents, detergents, and most metal ions are not stored in the sc-PDB. Ligands are classified into four main categories: nucleotides (< 4-mer), peptides (< 9-mer), cofactors, and organic compounds. The corresponding binding site is formed by all protein residues (including amino acids, cofactors, and important metal ions) with at least one atom within 6.5 angstroms of any ligand atom. The database was carefully annotated by browsing several protein databases (PDB, UniProt, and GO) and storing, for every sc-PDB entry, the following features: protein name, function, source, domain and mutations, ligand name, and structure. The repository of ligands has also been archived by diversity analysis of molecular scaffolds, and several chemoinformatics descriptors were computed to better understand the chemical space covered by stored ligands. The sc-PDB may be used for several purposes: (i) screening a collection of binding sites for predicting the most likely target(s) of any ligand, (ii) analyzing the molecular similarity between different cavities, and (iii) deriving rules that describe the relationship between ligand pharmacophoric points and active-site properties. The database is periodically updated and accessible on the web at http://bioinfo-pharma.u-strasbg.fr/scPDB/.