ArticlePublisher preview available

Improved protein structure prediction using potentials from deep learning

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Protein structure prediction can be used to determine the three-dimensional shape of a protein from its amino acid sequence¹. This problem is of fundamental importance as the structure of a protein largely determines its function²; however, protein structures can be difficult to determine experimentally. Considerable progress has recently been made by leveraging genetic information. It is possible to infer which amino acid residues are in contact by analysing covariation in homologous sequences, which aids in the prediction of protein structures³. Here we show that we can train a neural network to make accurate predictions of the distances between pairs of residues, which convey more information about the structure than contact predictions. Using this information, we construct a potential of mean force⁴ that can accurately describe the shape of a protein. We find that the resulting potential can be optimized by a simple gradient descent algorithm to generate structures without complex sampling procedures. The resulting system, named AlphaFold, achieves high accuracy, even for sequences with fewer homologous sequences. In the recent Critical Assessment of Protein Structure Prediction⁵ (CASP13)—a blind assessment of the state of the field—AlphaFold created high-accuracy structures (with template modelling (TM) scores⁶ of 0.7 or higher) for 24 out of 43 free modelling domains, whereas the next best method, which used sampling and contact information, achieved such accuracy for only 14 out of 43 domains. AlphaFold represents a considerable advance in protein-structure prediction. We expect this increased accuracy to enable insights into the function and malfunction of proteins, especially in cases for which no structures for homologous proteins have been experimentally determined⁷.
Analysis of structure accuracies a, lDDT12 versus distogram lDDT12 (see Methods, ‘Accuracy’). The distogram accuracy predicts the lDDT of the realized structure well (particularly for medium- and long-range residue pairs, as well as the TM score as shown in Fig. 4a) for both CASP13 (n = 500: 5 decoys for domains excluding T0999) and test (n = 377) datasets. Data are shown with Pearson’s correlation coefficients. b, DLDDT12 against the effective number of sequences in the MSA (Neff) normalized by sequence length (n = 377). The number of effective sequences correlates with this measure of distogram accuracy (r = 0.634). c, Structure accuracy measures, computed on the test set (n = 377), for gradient descent optimization of different forms of the potential. Top, removing terms in the potential, and showing the effect of following optimization with Rosetta relax. ‘P’ shows the significance of the potential giving different results from ‘Full’, for a two-tailed paired data t-test. ‘Bins’ shows the number of bins fitted by the spline before extrapolation and the number in the full distribution. In CASP13, splines were fitted to the first 51 of 64 bins. Bottom, reducing the resolution of the distogram distributions. The original 64-bin distogram predictions are repeatedly downsampled by a factor of 2 by summing adjacent bins, in each case with constant extrapolation beyond 18 Å (the last quarter of the bins). The two-level potential in the final row, which was designed to compare with contact predictions, is constructed by summing the probability mass below 8 Å and between 8 and 14 Å, with constant extrapolation beyond 14 Å. The TM scores in this table are plotted in Fig. 4b.
… 
This content is subject to copyright. Terms and conditions apply.
706 | Nature | Vol 577 | 30 January 2020
Article
Improved protein structure prediction using
potentials from deep learning
Andrew W. Senior1,4*, Richard Evans1,4, John Jumper1,4, James Kirkpatrick1,4, Laurent Sifre1,4,
Tim Green1, Chongli Qin1, Augustin Žídek1, Alexander W. R. Nelson1, Alex Bridgland1,
Hugo Penedones1, Stig Petersen1, Karen Simonyan1, Steve Crossan1, Pushmeet Kohli1,
David T. Jones2,3, David Silver1, Koray Kavukcuoglu1 & Demis Hassabis1
Protein structure prediction can be used to determine the three-dimensional shape of
a protein from its amino acid sequence1. This problem is of fundamental importance
as the structure of a protein largely determines its function2; however, protein
structures can be dicult to determine experimentally. Considerable progress has
recently been made by leveraging genetic information. It is possible to infer which
amino acid residues are in contact by analysing covariation in homologous
sequences, which aids in the prediction of protein structures3. Here we show that we
can train a neural network to make accurate predictions of the distances between
pairs of residues, which convey more information about the structure than contact
predictions. Using this information, we construct a potential of mean force4 that can
accurately describe the shape of a protein. We nd that the resulting potential can be
optimized by a simple gradient descent algorithm to generate structures without
complex sampling procedures. The resulting system, named AlphaFold, achieves high
accuracy, even for sequences with fewer homologous sequences. In the recent Critical
Assessment of Protein Structure Prediction5 (CASP13)—a blind assessment of the state
of the eld—AlphaFold created high-accuracy structures (with template modelling
(TM) scores6 of 0.7 or higher) for 24 out of 43 free modelling domains, whereas the
next best method, which used sampling and contact information, achieved such
accuracy for only 14 out of 43 domains. AlphaFold represents a considerable advance
in protein-structure prediction. We expect this increased accuracy to enable insights
into the function and malfunction of proteins, especially in cases for which no
structures for homologous proteins have been experimentally determined7.
Proteins are at the core of most biological processes. As the function of
a protein is dependent on its structure, understanding protein struc-
tures has been a grand challenge in biology for decades. Although
several experimental structure determination techniques have been
developed and improved in accuracy, they remain difficult and time-
consuming
2
. As a result, decades of theoretical work has attempted to
predict protein structures from amino acid sequences.
CASP5 is a biennial blind protein structure prediction assessment
run by the structure prediction community to benchmark progress in
accuracy. In 2018, AlphaFold joined 97 groups from around the world in
entering CASP13
8
. Each group submitted up to 5 structure predictions
for each of 84 protein sequences for which experimentally determined
structures were sequestered. Assessors divided the proteins into 104
domains for scoring and classified each as being amenable to template-
based modelling (TBM, in which a protein with a similar sequence has
a known structure, and that homologous structure is modified in
accordance with the sequence differences) or requiring free model-
ling (FM, in cases in which no homologous structure is available), with
an intermediate (FM/TBM) category. Figure1a shows that AlphaFold
predicts more FM domains with high accuracy than any other system,
particularly in the 0.6–0.7TM-score range. The TM score—ranging
between 0 and 1—measures the degree of match of the overall (back-
bone) shape of a proposed structure to a native structure. The assessors
ranked the 98 participating groups by the summed, capped z-scores of
the structures, separated according to category. AlphaFold achieved
a summed z-score of 52.8 in the FM category (best-of-five) compared
with 36.6 for the next closest group (322). Combining FM and TBM/FM
categories, AlphaFold scored 68.3 compared with 48.2. AlphaFold is
able to predict previously unknown folds to high accuracy (Fig.1b).
Despite using only FM techniques and not using templates, AlphaFold
also scored well in the TBM category according to the assessors’ for-
mula 0-capped z-score, ranking fourth for the top-one model or first
for the best-of-five models. Much of the accuracy of AlphaFold is due
to the accuracy of the distance predictions, which is evident from the
high precision of the corresponding contact predictions (Fig.1c and
Extended Data Fig.2a).
https://doi.org/10.1038/s41586-019-1923-7
Received: 2 April 2019
Accepted: 10 December 2019
Published online: 15 January 2020
1DeepMind, London, UK. 2The Francis Crick Institute, London, UK. 3University College London, London, UK. 4These authors contributed equally: Andrew W. Senior, Richard Evans, John Jumper,
James Kirkpatrick, Laurent Sifre. *e-mail: andrewsenior@google.com
Content courtesy of Springer Nature, terms of use apply. Rights reserved
... For example, in vision and language Transformers, the relative distance among tokens is a well-established attention bias and has been proven essential to the final performance of various tasks [23,22,28]. Also, for scientific problems with rich domain-specific prior [39], attention bias is an indispensable component, such as the pair representation bias in AlphaFold [1,17,32]. However, the targeted efficiency optimization for attention with bias is still lacking. ...
... Similarly, relative position is also used to encode sequential information in language models [13,36]. In addition, attention bias widely exists in Transformer-based scientific deep models to incorporate domain-specific priors, such as the pair representation bias in AlphaFold [1,17,32]. ...
... As shown in Figure 7, the maximum rank value of the low-rank subset is smaller than 100 in 23 layers (except the third layer). Based on these statistical data, we set R in FlashBias as [32,32,280,40,48,40,40,88,80,32,64,32,32,32,32,32,32,88,32,32,32,32,32,32] for the low-rank subset in each layer. Especially, we set R ≡ 0 (mod 8) following FlashAttention v2 [8] to facilitate block-wise reading. ...
Preprint
Full-text available
Attention mechanism has emerged as a foundation module of modern deep learning models and has also empowered many milestones in various domains. Moreover, FlashAttention with IO-aware speedup resolves the efficiency issue of standard attention, further promoting its practicality. Beyond canonical attention, attention with bias also widely exists, such as relative position bias in vision and language models and pair representation bias in AlphaFold. In these works, prior knowledge is introduced as an additive bias term of attention weights to guide the learning process, which has been proven essential for model performance. Surprisingly, despite the common usage of attention with bias, its targeted efficiency optimization is still absent, which seriously hinders its wide applications in complex tasks. Diving into the computation of FlashAttention, we prove that its optimal efficiency is determined by the rank of the attention weight matrix. Inspired by this theoretical result, this paper presents FlashBias based on the low-rank compressed sensing theory, which can provide fast-exact computation for many widely used attention biases and a fast-accurate approximation for biases in general formalization. FlashBias can fully take advantage of the extremely optimized matrix multiplication operation in modern GPUs, achieving 1.5×\times speedup for AlphaFold, and over 2×\times speedup for attention with bias in vision and language models without loss of accuracy.
... In a recent study, we demonstrated the interaction between FlhG and the flagellar C-ring protein FliY, pinpointing residues Lysine 177, Arginine 277, and Phenylalanine 215 on FlhG as crucial for this interaction (Schuhmacher, Rossmann, Dempwolff, et al. 2015). To predict a potential binding site of GtGpsB on GtFlhG, we made use of Alphafold2 (Senior et al. 2020). In this prediction, GtGpsB interacted with a GtFlhG region, which was very similar to the binding site of FliY, irrespective of whether the interaction was modeled with a monomer or a trimer of the GtGpsB-CTD (Figure 3a). ...
Article
Full-text available
Number and arrangement of flagella, the bacterial locomotion organelles, are species‐specific and serve as key taxonomic markers. The FlhG ATPase (also: YlxH, FleN), along with FlhF, plays pivotal roles in determining flagellation patterns. In Bacillus subtilis, FlhG and FlhF govern the spatial arrangement of peritrichous flagella. FlhG aids in flagellar assembly by interacting with the flagellar C‐ring protein FliY, yet the molecular implications of this interaction have been unclear. Our study reveals that the ATP‐dependent FlhG homodimer interacts with the C‐terminal domain of GpsB, a cell cycle regulator, which recruits the peptidoglycan synthase PBP1 (also: ponA) to sites of cell wall elongation. A deletion of gpsB leads to dysregulation of the flagellation pattern mimicking the effects of a flhG deletion strain. The finding that GpsB can interact simultaneously with FlhG and PBP1, combined with the observation that GpsB and FliY can simultaneously interact with FlhG, strongly argues for a model in which FlhG confines flagella biosynthesis to regions of active cell wall biosynthesis. Thus, the FlhG‐GpsB interaction appears to enable the locally restrained stimulation of the GTPase FlhF, known for its role to localize flagella in various bacterial species.
... Deep learning models, particularly AlphaFold [5], have significantly advanced structural prediction but have key limitations. Their reliance on evolutionary data restricts the performance of novel folds or proteins with limited homologous sequences [6], [7]. Current models also struggle with multi chain protein complexes and intricate residue-residue interactions due to limitations in pairwise-based representations [7]. ...
Preprint
Full-text available
Accurate protein structure prediction remains a key challenge in computational biology, especially for novel folds and complex interactions. We present TextFold, a framework that integrates geometric hypergraphs with multimodal textual context to enhance protein folding predictions. By modeling higher-order spatial relationships and incorporating scientific insights from PubMed and PubTator Central scientific database, TextFold improves accuracy and interpretability through feature attribution analysis.Evaluated on PDB and AlphaFold Protein Structure Database, TextFold achieves a TM-score of 0.81 and an RMSD of 2.1Å for homologous and 3.1Å for low-homology proteins, outperforming DeepFold, AlphaFold, and RoseTTAFold in low-homology settings. Ablation studies demonstrate the impact of textual embeddings on prediction refinement.By integrating geometric modeling with domain knowledge, TextFold advances protein structure prediction, offering a valuable tool for drug discovery and functional genomics.
... Jumper et al. [1] introduced AlphaFold, demonstrating highly accurate protein structure prediction for various proteomes. Senior et al. [2] further refined this approach by incorporating potentials from deep learning, leading to improved predictions. Tunyasuvunakool et al. [3] focused on highly accurate predictions specifically for the human proteome, leveraging AlphaFold technology. ...
Article
Protein structure prediction plays a crucial role in understanding biological functions and drug design. However, the current methods face challenges in accuracy and efficiency due to the complexity of protein structures. This paper addresses the limitations by proposing a novel approach utilizing Lasso regression with L1 regularization. By incorporating the sparsity-inducing property of L1 regularization, our method efficiently selects relevant features and improves prediction accuracy. The research results demonstrate that our approach outperforms existing methods in both accuracy and computational efficiency, showcasing its potential for advancing protein structure prediction in biomedical research and pharmaceutical development.
... For example, sequence-based representations fail to distinguish different functional conformations of proteins (15,16), which are crucial for their proper functionality. Despite structure prediction models like AlphaFold (17)(18)(19) and RoseTTAFold (20) have demonstrated their ability to map sequences to structures, the quality of their folding largely relies on homologous sequences and structural templates. On the other hand, inverse folding models such as ProteinMPNN (6) can design sequences that stabilize and accommodate given backbone structures, but their generalizability to rare and novel folds remains to be tested. ...
Article
Full-text available
Modern protein engineering demands integrated sequence-structure representations to tackle key challenges in designing, modifying, and evolving proteins for specific functions. While sequence-based methods are promising for generate novel proteins, incorporating...
... Addressing data quality, legal frameworks, and ethical considerations could improve patient outcomes. 76,77,78 ...
Article
Full-text available
We describe AlphaFold, the protein structure prediction system that was entered by the group A7D in CASP13 Submissions were made by three free‐modelling methods which combine the predictions of three neural networks. All three systems were guided by predictions of distances between pairs of residues produced by a neural network. Two systems assembled fragments produced by a generative neural network, one using scores from a network trained to regress GDT_TS. The third system shows that simple gradient descent on a properly constructed potential is able to perform on‐par with more expensive traditional search techniques and without requiring domain segmentation. In the CASP13 free‐modelling assessors' ranking by summed z‐scores, this system scored highest with 68.3 vs 48.2 for the next closest group. (An average GDT_TS of 61.4.) The system produced high‐accuracy structures (with GDT_TS scores of 70 or higher) for 11 out of 43 free‐modelling domains. Despite not explicitly using template information, the results in the template category were comparable to the best performing template‐based methods. This article is protected by copyright. All rights reserved.
Article
Full-text available
Performance in the template‐based modeling (TBM) category of CASP13 is assessed here, using a variety of metrics. Performance of the predictor groups that participated is ranked using the primary ranking score that was developed by the assessors for CASP12. This reveals that the best results are obtained by groups that include contact predictions or inter‐residue distance predictions derived from deep multiple sequence alignments. In cases where there is a good homologue in the wwPDB (TBM‐easy category), the best results are obtained by modifying a template. However, for cases with poorer homologues (TBM‐hard), very good results can be obtained without using an explicit template, by deep learning algorithms trained on the wwPDB. Alternative metrics are introduced, to allow testing of aspects of structural models that are not addressed by traditional CASP metrics. These include comparisons to the main‐chain and side‐chain torsion angles of the target, and the utility of models for solving crystal structures by the molecular replacement method. The alternative metrics are poorly correlated with the traditional metrics, and it is proposed that modeling has reached a sufficient level of maturity that the best models should be expected to satisfy this wider range of criteria. This article is protected by copyright. All rights reserved.
Article
Full-text available
In this article, we describe our efforts in contact prediction in the CASP13 experiment. We employed a new deep learning‐based contact prediction tool, DeepMetaPSICOV (or DMP for short), together with new methods and data sources for alignment generation. DMP evolved from MetaPSICOV and DeepCov and combines the input feature sets used by these methods as input to a deep, fully convolutional residual neural network. We also improved our method for multiple sequence alignment generation and included metagenomic sequences in the search. We discuss successes and failures of our approach and identify areas where further improvements may be possible. DMP is freely available at: https://github.com/psipred/DeepMetaPSICOV. This article is protected by copyright. All rights reserved.
Article
Full-text available
Misoprostol is a life-saving drug in many developing countries for women at risk of post-partum hemorrhaging owing to its affordability, stability, ease of administration and clinical efficacy. However, misoprostol lacks receptor and tissue selectivities, and thus its use is accompanied by a number of serious side effects. The development of pharmacological agents combining the advantages of misoprostol with improved selectivity is hindered by the absence of atomic details of misoprostol action in labor induction. Here, we present the 2.5 Å resolution crystal structure of misoprostol free-acid form bound to the myometrium labor-inducing prostaglandin E2 receptor 3 (EP3). The active state structure reveals a completely enclosed binding pocket containing a structured water molecule that coordinates misoprostol's ring structure. Modeling of selective agonists in the EP3 structure reveals rationales for selectivity. These findings will provide the basis for the next generation of uterotonic drugs that will be suitable for administration in low resource settings. © 2018, The Author(s), under exclusive licence to Springer Nature America, Inc.
Article
Full-text available
Visualizations of biomolecular structures empower us to gain insights into biological functions, generate testable hypotheses, and communicate biological concepts. Typical visualizations (such as ball and stick) primarily depict covalent bonds. In contrast, non-covalent contacts between atoms, which govern normal physiology, pathogenesis, and drug action, are seldom visualized. We present the Protein Contacts Atlas, an interactive resource of non-covalent contacts from over 100,000 PDB crystal structures. We developed multiple representations for visualization and analysis of non-covalent contacts at different scales of organization: atoms, residues, secondary structure, subunits, and entire complexes. The Protein Contacts Atlas enables researchers from different disciplines to investigate diverse questions in the framework of non-covalent contacts, including the interpretation of allostery, disease mutations and polymorphisms, by exploring individual subunits, interfaces, and protein-ligand contacts and by mapping external information. The Protein Contacts Atlas is available at http://www.mrc-lmb.cam.ac.uk/pca/ and also through PDBe.
Article
CASP (Critical Assessment of Structure Prediction) assesses the state of the art in modeling protein structure from amino acid sequence. The most recent experiment (CASP13 held in 2018) saw dramatic progress in structure modeling without use of structural templates (historically ‘ab initio’ modeling). Progress was driven by the successful application of deep learning techniques to predict inter‐residue distances. In turn, these results drove dramatic improvements in three‐dimensional structure accuracy: With the proviso that there are an adequate number of sequences known for the protein family, the new methods essentially solve the long‐standing problem of predicting the fold topology of monomeric proteins. Further, the number of sequences required in the alignment has fallen substantially. There is also substantial improvement in the accuracy of template‐based models. Other areas ‐ model refinement, accuracy estimation, and the structure of protein assemblies ‐ have again yielded interesting results. CASP13 placed increased emphasis on the use of sparse data together with modeling and chemical crosslinking, SAXS, and NMR all yielded more mature results. This paper summarizes the key outcomes of CASP13. The special issue of PROTEINS contains papers describing the CASP13 assessments in each modeling category and contributions from the participants. This article is protected by copyright. All rights reserved.
Article
This paper reports the CASP13 results of distance‐based contact prediction, threading and folding methods implemented in three RaptorX servers, which are built upon the powerful deep convolutional residual neural network (ResNet) method initiated by us for contact prediction in CASP12. On the 32 CASP13 FM (free‐modeling) targets with a median MSA (multiple sequence alignment) depth of 36, RaptorX yielded the best contact prediction among 46 groups and almost the best 3D structure modeling among all server groups without time‐consuming conformation sampling. In particular, RaptorX achieved top L/5, L/2 and L long‐range contact precision of 70%, 58% and 45%, respectively, and predicted correct folds (TMscore>0.5) for 18 of 32 targets. In particular, RaptorX predicted correct folds for all FM targets with >300 residues (T0950‐D1, T0969‐D1 and T1000‐D2) and generated the best 3D models for T0950‐D1 and T0969‐D1 among all groups. This CASP13 test confirms our previous findings: (1) predicted distance is more useful than contacts for both template‐based and free modeling; and (2) structure modeling may be improved by integrating template and co‐evolutionary information via deep learning. This paper will discuss progress we have made since CASP12, the strength and weakness of our methods, and why deep learning performed much better in CASP13. This article is protected by copyright. All rights reserved.
Article
We report the results of residue‐residue contact prediction of a new pipeline built purely on the learning of coevolutionary features in the CASP13 experiment. For a query sequence, the pipeline starts with the collection of multiple sequence alignments (MSAs) from multiple genome and metagenome sequence databases using two complementary HMM‐based searching tools. Three profile matrices, built on covariance, precision, and pseudolikelihood maximization respectively, are then created from the MSAs, which are used as the input features of a deep residual convolutional neural network architecture for contact‐map training and prediction. Two ensembling strategies have been proposed to integrate the matrix features through end‐to‐end training and stacking, resulting in two complementary programs called TripletRes and ResTriplet, respectively. For the 31 free‐modeling (FM) domains that do not have homologous templates in the PDB, TripletRes and ResTriplet generated comparable results with an average accuracy of 0.640 and 0.646, respectively, for the top L/5 long‐range predictions, where 71% and 74% of the cases have an accuracy above 0.5. Detailed data analyses showed that the strength of the pipeline is due to the sensitive MSA construction and the advanced strategies for coevolutionary feature ensembling. Domain splitting was also found to help enhance the contact prediction performance. Nevertheless, contact models for tail regions, which often involve a high number of alignment gaps, and for targets with few homologous sequences are still suboptimal. Development of new approaches where the model is specifically trained on these regions and targets might help address these problems. This article is protected by copyright. All rights reserved.
Article
We present our assessment of tertiary structure predictions for hard targets in CASP13. The analysis includes (i) assignment and discussion of best models through scores‐aided visual inspection of models for each evaluation unit; (ii) ranking of predictors resulting from this evaluation and from global scores; and (iii) evaluation of progress, state of the art and current limitations of protein structure prediction. We witness a sizable improvement in tertiary structure prediction building on the progress observed from CASP11 to CASP12, with (i) top models reaching backbone RMSD <3 å for several evaluation units of size <150 residues, contributed by many groups; (ii) at least one model that roughly captures global topology for all evaluation units, probably unprecedented in this track of CASP; and (iii) even quite good models for full, unsplit targets. Better structure predictions are brought about mainly by improved residue‐residue contact predictions, and since this CASP also by distance predictions, achieved through state‐of‐the‐art machine learning methods which also progressed to work with slightly shallower alignments compared to CASP12. As we reach a new realm of tertiary structure prediction quality, new directions are proposed and explored for future CASPs: (i) dropping splitting into evaluation units, (ii) rethinking difficulty metrics probably in terms of contact and distance predictions, (iii) assessing also side chains for models of high backbone accuracy, and (iv) assessing residue‐wise and possibly residue‐residue quality estimates. This article is protected by copyright. All rights reserved.
Article
Motivation: In addition to substitution frequency data from protein sequence alignments, many state-of-the-art methods for contact prediction rely on additional sources of information, or features, of protein sequences in order to predict residue-residue contacts, such as solvent accessibility, predicted secondary structure, and scores from other contact prediction methods. It is unclear how much of this information is needed to achieve state-of-the-art results. Here, we show that using deep neural network models, simple alignment statistics contain sufficient information to achieve state-of-the-art precision. Our prediction method, DeepCov, uses fully convolutional neural networks operating on amino-acid pair frequency or covariance data derived directly from sequence alignments, without using global statistical methods such as sparse inverse covariance or pseudolikelihood estimation. Results: Comparisons against CCMpred and MetaPSICOV2 show that using pairwise covariance data calculated from raw alignments as input allows us to match or exceed the performance of both of these methods. Almost all of the achieved precision is obtained when considering relatively local windows (around 15 residues) around any member of a given residue pairing; larger window sizes have comparable performance. Assessment on a set of shallow sequence alignments (fewer than 160 effective sequences) indicates that the new method is substantially more precise than CCMpred and MetaPSICOV2 in this regime, suggesting that improved precision is attainable on smaller sequence families. Overall, the performance of DeepCov is competitive with the state of the art, and our results demonstrate that global models, which employ features from all parts of the input alignment when predicting individual contacts, are not strictly needed in order to attain precise contact predictions. Availability: DeepCov is freely available at https://github.com/psipred/DeepCov. Contact: d.t.jones@ucl.ac.uk.