-
[show abstract]
[hide abstract]
ABSTRACT: For many years it has been accepted that the sequence of a protein can specify its three-dimensional structure. However, there has been limited progress in explaining how the sequence dictates its fold and no attempt to do this computationally without the use of specific structural data has ever succeeded for any protein larger than 100 residues. We describe a method that can predict complex folds up to almost 200 residues using only basic principles that do not include any elements of sequence homology. The method does not simulate the folding chain but generates many thousands of models based on an idealized representation of structure. Each rough model is scored and the best are refined. On a set of five proteins, the correct fold score well and when tested on a set of larger proteins, the correct fold was ranked highest for some proteins more than 150 residues, with others being close topological variants. All other methods that approach this level of success rely on the use of templates or fragments of known structures. Our method is unique in using a database of ideal models based on general packing rules that, in spirit, is closer to an ab initio approach.
Proteins Structure Function and Bioinformatics 04/2008; 70(4):1610-9. · 3.39 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Structural genomics initiatives aim to elucidate representative 3D structures for the majority of protein families over the next decade, but many obstacles must be overcome. The correct design of constructs is extremely important since many proteins will be too large or contain unstructured regions and will not be amenable to crystallization. It is therefore essential to identify regions in protein sequences that are likely to be suitable for structural study. Scooby-Domain is a fast and simple method to identify globular domains in protein sequences. Domains are compact units of protein structure and their correct delineation will aid structural elucidation through a divide-and-conquer approach. Scooby-Domain predictions are based on the observed lengths and hydrophobicities of domains from proteins with known tertiary structure. The prediction method employs an A*-search to identify sequence regions that form a globular structure and those that are unstructured. On a test set of 173 proteins with consensus CATH and SCOP domain definitions, Scooby-Domain has a sensitivity of 50% and an accuracy of 29%, which is better than current state-of-the-art methods. The method does not rely on homology searches and, therefore, can identify previously unknown domains.
Nucleic Acids Research 03/2008; 36(2):578-88. · 8.03 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Many intracellular proteins do not work on their own but rather in complex with small molecules, DNA, or other proteins. To gain a more fundamental understanding of protein interactions and their resulting functions, one requires a detailed structural model of relevant complexes. The first step in this challenge is to grow well-diffracting crystals. Three examples of protein complex crystallization will be discussed in detail below. In the first example, biophysical techniques such as fluorescence titration, isothermal titration calorimetry (ITC), and dynamic light scattering (DLS) are used to characterize the protein and assess the most suitable conditions for complex formation. The second example utilizes bioinformatic information and proteomic techniques to engineer constructs of the protein that are most favorable for crystallization. The final example uses NMR information for optimizing complex-forming conditions, which allowed the growth of better-diffracting complex crystals.
10/2007;
-
[show abstract]
[hide abstract]
ABSTRACT: A modeling method is described that avoids the need to consider the domain structure of the template used for modeling, and automatically extracts compact fragments of structure that would be of a suitable size to build the model. This aids automation as the size or nature of the template structure can be ignored and does not have to be broken into domain (or multi-domain) units beforehand. The approach leads to the generation of a large number of models each based on slightly differing domain definitions and this variation was further increased by considering alternative secondary structure predictions. Each model, of which there may be thousands, takes the form of a complete alpha-carbon trace and some methods (including residue burial) were investigated for their power to discriminate good models from bad models using decoys. The method is also compared to an earlier retroviral capsid modeling problem for which the X-ray structure is now known. Some potential extensions of the approach to more distant modeling problems are discussed.
Proteins Structure Function and Bioinformatics 09/2006; 64(3):601-14. · 3.39 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Scooby-domain (sequence hydrophobicity predicts domains) is a fast and simple method to identify globular domains in protein sequence, based on the observed lengths and hydrophobicities of domains from proteins with known tertiary structure. The prediction method successfully identifies sequence regions that will form a globular structure and those that are likely to be unstructured. The method does not rely on homology searches and, therefore, can identify previously unknown domains for structural elucidation. Scooby-domain is available as a Java applet at http://ibivu.cs.vu.nl/programs/scoobywww. It may be used to visualize local properties within a protein sequence, such as average hydrophobicity, secondary structure propensity and domain boundaries, as well as being a method for fast domain assignment of large sequence sets.
Nucleic Acids Research 08/2005; 33(Web Server issue):W160-3. · 8.03 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: In this paper, we present a secondary structure prediction method YASPIN that unlike the current state-of-the-art methods utilizes a single neural network for predicting the secondary structure elements in a 7-state local structure scheme and then optimizes the output using a hidden Markov model, which results in providing more information for the prediction.
YASPIN was compared with the current top-performing secondary structure prediction methods, such as PHDpsi, PROFsec, SSPro2, JNET and PSIPRED. The overall prediction accuracy on the independent EVA5 sequence set is comparable with that of the top performers, according to the Q3, SOV and Matthew's correlations accuracy measures. YASPIN shows the highest accuracy in terms of Q3 and SOV scores for strand prediction.
YASPIN is available on-line at the Centre for Integrative Bioinformatics website (http://ibivu.cs.vu.nl/programs/yaspinwww/) at the Vrije University in Amsterdam and will soon be mirrored on the Mathematical Biology website (http://www.mathbio.nimr.mrc.ac.uk) at the NIMR in London.
kxlin@nimr.mrc.ac.uk
Bioinformatics 02/2005; 21(2):152-9. · 5.47 Impact Factor
-
Bioinformatics. 01/2005; 21:152-159.
-
[show abstract]
[hide abstract]
ABSTRACT: This paper introduces the novel method of contact-based protein sequence alignment, where structural information in the form of contact mutation probabilities is incorporated into an alignment routine using contact-mutation matrices (CAO: Contact Accepted mutatiOn). The contact-based alignment routine optimizes the score of matched contacts, which involves four (two per contact) instead of two residues per match in pairwise alignments. The first contact refers to a real side-chain contact in a template sequence with known structure, and the second contact is the equivalent putative contact of a homologous query sequence with unknown structure. An algorithm has been devised to perform a pairwise sequence alignment based on contact information. The contact scores were combined with PAM-type (Point Accepted Mutation) substitution scores after parameterization of gap penalties and score weights by means of a genetic algorithm. We show that owing to the structural information contained in the CAO matrices, significantly improved alignments of distantly related sequences can be obtained. This has allowed us to annotate eight putative Drosophila IGF sequences. Contact-based sequence alignment should therefore prove useful in comparative modelling and fold recognition.
Nucleic Acids Research 02/2004; 32(8):2464-73. · 8.03 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Point Accepted Mutation (PAM) is the Markov model of amino acid replacements in proteins introduced by Dayhoff and her co-workers (Dayhoff et al., 1978). The PAM matrices and other matrices based on the PAM model have been widely accepted as the standard scoring system of protein sequence similarity in protein sequence alignment tools. Here, we present Contact Accepted mutatiOn (CAO), a Markov model of protein residue contact mutations. The CAO model simulates the interchanging of structurally defined side-chain contacts, and introduces additional structural information into protein sequence alignments. Therefore, similarities between structurally conserved sequences can be detected even without apparent sequence similarity. CAO has been benchmarked on the HOMSTRAD database and a subset of the CATH database, by comparing sequence alignments with reference alignments derived from structural superposition. CAO yields scores that reflect coherently the structural quality of sequence alignments, which has implications particularly for homology modelling and threading techniques.
Computational Biology and Chemistry 06/2003; 27(2):93-102. · 1.55 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: A novel knot found in the SET domain is examined in the light of five recent crystal structures and their descriptions in the literature. Using the algorithm of Taylor it was established that the backbone chain does not form a true knot. However, only two crosslinks corresponding to hydrogen-bonds were needed to form a knotted structure. Such loosely knotted structures formed by hydrogen-bonded crosslinks were assessed as lying between covalent crosslinks (such as disulphide bonds) and threaded-loops which are formed by close (unbonded) contacts between different parts of the chain. The term pseudo-knot was introduced (from the RNA field) to distinguish hydrogen-bonded 'knots'.
Computational Biology and Chemistry 03/2003; 27(1):11-5. · 1.55 Impact Factor
-
Nature 02/2003; 421(6918):25. · 36.28 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Fold recognition programs align a probe protein sequence onto protein three-dimensional (3D) structure templates. The alignment between the probe sequence and the most suitable template can be used to predict the 3D structure and often biological function of the probe. Here we present a new threading scoring function of protein sequence-structure compatibility. An artificial neural network model is trained to predict compatibility of amino acid side-chains with structural environments. Log-odds scores of predicted probabilities from this model can then be used to construct protein sequence-structure alignments.
Our model is tested on discrimination of native and decoy protein 3D structures. With a residue level structural description, its performance is comparable to those of pseudo-energy functions with atom level structural descriptions, better than the two functions with residue level structural descriptions.
The C++ source code of our neural network model is available at http://mathbio.nimr.mrc.ac.uk/~kxlin.
Bioinformatics 11/2002; 18(10):1350-7. · 5.47 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Bioinformatic software has used various numerical encoding schemes to describe amino acid sequences. Orthogonal encoding, employing 20 numbers to describe the amino acid type of one protein residue, is often used with artificial neural network (ANN) models. However, this can increase the model complexity, thus leading to difficulty in implementation and poor performance. Here, we use ANNs to derive encoding schemes for the amino acid types from protein three-dimensional structure alignments. Each of the 20 amino acid types is characterized with a few real numbers. Our schemes are tested on the simulation of amino acid substitution matrices. These simplified schemes outperform the orthogonal encoding on small data sets. Using one of these encoding schemes, we generate a colouring scheme for the amino acids in which comparable amino acids are in similar colours. We expect it to be useful for visual inspection and manual editing of protein multiple sequence alignments.
Journal of Theoretical Biology 07/2002; 216(3):361-65. · 2.21 Impact Factor
-
Journal of Computational Biology. 01/2001; 8:471-481.
-
[show abstract]
[hide abstract]
ABSTRACT: Bioinformatic software has used various numerical encoding schemes to describe amino acid sequences. Orthogonal encoding, employing 20 numbers to describe the amino acid type of one protein residue, is often used with artificial neural network (ANN) models. However, this can increase the model complexity, thus leading to difficulty in implementation and poor performance. Here, we use ANNs to derive encoding schemes for the amino acid types from protein three-dimensional structure alignments. Each of the 20 amino acid types is characterized with a few real numbers. Our schemes are tested on the simulation of amino acid substitution matrices. These simplified schemes outperform the orthogonal encoding on small data sets. Using one of these encoding schemes, we generate a colouring scheme for the amino acids in which comparable amino acids are in similar colours. We expect it to be useful for visual inspection and manual editing of protein multiple sequence alignments.
Journal of Theoretical Biology.