[Show abstract][Hide abstract] ABSTRACT: Methods for 3-class secondary-structure prediction are thought to be reaching the highest achievable accuracy. Their accuracy on beta-sheet residue class is considerably lower than for the other two classes. We analysed the relevance of 315 individual input attributes for a predictor with the usual framework of using sequence-profile based data with an input window of fixed size. We propose two alternative knowledge representations with significantly smaller sets of input attributes. We also investigated the possibility of exploiting the prediction of connected pairs of beta-sheet residues and the prediction of residue contact maps for the improvement of accuracy of secondary-structure prediction.
International Journal of Data Mining and Bioinformatics 02/2007; 1(3):286-313. · 0.39 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: The number of experimentally verified, intrinsically disordered (ID) proteins is rapidly rising. Research is often focused on a structural characterization of a given protein, looking for several key features. However, ID proteins with their dynamic structures that interconvert on a number of time-scales are difficult targets for the majority of traditional biophysical and biochemical techniques. Structural and functional analyses of these proteins can be significantly aided by disorder predictions. The current advances in the prediction of ID proteins and the use of protein disorder prediction in the fields of molecular biology and bioinformatics are briefly overviewed herein. A method is provided to utilize intrinsic disorder knowledge to gain structural and functional information related to individual proteins, protein groups, families, classes, and even entire proteomes.
[Show abstract][Hide abstract] ABSTRACT: The Database of Protein Disorder (DisProt) links structure and function information for intrinsically disordered proteins (IDPs). Intrinsically disordered proteins do not form a fixed three-dimensional structure under physiological conditions, either in their entireties or in segments or regions. We define IDP as a protein that contains at least one experimentally determined disordered region. Although lacking fixed structure, IDPs and regions carry out important biological functions, being typically involved in regulation, signaling and control. Such functions can involve high-specificity low-affinity interactions, the multiple binding of one protein to many partners and the multiple binding of many proteins to one partner. These three features are all enabled and enhanced by protein intrinsic disorder. One of the major hindrances in the study of IDPs has been the lack of organized information. DisProt was developed to enable IDP research by collecting and organizing knowledge regarding the experimental characterization and the functional associations of IDPs. In addition to being a unique source of biological information, DisProt opens doors for a plethora of bioinformatics studies. DisProt is openly available at http://www.disprot.org.
Nucleic Acids Research 02/2007; 35(Database issue):D786-93. · 8.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Composition Profiler is a web-based tool for semi-automatic discovery of enrichment or depletion of amino acids, either individually or grouped by their physico-chemical or structural properties.
The program takes two samples of amino acids as input: a query sample and a reference sample. The latter provides a suitable background amino acid distribution, and should be chosen according to the nature of the query sample, for example, a standard protein database (e.g. SwissProt, PDB), a representative sample of proteins from the organism under study, or a group of proteins with a contrasting functional annotation. The results of the analysis of amino acid composition differences are summarized in textual and graphical form.
As an exploratory data mining tool, our software can be used to guide feature selection for protein function or structure predictors. For classes of proteins with significant differences in frequencies of amino acids having particular physico-chemical (e.g. hydrophobicity or charge) or structural (e.g. alpha helix propensity) properties, Composition Profiler can be used as a rough, light-weight visual classifier.
[Show abstract][Hide abstract] ABSTRACT: About 10 years ago we published our first predictor of intrinsically disordered protein residues in another IEEE journal, the Proceedings of the IEEE International Conference on Neural Networks. Others call such proteins "natively unfolded" and "intrinsically unstructured." Since then, we and others have substantially improved the prediction of intrinsically disordered residues. The prediction of protein intrinsic disorder is similar to the prediction of secondary structure in terms of methodology, but, at the structural level, secondary structure (especially random coil) and intrinsic disorder differ completely in their dynamic motion. First, we will briefly describe the prediction of protein disorder, show the progress from ~ 70 % to ~ 85 % per residue prediction accuracy, and show that intrinsically disordered proteins are common over the three domains of life, but are especially common among the eukaryotes. Next we will discuss our methods for deducing functions that are associated with disordered rather than structured proteins. In brief, structured proteins have advantages for catalysis while disordered proteins and regions have advantages for the reversible, weak binding often observed in signaling, control, and regulation. After that we will discuss how disorder facilitates binding diversity in protein-protein interaction networks, both for single disordered regions binding to many partners and for many disordered regions with different sequences binding to a common site on the surface of one structured protein. Part three presents data indicating that alternative splicing is more prevalent in regions of RNA that code for disorder than those that code for structure, thus providing a means for evolving tissue-specific signaling networks. Finally, we will present a novel approach to drug discovery based on disordered protein.
Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE 2007, October 14-17, 2007, Harvard Medical School, Boston, MA, USA; 01/2007
[Show abstract][Hide abstract] ABSTRACT: Just over 10 years ago, in June, 1997, in the Proceedings of the IEEE International Conference on Neural Networks, we published our first predictor of intrinsically disordered protein. Since then, we have substantially improved our predictors, and more than 20 other laboratory groups have joined in efforts to improve the prediction of protein disorder. At the algorithmic level, prediction of protein intrinsic disorder is similar to the prediction of secondary structure, but, at the structural level, secondary structure and intrinsic disorder are entirely different. The secondary structure class called random coil or irregular differs from intrinsic disorder due to very different dynamic properties, with the secondary structure class being much less mobile than the region of disorder. At the biological level, unlike the prediction of secondary structure, the prediction of intrinsic disorder has been revolutionary. That is, for many years, experimentalists have provided evidence that some proteins lack fixed structure or are disordered (or unfolded) under physiological conditions. Experimentalists further are showing that, for some proteins, functions depended on the unstructured rather than structured state. However, these examples have been mostly ignored. To our knowledge, not one disordered protein or disorder-associated function is discussed in any biochemistry textbook, even though such examples began to be discovered more than 50 years ago. Disorder prediction has been important for showing that the few experimentally characterized examples represent a very large cohort that is found all across all three domains of life. We now know that many significant biological functions depend directly on, or are importantly associated with, the unfolded or partially folded state. In this paper, we will briefly review some of the key discoveries that have occurred in the last decade, and, furthermore, will make a few highly speculative projections.
Proceedings of the 7th IEEE International Conference on Bioinformatics and Bioengineering, BIBE 2007, October 14-17, 2007, Harvard Medical School, Boston, MA, USA; 01/2007
[Show abstract][Hide abstract] ABSTRACT: Protein interaction networks display approximate scale-free topology, in which hub proteins that interact with a large number of other proteins determine the overall organization of the network. In this study, we aim to determine whether hubs are distinguishable from other networked proteins by specific sequence features. Proteins of different connectednesses were compared in the interaction networks of Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans, and Homo sapienswith respect to the distribution of predicted structural disorder, sequence repeats, low complexity regions, and chain length. Highly connected proteins ("hub proteins") contained significantly more of, and greater proportion of, these sequence features and tended to be longer overall as compared to less connected proteins. These sequence features provide two different functional means for realizing multiple interactions: (1) extended interaction surface and (2) flexibility and adaptability, providing a mechanism for the same region to bind distinct partners. Our view contradicts the prevailing view that scaling in protein interactomes arose from gene duplication and preferential attachment of equivalent proteins. We propose an alternative evolutionary network specialization process, in which certain components of the protein interactome improved their fitness for binding by becoming longer or accruing regions of disorder and/or internal repeats and have therefore become specialized in network organization.
Journal of Proteome Research 12/2006; 5(11):2985-95. · 5.06 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Several proteomic studies in the last decade revealed that many proteins are either completely disordered or possess long structurally flexible regions. Many such regions were shown to be of functional importance, often allowing a protein to interact with a large number of diverse partners. Parallel to these findings, during the last five years structural bioinformatics has produced an explosion of results regarding protein-protein interactions and their importance for cell signaling. We studied the occurrence of relatively short (10-70 residues), loosely structured protein regions within longer, largely disordered sequences that were characterized as bound to larger proteins. We call these regions molecular recognition features (MoRFs, also known as molecular recognition elements, MoREs). Interestingly, upon binding to their partner(s), MoRFs undergo disorder-to-order transitions. Thus, in our interpretation, MoRFs represent a class of disordered region that exhibits molecular recognition and binding functions. This work extends previous research showing the importance of flexibility and disorder for molecular recognition. We describe the development of a database of MoRFs derived from the RCSB Protein Data Bank and present preliminary results of bioinformatics analyses of these sequences. Based on the structure adopted upon binding, at least three basic types of MoRFs are found: alpha-MoRFs, beta-MoRFs, and iota-MoRFs, which form alpha-helices, beta-strands, and irregular secondary structure when bound, respectively. Our data suggest that functionally significant residual structure can exist in MoRF regions prior to the actual binding event. The contribution of intrinsic protein disorder to the nature and function of MoRFs has also been addressed. The results of this study will advance the understanding of protein-protein interactions and help towards the future development of useful protein-protein binding site predictors.
Journal of Molecular Biology 11/2006; 362(5):1043-59. · 3.91 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Despite substantial increases in research funding by the pharmaceutical industry, drug discovery rates seem to have reached a plateau or perhaps are even declining, suggesting the need for new strategies. Protein-protein interactions have long been thought to provide interesting drug discovery targets, but the development of small molecules that modulate such interactions has so far achieved a low success rate. In contrast to this historic trend, a few recent successes raise hopes for routinely identifying druggable protein-protein interactions. In this Opinion article, we point out the importance of coupled binding and folding for protein-protein signalling interactions generally, and from this and associated observations, we develop a new strategy for identifying protein-protein interactions that would be particularly promising targets for modulation by small molecules. This novel strategy, based on intrinsically disordered protein, has the potential to increase significantly the discovery rate for new molecule entities.
Trends in Biotechnology 11/2006; 24(10):435-42. · 9.66 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Evidence that many protein regions and even entire proteins lacking stable tertiary and/or secondary structure in solution (i.e., intrinsically disordered proteins) might be involved in protein-protein interactions, regulation, recognition, and signal transduction is rapidly accumulating. These signaling proteins play a crucial role in the development of several pathological conditions, including cancer. To test a hypothesis that intrinsic disorder is also abundant in cardiovascular disease (CVD), a data set of 487 CVD-related proteins was extracted from SWISS-PROT. CVD-related proteins are depleted in major order-promoting residues (Trp, Phe, Tyr, Ile, and Val) and enriched in some disorder-promoting residues (Arg, Gln, Ser, Pro, and Glu). The application of a neural network predictor of natural disordered regions (PONDR VL-XT) together with cumulative distribution function (CDF) analysis, charge-hydropathy plot (CH plot) analysis, and alpha-helical molecular recognition feature (alpha-MoRF) indicator revealed that CVD-related proteins are enriched in intrinsic disorder. In fact, the percentage of proteins with 30 or more consecutive residues predicted by PONDR VL-XT to be disordered was 57 +/- 4% for CVD-associated proteins. This value is close that described earlier for signaling proteins (66 +/- 6%) and is significantly larger than the content of intrinsic disorder in eukaryotic proteins from SWISS-PROT (47 +/- 4%) and in nonhomologous protein segments with a well-defined three-dimensional structure (13 +/- 4%). Furthermore, CDF and CH-plot analyses revealed that 120 and 36 CVD-related proteins, respectively, are wholly disordered. This high level of intrinsic disorder could be important for the function of CVD-related proteins and for the control and regulation of processes associated with cardiovascular disease. In agreement with this hypothesis, 198 alpha-MoRFs were predicted in 101 proteins from the CVD data set. A comparison of disorder predictions with the experimental structural and functional data for a subset of the CVD-associated proteins indicated good agreement between predictions and observations. Thus, our data suggest that intrinsically disordered proteins might play key roles in cardiovascular disease.
[Show abstract][Hide abstract] ABSTRACT: It is recognized now that many functional proteins or their long segments are devoid of stable secondary and/or tertiary structure and exist instead as very dynamic ensembles of conformations. They are known by different names including natively unfolded, intrinsically disordered, intrinsically unstructured, rheomorphic, pliable, and different combinations thereof. Many important functions and activities have been associated with these intrinsically disordered proteins (IDPs), including molecular recognition, signaling, and regulation. It is also believed that disorder of these proteins allows function to be readily modified through phosphorylation, acetylation, ubiquitination, hydroxylation, and proteolysis. Bioinformatics analysis revealed that IDPs comprise a large fraction of different proteomes. Furthermore, it is established that the intrinsic disorder is relatively abundant among cancer-related and other disease-related proteins and IDPs play a number of key roles in oncogenesis. There are more than 100 different types of human papillomaviruses (HPVs), which are the causative agents of benign papillomas/warts, and cofactors in the development of carcinomas of the genital tract, head and neck, and epidermis. With respect to their association with cancer, HPVs are grouped into two classes, known as low (e.g., HPV-6 and HPV-11) and high-risk (e.g., HPV-16 and HPV-18) types. The entire proteome of HPV includes six nonstructural proteins [E1, E2, E4, E5, E6, and E7 (the latter two are known to function as oncoproteins in the high-risk HPVs)] and two structural proteins (L1 and L2). To understand whether intrinsic disorder plays a role in the oncogenic potential of different HPV types, we have performed a detailed bioinformatics analysis of proteomes of high-risk and low-risk HPVs with the major focus on E6 and E7 oncoproteins. The results of this analysis are consistent with the conclusion that high-risk HPVs are characterized by the increased amount of intrinsic disorder in transforming proteins E6 and E7.
Journal of Proteome Research 09/2006; 5(8):1829-42. · 5.06 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Aggregation of β-Lactoglobulin β-Lg) solutions, with and without sodium polypectate (SPP), was investigated at pH 6.5 and 3.5 by turbidity measurements and gel permeation chromatography during heating at 1°C/min. The ratio of β-Lg:SPP was maintained at 10:1. At pH 6.5, the transition temperature of β-Lg aggregation decreased linearly with the logarithm of β-Lg concentration. Irrespective of β-Lg concentration, SPP did not affect the rate of β-Lg aggregation during heating at pH 6.5. However, SPP influenced the formation of high-molecular-weight (HMW) β-Lg aggregates during heating at pH 6.5 was related to bulk macromolecular concentration. No thermal aggregation transitions were detected for β-Lg solutions at pH 3.5. SPP interacted with β-Lg at pH 3.5 to form a complex that precipitated on heating.
[Show abstract][Hide abstract] ABSTRACT: β-lactoglobulin (β-LG) in the molten globule state induced by high hydrostatic pressure (HHP) at 500 MPa and 50 °C for 32 min exhibited a significant decrease in affinity for retinol and a significant increase in affinity for cis-parinaric acid (CPA) and 1-anilino-naphthalene-8-sulfonate (ANS) compared to native β-LG. The number of β-LG binding sites for retinol and CPA significantly decreased after HHP treatment. The HHP-induced molten globule state of β-LG exhibited less affinity for palmitic acid, capsaicin, or carvacrol ligands than native β-LG, and no detectable specific binding for -ionone, β-ionone, cinnamaldehyde or vanillin flavors. HHP treatment resulted in changes in the hydrophobic calyx and surface hydrophobic sites of β-LG.
[Show abstract][Hide abstract] ABSTRACT: Intrinsic disorder (ID) is highly abundant in eukaryotes, which reflect the greater need for disorder-associated signaling and transcriptional regulation in nucleated cells. Although several well-characterized examples of intrinsically disordered proteins in transcriptional regulation have been reported, no systematic analysis has been reported so far. To test for the general prevalence of intrinsic disorder in transcriptional regulation, we used the predictor of natural disorder regions (PONDR) to analyze the abundance of intrinsic disorder in three transcription factor datasets and two control sets. This analysis revealed that from 94.13 to 82.63% of transcription factors possess extended regions of intrinsic disorder, relative to 54.51 and 18.64% of the proteins in two control datasets, which indicates the significant prevalence of intrinsic disorder in transcription factors. This propensity of transcription factors to intrinsic disorder was confirmed by cumulative distribution function analysis and charge-hydropathy plots. The amino acid composition analysis showed that all three transcription factor datasets were substantially depleted in order-promoting residues and significantly enriched in disorder-promoting residues. Our analysis of the distribution of disorder within the transcription factor datasets revealed that (a) the AT-hooks and basic regions of transcription factor DNA-binding domains are highly disordered; (b) the degree of disorder in transcription factor activation regions is much higher than that in DNA-binding domains; (c) the degree of disorder is significantly higher in eukaryotic transcription factors than in prokaryotic transcription factors; and (d) the level of alpha-MoRF (molecular recognition feature) prediction is much higher in transcription factors. Overall, our data reflected the fact that eukaryotes with well-developed gene transcription machinery require transcription factor flexibility to be more efficient.
[Show abstract][Hide abstract] ABSTRACT: Calmodulin (CaM) signaling involves important, wide spread eukaryotic protein-protein interactions. The solved structures of CaM associated with several of its binding targets, the distinctive binding mechanism of CaM, and the significant trypsin sensitivity of the binding targets combine to indicate that the process of association likely involves coupled binding and folding for both CaM and its binding targets. Here, we use bioinformatics approaches to test the hypothesis that CaM-binding targets are intrinsically disordered. We developed a predictor of CaM-binding regions and estimated its performance. Per residue accuracy of this predictor reached 81%, which, in combination with a high recall/precision balance at the binding region level, suggests high predictability of CaM-binding partners. An analysis of putative CaM-binding proteins in yeast and human strongly indicates that their molecular functions are related to those of intrinsically disordered proteins. These findings add to the growing list of examples in which intrinsically disordered protein regions are indicated to provide the basis for cell signaling and regulation.
Proteins Structure Function and Bioinformatics 06/2006; 63(2):398-410. · 3.34 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Alternative splicing of pre-mRNA generates two or more protein isoforms from a single gene, thereby contributing to protein diversity. Despite intensive efforts, an understanding of the protein structure-function implications of alternative splicing is still lacking. Intrinsic disorder, which is a lack of equilibrium 3D structure under physiological conditions, may provide this understanding. Intrinsic disorder is a common phenomenon, particularly in multicellular eukaryotes, and is responsible for important protein functions including regulation and signaling. We hypothesize that polypeptide segments affected by alternative splicing are most often intrinsically disordered such that alternative splicing enables functional and regulatory diversity while avoiding structural complications. We analyzed a set of 46 differentially spliced genes encoding experimentally characterized human proteins containing both structured and intrinsically disordered amino acid segments. We show that 81% of 75 alternatively spliced fragments in these proteins were associated with fully (57%) or partially (24%) disordered protein regions. Regions affected by alternative splicing were significantly biased toward encoding disordered residues, with a vanishingly small P value. A larger data set composed of 558 SwissProt proteins with known isoforms produced by 1,266 alternatively spliced fragments was characterized by applying the pondr vsl1 disorder predictor. Results from prediction data are consistent with those obtained from experimental data, further supporting the proposed hypothesis. Associating alternative splicing with protein disorder enables the time- and tissue-specific modulation of protein function needed for cell differentiation and the evolution of multicellular organisms.
Proceedings of the National Academy of Sciences 06/2006; 103(22):8390-5. · 9.81 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Regions of conserved disorder prediction (CDP) were found in protein domains from all available InterPro member databases, although with varying frequency. These CDP regions were found in proteins from all kingdoms of life, including viruses. However, eukaryotes had 1 order of magnitude more proteins containing long disordered regions than did archaea and bacteria. Sequence conservation in CDP regions varied, but was on average slightly lower than in regions of conserved order. In some cases, disordered regions evolve faster than ordered regions, in others they evolve slower, and in the rest they evolve at roughly the same rate. A variety of functions were found to be associated with domains containing conserved disorder. The most common were DNA/RNA binding, and protein binding. Many ribosomal proteins also were found to contain conserved disordered regions. Other functions identified included membrane translocation and amino acid storage for germination. Due to limitations of current knowledge as well as the methodology used for this work, it was not determined whether these functions were directly associated with the predicted disordered region. However, the functions associated with conserved disorder in this work are in agreement with the functions found in other studies to correlate to disordered regions. We have established that intrinsic disorder may be more common in bacterial and archaeal proteins than previously thought, but this disorder is likely to be used for different purposes than in eukaryotic proteins, as well as occurring in shorter stretches of protein. Regions of predicted disorder were found to be conserved within a large number of protein families and domains. Although many think of such conserved domains as being ordered, in fact a significant number of them contain regions of disorder that are likely to be crucial to their functions.
Journal of Proteome Research 05/2006; 5(4):888-98. · 5.06 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Many protein regions have been shown to be intrinsically disordered, lacking unique structure under physiological conditions. These intrinsically disordered regions are not only very common in proteomes, but also crucial to the function of many proteins, especially those involved in signaling, recognition, and regulation. The goal of this work was to identify the prevalence, characteristics, and functions of conserved disordered regions within protein domains and families. A database was created to store the amino acid sequences of nearly one million proteins and their domain matches from the InterPro database, a resource integrating eight different protein family and domain databases. Disorder prediction was performed on these protein sequences. Regions of sequence corresponding to domains were aligned using a multiple sequence alignment tool. From this initial information, regions of conserved predicted disorder were found within the domains. The methodology for this search consisted of finding regions of consecutive positions in the multiple sequence alignments in which a 90% or more of the sequences were predicted to be disordered. This procedure was constrained to find such regions of conserved disorder prediction that were at least 20 amino acids in length. The results of this work included 3,653 regions of conserved disorder prediction, found within 2,898 distinct InterPro entries. Most regions of conserved predicted disorder detected were short, with less than 10% of those found exceeding 30 residues in length.
Journal of Proteome Research 05/2006; 5(4):879-87. · 5.06 Impact Factor