ChapterPDF Available

Bioinformatics for Functional and Structural Genomics at the Protein Design Group CNB-CSIC

Authors:
  • Estación Experimental de Aula Dei - Spanish National Research Council (CSIC)

Abstract and Figures

The Protein Design Group initiated its activity in 1994 with the incorporation of Alfonso Valencia to the National Centre for Biotechnology CNB-CSIC in Madrid. At that time the orientation of the group was largely a continuation of the work carried out from 1988 to 1994 in the group of Chris Sander at the EMBL in Heidelberg, that, not surprisingly was also called Protein Design Group. Since then the group has adopted new approaches to deal with the avalanche of genomic and structural information, that was just starting in 1994. The application of literature mining to the analysis of expression arrays could be a good example of approaches unpredictable a few years ago. The increasing importance that Bioinformatics have had during the last few years drove us toward the development of professional software closer to the needs of the community, an aspect that was not so clearly perceived when Bioinformatics was still emerging. What remains from the spirit of Chris Sander's group is the interest for the real-world biological problems and the continuous effort for collaborating with molecular and structural biologists. In this article we have summarised our main lines of work in Structural and functional Genomics, describing the concepts behind applications and methods, and pointers to servers where our programs and results are available.
Analysis of Effector Recognition of Ras and Ral GIP-binding Proteins. A fragment of the ras family multiple sequence alignment used for the analysis is represented. The analysis of tree-determinant residues with SequenceSpace is represented. SequenceSpace implements a principal component analysis of which only the projection of the first vectors is shown. SequenceSpace analyses systematically those positions conserved in the different subfa;nilies but different between them. The best tree-determinants positions are those located on the extreme axis of the star like topology and the corresponding intersections. A selected set of tree-determinants were mapped on the structure of the RaplA-Raf_RBD complex (1GUA). The threedimensional structure shows how the selected tree-determinants are mainly found in the interface between RaplA (Ras related protein) and Raf_RBD (Ras Binding Domain of Raf protein). Positions 36 and 37 were detected as especially important in the SequenceSpace analysis since they appear as conserved in all the sub-families (see the vertices of the corresponding two-dimensional representation of SequenceSpace). The experimental evaluation of diflrent mutants showed that these two positions detennine the functional specificity of the binding of the diffi.rent Ras and Ral effictors. Their interchange is able to convert the specificity of one for the oilier. In the figure the experimental approach based on the yeast-two hybrid analysis shows how the mutants in Ral behave as a the wild-type Ras, and vice-versa (Bauer et at., 1999).
… 
Content may be subject to copyright.
A preview of the PDF is not available
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Functional annotation of proteins encoded in newly sequenced genomes can be expected to meet two conflicting objectives: (i) provide as much information as possible, and (ii) avoid erroneous functional assignments and over-predictions. The continuing exponential growth of the number of sequenced genomes makes the quality of sequence annotation a critical factor in the efforts to utilize this new information. When dubious functional assignments are used as a basis for subsequent predictions, they tend to proliferate, leading to "database explosion". It is therefore important to identify the common factors that hamper functional annotation. As a first step towards that goal, we have compared the annotations of the Mycoplasma genitalium and Methanococcus jannaschii genomes produced in several independent studies. The most common causes of questionable predictions appear to be: i) non-critical use of annotations from existing database entries; ii) taking into account only the annotation of the best database hit; iii) insufficient masking of low complexity regions (e.g. non-globular domains) in protein sequences, resulting in spurious database hits obscuring relevant ones; iv) ignoring multi-domain organization of the query proteins and/or the database hits; v) non-critical functional inferences on the basis of the functions of neighboring genes in an operon; vi) non-orthologous gene displacement, i.e. involvement of structurally unrelated proteins in the same function. These observations suggest that case by case validation of functional annotation by expert biologists remains crucial for productive genome analysis.
Article
The maintenance of protein function and structure constrains the evolution of amino acid sequences. This fact can be ex- ploited to interpret correlated mutations ob- served in a sequence family as an indication of probable physical contact in three dimensions. Here we present a simple and general method to analyze correlations in mutational behavior between different positions in a multiple se- quence alignment. We then use these correla- tions to predict contact maps for each of 11 pro- tein families and compare the result with the contacts determined by crystallography. For the most strongly correlated residue pairs pre- dicted to be in contact, the prediction accuracy ranges from 37 to 68% and the improvement ra- tio relative to a random prediction from 1.4 to 5.1. Predicted contact maps can be used as in- put for the calculation of protein tertiary struc- ture, either from sequence information alone or in combination with experimental informa- tion. 0 1994 Wiley-Liss, Inc.
Article
Many proteins have evolved to form specific molecular complexes and the specificity of this interaction is essential for their function. The network of the necessary inter-residue contacts must consequently constrain the protein sequences to some extent. In other words, the sequence of an interacting protein must reflect the consequence of this process of adaptation. It is reasonable to assume that the sequence changes accumulated during the evolution of one of the interacting proteins must be compensated by changes in the other.Here we apply a method for detecting correlated changes in multiple sequence alignments to a set of interacting protein domains and show that positions where changes occur in a correlated fashion in the two interacting molecules tend to be close to the protein-protein interfaces. This leads to the possibility of developing a method for predicting contacting pairs of residues from the sequence alone. Such a method would not need the knowledge of the structure of the interacting proteins, and hence would be both radically different and more widely applicable than traditional docking methods.We indeed demonstrate here that the information about correlated sequence changes is sufficient to single out the right inter-domain docking solution amongst many wrong alternatives of two-domain proteins. The same approach is also used here in one case (haemoglobin) where we attempt to predict the interface of two different proteins rather than two protein domains. Finally, we report here a prediction about the inter-domain contact regions of the heat- shock protein Hsc70 based only on sequence information.
Article
The sequences of at least 23 of the 43 CASP3 targets showed no significant similarity to the sequences of known structures. The experimental structures of all but three of these 23 targets revealed substantial similarities to known structures, with at least eleven of the target structures likely being distantly homologous to known structures. Nineteen of the 23 target structures were available at the time of the final CASP3 meeting in Asilomar in December 1998, whereas the experimental data on the protein folds of the remaining four targets were obtained afterwards. The predicted three-dimensional structures for each of the 23 targets were analyzed to select those predictions sharing with the experimental structures a similar overall fold and/or having correctly folded a substantial fraction of the target sequence. Initially, predicted models were numerically evaluated and the evaluation results aided the selection process. Each target structure was then classified to identify a minimal set of structural features characteristic to its protein fold and evolutionary superfamily. The predictions containing this set were assessed comparatively to find the best predictions for each target. The predictions of new folds were assessed separately. The total number of the selected `correct' predictions and the quality of these predictions were used to compare the performance of different predictor teams and different prediction methods in the fold prediction/recognition category. Proteins Suppl 1999;3:88–103. © 1999 Wiley-Liss, Inc.
Article
The maintenance of protein function and structure constrains the evolution of amino acid sequences. This fact can be exploited to interpret correlated mutations observed in a sequence family as an indication of probable physical contact in three dimensions. Here we present a simple and general method to analyze correlations in mutational behavior between different positions in a multiple sequence alignment. We then use these correlations to predict contact maps for each of 11 protein families and compare the result with the contacts determined by crystallography. For the most strongly correlated residue pairs predicted to be in contact, the prediction accuracy ranges from 37 to 68% and the improvement ratio relative to a random prediction from 1.4 to 5.1. Predicted contact maps can be used as input for the calculation of protein tertiary structure, either from sequence information alone or in combination with experimental information. © 1994 John Wiley & Sons, Inc.
Article
An approach for genome comparison, combining function classification of gene products and sequence comparison, is presented. The genomes of Haemophilus influenzae and Escherichia coli are analyzed, and all genes are classified into nine major functional classes, corresponding to important cellular processes. To study gene order relationships and genome organization in the two bacteria, we performed statistics on neighboring pairs of genes. To estimate the significance of the observations, a statistical model based on binomial distributions has been developed. Significant patterns of gene order are observed within, as well as between, the two bacterial genomes: Functionally related genes tend to be neighbors more often than do unrelated genes. Some of these groups represent well-known operons, but additional gene clusters are identified. These clusters correspond to genomic elements that have been conserved during bacterial evolution. In addition to nearest-neighbor relationships, the method is also useful to study the relative direction of transcription in genomes, which is also highly conserved between homologous gene pairs. This new approach combines the high-level description of molecular function with pair statistics that express genome organization. It is expected to complement traditional methods of sequence analysis in the study of genomic structure, function, and evolution.
Article
The functional composition of organisms can be analysed for the first time with the appearance of complete or sizeable parts of various genomes. We have reduced the problem of protein function classification to a simple scheme with three classes of protein function: energy-, information- and communication-associated proteins. Finer classification schemes can be easily mapped to the above three classes. To DAal with the vast amount of information, a system for automatic function classification using database annotations has been DAveloped. The system is able to classify correctly about 80% of the query sequences with annotations. Using this system, we can analyse samples from the genomes of the most represented species in sequence databases and compare their genomic composition. The similarities and differences for different taxonomic groups are strikingly intuitive. Viruses have the highest proportion of proteins involved in the control and expression of genetic information. Bacteria have the highest proportion of their genes DAdicated to the production of proteins associated with small molecule transformations and transport. Animals have a very large proportion of proteins associated with intra- and intercellular communication and other regulatory processes. In general, the proportion of communication-related proteins increases during evolution, indicating trends that led to the emergence of the eukaryotic cell and later the transition from unicellular to multicellular organisms.
Article
Protein families are a rich source of information; sequence conservation and sequence correlation are two of the main properties that can be derived from the analysis of multiple sequence alignments. Sequence conservation is related to the direct evolutionary pressure to retain the chemical characteristics of some positions in order to maintain a given function. Sequence correlation is attributed to the small sequence adjustments needed to maintain protein stability against constant mutational drift. Here, we showed that sequence conservation and correlation were each frequently informative enough to detect incorrectly folded proteins. Furthermore, combining conservation, correlation, and polarity, we achieved an almost perfect discrimination between native and incorrectly folded proteins. Thus, we made use of this information for threading by evaluating the models suggested by a threading method according to the degree of proximity of the corresponding correlated, conserved, and apolar residues. The results showed that the fold recognition capacity of a given threading approach could be improved almost fourfold by selecting the alignments that score best under the three different sequence-based approaches.