Effective connectivity profile: A structural representation that evidences the relationship between protein structures and sequences

Centro de Biología Molecular Severo Ochoa, (CSIC-UAM), Cantoblanco, 28049 Madrid, Spain.
Proteins Structure Function and Bioinformatics (Impact Factor: 2.63). 12/2008; 73(4):872-88. DOI: 10.1002/prot.22113
Source: PubMed


The complexity of protein structures calls for simplified representations of their topology. The simplest possible mathematical description of a protein structure is a onedimensional profile representing, for instance, buriedness or secondary structure. This kind of representation has been introduced for studying the sequence to structure relationship, with applications to fold recognition. Here we define the effective connectivity profile (EC), a network theoretical profile that self-consistently represents the network structure of the protein contact matrix. The EC profile makes mathematically explicit the relationship between protein structure and protein sequence, because it allows predicting the average hydrophobicity profile (HP) and the distributions of amino acids at each site for families of homologous proteins sharing the same structure. In this sense, the EC provides an analytic solution to the statistical inverse folding problem, which consists in finding the statistical properties of the set of sequences compatible with a given structure. We tested these predictions with simulations of the structurally constrained neutral (SCN) model of protein evolution with structure conservation, for singleand multi-domain proteins, and for a wide range of mutation processes, the latter producing sequences with very different hydrophobicity profiles, finding that the EC-based predictions are accurate even when only one sequence of the family is known. The EC profile is very significantly correlated with the HP for sequence-structure pairs in the PDB as well. The EC profile generalizes the properties of previously introduced structural profiles to modular proteins such as multidomain chains, and its correlation with the sequence profile is substantially improved with respect to the previously defined profiles, particularly for long proteins. Furthermore, the EC profile has a dynamic interpretation, since the EC components are strongly inversely related with the temperature factors measured in X-ray experiments, meaning that positions with large EC component are more strongly constrained in their equilibrium dynamics. Last, the EC profile allows to define a natural measure of modularity that correlates with the number of domains composing the protein, suggesting its application for domain decomposition. Finally, we show that structurally similar proteins have similar EC profiles, so that the similarity between aligned EC profiles can be used as a structure similarity measure, a property that we have recently applied for protein structure alignment. The code for computing the EC profile is available upon request writing to [email protected]
/* */, and the structural profiles discussed in this article can be downloaded from the SLOTH webserver

1 Follower
4 Reads
  • Source
    • "We call this a mean-field (MF) model, because each site evolves independently but taking into account in a self-consistent way the MF generated by the other sites. Approximating contact interaction energies with their hydrophobic component, the previous model established an explicit relationship between the average hydrophobicity of a site in a family of protein sequences and its connectivity at the structural level (Bastolla et al. 2005, 2008; Porto et al. 2005) and it was later extended to generate a substitution model (Bastolla et al. 2006). The MF model that we present here builds on that proposal, but is not explicitly based on hydrophobicity and it adopts an improved representation of the statistical mechanical model of the misfolded state (Minning et al. 2013). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Despite intense work, incorporating constraints on protein native structures into the mathematical models of molecular evolution remains difficult, since most models and programs assume that protein sites evolve independently whereas protein stability is maintained by interactions between sites. Here we address this problem by developing a new mean-field substitution model that generates independent site-specific amino-acids distributions with constraints on the stabillity of the native state against both unfolding and misfolding. The model depends on a background distribution of amino acids and one selection parameter that we fix maximizing the likelihood of the observed protein sequence. The analytic solution of the model shows that the main determinant of the site-specific distributions is the number of native contacts of the site and that the most variable sites are those with an intermediate number of native contacts. The mean-field models obtained taking into account misfolded conformations yield larger likelihood than models that only consider the native state, since their average hydrophobicity is more realistic, and they produce on the average stable sequences for most proteins. We evaluated the mean-field model with respect to empirical substitution models on 12 test datasets of different protein families. In all cases, the observed site-specific sequence profiles presented smaller Kullback-Leibler divergence from the mean-field distributions than from the empirical substitution model. Next, we obtained substitution rates combinining the mean-field frequencies with an empirical substitution model. The resulting mean-field sub-stitution model assigns larger likelihood than the empirical model to all studied families when we consider sequences with identity larger than 0.35, plausibly a condition that enforces conservation of the native structure across the family. We found that the mean-field model performs better than other structurally constrained models with similar or higher complexity. With respect to the much more complex model recently developed by Bordner and Mittelmann, which takes into account pairwise terms in 1 the amino acid distributions and also optimizes the exchangeability matrix, our model performed worse for data with small sequence divergence but better for data with larger sequence divergence. The mean-field model has been implemented into the computer program Prot Evol that is freely available at
    Molecular Biology and Evolution 04/2015; 32(8). DOI:10.1093/molbev/msv085 · 9.11 Impact Factor
    • "For instance, a gray-scale digital image can be represented by a lattice (i.e., a graph) where vertices represents image pixels and are labeled by the respective brightness value, while edges represent the neighborhood relations between adjacent pixels. Labeled graphs are used to represent data and systems in many different technical contexts, such as electrical circuits [17] [49], dynamical and complex networks [6] [50] [86], biochemical networks [4] [16] [21] [79], and segmented images [14] [51] [66] [73]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Data analysis techniques have been traditionally conceived to cope with data described in terms of numeric vectors. The reason behind this fact is that numeric vectors have a well-defined and clear geometric interpretation, which facilitates the analysis from the mathematical viewpoint. However, the state-of-the-art research on current topics of fundamental importance, such as smart grids, networks of dynamical systems, biochemical and biophysical systems, intelligent trading systems, multimedia content-based retrieval systems, and social networks analysis, deal with structured and non-conventional information characterizing the data, providing richer and hence more complex patterns to be analyzed. As a consequence, representing patterns by complex (relational) structures and defining suitable, usually non-metric, dissimilarity measures is becoming a consolidated practice in related fields. However, as the data sources become more complex, the capability of judging over the data quality (or reliability) and related interpretability issues can be seriously compromised. For this purpose, automated methods able to synthesize relevant information, and at the same time rigorously describe the uncertainty in the available datasets, are very important: information granulation is the key aspect in the analysis of complex data. In this paper, we discuss our general viewpoint on the adoption of information granulation techniques in the general context of soft computing and pattern recognition, conceived as a fundamental approach towards the challenging problem of automatic modeling of complex systems. We focus on the specific setting of processing the so-called non-geometric data, which diverges significantly from what has been done so far in the related literature. We highlight the motivations, the founding concepts, and finally we provide the high-level conceptualization of the proposed data analysis framework.
    Applied Soft Computing 02/2015; 27:567-574. DOI:10.1016/j.asoc.2014.08.072 · 2.81 Impact Factor
  • Source
    • "That is, the precise reason why operating directly on the labeled graph space yields worse results are not easy to rationalize, since both approaches (i.e., DS-G-454 and DS-S-454) operate basically on the same structural and chemico-physical information, although arranged in two different settings (i.e., respectively graph and sequence). However, as recently pointed out [6], a subset of the eigenvectors of the contact matrix provides good descriptors of the protein structure, showing strong correlation with hydrophobicity. It is worth to stress that, since the classification approach based on DS-S-454 effectively makes explicit use of the spectrum of the transition matrix (which is an elaboration of the contact/adjacency matrix) to define the order of the graph vertices in the sequence, such a technique is well-justified from the biological viewpoint, taking also into account chemico-physical information derived by the previously discussed statistical analysis. "
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper builds upon the fundamental paper by \citet{niwa2009} that provides the unique possibility to analyze the relative aggregation/folding propensity of the elements of the entire Escherichia coli (E. coli) proteome in a cell-free standardized microenvironment. The hardness of the problem comes from the superposition between the driving forces of intra- and inter-molecule interactions and it is mirrored by the evidences of shift from folding to aggregation phenotypes by single-point mutations \cite{doi:10.1021/ja1116233}. Here in this paper we apply different state-of-the-art classification methods coming from the field of structural pattern recognition, with the aim to compare different representations of the same proteins of the Niwa et al. data base, going from pure sequence to chemico-physical labeled (contact) graphs. By this comparison, we are able to identify some interesting general properties of protein universe, going from the confirming of a threshold size around 250 residues (discriminating "easily foldable" from "difficultly foldable" molecules consistent with other independent data on protein domains architecture) to the relevance of contact graphs eigenvalue ordering for folding behavior discrimination and characterization of the E. coli data. The soundness of the experimental results presented in this paper is proved by the statistically relevant relationships discovered among the chemico-physical description of proteins and the developed cost matrix of substitution used in the various discrimination systems.
Show more