Article

Parameters of proteome evolution from histograms of amino-acid sequence identities of paralogous proteins

Center for Models of Life, Niels Bohr Institute, Blegdamsvej 17, DK-2100 Copenhagen Ø, Denmark.
Biology Direct (Impact Factor: 4.04). 02/2007; 2:32. DOI: 10.1186/1745-6150-2-32
Source: PubMed

ABSTRACT The evolution of the full repertoire of proteins encoded in a given genome is mostly driven by gene duplications, deletions, and sequence modifications of existing proteins. Indirect information about relative rates and other intrinsic parameters of these three basic processes is contained in the proteome-wide distribution of sequence identities of pairs of paralogous proteins.
We introduce a simple mathematical framework based on a stochastic birth-and-death model that allows one to extract some of this information and apply it to the set of all pairs of paralogous proteins in H. pylori, E. coli, S. cerevisiae, C. elegans, D. melanogaster, and H. sapiens. It was found that the histogram of sequence identities p generated by an all-to-all alignment of all protein sequences encoded in a genome is well fitted with a power-law form approximately p-gamma with the value of the exponent gamma around 4 for the majority of organisms used in this study. This implies that the intra-protein variability of substitution rates is best described by the Gamma-distribution with the exponent alpha approximately 0.33. Different features of the shape of such histograms allow us to quantify the ratio between the genome-wide average deletion/duplication rates and the amino-acid substitution rate.
We separately measure the short-term ("raw") duplication and deletion rates , which include gene copies that will be removed soon after the duplication event and their dramatically reduced long-term counterparts r(*)dup, r(*)del. High deletion rate among recently duplicated proteins is consistent with a scenario in which they didn't have enough time to significantly change their functional roles and thus are to a large degree disposable. Systematic trends of each of the four duplication/deletion rates with the total number of genes in the genome were analyzed. All but the deletion rate of recent duplicates r(*)del were shown to systematically increase with Ngenes. Abnormally flat shapes of sequence identity histograms observed for yeast and human are consistent with lineages leading to these organisms undergoing one or more whole-genome duplications. This interpretation is corroborated by our analysis of the genome of Paramecium tetraurelia where the p-4 profile of the histogram is gradually restored by the successive removal of paralogs generated in its four known whole-genome duplication events.

Download full-text

Full-text

Available from: Sergei S Maslov, Jul 12, 2015
0 Followers
 · 
100 Views
  • Source
    • "However, inclusion of paralogous sequences can potentially introduce noises in generating protein sequence conservation profiles since, compared with orthologs, protein paralogs are more likely to diverge in sequence and in cellular functions. On average the amino acid sequence identity between paralogous protein pairs is only 30% (Axelsen et al., 2007). "
    [Show abstract] [Hide abstract]
    ABSTRACT: The recent advances in genome sequencing have revealed an abundance of non-synonymous polymorphisms among human individuals; subsequently it is of immense interest and importance to predict whether such substitutions are functional neutral or have deleterious effects. The accuracy of such prediction algorithms depends on the quality of the multiple sequence alignment, which is used to infer how an amino acid substitution is tolerated at a given position. Due to the scarcity of orthologous protein sequences in the past, the existing prediction algorithms all include sequences of protein paralogs in the alignment, which can dilute the conservation signal and affect prediction accuracy. However we believe that, with the sequencing of a large number of mammalian genomes, it is now feasible to include only protein orthologs in the alignment and improve the prediction performance. We have developed a novel prediction algorithm, named SNPdryad, which only includes protein orthologs in building a multiple sequence alignment. Among many other innovations, SNPdryad uses different conservation scoring schemes and uses Random Forest as a classifier. We have tested SNPdryad on several datasets. We found that SNPdryad consistently outperformed other methods in several performance metrics, which is attributed to the exclusion of paralogous sequence. We have run SNPdryad on the complete human proteome, generating prediction scores for all the possible amino acid substitutions. The algorithm and the prediction results can be accessed from the website: http://snps.ccbr.utoronto.ca:8080/SNPdryad/. http://datadryad.org/resource/doi:10.5061/dryad.n7m28
    Bioinformatics 01/2014; DOI:10.1093/bioinformatics/btt769 · 4.62 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Citation networks have re-emerged as a topic intense interest in the complex networks community with the recent availability of large-scale data sets. The ranking of citation networks is a necessary practice as a means to improve information navigability and search. Unlike many information networks, the aging characteristics of citation networks require the development of new ranking methods. To account for strong aging characteristics of citation networks, we modify the PageRank algorithm by initially distributing random surfers exponentially with age, in favor of more recent publications. The output of this algorithm, which we call CiteRank, is interpreted as approximate traffic to individual publications in a simple model of how researchers find new information. We optimize parameters of our algorithm to achieve the best performance. The results are compared for two rather different citation networks: all American Physical Society publications between 1893-2003 and the set of high-energy physics theory (hep-th) preprints. Despite major differences between these two networks, we find that their optimal parameters for the CiteRank algorithm are remarkably similar. The advantages and performance of CiteRank over more conventional methods of ranking publications are discussed. Collaborative voting systems have emerged as an abundant form of real-world, complex information systems that exist in a variety of online applications. These systems are comprised of large populations of users that collectively submit and vote on objects. While the specific properties of these systems vary widely, many of them share a core set of features and dynamical behaviors that govern their evolution. We study a subset of these systems that involve material of a time-critical nature as in the popular example of news items. We consider a general model system in which articles are introduced, voted on by a population of users, and subsequently expire after a proscribed period of time. To study the interaction between popularity and quality, we introduce simple stochastic models of user behavior that approximate differing user quality and susceptibility to the common notion of popularity. We define a metric to quantify user reputation in a manner that is self-consistent, adaptable and content-blind and shows good correlation with the probability that a user behaves in an optimal fashion. We further construct a mechanism for ranking documents that take into account user reputation and provides substantial improvement in the time-critical performance of the system. The structure of complex systems have been well studied in the context of both information and biological systems. More recently, dynamics in complex systems that occur over the background of the underlying network has received a great deal of attention. In particular, the study of fluctuations in complex systems has emerged as an issue central to understanding dynamical behavior. We approach the problem of collective effects of the underlying network on dynamical fluctuations by considering the protein-protein interaction networks for the system of the living cell. We consider two types of fluctuations in the mass-action equilibrium in protein binding networks. The first type is driven by relatively slow changes in total concentrations (copy numbers) of interacting proteins. The second type, to which we refer to as spontaneous, is caused by quickly decaying thermodynamic deviations away from the mass-action equilibrium of the system. As such they are amenable to methods of equilibrium statistical mechanics used in our study. We investigate the effects of network connectivity on these fluctuations by comparing them to different scenarios in which the interacting pair is isolated form the rest of the network. Such comparison allows us to analytically derive upper and lower bounds on network fluctuations. The collective effects are shown to sometimes lead to relatively large amplification of spontaneous fluctuations as compared to the expectation for isolated dimers. As a consequence of this, the strength of both types of fluctuations is positively correlated with the overall network connectivity of proteins forming the complex. On the other hand, the relative amplitude of fluctuations is negatively correlated with the equilibrium concentration of the complex. Our general findings are illustrated using a curated network of protein-protein interactions and multi-protein complexes in bakers yeast with experimentally determined protein concentrations.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The relationship between the regulatory design and the functionality of molecular networks is a key issue in biology. Modules and motifs have been associated to various cellular processes, thereby providing anecdotal evidence for performance based localization on molecular networks. To quantify structure-function relationship we investigate similarities of proteins which are close in the regulatory network of the yeast Saccharomyces Cerevisiae. We find that the topology of the regulatory network only show weak remnants of its history of network reorganizations, but strong features of co-regulated proteins associated to similar tasks. These functional correlations decreases strongly when one consider proteins separated by more than two steps in the regulatory network. The network topology primarily reflects the processes that is orchestrated by each individual hub, whereas there is nearly no remnants of the history of protein duplications. Our results suggests that local topological features of regulatory networks, including broad degree distributions, emerge as an implicit result of matching a number of needed processes to a finite toolbox of proteins.
    BMC Systems Biology 02/2008; 2:25. DOI:10.1186/1752-0509-2-25 · 2.85 Impact Factor
Show more