-
[show abstract]
[hide abstract]
ABSTRACT: Gene duplication is widely regarded as a major mechanism modelling genome evolution and function. However, the mechanisms that drive the evolution of the two, initially redundant, gene copies are still ill-defined. Many gene duplicates experience evolutionary rate acceleration, but the relative contribution of positive selection and random drift to the retention and subsequent evolution of gene duplicates, and for how long the molecular clock may be distorted by these processes, remains unclear. Focusing on rodent genes which duplicated before and after the mouse and rat split, we find significantly increased sequence divergence after duplication in only one of the copies, which in nearly all cases corresponds to the novel daughter copy, independent of the mechanism of duplication. We observe that the evolutionary rate of the accelerated copy, measured as the ratio of non-synonymous to synonymous substitutions, is on average 5 fold higher in the period spanning 4 to 12 My after the duplication than it was before the duplication. This increase can be explained, at least in part, by the action of positive selection according to the results of the maximum likelihood based branch-site test. Subsequently, the rate decelerates until purifying selection completely returns to preduplication levels. Reversion to the original rates has already been accomplished 40.5 My after the duplication event, corresponding to a genetic distance of about 0.28 synonymous substitutions per site. Differences in tissue gene expression patterns parallel those of substitution rates, reinforcing the role of neofunctionalization in explaining the evolution of young gene duplicates.
Molecular Biology and Evolution 04/2013; · 5.55 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: BACKGROUND: Proteins are composed of a combination of discrete, well-defined, sequence domains, associated with specific functions that have arisen at different times during evolutionary history. The emergence of novel domains is related to protein functional diversification and adaptation. But currently little is known about how novel domains arise and how they subsequently evolve. RESULTS: To gain insights into the impact of recently emerged domains in protein evolution we have identified all human young protein domains that have emerged in approximately the past 550 million years. We have classified them into vertebrate-specific and mammalian-specific groups, and compared them to older domains. We have found 426 different annotated young domains, totalling 995 domain occurrences, which represent about 12.3% of all human domains. We have observed that 61.3% of them arose in newly formed genes, while the remaining 38.7% are found combined with older domains, and have very likely emerged in the context of a previously existing protein. Young domains are preferentially located at the N-terminus of the protein, indicating that, at least in vertebrates, novel functional sequences often emerge there. Furthermore, young domains show significantly higher non-synonymous to synonymous substitution rates than older domains using human and mouse orthologous sequence comparisons. This is also true when we compare young and old domains located in the same protein, suggesting that recently arisen domains tend to evolve in a less constrained manner than older domains. CONCLUSIONS: We conclude that proteins tend to gain domains over time, becoming progressively longer. We show that many proteins are made of domains of different age, and that the fastest evolving parts correspond to the domains that have been acquired more recently.
BMC Evolutionary Biology 02/2013; 13(1):47. · 3.52 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Large-scale evolutionary studies often require the automated construction of alignments of a large number of homologous gene families. The majority of eukaryotic genes can produce different transcripts due to alternative splicing or transcription initiation, and many such transcripts encode different protein isoforms. As analyses tend to be gene-centered, one single protein isoform per gene is selected for the alignment, with the de facto approach being to use the longest protein isoform per gene (Longest), presumably to avoid including partial sequences and to maximize sequence information. Here we show that this approach is problematic because it increases the number of indels in the alignments due to the inclusion of non-homologous regions, such as those derived from species-specific exons, increasing the number of misaligned positions. With the aim of ameliorating this problem, we have developed a novel heuristic, PALO (Protein Alignment Optimizer), which, for each gene family, selects the combination of protein isoforms that are most similar in length. We examine several evolutionary parameters inferred from alignments in which the only difference is the method used to select the protein isoform combination: Longest, PALO, the combination that results in the highest sequence conservation and a randomly selected combination. We observe that Longest tends to overestimate both non-synonymous and synonymous substitution rates when compared to PALO, which is most likely due to an excess of misaligned positions. The estimation of the fraction of genes that have experienced positive selection by maximum likelihood is very sensitive to the method of isoform selection employed, both when alignments are constructed with MAFFT and with Prank(+F). Longest performs better than a random combination but still estimates up to 3 times more positively selected genes than the combination showing the highest conservation, indicating the presence of many false positives. We show that PALO can eliminate the majority of such false positives and thus that it is a more appropriate approach for large-scale analyses than Longest. A web server has been set up to facilitate the use of PALO given a user-defined set of gene families, it is available at http://evolutionarygenomics.imim.es/palo.
Genome Biology and Evolution 01/2013; · 4.62 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: BACKGROUND: Low-complexity regions (LCRs) in proteins are tracts that are highly enriched in one or a few amino acids. Given their high abundance, and their capacity to expand in relatively short periods of time through replication slippage, they can greatly contribute to increase protein sequence space and generate novel protein functions. However, little is known about the global impact of LCRs on protein evolution. RESULTS: We have traced back the evolutionary history of 2,802 LCRs from a large set of homologous protein families from H.sapiens, M.musculus, G.gallus, D.rerio and C.intestinalis. Transcriptional factors and other regulatory functions are overrepresented in proteins containing LCRs. We have found that the gain of novel LCRs is frequently associated with repeat expansion whereas the loss of LCRs is more often due to accumulation of amino acid substitutions as opposed to deletions. This dichotomy results in net protein sequence gain over time. We have detected a significant increase in the rate of accumulation of novel LCRs in the ancestral Amniota and mammalian branches, and a reduction in the chicken branch. Alanine and/or glycine-rich LCRs are overrepresented in recently emerged LCR sets from all branches, suggesting that their expansion is better tolerated than for other LCR types. LCRs enriched in positively charged amino acids show the contrary pattern, indicating an important effect of purifying selection in their maintenance. CONCLUSION: We have performed the first large-scale study on the evolutionary dynamics of LCRs in protein families. The study has shown that the composition of an LCR is an important determinant of its evolutionary pattern.
BMC Evolutionary Biology 08/2012; 12(1):155. · 3.52 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: What factors determine a protein's rate of evolution are actively debated. Especially unclear is the relative role of intrinsic factors of present-day proteins versus historical factors such as protein age. Here we study the interplay of structural properties and evolutionary age, as determinants of protein evolutionary rate. We use a large set of one-to-one orthologs between human and mouse proteins, with mapped PDB structures. We report that previously observed structural correlations also hold within each age group - including relationships between solvent accessibility, designabililty, and evolutionary rates. However, age also plays a crucial role: age modulates the relationship between solvent accessibility and rate. Additionally, younger proteins, despite being less designable, tend to evolve faster than older proteins. We show that previously reported relationships between age and rate cannot be explained by structural biases among age groups. Finally, we introduce a knowledge-based potential function to study the stability of proteins through large-scale computation. We find that older proteins are more stable for their native structure, and more robust to mutations, than younger ones. Our results underscore that several determinants, both intrinsic and historical, can interact to determine rates of protein evolution.
PLoS Computational Biology 05/2012; 8(5):e1002542. · 5.22 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Insertions and deletions (indels), together with nucleotide substitutions, are major drivers of sequence evolution. An excess of deletions over insertions in genomic sequences-the so-called deletional bias-has been reported in a wide range of species, including mammals. However, this bias has not been found in the coding sequences of some mammalian species, such as human and mouse. To determine the strength of the deletional bias in mammals, and the influence of mutation and selection, we have quantified indels in both neutrally evolving noncoding sequences and protein-coding sequences, in six mammalian branches: human, macaque, ancestral primate, mouse, rat, and ancestral rodent. The results obtained with an improved algorithm for the placement of insertions in multiple alignments, Prank(+F), indicate that contrary to previous results, the only mammalian branch with a strong deletional bias is the rodent ancestral branch. We estimate that such a bias has resulted in an ~2.5% sequence loss of mammalian syntenic region in the ancestor of the mouse and rat. Further, a comparison of coding and noncoding sequences shows that negative selection is acting more strongly against mutations generating amino acid insertions than against mutations resulting in amino acid deletions. The strength of selection against indels is found to be higher in the rodent branches than in the primate branches, consistent with the larger effective population sizes of the rodents.
Genome Research 11/2011; 22(3):478-85. · 13.61 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Low-complexity sequences are extremely abundant in eukaryotic proteins for reasons that remain unclear. One hypothesis is that they contribute to the formation of novel coding sequences, facilitating the generation of novel protein functions. Here, we test this hypothesis by examining the content of low-complexity sequences in proteins of different age. We show that recently emerged proteins contain more low-complexity sequences than older proteins and that these sequences often form functional domains. These data are consistent with the idea that low-complexity sequences may play a key role in the emergence of novel genes.
Molecular Biology and Evolution 10/2011; 29(3):883-6. · 5.55 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: The molecular clock hypothesis states that protein-coding genes evolve at an approximately constant rate. However, this is only expected to be true as long as the function and the tertiary structure of the molecule remain unaltered. An important implication of this statement is that significant deviations in the rate of evolution of a gene with respect to the species clock are likely to reflect functional and/or structural alterations. Here, we present a method to identify such deviations and apply it to a data set of 2,929 high-quality coding sequence alignments corresponding to one-to-one orthologous genes from six mammalian species--human, macaque, mouse, rat, cow, and dog. Deviated branches are defined as those that present significant alterations in both the rate of nonsynonymous substitutions (dN) and the selective pressure (dN/dS). Strikingly, we find that as many as 24.5% of the genes show branch-specific deviations in dN and dN/dS, though this is a relatively well-conserved set of genes. Around half of these genes show branch-specific acceleration of evolutionary rates. Positive selection (PS) tests based on divergence data only identify 17.7% of the accelerated branches. Failure to identify PS in accelerated branches with an excess of radical amino acid replacements suggests that these tests are conservative. Interestingly, genes with accelerated branches are significantly enriched in neural proteins, indicating that this type of protein might play a more important role than previously thought in species diversification, although they are generally not detected by PS tests. We discuss in detail several examples of genes that show lineage-specific evolutionary rate acceleration and are involved in synaptic transmission, chemosensory perception, and ubiquitination.
Molecular Biology and Evolution 01/2011; 28(1):383-98. · 5.55 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Gene duplication is a major mechanism for molecular evolutionary innovation. Young gene duplicates typically exhibit elevated rates of protein evolution and, according to a number of recent studies, increased expression divergence. However, the nature of these changes is still poorly understood. To gain novel insights into the functional consequences of gene duplication, we have undertaken an in-depth analysis of a large data set of gene families containing primate- and/or rodent-specific gene duplicates. We have found a clear tendency toward an increase in protein, promoter, and expression divergence with increasing number of duplication events undergone by each gene since the human-mouse split. In addition, gene duplication is significantly associated with a reduction in expression breadth and intensity. Interestingly, it is possible to identify three main groups regarding the evolution of gene expression following gene duplication. The first group, which comprises around 25% of the families, shows patterns compatible with tissue-expression partitioning. The second and largest group, comprising 33-53% of the families, shows broad expression of one of the gene copies and reduced, overlapping, expression of the other copy or copies. This can be attributed, in most cases, to loss of expression in several tissues of one or more gene copies. Finally, a substantial number of families, 19-35%, maintain a very high level of tissue-expression overlap (>0.8) after tens of millions of years of evolution. These families may have been subject to selection for increased gene dosage.
Molecular Biology and Evolution 10/2009; 27(2):325-35. · 5.55 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Genomes contain a large number of genes that do not have recognizable homologues in other species. These genes, found in only one or a few closely related species, are known as orphan genes. Their limited distribution implies that many of them are probably involved in lineage-specific adaptive processes. One important question that has remained elusive to date is how orphan genes originate. It has been proposed that they might have arisen by gene duplication followed by a period of very rapid sequence divergence, which would have erased any traces of similarity to other evolutionarily related genes. However, this explanation does not seem plausible for genes lacking homologues in very closely related species. In the present article, we review recent efforts to identify the mechanisms of formation of primate orphan genes. These studies reveal an unexpected important role of transposable elements in the formation of novel protein-coding genes in the genomes of primates.
Biochemical Society Transactions 09/2009; 37(Pt 4):778-82. · 3.71 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Single amino acid repeats are prevalent in eukaryote organisms, although the role of many such sequences is still poorly understood. We have performed a comprehensive analysis of the proteins containing homopolymeric histidine tracts in the human genome and identified 86 human proteins that contain stretches of five or more histidines. Most of them are endowed with DNA- and RNA-related functions, and, in addition, there is an overrepresentation of proteins expressed in the brain and/or nervous system development. An analysis of their subcellular localization shows that 15 of the 22 nuclear proteins identified accumulate in the nuclear subcompartment known as nuclear speckles. This localization is lost when the histidine repeat is deleted, and significantly, closely related paralogous proteins without histidine repeats also fail to localize to nuclear speckles. Hence, the histidine tract appears to be directly involved in targeting proteins to this compartment. The removal of DNA-binding domains or treatment with RNA polymerase II inhibitors induces the re-localization of several polyhistidine-containing proteins from the nucleoplasm to nuclear speckles. These findings highlight the dynamic relationship between sites of transcription and nuclear speckles. Therefore, we define the histidine repeats as a novel targeting signal for nuclear speckles, and we suggest that these repeats are a way of generating evolutionary diversification in gene duplicates. These data contribute to our better understanding of the physiological role of single amino acid repeats in proteins.
PLoS Genetics 04/2009; 5(3):e1000397. · 8.69 Impact Factor
-
Verónica Rodilla,
Alberto Villanueva,
Antonia Obrador-Hevia,
Alex Robert-Moreno,
Vanessa Fernández-Majada,
Andrea Grilli,
Nuria López-Bigas,
Nicolás Bellora, M Mar Albà,
Ferran Torres,
Mireia Duñach,
Xavier Sanjuan,
Sara Gonzalez,
Thomas Gridley,
Gabriel Capella,
Anna Bigas,
Lluís Espinosa
[show abstract]
[hide abstract]
ABSTRACT: Notch has been linked to beta-catenin-dependent tumorigenesis; however, the mechanisms leading to Notch activation and the contribution of the Notch pathway to colorectal cancer is not yet understood. By microarray analysis, we have identified a group of genes downstream of Wnt/beta-catenin (down-regulated when blocking Wnt/beta-catenin) that are directly regulated by Notch (repressed by gamma-secretase inhibitors and up-regulated by active Notch1 in the absence of beta-catenin signaling). We demonstrate that Notch is downstream of Wnt in colorectal cancer cells through beta-catenin-mediated transcriptional activation of the Notch-ligand Jagged1. Consistently, expression of activated Notch1 partially reverts the effects of blocking Wnt/beta-catenin pathway in tumors implanted s.c. in nude mice. Crossing APC(Min/+) with Jagged1(+/Delta) mice is sufficient to significantly reduce the size of the polyps arising in the APC mutant background indicating that Notch is an essential modulator of tumorigenesis induced by nuclear beta-catenin. We show that this mechanism is operating in human tumors from Familial Adenomatous Polyposis patients. We conclude that Notch activation, accomplished by beta-catenin-mediated up-regulation of Jagged1, is required for tumorigenesis in the intestine. The Notch-specific genetic signature is sufficient to block differentiation and promote vasculogenesis in tumors whereas proliferation depends on both pathways.
Proceedings of the National Academy of Sciences 04/2009; 106(15):6315-20. · 9.68 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Genomes contain a large number of genes that do not have recognizable homologues in other species and that are likely to be involved in important species-specific adaptive processes. The origin of many such "orphan" genes remains unknown. Here we present the first systematic study of the characteristics and mechanisms of formation of primate-specific orphan genes. We determine that codon usage values for most orphan genes fall within the bulk of the codon usage distribution of bona fide human proteins, supporting their current protein-coding annotation. We also show that primate orphan genes display distinctive features in relation to genes of wider phylogenetic distribution: higher tissue specificity, more rapid evolution, and shorter peptide size. We estimate that around 24% are highly divergent members of mammalian protein families. Interestingly, around 53% of the orphan genes contain sequences derived from transposable elements (TEs) and are mostly located in primate-specific genomic regions. This indicates frequent recruitment of TEs as part of novel genes. Finally, we also obtain evidence that a small fraction of primate orphan genes, around 5.5%, might have originated de novo from mammalian noncoding genomic regions.
Molecular Biology and Evolution 01/2009; 26(3):603-12. · 5.55 Impact Factor
-
04/2008; , ISBN: 9780470015902
-
[show abstract]
[hide abstract]
ABSTRACT: The gene content of any genome is a rich mosaic of genes that have originated at different times during evolution. Among the
most interesting properties related to gene age is the fact that younger genes tend to show accelerated evolutionary rates
with respect to older genes. Here, we use a large number of closely related mammalian genomes to gain further insights into
the relationship between gene age and evolutionary rate. We define a group of primate-specific genes that are absent from
11 non-primate mammalian genomes as well as from other eukaryotic genomes. These genes, of very recent origin, show the highest
evolutionary rate and the shortest protein length. We discuss how these results may shed light on understanding the proposed
mechanisms for the origin of lineage-specific, novel genes.
12/2007: pages 45-59;
-
[show abstract]
[hide abstract]
ABSTRACT: Single-amino-acid tandem repeats are very common in mammalian proteins but their function and evolution are still poorly understood. Here we investigate how the variability and prevalence of amino acid repeats are related to the evolutionary constraints operating on the proteins. We find a significant positive correlation between repeat size difference and protein nonsynonymous substitution rate in human and mouse orthologous genes. This association is observed for all the common amino acid repeat types and indicates that rapid diversification of repeat structures, involving both trinucleotide slippage and nucleotide substitutions, preferentially occurs in proteins subject to low selective constraints. However, strikingly, we also observe a significant negative correlation between the number of repeats in a protein and the gene nonsynonymous substitution rate, particularly for glutamine, glycine, and alanine repeats. This implies that proteins subject to strong selective constraints tend to contain an unexpectedly high number of repeats, which tend to be well conserved between the two species. This is consistent with a role for selection in the maintenance of a significant number of repeats. Analysis of the codon structure of the sequences encoding the repeats shows that codon purity is associated with high repeat size interspecific variability. Interestingly, polyalanine and polyglutamine repeats associated with disease show very distinctive features regarding the degree of repeat conservation and the protein sequence selective constraints.
Genomics 04/2007; 89(3):316-25. · 3.02 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Many DNA functional motifs tend to accumulate or cluster at specific gene locations. These locations can be detected, in a group of gene sequences, as high frequency 'peaks' with respect to a reference position, such as the transcription start site (TSS). We have developed a web tool for the identification of regions containing significant motif peaks. We show, by using different yeast gene datasets, that peak regions are strongly enriched in experimentally-validated motifs and contain potentially important novel motifs. AVAILABILITY: http://genomics.imim.es/peaks
Bioinformatics 02/2007; 23(2):243-4. · 5.47 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Understanding the constraints that operate in mammalian gene promoter sequences is of key importance to understand the evolution of gene regulatory networks. The level of promoter conservation varies greatly across orthologous genes, denoting differences in the strength of the evolutionary constraints. Here we test the hypothesis that the number of tissues in which a gene is expressed is related in a significant manner to the extent of promoter sequence conservation.
We show that mammalian housekeeping genes, expressed in all or nearly all tissues, show significantly lower promoter sequence conservation, especially upstream of position -500 with respect to the transcription start site, than genes expressed in a subset of tissues. In addition, we evaluate the effect of gene function, CpG island content and protein evolutionary rate on promoter sequence conservation. Finally, we identify a subset of transcription factors that bind to motifs that are specifically over-represented in housekeeping gene promoters.
This is the first report that shows that the promoters of housekeeping genes show reduced sequence conservation with respect to genes expressed in a more tissue-restricted manner. This is likely to be related to simpler gene expression, requiring a smaller number of functional cis-regulatory motifs.
Genome biology 02/2007; 8(7):R140. · 6.63 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: The arrangement of regulatory motifs in gene promoters, or promoter architecture, is the result of mutation and selection processes that have operated over many millions of years. In mammals, tissue-specific transcriptional regulation is related to the presence of specific protein-interacting DNA motifs in gene promoters. However, little is known about the relative location and spacing of these motifs. To fill this gap, we have performed a systematic search for motifs that show significant bias at specific promoter locations in a large collection of housekeeping and tissue-specific genes.
We observe that promoters driving housekeeping gene expression are enriched in particular motifs with strong positional bias, such as YY1, which are of little relevance in promoters driving tissue-specific expression. We also identify a large number of motifs that show positional bias in genes expressed in a highly tissue-specific manner. They include well-known tissue-specific motifs, such as HNF1 and HNF4 motifs in liver, kidney and small intestine, or RFX motifs in testis, as well as many potentially novel regulatory motifs. Based on this analysis, we provide predictions for 559 tissue-specific motifs in mouse gene promoters.
The study shows that motif positional bias is an important feature of mammalian proximal promoters and that it affects both general and tissue-specific motifs. Motif positional constraints define very distinct promoter architectures depending on breadth of expression and type of tissue.
BMC Genomics 02/2007; 8:459. · 4.07 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: It has been shown in a variety of organisms, including mammals, that genes that appeared recently in evolution, for example orphan genes, evolve faster than older genes. Low functional constraints at the time of origin of novel genes may explain these results. However, this observation has been recently attributed to an artifact caused by the inability of Blast to detect the fastest genes in different eukaryotic genomes. Distinguishing between these two possible explanations would be of great importance for any studies dealing with the taxon distribution of proteins and the origin of novel genes.
Here we used simulations of protein sequences to examine the capacity of Blast to detect proteins of diverse evolutionary rates in the different species of an eukaryotic phylogenetic tree that included metazoans, fungi and plants. We simulated the evolution of protein genes with the same evolutionary rates than those observed in functional mammalian genes and with among-site rate heterogeneity. Under these conditions, we found that only a very small percentage of simulated ancestral eukaryotic proteins was affected by the Blast artifact. We show that the good detectability of Blast is due to the heterogeneity of protein evolutionary rates at different sites, since only a small conserved motif in a sequence suffices to detect its homologues. Our results indicate that Blast, at least when applied within eukaryotes, only misses homologues of extremely fast-evolving sequences, which are rare in the mammalian genome, as well as sequences evolving homogeneously or pseudogenes.
Although great care should be exercised in the recognition of remote homologues, most functional mammalian genes can be detected in eukaryotic genomes by Blast. That is, the majority of functional mammalian genes are not as fast as for not being detected in other metazoans, fungi or plants, if they had been present in these organisms. Thus, the correlation previously found between age and rate seems not to be due to a pure Blast artifact, at least for mammals. This may have important implications to understand the mechanisms by which novel genes originate.
BMC Evolutionary Biology 02/2007; 7:53. · 3.52 Impact Factor