Effects of GC content and mutational pressure on the lengths of exons and coding sequences

Department of Biology, University of Ottawa, Ottawa, Ontario, Canada
Journal of Molecular Evolution (Impact Factor: 1.86). 03/2003; 56(3):362-70. DOI: 10.1007/s00239-002-2406-1
Source: PubMed

ABSTRACT It has been hypothesized that the length of an exon tends to increase with the GC content because stop codons are AT-rich and should occur less frequently in GC-rich exons. This prediction assumes that mutation pressure plays a significant role in the occurrence and distribution of stop codons. However, the prediction is applicable not to all exons, but only to the last coding exon of a gene and to single-exon CDS sequences. We classified exons in multiexon genes in eight eukaryotic species into three groups-the first exon, the internal, and the last exon-and computed the Spearman correlation between the exon length and the percentage GC (%GC) for each of the three groups. In only five of the species studied is the correlation for the last coding exon greater than that for the first or internal exons. For the single-exon CDS sequences, the correlation between CDS length and %GC is mostly negative. Thus, eukaryotic genomes do not support the predicted relationship between exon length and %GC. In prokaryotic genomes, CDS length and %GC are positively correlated in each of the 68 completely sequenced prokaryotic genomes in GenBank with genomic GC contents varying from 25 to 68%, except for the wall-less Mycoplasma genitalium and the syphilis pathogen Treponema pallidum. Moreover, the average CDS length and the genomic GC content are also positively correlated. After correcting for genome size, the partial correlation between the average CDS length and the genomic GC content is 0.3217 ( p < 0.025).

Download full-text


Available from: Xuhua Xia, Aug 21, 2015
  • Source
    • "The sequences were aligned in Geneious 5.6 (Biomatters, Auckland, New Zealand) and substitution saturation in different codon position for coding regions was detected using Xia et al. (2003) test performed in DAMBE5 (Xia 2013). Appropriate models of evolution for each gene partition were determined in jModelTest 2.1.4 "
    [Show abstract] [Hide abstract]
    ABSTRACT: Callistomys pictus is an arboreal echimyid rodent, and the only living species in this genus. It is endemic to a very small Atlantic Forest region in the state of Bahia, east Brazil. Here we used DNA sequences from 4 genes to infer the phylogenetic position of Callistomys within Echimyidae. The results show that Callistomys forms a clade with the semi-aquatic coypu (Myocastor) from the grasslands in the southern South America and terrestrial spiny rats (Proechimys) from the Amazon forest, but the relationships among these three genera are uncertain. This clade is sister to Thrichomys, a terrestrial rat from the dry lands of central South America. These clades are unexpected, given the contrasting morphology, ecology, and geographic ranges of its members. The resulting echimyid phylogeny indicates that Callistomys is not closely related to the other arboreal echimyids, and suggest that arboreal habits evolved more than once in this family.
  • Source
    • "They suggested that the longest coding sequences/exons in vertebrates are GC rich, while the shortest ones are GC-poor. Subsequently, Xia et al. (2003) described positive correlations between GC content and coding regions (CDS) lengths in 68 genomes. It was later shown that highly expressed rice and human GC-rich genes have significantly more and longer introns than lowly expressed genes, whereas their average exon length per gene is significantly lower. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background. The GC-content in the third codon position (GC3) exhibits a unimodal distribution in many plant and animal genomes. Interestingly, grasses and homeotherm vertebrates exhibit a unique bimodal distribution. High GC3 was previously found to be associated with variable expression, higher frequency of upstream TATA boxes, and an increase of GC3 from 5' to 3'. Moreover, GC3-rich genes are predominant in certain gene classes and are enriched in CpG dinucleotides that are potential targets for methylation. Based on the GC3 bimodal distribution we hypothesize that GC3 has a regulatory role involving methylation and gene expression. To test that hypothesis, we selected diverse taxa (rice, thale cress, bee, and human) that varied in the modality of their GC3 distribution and tested the association between GC3, DNA methylation and gene expression. Results. We examine the relationship between cytosine methylation levels and GC3, gene expression, genome signature, gene length, and other gene compositional features. We find a strong negative correlation (Pearson's correlation coefficient r=-0.67, p-value <0.0001) between GC3 and genic CpG methylation. The comparison between 5'-3' gradients of CG3-skew and genic methylation for the taxa in the study suggests interplay between gene-body methylation and transcription-coupled cytosine deamination effect.Conclusions. Compositional features are correlated with methylation levels of genes in rice, thale cress, human, bee and fruit fly (which acts as an unmethylated control). These patterns allow us to generate evolutionary hypotheses about the relationship between GC3 and methylation and how these affect expression patterns. Specifically, we propose that the opposite effects of methylation and compositional gradients along coding regions of GC3-poor and GC3-rich genes are the products of several competing processes.
    Genome Biology and Evolution 07/2013; 5(8). DOI:10.1093/gbe/evt103 · 4.53 Impact Factor
  • Source
    • "We analyzed the first, internal, and last exons separately. If exon length is different because of ATbiased mutations leading to a gain of stop codons and thus a shortening of the transcript, then a significant difference in GC content of those longer transcripts should be primarily found in the last exon (Xia et al. 2003). The average percent GC content was used for genes with multiple internal exons. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Studies have indicated that exon and intron size and intergenic distance are correlated with gene expression levels and expression breadth. Previous reports on these correlations in plants and animals have been conflicting. In this study, next-generation sequence data, which has been shown to be more sensitive than previous expression profiling technologies, were generated and analyzed from 14 tissues. Our results revealed a novel dichotomy. At the low expression level, an increase in expression breadth correlated with an increase in transcript size because of an increase in the number of exons and introns. No significant changes in intron or exon sizes were noted. Conversely, genes expressed at the intermediate to high expression levels displayed a decrease in transcript size as their expression breadth increased. This was due to smaller exons, with no significant change in the number of exons. Taking advantage of the known gene space of soybean, we evaluated the positioning of genes and found significant clustering of similarly expressed genes. Identifying the correlations between the physical parameters of individual genes could lead to uncovering the role of regulation owing to nucleotide composition, which might have potential impacts in discerning the role of the noncoding regions.Des études ont montré que la taille des introns et des exons ainsi que la distance intergénique seraient corrélées avec le niveau et l'étendue de l'expression génique. Les études antérieures sur ce sujet chez les plantes et les animaux se sont avérées contradictoires. Dans cette étude, des données de séquence de seconde génération, lesquelles fournissent des données transcriptomiques plus sensibles que celles obtenues à l'aide des techniques antérieures, ont été produites et analysées chez 14 tissus. Les résultats des auteurs révèlent une nouvelle dichotomie. Chez les gènes faiblement exprimés, un accroissement de l'étendue de l'expression était corrélé avec un accroissement de la taille des transcrits attribuable à une augmentation du nombre d'exons et d'introns. Aucun changement quant à la taille des introns ou des exons n'a été noté. Inversement, les gènes exprimés de manière intermédiaire ou forte présentaient des transcrits dont la taille diminuait au fur et à mesure que s'accroissait l'étendue de leur expression. Cette réduction était due à une réduction de la taille des exons, sans qu'il y ait eu réduction du nombre de ceux-ci. En tirant avantage de la connaissance de l'espace génique chez le soya, les auteurs ont examiné le positionnement des gènes et ont observé un groupement significatif des gènes qui présentent un niveau d'expression semblable. L'identification de corrélations entre les paramètres physiques de gènes individuels pourrait permettre de mieux comprendre la régulation génique découlant de la composition nucléotidique, laquelle pourrait aider à discerner le rôle des régions non-codantes.
    Genome 12/2010; 54(1):10-18. · 1.56 Impact Factor
Show more