Effects of GC Content and Mutational Pressure on the Lengths of Exons and Coding Sequences

Department of Biology, University of Ottawa, Ottawa, Ontario, Canada
Journal of Molecular Evolution (Impact Factor: 1.68). 03/2003; 56(3):362-70. DOI: 10.1007/s00239-002-2406-1
Source: PubMed


It has been hypothesized that the length of an exon tends to increase with the GC content because stop codons are AT-rich and should occur less frequently in GC-rich exons. This prediction assumes that mutation pressure plays a significant role in the occurrence and distribution of stop codons. However, the prediction is applicable not to all exons, but only to the last coding exon of a gene and to single-exon CDS sequences. We classified exons in multiexon genes in eight eukaryotic species into three groups-the first exon, the internal, and the last exon-and computed the Spearman correlation between the exon length and the percentage GC (%GC) for each of the three groups. In only five of the species studied is the correlation for the last coding exon greater than that for the first or internal exons. For the single-exon CDS sequences, the correlation between CDS length and %GC is mostly negative. Thus, eukaryotic genomes do not support the predicted relationship between exon length and %GC. In prokaryotic genomes, CDS length and %GC are positively correlated in each of the 68 completely sequenced prokaryotic genomes in GenBank with genomic GC contents varying from 25 to 68%, except for the wall-less Mycoplasma genitalium and the syphilis pathogen Treponema pallidum. Moreover, the average CDS length and the genomic GC content are also positively correlated. After correcting for genome size, the partial correlation between the average CDS length and the genomic GC content is 0.3217 ( p < 0.025).

Download full-text


Available from: Xuhua Xia,
  • Source
    • "The sequences were aligned in Geneious 5.6 (Biomatters, Auckland, New Zealand) and substitution saturation in different codon position for coding regions was detected using Xia et al. (2003) test performed in DAMBE5 (Xia 2013). Appropriate models of evolution for each gene partition were determined in jModelTest 2.1.4 "
    [Show abstract] [Hide abstract]
    ABSTRACT: Callistomys pictus is an arboreal echimyid rodent, and the only living species in this genus. It is endemic to a very small Atlantic Forest region in the state of Bahia, east Brazil. Here we used DNA sequences from 4 genes to infer the phylogenetic position of Callistomys within Echimyidae. The results show that Callistomys forms a clade with the semi-aquatic coypu (Myocastor) from the grasslands in the southern South America and terrestrial spiny rats (Proechimys) from the Amazon forest, but the relationships among these three genera are uncertain. This clade is sister to Thrichomys, a terrestrial rat from the dry lands of central South America. These clades are unexpected, given the contrasting morphology, ecology, and geographic ranges of its members. The resulting echimyid phylogeny indicates that Callistomys is not closely related to the other arboreal echimyids, and suggest that arboreal habits evolved more than once in this family.
  • Source
    • "They suggested that the longest coding sequences/exons in vertebrates are GC rich, while the shortest ones are GC-poor. Subsequently, Xia et al. (2003) described positive correlations between GC content and coding regions (CDS) lengths in 68 genomes. It was later shown that highly expressed rice and human GC-rich genes have significantly more and longer introns than lowly expressed genes, whereas their average exon length per gene is significantly lower. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background. The GC-content in the third codon position (GC3) exhibits a unimodal distribution in many plant and animal genomes. Interestingly, grasses and homeotherm vertebrates exhibit a unique bimodal distribution. High GC3 was previously found to be associated with variable expression, higher frequency of upstream TATA boxes, and an increase of GC3 from 5' to 3'. Moreover, GC3-rich genes are predominant in certain gene classes and are enriched in CpG dinucleotides that are potential targets for methylation. Based on the GC3 bimodal distribution we hypothesize that GC3 has a regulatory role involving methylation and gene expression. To test that hypothesis, we selected diverse taxa (rice, thale cress, bee, and human) that varied in the modality of their GC3 distribution and tested the association between GC3, DNA methylation and gene expression. Results. We examine the relationship between cytosine methylation levels and GC3, gene expression, genome signature, gene length, and other gene compositional features. We find a strong negative correlation (Pearson's correlation coefficient r=-0.67, p-value <0.0001) between GC3 and genic CpG methylation. The comparison between 5'-3' gradients of CG3-skew and genic methylation for the taxa in the study suggests interplay between gene-body methylation and transcription-coupled cytosine deamination effect.Conclusions. Compositional features are correlated with methylation levels of genes in rice, thale cress, human, bee and fruit fly (which acts as an unmethylated control). These patterns allow us to generate evolutionary hypotheses about the relationship between GC3 and methylation and how these affect expression patterns. Specifically, we propose that the opposite effects of methylation and compositional gradients along coding regions of GC3-poor and GC3-rich genes are the products of several competing processes.
    Genome Biology and Evolution 07/2013; 5(8). DOI:10.1093/gbe/evt103 · 4.23 Impact Factor
  • Source
    • "Different exons in the same gene can be under different selection and mutation pressure and consequently may warrant separate sequence analyses (Xia, Xie, Li 2003). With DAMBE, one can extract first exons, middle exons, and last exons from multiexon coding sequences in an annotated GenBank files. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Since its first release in 2001 as mainly a software package for phylogenetic analysis, DAMBE has gained many new functions that may be classified into 6 categories: 1) sequence retrieval, editing, manipulation and conversion among more than 20 standard sequence formats including MEGA, NEXUS, PHYLIP, GenBank and the new NeXML format for interoperabity, 2) motif characterization and discovery functions such as position weight matrix and Gibbs sampler, 3) descriptive genomic analysis tools with improved versions of codon adaptation index, effective number of codons, protein isoelectric point profiling, RNA and protein secondary structure prediction and calculation of minimum folding energy, and genomic skew plots with optimized window size, 4) molecular phylogenetics including sequence alignment, testing substitution saturation, distance-based, maximum parsimony and maximum likelihood methods for tree reconstructions, testing the molecular clock hypothesis with either a phylogeny or with relative-rate tests, dating gene duplication and speciation events, choosing the best-fit substitution models, and estimating rate heterogeneity over sites, 5) phylogeny-based comparative methods for continuous and discrete variables, and 6) graphic functions including secondary structure display, optimized skew plot, hydrophobicity plot as well as many other plots of amino acid properties along a protein sequence, tree display and drawing by dragging nodes to each other, and visual searching of the maximum parsimony tree. DAMBE features a graphic, user-friendly and intuitive interface, and is freely available from http://dambe.bio.uottawa.ca.
    Molecular Biology and Evolution 04/2013; 30(7). DOI:10.1093/molbev/mst064 · 9.11 Impact Factor
Show more