Effects of GC content and mutational pressure on the lengths of exons and coding sequences.

Department of Biology, University of Ottawa, Ottawa, Ontario, Canada KIN 6N5.
Journal of Molecular Evolution (Impact Factor: 2.15). 03/2003; 56(3):362-70. DOI: 10.1007/s00239-002-2406-1
Source: PubMed

ABSTRACT It has been hypothesized that the length of an exon tends to increase with the GC content because stop codons are AT-rich and should occur less frequently in GC-rich exons. This prediction assumes that mutation pressure plays a significant role in the occurrence and distribution of stop codons. However, the prediction is applicable not to all exons, but only to the last coding exon of a gene and to single-exon CDS sequences. We classified exons in multiexon genes in eight eukaryotic species into three groups-the first exon, the internal, and the last exon-and computed the Spearman correlation between the exon length and the percentage GC (%GC) for each of the three groups. In only five of the species studied is the correlation for the last coding exon greater than that for the first or internal exons. For the single-exon CDS sequences, the correlation between CDS length and %GC is mostly negative. Thus, eukaryotic genomes do not support the predicted relationship between exon length and %GC. In prokaryotic genomes, CDS length and %GC are positively correlated in each of the 68 completely sequenced prokaryotic genomes in GenBank with genomic GC contents varying from 25 to 68%, except for the wall-less Mycoplasma genitalium and the syphilis pathogen Treponema pallidum. Moreover, the average CDS length and the genomic GC content are also positively correlated. After correcting for genome size, the partial correlation between the average CDS length and the genomic GC content is 0.3217 ( p < 0.025).

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Callistomys pictus is an arboreal echimyid rodent, and the only living species in this genus. It is endemic to a very small Atlantic Forest region in the state of Bahia, east Brazil. Here we used DNA sequences from 4 genes to infer the phylogenetic position of Callistomys within Echimyidae. The results show that Callistomys forms a clade with the semi-aquatic coypu (Myocastor) from the grasslands in the southern South America and terrestrial spiny rats (Proechimys) from the Amazon forest, but the relationships among these three genera are uncertain. This clade is sister to Thrichomys, a terrestrial rat from the dry lands of central South America. These clades are unexpected, given the contrasting morphology, ecology, and geographic ranges of its members. The resulting echimyid phylogeny indicates that Callistomys is not closely related to the other arboreal echimyids, and suggest that arboreal habits evolved more than once in this family.
    Natureza On Line. 11/2014; 12:132-136.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this report we present the results of the analysis of approximately 2.7 Mb of genomic information for the American mink (Neovison vison) derived through BAC end sequencing. Our study, which encompasses approximately 1/1000th of the mink genome, suggests that simple sequence repeats (SSRs) are less common in the mink than in the human genome, whereas the average GC content of the mink genome is slightly higher than that of its human counterpart. The 2.7 Mb mink genomic dataset also contained 2,416 repeat elements (retroids and DNA transposons) occupying almost 31% of the sequence space. Among repeat elements, LINEs were over-represented and endogenous viruses (aka LTRs) under-represented in comparison to the human genome. Finally, we present a virtual map of the mink genome constructed with reference to the human and canine genome assemblies using a comparative genomics approach and incorporating over 200 mink BESs with unique hits to the human genome.
    Genes & genomics 02/2012; 34(1). · 0.50 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A frequently used approach for detecting potential coding regions is to search for stop codons. In the standard genetic code 3 out of 64 trinucleotides are stop codons. Hence, in random or non-coding DNA one can expect every 21st trinucleotide to have the same sequence as a stop codon. In contrast, the open reading frames (ORFs) of most protein-coding genes are considerably longer. Thus, the stop codon frequency in coding sequences deviates from the background frequency of the corresponding trinucleotides. This has been utilized for gene prediction, in particular, in detecting protein-coding ORFs. Traditional methods based on stop codon frequency are based on the assumption that the GC content is about 50%. However, many genomes show significant deviations from that value. With the presented method we can describe the effects of GC content on the selection of appropriate length thresholds of potentially coding ORFs. Conversely, for a given length threshold, we can calculate the probability of observing it in a random sequence. Thus, we can derive the maximum GC content for which ORF length is practicable as a feature for gene prediction methods and the resulting false positive rates. A rough estimate for an upper limit is a GC content of 80%. This estimate can be made more precise by including further parameters and by taking into account start codons as well. We demonstrate the feasibility of this method by applying it to the genomes of the bacteria Rickettsia prowazekii, Escherichia coli and Caulobacter crescentus, exemplifying the effect of GC content variations according to our predictions. We have adapted the method for predicting coding ORFs by stop codon frequency to the case of GC contents different from 50%. Usually, several methods for gene finding need to be combined. Thus, our results concern a specific part within a package of methods. Interestingly, for genomes with low GC content such as that of R. prowazekii, the presented method provides remarkably good results even when applied alone.
    Gene 12/2012; 511(2):441-446. · 2.20 Impact Factor

Full-text (2 Sources)

Available from
May 22, 2014