[Show abstract][Hide abstract] ABSTRACT: Orthologous introns have identical positions relative to the coding sequence in orthologous genes of different species. By analyzing the complete genomes of five plants we generated a database of 40,512 orthologous intron groups of dicotyledonous plants, 28,519 orthologous intron groups of angiosperms, and 15,726 of land plants (moss and angiosperms). Multiple sequence alignments of each orthologous intron group were obtained using the Mafft algorithm. The number of conserved regions in plant introns appeared to be hundreds of times fewer than in mammals or vertebrates. Approximately three quarters of conserved intronic regions among angiosperms and dicots, in particular, correspond to alternatively-spliced exonic sequences. We registered only a handful of conserved intronic ncRNAs of flowering plants. However, the most evolutionarily conserved intronic region, which is ubiquitous for all plants examined in this study, including moss, possessed multiple structural features of tRNAs, which caused us to classify it as a putative tRNA-like ncRNA. Intronic sequences encoding tRNA-like structures are not unique to plants. Bioinformatics examination of the presence of tRNA inside introns revealed an unusually long-term association of four glycine tRNAs inside the Vac14 gene of fish, amniotes, and mammals.
[Show abstract][Hide abstract] ABSTRACT: Mammalian genomes are replete with millions of polymorphic sites, among which those genetic variants that are co-located on the same chromosome and exist close to one another form blocks of closely linked mutations known as haplotypes. The linkage within haplotypes is constantly disrupted due to meiotic recombination events. Whole ensembles of such numerous haplotypes are subjected to evolutionary pressure, where mutations influence each other and should be considered as a whole entity - a gigantic matrix, unique for each individual genome. This idea was implemented into a computational approach, named Genome Evolution by Matrix Algorithms (GEMA) to model genomic changes taking into account all mutations in a population. GEMA has been tested for modeling of entire human chromosomes. The program can precisely mimic real biological processes that have influence on genome evolution such as: 1) authentic arrangements of genes and functional genomic elements; 2) frequencies of various types of mutations in different nucleotide contexts; 3) non-random distribution of meiotic recombination events along chromosomes. Computer modeling with GEMA has demonstrated that the number of meiotic recombination events per gamete is among the most crucial factors influencing population fitness. In humans, these recombinations create a gamete genome consisting on an average of 48 pieces of corresponding parental chromosomes. Such highly mosaic gamete structure allows preserving fitness of population under the intense influx of novel mutations (40 per individual) even when the number of mutations with deleterious effects is up to ten times more abundant than those with beneficial effects.
[Show abstract][Hide abstract] ABSTRACT: Background / Purpose:
We performed a bioinformatics investigation of biased gene conversion hypothesis from the genotype data available from the 1000 genomes database. Specifically, we chose four genes historically studied at our lab and identified groups of mutations forming haplotypes. Possible cases of short-scale recombinations of haplotypes were examined and the outcome was tested for a quantitative representation of biased gene conversion hypothesis that states that such short-scale recombinations or heteroduplexes should cause a significant shift towards GC richness of the human genome.
In the light of our data, we conjecture that heteroduplexes do not cause a significant shift towards GC richness. Our results indicating an opposite bias towards AT richness in human genes suggest that biased gene conversion hypothesis is highly unlikely to be the cause for formation of GC-rich isochors.
[Show abstract][Hide abstract] ABSTRACT: Two factors are thought to have contributed to the origin of codon usage bias in eukaryotes: 1) genome-wide mutational forces that shape overall GC-content and create context-dependent nucleotide bias, and 2) positive selection for codons that maximize efficient and accurate translation. Particularly in vertebrates, these two explanations contradict each other and cloud the origin of codon bias in the taxon. On the one hand, mutational forces fail to explain GC-richness (~60%) of third codon positions, given the GC-poor overall genomic composition among vertebrates (~40%). On the other hand, positive selection cannot easily explain strict regularities in codon preferences. Large-scale bioinformatic assessment, of nucleotide composition of coding and non-coding sequences in vertebrates and other taxa, suggests a simple possible resolution for this contradiction. Specifically, we propose that the last common vertebrate ancestor had a GC-rich genome (~ 65% GC). The data suggest that whole-genome mutational bias is the major driving force for generating codon bias. As the bias becomes prominent, it begins to affect translation and can result in positive selection for optimal codons. The positive selection can, in turn, significantly modulate codon preferences.
[Show abstract][Hide abstract] ABSTRACT: Messenger RNA sequences possess specific nucleotide patterns distinguishing them from non-coding genomic sequences. In this
study, we explore the utilization of modified Markov models to analyze sequences up to 44 bp, far beyond the 8-bp limit of
conventional Markov models, for exon/intron discrimination. In order to analyze nucleotide sequences of this length, their
information content is first reduced by conversion into shorter binary patterns via the application of numerous abstraction
schemes. After the conversion of genomic sequences to binary strings, homogenous Markov models trained on the binary sequences
are used to discriminate between exons and introns. We term this approach the Binary Abstraction Markov Model (BAMM). High-quality
abstraction schemes for exon/intron discrimination are selected using optimization algorithms on supercomputers. The best
MM classifiers are then combined using support vector machines into a single classifier. With this approach, over 95% classification
accuracy is achieved without taking reading frame into account. With further development, the BAMM approach can be applied
to sequences lacking the genetic code such as ncRNAs and 5′-untranslated regions.
Full-text · Article · Feb 2012 · Nucleic Acids Research
[Show abstract][Hide abstract] ABSTRACT: Non-coding genomic regions in complex eukaryotes, including intergenic areas, introns, and untranslated segments of exons, are profoundly non-random in their nucleotide composition and consist of a complex mosaic of sequence patterns. These patterns include so-called Mid-Range Inhomogeneity (MRI) regions -- sequences 30-10000 nucleotides in length that are enriched by a particular base or combination of bases (e.g. (G+T)-rich, purine-rich, etc.). MRI regions are associated with unusual (non-B-form) DNA structures that are often involved in regulation of gene expression, recombination, and other genetic processes (Fedorova & Fedorov 2010). The existence of a strong fixation bias within MRI regions against mutations that tend to reduce their sequence inhomogeneity additionally supports the functionality and importance of these genomic sequences (Prakash et al. 2009).
Here we demonstrate a freely available Internet resource -- the Genomic MRI program package -- designed for computational analysis of genomic sequences in order to find and characterize various MRI patterns within them (Bechtel et al. 2008). This package also allows generation of randomized sequences with various properties and level of correspondence to the natural input DNA sequences. The main goal of this resource is to facilitate examination of vast regions of non-coding DNA that are still scarcely investigated and await thorough exploration and recognition.
Full-text · Article · May 2011 · Journal of Visualized Experiments
[Show abstract][Hide abstract] ABSTRACT: Little is known about pre-mRNA splicing in Dictyostelium discoideum although its genome has been completely sequenced. Our analysis suggests that pre-mRNA splicing plays an important role in D. discoideum gene expression as two thirds of its genes contain at least one intron. Ongoing curation of the genome to date has revealed 40 genes in D. discoideum with clear evidence of alternative splicing, supporting the existence of alternative splicing in this unicellular organism. We identified 160 candidate U2-type spliceosomal proteins and related factors in D. discoideum based on 264 known human genes involved in splicing. Spliceosomal small ribonucleoproteins (snRNPs), PRP19 complex proteins and late-acting proteins are highly conserved in D. discoideum and throughout the metazoa. In non-snRNP and hnRNP families, D. discoideum orthologs are closer to those in A. thaliana, D. melanogaster and H. sapiens than to their counterparts in S. cerevisiae. Several splicing regulators, including SR proteins and CUG-binding proteins, were found in D. discoideum, but not in yeast. Our comprehensive catalog of spliceosomal proteins provides useful information for future studies of splicing in D. discoideum where the efficient genetic and biochemical manipulation will also further our general understanding of pre-mRNA splicing.
[Show abstract][Hide abstract] ABSTRACT: Multicellular eukaryotic genomes are replete with nonprotein coding sequences, both within genes (introns) and between them (intergenic regions). Excluding the well-recognized functional elements within these sequences (ncRNAs, transcription factor binding sites, intronic enhancers/silencers, etc.), the remaining portion is made up of so-called "dark" DNA, which still occupies the majority of the genome. This dark DNA has a profound nonrandomness in its sequence composition seen at different scales, from a few nucleotides to regions that span over hundreds of thousands of nucleotides. At the mid-range scale (from 30 up to 10,000 nt), this nonrandomness is manifested in base compositional extremes detected for each of four nucleotides (A, G, T, or C) or any of their combinations. Examples of such compositional nonrandomness are A-rich, purine-rich, or G+T-rich regions. Almost every combination of nucleotides has such enriched regions. We refer to these regions as being "inhomogeneous". These regions are associated with unusual DNA conformations and/or particular DNA properties. In particular, mid-range inhomogeneous regions have complex arrangements relative to each other and to specific genomic sites, such as centromeres, telomeres, and promoters, pointing to their important role in genomic functioning and organization.
Full-text · Article · Apr 2011 · The Scientific World Journal
[Show abstract][Hide abstract] ABSTRACT: It has been widely acknowledged that non-coding RNAs are master-regulators of genomic functions. However, the significance
of the presence of ncRNA within introns has not received proper attention. ncRNA within introns are commonly produced through
the post-splicing process and are specific signals of gene transcription events, impacting many other genes and modulating
their expression. This study, along with the following discussion, details the association of thousands of ncRNAs—snoRNA,
miRNA, siRNA, piRNA and long ncRNA—within human introns. We propose that such an association between human introns and ncRNAs
has a pronounced synergistic effect with important implications for fine-tuning gene expression patterns across the entire
Full-text · Article · Nov 2010 · Nucleic Acids Research
[Show abstract][Hide abstract] ABSTRACT: In mammals a considerable 92% of genes contain introns, with hundreds and hundreds of these introns reaching the incredible size of over 50,000 nucleotides. These "large introns" must be spliced out of the pre-mRNA in a timely fashion, which involves bringing together distant 5' and 3' acceptor and donor splice sites. In invertebrates, especially Drosophila, it has been shown that larger introns can be spliced efficiently through a process known as recursive splicing-a consecutive splicing from the 5'-end at a series of combined donor-acceptor splice sites called RP-sites. Using a computational analysis of the genomic sequences, we show that vertebrates lack the proper enrichment of RP-sites in their large introns, and, therefore, require some other method to aid splicing. We analyzed over 15,000 non-redundant, large introns from six mammals, 1,600 from chicken and zebrafish, and 560 non-redundant large introns from five invertebrates. Our bioinformatic investigation demonstrates that, unlike the studied invertebrates, the studied vertebrate genomes contain consistently abundant amounts of direct and complementary strand interspersed repetitive elements (mainly SINEs and LINEs) that may form stems with each other in large introns. This examination showed that predicted stems are indeed abundant and stable in the large introns of mammals. We hypothesize that such stems with long loops within large introns allow intron splice sites to find each other more quickly by folding the intronic RNA upon itself at smaller intervals and, thus, reducing the distance between donor and acceptor sites.
[Show abstract][Hide abstract] ABSTRACT: Mid-range inhomogeneity or MRI is the significant enrichment of particular nucleotides in genomic sequences extending from 30 up to several thousands of nucleotides. The best-known manifestation of MRI is CpG islands representing CG-rich regions. Recently it was demonstrated that MRI could be observed not only for G+C content but also for all other nucleotide pairings (e.g. A+G and G+T) as well as for individual bases. Various types of MRI regions are 4-20 times enriched in mammalian genomes compared to their occurrences in random models.
This paper explores how different types of mutations change MRI regions. Human, chimpanzee and Macaca mulatta genomes were aligned to study the projected effects of substitutions and indels on human sequence evolution within both MRI regions and control regions of average nucleotide composition. Over 18.8 million fixed point substitutions, 3.9 million SNPs, and indels spanning 6.9 Mb were procured and evaluated in human. They include 1.8 Mb substitutions and 1.9 Mb indels within MRI regions. Ancestral and mutant (derived) alleles for substitutions have been determined. Substitutions were grouped according to their fixation within human populations: fixed substitutions (from the human-chimp-macaca alignment), major SNPs (> 80% mutant allele frequency within humans), medium SNPs (20% - 80% mutant allele frequency), minor SNPs (3% - 20%), and rare SNPs (<3%). Data on short (< 3 bp) and medium-length (3 - 50 bp) insertions and deletions within MRI regions and appropriate control regions were analyzed for the effect of indels on the expansion or diminution of such regions as well as on changing nucleotide composition.
MRI regions have comparable levels of de novo mutations to the control genomic sequences with average base composition. De novo substitutions rapidly erode MRI regions, bringing their nucleotide composition toward genome-average levels. However, those substitutions that favor the maintenance of MRI properties have a higher chance to spread through the entire population. Indels have a clear tendency to maintain MRI features yet they have a smaller impact than substitutions. All in all, the observed fixation bias for mutations helps to preserve MRI regions during evolution.
[Show abstract][Hide abstract] ABSTRACT: This Small nucleolar RNA (snoRNA) are a group of non-protein-coding RNA molecules among hundreds of others in the human genome. These molecules bind specifically to other cellular RNA targets via base pairing to form short, double-stranded structures. This binding causes the snoRNA targets to undergo specific chemical modifications. There are a number of (orphan) snoRNAs whose targets are still unknown; yet they clearly seem to play an important cellular function as their removal seems to cause genetic diseases like Prader-Willi Syndrome. In this project we aim to computationally predict targets for a specific group of orphan snoRNA of human and mouse (known as HBII-85 and MBII-85 respectively) that are known to be associated directly in the development of Prader-Willi Syndrome . We started off by modifying our previously published snoTARGET program , to search for targets in the entire set of human and mouse genomic sequences. Then we generated a computational pipeline to characterize targets common to these two species. This resulted in the discovery of dozens of putative HBII-85/MBII-85 targets within the evolutionarily conserved segments of mRNAs, introns, and intergenic regions. Several of these targets have been found to be very well conserved evolutionarily among other mammals, and seem to have distinctive secondary structures detected by Evofold program . Hence these targets can form the primary objects for further experimental validation. This could enhance the understanding of the function and clinical relevance of this group of snoRNA and could pave novel modes of intervention for arresting or alleviating the Prader-Willi Syndrome. The human genome contains hundreds of small non-protein-coding RNA molecules of which one group are the snoRNA (small nucleolar RNA). These molecules bind specifically to other cellular RNA targets via base pairing to form short, double-stranded structures. This binding causes the snoRNA targets to undergo specific che-
mical modifications. There are a number of (orphan) snoRNAs whose targets are still unknown; yet, because their removal causes genetic diseases such as Prader-Willi Syndrome, they clearly seem to play an important cellular function. In this project we aimed to computationally predict targets for a specific group of orphan snoRNA of human and mouse (known as HBII-85 and MBII-85 respectively) that are known to be directly involved in the development of Prader-Willi Syndrome. To fulfill this task we modified our previously published snoTARGET program, to search for targets in the entire set of human and mouse genomic sequences. Then we generated a computational pipeline to characterize targets common for these two species. This approach resulted in the discovery of dozens of putative HBII-85/MBII-85 targets within the evolutionary conserved segments of mRNAs, introns, and intergenic regions. Several of these targets are located within mammalian-wide evolutionary conserved sequences that have distinctive secondary structures detected by Evofold program. These targets are the primary objects for further experimental validation of our findings.
[Show abstract][Hide abstract] ABSTRACT: Genomes possess different levels of non-randomness, in particular, an inhomogeneity in their nucleotide composition. Inhomogeneity is manifest from the short-range where neighboring nucleotides influence the choice of base at a site, to the long-range, commonly known as isochores, where a particular base composition can span millions of nucleotides. A separate genomic issue that has yet to be thoroughly elucidated is the role that RNA secondary structure (SS) plays in gene expression.
We present novel data and approaches that show that a mid-range inhomogeneity (~30 to 1000 nt) not only exists in mammalian genomes but is also significantly associated with strong RNA SS. A whole-genome bioinformatics investigation of local SS in a set of 11,315 non-redundant human pre-mRNA sequences has been carried out. Four distinct components of these molecules (5'-UTRs, exons, introns and 3'-UTRs) were considered separately, since they differ in overall nucleotide composition, sequence motifs and periodicities. For each pre-mRNA component, the abundance of strong local SS (< -25 kcal/mol) was a factor of two to ten greater than a random expectation model. The randomization process preserves the short-range inhomogeneity of the corresponding natural sequences, thus, eliminating short-range signals as possible contributors to any observed phenomena.
We demonstrate that the excess of strong local SS in pre-mRNAs is linked to the little explored phenomenon of genomic mid-range inhomogeneity (MRI). MRI is an interdependence between nucleotide choice and base composition over a distance of 20-1000 nt. Additionally, we have created a public computational resource to support further study of genomic MRI.
Jason M Bechtel · Thomas Wittenschlaeger · Trisha Dwyer · Jun Song · Sasi Arunachalam · Sadeesh K Ramakrishnan · Samuel Shepard · Alexei Fedorov
[Show abstract][Hide abstract] ABSTRACT: Comparison of MRI-analyses of GC-, AG- and GT-content with a 50 nt window in masked DMD intron 1. The first intron of the DMD gene was masked using the RepeatMasker program. SRI-generated counterpart sequences retain all masked positions. In each figure the MRI pattern for the natural sequence and the randomized counterpart is shown above and below, respectively: (A) analyzed for MRI in GC-composition; (B) analysis for MRI in AG-composition; (C) analysis for MRI in GT-composition.
[Show abstract][Hide abstract] ABSTRACT: Among thousands of non-protein-coding RNAs which have been found in humans, a significant group represents snoRNA molecules that guide other types of RNAs to specific chemical modifications, cleavages, or proper folding. Yet, hundreds of mammalian snoRNAs have unknown function and are referred to as "orphan" molecules. In 2006, for the first time, it was shown that a particular orphan snoRNA (HBII-52) plays an important role in the regulation of alternative splicing of the serotonin receptor gene in humans and other mammals. In order to facilitate the investigation of possible involvement of snoRNAs in the regulation of pre-mRNA processing, we developed a new computational web resource, snoTARGET, which searches for possible guiding sites for snoRNAs among the entire set of human and rodent exonic and intronic sequences. Application of snoTARGET for finding possible guiding sites for a number of human and rodent orphan C/D-box snoRNAs showed that another subgroup of these molecules (HBII-85) have statistically elevated guiding preferences toward exons compared to introns. Moreover, these energetically favorable putative targets of HBII-85 snoRNAs are non-randomly associated with genes producing alternatively spliced mRNA isoforms. The snoTARGET resource is freely available at: (http://hsc.utoledo.edu/depts/bioinfo/snotarget.html).
[Show abstract][Hide abstract] ABSTRACT: Some mutations in the internal regions of exons occur within splicing enhancers and silencers, influencing the pattern of alternative splicing in the corresponding genes. To understand how these sequence changes affect splicing, we created a database of these mutations.
The Alternative Splicing Mutation Database (ASMD) serves as a repository for all exonic mutations not associated with splicing junctions that measurably change the pattern of alternative splicing. In this initial published release (version 1.2), only human sequences are present, but the ASMD will grow to include other organisms, (see Availability and requirements section for the ASMD web address).This relational database allows users to investigate connections between mutations and features of the surrounding sequences, including flanking sequences, RNA secondary structures and strengths of splice junctions. Splicing effects of the mutations are quantified by the relative presence of alternative mRNA isoforms with and without a given mutation. This measure is further categorized by the accuracy of the experimental methods employed. The database currently contains 170 mutations in 66 exons, yet these numbers increase regularly.We developed an algorithm to derive a table of oligonucleotide Splicing Potential (SP) values from the ASMD dataset. We present the SP concept and tools in detail in our corresponding article.
The current data set demonstrates that mutations affecting splicing are located throughout exons and might be enriched within local RNA secondary structures. Exons from the ASMD have below average splicing junction strength scores, but the difference is small and is judged not to be significant.
Full-text · Article · Feb 2008 · BMC Research Notes
[Show abstract][Hide abstract] ABSTRACT: The Alternative Splicing Mutation Database (ASMD) presents a collection of all known mutations inside human exons which affect splicing enhancers and silencers and cause changes in the alternative splicing pattern of the corresponding genes.
An algorithm was developed to derive a Splicing Potential (SP) table from the ASMD information. This table characterizes the influence of each oligonucleotide on the splicing effectiveness of the exon containing it. If the SP value for an oligonucleotide is positive, it promotes exon retention, while negative SP values mean the sequence favors exon skipping. The merit of the SP approach is the ability to separate splicing signals from a wide range of sequence motifs enriched in exonic sequences that are attributed to protein-coding properties and/or translation efficiency. Due to its direct derivation from observed splice site selection, SP has an advantage over other computational approaches for predicting alternative splicing.
We show that a vast majority of known exonic splicing enhancers have highly positive cumulative SP values, while known splicing silencers have core motifs with strongly negative cumulative SP values. Our approach allows for computation of the cumulative SP value of any sequence segment and, thus, gives researchers the ability to measure the possible contribution of any sequence to the pattern of splicing.
Full-text · Article · Feb 2008 · BMC Research Notes
[Show abstract][Hide abstract] ABSTRACT: Investigation of exon-intron gene structures is a non-trivial task due to enormous expansions of the eukaryotic genomes, great variety of gene forms, and the imperfectness in sequence data. A number of available informational systems on various gene characteristics complement each other and are indispensable for many genomic studies. Among them, the Exon-Intron Database (EID) is a good choice for large-scale computational examination of exon/intron structure and splicing. It has many internal filters that control for sequence quality, consistency of gene descriptions, accordance to standards, and possible errors. New innovations in EID are described. The collection of exons and introns has been extended beyond coding regions and current versions of EID contain data on untranslated regions of gene sequences as well. Intron-less genes are included as a special part of EID. For species with entirely sequenced genomes, species-specific databases have been generated. A novel Mammalian Orthologous Intron Database (MOID) has been introduced which includes the full set of introns that come from orthologous genes that have the same positions relative to the reading frames. Examples of statistical analyses of gene sequences using EID are provided. We present the latest data on our comparison of intron positions in 11,025 orthologous genes of human, mouse and rat, and find no convincing cases of intron gain. We discuss relevant data-quality issues of genomic databases. In particular, 5% of genes in genomic databases contain internal stop codons. This fact is due to a combination of biological reasons and also to errors in sequence annotations. The EID is freely available at www.meduohio.edu/bioinfo/eid/.
Preview · Article · Jul 2006 · Briefings in Bioinformatics