[Show abstract][Hide abstract] ABSTRACT: Background:
Transcriptomics reveals the existence of transcripts of different coding potential and strand orientation. Alternative splicing (AS) can yield proteins with altered number and types of functional domains, suggesting the global occurrence of transcriptional and post-transcriptional events. Many biological processes, including seed maturation and desiccation, are regulated post-transcriptionally (e.g., by AS), leading to the production of more than one coding or noncoding sense transcript from a single locus.
We present an integrated computational framework to predict isoform-specific functions of plant transcripts. This framework includes a novel plant-specific weighted support vector machine classifier called CodeWise, which predicts the coding potential of transcripts with over 96 % accuracy, and several other tools enabling global sequence similarity, functional domain, and co-expression network analyses. First, this framework was applied to all detected transcripts (103,106), out of which 13 % was predicted by CodeWise to be noncoding RNAs in developing soybean embryos. Second, to investigate the role of AS during soybean embryo development, a population of 2,938 alternatively spliced and differentially expressed splice variants was analyzed and mined with respect to timing of expression. Conserved domain analyses revealed that AS resulted in global changes in the number, types, and extent of truncation of functional domains in protein variants. Isoform-specific co-expression network analysis using ArrayMining and clustering analyses revealed specific sub-networks and potential interactions among the components of selected signaling pathways related to seed maturation and the acquisition of desiccation tolerance. These signaling pathways involved abscisic acid- and FUSCA3-related transcripts, several of which were classified as noncoding and/or antisense transcripts and were co-expressed with corresponding coding transcripts. Noncoding and antisense transcripts likely play important regulatory roles in seed maturation- and desiccation-related signaling in soybean.
This work demonstrates how our integrated framework can be implemented to make experimentally testable predictions regarding the coding potential, co-expression, co-regulation, and function of transcripts and proteins related to a biological process of interest.
[Show abstract][Hide abstract] ABSTRACT: Reconstructing system dynamics from sequential data traces is an important algorithmic challenge with applications in computational neuroscience, systems biology, paleontology, and physical plant engineering. Here, we formalize a key computational task in network reconstruction, namely re- covering complex order-theoretic constraints among the sys- tem variables underlying a given dataset. Specifically, we fo- cus on the problem of reconstructing partial orders (posets) from their linear extensions. We discuss the theoretical com- plexity of this problem, a general framework to pose and study various inference tasks, and sketch algorithmic results for mining restricted classes of posets.
[Show abstract][Hide abstract] ABSTRACT: Alternative splicing (AS) is a post-transcriptional regulatory mechanism for gene expression regulation. Splicing decisions are affected by the combinatorial behavior of different splicing factors that bind to multiple binding sites in exons and introns. These binding sites are called splicing regulatory elements (SREs). Here we develop CoSREM (Combinatorial SRE Miner), a graph mining algorithm to discover combinatorial SREs in human exons. Our model does not assume a fixed length of SREs and incorporates experimental evidence as well to increase accuracy. CoSREM is able to identify sets of SREs and is not limited to SRE pairs as are current approaches.
We identified 37 SRE sets that include both enhancer and silencer elements. We show that our results intersect with previous results, including some that are experimental. We also show that the SRE set GGGAGG and GAGGAC identified by CoSREM may play a role in exon skipping events in several tumor samples. We applied CoSREM to RNA-Seq data for multiple tissues to identify combinatorial SREs which may be responsible for exon inclusion or exclusion across tissues.
The new algorithm can identify different combinations of splicing enhancers and silencers without assuming a predefined size or limiting the algorithm to find only pairs of SREs. Our approach opens new directions to study SREs and the roles that AS may play in diseases and tissue specificity.
[Show abstract][Hide abstract] ABSTRACT: In the past decade, next generation sequencing technology (NGS) has produced billions of short read sequences. Mapping these short reads to a reference genome is a computationally expensive task. As a result, many mapping tools have been proposed. However, most of the existing tools ignore genomic variations during the mapping task. To address this problem, recently we introduced SNPwise, a short read aligner that takes into account known SNP variations provided by the database or the user. In this work, we improve efficiency of the lookup of the SNP information and evaluate the performance of the improved version of SNPwise using several human genome data sets.
11th International Symposium on Bioinformatics Research and Applications (ISBRA), Norfolk, Virginia; 06/2015
[Show abstract][Hide abstract] ABSTRACT: Background. Developing a universal standardized microbial typing and nomenclature system that provides phylogenetic and epidemiological information in real time has never been as urgent in public health as today. We previously proposed to use genome similarity as the basis for immediate and precise typing and naming of individual organisms or viruses. Here, we tested the validity of the proposed system applying it to the epidemiology of infectious diseases using Ebola virus disease (EVD) outbreaks as the example. Methods. One hundred twenty-eight publicly available ebolavirus genomes were compared with each other and average nucleotide identity (ANI) was calculated. ANI was then used to assign unique codes, from here on called Life Identification Numbers (LINs), to every viral isolate, whereby each LIN consists of a series of positions reflecting increasing genome similarity. Congruence of LINs with phylogenetic and epidemiological relationships was then determined. Results. Assigned LINs correlate with phylogeny at the species and infraspecies level and can even identify some individual transmission chains during the 2014-2015 EVD epidemic in West Africa. Conclusions. LINs could provide a fast, automated, standardized, and scalable approach to precisely identify and name viral isolates upon genome sequence submission, facilitating unambiguous communication during disease epidemics among clinicians, epidemiologists, and governments.
[Show abstract][Hide abstract] ABSTRACT: Current high-throughput sequencing technologies produce a large number of short reads from random locations in the genome. The next and time-consuming step is to map or align all the reads to a reference genome. Though a few dozen alignment tools have been introduced, few of them consider known genomic variants while aligning reads. We present SNPwise, a short read alignment tool that incorporates known SNPs (Single Nucleotide Polymorphisms) provided by a database such as dbSNP or the user while aligning reads. Results show that SNPwise significantly increases the number of reads that can be mapped to the reference genome and also improves the accuracy of the alignment, which is important as the alignment result is used for all downstream analyses. Although incorporating known variations into the mapping/alignment process requires more computing time, the benefit is the improved mapping results in terms of both the proportion of the mapped reads and the accuracy of the mapped reads.
7th International Conference on Bioinformatics and Computational Biology (BICoB), Hawaii, USA; 03/2015
[Show abstract][Hide abstract] ABSTRACT: Abstract Splicing regulatory elements (SREs) are short, degenerate sequences on pre-mRNA molecules that enhance or inhibit the splicing process via the binding of splicing factors, proteins that regulate the functioning of the spliceosome. Existing methods for identifying SREs in a genome are either experimental or computational. Here, we propose a formalism based on de Bruijn graphs that combines genomic structure, word count enrichment analysis, and experimental evidence to identify SREs found in exons. In our approach, SREs are not restricted to a fixed length (i.e., k-mers, for a fixed k). As a result, we identify 2001 putative exonic enhancers and 3080 putative exonic silencers for human genes, with lengths varying from 6 to 15 nucleotides. Many of the predicted SREs overlap with experimentally verified binding sites. Our model provides a novel method to predict variable length putative regulatory elements computationally for further experimental investigation.
Journal of computational biology: a journal of computational molecular cell biology 11/2014; 21(12). DOI:10.1089/cmb.2014.0183 · 1.74 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: A broadly accepted and stable biological classification system is a prerequisite for biological sciences. It provides the means to describe and communicate about life without ambiguity. Current biological classification and nomenclature use the species as the basic unit and require lengthy and laborious species descriptions before newly discovered organisms can be assigned to a species and be named. The current system is thus inadequate to classify and name the immense genetic diversity within species that is now being revealed by genome sequencing on a daily basis. To address this lack of a general intra-species classification and naming system adequate for today's speed of discovery of new diversity, we propose a classification and naming system that is exclusively based on genome similarity and that is suitable for automatic assignment of codes to any genome-sequenced organism without requiring any phenotypic or phylogenetic analysis. We provide examples demonstrating that genome similarity-based codes largely align with current taxonomic groups at many different levels in bacteria, animals, humans, plants, and viruses. Importantly, the proposed approach is only slightly affected by the order of code assignment and can thus provide codes that reflect similarity between organisms and that do not need to be revised upon discovery of new diversity. We envision genome similarity-based codes to complement current biological nomenclature and to provide a universal means to communicate unambiguously about any genome-sequenced organism in fields as diverse as biodiversity research, infectious disease control, human and microbial forensics, animal breed and plant cultivar certification, and human ancestry research.
PLoS ONE 02/2014; 9(2):e89142. DOI:10.1371/journal.pone.0089142 · 3.23 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Developing soybean seeds accumulate oils, proteins, and carbohydrates that are used as oxidizable substrates providing metabolic precursors and energy during seed germination. The accumulation of these storage compounds in developing seeds is highly regulated at multiple levels, including at transcriptional and post-transcriptional regulation. RNA sequencing was used to provide comprehensive information about transcriptional and post-transcriptional events that take place in developing soybean embryos. Bioinformatics analyses lead to the identification of different classes of alternatively spliced isoforms and corresponding changes in their levels on a global scale during soybean embryo development. Alternative splicing was associated with transcripts involved in various metabolic and developmental processes, including central carbon and nitrogen metabolism, induction of maturation and dormancy, and splicing itself. Detailed examination of selected RNA isoforms revealed alterations in individual domains that could result in changes in subcellular localization of the resulting proteins, protein-protein and enzyme-substrate interactions, and regulation of protein activities. Different isoforms may play an important role in regulating developmental and metabolic processes occurring at different stages in developing oilseed embryos.
[Show abstract][Hide abstract] ABSTRACT: There has been much research on the combinatorial problem of generating the linear extensions of a given poset. This paper focuses on the reverse of that problem, where the input is a set of linear orders, and the goal is to construct a poset or set of posets that generates the input. Such a problem finds applications in computational neuroscience, systems biology, paleontology, and physical plant engineering. In this paper, several algorithms are presented for efficiently finding a single poset that generates the input set of linear orders. The variation of the problem where a minimum set of posets that cover the input is also explored. It is found that the problem is polynomially solvable for one class of simple posets (kite(2) posets) but NP-complete for a related class (hammock(2,2,2) posets).
Discrete Mathematics Algorithms and Applications 10/2013; 05(04). DOI:10.1142/S1793830913500304
[Show abstract][Hide abstract] ABSTRACT: Soybean (Glycine max) seeds are an important source of seed storage compounds, including protein, oil, and sugar used for food, feed, chemical, and biofuel production. We assessed detailed temporal transcriptional and metabolic changes in developing soybean embryos to gain a systems biology view of developmental and metabolic changes and to identify potential targets for metabolic engineering. Two major developmental and metabolic transitions were captured enabling identification of potential metabolic engineering targets specific to seed filling and to desiccation. The first transition involved a switch between different types of metabolism in dividing and elongating cells. The second transition involved the onset of maturation and desiccation tolerance during seed filling and a switch from photoheterotrophic to heterotrophic metabolism. Clustering analyses of metabolite and transcript data revealed clusters of functionally related metabolites and transcripts active in these different developmental and metabolic programs. The gene clusters provide a resource to generate predictions about the associations and interactions of unknown regulators with their targets based on "guilt-by-association" relationships. The inferred regulators also represent potential targets for future metabolic engineering of relevant pathways and steps in central carbon and nitrogen metabolism in soybean embryos and drought and desiccation tolerance in plants.
[Show abstract][Hide abstract] ABSTRACT: Background
Cold acclimation in woody perennials is a metabolically intensive process, but coincides with environmental conditions that are not conducive to the generation of energy through photosynthesis. While the negative effects of low temperatures on the photosynthetic apparatus during winter have been well studied, less is known about how this is reflected at the level of gene and metabolite expression, nor how the plant generates primary metabolites needed for adaptive processes during autumn.
The MapMan tool revealed enrichment of the expression of genes related to mitochondrial function, antioxidant and associated regulatory activity, while changes in metabolite levels over the time course were consistent with the gene expression patterns observed. Genes related to thylakoid function were down-regulated as expected, with the exception of plastid targeted specific antioxidant gene products such as thylakoid-bound ascorbate peroxidase, components of the reactive oxygen species scavenging cycle, and the plastid terminal oxidase. In contrast, the conventional and alternative mitochondrial electron transport chains, the tricarboxylic acid cycle, and redox-associated proteins providing reactive oxygen species scavenging generated by electron transport chains functioning at low temperatures were all active.
A regulatory mechanism linking thylakoid-bound ascorbate peroxidase action with “chloroplast dormancy” is proposed. Most importantly, the energy and substrates required for the substantial metabolic remodeling that is a hallmark of freezing acclimation could be provided by heterotrophic metabolism.
[Show abstract][Hide abstract] ABSTRACT: Ancestral genome reconstruction can be understood as a phylogenetic study with more details than a traditional phylogenetic tree reconstruction. We present a new computational system called REGEN for ancestral bacterial genome reconstruction at both the gene and replicon levels. REGEN reconstructs gene content, contiguous gene runs, and replicon structure for each ancestral genome. Along each branch of the phylogenetic tree, REGEN infers evolutionary events, including gene creation and deletion and replicon fission and fusion. The reconstruction can be performed by either a maximum parsimony or a maximum likelihood method. Gene content reconstruction is based on the concept of neighboring gene pairs. REGEN was designed to be used with any set of genomes that are sufficiently related, which will usually be the case for bacteria within the same taxonomic order. We evaluated REGEN using simulated genomes and genomes in the Rhizobiales order.
[Show abstract][Hide abstract] ABSTRACT: Microarray gene expression profiling is a powerful technique to understand complex developmental processes, but making biologically meaningful inferences from such studies has always been challenging. We previously reported a microarray study of the freezing acclimation period in Sitka spruce (Picea sitchensis) in which a large number of candidate genes for climatic adaptation were identified. In the current paper, we apply additional systems biology tools to these data to further probe changes in the levels of genes and metabolites and activities of associated pathways that regulate this complex developmental transition. One aspect of this adaptive process that is not well understood is the role of the cell wall. Our data suggest coordinated metabolic and signaling responses leading to cell wall remodeling. Co-expression of genes encoding proteins associated with biosynthesis of structural and non-structural cell wall carbohydrates was observed, which may be regulated by ethylene signaling components. At the same time, numerous genes, whose products are putatively localized to the endomembrane system and involved in both the synthesis and trafficking of cell wall carbohydrates, were up-regulated. Taken together, these results suggest a link between ethylene signaling and biosynthesis, and targeting of cell wall related gene products during the period of winter hardening. Automated Layout Pipeline for Inferred NEtworks (ALPINE), an in-house plugin for the Cytoscape visualization environment that utilizes the existing GeneMANIA and Mosaic plugins, together with the use of visualization tools, provided images of proposed signaling processes that became active over the time course of winter hardening, particularly at later time points in the process. The resulting visualizations have the potential to reveal novel, hypothesis-generating, gene association patterns in the context of targeted subcellular location.
[Show abstract][Hide abstract] ABSTRACT: Sorting permutations by operations such as reversals and block-moves has received much interest because of its applications
in the study of genome rearrangements and in the design of interconnection networks. A short block-move is an operation on
a permutation that moves an element at most two positions away from its original position. This paper investigates the problem
of finding a minimum-length sorting sequence of short block-moves for a given permutation. A 4/3 -approximation algorithm
for this problem is presented. Woven double-strip permutations are defined and a polynomial-time algorithm for this class
of permutations is devised that employs graph matching techniques. A linear-time maximum matching algorithm for a special
class of grid graphs improves the time complexity of the algorithm for woven double-strip permutations.
Key words. Computational biology, Genome rearrangement, Approximation algorithms, Maximum matching, Permutations.
[Show abstract][Hide abstract] ABSTRACT: In a binary dataset, a rare-class problem occurs when one class of data (typically the class of interest) is far outweighed by the other. Such a problem is typically difficult to learn and classify and is quite common, especially among biological problems such as the identification of gene conversions. A multitude of solutions for this problem exist with varying levels of success. In this paper we present our solution, which involves using the MetaCost algorithm, a cost-sensitive "meta-classifier" that requires a cost matrix to adjust the learning of an underlying classifier. Our method finds this cost matrix for a given dataset and classification algorithm, creating a final classification model. Through a detailed description, a basic evaluation, and the application to the problem of identifying gene conversions, we show the effectiveness of this approach. Our novel approach to generating a cost matrix has proven to be quite effective in the identification of gene conversions and represents a robust way to tackle the rare-class data problem.