-
[show abstract]
[hide abstract]
ABSTRACT: Regulatory sites that control gene expression are essential to the proper functioning of cells, and identifying them is critical for modeling regulatory networks. We have developed Magma (Multiple Aligner of Genomic Multiple Alignments), a software tool for multiple species, multiple gene motif discovery. Magma identifies putative regulatory sites that are conserved across multiple species and occur near multiple genes throughout a reference genome. Magma takes as input multiple alignments that can include gaps. It uses efficient clustering methods that make it about 70 times faster than PhyloNet, a previous program for this task, with slightly greater sensitivity. We ran Magma on all non-coding DNA conserved between Caenorhabditis elegans and five additional species, about 70 Mbp in total, in <4 h. We obtained 2,309 motifs with lengths of 6-20 bp, each occurring at least 10 times throughout the genome, which collectively covered about 566 kbp of the genomes, approximately 0.8% of the input. Predicted sites occurred in all types of non-coding sequence but were especially enriched in the promoter regions. Comparisons to several experimental datasets show that Magma motifs correspond to a variety of known regulatory motifs.
Journal of computational biology: a journal of computational molecular cell biology 02/2012; 19(2):139-47. · 1.69 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Detecting members of known noncoding RNA (ncRNA) families in genomic DNA is an important part of sequence annotation. However, the most widely used tool for modeling ncRNA families, the covariance model (CM), incurs a high-computational cost when used for genome-wide search. This cost can be reduced by using a filter to exclude sequences that are unlikely to contain the ncRNA of interest, applying the CM only where it is likely to match strongly. Despite recent advances, designing an efficient filter that can detect ncRNA instances lacking strong conservation while excluding most irrelevant sequences remains challenging. In this work, we design three types of filters based on multiple secondary structure profiles (SSPs). An SSP augments a regular profile (i.e., a position weight matrix) with secondary structure information but can still be efficiently scanned against long sequences. Multi-SSPbased filters combine evidence from multiple SSP matches and can achieve high sensitivity and specificity. Our SSP-based filters are extensively tested in BRAliBase III data set, Rfam 9.0, and a published soil metagenomic data set. In addition, we compare the SSPbased filters with several other ncRNA search tools including Infernal (with profile HMMs as filters), ERPIN, and tRNAscan-SE. Our experiments demonstrate that carefully designed SSP filters can achieve significant speedup over unfiltered CM search while maintaining high sensitivity for various ncRNA families. The designed filters and filter-scanning programs are available at our website: www.cse.msu.edu/~yannisun/ssp/.
IEEE/ACM transactions on computational biology and bioinformatics / IEEE, ACM 11/2011; 9(3):774-87. · 2.25 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Many biologically motivated problems are expressed as dy-namic programming recurrences and are difficult to paral-lelize due to the intrinsic data dependencies in their algo-rithms. Therefore their solutions have been sped up us-ing task level parallelism only. Emerging platforms such as GPUs are appealing parallel architectures for high-performance; at the same time they are a motivation to rethink the al-gorithms associated with these problems, to extract finer-grained parallelism such as data parallelism. In this paper, we consider the hmmersearch program as a representative of these problems and we re-design its com-putational algorithm to extract data parallelism for a more efficient execution on emerging platforms, despite the fact that hmmersearch has data dependencies. Our approach outperforms other existing methods when searching a very large database of unsorted sequences on GPUs.
09/2010;
-
Wilson Leung,
Christopher D Shaffer,
Taylor Cordonnier,
Jeannette Wong,
Michelle S Itano,
Elizabeth E Slawson Tempel,
Elmer Kellmann,
David Michael Desruisseau,
Carolyn Cain,
Robert Carrasquillo, [......],
Leah Sabin,
Anita Shah,
Anushree Sharma,
Sonal Singhal,
Fine Song,
Christopher Swope,
Craig B Wilen, Jeremy Buhler,
Elaine R Mardis,
Sarah C R Elgin
[show abstract]
[hide abstract]
ABSTRACT: The distal arm of the fourth ("dot") chromosome of Drosophila melanogaster is unusual in that it exhibits an amalgamation of heterochromatic properties (e.g., dense packaging, late replication) and euchromatic properties (e.g., gene density similar to euchromatic domains, replication during polytenization). To examine the evolution of this unusual domain, we undertook a comparative study by generating high-quality sequence data and manually curating gene models for the dot chromosome of D. virilis (Tucson strain 15010-1051.88). Our analysis shows that the dot chromosomes of D. melanogaster and D. virilis have higher repeat density, larger gene size, lower codon bias, and a higher rate of gene rearrangement compared to a reference euchromatic domain. Analysis of eight "wanderer" genes (present in a euchromatic chromosome arm in one species and on the dot chromosome in the other) shows that their characteristics are similar to other genes in the same domain, which suggests that these characteristics are features of the domain and are not required for these genes to function. Comparison of this strain of D. virilis with the strain sequenced by the Drosophila 12 Genomes Consortium (Tucson strain 15010-1051.87) indicates that most genes on the dot are under weak purifying selection. Collectively, despite the heterochromatin-like properties of this domain, genes on the dot evolve to maintain function while being responsive to changes in their local environment.
Genetics 08/2010; 185(4):1519-34. · 4.01 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Queuing theory provides a theoretical framework for quantitative understanding of the performance of streaming applications. There are questions, however, as to the suitability of queuing models to real applications. Here, we investigate the correspondence between a relatively simple queuing model of an FPGA-accelerated BLAST implementation and empirical measurements taken from executions of the actual application.
08/2010;
-
Christopher D Shaffer,
Consuelo Alvarez,
Cheryl Bailey,
Daron Barnard,
Satish Bhalla,
Chitra Chandrasekaran,
Vidya Chandrasekaran,
Hui-Min Chung,
Douglas R Dorer,
Chunguang Du, [......],
Joyce Stamm,
Jeff S Thompson,
Matthew Wawersik,
Barbara A Wilson,
Jim Youngblom,
Wilson Leung, Jeremy Buhler,
Elaine R Mardis,
David Lopatto,
Sarah C R Elgin
[show abstract]
[hide abstract]
ABSTRACT: Genomics is not only essential for students to understand biology but also provides unprecedented opportunities for undergraduate research. The goal of the Genomics Education Partnership (GEP), a collaboration between a growing number of colleges and universities around the country and the Department of Biology and Genome Center of Washington University in St. Louis, is to provide such research opportunities. Using a versatile curriculum that has been adapted to many different class settings, GEP undergraduates undertake projects to bring draft-quality genomic sequence up to high quality and/or participate in the annotation of these sequences. GEP undergraduates have improved more than 2 million bases of draft genomic sequence from several species of Drosophila and have produced hundreds of gene models using evidence-based manual annotation. Students appreciate their ability to make a contribution to ongoing research, and report increased independence and a more active learning approach after participation in GEP projects. They show knowledge gains on pre- and postcourse quizzes about genes and genomes and in bioinformatic analysis. Participating faculty also report professional gains, increased access to genomics-related technology, and an overall positive experience. We have found that using a genomics research project as the core of a laboratory course is rewarding for both faculty and students.
CBE life sciences education 01/2010; 9(1):55-69. · 1.19 Impact Factor
-
Proceedings of the 2010 ACM Symposium on Applied Computing (SAC), Sierre, Switzerland, March 22-26, 2010; 01/2010
-
21st IEEE International Conference on Application-specific Systems Architectures and Processors, ASAP 2010, Rennes, France, 7-9 July 2010; 01/2010
-
SPAA 2010: Proceedings of the 22nd Annual ACM Symposium on Parallelism in Algorithms and Architectures, Thira, Santorini, Greece, June 13-15, 2010; 01/2010
-
Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology, BCB 2010, Niagara Falls, NY, USA, August 2-4, 2010; 01/2010
-
[show abstract]
[hide abstract]
ABSTRACT: The amount of biosequence data being produced each year is growing exponentially. Extracting useful information from this massive amount of data efficiently is becoming an increasingly difficult task. There are many available software tools that molecular biologists use for comparing genomic data. This paper focuses on accelerating the most widely used such tool, BLAST. Mercury BLAST takes a streaming approach to the BLAST computation by off loading the performance-critical sections to specialized hardware. This hardware is then used in combination with the processor of the host system to deliver BLAST results in a fraction of the time of the general-purpose processor alone.This paper presents the design of the ungapped extension stage of Mercury BLAST. The architecture of the ungapped extension stage is described along with the context of this stage within the Mercury BLAST system. The design is compact and runs at 100 MHz on available FPGAs, making it an effective and powerful component for accelerating biosequence comparisons. The performance of this stage is 25× that of the standard software distribution, yielding close to 50× performance improvement on the complete BLAST application. The sensitivity is essentially equivalent to that of the standard distribution.
Microprocessors and Microsystems 06/2009; 33(4):281-289. · 0.57 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Large-scale protein sequence comparison is an important but compute-intensive task in molecular biology. BLASTP is the most popular tool for comparative analysis of protein sequences. In recent years, an exponential increase in the size of protein sequence databases has required either exponentially more running time or a cluster of machines to keep pace. To address this problem, we have designed and built a high-performance FPGA-accelerated version of BLASTP, Mercury BLASTP. In this paper, we describe the architecture of the portions of the application that are accelerated in the FPGA, and we also describe the integration of these FPGA-accelerated portions with the existing BLASTP software. We have implemented Mercury BLASTP on a commodity workstation with two Xilinx Virtex-II 6000 FPGAs. We show that the new design runs 11-15 times faster than software BLASTP on a modern CPU while delivering close to 99% identical results.
ACM Transactions on Reconfigurable Technology and Systems 07/2008; 1(2):9. · 0.65 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Detecting non-coding RNAs (ncRNAs) in genomic DNA is an important part of annotation. However, the most widely used tool for modeling ncRNA families, the covariance model (CM), incurs a high computational cost when used for search. This cost can be reduced by using a filter to exclude sequence that is unlikely to contain the ncRNA of interest, applying the CM only where it is likely to match strongly. Despite recent advances, designing an efficient filter that can detect nearly all ncRNA instances while excluding most irrelevant sequences remains challenging. This work proposes a systematic procedure to convert a CM for an ncRNA family to a secondary structure profile (SSP), which augments a conservation profile with secondary structure information but can still be efficiently scanned against long sequences. We use dynamic programming to estimate an SSP's sensitivity and FP rate, yielding an efficient, fully automated filter design algorithm. Our experiments demonstrate that designed SSP filters can achieve significant speedup over unfiltered CM search while maintaining high sensitivity for various ncRNA families, including those with and without strong sequence conservation. For highly structured ncRNA families, including secondary structure conservation yields better performance than using primary sequence conservation alone.
Computational systems bioinformatics / Life Sciences Society. Computational Systems Bioinformatics Conference 02/2008; 7:145-56.
-
19th IEEE International Conference on Application-Specific Systems, Architectures and Processors, ASAP 2008, July 2-4, 2008, Leuven, Belgium; 01/2008
-
[show abstract]
[hide abstract]
ABSTRACT: Biosequence similarity search is an important application in modern molecular biology. Search algorithms aim to identify sets of sequences whose extensional similarity suggests a common evolutionary origin or function. The most widely used similarity search tool for biosequences is BLAST, a program designed to compare query sequences to a database. Here, we present the design of BLASTN, the version of BLAST that searches DNA sequences, on the Mercury system, an architecture that supports high-volume, high-throughput data movement off a data store and into reconfigurable hardware. An important component of application deployment on the Mercury system is the functional decomposition of the application onto both the reconfigurable hardware and the traditional processor. Both the Mercury BLASTN application design and its performance analysis are described.
Journal of VLSI signal processing systems for signal, image, and video technology. 02/2007; 49(1):101-121.
-
[show abstract]
[hide abstract]
ABSTRACT: Single-nucleotide polymorphism (SNP) genotyping is an important molecular genetics process, which can produce results that will be useful in the medical field. Because of inherent complexities in DNA manipulation and analysis, many different methods have been proposed for a standard assay. One of the proposed techniques for performing SNP genotyping requires amplifying regions of DNA surrounding a large number of SNP loci. To automate a portion of this particular method, it is necessary to select a set of primers for the experiment. Selecting these primers can be formulated as the Multiple Degenerate Primer Design (MDPD) problem. The Multiple, Iterative Primer Selector (MIPS) is an iterative beam-search algorithm for MDPD. Theoretical and experimental analyses show that this algorithm performs well compared with the limits of degenerate primer design. Furthermore, MIPS outperforms an existing algorithm that was designed for a related degenerate primer selection problem.
Methods in molecular biology (Clifton, N.J.) 02/2007; 402:245-68.
-
[show abstract]
[hide abstract]
ABSTRACT: Profile HMMs are a powerful tool for modeling conserved motifs in proteins. These models are widely used by search tools to classify new protein sequences into families based on domain architecture. However, the proliferation of known motifs and new proteomic sequence data poses a computational challenge for search, requiring days of CPU time to annotate an organism's proteome.
We use PROSITE-like patterns as a filter to speed up the comparison between protein sequence and profile HMM. A set of patterns is designed starting from the HMM, and only sequences matching one of these patterns are compared to the HMM by full dynamic programming. We give an algorithm to design patterns with maximal sensitivity subject to a bound on the false positive rate. Experiments show that our patterns typically retain at least 90% of the sensitivity of the source HMM while accelerating search by an order of magnitude.
Contact the first author at the address below.
Bioinformatics 02/2007; 23(2):e36-43. · 5.47 Impact Factor
-
VLSI Signal Processing. 01/2007; 49:101-121.
-
FPL 2007, International Conference on Field Programmable Logic and Applications, Amsterdam, The Netherlands, 27-29 August 2007; 01/2007
-
IEEE Symposium on Field-Programmable Custom Computing Machines, FCCM 2007, 23-25 April 2007, Napa, California, USA; 01/2007