SeqWare Query Engine: Storing and Searching Sequence Data in the Cloud

UNC Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC 27599, USA.
BMC Bioinformatics (Impact Factor: 2.58). 12/2010; 11 Suppl 12(Suppl 12):S2. DOI: 10.1186/1471-2105-11-S12-S2
Source: PubMed


Since the introduction of next-generation DNA sequencers the rapid increase in sequencer throughput, and associated drop in costs, has resulted in more than a dozen human genomes being resequenced over the last few years. These efforts are merely a prelude for a future in which genome resequencing will be commonplace for both biomedical research and clinical applications. The dramatic increase in sequencer output strains all facets of computational infrastructure, especially databases and query interfaces. The advent of cloud computing, and a variety of powerful tools designed to process petascale datasets, provide a compelling solution to these ever increasing demands.
In this work, we present the SeqWare Query Engine which has been created using modern cloud computing technologies and designed to support databasing information from thousands of genomes. Our backend implementation was built using the highly scalable, NoSQL HBase database from the Hadoop project. We also created a web-based frontend that provides both a programmatic and interactive query interface and integrates with widely used genome browsers and tools. Using the query engine, users can load and query variants (SNVs, indels, translocations, etc) with a rich level of annotations including coverage and functional consequences. As a proof of concept we loaded several whole genome datasets including the U87MG cell line. We also used a glioblastoma multiforme tumor/normal pair to both profile performance and provide an example of using the Hadoop MapReduce framework within the query engine. This software is open source and freely available from the SeqWare project (
The SeqWare Query Engine provided an easy way to make the U87MG genome accessible to programmers and non-programmers alike. This enabled a faster and more open exploration of results, quicker tuning of parameters for heuristic variant calling filters, and a common data interface to simplify development of analytical tools. The range of data types supported, the ease of querying and integrating with existing tools, and the robust scalability of the underlying cloud-based technologies make SeqWare Query Engine a nature fit for storing and searching ever-growing genome sequence datasets.

Download full-text


Available from: Stanley F Nelson, Jul 30, 2014
1 Follower
20 Reads
  • Source
    • "Essentially, it provides a parallel read-mapping algorithm optimized for mapping sequence data to the human genome and other reference genomes, intended for use in a biological analysis including SNP discovery, genotyping, and personal genomics. Another large biological extensible workbench is SeqWire [29]. Users are allowed to write and share pipeline modules. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Rapid growth and storage of biomedical data enabled many opportunities for predictive modeling and improvement of healthcare processes. On the other side analysis of such large amounts of data is a difficult and computationally intensive task for most existing data mining algorithms. This problem is addressed by proposing a cloud based system that integrates metalearning framework for ranking and selection of best predictive algorithms for data at hand and open source big data technologies for analysis of biomedical data.
    The Scientific World Journal 04/2014; Volume 2014 (2014)(Article ID 859279):10 pages. DOI:10.1155/2014/859279 · 1.73 Impact Factor
  • Source
    • "Twitter, Facebook and Amazon. Motivated by the potential scalability and throughput offered by Hadoop, there are an increasing number of Hadoop-based tools for processing sequencing data (Taylor, 2010), ranging from quality control (Robinson et al., 2011) and alignment (Langmead et al., 2009; Pireddu et al., 2011) to SNP calling (Langmead et al., 2009), variant annotation (O'Connor et al., 2010) and structural variant detection (Whelan et al., 2013), including general purpose workflow management (Scho¨nherr et al., 2012). Note the recent publication of independent and complimentary work in (Nordberg et al., 2013). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Summary: Hadoop MapReduce-based approaches have become increasingly popular due to their scalability in processing large sequencing datasets. However, as these methods typically require in-depth expertise in Hadoop and Java, they are still out of reach of many bioinformaticians. To solve this problem, we have created SeqPig, a library and a collection of tools to manipulate, analyze and query sequencing datasets in a scalable and simple manner. SeqPigscripts use the Hadoop-based distributed scripting engine Apache Pig, which automatically parallelizes and distributes data processing tasks. We demonstrate SeqPig’s scalability over many computing nodes and illustrate its use with example scripts.Availability and Implementation: Available under the open source MIT license at andre.schumacher@yahoo.comSupplementary information: Supplementary data are available at Bioinformatics online.
    Bioinformatics 10/2013; DOI:10.1093/bioinformatics/btt601 · 4.98 Impact Factor
  • Source
    • "Recently exploratory efforts have been made in cloud-based DNA sequence storage. O'Connor et al. [97] created SeqWare Query Engine using cloud computing technologies to support databasing and query of information from thousands of genomes. BaseSpace is a scalable cloud-computing platform for all of Illumina's sequencing systems. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The discovery of prostate cancer biomarkers has been boosted by the advent of next-generation sequencing (NGS) technologies. Nevertheless, many challenges still exist in exploiting the flood of sequence data and translating them into routine diagnostics and prognosis of prostate cancer. Here we review the recent developments in prostate cancer biomarkers by high throughput sequencing technologies. We highlight some fundamental issues of translational bioinformatics and the potential use of cloud computing in NGS data processing for the improvement of prostate cancer treatment.
    07/2013; 2013:901578. DOI:10.1155/2013/901578
Show more