Article

SeqWare Query Engine: storing and searching sequence data in the cloud

UNC Lineberger Comprehensive Cancer Center, University of North Carolina, Chapel Hill, NC 27599, USA.
BMC Bioinformatics (Impact Factor: 2.67). 12/2010; 11 Suppl 12(Suppl 12):S2. DOI: 10.1186/1471-2105-11-S12-S2
Source: PubMed

ABSTRACT Since the introduction of next-generation DNA sequencers the rapid increase in sequencer throughput, and associated drop in costs, has resulted in more than a dozen human genomes being resequenced over the last few years. These efforts are merely a prelude for a future in which genome resequencing will be commonplace for both biomedical research and clinical applications. The dramatic increase in sequencer output strains all facets of computational infrastructure, especially databases and query interfaces. The advent of cloud computing, and a variety of powerful tools designed to process petascale datasets, provide a compelling solution to these ever increasing demands.
In this work, we present the SeqWare Query Engine which has been created using modern cloud computing technologies and designed to support databasing information from thousands of genomes. Our backend implementation was built using the highly scalable, NoSQL HBase database from the Hadoop project. We also created a web-based frontend that provides both a programmatic and interactive query interface and integrates with widely used genome browsers and tools. Using the query engine, users can load and query variants (SNVs, indels, translocations, etc) with a rich level of annotations including coverage and functional consequences. As a proof of concept we loaded several whole genome datasets including the U87MG cell line. We also used a glioblastoma multiforme tumor/normal pair to both profile performance and provide an example of using the Hadoop MapReduce framework within the query engine. This software is open source and freely available from the SeqWare project (http://seqware.sourceforge.net).
The SeqWare Query Engine provided an easy way to make the U87MG genome accessible to programmers and non-programmers alike. This enabled a faster and more open exploration of results, quicker tuning of parameters for heuristic variant calling filters, and a common data interface to simplify development of analytical tools. The range of data types supported, the ease of querying and integrating with existing tools, and the robust scalability of the underlying cloud-based technologies make SeqWare Query Engine a nature fit for storing and searching ever-growing genome sequence datasets.

Full-text

Available from: Stanley F Nelson, Jul 30, 2014
1 Follower
 · 
128 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Background Whole-genome sequencing (WGS) and whole-exome sequencing (WES) technologies are increasingly used to identify disease-contributing mutations in human genomic studies. It can be a significant challenge to process such data, especially when a large family or cohort is sequenced. Our objective was to develop a big data toolset to efficiently manipulate genome-wide variants, functional annotations and coverage, together with conducting family based sequencing data analysis. Methods Hadoop is a framework for reliable, scalable, distributed processing of large data sets using MapReduce programming models. Based on Hadoop and HBase, we developed SeqHBase, a big data-based toolset for analysing family based sequencing data to detect de novo, inherited homozygous, or compound heterozygous mutations that may contribute to disease manifestations. SeqHBase takes as input BAM files (for coverage at every site), variant call format (VCF) files (for variant calls) and functional annotations (for variant prioritisation). Results We applied SeqHBase to a 5-member nuclear family and a 10-member 3-generation family with WGS data, as well as a 4-member nuclear family with WES data. Analysis times were almost linearly scalable with number of data nodes. With 20 data nodes, SeqHBase took about 5 secs to analyse WES familial data and approximately 1 min to analyse WGS familial data. Conclusions These results demonstrate SeqHBase's high efficiency and scalability, which is necessary as WGS and WES are rapidly becoming standard methods to study the genetics of familial disorders.
    Journal of Medical Genetics 01/2015; 52(4). DOI:10.1136/jmedgenet-2014-102907
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper deals with the cloud query language for manipulating the cloud data. The cloud definition and cloud data manipulation language are proposed as component of cloud query language. This paper is also focus on retrieval and manipulation cloud data dispersed on different data centre.
  • [Show abstract] [Hide abstract]
    ABSTRACT: Introduction: Mathematical modeling enables in silico classification of cancers, prediction of disease outcomes, optimization of therapy, identification of promising drug targets, and prediction of resistance to anti-cancer drugs. In silico pre-screened drug targets can be validated by a small number of carefully selected experiments. Areas covered: This review discusses the basics of mathematical modeling in cancer drug discovery and development. The topics include in silico discovery of novel molecular drug targets, optimization of immunotherapies, personalized medicine, and guiding preclinical and clinical trials. Breast cancer has been used to demonstrate applications of mathematical modeling in cancer diagnostics, identification of high risk population, cancer screening strategies, prediction of tumor growth, and guiding cancer treatment. Expert opinion: Mathematical models are the key components of the toolkit used in fight against cancer. The combinatorial complexity of new drugs discovery is enormous, making systematic drug discovery by experimentation alone difficult, if not impossible. The biggest challenges include seamless integration of the growing data, information, and knowledge and making them available for multiplicity of analyses. Mathematical models are essential for bringing cancer drug discovery into the era of Omics, Big Data, and personalized medicine.
    Expert Opinion on Drug Discovery 07/2014; DOI:10.1517/17460441.2014.941351