A Scalable and Portable Framework For Massively Parallel Variable Selection in Genetic Association Studies

Division of Biostatistics, Department of Preventive Medicine, Los Angeles, CA 90089, USA.
Bioinformatics (Impact Factor: 4.98). 03/2012; 28(5):719-20. DOI: 10.1093/bioinformatics/bts015
Source: PubMed


The deluge of data emerging from high-throughput sequencing technologies poses large analytical challenges when testing for association to disease. We introduce a scalable framework for variable selection, implemented in C++ and OpenCL, that fits regularized regression across multiple Graphics Processing Units. Open source code and documentation can be found at a Google Code repository under the URL http://bioinformatics.oxfordjournals.org/content/early/2012/01/10/bioinformatics.bts015.abstract.
Supplementary information:
Supplementary data are available at Bioinformatics online.

Download full-text


Available from: Gary K Chen,
  • Source
    • "Speedups relative to standard serial implementations of over two orders of magnitude (100 fold) are commonplace. Epistasis detection only scratches the surface of the set of biologically relevant problems which have already been addressed using GPUs, including proteomics (Hussong et al., 2009), phylogenetics (Suchard and Rambaut, 2009; Zhou et al., 2011a), gene-expression analysis (Buckner et al., 2010; Kohlhoff et al., 2011; Magis et al., 2011), high dimensional optimization (Zhou et al., 2010; Chen, 2012), sequence alignment (Campagna et al., 2009; Blom et al., 2011; Vouzis and Sahinidis, 2011; Liu et al., 2012b), systems biology (Liepe et al., 2010; Klingbeil et al., 2011; Vigelius et al., 2011; Zhou et al., 2011b; Liu et al., 2012a), and genotype imputation (Chen www.frontiersin.org December 2013 | Volume 4 | Article 266 | 1 "
    [Show abstract] [Hide abstract]
    ABSTRACT: Despite the enormous investments made in collecting DNA samples and generating germline variation data across thousands of individuals in modern genome-wide association studies (GWAS), progress has been frustratingly slow in explaining much of the heritability in common disease. Today's paradigm of testing independent hypotheses on each single nucleotide polymorphism (SNP) marker is unlikely to adequately reflect the complex biological processes in disease risk. Alternatively, modeling risk as an ensemble of SNPs that act in concert in a pathway, and/or interact non-additively on log risk for example, may be a more sensible way to approach gene mapping in modern studies. Implementing such analyzes genome-wide can quickly become intractable due to the fact that even modest size SNP panels on modern genotype arrays (500k markers) pose a combinatorial nightmare, require tens of billions of models to be tested for evidence of interaction. In this article, we provide an in-depth analysis of programs that have been developed to explicitly overcome these enormous computational barriers through the use of processors on graphics cards known as Graphics Processing Units (GPU). We include tutorials on GPU technology, which will convey why they are growing in appeal with today's numerical scientists. One obvious advantage is the impressive density of microprocessor cores that are available on only a single GPU. Whereas high end servers feature up to 24 Intel or AMD CPU cores, the latest GPU offerings from nVidia feature over 2600 cores. Each compute node may be outfitted with up to 4 GPU devices. Success on GPUs varies across problems. However, epistasis screens fare well due to the high degree of parallelism exposed in these problems. Papers that we review routinely report GPU speedups of over two orders of magnitude (>100x) over standard CPU implementations.
    Frontiers in Genetics 12/2013; 4:266. DOI:10.3389/fgene.2013.00266
  • Source
    • "Algorithms written for GPUs have moved well beyond computer games and animation, making inroads into diverse problems in computational biology such as proteomics [Hussong et al., 2009], phylogenetics [Suchard and Rambaut, 2009; Zhou et al., 2011a], gene-expression analysis [Magis et al., 2011; Kohlhoff et al., 2011; Buckner et al., 2010], high dimensional optimization [Zhou et al., 2010; Chen, 2012], epistasis modeling [Chikkagoudar et al., 2011; Greene et al., 2010; Kam-Thong et al., 2011; Yung et al., 2011; Ritchie and Venkatraman, 2010; Hemani et al., 2011], sequence alignment [Blom et al., 2011; Campagna et al., 2009; Vouzis and Sahinidis, 2011; Liu et al., 2012b], and systems biology [Klingbeil et al., 2011; Vigelius et al., 2011; Liu et al., 2012a; Zhou et al., 2011b; Liepe et al., 2010]. Our algorithms for haplotype frequency estimation and imputation offer ample opportunities for acceleration. "
    Dataset: supplement

  • Source
    • "GPUs have been employed in recent years to solve several highdimensional problems in computational biology (Zhou et al., 2010; Chen, 2012) amenable to fine-grained parallelization. "
    Dataset: document

Show more