A scalable and portable framework for massively parallel variable selection in genetic association studies.
ABSTRACT The deluge of data emerging from high-throughput sequencing technologies poses large analytical challenges when testing for association to disease. We introduce a scalable framework for variable selection, implemented in C++ and OpenCL, that fits regularized regression across multiple Graphics Processing Units. Open source code and documentation can be found at a Google Code repository under the URL http://bioinformatics.oxfordjournals.org/content/early/2012/01/10/bioinformatics.bts015.abstract. SUPPLEMENTARY INFORMATION: Supplementary data are available at Bioinformatics online.
- SourceAvailable from: Gary K Chen
- [Show abstract] [Hide abstract]
ABSTRACT: A central challenge in systems biology and medical genetics is to understand how interactions among genetic loci contribute to complex phenotypic traits and human diseases. While most studies have so far relied on statistical modeling and association testing procedures, machine learning and predictive modeling approaches are increasingly being applied to mining genotype-phenotype relationships, also among those associations that do not necessarily meet statistical significance at the level of individual variants, yet still contributing to the combined predictive power at the level of variant panels. Network-based analysis of genetic variants and their interaction partners is another emerging trend by which to explore how sub-network level features contribute to complex disease processes and related phenotypes. In this review, we describe the basic concepts and algorithms behind machine learning-based genetic feature selection approaches, their potential benefits and limitations in genome-wide setting, and how physical or genetic interaction networks could be used as a priori information for providing improved predictive power and mechanistic insights into the disease networks. These developments are geared toward explaining a part of the missing heritability, and when combined with individual genomic profiling, such systems medicine approaches may also provide a principled means for tailoring personalized treatment strategies in the future.BioData Mining 03/2013; 6(1):5.
- [Show abstract] [Hide abstract]
ABSTRACT: Despite the enormous investments made in collecting DNA samples and generating germline variation data across thousands of individuals in modern genome-wide association studies (GWAS), progress has been frustratingly slow in explaining much of the heritability in common disease. Today's paradigm of testing independent hypotheses on each single nucleotide polymorphism (SNP) marker is unlikely to adequately reflect the complex biological processes in disease risk. Alternatively, modeling risk as an ensemble of SNPs that act in concert in a pathway, and/or interact non-additively on log risk for example, may be a more sensible way to approach gene mapping in modern studies. Implementing such analyzes genome-wide can quickly become intractable due to the fact that even modest size SNP panels on modern genotype arrays (500k markers) pose a combinatorial nightmare, require tens of billions of models to be tested for evidence of interaction. In this article, we provide an in-depth analysis of programs that have been developed to explicitly overcome these enormous computational barriers through the use of processors on graphics cards known as Graphics Processing Units (GPU). We include tutorials on GPU technology, which will convey why they are growing in appeal with today's numerical scientists. One obvious advantage is the impressive density of microprocessor cores that are available on only a single GPU. Whereas high end servers feature up to 24 Intel or AMD CPU cores, the latest GPU offerings from nVidia feature over 2600 cores. Each compute node may be outfitted with up to 4 GPU devices. Success on GPUs varies across problems. However, epistasis screens fare well due to the high degree of parallelism exposed in these problems. Papers that we review routinely report GPU speedups of over two orders of magnitude (>100x) over standard CPU implementations.Frontiers in Genetics 01/2013; 4:266.
Vol. 28 no. 5 2012, pages 719–720
A scalable and portable framework for massively parallel variable
selection in genetic association studies
Gary K. Chen∗
Division of Biostatistics, Department of Preventive Medicine, 2001 North Soto Street Los Angeles,
CA 90089, USA
Associate Editor: Martin Bishop
Advance Access publication January 11, 2012
Summary: The deluge of data emerging from high-throughput
sequencing technologies poses large analytical challenges when
testing for association to disease. We introduce a scalable framework
for variable selection, implemented in C++ and OpenCL, that fits
regularized regression across multiple Graphics Processing Units.
Open source code and documentation can be found at a Google
Code repository under the URL http://bioinformatics.oxfordjournals.
Supplementary information: Supplementary data are available at
Received on October 2, 2011; revised on January 4, 2012; accepted
on January 5, 2012
As the cost of sequencing continues to drop exponentially, it
will soon be practical to test all variation in the genome for
association to disease using data from thousands of individuals.
There are obvious computational challenges in analyzing datasets
on this scale. Regularized regression methods such as the LASSO
(Tibshirani, 1996) and other extensions are appropriate tools for
observations. Programs like glmnet (Friedman et al., 2010) are
computationally efficient for small to moderately sized datasets, but
do not scale to extremely large datasets due to memory burden.
We introduce an object-oriented framework that scales across nodes
on Graphics Processing Unit (GPU) clusters yet shields users
from the underlying complexities of a distributed optimization
algorithm, allowing them to easily implement custom Monte Carlo
routines (e.g. permutation testing, bootstrapping, etc.). Practical use
of our framework is demonstrated by an application to real and
Our C++ framework, named gpu-lasso, implements the mixed
L1 and L2 penalized regression model of Zhou et al. (2010)
on datasets with any arbitrary number of variables. L1 penalties
enforce sparsity, whereas L2 penalties enable correlated predictors
within groups (e.g. genes, pathways) to enter the model as well.
gpu-lasso exploits the optimization scheme of greedy coordinate
∗To whom correspondence should be addressed.
all variables, updates the single variable leading to the greatest
improvement to the likelihood with its new coefficient. In general,
this requires more iterations to converge than cyclic coordinate
models. More importantly, GCD exposes parallelism across subjects
and variables, which makes it both a better fit for GPU processors
and a more scalable algorithm compared CCD with, which only
exposes parallelization at the subject level. Since GPU memory is
genome sequence data it is essential to coordinate optimization
coordination, enabling GPUs to be distributed across a network. Our
GPU kernels are implemented in OpenCL, which assure maximum
portability across either ATI or nVidia GPU devices.
We compared runtime behavior of our program across multiple
configurations and to glmnet (Friedman et al., 2010). We were also
interested in scalability properties as optimization is split across
nodes. Our host was configured with a pair of nVidia Tesla C2050s,
various sizes by extracting genotypes from the first 250000 single
nucleotide polymorphisms (SNPs) and 1 million SNPs (ordered by
position) of a large Genome WideAssociation Studies (GWAS) (see
Section 3). Table 1 shows that due to its implementation in the
R environment, glmnet has a much heftier memory requirement
than our C++ implementation and could not load the 1 million SNP
Table 1. Computational requirements
Method Time per
1 million variables
gpu-lasso (1 CPU)
gpu-lasso (1 GPU)
gpu-lasso (2 GPU)
gpu-lasso (1 CPU)
gpu-lasso (1 GPU)
gpu-lasso (2 GPU)
A total of 7000 subjects in all analyses.
© The Author(s) 2012. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/
by-nc/3.0), which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.
where gpu-lasso was slightly faster. Memory and runtime halved as
expected when distributing across two nodes.
In the first example, we demonstrate how a mixed L1 and L2
penalization scheme can be beneficial for rare variant analyses by
carrying out a simulation study based on real data from Pilot 3 of the
1000 Genomes Project.We assigned 100 genic SNPs, most of which
had a minor allele frequency <0.01 to be causal with a relative risk
of disease of 2.0. Figure 1, which presents power as a function of
false discovery rate (FDR), shows that, as expected, inclusion of a
mixed penalty (in this specific case, L1=L2) can improve power
over a pure L1 penalty when informative groupings (i.e. genes) are
In our second example, we apply the method of stability selection
(Meinshausen and Bühlmann, 2010), which has been demonstrated
to provide good error control in gene expression data, to a large
GWAS on prostate cancer genotyped across 1047986 SNPs on
9641African-American men (Haiman et al., 2011).We fit the model
across 100 subsampled replicates of the data, completing analysis in
this analysis would have completed in ∼9 days on a single CPU
using the same algorithm.
Table 2 presents the three variables declared as being stable based
on a threshold (derived as a function of a pure L1 penalty) that
controls FDR at the 0.05 level. The first two SNPs listed in the table
replicate significant findings in earlier studies of prostate cancer
(Murabito et al., 2007; Schumacher et al., 2011) while the last SNP
appears to be a genuinely novel risk variant as we have recently
replicated this finding in an independent Stage 2 analysis (Haiman
et al., 2011).
0 0.002 0.004
False discovery rate
0.006 0.008 0.01
True discovery rate
Fig. 1. ROC for simulations based on 1KGP exome data.
Table 2. Stable variables
SNP IDChr PositionSelection
All reported variables based on a threshold πthr=0.506 which controls FDR at <0.05
We describe our scalable framework gpu-lasso, which can be
particularly useful for fitting sparse models in high-dimensional
settings. To demonstrate how one can carry out Monte Carlo
routines with our framework, we provide full source code listing
for the C++ class that implements Stability Selection in our
Supplementary Material. We should stress that our choice of GCD
as our optimization routine may not be ideal in other contexts,
particularly when large models need to be estimated, such as
exploration of the entire LASSO path over a grid of values for the
optimal penalty parameter. In this case, cyclic coordinate descent
may be preferable as first, the increased number of iterations for
GCD may swamp out gains from limited parallel resources, and
second, GCD may potentially converge to models that overestimate
sparsity. Alternatively, one could conceivably constrain the search
to a set of candidate (sparse) models by adding a BIC penalty for
example. For smaller datasets, software such as glmnet can be more
practical, since efficient routines like the LARS algorithm (Efron
et al., 2004), which solves the LASSO path without exploring a
penalty parameter grid, are already bundled. As datasets increase
in sample size, LARS and related approaches can lose their edge
over a simple (parallelized) penalty grid search since such methods
require inversion of a covariance matrix with dimension bounded
by the number of samples [i.e. O(n)].
I would like to thank Kenneth Lange and Marc Suchard for their
insights, Kai Wang and Paul Thomas for early feedback, the USC
Epigenome Center for GPU computing resources and the African
American Prostate Cancer Consortium, listed in Haiman et al.
(2011), for contributing data in the GWAS application.
Funding: This work was funded by National Institutes of Health
grant (R01 ES019876-01A1).
Conflict of Interest: none declared.
Efron,B. et al. (2004) Least angle regression. Ann. Statist., 32, 407–499. (With
discussion, and a rejoinder by the authors).
Friedman,J.H. et al. (2010) Regularization paths for generalized linear models via
coordinate descent. J. Stat. Softwr., 33, 1–22.
Haiman,C.A. et al. (2011) Genome-wide association study of prostate cancer in men of
African ancestry identifies a susceptibility locus at 17q21. Nat. Genet., 43, 570–573.
Meinshausen,N. and Bühlmann,P. (2010) Stability selection. J. R. Stat. Soc. Ser. B, 72,
Murabito,J.M. et al. (2007) A genome-wide association study of breast and prostate
cancer in the NHLBI’s Framingham Heart Study. BMC Med. Genet., 8 (Suppl. 1),
Schumacher,F.R. et al. (2011) Genome-wide association study identifies new prostate
cancer susceptibility loci. Hum. Mol. Genet., 20, 3867–3875.
Tibshirani,R. (1996) Regression shrinkage and selection via the lasso. J. R. Stat. Soc.
Ser. B, 58, 267–288.
Zhou,H. et al. (2010) Association screening of common and rare genetic variants by
penalized regression. Bioinformatics, 26, 2375–2382.