Page 1

Fast Identification of Biological Pathways Associated with a

Quantitative Trait Using Group Lasso with Overlaps

BY MATT SILVER, GIOVANNI MONTANA1

& ALZHEIMER’S DISEASE NEUROIMAGING INITIATIVE

Department of Mathematics

Imperial College London

London, SW7 2AZ, UK

Abstract

Where causal SNPs (single nucleotide polymorphisms) tend to accumulate within bio-

logical pathways, the incorporation of prior pathways information into a statistical model is

expected to increase the power to detect true associations in a genetic association study. Most

existing pathways-based methods rely on marginal SNP statistics and do not fully exploit the

dependence patterns among SNPs within pathways. We use a sparse regression model, with

SNPs grouped into pathways, to identify causal pathways associated with a quantitative trait.

Notable features of our pathways group lasso with adaptive weights (P-GLAW) algorithm in-

clude the incorporation of all pathways in a single regression model, an adaptive pathway

weighting procedure that accounts for factors biasing pathway selection, and the use of a

bootstrap sampling procedure for the ranking of important pathways. P-GLAW takes ac-

count of the presence of overlapping pathways and uses a novel combination of techniques to

optimise model estimation, making it fast to run, even on whole genome datasets. In a com-

parison study with an alternative pathways method based on univariate SNP statistics, our

method demonstrates high sensitivity and specificity for the detection of important pathways,

showing the greatest relative gains in performance where marginal SNP effect sizes are small.

1Introduction

The mixed success of attempts to identify genetic variants that account for a large part of the

heritability of common disease has focussed attention on the need to develop new methodological

approaches to the analysis of GWAS data. A number of factors that might explain this ‘missing

heritability’ have been suggested, including the failure of many current models to capture the

presence of gene-gene and gene-environment interactions, of multiple SNPs with small effect and

of rare variants (Manolio et al., 2009; Goldstein, 2009). One promising approach uses prior in-

formation on functional structure present within the genome to group genes and associated SNPs

into gene sets or pathways. The motivation here is that genes do not work in isolation, but instead

work together through their effect on molecular networks and cellular pathways. The hope is that

by jointly considering the effects of multiple SNPs or genes within a biological pathway, significant

associations might be identified that would otherwise be missed when considering markers indi-

vidually (Wang et al., 2010). First developed in the context of gene expression studies (Mootha

et al., 2003), pathways-based methods have more recently been extended to the analysis of GWAS

data (Holmans et al., 2009; Luo et al., 2010; Lango Allen et al., 2010; Lambert et al., 2010). This

has led to the identification of putative causal pathways for a number of diseases including Parkin-

son’s Disease (Lesnick et al., 2007), Crohn’s Disease (Wang et al., 2009b) and rheumatoid arthritis

(Eleftherohorinou et al., 2011). As well as offering the potential for increased statistical power,

pathways-based genetic association studies (PGAS) can aid the biological interpretation of results

through the identification of causal pathways, and may also facilitate comparisons between studies

genotyping different variants that nonetheless map to common pathways (Ma and Kosorok, 2010;

Cantor et al., 2010).

The majority of existing PGAS methods begin with a univariate test of association, in which

individual SNPs are scored according to their degree of association with disease status or a quan-

titative trait. Various techniques are then used to combine these univariate statistics into pathway

1Corresponding author. Email: g.montana@ic.ac.uk

arXiv:1201.5745v1 [stat.ME] 27 Jan 2012

Page 2

scores. For example, the GenGen method (Wang et al., 2007) first ranks all genes according to

the value of the highest-scoring SNP within 500kb of each gene. Pathway significance is then

assessed by determining the degree to which high-ranking genes are over-represented in a given

gene set, in comparison with the genomic background. The PLINK tookit (Purcell et al., 2007)

also features a ‘set-based test’, in which pathway significance is measured by taking the average,

marginal p-value of a pre-determined maximum number of ‘uncorrelated’ SNPs within the path-

way. Here, uncorrelated SNPs are defined as those whose pairwise linkage disequilibrium (LD) is

below a certain threshold value. As a final step, where more than one pathway is considered a

correction for multiple testing is generally made.

In contrast to univariate, ‘one SNP at a time’ methods, multivariate or multi-locus methods

allow all SNPs to be considered in the model at the same time, which can aid the identification of

weak signals while diminishing the importance of false ones. One such approach consists of fitting

a penalised, multivariate regression model, in which a subset of SNPs is selected by imposing

a penalty on some suitably selected norm of the regression coefficients, as in Lasso regression

(Tibshirani, 1996). This approach has been shown to yield higher statistical power, compared to

more common ‘mass univariate linear models’, especially with multivariate and high-dimensional

quantitative traits (Vounou et al., 2010). Several other studies have demonstrated the advantages

of this approach for the detection of genetic associations. For example, Wu et al. (2009) use

penalized logistic regression to select SNPs in a case-control study, and analyse two-way and

higher-order SNP-SNP interactions. Hoggart et al. (2008) propose a similar method for SNP

selection in a Bayesian context.

A number of penalized regression techniques that allow prior information on the relationship

between SNP markers to be incorporated into the model selection process have recently been

proposed. For example, Zhou et al. (2010) group SNPs into genes, and utilise a useful property

of the group lasso (Yuan and Lin, 2006) to aid the detection of rare variants within genes. The

GRASS method (Chen et al., 2010) begins by characterising within-gene variation as ‘eigenSNPs’,

obtained by principal component analysis (PCA). A combination of lasso and ridge regression,

followed by permutations is then used to measure significance for a single pathway. Finally, Zhao

et al. (2011) use a combination of PCA and lasso regression to identify a subset of genes within

a candidate pathway, followed by permutations to measure pathway significance. Once again this

method considers one pathway at a time.

The search for SNPs, or quantitative trait loci (QTL) influencing quantitative traits is gaining

momentum as a potentially more powerful way to study the underlying causes of complex disease

(Plomin et al., 2009). In the emerging field of neuroimaging genetics for example, in which we

have a particular interest, quantitative data in the form of MRI or PET scans serve as a type of

intermediate phenotype in the study of complex disorders such as Alzheimer’s Disease (AD) or

schizophrenia (Bigos and Weinberger, 2010). We use genotype data from the Alzheimers Disease

Neuroimaging Initiative (ADNI) dataset in this analysis.

Our focus here is on the identification of biological pathways associated with a quantitative

trait. Our assumption is that where causal SNPs are enriched in a pathway, the use of a regression

model that selects SNPs that are grouped into pathways will have increased power, compared to

a more traditional approach in which SNPs are considered one at a time. We also seek a true,

multivariate model which includes all mapped pathways at the same time. The hope is that this

will confer some of the benefits, in terms of detecting weaker signals and diminishing false positives,

described earlier. To achieve these ends, we use a modified version of the group lasso (GL) with

SNPs grouped into pathways, and develop a fast estimation algorithm applicable to the case of

non-orthogonal groups. In order to rank pathways, we use a bootstrap sampling procedure to rank

pathways in decreasing order of importance. We face a number of challenges in applying GL to

SNP and pathway data for the identification of implicated pathways. These include the fact that

pathways overlap, since many SNPs map to multiple pathways; the problem of selection bias, that

is the tendency of the model to select pathways having specific statistical properties irrespective of

their association with phenotype; and the sheer scale of SNP datasets, making efficient estimation

a necessity.

We have found that the issue of overlapping pathways receives surprisingly little attention in

Page 3

the PGAS literature, given that the presence of overlaps might be expected to have a significant

impact on the results of any PGAS analysis. For example, variation in the number and distribution

of causal SNPs with respect to genes that overlap multiple pathways will affect the number of

pathways defined to be ‘causal’, and different PGAS methods will be affected by such variation in

different ways. Additionally, the inclusion of multiple pathways in a single GL regression model

presents a particular problem, since GL in its original formulation will not select pathways in

the manner that we would wish. To account for this we employ a variable expansion procedure,

originally proposed in the context of microarray data analysis by Jacob et al. (2009), that ensures

that overlapping SNPs enter the regression model separately, for each pathway that they map to.

A number of factors may bias PGAS results, exaggerating pathway significance and giving

rise to inflated numbers of false positives. Depending on the methods used, and the underlying

disease-causing mechanism, such factors are likely to include pathway size (measured in number of

SNPs and/or genes), and the extent and distribution of pathway LD. Common strategies employed

by existing methods to reduce this bias include the use of permutation (of genes or phenotypes),

and dimensionality reduction techniques such as PCA (Fridley and Biernacka, 2011; Wang et al.,

2010). We propose a procedure that reduces bias by adjusting pathway weightings in the regression

model according to the empirical bias in pathway selection frequencies obtained by fitting the GL

model with a null response.

One potential drawback of using a regression model in the analysis of genetic data is the

typically very large number of predictors (here SNPs) that must be analysed. While the use of

penalized regression techniques at least makes the problem tractable when the number of predic-

tors vastly exceeds sample size, the very large matrix calculations required can still make model

estimation computationally infeasible. To address this, we combine a number of techniques that

speed up the estimation process including the use of an ‘active set’ of predictors, a Taylor ap-

proximation of the GL penalty and efficient computation of pathway block residuals. The final

estimation algorithm, which we call ‘Pathways Group Lasso with Adaptive Weights’ (P-GLAW),

is sufficiently fast to obviate the need either to undertake a preliminary stage of dimensionality

reduction, or to consider pathways individually.

We evaluate our method’s performance in a Monte Carlo (MC) simulation study, using real

genetic and pathway data with quantitative phenotypes simulated under an additive genetic model.

We consider a range of scenarios with different causal SNP distributions and effect sizes. We feel

the use of real genotype and pathway data is crucial, so as to capture the complex distributions

of gene size and number within a pathway, together with SNP LD patterns and overlaps between

pathways, all of which may have a significant effect on pathway ranking performance. To our

knowledge, this is the first such PGAS power study using GL with real SNP and pathway data.

The evaluation of GL pathway ranking performance however presents a number of challenges.

Firstly, as described above, variation in the number of causal pathways due to overlaps must be

taken into account when evaluating performance over multiple MC simulations. Secondly, we

require a means of evaluating the degree to which causal pathways are represented amongst the

top ranks. Thirdly, since GL performs variable selection, not all causal pathways may be ranked,

and ranking performance measures must reflect this. To address these issues we devise a battery

of measures that aim to capture different aspects of ranking performance. Finally, we compare

our method’s performance with another common PGAS method, derived from univariate SNP

statistics.

The article is organised as follows. Section 2 describes the GL model; our strategy for dealing

with overlapping pathways, model estimation and speed-ups; our proposed bias-adjusted pathway

weighting update procedure; our strategy for ranking pathways using a resampling procedure, and

our proposed ranking performance measures. In Section 3 we describe the real biological data sets

used in the experiments, the SNP to pathway mapping process, and the simulation framework

used to evaluate both methods under consideration. The results from these simulation studies are

provided in Section 4, and we conclude in Section 5 with a discussion and final remarks.

Page 4

2Methods

2.1The group lasso for pathway selection

We assume N unrelated individuals genotyped at P SNPs, each with a univariate quantitative

trait yi, for i = 1,...,N. For an individual i, we denote by xijthe minor allele count for SNP j, for

j = 1,...,P, and arrange these counts in an (N × P) design matrix X. Quantitative phenotypes

are arranged in an (N × 1) column vector y, and will be treated as quantitative responses in a

regression model.

We initially consider the situation where SNPs are partitioned into L mutually exclusive path-

ways, or groups. Each group Gl, for l = 1,...,L, is a subset of {1,2,...,P} of cardinality Sl,

containing the indices l1,l2,...,lSlof the SNPs that belong to it, such that Gl∩ Gl? = ∅ for any

l ?= l?. We denote by G = {1,...,P}, the set of all SNP indices. We denote by S ⊂ {1,...,P}

the subset of SNPs that are causal, that is the SNPs influencing y, and additionally denote the

cardinality of S by S. Accordingly, we denote by C ⊂ {1,2,...,L} the subset of causal pathways

containing one or more SNPs in S, having cardinality |C|. We denote the complement of C by C?.

We also assume that |C| ? L, so that only a small proportion of all pathways are causal. Finally,

we assume that y can be optimally predicted, in the least squares sense, by a linear combination

of allele counts corresponding to SNPs in pathway Gl, where l belongs to the set C.

We denote the vector of SNP regression coefficients β = (β1,...,βP) ∈ RP, and the parameter

vector corresponding to SNPs in pathway Gl only as βl = (βl1,...,βlSl) ∈ RSl. Under these

assumptions, one or more elements of each βlfor l ∈ C are expected to be non-zero, whereas all

the regression coefficients associated with SNPs that do not belong to C will be zero, that is βl= 0

for l ∈ C?. For example, for a single causal pathway Glwith causal SNPs {a,b} in S, the sparsity

pattern might look like

β = {(0,...,0)

?

???

G1

,...,(0,...,βla,0...,βlb,0,...,0)

?

???

Gl

,...,(0,...,0)

?

???

GL

}.

A suitable regression model that would enforce the assumed block sparsity pattern above is

the group lasso (GL) (Yuan and Lin, 2006), in which estimates for β are obtained by minimising

the penalised least squares function

(1)f(β) =1

2||y − Xβ||2

2+ λ

L

?

l=1

wl||βl||2

with respect to β, where || · ||2denotes the ?2(Euclidean) norm and wlis a pathway weighting

factor for group l. Sparsity at the pathway level is encouraged through the imposition of an ?1

lasso penalty on ||βl||2, which ensures that SNPs belonging to pathways not selected by the model

have zero regression coefficients. For selected pathways, i.e. those with βl?= 0, SNP coefficients

tend to shrink, through the imposition of a ridge-type penalty on ||βl||2. The degree of sparsity is

controlled by the regularisation parameter, λ, such that the number of pathways selected by the

model increases with decreasing λ. For a given λ, the block sparsity pattern is determined both

by the data (y and X), and by the distribution of pathway weights, w = (w1,...,wl), such that

an increase in wlmeans that pathway l is less likely to be selected, whereas a decrease in wlwill

have the opposite effect.

The GL optimisation problem associated with minimising the objective function (1) is convex,

and can be solved using coordinate descent methods. Problems arise however in the situation

where pathways overlap, that is when a SNP is allowed to belong to more than one pathway, so

that Gl∩ Gl? ?= ∅ for some l ?= l?. Firstly, where groups overlap, the penalty term in (1) is no

longer separable into groups, since the same SNPs occur in multiple pathways, and convergence

using coordinate descent is no longer guaranteed (Tseng and Yun, 2009). Secondly, if we wish

to be able to select pathways independently, GL is unable to do this. We illustrate this last

point using a simple example in Fig. 1 A, where we consider only three pathways, G1,G2 and

Page 5

G3, two of which overlap. As a consequence of this, pathway parameter vectors β1and β2also

overlap, since they have a number of SNPs in common (shaded dark grey). If a shared SNP is

selected (i.e. it has a non-zero coefficient), then both pathways to which it belongs (G1and G2)

are also selected, since their corresponding pathway parameter vectors have non-zero ?2norms.

The GL regression model thus does not meet our requirements, since in order to be able to rank

pathways in order of importance, we wish to be able to distinguish overlapping pathways and

select them independently. Conversely, where shared SNPs have zero coefficients, for example in

the case that G1is not selected in the model, then these SNPs will have zero coefficients in each

and every pathway to which they belong (here G1and G2). Hence SNPs retained in the model are

necessarily drawn from the complement of a union of (unselected) pathways. We instead require

retained SNPs to be drawn from a union of (selected) pathways, so that a SNP driving selection

in one pathway may still have a zero coefficient in another.

Figure 1: The problem of overlapping pathways: here there are three pathways, G1,G2 and G3, two of

which overlap. A: Standard formulation. Pathway parameter vectors β1 and β2 overlap, since they have

SNPs in common (shaded dark grey). Where an overlapping SNP has a non-zero coefficient, only G3, can

be selected independently. B: Formulation with duplicated SNPs. An expanded parameter vector, β∗,

is created by duplicating overlapping SNPs (dotted line). β∗

that pathways can be selected independently.

1and β∗

2now enter the model separately, so

Jacob et al. (2009) propose one possible solution to the problem of overlapping predictors in a

similar context, motivated by the analysis of gene expression data. The essence of this method is to

create duplicate, dummy SNPs, so that SNPs belonging to more than one pathway enter the model

separately (see Fig. 1 B). The process works as follows. An expanded design matrix is formed from

the column-wise concatenation of the L sub-matrices of size (N × Sl), that is XGl= {xij} with

i = 1,...,N and j ∈ Gl, to form the expanded design matrix X∗= [XG1,XG2,...,XGL] of size

(N × P∗), where P∗=?

model is then able to perform pathway selection in the way that we require, and the optimisation

(1), with β replaced by β∗, and X replaced by X∗becomes block separable, so that it can be

solved by block coordinate descent. In the following sections we assume both β and X have been

expanded, but omit the∗superscript for clarity. Finally, we note that where one or more SNPs

in S overlap multiple pathways, the corresponding number, |C|, of causal pathways will increase.

lSl. The corresponding parameter vector, β∗, size (P∗× 1), is formed

by joining the L,(Sl×1) pathway parameter vectors, β∗

l, so that β∗= [β∗

1

T,β∗

2

T,...,β∗

L

T]T. The

Page 6

2.2Parameter estimation

We seek a solution,ˆβ, that minimises the GL objective function (1). Where groups or pathways

are disjoint, so that the penalties are separable into groups, a global solution can be obtained using

block coordinate descent (BCD). Coordinate descent algorithms offer a highly efficient means of

solving convex optimisation problems, and work by breaking down the optimisation into a series of

univariate problems, solving the optimisation for each variable (here SNP) in turn, while holding all

the others fixed, until a suitable minimum based on some stopping criterion is reached (Friedman

et al., 2007). Where variables are grouped, as in GL, estimates are obtained for each pathway

parameter vector, βlin turn, while holding constant the current estimates for all other pathway

parameter vectors,ˆβm,(m ?= l), and then cycling through each pathway until convergence.

Yuan and Lin (2006) derive a method for solving GL under the assumption that the group

design matrices, XGlare orthogonal, that is XT

case, so in the next section we derive a solution for GL in the case of non-orthogonal groups. We

additionally find that GL estimation using BCD can be slow, particularly for the large datasets

common to PGAS, and so in the following sections propose a number of strategies for speeding

up parameter estimation.

GlXGl= I. This assumption does not hold in our

2.2.1 Block coordinate descent for non-orthogonal groups

We assume that (1) is block-separable, that is the groups indexed by 1,...,L are disjoint by

construction. In our context, this is achieved by implementing the SNP duplication strategy of

section 2.1. We begin by considering a single pathway l. We collect the N individual observed

SNPs for a given SNP j in a column vector Xj = (x1j,x2j,...,xNj). Using this notation, we

define the matrix XGl= (Xl1,Xl2,...,XSl) containing all SlSNPs belonging to pathway Gl, and

the corresponding vector of regression coefficients βl= (βl1,βl2,...,βSl). We can then rewrite the

objective function (1) for a single block l as a function of βl,

?

where ˆ rl= y −?

Estimates for each βjare then obtained by taking partial derivatives with respect to βj, that

is by setting

(2)f(βl) =1

2||ˆ rl−

j∈Gl

Xjβj||2

2+ λwl||βl||2

m?=lXmˆβm. The vector ˆ rlis the ‘partial residual’ vector for pathway l, based

on the current estimates,ˆβm,m ?= l, of the other pathway parameter vectors.

(3)

∂f(βl)

∂βj

= 0for j = l1,...,Sl

Ignoring the penalty term, the partial derivative with respect to βjis

?

We denote the partial derivative of the penalty term, by

∂

∂βj

1

2||ˆ rl−

j

Xjβj||2

2= −XT

j(ˆ rl−

?

j

Xjβj)

sj=

∂

∂βj||βl||2

so that (3) can be written as

(4)

− XT

j(ˆ rl−

?

j

Xjβj) + λwlsj= 0j = l1,...,Sl

We first consider the case where βl= 0, that is βj= 0, for j = l1,...,Sl. In this case ||βl||2

is not differentiable. We instead form the Slsub-differentials, sj∈ [−1,1], so that

(5)s2

?

j

j≤ 1

Page 7

The system of equations (4) can now be written

sj=

1

λwlXT

jˆ rl

j = l1,...,Sl

and using (5), we have

(6)

?

j

s2

j=

1

λ2w2

l

?

j

(XT

jˆ rl)2≤ 1.

Note that for (6) to be unbiased with respect to group size, a weight, wl=√Sl, as proposed by

Yuan and Lin (2006), can be applied. Alternatively, since

?

we may rewrite (6) as

?

so that if βl= 0

j

(XT

jˆ rl)2= ||XT

lˆ rl||2

2

(

j

s2

j)

1

2=

1

λwl||XT

lˆ rl||2≤ 1,

(7)

||XT

lˆ rl||2≤ λwl.

When βl?= 0, the minimisation of (2) can be obtained numerically, using coordinate descent,

as a series of one-dimensional estimations over βj,j = l1,...,lSl. Friedman et al. (2010) suggest

a golden section search over βj, combined with parabolic interpolation. However, the number of

such estimations depends on L and P∗, both of which increase with P, the latter markedly so.

This can make the GL optimisation prohibitively slow, particularly for the large P typically found

in PGAS. For this reason, we next describe three strategies for speeding up the estimation.

2.2.2 Taylor approximation of penalty

One means of speeding up the estimation for βjis to use a linear or quadratic approximation of the

GL ?2penalty (Zou and Li, 2008; Fan and Li, 2001), enabling the replacement of the multi-step

numerical optimisation over βjwith a one-step calculation. Breheny and Huang (2009) propose the

use of a Taylor approximation for a range of different estimation problems with grouped variables

and we adopt this approach for our GL estimation problem. We begin by rewriting the group Gl

objective function (2), for a single predictor as

f(βl|ˆβk,k ∈ Gl,k ?= j) =1

2||ˆ rl−

?

k

Xkˆβk− Xjβj||2

2+ λwlΓ(βl|ˆβk)

where Γ(βl|ˆβk) = (c + β2

estimates. For convenience, we rewrite this as

j)

1

2, with c =

?

k?=jˆβ2

k, and theˆβk are the current SNP coefficient

(8)f(βl|ˆβk,k ?= j) =1

2||ˆ r + Xjˆβj− Xjβj||2

2+ λwlΓ(βl|ˆβk)

where ˆ r = y −?

point a =ˆβ2

lXlˆβlis the total residual, using the current estimates of all SNP coefficients.

We now consider the first order Taylor expansion of Γ(βl|ˆβk) as a function of x = β2

j

j, about the

Γ(x) ? Γ(a) + Γ?(a)(x − a)

Page 8

Now

Γ(x) = (c + x)

1

2

and Γ?(a) =

1

2(c + a)

1

2

so that

Γ(x) ? (c + a)

1

2+

x − a

2(c + a)

1

2

Substituting a =ˆβ2

βl, this gives

j, and noting that (c + a)

1

2 = ||ˆβl||2, whereˆβldenotes the current estimate of

Γ(β2

j) ?ˆβl+β2

j−ˆβ2

2||ˆβl||2

j

Substituting this expression in (8), we have

f(βl|ˆβk,k ?= j) =1

2||ˆ r + Xjˆβj− Xjβj||2

2+ λwl

?ˆβl+β2

j−ˆβ2

2||ˆβl||2

j

?

Differentiating with respect to βjgives

∂f(βl)

∂βj

????ˆβk,k?=j

= −XT

j(ˆ r + Xjˆβj− Xjβj) + λwl

βj

||ˆβl||2

= −XT

jˆ r −ˆβj+ βj+ λwl

βj

||ˆβl||2

since?

ix2

ij= XT

jXj= 1. Rearranging terms and setting the partial derivative equal to zero, we

see that the minimum is achieved when

(9)βj=XT

jˆ r +ˆβj

1 + λ?

where λ?=

λwl

||ˆβl||2

Where the current estimate ||ˆβl||2= 0, that is when group l first enters the estimation, we set

||ˆβl||2to be a small positive quantity, η, enabling βjin (9) to be estimated.

BCD proceeds by obtaining estimates for each βj,j = l1,...,Sl,1,..., Sl,... until convergence

within the block, and for each pathway, l = 1,...,L, 1,...,L,... in turn, until a stopping criterion

indicating a global minimum of (1) has been satisfied. The estimation process is summarised in

Box 1.

2.2.3Use of pathway ‘active set’

For large P∗and L, the need for the repeated calculation of (7) to establish whether or not

a particular group can enter the estimation presents a major computational bottleneck. This

problem motivates another strategy providing substantial gains in computational efficiency for

a range of sparse regression problems. This ‘active set’ strategy relies on the pre-selection of a

subset of ‘potentially active’ predictors, or groups of predictors that are likely to be selected by

the model at a given λ (Tibshirani et al., 2010; Roth and Fischer, 2008). The optimisation can

then be run over this reduced set of variables, with a subsequent check to ensure that no other

predictors should have been included in the first place. The active set procedure offers potentially

dramatic speed up in execution times, particularly for very large datasets such as those found in

PGAS, due to the reduced number of computations that need to be performed. In addition there

are substantial savings in the amount of memory required to store data during processing, which

Page 9

Box 1 GL estimation algorithm using BCD

1. setˆβ = 0.

2. For pathway Gl,l = l1,2,...,L:

set ˆ rl= y −?

setˆβl= 0

else

do

for j = l1,...,Sl

m?=lXmˆβm

If ||XT

lˆ rl||2≤ λwl

estimate βjusing (9)

end

until convergence of f(βl) (2)

setˆβl= βl

end

end

3. Repeat step 2 until (global) convergence of f(β)(1)

can also lead to dramatic reductions in computation times with large datasets where memory is

constrained.

For the GL, we begin by considering the inequality (7). For groups to enter the model we

require

(10)

||XT

lˆ rl||2> λwl

l = 1,...,L

and therefore, at the first iteration, with β initialised to zero, a group Glenters the model if

||XT

(11)

ly||2> λwl

l = 1,...,L.

We define the ‘active set’ A of potentially active groups that satisfy (11) as

A = {m ∈ G : ||XT

my||2> λwm}

and additionally define

(12)λmax= min

λ

: ||XT

ly||2≤ λwl

l = 1,...,L

namely the smallest λ value for which the active set is empty. Note that provided λ is close to

λmax, then |A| ? L. Once one or more groups enter the model, not allˆβlwill be zero and the

inequality (10) will then determine which groups may enter or leave the model.

The active set procedure rests on the observation that in practice, the final set of groups

selected by the model rarely includes any groups not in A (Tibshirani et al., 2010). We can

therefore perform the full estimation on A, followed by a check of the inequality (10), to see if

any additional groups not in A can enter the model. If there are no additional groups, then we

have the full solution. If not, then we run the full estimation again, with the additional groups

satisfying (10) added to A. A summary of the active set algorithm is given in Box 2.

2.2.4 Efficient computation of block residuals

A further, large computational burden results from the repeated calculation of the residuals rl

and r in (7), (9) and (10). The computational overhead for these calculations is substantial,

both because of the size of the expanded design matrix (N = 743 and P∗= 66,085 in the

Page 10

Box 2 Active set algorithm for a single λ value

1. Form the active set, A = {m ∈ G : ||XT

2. Setˆβ = 0, and solve the GL estimation at λ, using only the groups in A:

1

2||y −

my||2> λwm}

ˆβ = min

β

?

m∈A

Xmβm||2

2+ λ

?

m∈A

wm||βm||2

3. Compute the revised active set on the full dataset:

A+= {z ∈ G : ||XT

zˆ rz||2> λwz}

if A+/A = ∅

ˆβ is the full solution

STOP

else

set A = A+

repeat 2. and 3. with the new, (expanded) active set

end

simulation study described in section 3, but substantially larger for a full PGWAS), and because

of the iterative nature of the BCD algorithm, meaning that a very large number of calculations are

performed. We therefore achieve one further substantial gain in computational efficiency by noting

that since the blocks are separable, during BCD only the single block residual, hl= y − Xlβl,

changes between iterations j = 1,...,Sl,1,...,Sl,... within block l, and between iterations l =

1,...,L,1,...,L,... across blocks. We therefore only need update hlat each iteration, with r and

rlupdated using computationally inexpensive matrix subtractions and additions. Python code for

mapping SNPs to pathways, and for analysing SNP data using PGLAW is available on request.

2.3 Selection bias and pathway weighting

PGAS methods derived from univariate SNP statistics are subject to various biasing factors that

can influence pathway ranking under the null, where no SNPs influence the phenotypic trait, y.

These factors vary from method to method, but may include the number and size of genes in a

pathway, as well as LD between SNPs and genes. Such biasing factors are generally corrected

through the use of permutation procedures. For example, the ‘GenGen’ method (Wang et al.,

2009b), measures the degree to which pathways are enriched with high ranking genes, and is sub-

ject to bias due to variation in the number of SNPs mapped to a gene, and to differences in LD

between SNPs mapped to different genes. The bias correction procedure begins by forming multi-

ple datasets through permutation of phenotype labels. For each permuted dataset, gene scores are

generated from univariate SNP statistics, and a pathway enrichment score is calculated. A nor-

malised (bias-corrected) pathway enrichment score is then derived by comparing the distribution

of pathway scores under the null with the score obtained from the unpermuted data.

Regression-based methods are similarly prone to bias, and once again the use of permutation

has been proposed to correct for this, along with dimensionality reduction to extract non-redundant

information. For example, with the GRASS method for case-control data (Chen et al., 2010), ge-

netic information within each gene is first summarised as ‘eigenSNPs’, obtained through PCA. The

biasing effect of gene size is once again accounted for through the generation of a null distribution,

formed by permuting phenotype labels.

With the GL under the null, pathway selection will be influenced by pathway size (i.e. the

number of SNPs within a pathway), since the accumulation of spurious associations in larger

pathways will give rise to larger ||βl||2 in (1). In addition, variation in dependencies between

Page 11

SNPs within pathways, and to a lesser extent between pathways will give rise to corresponding

variations in ||βl||2where spurious associations arise in regions of high LD.

One way to correct for biases arising from variations in the statistical properties of different

pathways or groups is through the selection of appropriate group weights w = (w1,...,wL) for the

objective function (1). For example, as noted before, Yuan and Lin (2006) suggest one possible

choice for the pathway weighting would be

(13)wl=

?

Sl

which ensures that groups of different size are penalised equally, and so have an equal chance of

being selected by the model, other things being equal (see (6)). In principle, we could follow this

strategy and perhaps attempt to account for other, additional factors that may also bias pathway

selection. However, there are a number of problems with this approach. Consider for example the

biasing effect of dependencies between SNPs within a pathway. Where causal SNPs tag, or reside

within large blocks with strong LD, the pathway ‘signal’ will be high, increasing the chance that

such pathways will be selected by the model, compared with other pathways where LD is low.

This biasing effect will further depend on the distribution of LD within the pathway, which will in

turn depend on other factors such as the number and size of pathway genes. The precise form of

any additional term(s) that should be added to (13) to account for this bias is thus unclear. Even

if we were able to identify a list of potential biasing factors, and formulate bias-correcting weight

adjustments for each, we are still faced with the problem that their may be other, unknown factors

that contribute to the bias. We therefore choose to adopt a ‘hypothesis-free’ approach to adjusting

pathway weights, which makes no assumptions about those factors which might influence pathway

selection.

Consider pathway selection under the GL model (1), with λ tuned to select M pathways.

We begin with the case M = 1.When there is no selection bias, and assuming no genetic

association, a pathway Gl should be randomly selected by the model according to a uniform

distribution, namely with probability Πl= 1/L, for l = 1,...,L. However, when biasing factors

are present this is generally not the case, and the empirical probability distribution describing

pathway selection, Π∗(w) will not be uniform. Here the dependence upon the weight vector w

has been made explicit, since with λ tuned to select a single pathway, w alone determines the

frequency distribution. A measure of distance between these two distributions can be obtained by

computing their Kullback-Leibler (KL) divergence

?

where Π∗

of no genetic associations. When GL pathway selection is unbiased, we expect this distance to

be approximately zero. Our strategy consists in adaptively adjusting all weights w in order to

minimise D.

Our adaptive weighting procedure is an iterative one, whereby at each iteration τ we first

update the previous weight vector w(τ−1), and then re-estimate Π∗(w(τ)) by fitting the GL model

R times, each with a random permutation of the response in order to create R null data sets2.

Π∗

iteration τ. The algorithm is initialised at iteration τ = 0 by using an initial weight vector w(0),

for instance the standard size weighting (13). This procedure is then repeated until D reaches

some suitably small value.

From (14), a reduction in D can be obtained by reducing the difference dl= Π∗

all l. As each |dl| approaches zero, the ratio, Π∗

of pathway Glto D is decreased. With this in mind, at each iteration, we adjust pathway weights

according to the following formula,

?1 − sign(dl)(α − 1)L2d2

R models can be fitted after sampling a response from that null distribution.

(14)D =

l

Π∗

l(w)logΠ∗

l(w)

Πl

l(w) is the empirical probability for the selection of pathway Gl under the assumption

l(w(τ)) is then the frequency at which pathway Gl is selected across the R null data sets at

l(w) − Πl, for

l(w)/Πl, approaches one, so that the contribution

(15)w(τ)

l

= w(τ−1)

l

l

?

0 < α < 1

2Alternatively, in a simulation study where the null distribution of the response is known (as in section 3), the

Page 12

where the paramater α controls the maximum amount by which each wl can be reduced in a

single iteration, in the case that pathway Glis selected with zero frequency. The weighting update

equation has the following desirable properties. When 0 ≤ Π∗

decreased, up to a maximum factor α when Π∗

When Π∗

when Π∗

that large values of |dl| result in relatively large adjustments to wl.

The estimation of Π∗when M > 1, that is where more than one pathway is selected by the

model, is computationally infeasible even for a small value of M, since we would need to estimate

the empirical joint probability distribution that M pathways are jointly selected. However, we

expect that many of the factors biasing pathway selection when M = 1 will similarly affect this

joint probability distribution. Under this assumption, we estimate the optimal weight vector w

only in the M = 1 case. Extensive simulation studies (see section 4) indicate that this data-driven

adaptive waiting scheme is able to substantially increase power and specificity compared with the

standard weighting (13), even when M > 1, indicating that this assumption holds in practice.

Finally, we note that despite the need for multiple MC simulations over multiple iterations, our

proposed bias-adjusted weighting strategy is fast, since it relies on fitting the GL model with λ

tuned to select a single pathway only, ensuring that the active set (see section 2.2.3) is very small,

and model estimation time for each of the R model fits is minimal.

l< Πl, i.e. −1

L≤ dl < 0, wl is

l= 0, increasing the chance that group l is selected.

l> Πl, i.e. dl> 0, wlis increased, decreasing the chance that group l is selected. Finally,

l= Πl, i.e. dl= 0, wlis unchanged. The square in the weight adjustment factor ensures

2.4Pathway ranking

Penalized regression typically proceeds by determining an optimal value for λ, corresponding to a

subset of variables that best predicts the response, and this is generally done by cross validating

the prediction error. In genetic association mapping, results are often instead presented in the

form of lists of pathways or SNPs, ranked in order of importance. We seek such a strategy for

the ranking of pathways within the regression model, such that pathways in C, will achieve a

high ranking, whereas those in C?will be ranked low. This approach has the added advantage of

allowing us to make direct comparisons with alternative pathway methods that use p-values as a

ranking criterion.

One simple ranking criterion in penalised regression is to use the order in which each variable

enters the model along the regularization path - i.e. as λ is decreased from its maximal value, where

no variables are selected. We instead adopt a bootstrap sampling approach, in which we fit the

regression model over multiple subsamples of the data, drawn with replacement, at a single, fixed

value for λ. Pathways are ranked in order of importance according to their selection frequency

across subsamples. Our motivation here is to exploit knowledge of finite sample variability obtained

by subsampling, to achieve better estimates of pathway importance. In this respect our strategy

resembles the pointwise stability selection method proposed by Meinshausen and B¨ uhlmann (2010)

in the context of variable selection.

As with stability selection, for our ranking strategy to be effective, the value of λ must be

small enough to ensure that all pathways in C are selected by the model with a high probability at

each subsample. Computation time increases rapidly with M, the number of selected pathways,

so that with the number, |C|, of causal pathways unknown, the choice of M is driven by the

number of causal pathways we seek to identify within computational constraints. We use B = 100

subsamples, each of size N/2, and at each subsample we perform a line search over λ, to ensure that

M ≥ Mminpathways are selected. This procedure is described in appendix 5. Once λ is tuned, for

each subsample, b, we obtain estimates β(b)

For pathway Gl, we define π(b)

pathway parameter vector estimated for subsample b. We rank pathways in order of their selection

frequency across subsamples, ¯ πl1≥,...,≥ ¯ πlL. We note that since typically M ? L, some ¯ πlmay

be zero. Such pathways are classified as unranked.

j(b = 1,...,B) for each SNP coefficient (j = 1,...,P∗).

= 1 when ||β(b)

ll||2?= 0 and π(b)

l

= 0 otherwise, where β(b)

l

is the

Page 13

2.5Ranking performance measures

In order to evaluate the success of any PGAS method, some measure of ranking performance

is required. In this section we describe 3 separate ranking performance measures that we use

to evaluate the performance of our method in a simulation study described in section 3. One

complicating factor is the issue of overlapping pathways, making the effective number of causal

pathways, |C|, dependent on the degree to which SNPs in S overlap multiple pathways. In addition,

with any method based on variable selection, the possibility that causal pathways are unranked,

i.e. they are not selected by the model, must be taken into account.

Consider the situation where the set S of causal SNPs, with cardinality S > 1, is known. We

may choose to define C in its most restricted sense as the set of pathways that contain all members

of S, or alternatively C might include all pathways containing one or more SNPs belonging to S.

In either case |C| will depend on the degree to which SNPs in S overlap multiple pathways. This

in turn depends on the particular distribution of causal SNPs with respect to overlapping genes.

The need to accommodate this variability in |C| in part motivates our formulation of the ranking

measures described below.

We propose three separate ranking measures that capture different aspects of ranking perfor-

mance, and focus on the top 100 ranked pathways only. We do this firstly because in any method

attention is inevitably focused on the highest ranking pathways (or alternatively those with the

highest statistical significance in a hypothesis testing framework). Also, since in a simulation study

we compare the performance of our variable selection method which identifies a limited number

of pathways against an alternative method that scores all pathways, some suitable cutoff in rank

order must be chosen.

We denote the set of ranked causal pathways by C∗= {k ∈ C : ¯ πk > 0}, cardinality |C∗|,

and their respective rankings by rk1,rk2,...,r|C∗|, ranked in order of their respective selection

frequencies, ¯ πk1< ¯ πk2<,...,< ¯ π|C∗|.

cardinality |C∗

rankings rk1,rk2,...,r|C∗

1. Highest causal pathway rank, rk1, that is the single highest rank achieved by any pathway

in C∗

2. Ranking power, p100, defined as

We further denote by C∗

100= {k ∈ C∗: rk ≤ 100},

100|, the set of ranked causal pathways falling in the top 100 ranks, with corresponding

100|. Our three proposed ranking measures are as follows:

100. This lies in the range 1 ≤ rk1≤ 100, and is only defined for |C∗

100| ≥ 1.

(16)p100=|C∗

100|

|C|

with 0 ≤ p100≤ 1. p100= 0 when no causal pathways are ranked in the top 100 (C∗

and p100= 1 when all causal pathways are ranked in the top 100 (C∗

3. Power-adjusted, normalised, weighted ranking score, R. This takes account of the actual

rankings, rk1,...,r|C∗

malised, weighted ranking score,

100= ∅),

100= C).

100|, as well as the ranking power, p100. We begin by defining a nor-

(17)R∗=

?

k∈C∗

?|C∗

100r

100|

k=1k

1

2

k

1

2

Here the square root increases the weight given to highly-ranked causal pathways. The

denominator is a normalising factor which represents the minimum possible weighted ranking

score, with rk1= 1,rk2= 2..., r|C∗

of 1 when the pathways in C∗

ranking. R∗takes no account of the possibility that C∗

are ranked. To do this we form the adjusted measure

?

γ

100|= |C∗

100|, ensuring that R∗attains its minimum value

100are optimally ranked. Higher values of R∗indicate suboptimal

100?= C, i.e. not all causal pathways

(18)R =

R∗/p100

if p100> 0

if p100= 0

Page 14

R thus attains a minimum value of 1 when all causal pathways are optimally ranked, and

the value γ when no causal pathways are ranked.

3 Simulation Study

We assess the power of our proposed method in a simulation study using real genotype and pathway

data, with simulated, quantitative phenotypes generated under an additive genetic model from

SNPs within a single, selected causal pathway. The presence of overlapping SNPs means that

the actual number of causal pathways is typically greater than one. We additionally compare

our method’s performance with an alternative, univariate-based method commonly used in gene

set analysis. Computation times for both methods increase with P, and because of this, and the

large number of scenarios and simulations tested, we restrict this analysis to SNPs on a single

chromosome to keep execution times within practical limits.

3.1Genotype and pathways data

We use genotypes obtained from the Alzheimer’s Disease Neuroimaging Initiative, ADNI (www.

loni.ucla.edu/ADNI), derived from the Illumina Human 610-Quad BeadChip. Subjects comprise

a mix of healthy controls, those diagnosed as having mild cognitive impairment, and those with

AD. After removing variants with a call rate < 95%, minor allele frequency (MAF) < 0.1 and

significant deviation from Hardy-Weinberg equilibrium (p < 5.7×10−7), 448,294 SNPs remain. In

this study we use genotype data from N = 743 subjects, and consider only SNPs from chromosome

1 (33,850 SNPs).

Popular databases used for the mapping of genes to biological pathways include the Kyoto

Encylopedia of Genes and Genomes (KEGG, www.genome.jp/kegg/pathway.html) and BioCarta

(www.biocarta.com/genes/index.asp). For this study we use data on ‘canonical pathways’ from

the Molecular Signals Database (MSigDB, www.broadinstitute.org/gsea/msigdb/index.jsp),

which is a commonly-used, curated collection of pathways obtained from multiple sources. At the

time of writing this comprised 880 pathways mapped to 6,804 genes. 2,382 human gene locations

on chromosome 1, corresponding to assembly GRCh37.p3 are obtained using Ensembl’s biomart

API (www.biomart.org). ADNI-genotyped SNPs on chromosome 1 are then mapped to annotated

genes within 10kb (20,399 SNPs mapped to 2,096 genes), and these remaining genes and SNPs

are then mapped to pathways using MSigDB (8,102 SNPs mapped to 778 pathways). Thus we

see that the majority of chromosome 1 SNPs fail to map to any pathway, but that the majority

of annotated pathways map to at least 1 SNP on this chromosome. Finally, small (< 10 SNPs)

and identical pathways are removed. After all pre-processing we are left with a total of P = 8,078

SNPs mapped to 551 pathways (max: 1,059; min: 10; mean: 120 ± 142 SNPs per pathway). All

SNP to pathway mapping and filtering was performed using bespoke code written in Python. The

mapping and filtering process is illustrated in Fig. 2.

More than 80% of SNPs are observed to overlap more than 1 pathway, with around 20%

overlapping 10 or more pathways and 2% overlapping 60 or more (see Fig. 3). After variable

expansion to account for overlapping pathways (section 2.1), we have P∗= 66,085 SNPs.

3.2Simulation framework

We begin by adjusting the pathway weight vector, w, using the bias-adjusted adaptive weighting

procedure described in section 2.3. We do this over 10 iterations with R = 40,000 MC simulations,

each with response y sampled from a standard normal distribution, N(0,1) for simplicity, since

many quantitative traits are expected to follow a normal distribution.

For the simulation of a SNP-dependent response, we begin by drawing S SNPs from a single,

randomly selected causal pathway, Gφ, according to some specified distribution (see below), and

then form the set C, of causal pathways that contain all the members of S. We thus chose to

Page 15

Pathways:?? ??

880?? Pathways??

containing?? 6,804??

dis5nct?? genes??

20,399?? SNPs?? mapped?? to?? 2,096?? genes??

within?? 10kbp??

Genotypes:?? ADNI?? Chr1??

33,850?? SNPs??

Genes:?? GRCh37.p3,?? Chr1??

2,382?? genes??

SNP?? to?? gene?? mapping??

8,102?? SNPs?? mapped?? to?? 778?? pathways??

SNP?? to?? pathway?? mapping??

remove?? pathways?? with?? <?? 10?? mapped?? SNPs??

(130?? pathways)??

remove?? pathways?? with?? iden5cal?? SNPs??

(97?? pathways)??

P?? =?? 8,078?? SNPs?? mapped?? to?? 551?? pathways??

P*?? =?? 66,085?? SNPs?? mapped?? to?? 551?? pathways??

overlap?? expansion??

Figure 2: SNP to pathway mapping.

define C in its most restricted sense, rather than for example including pathways that contain

one or more SNPs in S. Note that the number, |C| of causal pathways will vary according to the

particular distribution of overlaps within S.

For each simulation, a univariate quantitative phenotype y is simulated using an additive

model,

yi=

?

k∈S

ζkxik+ ?

where ζkis the allelic effect per minor allele due to causal SNP k. Setting wk= ζkxk, we define

the effect size of SNP k as δk= E(wk)/E(y) for k ∈ S, and set ? ∼ N(1,σ2

ζ = 0. We also record the average SNP effect size as a proportion of total phenotypic variance,

ESk= Var(wk)/Var(y), and the mean proportionate change in response per minor allele, E(ζk).

For our simulations we control δk, and set ζkaccordingly, so that effect size is independent of SNP

MAF, whereas ζkand ESkare MAF-dependent.

The power and specificity of any PGAS method is likely to depend on a range of factors in-

cluding the number of causal pathways to be identified, the number and distribution of causal

SNPs, and the size of their phenotypic effect (Wang et al., 2010; Fridley and Biernacka, 2011).

We therefore assess the performance of our method across 6 different scenarios in which we vary

each of these factors. Furthermore, we test each scenario over 500 MC simulations to account for

variation in causal SNP MAFs, gene size and number within pathways, and LD patterns within

and between causal pathways.

?) so that δk= 0 when