Page 1

Genetic Epidemiology 36:451–462 (2012)

Reprioritizing Genetic Associations in Hit Regions Using

LASSO-Based Resample Model Averaging

William Valdar,1†∗Jeremy Sabourin,2†Andrew Nobel,2and Christopher C. Holmes3

1Department of Genetics, and Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina

2Department of Statistics and Operations Research, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina

3Department of Statistics, Oxford, United Kingdom

Significance testing one SNP at a time has proven useful for identifying genomic regions that harbor variants affecting

human disease. But after an initial genome scan has identified a “hit region” of association, single-locus approaches can falter.

Local linkage disequilibrium (LD) can make both the number of underlying true signals and their identities ambiguous.

Simultaneous modeling of multiple loci should help. However, it is typically applied ad hoc: conditioning on the top SNPs,

with limited exploration of the model space and no assessment of how sensitive model choice was to sampling variability.

Formal alternatives exist but are seldom used. Bayesian variable selection is coherent but requires specifying a full joint

model, including priors on parameters and the model space. Penalized regression methods (e.g., LASSO) appear promising

but require calibration, and, once calibrated, lead to a choice of SNPs that can be misleadingly decisive. We present a general

method for characterizing uncertainty in model choice that is tailored to reprioritizing SNPs within a hit region under strong

LD.Ourmethod,LASSOlocalautomaticregularizationresamplemodelaveraging(LLARRMA),combinesLASSOshrinkage

with resample model averaging and multiple imputation, estimating for each SNP the probability that it would be included

in a multi-SNP model in alternative realizations of the data. We apply LLARRMA to simulations based on case-control

genome-wide association studies data, and find that when there are several causal loci and strong LD, LLARRMA identifies

a set of candidates that is enriched for true signals relative to single locus analysis and to the recently proposed method of

Stability Selection. Genet. Epidemiol. 36:451–462, 2012.

Key words: GWAS; case-control; genotype imputation; model averaging; LASSO; Stability Selection

C ?2012 Wiley Periodicals, Inc.

Supporting Information is available in the online issue at wileyonlinelibrary.com.

†These authors contributed equally to this work.

∗Correspondence to: William Valdar, Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599-7265.

E-mail: william.valdar@unc.edu

Received 11 October 2011; Revised 21 March 2012; Accepted 21 March 2012

Published online 30 April 2012 in Wiley Online Library (wileyonlinelibrary.com/journal/gepi).

DOI: 10.1002/gepi.21639

INTRODUCTION

Single locus regression has become a staple tool of hu-

man genome-wide association studies (GWAS; WTCCC

[2007]). Despite the fact that it simplistically reduces the

often complex genetic architecture of a phenotype down

to effects at an individual single nucleotide polymor-

phism (SNP) (or other localized variant), it has proved

powerful in identifying major genetic determinants and

predictors of disease susceptibility [Cantor et al., 2010].

Many would acknowledge that simultaneous modeling

of all loci potentially yields fairer estimates of genetic

effect, more stable phenotypic predictions, and better

characterization of between-locus confounding [Hoggart

et al., 2008; Lee et al., 2008]. However, such multiple lo-

cus approaches are at present seldom used. This could be

because they are considered impractical, potentially hard

for readers to understand, or, with some theoretical sup-

port [Fan and Lv, 2008], unnecessary in an initial genome

scan.Certainly,muchofthegenome-wideconfoundingthat

explicit multiple locus modeling would hope to resolve is

efficiently, if bluntly, dealt with by the addition of regres-

sion covariates correcting for higher order geometric rela-

tionships in the data [Price et al., 2010] or probabilistically

inferred strata [Pritchard et al., 2000].

Nonetheless, once initial genome scans have been per-

formed and “hit regions” of association identified, short-

comings of a single-locus approach become apparent. Local

patterns of linkage disequilibrium (LD) in such hit regions

can make ambiguous both the number of underlying true

signals and the identity of the loci that most directly give

rise to them [e.g., Strange et al., 2010]. Statistical analysis

after this point is often ad hoc. It typically involves fitting

further regressions that condition on “top” loci that appear

most strongly associated in order to rule out neighbors or

rule in suspicions of an independent second signal [Barratt

et al., 2004; Udea et al., 2003]. This is followed by more in-

terpretive analysis based on annotation as a prelude to, for

example, investigation at the bench. In ad hoc condition-

ing, rarely is there formal consideration of the fact that the

association of the top locus is often insignificantly different

from that of its correlated neighbors, and that whereas its

C ?2012 Wiley Periodicals, Inc.

Page 2

452Valdar et al.

association with the phenotype is probably stable to sam-

pling error, its superiority in association over its neighbors

is probably not. This inherent instability of the relative

strengths of association between confounding loci makes

such strategies high risk: a slightly different sampling of in-

dividuals could demote the conditioning locus, result in an

alternativeconditioninglocusbeingchosen,andpotentially

leadtoalteredconclusions.Thisapproachbecomesyetmore

unstable when some of the loci are themselves known with

varying certainty, their genotypes having been partially or

wholly imputed [Zheng et al., 2011], such that weakness

of association is now also a function of imputation uncer-

taintyunrelatedtothephenotype[e.g.,ServinandStephens,

2007].

There is thus great value in developing principled ap-

proaches to discriminate true from false signals in hit re-

gions. Joint modeling of all loci through multiple regression

seems attractive because it accounts for the LD of the data

[Balding, 2006]. Standard regression is unsuitable for this

purpose, however, because even when the number of con-

sidered loci p is much fewer than the number of individ-

uals n, LD creates multicollinearity that derails meaning-

ful estimation of locus effects. Stepwise multiple regression

techniques [Cordell and Clayton, 2002] formalize the ad

hoc conditioning approach but also inherit its weaknesses:

model selection choosing a single set of active loci typically

provides no indication about how sensitive that choice was

to, for example, sampling variability, making it a statistic

that is opaque at best and misleading at worst. Bayesian

approaches offer a coherent perspective by formally ac-

counting for uncertainty in model choice, effect estimation,

and imputation uncertainty [Stephens and Balding, 2009].

Nonetheless, these are often highly computationally inten-

sive,andrequireformalstatementsofpriorbeliefrelatingto

the number of causal variants and their effects that analysts

may feel unprepared or unwilling to specify.

Penalized regression models can provide an alternative

that does not require a commitment to Bayesian learning.

Placingapenaltyonthesizeofcoefficientsinthemultiplere-

gression leads to moderated estimates of coefficient effects,

allowing their stable estimation even when many predic-

tors are in the model. In particular, the LASSO [Tibshirani,

1996], which penalizes increases in the absolute value of

each coefficient subject to a penalty parameter ?, results in

some effects being shrunk to exactly zero. The result is a

“sparse” model in which only a subset of effects are ac-

tive. Increasing the level of penalization leads to greater

sparsity, effectively making ? a continuous model selection

parameter. Recent advances in fitting LASSO-type models

have made them more practical for analysis of large-scale

genetic data [e.g., Wu et al., 2009]. Nonetheless, as a tool for

modeling effects at multiple loci, the LASSO leaves impor-

tant questions unanswered. One problem is how to select ?.

This is typically approached through criteria-based evalua-

tionmethods[Wuetal.,2009;Zhouetal.,2010],suchasAIC

and BIC, empirical measures of predictive accuracy (such

as cross-validation [Friedman et al., 2010]), or criteria aim-

ing to control type I error (such as permutation [Ayers and

Cordell, 2010]). Another problem is, given ?, how to charac-

terize uncertainty in model choice. Although LASSO mod-

erates estimated effects through shrinkage, it is no better

than stepwise methods in that it ultimately selects a single

model (or single “path” of models, when ? is varied), and

thus states with absolute confidence a statistic that could in

fact be highly sensitive to the sampling of observations.

An intuitive way to characterize variability of model

choice is to estimate for each locus a model inclusion prob-

ability (MIP). A Bayesian approach would formulate this as

a posterior probability that conditions on both the observed

data and prior uncertainty in model choice. The Bayesian

MIP embodies a statement about whether the researcher

should believe the locus is included in the true model. A

frequentist alternative is to formulate the MIP as the prob-

ability a locus would be included in a sparse model under

an alternative realization of the data. This frequentist MIP

is thus a statement about the expected long-run behavior

of the model selection procedure. Valdar et al. [2009] pro-

posedanapproachthatappliedforwardselectionofgenetic

loci to resamples of the data and defined the resample MIP

(RMIP) as the proportion of resampled datasets for which a

locus was selected. This resample model averaging (RMA)

approachusedeitherbootstrapping(i.e.,“bagging”)orsub-

sampling (i.e., “subagging”), and followed an earlier appli-

cationtogenome-wideassociationinValdaretal.[2006]and

work on general aggregation methods by Breiman [1996]

and B¨ uhlmann and Yu [2001] (cf. parallel applied work

by Austin and Tu [2004] and Hoh et al. [2000]). Indepen-

dently, Meinshausen and B¨ uhlmann [2010] proposed “Sta-

bility Selection” (SS) that powerfully combines subagging

with LASSO shrinkage to produce a set of frequentist MIPs

at each specified ?. Recently, Alexander and Lange [2011]

adapted this method with limited success to whole-genome

association.

Herein, we propose a statistical method for reprioritizing

genetic associations in a hit region of a human GWAS based

on case-control data that exploits and extends the resam-

ple aggregation techniques developed in Valdar et al. [2009]

and Meinshausen and B¨ uhlmann [2010]. We demonstrate a

principled approach, LASSO local automatic regularization

resample model averaging (LLARRMA), that characterizes

sensitivity of locus choice to sampling variability and un-

certainty due to missing genotype data, and that provides

LASSO shrinkage automatically regularized through either

predictive- or discovery-based criteria. We show that when

multiple correlated SNPs are present in a hit region that has

been identified by standard single-locus regression, LLAR-

RMA produces a reprioritization that is enriched for true

signals.

METHODS

We start by considering a standard logistic regression to

estimate the effects of m SNPs in a hit region on a case-

control outcome in n individuals, and then describe statis-

tical approaches to identify a subset mq of SNPs that rep-

resent true signals. Herein, we define a “true signal” as the

SNP that most strongly tags an underlying causal variant,

a “background” SNP as an SNP that is not a true signal,

and an optimal analysis as one that distinguishes true sig-

nals from background SNPs in the hit region. We assume

that the hit region has been previously identified by an ini-

tial genome-wide screen using, for example, single-locus

regression, that many of the m SNPs may be in high LD,

and that mq< m < n. Let y = (y1,..., yn) be an n-vector of

the dichotomous response with each of the n1cases coded

by 1 and the n0controls coded by 0, let X be an n × m matrix

ofSNPgenotypes,whereSNPsarecodedtoreflectadditive-

onlyeffectsas{0,1,2}forunphasedgenotypes{qq,qQ,QQ},

and let D = {y,X} and N = {1,...,n}. Logistic regression

Genet. Epidemiol.

Page 3

Reprioritizing Genetic Associations in Hit Regions Using LASSO-Based Resample Model Averaging453

models the case-control status of individual i as if sampled

from Yi∼ Bin(pi,1), where i’s propensity pi= P(Yi= 1) is

determined by a linear function of the m SNP predictors

logit(pi) = log

?

pi

1 − pi

?

= ? +

m

?

j=1

?jxij,

(1)

where xijis the value of the jth SNP for the ith individual

and the ijth element of the column-centered design matrix

X, ? is the intercept, and ? = (?1,...,?m) are the effects of

the m predictors.

We assume that only a subset of the m SNPs are true sig-

nals, and define a corresponding vector of 0–1 inclusions

? = (?1,...,?m) such that ?j= I(?j?= 0). A common way

to infer ?, and to thereby estimate the identity of the true

signal, is to use a model selection procedure that maximizes

some criterion of fit. This returns a binary vector ˆ ?, a hard

estimate declaring which SNPs belong to the model. Al-

though superficially attractive, ˆ ? has limited interpretabil-

ity because it provides no information about how sensitive

the selection could have been to finite sampling. That is,

whether ˆ ? would be expected to vary dramatically when

applied to alternative samples from the same population.

Moreover, although many selection procedures guarantee

that they will deliver the correct result in an infinite sample

(i.e., are consistent), this offers little reassurance when the

sample is finite, and suggests that the returned statistic ˆ ?

could have high variance.

LLARRMA

Resample Model Averaging.

? in a way that incorporates uncertainty in model choice

arising through, for example, potential variability of the se-

lected set due to finite sampling. To do this we use RMA

[Valdar et al., 2009], applying a model selection procedure

to repeated resamples of the data, and basing subsequent

inference on the aggregate of those results. Rather than ob-

taining a binary estimate of each ?j, we instead seek to

estimate its expectation E(?j) over resamples, hoping to ap-

proximate its expectation over samples from the popula-

tion. We start by drawing subsamples k = 1,..., K with

subsampling proportion ? =2

comprises data D(k)= {y(k),X(k)} on |N(k)| = ?n individuals

N(k)⊂ N. Each subsample is produced by drawing ?n1in-

dividualsatrandomwithoutreplacementfromthen1cases,

and ?n0individuals at random without replacement from

the n0controls. For each subsample k, we perform a fixed

model selection procedure to estimate ˆ ?(D(k)) = ˆ ?(k), the m-

length binary vector of inclusions based on the kth subsam-

ple. Applying this to all subsamples gives the K × m matrix

Γ,whereΓT= [ˆ ?(1), ˆ ?(2),···, ˆ ?(K)].Theexpectedproportion

of times that the jth predictor is included in the model is

given by its RMA estimate

We seek to estimate

3, such that each subsample

?

RMIPj=

1

K

K

?

k=1

ˆ ?(D(k))j=

1

K

K

?

k=1

ˆ ?(k)

j

=

1

K

K

?

k=1

?kj,

(2)

which we refer to as its RMIP.

Selection Within a Subsample Using the Lasso.

To select SNPs within the kth subsample, we use LASSO

penalized regression [Tibshirani, 1996]. This estimates ? for

subsample k as

ˆ?(?;D(k)) = argmin?

?

− ?(?;D(k)) + ?

m

?

j=1

|?j|

?

,

(3)

where ?(?;D(k)) is the log-likelihood of ? for data D(k), and

?is a penalty parameter. The LASSO estimateˆ?(?;D(k)) eas-

ily translates into an estimate of the inclusions ˆ ?(?;D(k)) =

I(ˆ?(?;D(k)) ?= 0). Nonetheless, to arrive at a single estimate

of ?, as required for model averaging, we must devise a

suitable criterion for choosing the penalty ?. We propose

two alternatives, both of which identify a value ?(k)specific

to subsample k (i.e., local): complement deviance selection

and permutation selection.

Predictive-Based Choice of ?(k): Complement

Deviance Selection.

The complement deviance crite-

rion seeks a model that would perform well in out-of-

sample prediction. After estimatingˆ?(?;D(k)) over a grid

of ? to calculate the LASSO path, this criterion finds the

value of ? that minimizes the deviance of the complement

of subsample k, i.e.,

ˆ?(k)

CompDev

= argmin?

⎧

⎨

⎩−2

?

i∈N(\k)

?yilog( ˆ pi,?) + (1 − yi)log(1 − ˆ pi,?)?

⎫

⎬

⎭,

where N(\k)= N\N(k)is the set of (1 − ?)n individuals not

selected for subsample k, and ˆ pi,?is the predicted probabil-

ity of P(Yi= 1) based uponˆ?(?;D(k)) applied to the design

matrix of the complement subsample X(\k).

Discovery-Based Choice of ?(k): Permutation

Selection.

The permutation selection criterion is a mod-

ified version of that proposed by Ayers and Cordell [2010]

and seeks a conservative model that would tend to include

no SNPs under permutation of the response. Given a sub-

sample k, we estimate for a given permutation of the re-

sponse ?(y) the smallest penalty required to zero out all

predictors, i.e.,

?null(?,k) =

1

|N(k)|maxj

????x(k)

j,?(y(k))?

???,

where x(k)

centereddesignmatrixX(k),and?·,·?denotestheinnerprod-

uct of its two arguments. Calculating this for each of S per-

mutations ?1,...,?S, we estimate the permutation selec-

tion ? for subsample k as

j is the jth column of the subsampled and mean-

ˆ?(k)

Perm= median({?null(?1,k),?null(?2,k),...,?null(?S,k)}).

Ayers and Cordell [2010] apply a similar criterion when

analyzing complete datasets, with the difference that they

estimateˆ?null as the maximum of {?null(?1),...,?null(?S)}

for S = 25. We prefer not to do this because the maxi-

mumisrelativelyunstablefor S = 25,andisundesirablefor

larger S because it potentially allowsˆ?null= ?null(?s) where

?s(y) = y. In contrast, when using the median (Equation 4)

theaccuracyofˆ?Permincreaseswith S,althoughwefindthat

in simulations S = 20 is adequate.

(4)

Genet. Epidemiol.

Page 4

454 Valdar et al.

Fig. 1. A comparison of LLARRMA and Stability Selection.

Incorporating Uncertainty Due to Missing Geno-

types: Hard, Dosage, and Multiple Imputation.

SNP data within a hit region will often include combina-

tions of markers and individuals for which the genotype

is unknown or uncertain. To avoid a potentially wasteful

completecasesanalysis,itiscommontoimputethemissing

genotypes using a program such as MACH [Li et al., 2010],

IMPUTE [Howie et al., 2009], or fastPHASE [Scheet and

Stephens, 2006], and analyze the partly imputed data

as if it were fully observed. Imputation methods are

typically based on reconstruction and phasing of inferred

haplotypes. Dividing the SNP matrix X into missing

and observed elements X = {Xmis,Xobs}, methods such as

fastPHASE [Scheet and Stephens, 2006] model the joint

distribution p(Xmis|Xobs,?), where ? includes additional

information used in the imputation (e.g., priors). Most

GWAS, however, do not use this joint distribution directly.

Rather, they replace Xmiswith a point estimate?

tions. Specifically, Xmisis replaced by either the “dosage,”

?

with elements imputed as their maximum a posteriori

genotype

Xmis, each

element of which is constructed from its marginal distribu-

Xdose

count ˆ xij= E(xij|Xobs,?); or a “hard” imputation, ?

mis, with elements defined as the expectation of the allele

Xhard

mis,

ˆ xij= argmaxg∈{0,1,2}p(xij= g|Xobs,?).

The simplest approach to modeling missing genotypes

within LLARRMA is first to estimate Xmisas either?

complete. This plug-in approach underestimates variabil-

ity because it fails to incorporate uncertainty about the

imputation. Zheng et al. [2011] show that doing this when

modelingeffectsatsinglelocireducespowerbyanegligible

amount when the imputation accuracy is reasonably high.

Nonetheless, ignoring imputation uncertainty could be

moreproblematicinmultiple-locussettings,if,forexample,

the posterior distribution of haplotypes p(Xmis|Xobs,?)

differs substantially from joint distribution implied by the

product of marginal posteriors?

imputation uncertainty into our resampling framework is

through multiple imputation [Little and Rubin, 2002]. At

each iteration k, we sample a new X?

p(Xmis|Xobs,?), subsample the resulting X?= {X?

to give {X?,y}(k)= D?(k), and then calculate RMIPs using

ˆ ?(D?(k)) in place of ˆ ?(D(k)) in Equation 2. The resulting

Genet. Epidemiol.

Xdose

mis

or?

Xhard

mis

and then subsample? X = {?

Xmis,Xobs} as if it were

ij∈Xmisp(xij|Xobs,?) [e.g.,

Servin and Stephens, 2007]. A natural way to incorporate

misfrom its posterior

mis,Xobs}

RMIPs incorporate additional variability because each

subsample now includes a potentially different imputa-

tion of missing genotypes. We implement hard, dosage,

and multiple imputation using posterior draws from

fastPHASE (making use of the -s option).

COMPETING METHODS

LLARRMA calculates a score (an RMIP) for each SNP in

a case-control study. We compare the ability of those scores

to discriminate true signals from background with the SNP

scores calculated by two alternatives: the traditional GWAS

approach of single-locus regression, and the LASSO-based

subsample model averaging method stability selection

(SS) recently proposed in a more general context by

Meinshausen and B¨ uhlmann [2010].

Single Locus Regression.

regression with logistic regression as used in, for exam-

ple, PLINK [Purcell et al., 2007]. For each SNP, we fit a

single-predictorversionofEquation1andscoreits−log10P

(logP), where P is the P-value from a likelihood ratio test

against an intercept-only model.

Stability Selection.

SS differs from LLARRMA in

two main respects (see Fig. 1). First, whereas LLARRMA

selects variables within each subsample using a local (i.e.,

subsample-specific) penalty ?(k), SS uses a single global

penalty ? applied to all K subsamples. Second, whereas

LLARRMA chooses each ?(k)automatically, SS leaves its

global ? as a free parameter. In SS, the RMIP (referred to as

the “selection probability” in Meinshausen and B¨ uhlmann

[2010]) is thus left as a function of ?,

We perform single locus

?

RMIPSS(?)j=

1

K

K

?

k=1

I(ˆ?(?;D(k))j?= 0) (5)

giving rise to a sequence of RMIPs (a “stability path”) for

each locus j. Meinshausen and B¨ uhlmann [2010] provide

little guidance for choosing ?. As a choice of ? is required

to produce a unique RMIP and thereby ensure meaningful

comparison with LLARRMA, we select ? to produce the

stiffest possible competition: as the value that maximizes

the criterion used for comparing methods. Specifically,

given a criterion of success u(?, ˆ ?) comparing truth ? with

guess ˆ ?, we define

ˆ?oracle= argmax?u(?,RMIPSS(?)),

Page 5

Reprioritizing Genetic Associations in Hit Regions Using LASSO-Based Resample Model Averaging455

where “oracle” reflects the fact that choosing this unfairly

advantageous value requires foreknowledge of ?. We

consider SS with the oracle property defined by setting u to

betheinitialareaunderthecurve(AUC)(describedbelow).

ROC-BASED EVALUATION

We assess the performance of LLARRMA and its com-

petitors by simulation, examining the ability of each to dis-

criminate true signals from background in simulated case-

control studies. Performance is evaluated formally using

receiver operator characteristic (ROC) curves. ROC curve

methodology can vary between studies [Krzanowski and

Hand, 2009], so we describe ours in full. A given simulation

study comprises a set of simulation trials S = {1,..., S}. In

each trial s, a given method is presented with m SNPs of

which mq will be a true signal. That method calculates a

single score for each SNP (an RMIP or logP). For a given

threshold t, define powers(t) as the proportion of mq true

signal SNPs scoring ≥ t (i.e., the power to detect), and the

false-positive rate FPRs(t) as the proportion of the m − mq

background SNPs scoring ≥ t (i.e., the false positive rate;

FPR). We define the area under curve in trial s for FPRs

between a and b as AUCs(a,b) =?b

is x, and the integration is approximated using the trapez-

ium rule. For a given method and set of simulations S,

we define the estimated AUC between FPR a and b as

AUC(a,b) =?

ance(S − 1)−1?

[0,0.05], and the “initial AUC” as AUC(0,0.05); the “full

ROC”iswhereFPR ∈ [0,1]andthe“fullAUC”isAUC(0,1).

When plotting ROC curves for each method, we use thresh-

oldaveraging[Fawcett,2006],varyingt overitsrange([0,1]

for RMIPs; [0,∞) for logP) and at each t plotting x and y co-

ordinates S−1?

apowers

?FPR−1

s(x)?dx,

where FPR−1

s(x) returns the threshold t at which the FPR

s∈SS−1AUC(s,a,b), and assume this esti-

mate to be approximately normally distributed with vari-

?

the “initial ROC” as the ROC curve in the range FPR ∈

s∈S

AUC(a,b) − AUC(s,a,b)

?2.Wedefine

s∈SFPRs(t) and S−1?

s∈Spowers(t), respec-

tively.

SIMULATION STUDY 1: FIVE LOCI IN CANCER

DATA

Weobtainedgenotypedatafromphase1ofacase-control

GWAS for colorectal cancer from collaborators at the Well-

come Trust Centre for Human Genetics, University of Ox-

ford.Twoformsofthedataareusedhere.The“cancerdata”

comprise complete genotype information on 1,493 subjects

for 183 SNPs covering a hit region previously identified on

18q21. The cancer data are a subset of the “full cancer data,”

whichcomprisesincompletegenotypeinformationon1,859

subjects for the hit region.

Generating Missing Genotypes.

sitivity of the compared methods to alternative strategies

for modeling missing genotypes, we generate incomplete

versions of the cancer data by deleting genotypes accord-

ing to a random missingness algorithm. The missingness

algorithm is based on empirical modeling of the pattern of

missing data in the full cancer data. The full cancer data

genotypes contained 854 missing genotypes (∼0.25%). We

observed that the proportion of missing genotypes varied

considerably from SNP to SNP, but that missingness across

individuals was consistent with a random allocation. To

To assess the sen-

generate each incomplete dataset, we therefore do the fol-

lowing. First, for each SNP j, we assign a missingness pro-

portion ?mis,j generated as a random draw ?mis,j∼ fmis,

where fmisis an empirical density based on the histogram

of missingness proportions of SNPs in the full cancer data.

Second, we select a subset of nmis< n individuals eligi-

ble to receive missing genotypes. Third, at each SNP j

we delete dj= nmis× min(c?mis,j,1) marker genotypes at

random from the nmisindividuals, where c is chosen such

that the overall proportion of missing data is fixed value

pmis= (mn)−1?

had complete data, we set pmis= 0.1 and nmis= 0.9n.

Simulating Phenotypes.

based on a binomial draw from the logistic model in Equa-

tion 1. Given a set of SNPs representing true signals, with

genotypes Xq and effects ?q, we first calculate the inter-

cept necessary for an expected 50/50 ratio of cases to con-

trols as ? = −n−11T(XT

ties as pi= logit−1(? + xT

as Yi∼ Bin(1, ˆ pi).

PlacingCausalLoci.

Toensureadegreeofconfound-

ingcorrelationbetweenloci,wechoosefivetruesignalSNPs

at random but in a restricted manner from the LD blocks

shown in Figure 2. Specifically, in each simulation trial, two

SNPs are chosen from block 1 at random but subject to cor-

relation r ≥ 0.4, two SNPs are from block 2, also subject to

r ≥ 0.4, and one SNP is randomly chosen from block 3.

Simulation 1A: Moderate Effects.

tial illustrative comparison between methods, our first

study on the cancer data simulates a relatively constant

effects structure. In each simulation trial, we assign a

permutation of the effects (on the odds scale) exp{?q} =

(1.287,1.398,1.246,1.357,1.419) to the selected five SNPs.

Simulation 1B: Small Effects.

challenging and variable set of causal targets, our sec-

ond study on the cancer data randomly chooses true sig-

nal SNPs as in 1A but draws each element ?qj of effects

?q independently as exp{?qj} ∼ N(1.25(−1)?j,0.022) with

?j∼ Bin(1,0.5). The resulting effects are comparable to the

smalleffectsestimatedinmanyGWAS[Manolioetal.,2009].

jdj. To generate a more conservative level

of missingness while ensuring at least 10% of individuals

Phenotypes are simulated

q?q), calculate individual propensi-

q?q), and then draw phenotypes

To aid an ini-

Providing a more

SIMULATION STUDY 2: ONE TO SEVEN LOCI

IN ‘58 DATA

The “‘58” data are a complete-genotypes subset of data

collected during the human GWAS for seven diseases de-

scribed in WTCCC [2007] . It comprises genotypes for 2,199

subjects on 500 SNPs in the region 39.063723–40.985321 Mb

on chromosome 22, this region being chosen by us as a con-

tiguous run of markers that exhibits a mixture of high and

low LD (Fig. 2). To assess the how the number of true sig-

nals affects the relative utility of modeling single vs. multi-

ple loci, we evaluated methods in seven distinct simulation

substudies, simulating 1,...,7 true signals, respectively. In

each simulated trial of each substudy, the set of true signals

ischosenatentirelyrandomfromthe500SNPsandtheSNP

effects are generated as in simulation 1B above.

COMPUTATION

Genotype imputation was performed using fastPHASE

[Scheet and Stephens, 2006]. All other analyses were per-

formed in R [R Development Core Team, 2011], with the

Genet. Epidemiol.