Content uploaded by Dabao Zhang

Author content

All content in this area was uploaded by Dabao Zhang on Jan 16, 2018

Content may be subject to copyright.

Electronic Journal of Statistics

ISSN: 1935-7524

arXiv: math.PR/0000000

Selecting Massive Variables Using

Iterated Conditional Modes/Medians

Vitara Pungpapong

Department of Statistics

Faculty of Commerce and Accountancy

Chulalongkorn University

Bangkok, Thailand

e-mail: vitara@cbs.chula.ac.th

Min Zhang and Dabao Zhang∗

Department of Statistics

Purdue University

West Lafayette, IN 47907

e-mail: minzhang@stat.purdue.edu;zhangdb@stat.purdue.edu

Abstract: Empirical Bayes methods are privileged in selecting variables

out of massive yet structured candidates because of three attributes: taking

prior information on model parameters, allowing data-driven hyperparam-

eter values, and free of tuning parameters. We proposed an iterated condi-

tional modes/medians (ICM/M) algorithm to implement empirical Bayes

selection of massive variables while incorporating sparsity or more compli-

cated a priori information. The iterative conditional modes are employed

to obtain data-driven estimates of hyperparameters, and the iterative con-

ditional medians are used to estimate the model coeﬃcients and therefore

enable the selection of massive variables. The ICM/M algorithm is com-

putationally fast, and can easily extend the empirical Bayes thresholding,

which is adaptive to parameter sparsity, to complex data. Empirical stud-

ies suggest competitive performance of the proposed method, even in the

simple case of selecting massive regression predictors.

AMS 2000 subject classiﬁcations: Primary 62J05; secondary 62C12,

62F07.

Keywords and phrases: Empirical Bayes Variable Selection, High Di-

mensional Data, Prior, Sparsity.

Contents

1 Introduction ................................. 2

2 The Method ................................. 3

2.1 Iterated Conditional Modes/Medians ................ 3

2.2 Evaluation of Variable Importance . . . . . . . . . . . . . . . . . 5

3 Selection of Sparse Variables ....................... 6

3.1 The Algorithm ............................ 7

3.2 Simulation Studies .......................... 7

4 Selection of Structured Variables ..................... 10

∗Corresponding author

1

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 2

4.1 The Algorithm ............................ 12

4.2 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 13

5 Real Data Analysis ............................. 18

6 Discussion .................................. 19

Acknowledgements ............................... 21

Appendix A. Technical Details of the ICM/M Algorithms . . . . . . . . . 22

References .................................... 23

1. Introduction

Selecting variables out of massive candidates is a challenging yet critical problem

in analyzing high-dimensional data. Because high-dimensional data are usually

of relatively small sample sizes, successful variable selection demands appropri-

ate incorporation of a priori information. A fundamental piece of information is

that only a few of the variables are signiﬁcant and should be included into the

underlying models, leading to a fundamental assumption of sparsity in variable

selection (Fan and Li,2001). Many methods have been developed to take full

advantage of this sparsity assumption, mostly built upon thresholding proce-

dures (Donoho and Johnstone,1994), see Tibshirani (1996), Fan and Li (2001),

and others.

Recently many eﬀorts have been devoted to selecting variables from massive

candidates by incorporating rich a priori information accumulated from histori-

cal research or practices. For example, Yuan and Lin (2006) deﬁned group-wise

norms for grouped variables. For graph-structured variables, Li and Li (2010)

and Pan et al. (2010) proposed to use Laplacian matrices and Lγnorms, respec-

tively. Li and Zhang (2010) and Stingo et al. (2011) both employed Bayesian

approaches to incorporate structural information of the variables, both formu-

lating Ising priors.

Markov chain Monte Carlo (MCMC) algorithms have been commonly em-

ployed to develop Bayesian variable selection, see George and McCulloch (1993),

Carlin and Chib (1995), Li and Zhang (2010), Stingo et al. (2011), and oth-

ers. However, MCMC algorithms are computationally intensive and may be

diﬃcult to obtain appropriate hyperparameters. On the other hand, penalty-

based variable selection usually demands predetermination of certain tuning

parameters (e.g. Fan and Li,2001;Li and Li,2010;Pan et al.,2010;Tibshirani,

1996;Yuan and Lin,2006), which challenges high-dimensional data analysis. Al-

though cross-validation has been widely suggested to choose tuning parameters,

it may be infeasible in certain situations, in particular the case that many vari-

ables rarely vary. Recently, Sun and Zhang (2012) proposed the scaled sparse

linear regression to attach the tuning parameter to the estimable noise level.

Empirical Bayes methods are privileged in high-dimensional data analysis

because of no need to choose tuning parameters. They also allow incorporating

a priori information while modeling uncertainty of such prior information using

hyperparameters. For example, Johnstone and Silverman (2004) modeled the

sparse normal means using a spike-and-slab prior. The mixing rate of the dirac

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 3

spike and slab is taken as a hyperparameter to achieve data-driven threshold-

ing, and resultant empirical Bayes estimates are therefore adaptive to sparsity of

the high-dimensional parameters. As demonstrated by Johnstone and Silverman

(2004), this empirical Bayes method can work better than traditional thresh-

olding estimators. One important contribution of this paper is to develop a new

algorithm which allows to construct such empirical Bayes variable selection with

complex data.

We propose an iterative conditional modes/medians (ICM/M) algorithm for

easy implementation and fast computation of empirical Bayes variable selec-

tion (EBVS). Similar to the iterated conditional modes (Besag,1986), itera-

tive conditional modes are for optimization of hyperparameters and parame-

ters other than regression coeﬃcients. Iterative conditional medians are used

to enforce variable selection. As shown in Johnstone and Silverman (2004) and

Zhang et al. (2010), when mixture priors are utilized, posterior medians can lead

to thresholding rules and thus help screen out small and insigniﬁcant variables.

Furthermore, ICM/M makes it easy to incorporate complicated priors for the

purpose of selecting variables out of massive structured candidates. Taking the

Ising prior as an example (Li and Zhang,2010), we illustrate such strength of

ICM/M.

The rest of this paper is organized as follows. In the next section, we will

propose the ICM/M algorithm for empirical Bayes variable selection (EBVS).

We also explore to control false discovery rates (FDR) using conditional poste-

rior probabilities. We implement the ICM/M algorithm in Section 3 for high-

dimensional linear regression models, only assuming that non-zero regression

coeﬃcients are few. Shown in Section 4 is the ICM/M algorithm when incor-

porating a priori information on graphical relationship between the predictors.

Simulation studies are carried out in both Section 3 and 4 to evaluate the perfor-

mance of the corresponding ICM/M algorithms. An application to a real dataset

from a genome-wide association study (GWAS) is presented in Section 5. We

conclude this paper with a discussion in Section 6.

In the rest of this paper, the j-th component of a vector parameter, say β, is

denoted by βj;β−jdenotes the all components of βexcept the j-th component;

and βj:kincludes components of βfrom βjto βk. The parameter with a paren-

thesized superscript, say ˆ

β(k), indicates an estimate from the k-th iteration.

2. The Method

2.1. Iterated Conditional Modes/Medians

Consider a general variable selection issue presented with a likelihood function,

L(Y;Xβ;φ),(2.1)

where Yis a n×1 random vector, Xis a n×pmatrix containing values of p

variables, βis a p×1 parameter vector with the j-th component βjrepresenting

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 4

the eﬀects of the j-th variable to the model, and φincludes all other auxiliary

parameters.

A typical variable selection task is to identify non-zero components in β, that

is, to select important variables out of the pcandidates. For convenience, deﬁne

τj=I{βj6= 0}, which indicates whether the j-th variable should be selected into

the model. Further denote τ= (τ1, τ2,···, τp)t. Here we consider an empirical

Bayes variable selection, which assumes priors,

β∼π(β|τ, ψ1)×π(τ|ψ2),

φ∼π(φ|ψ3),(2.2)

where ψ= (ψt

1, ψt

2, ψt

3)tincludes all hyperparameters.

To avoid high-dimensional integrals, here we cycle through coordinates to

obtain the estimate of each component of (β, φ, ψ) iteratively,

ˆ

βj=ˆ

βj(ˆ

β−j,ˆ

φ, ˆ

ψ),

ˆ

φj=ˆ

φj(ˆ

β, ˆ

φ−j,ˆ

ψ),

ˆ

ψj=ˆ

ψj(ˆ

β, ˆ

φ, ˆ

ψ−j).

(2.3)

Indeed, with properly chosen priors of φand ψ, both ˆ

φj=ˆ

φj(ˆ

β, ˆ

φ−j,ˆ

ψ, Y,X)

and ˆ

ψj=ˆ

ψj(ˆ

β, ˆ

φ, ˆ

ψ−j,Y,X) can be obtained by maximizing the fully condi-

tional posterior, i.e.,

ˆ

φj=ˆ

φj(ˆ

β, ˆ

φ−j,ˆ

ψ) = mode(φj|Y,X,ˆ

β, ˆ

φ−j,ˆ

ψ),

ˆ

ψj=ˆ

ψj(ˆ

β, ˆ

φ, ˆ

ψ−j) = mode(ψj|Y,X,ˆ

β, ˆ

φ, ˆ

ψ−j).(2.4)

When each ˆ

βjis also obtained by maximizing its fully conditional pos-

terior, it suggests the iterated conditional modes (ICM) algorithm by Besag

(1986). However, calculation of conditional mode for ˆ

βjis either infeasible

or practically undesirable (due to lack of variable selection function). Indeed,

Bayesian or empirical Bayes variable selection for high-dimensional data usu-

ally follows a spike-and-slab prior on each βj(e.g. Ishwaran and Rao,2005;

Mitchell and Beauchamp,1988), and it induces a spike-and-slab posterior for

each βj. With a dirac spike, it is infeasible to obtain the mode of such a spike-

and-slab posterior. However, its median can be zero and allows to select the me-

dian probability model as suggested by Barbieri and Berger (2004). Henceforth,

following Johnstone and Silverman (2004), we construct ˆ

βj=ˆ

βj(ˆ

β−j,ˆ

φ, ˆ

ψ, Y,X)

as median of the fully conditional posterior, i.e.,

ˆ

βj=ˆ

βj(ˆ

β−j,ˆ

φ, ˆ

ψ) = median(βj|Y,X,ˆ

β−j,ˆ

φ, ˆ

ψ).(2.5)

With the iterative conditional median for βj, and conditional modes for φj

and ψjrespectively, for Bayesian update of a component conditional on all

other components, we hereafter propose the iterated conditional medians/modes

(ICM/M) algorithm for implementing the empirical Bayes variable selection.

As shown later, ICM/M algorithm allows an easy extension of the (general-

ized) empirical Bayes thresholding methods by Johnstone and Silverman (2004)

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 5

and Zhang et al. (2010) to dependent data. Obviously, with a consistent initial

point of ( ˆ

β, ˆ

φ, ˆ

ψ), the cycling Bayesian updates of this algorithm lead to a well-

established estimate ( ˆ

β, ˆ

φ, ˆ

ψ).

2.2. Evaluation of Variable Importance

When proposing a statistical model, we are primarily interested in evaluating the

importance of variables besides its predictive ability. For example, the objective

of high-dimensional data analysis is to identify a list of Jpredictors that are

most important or signiﬁcant among ppredictors. This is a common practice in

biomedical research using high-throughput biotechnologies, ranking all markers

and selecting a short list of candidates for follow-up studies.

For Bayesian approach, inference on the importance of each variable can

be done through its marginal posterior probability P(βj6= 0|Y,X). However,

this quantity involves high-dimensional integrals which is diﬃcult to calculate

even in the case of moderate p. Furthermore, the marginal posterior probability

may not be meaningful in the case that predictors are highly correlated (which

usually occurs in a large psmall ndata set). For example, suppose predictors

X1and X2are linearly dependent and both predictors are associated with a

response variable. The marginal posterior probability of X1being included in

the model might be very high and dominates the marginal posterior probability

of X2being included in the model.

We propose a local posterior probability to evaluate the importance of a

variable. That is, conditional on the optimal point {ˆ

βj,ˆ

φ, ˆ

ψ}obtained from

empirical Bayes variable selection through ICM/M algorithm, the importance

of a variable is evaluated by its full conditional posterior probability,

ζj=P(βj6= 0|Y,X,ˆ

β−j,ˆ

φ, ˆ

ψ).(2.6)

Such a probability has a closed form which can be easily computed. We will

show later in simulation studies that the local posterior probability is a good

indicator to quantify the importance of variables.

Another challenging question is how large the list of important predictors

should be. In many literatures, the numbers of important variables reported

are arbitrary. For instance, some laboratories may be interested in the top ten

genes. Typically, however, there is an interest to create the list such that errors

like type-I and type-II errors are controlled (Dudoit et al.,2003). False discovery

rate (FDR) control is widely used in high-dimensional data analysis since it is

less conservative and has more power than controlling the familywise error rate

(Benjamini and Hochberg,1995).

With the local posterior probability ζand assumption that true βis known,

we can report a list containing predictors having the posterior probability greater

than some bound κ, 0 ≤κ < 1. Given the data, true FDR can be computed as

F DR(κ) =

p

X

j=1

I{βj= 0, ζj> κ}p

X

j=1

I{ζj> κ}.(2.7)

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 6

Newton et al. (2004) proposed the expected FDR given the data in Bayesian

scheme as

d

F DR(κ) =

p

X

j=1

(1 −ζj)I{ζj> κ}p

X

j=1

I{ζj> κ}.(2.8)

Therefore we can select predictors to report by controlling d

F DR(κ) at a desired

level.

3. Selection of Sparse Variables

Here we consider the empirical Bayes variable selection for the following regres-

sion model with high dimensional data,

Y=Xβ+ǫ, ǫ ∼N(0, σ2In).(3.1)

Further assume that the response is centered and the predictors are standard-

ized, that is, Yt1n= 0, Xt1n=0p, and

Xt

jXj=n−1, j = 1,···, p,

where Xjis the j-th column of X, i.e., X= (X1,X2,···,Xp).

Let ˜

Yj=Y−Xβ+Xjβj. Assuming all model parameters except βjare

known, βjhas a suﬃcient statistic

1

n−1Xt

j˜

Yj∼Nβj,1

n−1σ2.(3.2)

To capture the sparsity of regression coeﬃcients, we put an independent prior

on each of scaled βjas follows,

βj|σiid

∼(1 −ω)δ0(·) + ωγ(·|σ),(3.3)

where δ0(·) is a Dirac delta function at zero, γ(·|σ) is assumed to be a probability

density function. This mixture prior implies that βjis zero with probability

(1 −ω) and is drawn from the nonzero part of prior, γ(·|σ), with probability ω.

As suggested by Johnstone and Silverman (2004), a heavy-tailed prior such as

Laplace distribution is a good choice for γ(·|σ), that is,

γ(βj|σ) = α√n−1

2σexp −α√n−1

σ|βj|,(3.4)

where α > 0 is a scale parameter. We take Jeﬀreys’ prior on σas π(σ)∝1/σ

(Jeﬀreys,1946).

Note that there is a connection of using Laplace prior and the lasso. Indeed,

setting ω= 1 in (2.4) and (3.3) leads to a lasso estimate with αrelated to a

tuning parameter in the lasso, see details in Tibshirani (1996). Our empirical

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 7

Bayes variable selection allows a data-driven optimal choice of ω. Indeed, a data-

driven optimal αcan also be obtained through the conditional mode suggested

by (2.4), which avoids the issue brought by a tuning parameter to lasso (while

lasso usually relies on cross validation to choose an optimal tuning parameter).

Johnstone and Silverman (2004) also suggested a default value α= 0.5, which

in general works well.

3.1. The Algorithm

Here we implement the ICM/M algorithm described in (2.4) and (2.5). Note that

φ=σ, and ψ= (ω, α) or ψ=ωdepending on whether αis ﬁxed. Throughout

this paper, we ﬁx α= 0.5 as suggested by Johnstone and Silverman (2004).

To obtain ˆ

β(k+1)

j= median(βj|Y,X,ˆ

β(k+1)

1:(j−1),ˆ

β(k)

(j+1):p,ˆσ(k),ˆω(k)), we notice

the suﬃcient statistic of βjin (3.2) and it is therefore easy to calculate ˆ

β(k+1)

j

as stated below. Indeed, ˆ

β(k+1)

jis an empirical Bayes thresholding estimator as

shown in Johnstone and Silverman (2004).

Proposition 3.1. With pre-speciﬁed values of σand β−j,1

n−1Xt

j˜

Yjis a suﬃ-

cient statistic for βjw.r.t the model (3.1). Furthermore, the iterative conditional

median of βjin the ICM/M algorithm can be constructed as the posterior median

of βjin the following Bayesian analysis,

1

σ√n−1Xt

j˜

Yj|βj∼N√n−1

σβj,1,

βj∼(1 −ω)δ0(βj) + ω√n−1

4σexp −√n−1

2σ|βj|.

The conditional mode ˆσ(k+1) = mode(σ|Y,X,ˆ

β(k+1),ˆω(k)) has an explicit

solution,

ˆσ(k+1) =1

4dc+qc2+ 16d||Y−Xˆ

β(k+1)||2,

where c=√n−1kˆ

β(k+1)k1, and d=n+kˆ

β(k+1)k0+ 1. Furthermore, the con-

ditional mode ˆω(k+1) = mode(ω|Y,X,ˆ

β(k+1),ˆσ(k+1)) can be easily calculated

as

ˆω(k+1) =kˆ

β(k+1)k0p.

3.2. Simulation Studies

To evaluate the performance of our proposed empirical Bayes variable selection

(EBVS) via ICM/M algorithm, we simulated data from model (3.1) with large p

small n, i.e., p= 1,000 and n= 100. There are a total of 20 non-zero regression

coeﬃcients which are β1=··· =β10 = 2 and β101 =··· =β110 = 1. The

error standard deviation σis set to one. The predictors are partitioned into ten

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 8

blocks, each including 100 predictors which are serially correlated at the same

level of correlation coeﬃcient ρ. We simulated 100 datasets for ρtaking values

in {0,0.1,0.2,···,0.9}respectively.

EBVS was compared with two popularly considered approaches, i.e., lasso

by Tibshirani (1996), and adaptive lasso by Zou (2006). The scaled lasso by

Sun and Zhang (2012) was also applied to the simulated datasets. Ten-fold cross-

validation was used to choose optimal tuning parameters for lasso and adaptive

lasso respectively. The median values of prediction error, false positive, and

false negative rates were reported for each approach based on the 100 simulated

datasets.

As shown in Figure 1, EBVS performs much better than the other three

methods in terms of prediction error. In particular, when ρ≥0.3, EBVS con-

sistently reported median prediction error approximately at 1.5. In comparison

of lasso and adaptive lasso, adaptive lasso has smaller prediction error when

ρ < 0.3; but lasso has smaller prediction error when ρ > 0.3.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0

5

10

15

20

25

30

35

40

ρ

median prediction error

Fig 1. Comparison of median prediction errors of lasso (dotted), adaptive lasso (dash-dotted),

scaled lasso (thin solid), and EBVS (thick solid) by averaging over 100 datasets simulated for

each ρin Section 3.2.

It is known that lasso can inconsistently select variables under certain condi-

tions, and adaptive lasso was proposed for solving this issue (Zou,2006). Figure 2

showed that lasso has very high false positive rates (more than 50%), and adap-

tive lasso signiﬁcantly reduces the false positive rates especially when ρ≥0.2.

Indeed, lasso has much larger false positive rates than all other methods. It is

interesting to observe that EBVS has zero false positive rates except in the case

that ρ= 0.5 and ρ= 0.9. All methods have very low false negative rates.

Recently, Meinshausen et al. (2009) proposed a multi-sample-split method

to construct p-values for high-dimensional regressions, especially in the case

that the number of predictors is larger than the sample size. Here we applied

this method, as well as EBVS, to each simulated dataset with a total of 50

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 9

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

ρ

false positive rate

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0

2

4

6

8

10

12

x 10−3

ρ

false negative rate

Fig 2. Comparison of false positive rates (top) and false negative rates (bottom). Averaging

over 100 datasets simulated for each ρin Section 3.2, the false positive/negative rates were

calculated for lasso (dotted), adaptive lasso (dash-dotted), scaled lasso (thin solid), and EBVS

(thick solid).

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 10

sample-splits, and compared its performance with that of ζideﬁned in (2.6).

For each predictor, Figure 3plotted the median of −log10 (1 −ζi), truncated

at 10, against the median of −log10(p-value) across 100 datasets simulated

from the regression model with ρ= 0.5 and ρ= 0.9 respectively. For either

model, ζican clearly distinguish true positives (i.e., predictors with τi6= 0) from

true negatives (i.e., predictors with τi= 0). However, as shown in Figure 3.b

where ρ= 0.9, there is no clear cutoﬀ of p-values to distinguish between true

positives and true negatives. Here we also observed that F DR(κ) can be well

approximated by d

F DR(κ) (results are not shown), with both dropped sharply

to zero for κ > 0.05. We therefore can select κto threshold ζifor the purpose

of controlling FDR.

4. Selection of Structured Variables

When the information of structural relationship among predictors is available, it

is unreasonable to assume independent prior on each βj, j = 1, ..., p as described

in previous section. Instead, we introduce an indicator variable τj=I{βj6= 0}.

Then, the prior distribution of βis set to be dependent on τ= (τ1, ..., τp)T.

Speciﬁcally, given τj,βjhas the mixture distribution

βj|τj∼(1 −τj)δ0(βj) + τjγ(βj),(4.1)

where γ(·) is the Laplace density with the scale parameter α.

The relationship among predictors can be represented by an undirected graph

G= (V, E ) comprising a set Vof vertices and a set Eof edges. In this case, each

node is associated with a binary valued random variable τj∈ {0,1}and there is

an edge between two nodes if two covariates are correlated. The following Ising

model (Onsager,1943) is employed to model the a priori information on τ,

P(τ) = 1

Z(a, b)exp

aX

i

τi+bX

<i,j>∈E

τiτj

,(4.2)

where aand bare two parameters, and

Z(a, b) = X

τ∈{0,1}p

exp

aX

i

τi+bX

<i,j>∈E

τiτj

.

The parameter bcorresponds to the “energies” associated with interactions

between nearest neighboring nodes. When b > 0, the interaction is called ferro-

magnetic, i.e., neighboring τiand τjtend to have the same value. When b < 0,

the interaction is called antiferromagnetic, i.e., neighboring τiand τjtend to

have diﬀerent values. When b= 0, there is no interaction, and the prior gets

back to independent and identical Bernoulli distribution. The value of a+b

indicates the preferred value of each τi. That is, if a+b > 0, τitends to be one;

if a+b < 0, τitends to be zero.

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 11

0 2 4 6 8 10

0

2

4

6

8

10

−log10(p−value)

−log10(1−ζ)

0 2 4 6 8 10

0

2

4

6

8

10

−log10(p−value)

−log10(1−ζ)

Fig 3. Comparison of the local posterior probabilities (with −log10 (1 −ζ)truncated at 10)

and p-values in evaluating variable importance by EBVS, with ρ= 0.5(top) and ρ= 0.9

(bottom). Each plot is based on 100 datasets simulated in Section 3.2. True positives are

indicated by crosses and true negatives are indicated by circles.

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 12

4.1. The Algorithm

Here we will implement ICM/M algorithm to develop empirical Bayes variable

selection with Ising prior (abbreviated as EBVSi) to incorporate the structure

of predictors in modeling process. We assume the Ising prior as homogeneous

Boltzmann model, but the algorithm can be extended for more general pri-

ors. With α= 0.5, the ICM/M algorithm described in (2.4) and (2.5) can be

proceeded with φ=σand ψ= (ω , a, b).

For the hyperparameters aand b, we will calculate the conditional mode

of (a, b) simultaneously. Conceptually, we want (ˆa(k+1),ˆ

b(k+1)) maximizing the

prior likelihood P(τ) in (4.2). However, it requires to compute Z(a, b) by sum-

ming up p-dimensional space of τ, which demands intensive computation es-

pecially for a large p. Many methods have been proposed for approximate cal-

culation, see Geyer (1991), Geyer and Thompson (1992), Zhou and Schmidler

(2009) and others. Here we will consider the composite likelihood approach

(Varin et al.,2011) which is widely used when the actual likelihood is not easy

to compute. In particular, (ˆa(k+1),ˆ

b(k+1)) will be obtained by maximizing a

pseudo-likelihood function, a special type of composite conditional likelihood

and a natural choice for a graphical model (Besag,1975).

With the Ising prior on τ(k), the pseudo-likelihood of (a, b) is as follows,

Lp(a, b) =

p

Y

i=1

P(τ(k)

i|τ(k)

−j, a, b) =

p

Y

i=1

exp τ(k)

i(a+bP<i,j>∈Eτ(k)

j)

1 + exp a+bP<i,j >∈Eτ(k)

j.

The surface of such a pseudo-likelihood is much smoother than the joint like-

lihood and therefore easy to maximize (Liang and Yu,2003). The resultant

estimator (ˆa(k+1) ,ˆ

b(k+1)) by maximizing Lp(a, b) is biased for a ﬁnite sam-

ple size, but it is asymptotically unbiased and consistent (Guyon and Kunsch,

1992;Mase,2000;Varin et al.,2011). The implementation of pseudo-likelihood

method is fast and straightforward which is feasible for a large scale of graph.

Indeed, ˆa(k+1) and ˆ

b(k+1) are the logistic regression coeﬃcients when the binary

variable ˆτ(k)

iis regressed on P<i,j>∈Eˆτ(k)

jfor i= 1,···, p.

As shown in the previous sections, the conditional median ˆ

β(k+1)

jcan be

constructed on the basis of the following preposition.

Proposition 4.1. With pre-speciﬁed values of σ,a,b, and β−j,1

n−1Xt

j˜

Yjis

a suﬃcient statistic for βjw.r.t the model (3.1). Furthermore, the iterative

conditional median of βjin the ICM/M algorithm can be constructed as the

posterior median of βjin the following Bayesian analysis,

1

σ√n−1Xt

j˜

Yj|βj∼N√n−1

σβj,1,

βj∼(1 −̟j)δ0(βj) + ̟j√n−1

4σexp −√n−1

2σ|βj|,

where the probability ̟jis speciﬁed as follows,

̟−1

j= 1 + exp −a−bX

k:<j,k>∈E

τk.

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 13

The conditional mode ˆσ(k+1) = mode(σ|Y,X,ˆ

β(k+1),ˆω(k)) has an explicit

solution,

ˆσ(k+1) =1

4dc+qc2+ 16d||Y−Xˆ

β(k+1)||2,

where c=√n−1kˆ

β(k+1)k1, and d=n+kˆ

β(k+1)k0+ 1.

4.2. Simulation Studies

Here we simulated large psmall ndatasets from model (3.1) with structured

predictors, i.e., the values of βjdepend on correlated τj. We here consider two

diﬀerent correlation structures of τi. Both EBVS and EBVSiwere applied to

each simulated dataset, and they were compared with three other methods, i.e.,

lasso, adaptive lasso, and scaled lasso.

Case I. Markov Chain. For each j= 1,···, p,βj= 0 if τj= 0; and if

τj= 1, βjis independently sampled from a uniform distribution on [0.3,2].

The indicator variables τ1,···,τpform a Markov chain with the transition

probabilities speciﬁed as follows,

P(τj+1 = 0|τj= 0) = 1 −P(τj+1 = 1|τj= 0) = 0.99;

P(τj+1 = 0|τj= 1) = 1 −P(τj+1 = 1|τj= 1) = 0.5.

The ﬁrst indicator variable τ1is sampled from Bernouli(0.5). The error variance

is ﬁxed at one. For each individual, its predictors were simulated from AR(1)

with correlation coeﬃcient ρranging from 0 to 0.9 with step 0.1.

The median prediction error rates of all methods are shown in Figure 4. EBVS

performed slightly better than adaptive lasso, and both performed much better

than lasso and scaled lasso. Lasso, adaptive lasso, scaled lasso, and EBVS all

presented varying prediction error rates when ρgoes from 0 to 0.9. However,

the prediction error rates of EBVSiare rather stable for varying values of ρ,

and are much smaller than those of the other four methods.

Shown in Figure 5are the false positive rates and false negative rates of

diﬀerent methods. Not surprisingly, lasso has false positive rates over 70%, much

higher than that of other methods. Adaptive lasso signiﬁcantly reduces the false

positive rates, which is still more than 10%. On the other hand, the false positive

rates of both EBVS and EBVSiare less than 10%. Indeed, EBVS reported false

positive rates at zero for diﬀerent values of ρ; and EBVSireported false positive

rates at zero when ρ < 0.6, and 0.1 when ρ≥0.6. However, EBVSireported

false negative rates less than EBVS. Therefore, EBVS tends to select correct

true positives by including fewer true positives in the ﬁnal model than the

model obtained by EBVSi. We then conjecture that, when covariates are highly

correlated, EBVSitends to select more variables into the model. In particular,

if one covariate is selected into the model, EBVSitends to include its highly

correlated neighboring predictors into the model.

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 14

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

2

4

6

8

10

12

14

16

ρ

median prediction error

Fig 4. Comparison of median prediction errors of lasso (dotted), adaptive lasso (dash-dotted),

scaled lasso (thin solid), EBVS (dashed), and EBVSi(thick solid) by averaging over 100

datasets simulated for each ρin Case I of Section 4.2.

Figure 6shows F DR(κ) and d

F DR(κ) of EBVSifor the models with ρ= 0.5

and ρ= 0.9 respectively (we also observed that F D R(κ) of EBVS is similar to

that of EBVSi, results are not shown). Overall, the estimate d

F DR(κ) dominates

F DR(κ), i.e., the true FDR. Therefore, we will be conservative in selecting

variables when controlling FDR using d

F DR(κ). For example, if one would like

to list important predictors while controlling FDR at 0.1 for the model with

ρ= 0.9, κshould be set around 0.1 based on F DR(κ). However, one can set κ

around 0.4 based on d

F DR(κ), which suggests a true FDR as low as zero.

Plotted in Figure 7are the p-values calculated using the multi-sample-split

method (Meinshausen et al.,2009) against ζjfor each predictor. For both EBVS

and EBVSi,ζjquantiﬁed variable importance better than p-values in terms of

distinguishing true positives from true negatives. Overall, EBVSioutperforms

EBVS since it provides larger values of ζfor true positives, while both EBVS

and EBVSikeep true negatives with ζjclose to zero. Indeed, EBVS produced ζj

close to 0 for several true positives while EBVSiproduced larger values of ζjfor

these true positives. We then summarize empirically that, by incorporating a

priori information, EBVSihas more power to detect true positives than EBVS.

Case II. Pathway Information. To mimic a real genome-wide associa-

tion study (GWAS), we took values of some single nucleotide polymorphisms

(SNPs) in Framingham dataset (Cupples et al.,2007) to generate Xin model

(3.1). Speciﬁcally, 24 human regulatory pathways were retrieved from Kyoto

Encyclopedia of Genes and Genomes (KEGG) database, and involved 1,502

genes. For each gene involved in these pathways, at most two SNPs listed in the

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 15

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

ρ

false positive error

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

0

1

2

3

4

5

6

7

8

9x 10−3

ρ

false negative error

Fig 5. Comparison of false positive rates (top) and false negative rates (bottom). Averaging

over 100 datasets simulated for each ρin Case I of Section 4.2, the false positive/negative

rates were calculated for lasso (dotted), adaptive lasso (dash-dotted), scaled lasso (thin solid),

EBVS (dashed), and EBVSi(thick solid).

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 16

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

κ

FDR

0 0.2 0.4 0.6 0.8 1

0

0.2

0.4

0.6

0.8

1

κ

FDR

Fig 6. Plots of median true FDR (solid) and estimated FDR (dotted) versus κbased on the

results of applying EBVSito 100 data simulated for Case I in Section 4.2, with ρ= 0.5(top)

and ρ= 0.9(bottom) respectively.

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 17

a. EBVS, ρ= 0.5 b. EBVS, ρ= 0.9

0 2 4 6 8 10 12 14

0

2

4

6

8

10

−log10(p−value)

−log10(1−ζ)

0 2 4 6 8 10 12 14

0

2

4

6

8

10

−log10(p−value)

−log10(1−ζ)

c. EBVSi,ρ= 0.5 d. EBVSi,ρ= 0.9

0 2 4 6 8 10 12 14

0

2

4

6

8

10

−log10(p−value)

−log10(1−ζ)

0 2 4 6 8 10 12 14

0

2

4

6

8

10

−log10(p−value)

−log10(1−ζ)

Fig 7. Comparison of local posterior probabilities (with −log10(1 −ζ)truncated at 10) and

p-values in evaluating variable importance by EBVS and EBVSi. Each plot is based on 100

datasets simulated for Case I in Section 4.2. True positives are indicated by crosses and true

negatives are indicated by circles.

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 18

Framingham dataset were randomly selected out of those SNPs residing in the

genetic region. If no SNP could be found within the genetic region, a nearest

neighboring SNP would be identiﬁed. A total of 1,782 SNPs were selected. We

ﬁrst identiﬁed 952 unrelated individuals out of the Framingham dataset, and

used them to generate predictor values of the training dataset. For the rest of the

Framingham dataset, we identiﬁed 653 unrelated individuals to generate predic-

tor values of the test dataset. Five pathways were assumed to be associated with

the phenotype Y. That is, all 311 SNPs involved in these ﬁve pathways were

assumed to have nonzero regression coeﬃcients, which were randomly sampled

from a uniform distribution over [0.5,3]. With the error variance at ﬁve, a total

of 100 datasets were simulated.

As shown in Table 1, lasso has relatively low prediction error. However, its

median false positive rate is as high as 69%, much higher than others. Adaptive

lasso (LASSOa), on the other hand, has very large prediction error but its false

positive rate is much smaller than lasso. EBVS presented the lowest false posi-

tive rate among all the methods, and its false negative rate is also smaller than

that of adaptive lasso. Indeed, with initial values obtained from lasso, EBVS re-

duces the false positive rate from lasso by more than 98%. By incorporating the

pathway information using an Ising prior on τ, EBVSireported the lowest pre-

diction error. Furthermore, EBVSicompromised between lasso, adaptive lasso,

and EBVS to balance well between the false positive rate and false negative rate.

Scaled lasso (LASSOs) performed unstably in analyzing our simulated datasets,

and it selected more than 800 positives in seven of the simulated datasets.

Table 1

Results of Simulation Studies with Pathway Information (Case II).

Method Prediction Error (s.e.) False Positive (s.e.) False Negative (s.e.)

LASSO 30.6928(.4050) .6905(.0004) .0204(.0004)

LASSOa206.1994(.5726) .0744(.0017) .1266(.0002)

LASSO∗

s368.6464(6.1308) .1290(.0077) .1475(.0012)

EBVS 95.3686(1.8820) .0118(.0010) .0970(.0008)

EBVSi21.7731(.2320) .0308(.0015) .0394(.0003)

∗The results of the scaled lasso excluded seven datasets. Applying the scaled lasso to these

seven datasets reported the median prediction error at 2.45 ×1010, false positive rate at

.7059, and false negative rate at .0043.

5. Real Data Analysis

The empirical Bayes variable selection using ICM/M algorithm was applied to

the Framingham dataset (Cupples et al.,2007) to ﬁnd SNPs associated with

vitamin D level. The SNPs of the dataset were preprocessed following common

criteria of GWAS, that is, both missingness per individual and missingness per

SNP are less than 10%; minor allele frequency (MAF) is no less than 5%; and

the signiﬁcance level of Hardy-Weinberg test on each SNP is 0.001. It resulted

in a total of 370,773 SNPs, and 84,834 of them resided in 2,167 genetic regions

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 19

involving 112 pathways relevant to vitamin D level. We pre-screened SNPs by

selecting those having p-values of univariate tests smaller than 0.1, and ended

with 7,824 SNPs for the following analysis. As in Section 4.2, a training dataset

and a test dataset were constructed with 952 and 519 unrelated individuals

respectively. The response variable is the log-transformed vitamin D level.

We applied lasso, adaptive lasso, scaled lasso, EBVS, and EBVSito the

training dataset, and calculated the prediction errors using the test dataset.

The results are reported in Table 2. While identifying much more SNPs than

all other methods, lasso reported the largest prediction error. Other than scaled

lasso (LASSOs), EBVS has the smallest prediction error though it identiﬁed

only one SNP. Adaptive lasso (LASSOa) and EBVSieach identiﬁed ﬁve SNPs,

and their prediction errors are slightly higher than that of EBVS.

Table 2

Prediction Errors for the Framingham Dataset.

Method Prediction Error No. of Identiﬁed SNPs

LASSO .2560 14

LASSOa.2085 5

LASSOs.2066 25

EBVS .2078 1

EBVSi.2121 5

Presented in Table 3are the seven SNPs identiﬁed to have non-zero regression

coeﬃcients by adaptive lasso, EBVS, and EBVSi. Each SNP is identiﬁed by the

chromosome it resides in and four digits. The only SNP, 5-2773, which was

identiﬁed by EBVS, was identiﬁed by all other methods. While adaptive lasso

and EBVSieach identiﬁed ﬁve SNPs with non-zero regression coeﬃcients, there

are only three commonly identiﬁed SNPs, i.e., 1-3887, 5-2773, and 8-5143. The

two SNPs on chromosome 17, i.e., 17-3907 identiﬁed by EBVSiand SNP 17-9089

identiﬁed by EBVS, neighbor each other with 16kbases in between. However

the two SNPs on chromosome 4 are far apart from each other.

As in the previous section, we also took the multi-sample-split method to

calculate p-values based on 50 sample splits for all methods. When we followed

Benjamini and Hochberg (1995) to control FDR at 0.1, none of these methods

reported any signiﬁcant SNPs, though adaptive lasso and EBVSireported SNP

5-2773 with the p-value as small as 0.0031 and 0.0034 respectively. Instead,

when controlling d

F DR(κ)≤0.1 for both EBVS and EBVSi, EBVS identiﬁed

only SNP 5-2773, and EBVSiidentiﬁed both SNP 5-2773 and 17-3907, with

κ= 0.8. Note that SNP 17-3907 is one of the neighboring pair on chromosome

17. As shown in the simulation studies, d

F DR(κ) usually overestimated F DR(κ),

so we expect that F DR(.08) <0.1 for both EBVS and EBVSi.

6. Discussion

We intend to extend empirical Bayes thresholding (Johnstone and Silverman,

2004) for high-dimensional dependent data, allowing incorporation of compli-

cated a priori information on model parameters. An iterative conditional modes/medians

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 20

Table 3

Results of Analyzing the Framingham Data Using LASSO, Adaptive LASSO, Scaled

LASSO, EBVS, and EBVSi.

Chromosome-SNP

1-3887 4-0894 4-1174 5-2773 8-5143 17-3907 17-9089

ˆ

βLASSO .0412 0 .0355 .0402 0 0 0

LASSOa.1521 0 .0434 .1539 -.0200 0 .0167

LASSOs.0990 -.0112 .0528 .1366 -.0207 0 .0294

EBVS 0 0 0 .3778 0 0 0

EBVSi.2417 -.0542 0 .3047 -.0857 .1093 0

p-value∗LASSO .2694 1 1 .6050 1 1 1

LASSOa.2060 1 1 .0031 1 1 1

LASSOs1 1 1 .0328 1 1 1

EBVS .3138 1 1 .0187 1 1 1

EBVSi.0837 1 1 .0034 1 1 1

ζEBVS .1277 .0133 .0347 .9976 .0981 .0869 .0966

EBVSi.7609 .5275 .3269 .9718 .7464 .8450 .0009

∗p-values were calculated using the multi-sample-split method.

(ICM/M) algorithm is proposed to cycle through each coordinate of the param-

eters for a Bayesian update conditional on all other coordinates. The idea of

cycling through coordinates has been revived recently for analyzing high dimen-

sional data. For example, the coordinate descent algorithm has been suggested

to obtain penalized least squares estimates, see Fu (1998), Daubechies et al.

(2004), Wu and Lange (2008), and Breheny and Huang (2011). However, di-

rect application of the coordinate descent algorithm here is challenged with the

spike-and-slab posteriors.

Without a priori information other than that regression coeﬃcients are sparse,

many lasso-type methods have been proposed with some tuning parameters. It

is diﬃcult to select a value for the tuning parameters, and in practice the cross-

validation method is widely used. However, high-dimensional data are usually

of small sample sizes, and available model ﬁtting algorithms demand intensive

computation, both of which disfavor the cross-validation method. In particu-

lar, when genome-wide association studies focus more and more on complex

diseases associated with rare variants (Nawy,2012), the limited data usually

contain large number of SNPs which diﬀer in a small number of individuals. It

is almost infeasible to take a cross-validation method as the small number of

unique individuals for a rare variant is more likely to be included in the same

fold. Instead, the proposed ICM/M algorithm obtains data-driven hyperparam-

eters via conditional modes, which takes full advantage of each observation in

the small sample.

With a large number of predictors and complicated correlation between esti-

mates, classical p-values are diﬃcult to compute and it is therefore challenging to

evaluate the signiﬁcance of selected predictors. Wasserman and Roeder (2009),

and Meinshausen et al. (2009) recently proposed to calculate p-values by split-

ting the samples. That is, when a sample is split into two folds, one fold is used

as the training data to select variables, and the other is used to calculate p-values

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 21

of selected variables. Similar to applying the cross-validation method, splitting

samples signiﬁcantly reduces the power of variable selection and p-value calcu-

lation, especially for high-dimensional data of small sample sizes. Again, it is

almost not feasible to apply such a splitting method to genome-wide association

studies with rare variants.

As shown in Section 4, an Ising model as (4.2) can be used to model a priori

graphical information on predictors. Maximizing pseudo-likelihood approach is

utilized to obtain the conditional mode of the Ising model parameters, and there-

fore the ICM/M algorithm can be easily implemented. Indeed, at each iteration

of the ICM/M algorithm, we cycle through all parameters by obtaining condi-

tional modes/medians of one parameter (or a set of parameters), and therefore,

many classical approximation methods for low-dimensional issues may be used

to simplify the implementation. On the other hand, the Ising prior (4.2) can also

be modiﬁed to incorporate more complicated a priori information on predictors.

For example, we may multiply a weight wij to the interaction τiτjto model the

known relationship between the i-th and j-th predictors. A copula model may

be established to model more complicated graphical relationship between the

predictors.

For high-dimensional data, stochastic search has been employed to implement

Bayesian variable selection, see Hans et al. (2007), Bottolo and Richardson (2010),

Li and Zhang (2010), Stingo et al. (2011), and others. The reviewers pointed out

that Rockova and George (2014) recently proposed EMVS as an EM approach

for rapid Bayesian variable selection. EMVS assumes the “spike-and-slab” Gaus-

sian mixture prior on each βj,

βj|ωj∼(1 −ωj)N(0, ν0σ2) + ωjN(0, ν1σ2),

where ωjis a prior probability, ν1takes either a prespeciﬁed large value or a g-

prior, and ν0is suggested to explore a sequence of positive values with ν0< ν1.

With an absolutely continuous spike, EMVS estimates ωjat the E-step, and

estimates βjat the M-step. Note that a positive ν0will not automatically yield

a sparse estimate of β, which has to be sparsiﬁed using a prespeciﬁed threshold.

However, the ICM/M algorithm estimates a common ωbased on a conditional

mode, and estimates βjbased on a conditional median which enables variable

selection following Johnstone and Silverman (2004). We also propose a local

posterior probability to evaluate the importance of the predictor, which helps

control the false discovery rate.

Acknowledgements

This work was partially supported by NSF CAREER award IIS-0844945, U01CA128535

from the National Cancer Institute, and the Cancer Care Engineering project

at the Oncological Science Center of Purdue University. We would like to thank

the Editor and the Associate Editor for their insightful comments on the paper,

which led to improvement of the manuscript.

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 22

The Framingham Heart Study is conducted and supported by the National

Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston Uni-

versity (Contract No. N01-HC-25195). This manuscript was not prepared in

collaboration with investigators of the Framingham Heart Study and does not

necessarily reﬂect the opinions or views of the Framingham Heart Study, Boston

University, or NHLBI.

Funding for SHARe Aﬀymetrix genotyping was provided by NHLBI Contract

N02-HL-64178. SHARe Illumina genotyping was provided under an agreement

between Illumina and Boston University.

Appendix A. Technical Details of the ICM/M Algorithms

A.1 The Algorithm in Section 3.1

Given ˆ

β(k), ˆσ(k), and ˆω(k)from the k-th iteration, the (k+ 1)-st iteration of

ICM/M algorithm can proceed in the order of ˆ

β(k+1)

1,···,ˆ

β(k+1)

p, ˆσ(k+1) , and

ˆω(k+1), based on their fully conditional distributions.

Let (˜

Yj=Y−Pj−1

l=1 Xlβ(k+1)

l−Pp

l=j+1 Xlβ(k)

l,

zj=Xt

j˜

Yj(ˆσ(k)√n−1).

Following Proposition 3.1,ˆ

β(k+1)

jis updated as the median value of its posterior

distribution conditional on (zj,ˆω(k),ˆσ(k)).

Let

˜

F(k+1)(0|zj) = P(βj≥0|zj,ˆω(k),ˆσ(k))

=1−Φ(0.5−zj)

[1 −Φ(zj+ 0.5)]ezj+ Φ(zj−0.5) ,

and ωj=P(βj6= 0|zj,ˆω(k),ˆσ(k)) which can be calculated as follows,

ω−1

j= 1 + 4(1/ˆω(k)−1) Φ(zj−0.5)

φ(zj−0.5) +1−Φ(zj+ 0.5)

φ(zj+ 0.5) −1

.

If zj>0, as shown in Johnstone and Silverman (2005), the posterior median

ˆ

β(k+1)

jis zero if ωj˜

F(k+1)(0|zj)≤0.5; otherwise,

ˆ

β(k+1)

j=ˆσ(k)

√n−1zj−0.5−Φ−1[1 −Φ(zj+ 0.5)]ezj+ Φ(zj−0.5)

2ωj.

If zj<0, ˆ

β(k+1)

jcan be calculated on the basis of its antisymmetry property.

That is, when a function ˆ

β(zj) = ˆ

β(k+1) is deﬁned, then ˆ

β(−zj) = −ˆ

β(zj).

The conditional mode ˆσ(k+1) can be easily derived following the fact that

ˆσ(k+1) = mode(σ|Y,X,ˆ

β(k+1)), and the conditional mode ˆω(k+1) can be easily

derived following the fact that ˆω(k+1) = mode(ω|ˆ

β(k+1)).

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 23

A.2 The Algorithm in Section 4.1

Following Proposition 4.1,ˆ

β(k+1)

jis updated as the median value of its posterior

distribution conditional on (zj,ˆ̟j,ˆσ(k)), where ˆ̟jis calculated as follows,

ˆ̟−1

j= 1 + exp −ˆa(k+1) −ˆ

b(k+1) X

l:<j,k>∈E

ˆτl,

with ˆτl=I{ˆ

β(k+1)

l6= 0}for l= 1,···, j −1; and ˆτl=I{ˆ

β(k)

l6= 0}for l=

j+ 1,···, p.

The conditional median ˆ

β(k+1)

jcan be calculated following A.1, except that

the posterior probability ωj=P(βj6= 0|zj,ˆ̟j,ˆσ(k)) should be updated as

follows,

ω−1

j= 1 + 4(1/ˆ̟j−1) Φ(zj−0.5)

φ(zj−0.5) +1−Φ(zj+ 0.5)

φ(zj+ 0.5) −1

.

References

Barbieri, M. M. and Berger, J. O. (2004). Optimal predictive model selection.

The Annals of Statistics, 32:870–897.

Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate:

a practical and powerful approach to multiple testing. Journal of the Royal

Statistical Society, Series B, 57:289–300.

Besag, J. (1975). Statistical analysis of non-lattice data. Journal of the Royal

Statistical Society Series D (The Statistician), 24:179–195.

Besag, J. (1986). On the statistical analysis of dirty pictures. Journal of the

Royal Statistical Society Series B, 48:259–302.

Bottolo, L. and Richardson, S. (2010). Evolutionary stochastic search for

bayesian model exploration. Bayesian Analysis, 5:583–618.

Breheny, P. and Huang, J. (2011). Cooridinate descent algorithms for nonconvex

penalized regression, with applications to biological feature selection. The

Annals of Applied Statistics, 5:232–253.

Carlin, B. P. and Chib, S. (1995). Bayesian model choice via markov chain monte

carlo methods. Journal of the Royal Statistical Society Series B, 57:473–484.

Cupples, L. A., Arruda, H. T., Benjamin, E. J., and et al. (2007). The framing-

ham heart study 100k snp genome-wide association study resource: overview

of 17 phenotype working group reports. BMC Medical Genetics, 8(Suppl1):S1.

Daubechies, I., Defrise, M., and Mol, C. D. (2004). An iterative thresholding al-

gorithm for linear inverse problems with a sparsity constraint. Commications

on Pure and Applied Mathematics, 57:1413–1457.

Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet

shrinkage. Biometrika, 81:425–455.

Dudoit, S., Shaﬀer, J. P., , and Boldrick, J. C. (2003). Multiple hypothesis

testing in microarray experiments. Statistical Science, 18:71–103.

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 24

Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood

and its oracle properties. Journal of the American Statistical Association,

96:1348–1360.

Fu, W. J. (1998). Penalized regressions: The bridge versus the lasso. Journal of

Computational and Graphical Statistics, 7:397–416.

George, E. I. and McCulloch, R. E. (1993). Variable selection via gibbs sampling.

Journal of American Statistical Association, 85:398–409.

Geyer, C. J. (1991). Markov chain monte carlo maximum likelihood. In Com-

puting Science and Statistics, Proceedings of the 23rd Symposium on the In-

terface, pages 156–163.

Geyer, C. J. and Thompson, E. A. (1992). Constrained monte carlo maximum

likelihood for dependent data. Journal of the Royal Statistical Society Series

B, 54:657–699.

Guyon, X. and Kunsch, H. R. (1992). Asymptotic comparison of estimators

in the ising model. In Barone, P., Frigessi, A., and Piccioni, M., editors,

Stochastic Models, Statistical Methods, and Algorithms in Image Analysis,

pages 177–198. Springer New York.

Hans, C., Dobra, A., and West, M. (2007). Shotgun stochastic search for “large

p” regression. Journal of the American Statistical Association, 102:507–516.

Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: frequentist

and bayesian strategies. The Annals of Statistics, 33:730–773.

Jeﬀreys, H. (1946). An invariant form for the prior probability in estimation

problems. Proceedings of the Royal Society of Landon Series A, 196:453–461.

Johnstone, I. M. and Silverman, B. W. (2004). Needles and straw in haystacks:

empirical bayes estimates of possibly sparse sequence. The Annals of Statis-

tics, 32:1594–1649.

Li, C. and Li, H. (2010). Variable selection and regression analysis for graph-

structured covariates with an application to genomics. The Annals of Applied

Statistics, 4:1498–1516.

Li, F. and Zhang, N. R. (2010). Bayesian variable selection in structured high-

dimensional covariate spaces with application in genomics. Journal of the

American Statistical Association, 105:1202–1214.

Liang, G. and Yu, B. (2003). Maximum pseudo likelihood estimation in network

tomography. IEEE Transactions on Signal Processing, 51:2043–2053.

Mase, S. (2000). Marked gibbs processes and asymptotic normality of maximum

pseudo-likelihood estimators. Mathematische Nachrichten, 209:151–169.

Meinshausen, N., Meier, L., and Buehlmann, P. (2009). P-values for high-

dimensional regression. Journal of the American Statistical Association,

104:1671–1681.

Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear

regression. Journal of the American Statistical Association, 83:1023–1036.

Nawy, T. (2012). Rare variants and the power of association. Nature Methods,

9:324.

Onsager, L. (1943). Crystal statistics. i. a two-dimensional model with an order-

disorder transition. Physical Review, 65:117–149.

Pan, W., Xie, B., and Shen, X. (2010). Incorporating predictor network in

D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 25

penalized regression with application to microarray data. Biometrics, 66:474–

484.

Rockova, V. and George, E. I. (2014). Incorporating predictor network in penal-

ized regression with application to microarray data. Journal of the American

Statistical Association, 109:828–846.

Stingo, F. C., Chen, Y. A., Tadesse, M. G., and Vannucci, M. (2011). Incorpo-

rating biological information into linear models: a bayesian approach to the

selection of pathways and genes. The Annals of Applied Statistics, 5:1978–

2002.

Sun, T. and Zhang, C.-H. (2012). Scaled sparse linear regression. Biometrika,

99:879–898.

Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal

of Royal Statistical Society Series B, 58:267–288.

Varin, C., Reid, N., and Firth, D. (2011). An overview of composite likelihood

methods. Statistica Sinica, 21:5–42.

Wasserman, L. and Roeder, K. (2009). High-dimensional variable selection.

Annals of Statistics, 37:2178–2201.

Wu, T. T. and Lange, K. (2008). Coordinate descent algorithms for lasso pe-

nalized regression. The Annals of Applied Statistics, 2:224–244.

Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with

grouped variables. Journal of Royal Statistical Society Series B, 68:49–67.

Zhang, M., Zhang, D., and Wells, M. T. (2010). Generalized thresholding estima-

tors for high-dimensional location parameters. Statistica Sinica, 20:911–926.

Zhou, X. and Schmidler, S. C. (2009). Bayesian parameter estimation in ising

and potts models: a comparative study with applications to protein modeling.

Technical report, Duke University.

Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the

American Statistical Association, 101:1418–1429.