ArticlePDF Available

Abstract and Figures

Empirical Bayes methods are designed in selecting massive variables, which may be inter-connected following certain hierarchical structures, because of three attributes: taking prior information on model parameters, allowing data-driven hyperparameter values, and free of tuning parameters. We propose an iterated conditional modes/medians (ICM/M) algorithm to implement empirical Bayes selection of massive variables, while incorporating sparsity or more complicated a priori information. The iterative conditional modes are employed to obtain data-driven estimates of hyperparameters, and the iterative conditional medians are used to estimate the model coefficients and therefore enable the selection of massive variables. The ICM/M algorithm is computationally fast, and can easily extend the empirical Bayes thresholding, which is adaptive to parameter sparsity, to complex data. Empirical studies suggest competitive performance of the proposed method, even in the simple case of selecting massive regression predictors. © 2015, Institute of Mathematical Statistics. All rights reserved.
Content may be subject to copyright.
Electronic Journal of Statistics
ISSN: 1935-7524
arXiv: math.PR/0000000
Selecting Massive Variables Using
Iterated Conditional Modes/Medians
Vitara Pungpapong
Department of Statistics
Faculty of Commerce and Accountancy
Chulalongkorn University
Bangkok, Thailand
e-mail: vitara@cbs.chula.ac.th
Min Zhang and Dabao Zhang
Department of Statistics
Purdue University
West Lafayette, IN 47907
e-mail: minzhang@stat.purdue.edu;zhangdb@stat.purdue.edu
Abstract: Empirical Bayes methods are privileged in selecting variables
out of massive yet structured candidates because of three attributes: taking
prior information on model parameters, allowing data-driven hyperparam-
eter values, and free of tuning parameters. We proposed an iterated condi-
tional modes/medians (ICM/M) algorithm to implement empirical Bayes
selection of massive variables while incorporating sparsity or more compli-
cated a priori information. The iterative conditional modes are employed
to obtain data-driven estimates of hyperparameters, and the iterative con-
ditional medians are used to estimate the model coefficients and therefore
enable the selection of massive variables. The ICM/M algorithm is com-
putationally fast, and can easily extend the empirical Bayes thresholding,
which is adaptive to parameter sparsity, to complex data. Empirical stud-
ies suggest competitive performance of the proposed method, even in the
simple case of selecting massive regression predictors.
AMS 2000 subject classifications: Primary 62J05; secondary 62C12,
62F07.
Keywords and phrases: Empirical Bayes Variable Selection, High Di-
mensional Data, Prior, Sparsity.
Contents
1 Introduction ................................. 2
2 The Method ................................. 3
2.1 Iterated Conditional Modes/Medians ................ 3
2.2 Evaluation of Variable Importance . . . . . . . . . . . . . . . . . 5
3 Selection of Sparse Variables ....................... 6
3.1 The Algorithm ............................ 7
3.2 Simulation Studies .......................... 7
4 Selection of Structured Variables ..................... 10
Corresponding author
1
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 2
4.1 The Algorithm ............................ 12
4.2 Simulation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . 13
5 Real Data Analysis ............................. 18
6 Discussion .................................. 19
Acknowledgements ............................... 21
Appendix A. Technical Details of the ICM/M Algorithms . . . . . . . . . 22
References .................................... 23
1. Introduction
Selecting variables out of massive candidates is a challenging yet critical problem
in analyzing high-dimensional data. Because high-dimensional data are usually
of relatively small sample sizes, successful variable selection demands appropri-
ate incorporation of a priori information. A fundamental piece of information is
that only a few of the variables are significant and should be included into the
underlying models, leading to a fundamental assumption of sparsity in variable
selection (Fan and Li,2001). Many methods have been developed to take full
advantage of this sparsity assumption, mostly built upon thresholding proce-
dures (Donoho and Johnstone,1994), see Tibshirani (1996), Fan and Li (2001),
and others.
Recently many efforts have been devoted to selecting variables from massive
candidates by incorporating rich a priori information accumulated from histori-
cal research or practices. For example, Yuan and Lin (2006) defined group-wise
norms for grouped variables. For graph-structured variables, Li and Li (2010)
and Pan et al. (2010) proposed to use Laplacian matrices and Lγnorms, respec-
tively. Li and Zhang (2010) and Stingo et al. (2011) both employed Bayesian
approaches to incorporate structural information of the variables, both formu-
lating Ising priors.
Markov chain Monte Carlo (MCMC) algorithms have been commonly em-
ployed to develop Bayesian variable selection, see George and McCulloch (1993),
Carlin and Chib (1995), Li and Zhang (2010), Stingo et al. (2011), and oth-
ers. However, MCMC algorithms are computationally intensive and may be
difficult to obtain appropriate hyperparameters. On the other hand, penalty-
based variable selection usually demands predetermination of certain tuning
parameters (e.g. Fan and Li,2001;Li and Li,2010;Pan et al.,2010;Tibshirani,
1996;Yuan and Lin,2006), which challenges high-dimensional data analysis. Al-
though cross-validation has been widely suggested to choose tuning parameters,
it may be infeasible in certain situations, in particular the case that many vari-
ables rarely vary. Recently, Sun and Zhang (2012) proposed the scaled sparse
linear regression to attach the tuning parameter to the estimable noise level.
Empirical Bayes methods are privileged in high-dimensional data analysis
because of no need to choose tuning parameters. They also allow incorporating
a priori information while modeling uncertainty of such prior information using
hyperparameters. For example, Johnstone and Silverman (2004) modeled the
sparse normal means using a spike-and-slab prior. The mixing rate of the dirac
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 3
spike and slab is taken as a hyperparameter to achieve data-driven threshold-
ing, and resultant empirical Bayes estimates are therefore adaptive to sparsity of
the high-dimensional parameters. As demonstrated by Johnstone and Silverman
(2004), this empirical Bayes method can work better than traditional thresh-
olding estimators. One important contribution of this paper is to develop a new
algorithm which allows to construct such empirical Bayes variable selection with
complex data.
We propose an iterative conditional modes/medians (ICM/M) algorithm for
easy implementation and fast computation of empirical Bayes variable selec-
tion (EBVS). Similar to the iterated conditional modes (Besag,1986), itera-
tive conditional modes are for optimization of hyperparameters and parame-
ters other than regression coefficients. Iterative conditional medians are used
to enforce variable selection. As shown in Johnstone and Silverman (2004) and
Zhang et al. (2010), when mixture priors are utilized, posterior medians can lead
to thresholding rules and thus help screen out small and insignificant variables.
Furthermore, ICM/M makes it easy to incorporate complicated priors for the
purpose of selecting variables out of massive structured candidates. Taking the
Ising prior as an example (Li and Zhang,2010), we illustrate such strength of
ICM/M.
The rest of this paper is organized as follows. In the next section, we will
propose the ICM/M algorithm for empirical Bayes variable selection (EBVS).
We also explore to control false discovery rates (FDR) using conditional poste-
rior probabilities. We implement the ICM/M algorithm in Section 3 for high-
dimensional linear regression models, only assuming that non-zero regression
coefficients are few. Shown in Section 4 is the ICM/M algorithm when incor-
porating a priori information on graphical relationship between the predictors.
Simulation studies are carried out in both Section 3 and 4 to evaluate the perfor-
mance of the corresponding ICM/M algorithms. An application to a real dataset
from a genome-wide association study (GWAS) is presented in Section 5. We
conclude this paper with a discussion in Section 6.
In the rest of this paper, the j-th component of a vector parameter, say β, is
denoted by βj;βjdenotes the all components of βexcept the j-th component;
and βj:kincludes components of βfrom βjto βk. The parameter with a paren-
thesized superscript, say ˆ
β(k), indicates an estimate from the k-th iteration.
2. The Method
2.1. Iterated Conditional Modes/Medians
Consider a general variable selection issue presented with a likelihood function,
L(Y;Xβ;φ),(2.1)
where Yis a n×1 random vector, Xis a n×pmatrix containing values of p
variables, βis a p×1 parameter vector with the j-th component βjrepresenting
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 4
the effects of the j-th variable to the model, and φincludes all other auxiliary
parameters.
A typical variable selection task is to identify non-zero components in β, that
is, to select important variables out of the pcandidates. For convenience, define
τj=I{βj6= 0}, which indicates whether the j-th variable should be selected into
the model. Further denote τ= (τ1, τ2,···, τp)t. Here we consider an empirical
Bayes variable selection, which assumes priors,
βπ(β|τ, ψ1)×π(τ|ψ2),
φπ(φ|ψ3),(2.2)
where ψ= (ψt
1, ψt
2, ψt
3)tincludes all hyperparameters.
To avoid high-dimensional integrals, here we cycle through coordinates to
obtain the estimate of each component of (β, φ, ψ) iteratively,
ˆ
βj=ˆ
βj(ˆ
βj,ˆ
φ, ˆ
ψ),
ˆ
φj=ˆ
φj(ˆ
β, ˆ
φj,ˆ
ψ),
ˆ
ψj=ˆ
ψj(ˆ
β, ˆ
φ, ˆ
ψj).
(2.3)
Indeed, with properly chosen priors of φand ψ, both ˆ
φj=ˆ
φj(ˆ
β, ˆ
φj,ˆ
ψ, Y,X)
and ˆ
ψj=ˆ
ψj(ˆ
β, ˆ
φ, ˆ
ψj,Y,X) can be obtained by maximizing the fully condi-
tional posterior, i.e.,
ˆ
φj=ˆ
φj(ˆ
β, ˆ
φj,ˆ
ψ) = mode(φj|Y,X,ˆ
β, ˆ
φj,ˆ
ψ),
ˆ
ψj=ˆ
ψj(ˆ
β, ˆ
φ, ˆ
ψj) = mode(ψj|Y,X,ˆ
β, ˆ
φ, ˆ
ψj).(2.4)
When each ˆ
βjis also obtained by maximizing its fully conditional pos-
terior, it suggests the iterated conditional modes (ICM) algorithm by Besag
(1986). However, calculation of conditional mode for ˆ
βjis either infeasible
or practically undesirable (due to lack of variable selection function). Indeed,
Bayesian or empirical Bayes variable selection for high-dimensional data usu-
ally follows a spike-and-slab prior on each βj(e.g. Ishwaran and Rao,2005;
Mitchell and Beauchamp,1988), and it induces a spike-and-slab posterior for
each βj. With a dirac spike, it is infeasible to obtain the mode of such a spike-
and-slab posterior. However, its median can be zero and allows to select the me-
dian probability model as suggested by Barbieri and Berger (2004). Henceforth,
following Johnstone and Silverman (2004), we construct ˆ
βj=ˆ
βj(ˆ
βj,ˆ
φ, ˆ
ψ, Y,X)
as median of the fully conditional posterior, i.e.,
ˆ
βj=ˆ
βj(ˆ
βj,ˆ
φ, ˆ
ψ) = median(βj|Y,X,ˆ
βj,ˆ
φ, ˆ
ψ).(2.5)
With the iterative conditional median for βj, and conditional modes for φj
and ψjrespectively, for Bayesian update of a component conditional on all
other components, we hereafter propose the iterated conditional medians/modes
(ICM/M) algorithm for implementing the empirical Bayes variable selection.
As shown later, ICM/M algorithm allows an easy extension of the (general-
ized) empirical Bayes thresholding methods by Johnstone and Silverman (2004)
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 5
and Zhang et al. (2010) to dependent data. Obviously, with a consistent initial
point of ( ˆ
β, ˆ
φ, ˆ
ψ), the cycling Bayesian updates of this algorithm lead to a well-
established estimate ( ˆ
β, ˆ
φ, ˆ
ψ).
2.2. Evaluation of Variable Importance
When proposing a statistical model, we are primarily interested in evaluating the
importance of variables besides its predictive ability. For example, the objective
of high-dimensional data analysis is to identify a list of Jpredictors that are
most important or significant among ppredictors. This is a common practice in
biomedical research using high-throughput biotechnologies, ranking all markers
and selecting a short list of candidates for follow-up studies.
For Bayesian approach, inference on the importance of each variable can
be done through its marginal posterior probability P(βj6= 0|Y,X). However,
this quantity involves high-dimensional integrals which is difficult to calculate
even in the case of moderate p. Furthermore, the marginal posterior probability
may not be meaningful in the case that predictors are highly correlated (which
usually occurs in a large psmall ndata set). For example, suppose predictors
X1and X2are linearly dependent and both predictors are associated with a
response variable. The marginal posterior probability of X1being included in
the model might be very high and dominates the marginal posterior probability
of X2being included in the model.
We propose a local posterior probability to evaluate the importance of a
variable. That is, conditional on the optimal point {ˆ
βj,ˆ
φ, ˆ
ψ}obtained from
empirical Bayes variable selection through ICM/M algorithm, the importance
of a variable is evaluated by its full conditional posterior probability,
ζj=P(βj6= 0|Y,X,ˆ
βj,ˆ
φ, ˆ
ψ).(2.6)
Such a probability has a closed form which can be easily computed. We will
show later in simulation studies that the local posterior probability is a good
indicator to quantify the importance of variables.
Another challenging question is how large the list of important predictors
should be. In many literatures, the numbers of important variables reported
are arbitrary. For instance, some laboratories may be interested in the top ten
genes. Typically, however, there is an interest to create the list such that errors
like type-I and type-II errors are controlled (Dudoit et al.,2003). False discovery
rate (FDR) control is widely used in high-dimensional data analysis since it is
less conservative and has more power than controlling the familywise error rate
(Benjamini and Hochberg,1995).
With the local posterior probability ζand assumption that true βis known,
we can report a list containing predictors having the posterior probability greater
than some bound κ, 0 κ < 1. Given the data, true FDR can be computed as
F DR(κ) =
p
X
j=1
I{βj= 0, ζj> κ}p
X
j=1
I{ζj> κ}.(2.7)
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 6
Newton et al. (2004) proposed the expected FDR given the data in Bayesian
scheme as
d
F DR(κ) =
p
X
j=1
(1 ζj)I{ζj> κ}p
X
j=1
I{ζj> κ}.(2.8)
Therefore we can select predictors to report by controlling d
F DR(κ) at a desired
level.
3. Selection of Sparse Variables
Here we consider the empirical Bayes variable selection for the following regres-
sion model with high dimensional data,
Y=Xβ+ǫ, ǫ N(0, σ2In).(3.1)
Further assume that the response is centered and the predictors are standard-
ized, that is, Yt1n= 0, Xt1n=0p, and
Xt
jXj=n1, j = 1,···, p,
where Xjis the j-th column of X, i.e., X= (X1,X2,···,Xp).
Let ˜
Yj=YXβ+Xjβj. Assuming all model parameters except βjare
known, βjhas a sufficient statistic
1
n1Xt
j˜
YjNβj,1
n1σ2.(3.2)
To capture the sparsity of regression coefficients, we put an independent prior
on each of scaled βjas follows,
βj|σiid
(1 ω)δ0(·) + ωγ(·|σ),(3.3)
where δ0(·) is a Dirac delta function at zero, γ(·|σ) is assumed to be a probability
density function. This mixture prior implies that βjis zero with probability
(1 ω) and is drawn from the nonzero part of prior, γ(·|σ), with probability ω.
As suggested by Johnstone and Silverman (2004), a heavy-tailed prior such as
Laplace distribution is a good choice for γ(·|σ), that is,
γ(βj|σ) = αn1
2σexp αn1
σ|βj|,(3.4)
where α > 0 is a scale parameter. We take Jeffreys’ prior on σas π(σ)1
(Jeffreys,1946).
Note that there is a connection of using Laplace prior and the lasso. Indeed,
setting ω= 1 in (2.4) and (3.3) leads to a lasso estimate with αrelated to a
tuning parameter in the lasso, see details in Tibshirani (1996). Our empirical
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 7
Bayes variable selection allows a data-driven optimal choice of ω. Indeed, a data-
driven optimal αcan also be obtained through the conditional mode suggested
by (2.4), which avoids the issue brought by a tuning parameter to lasso (while
lasso usually relies on cross validation to choose an optimal tuning parameter).
Johnstone and Silverman (2004) also suggested a default value α= 0.5, which
in general works well.
3.1. The Algorithm
Here we implement the ICM/M algorithm described in (2.4) and (2.5). Note that
φ=σ, and ψ= (ω, α) or ψ=ωdepending on whether αis fixed. Throughout
this paper, we fix α= 0.5 as suggested by Johnstone and Silverman (2004).
To obtain ˆ
β(k+1)
j= median(βj|Y,X,ˆ
β(k+1)
1:(j1),ˆ
β(k)
(j+1):p,ˆσ(k),ˆω(k)), we notice
the sufficient statistic of βjin (3.2) and it is therefore easy to calculate ˆ
β(k+1)
j
as stated below. Indeed, ˆ
β(k+1)
jis an empirical Bayes thresholding estimator as
shown in Johnstone and Silverman (2004).
Proposition 3.1. With pre-specified values of σand βj,1
n1Xt
j˜
Yjis a suffi-
cient statistic for βjw.r.t the model (3.1). Furthermore, the iterative conditional
median of βjin the ICM/M algorithm can be constructed as the posterior median
of βjin the following Bayesian analysis,
1
σn1Xt
j˜
Yj|βjNn1
σβj,1,
βj(1 ω)δ0(βj) + ωn1
4σexp n1
2σ|βj|.
The conditional mode ˆσ(k+1) = mode(σ|Y,X,ˆ
β(k+1),ˆω(k)) has an explicit
solution,
ˆσ(k+1) =1
4dc+qc2+ 16d||YXˆ
β(k+1)||2,
where c=n1kˆ
β(k+1)k1, and d=n+kˆ
β(k+1)k0+ 1. Furthermore, the con-
ditional mode ˆω(k+1) = mode(ω|Y,X,ˆ
β(k+1),ˆσ(k+1)) can be easily calculated
as
ˆω(k+1) =kˆ
β(k+1)k0p.
3.2. Simulation Studies
To evaluate the performance of our proposed empirical Bayes variable selection
(EBVS) via ICM/M algorithm, we simulated data from model (3.1) with large p
small n, i.e., p= 1,000 and n= 100. There are a total of 20 non-zero regression
coefficients which are β1=··· =β10 = 2 and β101 =··· =β110 = 1. The
error standard deviation σis set to one. The predictors are partitioned into ten
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 8
blocks, each including 100 predictors which are serially correlated at the same
level of correlation coefficient ρ. We simulated 100 datasets for ρtaking values
in {0,0.1,0.2,···,0.9}respectively.
EBVS was compared with two popularly considered approaches, i.e., lasso
by Tibshirani (1996), and adaptive lasso by Zou (2006). The scaled lasso by
Sun and Zhang (2012) was also applied to the simulated datasets. Ten-fold cross-
validation was used to choose optimal tuning parameters for lasso and adaptive
lasso respectively. The median values of prediction error, false positive, and
false negative rates were reported for each approach based on the 100 simulated
datasets.
As shown in Figure 1, EBVS performs much better than the other three
methods in terms of prediction error. In particular, when ρ0.3, EBVS con-
sistently reported median prediction error approximately at 1.5. In comparison
of lasso and adaptive lasso, adaptive lasso has smaller prediction error when
ρ < 0.3; but lasso has smaller prediction error when ρ > 0.3.
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
5
10
15
20
25
30
35
40
ρ
median prediction error
Fig 1. Comparison of median prediction errors of lasso (dotted), adaptive lasso (dash-dotted),
scaled lasso (thin solid), and EBVS (thick solid) by averaging over 100 datasets simulated for
each ρin Section 3.2.
It is known that lasso can inconsistently select variables under certain condi-
tions, and adaptive lasso was proposed for solving this issue (Zou,2006). Figure 2
showed that lasso has very high false positive rates (more than 50%), and adap-
tive lasso significantly reduces the false positive rates especially when ρ0.2.
Indeed, lasso has much larger false positive rates than all other methods. It is
interesting to observe that EBVS has zero false positive rates except in the case
that ρ= 0.5 and ρ= 0.9. All methods have very low false negative rates.
Recently, Meinshausen et al. (2009) proposed a multi-sample-split method
to construct p-values for high-dimensional regressions, especially in the case
that the number of predictors is larger than the sample size. Here we applied
this method, as well as EBVS, to each simulated dataset with a total of 50
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 9
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
ρ
false positive rate
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
2
4
6
8
10
12
x 10−3
ρ
false negative rate
Fig 2. Comparison of false positive rates (top) and false negative rates (bottom). Averaging
over 100 datasets simulated for each ρin Section 3.2, the false positive/negative rates were
calculated for lasso (dotted), adaptive lasso (dash-dotted), scaled lasso (thin solid), and EBVS
(thick solid).
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 10
sample-splits, and compared its performance with that of ζidefined in (2.6).
For each predictor, Figure 3plotted the median of log10 (1 ζi), truncated
at 10, against the median of log10(p-value) across 100 datasets simulated
from the regression model with ρ= 0.5 and ρ= 0.9 respectively. For either
model, ζican clearly distinguish true positives (i.e., predictors with τi6= 0) from
true negatives (i.e., predictors with τi= 0). However, as shown in Figure 3.b
where ρ= 0.9, there is no clear cutoff of p-values to distinguish between true
positives and true negatives. Here we also observed that F DR(κ) can be well
approximated by d
F DR(κ) (results are not shown), with both dropped sharply
to zero for κ > 0.05. We therefore can select κto threshold ζifor the purpose
of controlling FDR.
4. Selection of Structured Variables
When the information of structural relationship among predictors is available, it
is unreasonable to assume independent prior on each βj, j = 1, ..., p as described
in previous section. Instead, we introduce an indicator variable τj=I{βj6= 0}.
Then, the prior distribution of βis set to be dependent on τ= (τ1, ..., τp)T.
Specifically, given τj,βjhas the mixture distribution
βj|τj(1 τj)δ0(βj) + τjγ(βj),(4.1)
where γ(·) is the Laplace density with the scale parameter α.
The relationship among predictors can be represented by an undirected graph
G= (V, E ) comprising a set Vof vertices and a set Eof edges. In this case, each
node is associated with a binary valued random variable τj∈ {0,1}and there is
an edge between two nodes if two covariates are correlated. The following Ising
model (Onsager,1943) is employed to model the a priori information on τ,
P(τ) = 1
Z(a, b)exp
aX
i
τi+bX
<i,j>E
τiτj
,(4.2)
where aand bare two parameters, and
Z(a, b) = X
τ∈{0,1}p
exp
aX
i
τi+bX
<i,j>E
τiτj
.
The parameter bcorresponds to the “energies” associated with interactions
between nearest neighboring nodes. When b > 0, the interaction is called ferro-
magnetic, i.e., neighboring τiand τjtend to have the same value. When b < 0,
the interaction is called antiferromagnetic, i.e., neighboring τiand τjtend to
have different values. When b= 0, there is no interaction, and the prior gets
back to independent and identical Bernoulli distribution. The value of a+b
indicates the preferred value of each τi. That is, if a+b > 0, τitends to be one;
if a+b < 0, τitends to be zero.
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 11
0 2 4 6 8 10
0
2
4
6
8
10
−log10(p−value)
−log10(1−ζ)
0 2 4 6 8 10
0
2
4
6
8
10
−log10(p−value)
−log10(1−ζ)
Fig 3. Comparison of the local posterior probabilities (with log10 (1 ζ)truncated at 10)
and p-values in evaluating variable importance by EBVS, with ρ= 0.5(top) and ρ= 0.9
(bottom). Each plot is based on 100 datasets simulated in Section 3.2. True positives are
indicated by crosses and true negatives are indicated by circles.
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 12
4.1. The Algorithm
Here we will implement ICM/M algorithm to develop empirical Bayes variable
selection with Ising prior (abbreviated as EBVSi) to incorporate the structure
of predictors in modeling process. We assume the Ising prior as homogeneous
Boltzmann model, but the algorithm can be extended for more general pri-
ors. With α= 0.5, the ICM/M algorithm described in (2.4) and (2.5) can be
proceeded with φ=σand ψ= (ω , a, b).
For the hyperparameters aand b, we will calculate the conditional mode
of (a, b) simultaneously. Conceptually, we want (ˆa(k+1),ˆ
b(k+1)) maximizing the
prior likelihood P(τ) in (4.2). However, it requires to compute Z(a, b) by sum-
ming up p-dimensional space of τ, which demands intensive computation es-
pecially for a large p. Many methods have been proposed for approximate cal-
culation, see Geyer (1991), Geyer and Thompson (1992), Zhou and Schmidler
(2009) and others. Here we will consider the composite likelihood approach
(Varin et al.,2011) which is widely used when the actual likelihood is not easy
to compute. In particular, (ˆa(k+1),ˆ
b(k+1)) will be obtained by maximizing a
pseudo-likelihood function, a special type of composite conditional likelihood
and a natural choice for a graphical model (Besag,1975).
With the Ising prior on τ(k), the pseudo-likelihood of (a, b) is as follows,
Lp(a, b) =
p
Y
i=1
P(τ(k)
i|τ(k)
j, a, b) =
p
Y
i=1
exp τ(k)
i(a+bP<i,j>Eτ(k)
j)
1 + exp a+bP<i,j >Eτ(k)
j.
The surface of such a pseudo-likelihood is much smoother than the joint like-
lihood and therefore easy to maximize (Liang and Yu,2003). The resultant
estimator (ˆa(k+1) ,ˆ
b(k+1)) by maximizing Lp(a, b) is biased for a finite sam-
ple size, but it is asymptotically unbiased and consistent (Guyon and Kunsch,
1992;Mase,2000;Varin et al.,2011). The implementation of pseudo-likelihood
method is fast and straightforward which is feasible for a large scale of graph.
Indeed, ˆa(k+1) and ˆ
b(k+1) are the logistic regression coefficients when the binary
variable ˆτ(k)
iis regressed on P<i,j>Eˆτ(k)
jfor i= 1,···, p.
As shown in the previous sections, the conditional median ˆ
β(k+1)
jcan be
constructed on the basis of the following preposition.
Proposition 4.1. With pre-specified values of σ,a,b, and βj,1
n1Xt
j˜
Yjis
a sufficient statistic for βjw.r.t the model (3.1). Furthermore, the iterative
conditional median of βjin the ICM/M algorithm can be constructed as the
posterior median of βjin the following Bayesian analysis,
1
σn1Xt
j˜
Yj|βjNn1
σβj,1,
βj(1 ̟j)δ0(βj) + ̟jn1
4σexp n1
2σ|βj|,
where the probability ̟jis specified as follows,
̟1
j= 1 + exp abX
k:<j,k>E
τk.
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 13
The conditional mode ˆσ(k+1) = mode(σ|Y,X,ˆ
β(k+1),ˆω(k)) has an explicit
solution,
ˆσ(k+1) =1
4dc+qc2+ 16d||YXˆ
β(k+1)||2,
where c=n1kˆ
β(k+1)k1, and d=n+kˆ
β(k+1)k0+ 1.
4.2. Simulation Studies
Here we simulated large psmall ndatasets from model (3.1) with structured
predictors, i.e., the values of βjdepend on correlated τj. We here consider two
different correlation structures of τi. Both EBVS and EBVSiwere applied to
each simulated dataset, and they were compared with three other methods, i.e.,
lasso, adaptive lasso, and scaled lasso.
Case I. Markov Chain. For each j= 1,···, p,βj= 0 if τj= 0; and if
τj= 1, βjis independently sampled from a uniform distribution on [0.3,2].
The indicator variables τ1,···,τpform a Markov chain with the transition
probabilities specified as follows,
P(τj+1 = 0|τj= 0) = 1 P(τj+1 = 1|τj= 0) = 0.99;
P(τj+1 = 0|τj= 1) = 1 P(τj+1 = 1|τj= 1) = 0.5.
The first indicator variable τ1is sampled from Bernouli(0.5). The error variance
is fixed at one. For each individual, its predictors were simulated from AR(1)
with correlation coefficient ρranging from 0 to 0.9 with step 0.1.
The median prediction error rates of all methods are shown in Figure 4. EBVS
performed slightly better than adaptive lasso, and both performed much better
than lasso and scaled lasso. Lasso, adaptive lasso, scaled lasso, and EBVS all
presented varying prediction error rates when ρgoes from 0 to 0.9. However,
the prediction error rates of EBVSiare rather stable for varying values of ρ,
and are much smaller than those of the other four methods.
Shown in Figure 5are the false positive rates and false negative rates of
different methods. Not surprisingly, lasso has false positive rates over 70%, much
higher than that of other methods. Adaptive lasso significantly reduces the false
positive rates, which is still more than 10%. On the other hand, the false positive
rates of both EBVS and EBVSiare less than 10%. Indeed, EBVS reported false
positive rates at zero for different values of ρ; and EBVSireported false positive
rates at zero when ρ < 0.6, and 0.1 when ρ0.6. However, EBVSireported
false negative rates less than EBVS. Therefore, EBVS tends to select correct
true positives by including fewer true positives in the final model than the
model obtained by EBVSi. We then conjecture that, when covariates are highly
correlated, EBVSitends to select more variables into the model. In particular,
if one covariate is selected into the model, EBVSitends to include its highly
correlated neighboring predictors into the model.
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 14
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
2
4
6
8
10
12
14
16
ρ
median prediction error
Fig 4. Comparison of median prediction errors of lasso (dotted), adaptive lasso (dash-dotted),
scaled lasso (thin solid), EBVS (dashed), and EBVSi(thick solid) by averaging over 100
datasets simulated for each ρin Case I of Section 4.2.
Figure 6shows F DR(κ) and d
F DR(κ) of EBVSifor the models with ρ= 0.5
and ρ= 0.9 respectively (we also observed that F D R(κ) of EBVS is similar to
that of EBVSi, results are not shown). Overall, the estimate d
F DR(κ) dominates
F DR(κ), i.e., the true FDR. Therefore, we will be conservative in selecting
variables when controlling FDR using d
F DR(κ). For example, if one would like
to list important predictors while controlling FDR at 0.1 for the model with
ρ= 0.9, κshould be set around 0.1 based on F DR(κ). However, one can set κ
around 0.4 based on d
F DR(κ), which suggests a true FDR as low as zero.
Plotted in Figure 7are the p-values calculated using the multi-sample-split
method (Meinshausen et al.,2009) against ζjfor each predictor. For both EBVS
and EBVSi,ζjquantified variable importance better than p-values in terms of
distinguishing true positives from true negatives. Overall, EBVSioutperforms
EBVS since it provides larger values of ζfor true positives, while both EBVS
and EBVSikeep true negatives with ζjclose to zero. Indeed, EBVS produced ζj
close to 0 for several true positives while EBVSiproduced larger values of ζjfor
these true positives. We then summarize empirically that, by incorporating a
priori information, EBVSihas more power to detect true positives than EBVS.
Case II. Pathway Information. To mimic a real genome-wide associa-
tion study (GWAS), we took values of some single nucleotide polymorphisms
(SNPs) in Framingham dataset (Cupples et al.,2007) to generate Xin model
(3.1). Specifically, 24 human regulatory pathways were retrieved from Kyoto
Encyclopedia of Genes and Genomes (KEGG) database, and involved 1,502
genes. For each gene involved in these pathways, at most two SNPs listed in the
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 15
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
ρ
false positive error
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
0
1
2
3
4
5
6
7
8
9x 10−3
ρ
false negative error
Fig 5. Comparison of false positive rates (top) and false negative rates (bottom). Averaging
over 100 datasets simulated for each ρin Case I of Section 4.2, the false positive/negative
rates were calculated for lasso (dotted), adaptive lasso (dash-dotted), scaled lasso (thin solid),
EBVS (dashed), and EBVSi(thick solid).
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 16
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
κ
FDR
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
κ
FDR
Fig 6. Plots of median true FDR (solid) and estimated FDR (dotted) versus κbased on the
results of applying EBVSito 100 data simulated for Case I in Section 4.2, with ρ= 0.5(top)
and ρ= 0.9(bottom) respectively.
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 17
a. EBVS, ρ= 0.5 b. EBVS, ρ= 0.9
c. EBVSi,ρ= 0.5 d. EBVSi,ρ= 0.9
Fig 7. Comparison of local posterior probabilities (with log10(1 ζ)truncated at 10) and
p-values in evaluating variable importance by EBVS and EBVSi. Each plot is based on 100
datasets simulated for Case I in Section 4.2. True positives are indicated by crosses and true
negatives are indicated by circles.
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 18
Framingham dataset were randomly selected out of those SNPs residing in the
genetic region. If no SNP could be found within the genetic region, a nearest
neighboring SNP would be identified. A total of 1,782 SNPs were selected. We
first identified 952 unrelated individuals out of the Framingham dataset, and
used them to generate predictor values of the training dataset. For the rest of the
Framingham dataset, we identified 653 unrelated individuals to generate predic-
tor values of the test dataset. Five pathways were assumed to be associated with
the phenotype Y. That is, all 311 SNPs involved in these five pathways were
assumed to have nonzero regression coefficients, which were randomly sampled
from a uniform distribution over [0.5,3]. With the error variance at five, a total
of 100 datasets were simulated.
As shown in Table 1, lasso has relatively low prediction error. However, its
median false positive rate is as high as 69%, much higher than others. Adaptive
lasso (LASSOa), on the other hand, has very large prediction error but its false
positive rate is much smaller than lasso. EBVS presented the lowest false posi-
tive rate among all the methods, and its false negative rate is also smaller than
that of adaptive lasso. Indeed, with initial values obtained from lasso, EBVS re-
duces the false positive rate from lasso by more than 98%. By incorporating the
pathway information using an Ising prior on τ, EBVSireported the lowest pre-
diction error. Furthermore, EBVSicompromised between lasso, adaptive lasso,
and EBVS to balance well between the false positive rate and false negative rate.
Scaled lasso (LASSOs) performed unstably in analyzing our simulated datasets,
and it selected more than 800 positives in seven of the simulated datasets.
Table 1
Results of Simulation Studies with Pathway Information (Case II).
Method Prediction Error (s.e.) False Positive (s.e.) False Negative (s.e.)
LASSO 30.6928(.4050) .6905(.0004) .0204(.0004)
LASSOa206.1994(.5726) .0744(.0017) .1266(.0002)
LASSO
s368.6464(6.1308) .1290(.0077) .1475(.0012)
EBVS 95.3686(1.8820) .0118(.0010) .0970(.0008)
EBVSi21.7731(.2320) .0308(.0015) .0394(.0003)
The results of the scaled lasso excluded seven datasets. Applying the scaled lasso to these
seven datasets reported the median prediction error at 2.45 ×1010, false positive rate at
.7059, and false negative rate at .0043.
5. Real Data Analysis
The empirical Bayes variable selection using ICM/M algorithm was applied to
the Framingham dataset (Cupples et al.,2007) to find SNPs associated with
vitamin D level. The SNPs of the dataset were preprocessed following common
criteria of GWAS, that is, both missingness per individual and missingness per
SNP are less than 10%; minor allele frequency (MAF) is no less than 5%; and
the significance level of Hardy-Weinberg test on each SNP is 0.001. It resulted
in a total of 370,773 SNPs, and 84,834 of them resided in 2,167 genetic regions
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 19
involving 112 pathways relevant to vitamin D level. We pre-screened SNPs by
selecting those having p-values of univariate tests smaller than 0.1, and ended
with 7,824 SNPs for the following analysis. As in Section 4.2, a training dataset
and a test dataset were constructed with 952 and 519 unrelated individuals
respectively. The response variable is the log-transformed vitamin D level.
We applied lasso, adaptive lasso, scaled lasso, EBVS, and EBVSito the
training dataset, and calculated the prediction errors using the test dataset.
The results are reported in Table 2. While identifying much more SNPs than
all other methods, lasso reported the largest prediction error. Other than scaled
lasso (LASSOs), EBVS has the smallest prediction error though it identified
only one SNP. Adaptive lasso (LASSOa) and EBVSieach identified five SNPs,
and their prediction errors are slightly higher than that of EBVS.
Table 2
Prediction Errors for the Framingham Dataset.
Method Prediction Error No. of Identified SNPs
LASSO .2560 14
LASSOa.2085 5
LASSOs.2066 25
EBVS .2078 1
EBVSi.2121 5
Presented in Table 3are the seven SNPs identified to have non-zero regression
coefficients by adaptive lasso, EBVS, and EBVSi. Each SNP is identified by the
chromosome it resides in and four digits. The only SNP, 5-2773, which was
identified by EBVS, was identified by all other methods. While adaptive lasso
and EBVSieach identified five SNPs with non-zero regression coefficients, there
are only three commonly identified SNPs, i.e., 1-3887, 5-2773, and 8-5143. The
two SNPs on chromosome 17, i.e., 17-3907 identified by EBVSiand SNP 17-9089
identified by EBVS, neighbor each other with 16kbases in between. However
the two SNPs on chromosome 4 are far apart from each other.
As in the previous section, we also took the multi-sample-split method to
calculate p-values based on 50 sample splits for all methods. When we followed
Benjamini and Hochberg (1995) to control FDR at 0.1, none of these methods
reported any significant SNPs, though adaptive lasso and EBVSireported SNP
5-2773 with the p-value as small as 0.0031 and 0.0034 respectively. Instead,
when controlling d
F DR(κ)0.1 for both EBVS and EBVSi, EBVS identified
only SNP 5-2773, and EBVSiidentified both SNP 5-2773 and 17-3907, with
κ= 0.8. Note that SNP 17-3907 is one of the neighboring pair on chromosome
17. As shown in the simulation studies, d
F DR(κ) usually overestimated F DR(κ),
so we expect that F DR(.08) <0.1 for both EBVS and EBVSi.
6. Discussion
We intend to extend empirical Bayes thresholding (Johnstone and Silverman,
2004) for high-dimensional dependent data, allowing incorporation of compli-
cated a priori information on model parameters. An iterative conditional modes/medians
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 20
Table 3
Results of Analyzing the Framingham Data Using LASSO, Adaptive LASSO, Scaled
LASSO, EBVS, and EBVSi.
Chromosome-SNP
1-3887 4-0894 4-1174 5-2773 8-5143 17-3907 17-9089
ˆ
βLASSO .0412 0 .0355 .0402 0 0 0
LASSOa.1521 0 .0434 .1539 -.0200 0 .0167
LASSOs.0990 -.0112 .0528 .1366 -.0207 0 .0294
EBVS 0 0 0 .3778 0 0 0
EBVSi.2417 -.0542 0 .3047 -.0857 .1093 0
p-valueLASSO .2694 1 1 .6050 1 1 1
LASSOa.2060 1 1 .0031 1 1 1
LASSOs1 1 1 .0328 1 1 1
EBVS .3138 1 1 .0187 1 1 1
EBVSi.0837 1 1 .0034 1 1 1
ζEBVS .1277 .0133 .0347 .9976 .0981 .0869 .0966
EBVSi.7609 .5275 .3269 .9718 .7464 .8450 .0009
p-values were calculated using the multi-sample-split method.
(ICM/M) algorithm is proposed to cycle through each coordinate of the param-
eters for a Bayesian update conditional on all other coordinates. The idea of
cycling through coordinates has been revived recently for analyzing high dimen-
sional data. For example, the coordinate descent algorithm has been suggested
to obtain penalized least squares estimates, see Fu (1998), Daubechies et al.
(2004), Wu and Lange (2008), and Breheny and Huang (2011). However, di-
rect application of the coordinate descent algorithm here is challenged with the
spike-and-slab posteriors.
Without a priori information other than that regression coefficients are sparse,
many lasso-type methods have been proposed with some tuning parameters. It
is difficult to select a value for the tuning parameters, and in practice the cross-
validation method is widely used. However, high-dimensional data are usually
of small sample sizes, and available model fitting algorithms demand intensive
computation, both of which disfavor the cross-validation method. In particu-
lar, when genome-wide association studies focus more and more on complex
diseases associated with rare variants (Nawy,2012), the limited data usually
contain large number of SNPs which differ in a small number of individuals. It
is almost infeasible to take a cross-validation method as the small number of
unique individuals for a rare variant is more likely to be included in the same
fold. Instead, the proposed ICM/M algorithm obtains data-driven hyperparam-
eters via conditional modes, which takes full advantage of each observation in
the small sample.
With a large number of predictors and complicated correlation between esti-
mates, classical p-values are difficult to compute and it is therefore challenging to
evaluate the significance of selected predictors. Wasserman and Roeder (2009),
and Meinshausen et al. (2009) recently proposed to calculate p-values by split-
ting the samples. That is, when a sample is split into two folds, one fold is used
as the training data to select variables, and the other is used to calculate p-values
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 21
of selected variables. Similar to applying the cross-validation method, splitting
samples significantly reduces the power of variable selection and p-value calcu-
lation, especially for high-dimensional data of small sample sizes. Again, it is
almost not feasible to apply such a splitting method to genome-wide association
studies with rare variants.
As shown in Section 4, an Ising model as (4.2) can be used to model a priori
graphical information on predictors. Maximizing pseudo-likelihood approach is
utilized to obtain the conditional mode of the Ising model parameters, and there-
fore the ICM/M algorithm can be easily implemented. Indeed, at each iteration
of the ICM/M algorithm, we cycle through all parameters by obtaining condi-
tional modes/medians of one parameter (or a set of parameters), and therefore,
many classical approximation methods for low-dimensional issues may be used
to simplify the implementation. On the other hand, the Ising prior (4.2) can also
be modified to incorporate more complicated a priori information on predictors.
For example, we may multiply a weight wij to the interaction τiτjto model the
known relationship between the i-th and j-th predictors. A copula model may
be established to model more complicated graphical relationship between the
predictors.
For high-dimensional data, stochastic search has been employed to implement
Bayesian variable selection, see Hans et al. (2007), Bottolo and Richardson (2010),
Li and Zhang (2010), Stingo et al. (2011), and others. The reviewers pointed out
that Rockova and George (2014) recently proposed EMVS as an EM approach
for rapid Bayesian variable selection. EMVS assumes the “spike-and-slab” Gaus-
sian mixture prior on each βj,
βj|ωj(1 ωj)N(0, ν0σ2) + ωjN(0, ν1σ2),
where ωjis a prior probability, ν1takes either a prespecified large value or a g-
prior, and ν0is suggested to explore a sequence of positive values with ν0< ν1.
With an absolutely continuous spike, EMVS estimates ωjat the E-step, and
estimates βjat the M-step. Note that a positive ν0will not automatically yield
a sparse estimate of β, which has to be sparsified using a prespecified threshold.
However, the ICM/M algorithm estimates a common ωbased on a conditional
mode, and estimates βjbased on a conditional median which enables variable
selection following Johnstone and Silverman (2004). We also propose a local
posterior probability to evaluate the importance of the predictor, which helps
control the false discovery rate.
Acknowledgements
This work was partially supported by NSF CAREER award IIS-0844945, U01CA128535
from the National Cancer Institute, and the Cancer Care Engineering project
at the Oncological Science Center of Purdue University. We would like to thank
the Editor and the Associate Editor for their insightful comments on the paper,
which led to improvement of the manuscript.
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 22
The Framingham Heart Study is conducted and supported by the National
Heart, Lung, and Blood Institute (NHLBI) in collaboration with Boston Uni-
versity (Contract No. N01-HC-25195). This manuscript was not prepared in
collaboration with investigators of the Framingham Heart Study and does not
necessarily reflect the opinions or views of the Framingham Heart Study, Boston
University, or NHLBI.
Funding for SHARe Affymetrix genotyping was provided by NHLBI Contract
N02-HL-64178. SHARe Illumina genotyping was provided under an agreement
between Illumina and Boston University.
Appendix A. Technical Details of the ICM/M Algorithms
A.1 The Algorithm in Section 3.1
Given ˆ
β(k), ˆσ(k), and ˆω(k)from the k-th iteration, the (k+ 1)-st iteration of
ICM/M algorithm can proceed in the order of ˆ
β(k+1)
1,···,ˆ
β(k+1)
p, ˆσ(k+1) , and
ˆω(k+1), based on their fully conditional distributions.
Let (˜
Yj=YPj1
l=1 Xlβ(k+1)
lPp
l=j+1 Xlβ(k)
l,
zj=Xt
j˜
Yjσ(k)n1).
Following Proposition 3.1,ˆ
β(k+1)
jis updated as the median value of its posterior
distribution conditional on (zj,ˆω(k),ˆσ(k)).
Let
˜
F(k+1)(0|zj) = P(βj0|zj,ˆω(k),ˆσ(k))
=1Φ(0.5zj)
[1 Φ(zj+ 0.5)]ezj+ Φ(zj0.5) ,
and ωj=P(βj6= 0|zj,ˆω(k),ˆσ(k)) which can be calculated as follows,
ω1
j= 1 + 4(1/ˆω(k)1) Φ(zj0.5)
φ(zj0.5) +1Φ(zj+ 0.5)
φ(zj+ 0.5) 1
.
If zj>0, as shown in Johnstone and Silverman (2005), the posterior median
ˆ
β(k+1)
jis zero if ωj˜
F(k+1)(0|zj)0.5; otherwise,
ˆ
β(k+1)
j=ˆσ(k)
n1zj0.5Φ1[1 Φ(zj+ 0.5)]ezj+ Φ(zj0.5)
2ωj.
If zj<0, ˆ
β(k+1)
jcan be calculated on the basis of its antisymmetry property.
That is, when a function ˆ
β(zj) = ˆ
β(k+1) is defined, then ˆ
β(zj) = ˆ
β(zj).
The conditional mode ˆσ(k+1) can be easily derived following the fact that
ˆσ(k+1) = mode(σ|Y,X,ˆ
β(k+1)), and the conditional mode ˆω(k+1) can be easily
derived following the fact that ˆω(k+1) = mode(ω|ˆ
β(k+1)).
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 23
A.2 The Algorithm in Section 4.1
Following Proposition 4.1,ˆ
β(k+1)
jis updated as the median value of its posterior
distribution conditional on (zj,ˆ̟j,ˆσ(k)), where ˆ̟jis calculated as follows,
ˆ̟1
j= 1 + exp ˆa(k+1) ˆ
b(k+1) X
l:<j,k>E
ˆτl,
with ˆτl=I{ˆ
β(k+1)
l6= 0}for l= 1,···, j 1; and ˆτl=I{ˆ
β(k)
l6= 0}for l=
j+ 1,···, p.
The conditional median ˆ
β(k+1)
jcan be calculated following A.1, except that
the posterior probability ωj=P(βj6= 0|zj,ˆ̟j,ˆσ(k)) should be updated as
follows,
ω1
j= 1 + 4(1/ˆ̟j1) Φ(zj0.5)
φ(zj0.5) +1Φ(zj+ 0.5)
φ(zj+ 0.5) 1
.
References
Barbieri, M. M. and Berger, J. O. (2004). Optimal predictive model selection.
The Annals of Statistics, 32:870–897.
Benjamini, Y. and Hochberg, Y. (1995). Controlling the false discovery rate:
a practical and powerful approach to multiple testing. Journal of the Royal
Statistical Society, Series B, 57:289–300.
Besag, J. (1975). Statistical analysis of non-lattice data. Journal of the Royal
Statistical Society Series D (The Statistician), 24:179–195.
Besag, J. (1986). On the statistical analysis of dirty pictures. Journal of the
Royal Statistical Society Series B, 48:259–302.
Bottolo, L. and Richardson, S. (2010). Evolutionary stochastic search for
bayesian model exploration. Bayesian Analysis, 5:583–618.
Breheny, P. and Huang, J. (2011). Cooridinate descent algorithms for nonconvex
penalized regression, with applications to biological feature selection. The
Annals of Applied Statistics, 5:232–253.
Carlin, B. P. and Chib, S. (1995). Bayesian model choice via markov chain monte
carlo methods. Journal of the Royal Statistical Society Series B, 57:473–484.
Cupples, L. A., Arruda, H. T., Benjamin, E. J., and et al. (2007). The framing-
ham heart study 100k snp genome-wide association study resource: overview
of 17 phenotype working group reports. BMC Medical Genetics, 8(Suppl1):S1.
Daubechies, I., Defrise, M., and Mol, C. D. (2004). An iterative thresholding al-
gorithm for linear inverse problems with a sparsity constraint. Commications
on Pure and Applied Mathematics, 57:1413–1457.
Donoho, D. L. and Johnstone, I. M. (1994). Ideal spatial adaptation by wavelet
shrinkage. Biometrika, 81:425–455.
Dudoit, S., Shaffer, J. P., , and Boldrick, J. C. (2003). Multiple hypothesis
testing in microarray experiments. Statistical Science, 18:71–103.
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 24
Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood
and its oracle properties. Journal of the American Statistical Association,
96:1348–1360.
Fu, W. J. (1998). Penalized regressions: The bridge versus the lasso. Journal of
Computational and Graphical Statistics, 7:397–416.
George, E. I. and McCulloch, R. E. (1993). Variable selection via gibbs sampling.
Journal of American Statistical Association, 85:398–409.
Geyer, C. J. (1991). Markov chain monte carlo maximum likelihood. In Com-
puting Science and Statistics, Proceedings of the 23rd Symposium on the In-
terface, pages 156–163.
Geyer, C. J. and Thompson, E. A. (1992). Constrained monte carlo maximum
likelihood for dependent data. Journal of the Royal Statistical Society Series
B, 54:657–699.
Guyon, X. and Kunsch, H. R. (1992). Asymptotic comparison of estimators
in the ising model. In Barone, P., Frigessi, A., and Piccioni, M., editors,
Stochastic Models, Statistical Methods, and Algorithms in Image Analysis,
pages 177–198. Springer New York.
Hans, C., Dobra, A., and West, M. (2007). Shotgun stochastic search for “large
p” regression. Journal of the American Statistical Association, 102:507–516.
Ishwaran, H. and Rao, J. S. (2005). Spike and slab variable selection: frequentist
and bayesian strategies. The Annals of Statistics, 33:730–773.
Jeffreys, H. (1946). An invariant form for the prior probability in estimation
problems. Proceedings of the Royal Society of Landon Series A, 196:453–461.
Johnstone, I. M. and Silverman, B. W. (2004). Needles and straw in haystacks:
empirical bayes estimates of possibly sparse sequence. The Annals of Statis-
tics, 32:1594–1649.
Li, C. and Li, H. (2010). Variable selection and regression analysis for graph-
structured covariates with an application to genomics. The Annals of Applied
Statistics, 4:1498–1516.
Li, F. and Zhang, N. R. (2010). Bayesian variable selection in structured high-
dimensional covariate spaces with application in genomics. Journal of the
American Statistical Association, 105:1202–1214.
Liang, G. and Yu, B. (2003). Maximum pseudo likelihood estimation in network
tomography. IEEE Transactions on Signal Processing, 51:2043–2053.
Mase, S. (2000). Marked gibbs processes and asymptotic normality of maximum
pseudo-likelihood estimators. Mathematische Nachrichten, 209:151–169.
Meinshausen, N., Meier, L., and Buehlmann, P. (2009). P-values for high-
dimensional regression. Journal of the American Statistical Association,
104:1671–1681.
Mitchell, T. J. and Beauchamp, J. J. (1988). Bayesian variable selection in linear
regression. Journal of the American Statistical Association, 83:1023–1036.
Nawy, T. (2012). Rare variants and the power of association. Nature Methods,
9:324.
Onsager, L. (1943). Crystal statistics. i. a two-dimensional model with an order-
disorder transition. Physical Review, 65:117–149.
Pan, W., Xie, B., and Shen, X. (2010). Incorporating predictor network in
D. Zhang et al./Iterated Conditional Modes/Medians Algorithm 25
penalized regression with application to microarray data. Biometrics, 66:474–
484.
Rockova, V. and George, E. I. (2014). Incorporating predictor network in penal-
ized regression with application to microarray data. Journal of the American
Statistical Association, 109:828–846.
Stingo, F. C., Chen, Y. A., Tadesse, M. G., and Vannucci, M. (2011). Incorpo-
rating biological information into linear models: a bayesian approach to the
selection of pathways and genes. The Annals of Applied Statistics, 5:1978–
2002.
Sun, T. and Zhang, C.-H. (2012). Scaled sparse linear regression. Biometrika,
99:879–898.
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal
of Royal Statistical Society Series B, 58:267–288.
Varin, C., Reid, N., and Firth, D. (2011). An overview of composite likelihood
methods. Statistica Sinica, 21:5–42.
Wasserman, L. and Roeder, K. (2009). High-dimensional variable selection.
Annals of Statistics, 37:2178–2201.
Wu, T. T. and Lange, K. (2008). Coordinate descent algorithms for lasso pe-
nalized regression. The Annals of Applied Statistics, 2:224–244.
Yuan, M. and Lin, Y. (2006). Model selection and estimation in regression with
grouped variables. Journal of Royal Statistical Society Series B, 68:49–67.
Zhang, M., Zhang, D., and Wells, M. T. (2010). Generalized thresholding estima-
tors for high-dimensional location parameters. Statistica Sinica, 20:911–926.
Zhou, X. and Schmidler, S. C. (2009). Bayesian parameter estimation in ising
and potts models: a comparative study with applications to protein modeling.
Technical report, Duke University.
Zou, H. (2006). The adaptive lasso and its oracle properties. Journal of the
American Statistical Association, 101:1418–1429.
... Moreover, extensions of the EMVS algorithm 60 to GLMs are not straightforward. Pungpapong et al. (2015) presented a contribution of empirical Bayes thresholding (Johnstone and Silverman (2004)) to select variables in a linear regression framework which can identify and estimate sparse predictors efficiently. A dirac spike-and-slab prior, a mixture prior of an atom of probability at zero 65 and a heavy-tailed density, is put on each regression coefficient. ...
... With a dirac spike-and-slab posterior for each regression coefficient, a conditional median is employed to enforce the variable selection. As demonstrated in Pungpapong et al. (2015), empirical Bayes variable selection can also handle the case when the information about structural relationship among predictors is available through the Ising prior. The ICM/M algorithm can be easily 75 implemented for such complicated prior. ...
... To introduce sparsity in the modeling process, an independent mixture of an atom of probability at zero and a distribution describing non-zero effect can be put on each of the regression coefficient β j as prior in the Bayesian framework. With the approximated distribution of pseudodata Z and assuming that all other parameters except β j are known, β j has a sufficient statistic Pungpapong et al. (2015), the prior of β j is in the form ...
Article
Full-text available
High-dimensional linear and nonlinear models have been extensively used to identify associations between response and explanatory variables. The variable selection problem is commonly of interest in the presence of massive and complex data. An empirical Bayes model for high-dimensional generalized linear models (GLMs) is considered in this paper. The extension of the Iterated Conditional Modes/Medians (ICM/M) algorithm is proposed to build up a GLM. With the construction of pseudodata and pseudovariances based on iteratively reweighted least squares (IRLS), conditional modes are employed to obtain data-drive optimal values for hyperparameters and conditional medians are used to estimate regression coefficients. With a spike-and-slab prior for each coefficient, a conditional median can enforce variable estimation and selection at the same time. The ICM/M algorithm can also incorporate more complicated prior by taking the network structural information into account through the Ising model prior. Here we focus on two extensively used models for genomic data: binary logistic and Cox's proportional hazards models. The performance of the proposed method is demonstrated through both simulation studies and real data examples.
... The explicit solution for each / 2 k is crucial for the high computational efficiency of our new SBL. This approach has been applied to variable selection and it is called iterative conditional mode algorithm (Pungpapong et al., 2015). ...
... This explains the high computational efficiency of the proposed SBL in this paper. This approach has been taken in statistics for variable selection by Johnstone and Silverman (2004) and recently by Pungpapong et al. (2015) who called the method an iterative conditional mode algorithm. A theoretical justification for replacing the term 'conditional on variances' by the term 'conditional on modes' can be found in Equation (4.4) of Mackay (1992). ...
Article
Motivation: Genomic scanning approaches that detect one locus at a time are subject to many problems in genome-wide association studies (GWAS) and quantitative trait locus (QTL) mapping. The problems include large matrix inversion, over-conservativeness for tests after Bonferroni correction and difficulty in evaluation of the total genetic contribution to a trait's variance. Targeting these problems, we take a further step and investigate a multiple locus model that detects all markers simultaneously in a single model. Results: We developed a sparse Bayesian learning (SBL) method for QTL mapping and GWAS. This new method adopts a coordinate descent algorithm to estimate parameters (marker effects) by updating one parameter at a time conditional on current values of all other parameters. It uses an L2 type of penalty that allows the method to handle extremely large sample sizes (>100,000). Simulation studies show that SBL often has higher statistical powers and the simulated true loci are often detected with extremely small p-values, indicating that SBL is insensitive to stringent thresholds in significance testing. Availability: An R package (sbl) is available on the comprehensive R archive network (CRAN) and https://github.com/MeiyueComputBio/sbl/tree/master/R%20packge. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Logistic regression is one of the widely-used classification tools to construct prediction models. For datasets with a large number of features, feature subset selection methods are considered to obtain accurate and interpretable prediction models, in which irrelevant and redundant features are removed. In this paper, we address the problem of feature subset selection in logistic regression using modern optimization techniques. To this end, we formulate this problem as a mixed-integer exponential cone program (MIEXP). To the best of our knowledge, this is the first time both nonlinear and discrete aspects of the underlying problem are fully considered within an exact optimization framework. We derive different versions of the MIEXP model by the means of regularization and goodness of fit measures including Akaike Information Criterion and Bayesian Information Criterion. Finally, we solve our MIEXP models using the solver MOSEK and evaluate the performance of our different versions over a set of toy examples and benchmark datasets. The results show that our approach is quite successful in obtaining accurate and interpretable prediction models compared to other methods from the literature.
Article
The Cox proportional hazards model has been widely used in cancer genomic research that aims to identify genes from high-dimensional gene expression space associated with the survival time of patients. With the increase in expertly curated biological pathways, it is challenging to incorporate such complex networks in fitting a high-dimensional Cox model. This paper considers a Bayesian framework that employs the Ising prior to capturing relations among genes represented by graphs. A spike-and-slab prior is also assigned to each of the coefficients for the purpose of variable selection. The iterated conditional modes/medians (ICM/M) algorithm is proposed for the implementation for Cox models. The ICM/M estimates hyperparameters using conditional modes and obtains coefficients through conditional medians. This procedure produces some coefficients that are exactly zero, making the model more interpretable. Comparisons of the ICM/M and other regularized Cox models were carried out with both simulated and real data. Compared to lasso, adaptive lasso, elastic net, and DegreeCox, the ICM/M yielded more parsimonious models with consistent variable selection. The ICM/M model also provided a smaller number of false positives than the other methods and showed promising results in terms of predictive accuracy. In terms of computing times among the network-aware methods, the ICM/M algorithm is substantially faster than DegreeCox even when incorporating a large complex network. The implementation of the ICM/M algorithm for Cox regression model is provided in R package icmm, available on the Comprehensive R Archive Network (CRAN).
Article
Logistic regression is an effective tool in case-control analysis. With the advanced high throughput technology, a quest to seek a fast and efficient method in fitting high-dimensional logistic regression has gained much interest. An empirical Bayes model for logistic regression is considered in this article. A spike-and-slab prior is used for variable selection purpose, which plays a vital role in building an effective predictive model while making model interpretable. To increase the power of variable selection, we incorporate biological knowledge through the Ising prior. The development of the iterated conditional modes/medians (ICM/M) algorithm is proposed to fit the logistic model that has computational advantage over Markov Chain Monte Carlo (MCMC) algorithms. The implementation of the ICM/M algorithm for both linear and logistic models can be found in R package icmm that is freely available on Comprehensive R Archive Network (CRAN). Simulation studies were carried out to assess the performances of our method, with lasso and adaptive lasso as benchmark. Overall, the simulation studies show that the ICM/M outperform the others in terms of number of false positives and have competitive predictive ability. An application to a real data set from Parkinson's disease study was also carried out for illustration. To identify important variables, our approach provides flexibility to select variables based on local posterior probabilities while controlling false discovery rate at a desired level rather than relying only on regression coefficients.
Article
With high-dimensional data emerging in various domains, sparse logistic regression models have gained much interest of researchers. Variable selection plays a key role in both improving the prediction accuracy and enhancing the interpretability of built models. Bayesian variable selection approaches enjoy many advantages such as high selection accuracy, easily incorporating many kinds of prior knowledge and so on. Because Bayesian methods generally make inference from the posterior distribution with Markov Chain Monte Carlo (MCMC) techniques, however, they become intractable in high-dimensional situations due to the large searching space. To address this issue, a novel variational Bayesian method for variable selection in high-dimensional logistic regression models is presented. The proposed method is based on the indicator model in which each covariate is equipped with a binary latent variable indicating whether it is important. The Bernoulli-type prior is adopted for the latent indicator variable. As for the specification of the hyperparameter in the Bernoulli prior, we provide two schemes to determine its optimal value so that the novel model can achieve sparsity adaptively. To identify important variables and make predictions, one efficient variational Bayesian approach is employed to make inference from the posterior distribution. The experiments conducted with both synthetic and some publicly available data show that the new method outperforms or is very competitive with some other popular counterparts.
Article
Full-text available
A crucial problem in building a multiple regression model is the selection of predictors to include. The main thrust of this article is to propose and develop a procedure that uses probabilistic considerations for selecting promising subsets. This procedure entails embedding the regression setup in a hierarchical normal mixture model where latent variables are used to identify subset choices. In this framework the promising subsets of predictors can be identified as those with higher posterior probability. The computational burden is then alleviated by using the Gibbs sampler to indirectly sample from this multinomial posterior distribution on the set of possible subset choices. Those subsets with higher probability—the promising ones—can then be identified by their more frequent appearance in the Gibbs sample.
Article
Bridge regression, a special family of penalized regressions of a penalty function Σ|βj|γ with γ ≤ 1, considered. A general approach to solve for the bridge estimator is developed. A new algorithm for the lasso (γ = 1) is obtained by studying the structure of the bridge estimators. The shrinkage parameter γ and the tuning parameter λ are selected via generalized cross-validation (GCV). Comparison between the bridge model (γ ≤ 1) and several other shrinkage models, namely the ordinary least squares regression (λ = 0), the lasso (γ = 1) and ridge regression (γ = 2), is made through a simulation study. It is shown that the bridge regression performs well compared to the lasso and ridge regression. These methods are demonstrated through an analysis of a prostate cancer data. Some computational advantages and limitations are discussed.
Article
Analyzing high-throughput genomic, proteomic, and metabolomic data usually involves estimating high-dimensional location parameters. Thresholding estimators can significantly improve such estimation when many parameters are zero, i.e., parameters are sparse. Several such estimators have been constructed to be adaptive to parameter sparsity. However, they assume that the underlying parameter spaces are symmetric. Since many applications present asymmetric parameter spaces, we introduce a class of generalized thresholding estimators. A construction of these estimators is developed using a Bayes approach, where an important constraint on the hyperparameters is identified. A generalized empirical Bayes implementation is presented for estimating high-dimensional yet sparse normal means. This implementation provides generalized thresholding estimators which are adaptive to both sparsity and asymmetry of high-dimensional parameters.
Article
Model search in regression with very large numbers of candidate predictors raises challenges for both model specification and computation, for which standard approaches such as Markov chain Monte Carlo (MCMC) methods are often infeasible or ineffective. We describe a novel shotgun stochastic search (SSS) approach that explores “interesting” regions of the resulting high-dimensional model spaces and quickly identifies regions of high posterior probability over models. We describe algorithmic and modeling aspects, priors over the model space that induce sparsity and parsimony over and above the traditional dimension penalization implicit in Bayesian and likelihood analyses, and parallel computation using cluster computers. We discuss an example from gene expression cancer genomics, comparisons with MCMC and other methods, and theoretical and simulation-based aspects of performance characteristics in large-scale regression model searches. We also provide software implementing the methods.
Article
Statistical methods vary widely in the power to predict which rare genetic sequence variants are functional.
Article
A Markovian approach to the specification of spatial stochastic interaction for irregularly distributed data points is reviewed. Three specific methods of statistical analysis are proposed; the first two are generally applicable whilst the third relates only to "normally" distributed variables. Some reservations are expressed and the need for practical investigations is emphasized.
Article
Markov chain Monte Carlo (MCMC) integration methods enable the fitting of models of virtually unlimited complexity, and as such have revolutionized the practice of Bayesian data analysis. However, comparison across models may not proceed in a completely analogous fashion, owing to violations of the conditions sufficient to ensure convergence of the Markov chain. In this paper we present a framework for Bayesian model choice, along with an MCMC algorithm that does not suffer from convergence difficulties. Our algorithm applies equally well to problems where only one model is contemplated but its proper size is not known at the outset, such as problems involving integer-valued parameters, multiple changepoints or finite mixture distributions. We illustrate our approach with two published examples.