Evolutionary Stochastic Search for Bayesian model exploration
ABSTRACT Implementing Bayesian variable selection for linear Gaussian regression models for analysing high dimensional data sets is of current interest in many fields. In order to make such analysis operational, we propose a new sampling algorithm based upon Evolutionary Monte Carlo and designed to work under the "large p, small n" paradigm, thus making fully Bayesian multivariate analysis feasible, for example, in genetics/genomics experiments. Two real data examples in genomics are presented, demonstrating the performance of the algorithm in a space of up to 10,000 covariates. Finally the methodology is compared with a recently proposed search algorithms in an extensive simulation study.
Article: Construction of regulatory networks using expression time-series data of a genotyped population.[show abstract] [hide abstract]
ABSTRACT: The inference of regulatory and biochemical networks from large-scale genomics data is a basic problem in molecular biology. The goal is to generate testable hypotheses of gene-to-gene influences and subsequently to design bench experiments to confirm these network predictions. Coexpression of genes in large-scale gene-expression data implies coregulation and potential gene-gene interactions, but provide little information about the direction of influences. Here, we use both time-series data and genetics data to infer directionality of edges in regulatory networks: time-series data contain information about the chronological order of regulatory events and genetics data allow us to map DNA variations to variations at the RNA level. We generate microarray data measuring time-dependent gene-expression levels in 95 genotyped yeast segregants subjected to a drug perturbation. We develop a Bayesian model averaging regression algorithm that incorporates external information from diverse data types to infer regulatory networks from the time-series and genetics data. Our algorithm is capable of generating feedback loops. We show that our inferred network recovers existing and novel regulatory relationships. Following network construction, we generate independent microarray data on selected deletion mutants to prospectively test network predictions. We demonstrate the potential of our network to discover de novo transcription-factor binding sites. Applying our construction method to previously published data demonstrates that our method is competitive with leading network construction algorithms in the literature.Proceedings of the National Academy of Sciences 11/2011; 108(48):19436-41. · 9.68 Impact Factor
Evolutionary Stochastic Search
Institute for Mathematical Sciences, Imperial College London, UK
Centre for Biostatistics, Imperial College, London, UK
Implementing Bayesian variable selection for linear Gaussian regression models for analysing high dimen-
sional data sets is of current interest in many fields. In order to make such analysis operational, we propose
a new search algorithm based upon Evolutionary Monte Carlo and designed to work equally well when
n > p or under the “large p, small n” paradigm, thus making multivariate analysis feasible, for example,
in genomics experiments. The methodology is compared with a recently proposed search algorithm in an
extensive simulation study. Finally two real data examples in genomics are presented, demonstrating the
performance of the algorithm in a space of up to 10,000 covariates.
Keywords: Evolutionary Monte Carlo; Fast Scan Metropolis-Hastings schemes; Linear Gaussian regression
models; Variable selection.
This paper is a contribution to the methodology of Bayesian variable selection for linear Gaussian regression
models, an important problem which has been much discussed both from a theoretical and a practical perspective
(see Chipman et al., 2001 and Clyde and George, 2004 for extensive literature reviews). Recent advances have
been made in two directions, unravelling the theoretical properties of different choices of prior structure for the
regression coefficients (Fern´ andez et al., 2001; Liang et al., 2008) and proposing algorithms that can explore
efficiently the huge model space consisting of all the possible subsets when there are a large number of covariates,
using either MCMC or other search algorithms (Kohn et al., 2001; Dellaportas et al., 2002; Nott and Kohn,
2005; Hans et al., 2007).
In this paper, we propose a new sampling algorithm for implementing the variable selection model, based on
tailoring ideas from Evolutionary Monte Carlo (Liang and Wong, 2000; Jasra et al., 2007) in order to overcome
∗Address for correspondence: Sylvia Richardson, Department of Epidemiology and Public Health, Imperial College, 1 Norfolk
Place, London, W2 1PG, UK.
the known difficulties that MCMC samplers face in a high dimension multimodal model space: enumerating the
model space becomes rapidly unfeasible even for a moderate number of covariates. For a Bayesian approach to
be operational, it needs to be accompanied by an algorithm that samples the indicators of the selected subsets
of covariates, together with any other parameters that have not been integrated out. We stress that our new
algorithm for searching through the model space has many generic features that are of interest per se and
can be easily coupled with any prior formulation for the variance-covariance of the regression coefficients. We
illustrate this point by implementing the case of g-priors for the regression coefficients as well as the case of
independent priors: in both cases the formulation we adopt is general and allows the specification of a further
level of hierarchy on the priors for the regression coefficients, if so desired.
The paper is structured as follows. In Section 2, we present the background of Bayesian variable selection,
reviewing briefly alternative prior specifications for the regression coefficients, namely g-priors and independent
priors. Section 3 is devoted to the description of our MCMC sampler which uses a wide portfolio of moves.
Section 4 demonstrates the good performance of our new MCMC algorithm in a variety of examples with
different structure on the predictors, where the number of covariates p ranges between 30 and 1,000, and the
number of samples n is both larger and smaller than p. A comparison with the recent Shotgun Stochastic Search
algorithm of Hans et al. (2007) is presented. In Section 5 we complement the simulations results by illustrating
the performance of our algorithm with two real data sets, including a challenging case where the number of
predictors is extremely large (p = 10,000) with respect to the sample size (n = 50). Finally Section 6 contains
some concluding remarks and a discussion of extensions.
Let y = (y1,...,yn)Tbe a sequence of n observed responses and xi= (xi1,...,xip)Ta vector of predictors for
yi, i = 1,...,n, of dimension p × 1. Moreover let X be the n × p design matrix with ith row xT
linear model can be described by the equation
i. A Gaussian
y = α1n+ Xβ + ε,
where α is the unknown constant, 1nis a column vector of ones, β = (β1,...,βp)Tis a p×1 vector of unknown
parameters and ε ∼ N?0,σ2In
Suppose one wants to model the relationship between y and a subset of x1,...,xp, but there is uncertainty
about which subset to use. Following the usual convention of only considering models that have the intercept
α, this problem, known as variable selection or subset selection, is particularly interesting when p is large and
parsimonious models containing only a few predictors are sought, with a view to gain interpretability. From a
Bayesian perspective the problem is tackled by placing a constant prior density on α and a prior on β such that
if βj= 0 then the jth predictor does not appear in the expected value of y: as a result the prior structure on
β depends on a latent binary vector γ = (γ1,...,γp)T, where γj= 1 if βj?= 0 and γj= 0 if βj= 0. The overall
number of possible models grows exponentially with p and selecting the best model that predicts y is equivalent
to find one over the 2psubsets that form the model space.
Given the latent variable γ, a Gaussian linear model can therefore be written as
y = α1n+ Xγβγ+ ε,
where βγis the non-zero vector of coefficients extracted from β, Xγis the design matrix of dimension n × pγ,
pγ≡ γT1p, with columns corresponding to γj= 1. In the following we will assume that, apart from the intercept
α, x1,...,xpcontains no variables that would be included in every possible model and that the columns of the
design matrix have all been centred with mean 0.
It is recommended to treat the intercept separately and assign it a constant prior: p(α) ∝ 1, Fern´ andez et
al. (2001), Berger and Molina (2005). When coupled with the latent variable γ, the conjugate prior structure
of?βγ,σ2?follows a normal-inverse-gamma distribution
p?σ2|γ?= p?σ2?= InvGa(aσ,bσ)
with aσ,bσ> 0. Some guidelines how to fix the value of the hyperparameters aσand bσare provided in Cripps
et al. (2006). The limit case of (3) when aσ→ 0 and bσ→ 0 corresponds to the Jeffreys’ prior (e.g. Bernardo
and Smith, 1994) for the error variance p?σ2?∝ σ−2. Taking into account both the likelihood structure (1),
the prior specification for α, (2) and (3), the joint distribution of all the variables (based on further conditional
independence conditions) can be written in general form as
The main advantage of the conjugate structure (2) and (3) is the analytical tractability of the marginal
likelihood whatever is the specification of the prior covariance matrix Σγ:
??−1/2|Σγ|−1/2(2bσ+ S (γ))−(2aσ+n−1)/2,
γM, with C = (y − ¯ yn)T(y − ¯ yn) + mT
(Brown et al., 1998).
where S (γ) = C − MTK−1
γmγ, M = XT
γ(y − ¯ yn) + Σ−1
While the mean of the prior (2) is usually set equal to zero, mγ= 0, a neutral choice with respect to positive
or negative values of the coefficients (Chipman et al., 2001; Clyde and George, 2004), the specification of the
prior covariance Σγmatrix leads to at least two different classes of priors:
• When Σγ= gVγ, where g is a scalar and Vγ=?XT
likelihood giving rise to so called g-priors first proposed by Zellner (1986).
?−1, it replicates the covariance structure of the
• When Σγ = cVγ, but Vγ = Ipγthe components of βγ are conditionally independent and in contrast to
g-priors, independent priors weaken the likelihood covariance structure.
In the following we will adopt the notation Σγ= τVγas we want to cover both cases in a unified manner.
Thus in the g-prior case, Σγ= τ?XT
the variable selection coefficient for reasons that will become clear in the next Section.
To complete the prior specification in (4), p(γ) must be defined. A complete discussion about alternative
?−1while in the independent case, Σγ= τIpγ. We will refer to τ as
priors on the model space can be found in Chipman (1996) and Chipman et al. (2001), for a new proposal see
Scott and Berger (2006). Here we adopt the beta-binomial prior illustrated in Kohn et al. (2001)
with pγ ≡ γT1p, where the choice p(γ |ω) = ωpγ(1 − ω)p−pγimplicitly induces a binomial prior distribution
over the model size and p(ω) = ωaω−1(1 − ω)bω−1/B (aω,bω). The hypercoefficients aωand bωcan be chosen
once E (pγ) and V (pγ) have been elicited. From this point of view (6) offers more flexibility than the simpler
p(γ |ω)p(ω)dω =B (pγ+ aω,p − pγ+ bω)
2.2Priors for the variable selection coefficient τ
It is a known fact that g-priors have two attractive properties. Firstly they possess an automatic scaling
feature (Chipman et al., 2001; Kohn et al., 2001). In contrast to g-priors, for independent priors the effect of
Vγ= Ipγon the posterior depends on the relative scale of X and standardisation of the design matrix to units
of standard deviation is recommended (Chipman et al., 2001). However this is not always the best procedure
when the distribution X is possibly skewed, or when the columns of X are not defined on a common scale of
measurement, a case which occurs often in practice when analysing joint data sets.
The second feature that makes g-priors particularly appealing is the rather simple structure of the marginal
likelihood (5) with respect to the constant τ which becomes
∝ (1 + τ)−pγ/2(2bσ+ S (γ))−(2aσ+n−1)/2,
where, if mγ= 0, S (γ) = ESS (γ) + RSS (γ)/(1 + τ) with
• ESS (γ) = (y − ¯ yn)T(y − ¯ yn) − (y − ¯ yn)TXγ
γ(y − ¯ yn), the error sum of squares.
• RSS (γ) = (y − ¯ yn)TXγ
γ(y − ¯ yn), the regression sum of squares.
γ= RSS (γ)/?(y − ¯ yn)T(y − ¯ yn)?. Despite the simplicity of the marginalGiven the above notation, we define R2
likelihood (7), the choice of the constant τ for g-priors is quite complex, see Fern´ andez et al. (2001), George
and Foster (2000), Cui and George (2008) and Liang et al. (2008).
Historically the first attempt to build a comprehensive Bayesian analysis placing a prior distribution on τ
dates back to Zellner and Siow (1980), where the data adaptivity of the degree of shrinkage accommodates to
different scenarios better than standard fix values. Zellner-Siow priors can be thought as a mixture of g-priors
and an inverse-gamma prior on τ, InvGa(1/2,n/2) leading to
The joint distribution representation (4) can now be written as
while the marginal likelihood has the new integral representation
p(y |γ,τ )p(τ)dτ.
Liang et al. (2008) analyse in details Zellner-Siow priors pointing out a variety of theoretical properties. From
a computational point of view, under (8) and (9), the marginal likelihood (11) is no more available in closed
form which is somehow desirable in order to quickly perform a stochastic search (George and McCulloch, 1997;
Chipman et al., 2001). Even though in the prior set-up (9), no hyperparameters need to be specified and
therefore no calibration is required, and that a Laplace approximation can be derived (Tierney and Kadane,
1986), Zellner-Siow priors never became as popular as the simpler g-prior with a constant value suitably chosen
for the coefficient τ. For alternative prior specifications see also Celeux et al. (2006), Cui and George (2008)
and Liang et al. (2008).
When all the variables are defined on the same scale, which often occurs in biological experiments, independent
priors represent an attractive alternative to g-priors. The likelihood marginalised over α, βγ and σ2has the
p(y |γ) ∝ τ−pγ/2??XT
where S (γ) = 2bσ+ (y − ¯ yn)T(y − ¯ yn) − (y − ¯ yn)TXγ
putationally more demanding than (7) due to the extra determinant operator. From the above equations it is
γ(y − ¯ yn). Note that (12) is com-
also evident that the role of the constant τ in the independent prior set-up is to regularise the quadratic form
γXγwhen it is ill-conditioned.
Different approaches have been proposed to fix the value of τ. Geweke (1996) suggests to fix a different
value of τj, j = 1,...,p, based on the idea of “substantially significant determinant” of ∆Xj with respect
to ∆y. Under this formulation, (12) changes accordingly with τ−pγ/2replaced by?p
Tγ which is the diagonal matrix containing the coefficients attached to the selected covariates. However it is
common practice to standardise the predictor variables, taking τ = 1 in order to place appropriate prior mass
on reasonable values of the regression coefficients (Hans et al., 2007). A final approach consists in placing a prior
distribution on τ (or on τj, j = 1,...,p) without standardising the predictors: such a strategy is illustrated for
instance in Bae and Mallick (2004) and Sha et al. (2004).