Infinite mixtureofexperts model for sparse survival regression with application to breast cancer.
ABSTRACT We present an infinite mixtureofexperts model to find an unknown number of subgroups within a given patient cohort based on survival analysis. The effect of patient features on survival is modeled using the Cox's proportionality hazards model which yields a nonstandard regression component. The model is able to find key explanatory factors (chosen from main effects and higherorder interactions) for each subgroup by enforcing sparsity on the regression coefficients via the Bayesian GroupLasso.
Simulated examples justify the need of such an elaborate framework for identifying subgroups along with their key characteristics versus other simpler models. When applied to a breastcancer dataset consisting of survival times and protein expression levels of patients, it results in identifying two distinct subgroups with different survival patterns (lowrisk and highrisk) along with the respective sets of compound markers.
The unified framework presented here, combining elements of cluster and feature detection for survival analysis, is clearly a powerful tool for analyzing survival patterns within a patient group. The model also demonstrates the feasibility of analyzing complex interactions which can contribute to definition of novel prognostic compound markers.
 [Show abstract] [Hide abstract]
ABSTRACT: This study demonstrates the feasibility of using a modified mixture of experts (ME) model with repeated measured tumoural K(trans) value to perform an automatic diagnosis of responder based on perfusion magnetic resonance imaging (MRI) of rectal cancer. The data used in this study was obtained from 39 patients with primary rectal carcinoma who were scheduled for preoperative chemoradiotherapy. The modified ME model is a joint modeling of the ME model via the linear mixed effect model. First, we considered two local experts and a gating network, and the modified expert network as a liner mixed effect model. Afterward, the finding estimates were obtained via the expectationmaximization algorithm. All computation was performed by R2.15.2. We found that two experts have different patterns. The feature of expert 1 (n = 10) had a higher baseline value and a lower slope than expert 2 (n = 29). A comparison of the estimated experts and responder/nonresponder groups according to Tdownstaging criteria showed that expert 1 had a more effect treatment responder than expert 2. A novel feature of this study is that it is an extension of classical ME models in case of repeatedly measured data. The proposed model has the advantages of flexibility and adaptability for identifying distinct subgroups with various time patterns, and it can be applied to biomedical data which is measured repeatedly, such as timecourse microarray data or cohort data. This method can assist physicians as important diagnostic decision making mechanism.Healthcare informatics research. 06/2013; 19(2):1306. 
Conference Paper: An approach to the design of a clinically acceptable expert alert system
[Show abstract] [Hide abstract]
ABSTRACT: The author describes a project in which the goal was to ensure clinician acceptance and use of an expert system that generates information alerts for psychiatric practitioners. The resulting implementation methodology is described, and its potential application in other settings is discussed. The author describes the procedures followed in developing an implementation methodology and key features of the design, and he comments on the applicability of the methodology to other medical settingsEngineering in Medicine and Biology Society, 1989. Images of the TwentyFirst Century., Proceedings of the Annual International Conference of the IEEE Engineering in; 12/1989  SourceAvailable from: PubMed Central
Article: A mixture of experts model for the diagnosis of liver cirrhosis by measuring the liver stiffness.
[Show abstract] [Hide abstract]
ABSTRACT: The mixtureofexperts (ME) network uses a modular type of neural network architecture optimized for supervised learning. This model has been applied to a variety of areas related to pattern classification and regression. In this research, we applied a ME model to classify hidden subgroups and test its significance by measuring the stiffness of the liver as associated with the development of liver cirrhosis. The data used in this study was based on transient elastography (Fibroscan) by Kim et al. We enrolled 228 HBsAgpositive patients whose liver stiffness was measured by the Fibroscan system during six months. Statistical analysis was performed by R2.13.0. A classical logistic regression model together with an expert model was used to describe and classify hidden subgroups. The performance of the proposed model was evaluated in terms of the classification accuracy, and the results confirmed that the proposed ME model has some potential in detecting liver cirrhosis. This method can be used as an important diagnostic decision support mechanism to assist physicians in the diagnosis of liver cirrhosis in patients.Healthcare informatics research. 03/2012; 18(1):2934.
Page 1
RESEARCHOpen Access
Infinite mixtureofexperts model for sparse
survival regression with application to breast
cancer
Sudhir Raman1*, Thomas J Fuchs2,3, Peter J Wild4, Edgar Dahl5, Joachim M Buhmann2,3, Volker Roth1
From Machine Learning in Computational Biology (MLCB) 2009
Whistler, Canada. 1011 December 2009
Abstract
Background: We present an infinite mixtureofexperts model to find an unknown number of subgroups within a
given patient cohort based on survival analysis. The effect of patient features on survival is modeled using the
Cox’s proportionality hazards model which yields a nonstandard regression component. The model is able to find
key explanatory factors (chosen from main effects and higherorder interactions) for each subgroup by enforcing
sparsity on the regression coefficients via the Bayesian GroupLasso.
Results: Simulated examples justify the need of such an elaborate framework for identifying subgroups along
with their key characteristics versus other simpler models. When applied to a breastcancer dataset consisting of
survival times and protein expression levels of patients, it results in identifying two distinct subgroups with
different survival patterns (lowrisk and highrisk) along with the respective sets of compound markers.
Conclusions: The unified framework presented here, combining elements of cluster and feature detection for
survival analysis, is clearly a powerful tool for analyzing survival patterns within a patient group. The model also
demonstrates the feasibility of analyzing complex interactions which can contribute to definition of novel
prognostic compound markers.
Background
Survival Analysis is a branch of statistics dealing with
the analysis of timetofailure data and is applicable to a
variety of domains like biology, engineering, economics
etc. More generally, it is the analysis of timetoevent
data where an event could signify death, failure etc. Par
ticularly in the context of disease studies, it is a power
ful tool for understanding the effect of patient features
on survival patterns within a group. A parametric
approach to such an analysis involves the estimation of
parameters of a probability density function which mod
els time. The model is further extended by considering
the effect of covariates (X) on time via a regression
component. Cox’s proportionality hazards model, as
explained in [1], is a popular model for modeling such
an effect:
h t x
(
h t
0
xT
)( ) exp(),
=
(1)
where h0(t) is the baseline hazard function (chance of
instant death given survival till time t), x is the vector of
covariates and b is a vector of regression coefficients. In
this paper, we focus on covariates which are categorical
in nature, since it is a frequently encountered case in
biological applications.
In the past, such models have been extended to a mix
ture model (mixture of survival experts) in order to find
subgroups in data with respect to survival time along
with measuring the effect of covariates within each sub
group. In this context, (Rosen and Tanner) [2] define a
finite mixtureofexperts (MOE) model by maximizing
the partial likelihood for the regression coefficients and
* Correspondence: sudhir.raman@unibas.ch
1Department of Computer Science, University of Basel, Bernoullistr. 16,
CH4056 Basel, Switzerland
Full list of author information is available at the end of the article
Raman et al. BMC Bioinformatics 2010, 11(Suppl 8):S8
http://www.biomedcentral.com/14712105/11/S8/S8
© 2010 Raman et al; licensee BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.
Page 2
by using some heuristics to resolve the number of
experts in the model. A more recent attempt at this
analysis, which was carried out by [3], uses a maximum
likelihood approach to infer the parameters of the
model and the Akaike information criterion (AIC) to
determine the number of mixture components. A Baye
sian version of the mixture model has been investigated
by [4], which analyzes the model with respect to time
but does not capture the effect of covariates. On the
other hand, the work by [5] performs variable selection
based on the covariates but ignores the clustering aspect
of the modeling. Similarly, [6] defines an infinite mix
ture model but does not include a mixture of experts,
hence assuming all the covariates to be generated from
the same distribution and also assumes a common
shape parameter for the Weibull distribution.
In this paper, we unify the various important elements
of this analysis into a Bayesian infinite mixtureof
experts (MOE) framework to model survival time, while
capturing the effect of covariates and also dealing with
an unknown number of mixing components. The num
ber of experts are inferred using a Dirichlet process
prior on the mixing proportions, which overcomes the
issue of deciding the number of mixture components
beforehand [7]. The regression component, introduced
via the proportionality hazards model, is nonstandard
since the Weibull distribution is not part of the expo
nential family of distributions due to the lack of fixed
length sufficient statistics. Another novel feature of this
framework is the addition of sparsity constraints to the
regression coefficients b in order to determine the key
explanatory factors (covariates) for each mixture compo
nent. Since the covariates are discrete in nature, each
variable is transformed to a group of dummy variables
and sparsity is achieved by applying a Bayesian version
of the GroupLasso (as described in [8] and [9]) which
is based on a sparse constraint for grouped coefficients
[10]. We demonstrate the ability of the model to recover
the right sparsity pattern with simulated examples. In a
related work, [11] show sparsistency (sparse pattern
consistency) of the lasso in the limit of large observa
tions. The followingsectionsdescribeall the
components of this unified framework with some results
on a breastcancer dataset.
Methods
In this section, we explain the overall model in an incre
mental way starting first with a regression model for
survival analysis and then attaching a clustering model
to it. This also highlights the incremental nature of the
algorithm presented for inference.
Bayesian survival regression
To begin with, we focus on defining a single cluster
model. For survival analysis, we model the distribution
of a random variable T (representing time) over the
interval [0, ∞). Further, a standard survival function is
defined based on the cumulative distribution over T as
follows:
S t
( )
p T t
(
p t dt
( ) ,
t
)
= −
1
≤= −∫
1
0
0
0
(2)
which models the probability of an individual surviv
ing up to time t0. The hazard function h(t), the instanta
neous rate of failure at time t, is defined as follows:
h t
( )
P t T t
(
< ≤ +
t T t
Δ
t
p T t
S t
( )
t
lim
→Δ
)
()
.
=
>
=
=
Δ
0
(3)
For modeling purposes, our choice of distribution for
modeling time is the Weibull distribution which is flex
ible in terms of being able to model a variety of survival
functions and hazard rates. Apart from flexibility, it is
also the only distribution which captures both the accel
erated time model and the proportionality hazards
model, see [12] for details. The Weibull distribution is
defined as follows:
p t
(
tt
www
ww
ww
,)exp,
λ
λλ
=−
⎛
⎝
⎜
⎞
⎠
⎟
−
11
1
(4)
where awand lware the shape and scale parameters,
respectively. Based on the above definition and assuming
Algorithm 1 Algorithm 1 Blocked Gibbs Sampling for a Truncated Dirichlet process
Input: N observations D = (xi, ti).
2:
Initialize: ci= random cluster assignments and parametersci.
3:Draw from the posterior of the joint distribution p(π, F*, c) by drawing from the conditionals.
4:
while NotCoverged do
5: Sample F*  π, c, D  This is carried out individually for each parameter in the model conditioned on the rest.
6: Sample c  F*, π, D  For i = 1,…, N, draw values P c
7: Sample π  F*, c, D  The mixture proportions are drawn based on the posterior P(πa)P(cπ).
8:
end while
1:
DP cP x t
) ( ,
iiiici
( , *, ) ~ (
Φ
)
, ci= 1,…, M.
Raman et al. BMC Bioinformatics 2010, 11(Suppl 8):S8
http://www.biomedcentral.com/14712105/11/S8/S8
Page 2 of 10
Page 3
rightcensored data (see [1] for details), the likelihood
can be written as:
p t
({ }
tt
i i
N
=
ww
w
w
i
i
N
w
i
w
i
w
,)exp,
−
=
=
⎛
⎝
⎜
⎞
⎠
⎟
−
⎛
⎝
⎜
⎞
⎠
⎟
∏
0
1
1
1
λ
λ
λ
(5)
where N is the number of observations, δi= 0 when
the ithobservation is censored and 1 otherwise. Further,
to model the effect of covariates x on the distribution
over time, we apply Cox’s proportional hazards model.
Under this model, the covariates are assumed to have a
multiplicative effect on the hazard function:
h(tx) = h0(t) exp(f(x, b)),(6)
where h0(t) is the baseline hazard function, x is the vec
tor of covariates and b is a vector of regression coeffi
cients. In our model, we assume the function f to be a
linear predictor i.e. f(x, b) = h = xTb. We also consider
higherorder interactions (firstorder  pairs of features,
and secondorder  triplets of features etc.) instead of
modeling just the main effects (individual features).
Further flexibility is added to the linear predictor by
adding a random effect in the following manner:
h = xtb + Î, where Î ~ N(0,s2). (7)
The likelihood is modified as follows to include the
covariate effect:
p t
({ }
t
i i
N
=
i i
N
=
ww
w
w
i
i
i
N
w
i
{ }
,,) exp( )ex
−
=
=
⎡
⎣⎢
⎤
⎦⎥
∏
00
1
1
λ
λ
p pexp( ) .
−
⎛
⎝
⎜
⎞
⎠
⎟
1
λw
i
i
t
w
(8)
We note that although most parts of the model
described so far resemble an enhancement of a
generalized linear model (GLM) (see [13]) called a ran
domintercept model, it is not strictly a GLM since the
Weibull distribution lacks fixedlength sufficient statis
tics and is not considered, in a strict sense, to be part of
the exponential family of distributions unless the shape
parameter is known. Although the Weibull distribution
lacks fixedlength sufficient statistics, for the two para
meters (aw, lw), it is still possible to define a joint con
jugate prior ([14]), as is explained in the subsection on
priors eq. (10). In order to provide a full Bayesian treat
ment of the model, we define suitable conjugate priors
for the other parameters of the model, namely s and b.
Contrast coding
In biological applications, it is very common to encoun
ter categorical data. When the xi’s are categorical vari
ables, a suitable coding procedure is applied to the
variables (see standard textbooks like [15]) in order to
obtain the design matrix for inference. Apart from single
variables (interactions of order zero), the design matrix
also consists of higherorder 1st order (pairwise interac
tions) and 2nd order (triplet interactions). An example
of a two variable (with three categories) observation
matrix with a firstorder interaction transformed using
dummy coding is shown in Fig. 1 (top). A default
dummy coding procedure leads to overparametrization
(redundancy in the number of columns) and this effect
becomes profound with greater number of levels and
higherorder interactions. Also in many biological appli
cations, the categorical variables have a natural ordering
in the values that they take, for example  intensity
values. Based on these requirements, we use polynomial
contrast codes since they are suited for ordered catego
rical variables and avoid overparametrization by repre
senting a Klevel variable with K−1 columns (see Fig. 1
(bottom)). This results in representing each categorical
variable as a group of contrastcoded variables. Hence,
to create the full design matrix, first the levels are con
trastcoded (using a standard R function) which gives us
the codes for respective levels (see Fig. 1 (bottomright))
and then each observation is recoded (for main effects
and higherorder interactions) using these codes as
reference.
Priors
One of the major requirements of the model is to find
the key explanatory factors from data. To achieve this
goal, we need to apply sparsity constraints on the
regression coefficients b to identify the key interactions.
As described, the coding procedure gives rise to groups
of contrastcoded variables. This transformation of data
leads to the task of inferring sparsity on a group level,
i.e. on grouped dummy variables, where each group
represents a single variable in the original formulation.
Hence, for parameter b, we apply the general prior
defined in [9] to a special case for Bayesian Group
Lasso (as defined in [8] for a Poisson model), which is
suitable for sparse inference in grouped variables for the
model that we have defined. The sparse prior is moti
vated by the classical Grouplasso which can be recov
ered in the logspace based on defining the prior as a
product of Multivariate Laplacians. Although a direct
representation of the prior exists, in order to make the
posterior analysis feasible (to obtain standard condi
tional posteriors), we redefine the prior as a twolevel
hierarchical model, by introducing latent variables lg.
For the Bayesian Group Lasso, the hierarchical prior
over the regression coefficients is defined as follows:
∏ = ∏
==
+
2
∫
g
G
gg
G
ggg
pp
g
pNI
gg
2
11
2 2
λ
2
1
2
0()(,),,
λλ
Gamma()d
(9)
where G is the number of groups, pgis the size of group
g, r and s2play the role of the Lagrange parameter in
classical GroupLasso and each bgis a scaled mixture of
MultivariateGaussians. Based on (9), we can derive the
Raman et al. BMC Bioinformatics 2010, 11(Suppl 8):S8
http://www.biomedcentral.com/14712105/11/S8/S8
Page 3 of 10
Page 4
marginal pdf of bganalytically as a product of Multivari
ate Laplacians (for details, see [8]).
A full Bayesian treatment of the model is achieved by
introducing a prior on s2, based on a standard conjugate
joint prior (see [16]), described as a product of a Normal
distribution of b given s and an inversechi square distri
bution of
: ( ,)() (
pp
and a conjugate Gamma prior on r. Although the Weibull
distribution lacks fixedlength sufficient statistics, for the
remaining two parameters (aw, lw), it is possible to define
a joint conjugate prior, as explained in [14]:
(
(
2222222
00
2
),),))
,
pNs
==∑ ⋅Inv
pa b c d
, , , )
b
d
λ
www
a
ww
c
w
w
(, exp() exp,
−
λλ∝−
⎛
⎜⎜
⎝
⎞
⎟⎟
⎠
−−
1
(10)
where a,b,c > 0 and d allows us to deal with the lack of
fixedlength sufficient statistics.
The full model with all the variables is described in
Figure 2.
Posteriors
In practice, sampling from the posterior distribution will
not be possible directly, hence we propose to use a
Gibbs sampling strategy for stochastic integration. The
sampling process further enables this procedure to be
incorporated very naturally as another step in the clus
tering algorithm discussed in the next section. Addition
ally, for the lasso model, the BlockedGibbs sampler has
been shown to be geometrically ergodic in [17]. Hence
the convergence of the Gibbs sampler is expected to be
very rapid. Multiplying the priors with the likelihood and
rearranging the relevant terms yields the full conditional
posteriors, which are needed in the Gibbs sampler for
carrying out the stochastic integrations. The posterior
for s, b, r andλg
ditional posterior of hiis difficult to sample from since it
is not of standard form. However, since the conditional
posterior is logconcave, we propose the use of Laplace
approximation, similar to that in [18], which approxi
mates the conditional posterior to a Normal distribution
and simplifies sampling considerably. Although alterna
tives exist in the form of adaptiverejection sampling, the
Laplace approximation gives results that are indistin
guishable while speeding up computations considerably.
For the Weibull parameters awand lw, sampling
based on their individual posteriors conditioned on each
other is avoided, since this results in a slow mixing of
the Markov chain due to a high correlation between
samples from the two conditionals. To overcome this
issue, the conditional posterior of (aw, lw) is split up
into the conditional of lwgiven awwhich results in an
InverseGamma distribution,
2are exactly as defined in [8]. The con
p c y
+ −
dt
ww
i
i
w
w
(, )
• ∝
(,exp( )),
λ
+∑
Inverse Gamma

1
(11)
where y is the number of deaths (number of data
points for which δi= 1) and the marginal of awwhich is
derived based on the work in [14]:
p
bP
dt
w
w
(
a y
+ −
w
exp( ))
y
i
i
c y
+ −
w
w
(
−
•
)∝
−
+∑
1
1
exp((log( )))
,
(12)
where Pyis the product of ti’s for which δi= 1 and (●)
represents all the unknown parameters. This marginal
Dummy
Coding
0.71
X2
Low
Original Design
Matrix
High
HighMed
X1
X2
High
Med
Low
X1
X2
X1:X2
X1
X2
X1:X2
Patient1
Patient2
High
Med
Low
Transformed Design
Matrix
Polynomial Contrast
Coding
−
−
−
1 0 0
0 1 0
0 0 1
0.71 0.41
0.00 −0.82
−0.71 0.41
High
Med
Low
−
−
−
1
0
0
0
0
1
0
010
010 0 0 0 0
0 0
0
0
1
1 0 0
0
0000
0.41
0.41 0.71
0.29
0.58
0.34
0.41
−0.82
−0.71
−0.50
0.00
−0.29
0.00
0.17
0.00
X1
Figure 1 Dummy coding illustration. On the topleft, categorical observations for 2 patients are shown for whom 2 biomarkers (X1 and X2)
are measured for expression values. Each biomarker (categorical variable) can have three possible values (high, med and low). The topright side
shows the transformed covariate matrix after the dummy coding procedure has been applied. The resulting design matrix represents each
variable as a group of dummyvariables. Hence identifying key features from the original matrix is translated to the problem of identifying key
groups of dummy variables. The bottomright shows the transformed matrix after using a polynomial contrast coding procedure. The resulting
contrastcoded matrix uses (K − 1)order+1columns for an interaction as opposed to (K)order+1columns in a a dummycoded matrix where K is the
number of categories for a variable and order denotes the order number of the interaction (zeroth, first, second etc).
Raman et al. BMC Bioinformatics 2010, 11(Suppl 8):S8
http://www.biomedcentral.com/14712105/11/S8/S8
Page 4 of 10
Page 5
results in a nonstandard distribution, and sampling is
done via a discretized version of the same.
Infinite mixture of survival experts
Finite mixture of experts. The previous section
described the inference procedure when the data is
assumed to be generated from one global group. We
further enhance this idea by removing this assumption
and model data which is potentially generated from
multiple (and known number of) subgroups/clusters in
data. In order to model the clustering in terms of the
combined effects of features x and survival time t, we
use an MOE model as defined in [19] (see Figure 3: Left
panel). It consists of a fixed number of experts, each
expert explaining the distribution of time for a particu
lar region in the covariate space. Hence the t based clus
ters or mixing components, represented by experts, are
probability distributions conditioned on the covariates x.
The distribution of t can be written based on a standard
mixture model conditioned on x as:
p t x
(
p c x
(
p t x c
, ) (
•
jj
j
k
, )
• =
, , ),
•
=∑
1
(13)
where (●) represents all the unknown parameters and
cj’s are the mixture components. The first term in
eq. (13) is the gate function which decides which jth
expert is best suited for making a prediction for feature
vector x. Using Bayes’ rule, we can rewrite the model in
the following way in order to resemble a standard mix
ture model, as shown in [20]):
p t x
(
p c p x c
j
( ) (
p t x c
, ) (
•
jj
j
k
, )
• ∝
, , ).
•
=∑
1
(14)
This representation allows us to visualize each mixture
component as a joint distribution over (x, t). The distri
bution over x is modeled as a Normal distribution
N xI
c
(,)
as show in Figure 2. The standard joint conju
gate prior of NormalInvc2is applied to the parameters
( ,)
c
form and hence can be easily incorporated into the
Gibbs sampling scheme introduced in the previous sec
tion. To complete the Bayesian picture, we need to
apply a suitable prior to the mixing proportions c. In a
finite MOE model, a Dirichlet distribution is a standard
conjugate prior to the mixing proportions. All other
parameters and priors, based on the modeling of (x, t),
follow from the previous section.
Infinite mixture of experts. The above model was
described for the case when the underlying number of
clusters is fixed/known. We now add the final enhance
ment to our model by removing this limiting assump
tion as well. The model is extended to an infinite
mixtureofexperts by replacing finite clusters by infinite
clusters and hence replacing the Dirichlet distribution
by a Dirichlet process (DP) as prior for the mixing pro
portions, similar to [20]. The Dirichlet process is a dis
tribution on distributions i.e. a particular sample from a
DP is also a probability distribution from which samples
can be drawn. The draws from a DP are discrete hence
making it a useful prior for clustering purposes. In this
2
2. The posterior conditionals are also of standard
Figure 2 Model description with all the parameters involved for a single cluster. The complete hierarchical model with the
parametrization for a single cluster model. Depicted in blue are the hyperparameters for the respective distributions, like (r, s) for the Gamma
prior on r. The observed variables x denoting the covariates and t denoting time are shown in green. The part of the figure centered around t
forms the core which defines the generalized linear model with a Normal random link between h and the covariates and coefficients and priors
for the Weibull distribution. The block on the right defines the hierarchy related to the sparse regression on the covariates via the hierarchical
representation of the NormalGamma prior on the regression coefficients b. Furthermore, the left block defines the variables for describing the
distribution of the covariate space.
Raman et al. BMC Bioinformatics 2010, 11(Suppl 8):S8
http://www.biomedcentral.com/14712105/11/S8/S8
Page 5 of 10
Page 6
manner, the effective number of clusters can be inferred
from data by carrying out MCMC sampling from the
posterior distribution. This model extension is described
in a hierarchical manner as follows (see Figure 3):
( , )
x t
i
,~()
~
~
~( , ),
cF
cG
G
DP GG
iic
i
c
i
0
0
(15)
where DP denotes a dirichlet process prior with base
distribution G0and a concentration parameter a, ciis
the latent class to which an observation (xi, ti) belongs
and jcdenotes the parameters which determine the dis
tribution of class c. Further hierarchy is added to jc
(parameters) by adding suitable priors as defined in
Section 2.
Markov Chain Monte Carlo (MCMC) sampling for
Inference and Parameter Estimation. The inference of
the infinitemixtureofexperts model is carried out by
MCMC sampling of the posterior distribution. Although
there exist nonconjugate versions of the Dirichlet pro
cess algorithms (as given in [21]) which can be applied
for inference, for practical reasons, we use a truncated
version of the Dirichlet process called the DirichletMul
tinomial allocation model [22], by specifying an upper
bound on maximum number of clusters based on the
prior knowledge of the particular application. It serves
as a good approximation to the DP measure and results
in a finitesum random probability measure which is
computationally easy to deal with and easy to imple
ment. More specifically, we carry out a BlockedGibbs
sampling on a truncated Dirichlet process (see Algo
rithm 1 for details). After initializing all the parameters,
the sampling algorithm is executed till the point of con
vergence. The point of convergence can be determined
based on the lengthcontrol diagnosis explained in [23]
or fixed to a maximum number of iterations based on
studying the traceplots of the sampling process in
simulations.
Results and discussion
Simulations. In order to demonstrate the effectiveness
of the model, experiments were carried out on simu
lated data. The first experiment shows the capability of
the model to correctly identify two subgroups in data
along with identifying the key explanatory factors in
both groups. The dataset of size 150 was generated
from two equally proportioned clusters with (5, 5) and
(1,1) being the shape and scale parameters for the Wei
bull distribution for each cluster. The features consisted
of 7 variables with expansion up to 2nd order interac
tions (63 terms). For the first cluster, the significant fac
tors included main effects X1, X3 and X4, all first order
interactions with these three variables i.e. (X1 : X3), (X1
: X4), (X3 : X4) and a second order interaction (X1 : X3
: X4). Similarly, for the second cluster, the significant
factors included main effects X2, X6 and X7, all first
order interactions with these three variables (i.e. (X2 :
X6), (X2 : X7), (X6 : X7)) and a second order interaction
(X2 : X6 : X7).
Significance was achieved by assigning b values of (3,
3, 3, 3, 3, 3, 3) and (3, 3, 3, 3, 3, 3, 3) to the specific fac
tors in the respective clusters and the rest of the b coef
ficients to zero. The covariates themselves were sampled
from a Normal distribution with means (0.3, 0.3, 0.3,
0.3, 0.3, 0.3, 0.3) and (0.7, 0.7, 0.7, 0.7, 0.7, 0.7, 0.7) for
each cluster respectively. The Gibbs sampling process
was executed for 50,000 iterations and the burnin was
Figure 3 Infinite Mixtureofexperts model. Left panel: Mixtureofexperts model for two experts with a gating node representing the
function that decides which of the two experts is chosen to make a prediction for x which is represented by p(cjx, ●) in eq. (13). Right panel:
Infinite mixture of experts using a Dirichlet process prior G with parameters (a, G0). N denotes the number of observations and cithe respective
assignment variables. The observed variables x and t are represented in green with the priors collapsed toci. In the full model, theci part will
be replaced by Figure 1.
Raman et al. BMC Bioinformatics 2010, 11(Suppl 8):S8
http://www.biomedcentral.com/14712105/11/S8/S8
Page 6 of 10
Page 7
observed to be very early (in the first ≈ 100 iterations).
Both the clusters were detected and all the true signifi
cant factors for both clusters were identified success
fully. See Figure 4 for details.
In the second experiment, we compare our mixture
ofexperts model to a global single cluster model in
order to justify the need for a mixture model. The train
ing data generated in the first experiment was used
again for learning the parameters of a singlecluster
model. In order to compare the two models, a separate
test set (of size 500) was generated additionally to evalu
ate the performance of both models by comparing the
loglikelihood of all the test points based on the para
meters learned by both models. The perpoint compari
son is shown in Figure 5 which indicates the
improvement achieved by using a MOE model. We also
performed a standard KruskalWallis rank test which
also ranks the MOE model higher than the single cluster
model (see Figure 5 left panel). Apart from the quantita
tive evaluation, we also see in terms of identifying the
significant factors (see Figure 5 right panel), that the
single cluster model does poorly, both in recognizing
the true factors and in terms of false positives. This can
be explained based on the fact that in a single cluster
model, the model has to assume a common baseline
model (for both clusters). Then, in order to adjust for
the real survival patterns, it can only achieve the same
effect by making suitable adjustments to the regression
component. In doing so, the model compromises in
terms of the identification of significant factors from
data. As a result, we see that the MOE model performs
much better than a onecluster model, hence justifying
the need for a clusterbased model.
Application to BreastCancer dataset. The dataset
consists of measured intensity levels obtained from
tissue microarrays of the following markers: karyo
pherinalpha2 (KPNA2), nuclear staining for p53, the
anticytokeratin CK5/6, the fibrous structural protein
CollagenVI, the interatrypsin inhibitor ITIH5, the
estrogen receptor (ER) and the human epidermal growth
factor receptor HER2. From these categorical variables
we constructed covariates arranged in a design matrix
which includes all dummycoded interactions up to the
second order.
Crossvalidation experiments were conducted for both
the MOE and single cluster model which gave rise to
similar trends but with unclear significance. Despite of
the fact that this dataset is one of the biggest of its kind,
the rather low number of samples (270 patients)
remains the main challenge in these scenarios. A further
difficulty is the large number of censored patients (60%),
which is a common problem in long term retrospective
studies.
Over a wide range of priorvalues, the Dirichlet pro
cess mixture model for selecting “survival experts” finds
two large and highly stable clusters. In order to exter
nally validate these clusters, we analyze the survival of
the underlying patient populations by way of classical
KaplanMeier plots, see Figure 6. It is obvious that the
survival experiences of patients belonging to the two
clusters differ significantly, with cluster 1 basically con
taining all patients who die early. In Figure 7, the inter
action patterns within the two clusters are shown as
lines connecting pairs or triplets of markers, where the
linewidth encodes the significance in terms of posterior
quantiles which do not contain zero.
The highrisk patient cluster is characterized by a glo
bal underexpression of ER and overexpression of basi
cally all other markers, in particular KPNA2, CK5/6 and
HER2. Overexpression of the latter two markers clearly
Figure 4 Results for simulated data: 2 clusters with 7 categorical variables having interaction terms up to second order. In all interaction
graphs, the lightblue circles represent the main effects, the blue lines represent 1storder pairs and the reddish triangular lines indicate 2nd
order triplet interactions. In each case, the size of the circle or the width of the lines indicates the estimated significance of the main effect or
the higherorder interaction: i.e. For example on the right cluster, more than 90% of the posterior samples for variable 2 have a positive sign.
Based on the results of the inference process, we observe that all the key features have been correctly identified.
Raman et al. BMC Bioinformatics 2010, 11(Suppl 8):S8
http://www.biomedcentral.com/14712105/11/S8/S8
Page 7 of 10
Page 8
Figure 6 Kaplan Meier plots for the identified subgroups. KaplanMeier plots for the highrisk group (left) and the lowrisk group (right).
The highrisk group contains a large number of patients, who die early.
Figure 5 Comparison to a global model. Left: The actual number of points in the test set which scored better in a particular model (442 for
MOE Vs 58 for Single Cluster) based on the likelihood scores. Results of the KruskalWallis rank test also validated this observation with a pvalue
≪ 0.001. Right: Results of the key interactions found for a single cluster model. Some of the key factors are not identified along with existence
of many falsepositives.
Figure 7 Breast Cancer results  key interaction patterns for the identified subgroups. Identified interaction patterns for the highrisk
group (left) and the lowrisk group (right). The size of the circles indicates the estimated significance of the main effects. For instance, the
largest circle for ER means that the 0.9 posterior quantile does not contain zero. Correspondingly, the linewidth of the interactions (blue lines:
1storder, reddish triangles: 2ndorder) indicates their significance.
Raman et al. BMC Bioinformatics 2010, 11(Suppl 8):S8
http://www.biomedcentral.com/14712105/11/S8/S8
Page 8 of 10
Page 9
identifies this cluster as a collection of basal and
HER2type breastcancer patients. The occurrence of
KPNA2 in the highrisk group is also in accordance
with previous studies: KPNA2 is a member of the karyo
pherin (importin) family, which is part of the nuclear
transport protein complex. KPNA2 overexpression has
been shown in several gene expression signatures in
breast cancer and other cancer types. KPNA2 overex
pression has been previously identified as a possible
prognostic marker in breast cancer [24].
The groupLasso detects several strong higherorder
interactions. Interpreting these interaction terms can be
a complex problem, but a close analysis of the contrast
codes and the sign of the regression coefficients shows
that the weak prognosis of members in this cluster is
dominated by some of the combinations, details in Table
1 where ↘ means underexpression and ↗ overexpression.
The observation that highorder interaction terms
seem to be even more indicative than the individual main
effects is a highly interesting result of this study which
may lead to the definition of novel prognostic markers
for better differentiation between highrisk patients.
Together with our medical partners we are currently test
ing these new hypothetical compoundmarkers.
The lowrisk cluster has a clear luminaltype signature
(strong ER response). Hardly any significant patterns
can be identified which, however, is quite understand
able by noticing that the survival curve is almost flat for
these patients: in the proportional hazards model the
individual covariates influence the “passage of time”, and
a flat curve basically means that there is almost no
intraclass variation that could be explained by indivi
dual covariate effects.
Conclusions
We have introduced a fully Bayesian survival infinite
mixtureofexperts model which extends classical
approaches by including feature selection for contrast
coded categorical variables. Random links and a mix
tureofexperts architecture allow for both stochastic
and modeldriven deviations from the underlying para
metric survival model. The inherent clustering property
of the final model makes it possible to identify patient
groups which are homogeneous with respect to the pre
dictive power of their covariates for the observed survi
val times. The builtin Bayesian feature selection
mechanism reveals clusterspecific explanatory factors
and interactions. Due to the Bayesian treatment within a
suitably expanded model, posterior samples can be gen
erated efficiently which makes it possible to assess the
statistical significance based on a very large number of
draws.
Applied to survival data from a breast cancer study,
the model identified two stable patient clusters that
show a clear distinction in terms of survival probability.
Several strong highorder interactions between marker
proteins were detected which carry more information
about the survival targets as the markers themselves.
Not only does this result confirm earlier studies, it also
shows that the analysis of complex interactions is feasi
ble and may lead to the definition of novel prognostic
markers. We are currently conducting new experiments
to test these new hypothetical compoundmarkers.
Authors contributions
SR, TJF, JMB and VR have contributed toward designing the model and
drafting the manuscript. PJW and ED are domain experts in pathology and
molecular biology and have contributed with respect to conducting
biological experiments, generating the required samples and in analyzing
the results, i.e. estimating the protein expression on the
immunohistochemical stained slides. All authors read and approved the final
manuscript.
List of abbreviations
AIC: Akaike information criterion; MOE: Mixture of experts; GLM: Generalized
linear model; MCMC: Markov chain Monte Carlo; DP: Dirichlet Process
Competing Interests
The authors declare that they have no competing interests.
Acknowledgements
The work was supported by a grant of the Swiss SystemsX.ch Initiative
(Swiss National Science Foundation) to the project ”LiverX” (Competence
Center for Systems Physiology and Metabolic Diseases). We also
acknowledge financial support from the FET programme within the EU FP7,
under the SIMBAD project (Contract 213250).
Author details
1Department of Computer Science, University of Basel, Bernoullistr. 16,
CH4056 Basel, Switzerland.2Department of Computer Science, ETH Zurich,
Universitaetstrasse 6, CH8092 Zurich, Switzerland.3Competence Center for
Systems Physiology and Metabolic Diseases, Schafmattstr. 18, CH8093
Zurich, Switzerland.4Institute of Pathology, University Hospital Zurich,
Schmelzbergstrasse 12, CH8091 Zurich, Switzerland.5Institute of Pathology,
University Hospital Aachen, Pauwelsstrasse 30, 52074 Aachen, Germany.
Published: 26 October 2010
References
1.Klein JP, Moeschberger ML: . Survival Analysis: Techniques for Censored and
Truncated Data SpringerVerlag:New York Inc 1997.
2. Rosen O, Tanner M: Mixtures of Proportional Hazards Regression models.
Statistics in Medicine 1999, 18:11191131.
Table 1 Interpretation of interaction terms
↘
ER
ER
↘
CK5/6
↘
KPNA2
↘
p53
↘
CollagenVI
↘
ITIH5
↗
HER2
↗
ER
↘
CollagenVI
↘
HER2
↗
ER
↘
KPNA2
↘
ITIH5
↗
ER
↘
p53
↗
CK5/6
↘
ER
↘
KPNA2
↘
CollagenVI
↘
Raman et al. BMC Bioinformatics 2010, 11(Suppl 8):S8
http://www.biomedcentral.com/14712105/11/S8/S8
Page 9 of 10
Page 10
3. Ando T, Imoto S, Miyano S: Kernel Mixture Survival Models for Identifying
Cancer Subtypes, Predicting Patient’s Cancer Types and Survival
Probabilities. Genome Informatics 2004, 15(2):201210.
Kottas A: Nonparametric Bayesian Survival Analysis using Mixtures of
Weibull distributions. Journal of Statistical Planning and Inference 2006,
136(3):578596.
Ibrahim JG, Chen MH, Maceachern SN: Bayesian Variable Selection for
Proportional Hazards Models. The Canadian Journal of Statistics 1999,
27(4):701717.
Paserman MD: Bayesian Inference for Duration Data with Unobserved
and Unknown Heterogeneity: Monte Carlo Evidence and an Application.
IZA Discussion Papers 996, Institute for the Study of Labor (IZA) 2004.
Rasmussen CE, Ghahramani Z: Infinite Mixtures of Gaussian Process
Experts. Advances in Neural Information Processing Systems 14 MIT Press
2002, 881888.
Raman S, Fuchs T, Wild P, Dahl E, Roth V: The Bayesian GroupLasso for
Analyzing Contingency Tables. Proceedings of the 26th International
Conference on Machine Learning Omnipress 2009, 881888.
Raman S, Roth V: Sparse Bayesian Regression for Grouped Variables in
Generalized Linear Models. Proceedings of the 31st DAGM Symposium on
Pattern Recognition SpringerVerlag 2009, 242251.
Yuan M, Lin Y: Model Selection and Estimation in Regression with
Grouped Variables. J. Roy. Stat. Soc. B 2006, 4967.
Ravikumar P, Liu H, Lafferty J, Wasserman L: Spam: Sparse additive models.
Advances in Neural Information Processing Systems 20 MIT Press 2007.
Ibrahim JG, Chen MH, Sinha D: . Bayesian Survival Analysis SpringerVerlag:
New York Inc 2001.
McCullaghand P, Nelder J: . Generalized Linear Models Chapman & Hall 1983.
Fink D: A Compendium of Conjugate Priors. In progress report:
Extension and enhancement of methods for setting data quality
objectives. Technical Report 1995.
Everitt B: . The Analysis of Contingency Tables Chapman & Hall 1997.
Gelman A, Carlin J, Stern H, Rubin D: . Bayesian Data Analysis Chapman&Hall
1995.
Kyung M, Gill J, Ghosh M, Casella G: Penalized Regression, Standard Errors
and Bayesian Lassos. Bayesian Analysis 2010, 5(2):369412.
Green P, Park T: Bayesian Methods for Contingency Tables using Gibbs
Sampling. Statistical Papers 2004, 45:3350.
Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE: Adaptive Mixtures of Local
Experts. Neural Computation 1991, 3:7987.
Kim S, Smyth P, Stern H: A Nonparametric Bayesian Approach to
Detecting Spatial Activation Patterns in fMRI Data. In Proceedings of the
9th International Conference on Medical Image Computing and Computer
Assisted Intervention 2006, 217224.
Neal RM: Markov Chain Sampling Methods for Dirichlet Process Mixture
Models. Journal of Computational and Graphical Statistics 2000, 9:249265.
Ishwaran H, Zarepour M: Exact and Approximate Sum Representations for
the Dirichlet process. The Canadian Journal of Statistics 2002, 30:269283.
Raftery A, Lewis S: One long run with diagnostics: Implementation
strategies for Markov chain Monte Carlo. Statistical Science 1992,
7:493497.
Dahl E, Kristiansen G, Gottlob K, Klaman I, Ebner E, Hinzmann B, Hermann K,
Pilarsky C, Dürst M, KlinkhammerSchalke M, Blaszyk H, Knuechel R,
Hartmann A, Rosenthal A, Wild PJ: Molecular Profiling of Laser
Microdissected Matched Tumor and Normal Breast Tissue Identifies
Karyopherin α2 as a Potential Novel Prognostic Marker in Breast Cancer.
Clinical Cancer Research 2006, 12:395060.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
doi:10.1186/1471210511S8S8
Cite this article as: Raman et al.: Infinite mixtureofexperts model for
sparse survival regression with application to breast cancer. BMC
Bioinformatics 2010 11(Suppl 8):S8.
Submit your next manuscript to BioMed Central
and take full advantage of:
• Convenient online submission
• Thorough peer review
• No space constraints or color figure charges
• Immediate publication on acceptance
• Inclusion in PubMed, CAS, Scopus and Google Scholar
• Research which is freely available for redistribution
Submit your manuscript at
www.biomedcentral.com/submit
Raman et al. BMC Bioinformatics 2010, 11(Suppl 8):S8
http://www.biomedcentral.com/14712105/11/S8/S8
Page 10 of 10