ArticlePDF Available

Assessing the Validity Domains of Graphical Gaussian Models in Order to Infer Relationships among Components of Complex Biological Systems

Authors:

Abstract and Figures

The study of the interactions of cellular components is an essential base step to understand the structure and dynamics of biological networks. Various methods were recently developed for this purpose. While most of them combine different types of data and a priori knowledge, methods based on graphical Gaussian models are capable of learning the network directly from raw data. They consider the full-order partial correlations which are partial correlations between two variables given the remaining ones, for modeling direct links between variables. Statistical methods were developed for estimating these links when the number of observations is larger than the number of variables. However, the rapid advance of new technologies that allow the simultaneous measure of genome expression, led to large-scale datasets where the number of variables is far larger than the number of observations. To get around this dimensionality problem, different strategies and new statistical methods were proposed. In this study we focused on statistical methods recently published. All are based on the fact that the number of direct relationships between two variables is very small in regards to the number of possible relationships, p(p-1)/2. In the biological context, this assumption is not always satisfied over the whole graph. It is essential to precisely know the behavior of the methods in regards to the characteristics of the studied object before applying them. For this purpose, we evaluated the validity domain of each method from wide-ranging simulated datasets. We then illustrated our results using recently published biological data.
Content may be subject to copyright.
*
Assessing the Validity Domains of
Graphical Gaussian Models
in order to Infer Relationships
among Components of Complex
Biological Systems
by
Fanny Villers, Brigitte Schaeffer, Caroline Bertin
and Sylvie Huet
Research Report No. 16
July 2008
Statistics for Systems Biology Group
Jouy-en-Josas/Paris/Evry, France
http://genome.jouy.inra.fr/ssb/
Assessing the Validity
Domains of Graphical
Gaussian Models in order to
Infer Relationships among
Components of Complex
Biological Systems
Fanny Villers1
Brigitte Schaeffer2
Caroline Bertin3
Sylvie Huet4
Unit´e Math´ematiques et Informatique Appliqu´ees
INRA
Domaine de Vilvert
F-78352 Jouy-en-Josas Cedex
4sylvie.huet@jouy.inra.fr
SSB - RR No. 16 Villers et al.
Abstract. The study of the interactions of cellular components is an
essential base step to understand the structure and dynamics of bio-
logical networks. So, various methods were recently developed in this
purpose. While most of them combine different types of data and ¡em¿a
priori¡/em¿ knowledge, methods based on Graphical Gaussian Models
are capable of learning the network directly from raw data. They con-
sider the full-order partial correlations which are partial correlations
between two variables given the remaining ones, for modelling direct
links between variables. Statistical methods were developed for esti-
mating these links when the number of observations is larger than the
number of variables. However, the rapid advance of new technologies
that allow to simultaneous measure genome expression, led to large-
scale datasets where the number of variables is far larger than the
number of observations. To get round this dimensionality problem,
different strategies and new statistical methods were proposed. In this
study we focused on statistical methods recently published. All are
based on the fact that the number of direct relationship between two
variables is very small in regards to the number of possible relation-
ships, ¡em¿p(p-1)/2¡/em¿. In the biological context, this assumption is
not always satisfied over the whole graph. So it is essential to precisely
know the behaviour of the methods in regards to the characteristics of
the studied object before applying them. For this purpose, we evalu-
ated the validity domain of each method from wide-ranging simulated
datasets. We then illustrated our results using recently published bio-
logical data.
Subjclass. 62H12
Keywords. Graphical Gaussian Model, Estimation, Simulation
1. Introduction
Biological systems involve complex cellular processes built up from
physical and functional interactions between molecular entities (genes,
proteins, small molecules,...). Thus, to understand how these processes
are regulated, it is necessary to study the behavior of the molecular ma-
chinery. Recently, biotechnological developments were focused on the
characterization and the quantification of cellular system components
leading to produce a huge amount of various data. So, one of major
challenges is nowadays to understand from these data, how molecular
entities interact i.e. what the functional links are, in the context of a
whole system. To this end, several mathematical and computational
approaches are developing. Some methods based on correlations or
1
clustering can reveal proximities between variables but do not bring
to light the direct or functional links. Other methods, such as kernel-
based methods (Okamoto et al., 2007; Yellaboina et al., 2007) imply a
learning phase and so need a training data set. Bayesian approaches
are also used to infer relations between biological entities in order to
understand the regulatory mechanisms of living cells (Husmeier, 2003;
Werhli & Husmeier, 2007). However these methods have to deal with
the prior probability that has a non-negligible influence on the posterior
probability when data are sparse and noisy.
A valuable complement to all of these methods is graphical Gauss-
ian modeling (Kishino & Waddell, 2000; Dobra et al., 2004; Wu & Ye,
2006) that can infer direct relations between variables from a set of
repeating observations of these variables without any a priori knowl-
edge. Graphical modeling is the use of a graph to represent a model. A
graph is a set of nodes and edges which can be represented as a graphic
for a visual study or as a matrix for computer processing. Graphical
modeling is based on the conditional independence concept. In other
words, a direct relation between two variables exists if those two vari-
ables are conditionally dependent given all remaining variables. In the
Gaussian setting, a direct relation between two variables corresponds
to a non-zero entry in the partial correlation matrix. As the partial
correlation matrix is related to the inverse of the covariance matrix,
a direct relation between two variables also corresponds to a non-zero
entry in the inverse of the covariance matrix.
Graphical models are classically used when the number of observa-
tions, denoted n, is larger than the number of variables, denoted p. This
is generally the case in financial or sociological studies where surveys
concern few variables and a lot of observations. But it is not the case in
the post-genomic context where each experiment is costly in time and
money. So the number of repetitions is limited; moreover, each exper-
iment generates numerous data. Then the data set structure, pn,
does not match with the assumptions of the classical graphical model-
ing approach and the empirical covariance matrix cannot be inverted.
Over the last years, some mathematical and computational researches
were developed for surrounding that dimensionality problem and vari-
ous methods were proposed. Most of them are based on the fact that
the number of direct relations between two variables is very small in
regards to the number of possible relations, p(p1)/2.
The purpose of our study is to determine the validity domain of
some of these methods recently proposed. The reason for this work is
to give biologists hints for using the most appropriate methods. Indeed,
2
SSB - RR No. 16 Villers et al.
biologists are very interested in infering biological networks but they
generally have a small number of repetitions, the order of ten.
The core of this document is divided in three parts. The first one
describes the statistical methodology involved in Sch¨afer & Strim-
mer (2005a), Sch¨afer & Strimmer (2005b) Wille & uhlmann (2006),
Meinshausen & uhlmann (2006), Friedman et al. (2007), Kalisch &
uhlmann (2007) and Giraud (2008) approaches. The second part
presents simulations carried out with each of these methods, under dif-
ferent conditions of dataset structure. The third part illustrates the
interest of the graphical Gaussian modeling with an application to flow
cytometry data produced by Sachs et al. (2005). In the conclusions we
discuss the performances of each method and we bring some recom-
mendations according to their validity domain.
2. STATISTICAL METHODS
Let Γ = {1, . . . , p}be the set of nodes of the graph. The pnodes
of the graph are identified with pGaussian random variables. Let us
denote by X
X
X= (X1,...,Xp)T, a prandom vector distributed as a multi-
variate Gaussian N(0,Σ). For ma subset of {1,...,p}with cardinality
|m|, we denote by X
X
Xmthe |m|random vector whose components are
the variables Xc, where cm. Moreover we denote by Γmthe set
of nodes that are not in m, Γm= Γ \m, and by X
X
Xmthe p |m|
random vector whose components are the variables Xc, where cΓm.
There exists an edge between nodes aand bif and only if, the random
variables Xaand Xbare not independent conditionally to X
X
X−{a,b}. In
other words, assuming that the matrix Σ is nonsingular, there exists
an edge between nodes aand bif and only if the component (a, b) of
the concentration matrix K= Σ1is non zero. These graphs are called
concentration graphs or full conditional independence graphs. For each
node a, the set of neighbors of ais defined as the set of nodes in Γ−{a}
that are connected with a. Finally let us denote by E, the set of edges
of the graph.
The statistical challenge is to detect the edges in the graph on the
basis of a n-sample from a multivariate distribution N(0,Σ). For each
i= 1,...,n we denote by Xi= (Xi1,...,Xip) the ith observation.
When the number of observation nis large enough, at least np+1, in
order to guarantee that the sample covariance matrix Sis nonsingular,
several methods have been proposed. A detailed review can be found in
a recent paper by Drton & Perlman (2007). However, when the interest
lies on genomic networks, we are generally dealing with data where the
3
number of variables pis large and the number of experiments nis small.
Several methods have been proposed recently in that context.
Some of these methods aim at estimating the concentration matrix
K. For instance, Sch¨afer & Strimmer (2005b,a) proposed methods
based on bagging or shrinkage in order to stabilize either the estima-
tor of P, the correlation matrix associated to Σ, or the estimator of
Π the partial correlation matrix. Then they estimate the probability
of an edge between two nodes (a, b) by estimating the density of the
estimated partial correlation coefficient. More recently some authors
(Yuan & Lin, 2007; Banerjee et al., 2008; Huang et al., 2006; Friedman
et al., 2007) proposed algorithms to estimate Kby maximizing the
penalized log-likelihood, the penalty term being proportionnal to the
sum of the absolute values of the components of K. The coefficient
of proportionality may be chosen such as to control the probability of
error in estimating the graphical model.
Other methods are based on the estimation of a graph that is an
approximation of the full conditional graph. Wille & uhlmann (2006)
suggested to estimate a lower-order conditional independence graph in
place of the full conditional independence graph. They use a multi-
ple testing procedure for detecting edges. Kalisch & uhlmann (2007)
considered the PC-algorithm (Spirtes et al. (2000)) to estimate a graph
defined through conditional dependencies on any subset of the vari-
ables. The PC-algorithm starts from the complete graph and deletes
recursively edges based on conditional independencies.
We finally consider a third kind of methods, based on neighborhood
estimation. Meinshausen & B¨uhlmann (2006) proposed to estimate the
neighbors of each node using a model selection procedure based on the
LASSO method. The choice of the penalty parameter allows to control
the probability of falsy joining distinct connectivity components in the
graph. More recently, Giraud (2008) suggested to estimate graphs using
a model selection procedure based on a penalized empirical risk. The
procedure leads to control the mean square error of prediction and its
performances are established in a non-asymptotic setting.
In the next section we briefly describe these methods, specifying their
theoretical properties if any.
2.1. Estimating the concentration matrix.
2.1.1. Bagging or shrinkage for improving the covariance estimator.
Sch¨afer and Strimmer proposed to use bagging (Sch¨afer & Strimmer,
2005a) or shrinkage (Scafer & Strimmer, 2005b) for obtaining accurate
and reliable estimates of the covariance matrix Σ or its inverse K.
4
SSB - RR No. 16 Villers et al.
The bagging approach. Bootstrap aggregation (bagging) is used in order
to reduce the variance of the estimator of the correlation matrix P. For
each bootstrap sample X, the empirical correlation matrix b
Pis cal-
culated. The bagged estimator is the empirical mean of the b
P’s from
the bootstrap samples. The partial correlation matrix Π is estimated
from the pseudo inverse of the bagged correlation matrix estimator and
is denoted by b
Πbagged.
The shrinkage approach. The shrinkage estimator is a linear combi-
nation of the empirical covariance matrix S and of a target estima-
tor denoted b
chosen for its very low variability. Precisely b
Σ(λ) =
λb
+ (1 λ)Swhere the parameter λis chosen such as to minimize the
quadratic risk function defined as R(λ) = E nPaPb(b
Σa,b(λ)Σa,b )2o.
The parameter λcan be explicetly calculated and is estimated using
the data only. Let b
λbe this estimator. The partial correlation matrix
Π is estimated by b
Πshrinked from the inverse of the matrix b
Σ(b
λ).
Estimating the graph. It remains to define a decision rule for detecting
the significant components of Π. Let us denote by b
Π either b
Πbagged
or b
Πshrinked. Sch¨afer and Strimmer assume that the distribution of
the b
Πa,b’s is known up to some parameters that are estimated. They
deduce from this estimator the posterior probability of an edge to be
present in the graph and decide to keep edges such that the posterior
probability is greater than a given threshold 1 α.
2.1.2. Penalized maximum likelihood. Banerjee et al. (2008) considered
the problem of estimating the parameters of a Gaussian distribution
solving a maximum likelihood problem with an added 1-norm penalty
term. Precisely they proposed to estimate the inverse covariance ma-
trix Kby maximizing with respect to in the set of positive definite
matrices the following criteria:
C(Ω, λ) = log(detΩ) trace(SΩ) λX
aX
b
|ab|.
Friedman et al. (2007) recently proposed a performant algorithm allow-
ing to estimate Kby showing that solving this optimisation problem
comes to recursively solving and updating a regression LASSO prob-
lem. For a given parameter λ, let us denote by b
K(λ) the estimator
of K. The set of pairs (a, b) such that b
Ka,b(λ) is non zero constitutes
the set of edges in the graph. Banerjee and collaborators proposed a
5
choice of λfor which the probability to connect two distinct connectiv-
ity components of the graph is bounded by some α. Precisely
(1) λ(α) = T1
n2(1 α/2p2)
qn2 + T1
n2(1 α/2p2)
where Tn2is the distribution function of a Student variable with n2
degrees of freedom.
2.2. Approximation of the concentration graph.
2.2.1. The 0-1 conditional independence graph. Wille and B¨uhlmann
proposed to infer the first-order conditional independence graph in-
stead of the full conditional independence graph. Their method has
nice computational properties but the drawback is that 0-1 conditional
independence graphs do not generally coincide with concentrations
graphs, though the links between both graphs can be established in
some cases. The 0-1 conditional independence graph is defined as fol-
lows: for each pair of nodes (a, b), let Rab/be the correlation between
the variables Xaand Xb, and for each cΓ−{a,b}, let Rab/c be the cor-
relation between Xaand Xbconditionally to Xc; there exists an edge
between nodes (a, b) if Rab/6= 0 and Rab/c 6= 0 for all cΓ−{a,b}, or
equivalently if
(2) φa,b = min |Rab/c|, c Γ−{a,b}
is non zero. Therefore, detecting edges in the graph remains to testing
p(p1)/2 statistical hypotheses: For each (a, b), 1 a < b 1,
there exists an edge between nodes (a, b) if the hypothesis φab = 0′′ is
rejected. Wille and B¨uhlmann propose the following testing procedure:
For each (a, b) and cΓ−{a,b} the likelihood ratio test statistic of
the hypothesis Rab/c = 0” is calculated as well as the corresponding
p-value denoted P(a, b/c). Then the hypothesis φab = 0′′ is rejected
at level αif
Pmax(a, b) = max P(a, b/c), c Γ−{a,b} α.
It remains to calculate the adjusted p-values to take into account the
multiplicity of hypotheses to test, considering for example the Bonfer-
roni procedure or the Benjamini-Hochberg one’s.
Considering 0-1 conditional independence in place of full conditional
independence has several advantages. The test statistics are very easy
to calculate. For each hypothesis to test, Rab/c = 0”, one considers the
marginal distribution of the 3-random Gaussian vector (Xa, Xb, Xc)T.
6
SSB - RR No. 16 Villers et al.
Therefore, provided that nis large enough, the distribution of the like-
lihood ratio test statistic of the hypothesis Rab/c = 0” is well approx-
imated by the distribution of a χ2with 1 degree of freedom. Note that
it is not necessary to assume that pis small. It follows, that, for each
(a, b), the probability to detect an edge between aand bwhen it does
not exist is smaller than α, if nis large (see Proposition 3 in Wille &
uhlmann (2006)). Moreover it can be shown that if pincreases with
nin such a way that log(p)/n tends to 0 when ntends to infinity, then
the estimators of the Rab/c’s are uniformly convergent for all a, b Γ
and cΓ−{a,b} .
Castelo & Roverato (2006) and Malouche & Sevestre (2007) proposed
a similar approach for estimating “up to q”-order conditional indepen-
dence graphs where the presence/absence of edges is associated to all
marginal distributions up to the order q. We will only present the
method proposed by Wille and uhlmann in our simulation study.
2.2.2. The strong conditional independence graph. Let us consider graphs
defined as follows : there exists an edge between nodes aand bif and
only if for all set of nodes mΓ−{a,b}, the random variables Xaand
Xbare not independent conditionally to X
X
Xm. This graph is subset
of the full conditional independence graph and will be called strong
conditional independence graph.
Such graphs can be estimated using an iterative procedure called
the PC-algorithm proved to be computationnally very fast for sparse
graphs. The procedure starts with the complete graph and removes
edges with zero order conditional independence relations. Then edges
with one order conditional independence relations are removed and so
on. For each step s, let us denote by Esthe set edges and for each node
a, by Vs
athe set of neighbors of a. At step s+1, we need only to consider
the ordered pairs of nodes (a, b)Es, such that the cardinality of Vs
a
is strictly greater than s. For each of these pairs (a, b), the procedure
consists in keeping an edge between nodes aand bif Xaand Xbare not
independent conditionally to X
X
Xmfor all subsets of nodes mcontained
in Vs
awith cardinality equal to s+ 1.
Kalisch & uhlmann (2007) considered a sample version of the PC-
algorithm as follows : the testing procedure for deciding to keep an edge
between nodes aand bat step sconsists in testing, for each subset of
nodes mto be considered, that the correlation between Xaand Xb
conditionally to X
X
Xmis zero. The test statistic is based on the Fisher’s
7
z-transform of the sample partial correlations b
Rab/m. Precisely
Za,b/m =1
2log 1 + b
Rab/m
1b
Rab/m !.
and for some α > 0, the null hypothesis is rejected if pn |m| 3Za,b/m >
Φ1(1 α/2), where |m|denotes the cardinality of mand Φ the dis-
tribution function of a Gaussian centered variable with unit variance.
The edge between nodes aand bis removed at step sof the algorithm,
if there exists mwith cardinality ssuch that the test is not rejected,
Under some conditions on the distribution of X
X
X, the estimated graph
is a consistent estimate of the strong conditional independence graph.
The asymptotic framework considers sparse graphs of high dimension:
when ntends to infinity, the maximum number of neighbors tends to
infinity slower than n, while the number of nodes pmay grow like any
power of nand the parameter αhas to tend to zero.
For practical issues the choice of the parameter αis an open prob-
lem. Kalisch & uhlmann (2007) discussed this point on the basis of a
simulation study for estimating the skeleton of a directed acyclic graph.
2.3. Estimating the neighbors.
2.3.1. LASSO procedure. Detecting the neighbors of all nodes leads to
detecting the edges in the graph. Because of the Gaussian assumption
on the distribution of X
X
X, for each variable Xa, a conditional regression
model can be defined as follows:
(3) Xa=X
bΓ−{a}
θa,bXb+εa
where the parameters θa,b are equal to Ka,b/Ka,a. The variable εais
distributed as a centered Gaussian variable and is independent from the
Xb’s for all bΓ−{a}. Meinshausen & B¨uhlmann (2006) proposed to
detect the non zero coefficients of the regression of Xaon the variables
Xbfor bΓ−{a}on the basis of the n-sample (X1,...,Xn), using the
LASSO method as a model selection procedure. Precisely, for a given
smoothing parameter λ, the estimators of {θa,b , b Γ−{a}}minimize
the sum of squares penalized by the 1-norm of the parameters vector:
(4)
n
X
i=1
Xia X
bΓ−{a}
θa,bXib
2
+λX
bΓ−{a}
|θa,b|.
The solution to this minimization problem is given by a set of (b
θa,b, b
Γ−{a}) that are either equal to zero or not. The set of nodes bΓ−{a}
8
SSB - RR No. 16 Villers et al.
such that b
θa,b is non zero constitutes b
Va, the estimated set of neighbors
of the node a. Two estimated graphs may be deduced from all these
b
Vafor a= 1, . . . , p, depending on whether we decide to put an edge
between nodes aand bif both b
θa,b are b
θb,a are non zero or if one of
these is non zero.
Meinshausen and uhlmann proved that, under some conditions en-
suring that the signal to noise ratio is not too small, the method is
consistent, namely the probability for b
Vato be exactly equal to Va
tends to one. The asymptotic framework is similar to the one consid-
ered by Kalisch & uhlmann (2007) : sparse graphs of high dimension.
The smoothing parameter λis assumed to decrease to zero at a rate
smaller than n1/2.
For the sake of application they propose a choice of λsuch that
the probability to connect two distinct connectivity components of the
graph is bounded by some α. Precisely
(5) λ= 2v
u
u
t
n
X
i=1
X2
iaΦ11α/2p2.
This choice is based on the Bonferroni inequality and it is assumed that
the variance of the variables Xafor a= 1,...,p are all equal to one.
2.3.2. Model selection procedure. Giraud (2008) considered the prob-
lem of estimating by a model selection procedure the non zero θ’s oc-
curing in the pregression models defined at Equation (3). The pro-
cedure starts with the choice of a collection of graphs with pnodes
or equivalently to the choice of a collection of sets of edges, denoted
{E1,...,EL}, where Lis the cardinality of the collection. For the sake
of simplicity, we say that a p×pmatrix Eif a,a = 0 and if
a,b = 0 is equivalent to (a, b)/E. For each set Ein the collec-
tion, the parameters θare estimated by minimising the residual sums
of squares: b
θ() is the p×pmatrix that minimizes
SC R(Ω) = X
aΓ
n
X
i=1 Xia X
bΓ
a,bXib !2
with respect to such that E. The choice of the best graph
among {E1, . . . , EL}is done by selecting the estimator b
θ() that mini-
mizes the following criteria:
Crit() = X
aΓ
q(K, νa())
n
X
i=1 Xia X
bΓb
θa,b()Xib!2
9
where νa() is the number of neighbors of node ain the graph associated
to E,Kis a constant greater than 1 and qis a penalty function given
in Giraud (2008). We denote by b
θthis estimator.
The theoretical properties of the method are given in a non-asymptotic
framework with n < p. The graph is assumed to be sparse in the fol-
lowing sense: the maximum number of neighbors over all the nodes in
the graph, denoted D, must be smaller than a a quantity of the order
n/2(log p). Under this assumption, it is proved that the Mean Square
Error of Prediction of the estimator MSEP(b
θ) is bounded above, up to
a log pfactor, by a quantity closed to the minimum over of the Mean
Square Error of Prediction of b
θ().
In practice a collection of graphs has to be chosen. For example, one
can choose the set of all graphs with at most Dedges. Obviously such
a choice leads to very high computational cost for large values of p.
3. SIMULATIONS
3.1. Methods of simulation.
3.1.1. Simulating a graph. Graphs were simulated according to two
different approaches.
The first approach is based on the Erd¨os-R´enyi model, noted ER
model, which assumes that edges are independent and occur with the
same probability. Practically, we fix the number of nodes pand the
percentages of edges ηthen we draw the number of edges according to
a binomial distribution with parameters p(p1)/2, η. Next we choose
uniformly and independently the positions of the edges.
The second approach was proposed by Daudin et al. (2006) to take
into account the topological features of biological networks such as
connectivity degree or clustering coefficient. Their model called Erdos-
enyi Mixtures for Graphs, noted ERMG, supposes that nodes are
spread into Qclusters with probabilities {p1,...,pQ}, and that the
connection probabilities of each cluster and between clusters are het-
erogenous. These connection probabilities constitute the connectivity
matrix C. The parameters available from Daudin et al. (2006) study,
correspond to a graph with 199 nodes. As we wanted to study the in-
fluence of p, we adapted those parameters to our simulations. However,
we kept the same graph structure by taking a large weakly connected
cluster, a small highly connected cluster and the same group connection
structure. Thus we used the following parameter values
Q= 4,(p1,...,pQ) = 0.07 0.1 0.18 0.65
(6)
10
SSB - RR No. 16 Villers et al.
C=
0.999 1061060.005
1060.4 0.014 0.003
1060.014 0.2065 0.011
0.005 0.003 0.011 0.013
.(7)
That leads to a mean percentage of edges ηequals to 2.5%.
Whatever the approach, we finally obtain a matrix composed of 0
and 1, the values 1 indicating the edge positions in the corresponding
graph. This matrix is denoted the incidence matrix.
3.1.2. Simulating the data. From the incidence matrix of a given graph,
we simulated nobservations as follows: first we generate a partial corre-
lation matrix Π by replacing the values 1 indicating the edge positions
in the incidence matrix, by values drawn from the uniform distribution
between 1 and 1. Then we compute columm-wise sums of the abso-
lute values and set the corresponding diagonal element equal to this
sum plus a small constant. This ensures that the resulting matrix is
diagonally dominant and thus positive definite. Next we standardize
the matrix so that each diagonal entry equals to 1. Finally, we generate
nindependent samples from the multivariate normal distribution with
mean zero, unit variance, and correlation structure associated to the
partial correlation matrix Π.
3.2. Simulation setup. We simulated graphs and data for different
values of p, η, n and we estimated graphs from these data using different
methods. We review the methods and the way we carried them out.
Then we present how we assessed their performances.
3.2.1. Methods. The methods for which we present simulation are the
following:
- the b
Πbagged and b
Πshrinked methods, proposed by Sch¨afer & Strimmer
(2005a,b) with the decision rule based on posterior probabilities. The
threshold 1 αis fixed at 0.95. Both methods are implemented in R
software (GeneTS package, R-2.2.0; GeneNet package R-2.4.1).
- the glasso proposed by (Friedman et al., 2007) with α= 5% in
accordance with Banerjee et al. (2008). This method is implemented
in R software (glasso package, R-2.4.1).
- the 0-1 conditional independence graph approach, proposed by
Wille & uhlmann (2006), with the decision rule based on the adjusted
p-values following the Benjamini-Hochberg procedure taking α= 5%.
We implemented the method in R-2.4.1.
11
- the PC-algorithm, as proposed by Kalisch & uhlmann (2007) with
α= 5%. This method is implemented in R software (pcalg package,
R-2.6.1).
- the Lasso approach, with the two variants and and or proposed
by Meinshausen & uhlmann (2006) and α= 5%. This method is
implemented in R software using the lars package R-2.4.1. A part of
the algorithm is implemented in R according to the description given
in Section 6.
- the model selection approach proposed by Giraud (2008) taking
K= 3 in the penalty function as suggested by the author to better
control the FDR. The method implemented in R-2.4.1 was kindly pro-
vided by the author. For saving computational time, the collection of
graphs was a subset of the set of all graphs with at most 3 neighbors
per node.
In the continuation of this document we will respectively denote
these methods as bagging,shrinkage,glasso,pcAlgo,WB,MB.and
and MB.or,KGGM.
3.2.2. Assessing the performance of methods. To assess the performance
of the investigated methods we compared each simulated graph with
the estimated graph by counting true positives TP (correctly identi-
fied edges), false positive FP (wrongly detected edges), true negatives
TN (correctly identified zero-edges), and false negatives FN (not rec-
ognized edges). From those quantities we estimated the power and the
false discovery rate FDR, which are defined by:
power = ETP
TP + FN
FDR = EFP
TP + FP|(TP + FP) >0.
The power and FDR values presented in this work, are the means
over 2000 simulations (according to our preliminary results which showed
that the stability of the FDR estimation was reached with 2000 simu-
lations).
The performance of the methods were evaluated for several combi-
nations of the parameters p, η and n, in regards to the problematic we
wanted to investigate. Moreover, the parameter values were chosen in
order to both make the computer time reasonable and extrapolate the
results to biological fields.
The first problematic we have focused on, is the influence of the sam-
ple size. To this aim, we simulated random graphs fixing the number of
12
SSB - RR No. 16 Villers et al.
nodes pequal to 30, ηequal to 2.5% and varying the number of obser-
vations nin {15,22,30,60}. Secondly, we investigated the sparsity as-
sumption common to all methods taking ηin {0%,2.5%,4%,5%,10%}.
Third, we were interested in the influence of the node number p. So,
we increased pand chose nin order to keep the p/n rates similar to
the p/n values used in the first considered point. For all of these three
studies, graphs were simulated with the ER method.
The forth problematic we investigated concerns the influence of the
graph structure. In this goal, we also simulated graphs with the ERMG
method fixing pequal to 30 and varying nin {15,22,30,60}.
Finally, we focused on the method proposed by Wille and uhlmann
to evaluate the consequence of estimating the 0-1 graph instead of the
concentration graph. For this purpose we fixed p= 30 and varied η
from 0.025 to 0.2 and nfrom 60 to 1200.
3.3. Results and discussion.
3.3.1. Comparing the methods. As shown in Figure 1 methods behave
very differently. Let us first discuss methods presenting high FDR
values.
Comments on shrinkage,glasso and pcAlgo methods. The FDR val-
ues for these methods are very high for all considered values of nwhen
p= 30 and η= 2.5% as it is shown in Figure 1a. The FDR does
not vary with nand remains close to 47% and 30% for glasso and
pcAlgo respectively, while it increases with nfrom 45% with n= 15
to 75% with n= 60 for shrinkage. When η= 0, the FDR is small for
shrinkage and glasso methods while it equals 1 for pcAlgo (at least
one rejected edge at each simulation, see Figure 1c). The high FDR
values are associated with high power values. When the graph is sparse
enough, say ηsmaller than 5%, the methods are powerfull (Figures 1b
and d), particularly glasso: the power varies from 97% to 99% when
η= 2.5% and nvaries from 15 to 60. This result suggests that it may
be of interest to look for better choices of the thresholding parameter
α. This will be the object of section 3.3.2.
Comments on bagging,WB,MB, and KGGM methods. For these methods
the FDR values never exceed 6% except with the bagging method for
n= 15. The FDR values obtained with MB.or remain steady around
5.5% whereas the FDR values obtained with MB.and never exceed 1%.
KGGM behaves similarly as MB.or, with slightly smaller FDR and power
values. FDR from bagging reaches 18% when n= 15 then deeply
declines below 3%. The power represented in Figure 1b gradually in-
creases with the number of observations nexcept with the bagging
13
20 30 40 50 60
0.0 0.4 0.8
sample size
FDR
a)
20 30 40 50 60
0.0 0.4 0.8
sample size
power
b)
0.00 0.04 0.08
0.0 0.4 0.8
edge percentage
FDR
WB
MB.and
MB.or
bagging
KGGM
shrinkage
glasso
pcAlgo
c)
0.00 0.04 0.08
0.0 0.4 0.8
edge percentage
power
d)
Figure 1. FDR and power of the different methods
tested, in function of the sample size (plots a) and
b), respectively) and the edge percentage (plots c) and
d), respectively). Plots a) and b) were obtained with
η= 0.025. Plots c) and d) were obtained with n= 22.
All plots correspond to p= 30.
method which shows a drop for n=p. This phenomena was com-
mented by the authors in Sch¨afer & Strimmer (2005a). Let us notice
that MB.or and MB.and do not work when n= 15. This is due to the
fact that when n, α, p satisfy Equation (11), no edge will be detected
whatever the data. 14
SSB - RR No. 16 Villers et al.
The influence of the edge percentages ηis shown in Figure 1c and 1d,
for n= 22. When ηin {2.5%,4%,5%}, the FDR values, shown in
Figure 1c, stay under 1% with WB and MB.and methods, around 5% for
KGGM and exceed 5% with bagging and MB.or . For all methods the
power dramatically falls as ηincreases and is close to 0 when ηequals
10%, whatever the method used. Similar graphics were obtained for
n= 30 and n= 60. When η= 0, the FDR values lie between 0 for the
MB methods and 2.4% for WB.
Considering the reliability feature (low FDR), the results presented
in Figure 1 reveal that the MB.and and WB methods perform quite well
in all cases. Referring to the power, the MB.or and KGGM methods
outperfom the others. The MB.and and WB methods are less powerfull
with the advantage of producing smaller FDR values. The bagging
appears as the less competitive method in terms of power. All methods
similarly show a strong decrease of the power when ηincreases, in
accordance with the sparsity assumption.
3.3.2. Focus on the high FDRs. Previously we have seen that the
FDR values of the shrinkage,glasso and pcAlgo methods were very
high. This behavior may be due to a bad choice of the thresholding
parameter αoccurring in each of these methods. Hence it may be
worthwhile to verify if a more severe thresholding leads to reduce the
FDR keeping at the same time a good power. Therefore, we estimated
by simulation the power and the FDR for decreasing values of the
thresholding parameter using p= 30, η= 0.025 and nvarying from 15
to 60.
The curves of the power versus the FDR are shown at Figure 2 for
shrinkage and glasso methods and n= 22. For the sake of compar-
ison, we represent the corresponding curve for the MB.or method on
the same graphic. This graphic shows that we cannot both reduce the
FDR and keep a good power with the shrinkage and glasso methods.
When the FDR equals 5%, the power of MB.or,glasso and shrinkage
are respectively equal to 0.86, 0.47, 0.05. These values are obtained
when αequals 1% for MB.or and αequals 1012 for shrinkage. For
glasso FDR values smaller than 0.45 could not be obtained by varying
α. Indeed, when αequals 1012,λgiven by Equation (1) equals 0.981,
FDR equals 0.46 and power 0.95. Therefore we carried out the glasso
procedure by varying the values of λ. We got FDR equals 0.05 with
λ= 0.9996.
When ηincreases (η= 4%,5%,10%) or when pis taken equal to 60,
these two methods behave in the same way (results not shown). There-
fore they cannot be used for estimating graphs if one wants to control
15
the FDR to a value around 5%. Then we did not keep shrinkage and
glasso methods for further studies.
For the pcAlgo algorithm, power decreases with αwhile the FDR
is not a monotone function of α, as shown in Figure 3. Indeed the
pcAlgo algorithm is a stepwise procedure and at each step only nodes
for which the estimated neighborhood is large enough are involved in
the next step. If αis too small not enough nodes are kept for the
following step. So edges may appear between two nodes even though
they are linked through a dropped node. One interesting feature of the
variation of FDR versus αis that the FDR is minimum in α= 0.1%
whatever the values of n(Figure 3). Simulations (not shown) with
η= 4% lead to the same result: the FDR is a convex function of α
and is minimum for α= 0.1% whatever the values of n. Therefore we
tested again the performances of pcAlgo method using α= 0.1% for
different values of nand η. FDR decreases from 8.4% to 1.6% when n
increases (Figure 4a). It is equal to 47% when η= 0, remains around
5% when ηvaries between 2.5% and 5%, and equals 8.8% when η= 10%
(Figure 4b). Concerning the power, the pcAlgo behaves nearly as the
WB method.
3.3.3. Influence of the number of nodes. In Section 3.3.1 we showed
that all methods loose in power when ηincreases. We now investigate
the influence of the number of nodes, p, on that loss of power. The
graphic in Figure 5 represents the power in function of η, for different
values of p, with n=p. Results are shown for MB.and procedure, the
behavior of the other procedures being similar. In all cases the FDR
is smaller than 1%. Figure 5 shows that whatever the value of p, the
power decreases when ηincreases. However, the larger p, the faster the
loss of power with η. Consequently, all methods are efficient for sparse
graphs, and the edge percentage from which the methods fail depends
on the number of nodes.
3.3.4. Influence of the numbers of neighbors. In Section 3.3.1 we showed
that if the graph is highly connected, the methods are not powerful
anymore. In this section we aim at understanding why, and we show
in particular the behavior of the methods according to the number of
neighbors of the nodes. We focus on the procedure proposed by Mein-
shausen and uhlmann and we consider the experiment simulation for
p= 30, η= 0.025 and n= 30. For each node of the 2000 simulated
graphs we count the number of neighbors and the number of correctly
detected neighbors. In Table 1 we present for iin {1,...,5}, the num-
ber niof nodes with ineighbors and the percentage pi,j of nodes for
16
SSB - RR No. 16 Villers et al.
0.0 0.2 0.4 0.6 0.8
0.0 0.2 0.4 0.6 0.8 1.0
FDR
power
glasso
MB.or
shrinkage
Figure 2. Power in function of FDR for the glasso
and shrinkage methods. The curves for MB.or method
is given as reference. Plots correspond to p= 30, n= 22
and η= 0.025.
which the method has correctly detected jneighbors exactly, for jin
{0,...,i}.
The percentage (pii)i=1,...,5of nodes for which the whole set of neigh-
bors is correctly detected decreases when the number of neighbors i
increases. In other words when a node has several neighbors, it often
happens that at least one neighbor is not detected. This may explain
17
Figure 3. Influence of αon FDR and power (plots a)
and b), respectively) for different values of n. The level
α= 0.001 is indicated by the dashed line. Plots corre-
spond to p= 30 and η= 0.025.
Figure 4. Performance of pcAlgo for α= 0.001; a)
FDR and power in function of sample size for η= 0.025;
b) FDR and power in function of the edge percentage for
n= 22. Plots correspond to p= 30.
the loss of power previously observed (see Section 3.3.1) when ηin-
creases, because the average number of neighbors increases with η.
Let us now compare the results obtained with MB.and and MB.or
procedures. In Section 3.3.1 we showed that MB.or procedure is more
18
SSB - RR No. 16 Villers et al.
0.02 0.04 0.06 0.08 0.10
0.0 0.2 0.4 0.6 0.8 1.0
edge percentage
power
p = 30
p = 60
p = 120
Figure 5. Power according to the edge percentage, for
different values of pwith n=pand for Meinshausen and
uhlmann method using its and variant.
powerful and we recover in Table 1 that the percentages of nodes for
which the whole set of neighbors is correctly detected are significantly
larger with MB.or procedure than with MB.and procedure. Let us con-
sider for example, as illustrated in Figure 6, a node awith two neighbors
band csuch that ais the only neighbor of band of c. As it has been
noticed just before, the procedure of Meinshausen and uhlmann will
detect more easily that ais the neighbor of band cthan the nodes b
and care both neighbors of a. This is the reason why MB.or procedure
is more powerful than MB.and procedure.
19
i 1 2 3 4 5
ni21398 7684 1788 287 38
percentage pi,j obtained with MB.and
@@
@
j
i12345
0 0.182 0.078 0.062 0.066 0.105
10.818 0.643 0.465 0.380 0.263
20.279 0.414 0.411 0.500
30.059 0.139 0.132
40.004 0.000
percentage pi,j obtained with MB.or
@@
@
j
i12345
0 0.022 0.016 0.030 0.049 0.079
10.978 0.267 0.097 0.063 0.105
20.717 0.370 0.195 0.079
30.503 0.387 0.132
40.306 0.447
50.158
Table 1. Number of nodes with ineighbors and per-
centages of nodes for which exactly jneighbors have been
correctly detected by both methods of Meinshausen and
uhlmann with n= 30. Graphs are simulated according
to the ER model with p= 30, η= 0.025.
Figure 6. Node awith two neighbors band csuch that
ais the only neighbor of band of c.
3.3.5. Influence of the graph structure. In this section we present re-
sults when graphs are simulated according to the ERMG model de-
scribed in Section 3.1.1. Our aim is to evaluate the influence of hetere-
geneous clusters in the graph. Results are shown in Figure 7 for p= 30
20
SSB - RR No. 16 Villers et al.
and for ntaking the values 15, 22, 30 and 60 and for all methods cho-
sen for their low FDR. The parameter αfor the pcAlgo method was
taken equal to 0.1% in accordance with results given at Section 3.3.2.
For the parameters given in Equations (6) and (7) the percentage ηof
edges equals 2.5% which makes the results comparable with those of
Figure 1a, 1b and 4.
Using the ERMG model for simulating graphs does not change the
shapes of FDR and power curves. As in Figure 1a the FDR value
obtained with bagging is high when n= 15 then deeply declines, and
the power drops for n=p. Moreover we recover that the FDR values
stay very low with WB and MB.and procedure, stay under 5% for KGGM
and are larger with MB.or and pcAlgo procedures. Referring to the
power, as in Figure 1b the MB.or and KGGM procedures outperform the
others.
The main difference when graphs are simulated according to the
ERMG models is that the power remains under 0.8 even for large n
(Figure 7b) whereas it achieves 0.95 when ER model is used (Figure 1b).
So, the methods are less powerful when graphs are simulated according
to the ERMG model than according to the ER model. The next section
shed light on this loss of power.
Figure 7. FDR (a)) and power (b)) obtained with the
different methods tested, in function of the sample size.
Graphs were simulated according to the ERMG model
with p= 30.
21
3.3.6. Influence of the neighborhood structure. In this section we study
why the methods are less powerful when graphs are simulated according
to the ERMG model than according to the ER model and we underline
in particular, the influence of the neighborhood structure.
We consider the same experiment study as in Section 3.3.4 except
that the graphs are simulated according to the ERMG model. In Ta-
ble 2, one can read for each iin {1,...,6}, the number niof nodes with
ineighbors and the percentage pi,j of nodes for which the method has
correctly detected jneighbors, for jin {0,...,i}. Results are obtained
with the procedure MB.or.
i1 2 3 4 5 6
ni16326 6720 2603 941 332 63
percentage pi,j obtained with MB.or
@@
@
j
i123456
0 0.042 0.129 0.289 0.530 0.654 0.841
10.958 0.305 0.206 0.148 0.133 0.048
20.566 0.256 0.128 0.075 0.048
30.249 0.121 0.057 0.032
40.073 0.054 0.000
50.027 0.032
60.000
Table 2. Number of nodes with ineighbors and per-
centages of nodes for which jneighbors have been cor-
rectly detected by the MB.or procedure with n= 30.
Graphs are simulated according to the ERMG model
with p= 30.
Comparing Tables 1 and 2 shows that the number of nodes with 1
and 2 neighbors is smaller for the ERMG model than for the ER model,
while the number of nodes with more than 3 neighbors is greater. It
appears also that the percentages p1jare similar in both tables for nodes
with one neighbor. But when the number of neighbors iis larger than
one, the percentage of nodes pii for which the whole set of neighbors
is correctly detected are smaller in Table 2 than in Table 1. Moreover
the main difference between Table 1 and 2 concerns the percentage of
nodes for which no neighbors is detected: these percentages pi0are
22
SSB - RR No. 16 Villers et al.
very small in Table 1, but large in Table 2 and increases with the
number of neighbours. In other words, for graphs simulated according
to the EMRG model, detecting no neighbor often happens, especially
for nodes with a large number of neighbors. This can be explained
by the structure of the neighbors, which is more complex for graphs
simulated according to the ERMG model. This point is illustrated
above.
In the following we present the FDR and the power estimated into
each cluster and between the clusters. We first simulate a graph G
according to the ERMG model with the parameters defined in Sec-
tion 3.1.1 in order to fix the number of nodes and the localisation of
edges in each cluster and between clusters. We simulate a graph with
p= 120 nodes to ensure that each cluster contains a minimal num-
ber of nodes. We denote by (n1,...,nQ) the number of nodes in each
cluster and by Nedges the matrix which specifies the number of edges
within each cluster and between two clusters. For the simulated graph
G, these parameters are:
(n1,...,nQ) = 7 11 23 79
and
Nedges =
21 0 0 3
0 21 6 3
0 6 38 19
3 3 19 35
.
We simulate 2000 data matrix as described in Section 3.1.2 from this
graph Gand we estimate the FDR and the power for detecting edges
within and between clusters. The results obtained with the MB.or
procedure and with n=p, are presented in the matrices FDR and
power given at Equations (8) and (9). The component (a, b)a6=bof the
matrix FDR (respectively power) gives the estimated false discovery
rate (respectively power) of edges between clusters aand b. When
there is no edge between two clusters, estimating the power does not
make any sense and we put Na. The elements on the diagonal give the
estimated false discovery rate (respectively power) of edges within each
cluster.
FDR =
0.000 0.001 0.016 0.008
0.001 0.005 0.004 0.012
0.016 0.004 0.006 0.014
0.008 0.012 0.014 0.021
(8)
23
power =
0.10 Na Na 0.46
Na 0.26 0.22 0.61
Na 0.22 0.29 0.61
0.46 0.61 0.61 0.87
(9)
We can notice from Equation 8 that all estimated FDR values are
small. Moreover, the estimated powers vary a lot according to the
clusters. Indeed, in the first cluster which contains 21 edges among the
n1(n11)/2 = 21 possible edges, the power is very small whereas in
the fourth cluster which contains 35 edges among the n4(n41)/2 =
3081 possible edges, the power is large. The neighbors of the neigh-
bors also influence the power. This can be observed by comparing
the power for detecting edges between the second and third clusters,
power[2,3] = 0.22, with the power for detecting edges in the fourth
cluster, power[4,4] = 0.87. So, it appears that it is more difficult to
detect edges between clusters 2 and 3 than within cluster 4, while in
both cases the percentage of edges to detect is approximately equal to
0.01. This comes from the fact that clusters 2 and 3 are both highly
connected. Therefore these two clusters involves nodes for which the
structure of the neighbors is complex.
Because of these highly connected parts, the power estimated over
the whole graph Gis smaller than if the edges were distributed uni-
formly in the graph. Indeed, the FDR and the power estimated for the
whole graph Gequal respectively 0.016 and 0.44. For graphs simulated
according to the ER model with p= 120 and η= 0.025, the aver-
age FDR and power estimated over 2000 simulations with the MB.or
procedure and n= 120 equal respectively 0.009 and 0.50.
3.3.7. Inferring a concentration graph using a 0-1 conditional indepen-
dence graph. If the gaussian distribution is faithfull for the concentra-
tion graph G(see Proposition 1 in Wille & uhlmann (2006)), then
all edges in Gare edges in the 0-1 conditional independence graph de-
noted G{0,1}. A comparison between Gand G{0,1}is given at Table 3.
For each concentration matrix whose values are simulated as described
in Section 3.1.2, and for each pair (a, b), 1 a < b 1, we calculated
φa,b defined at Equation (2). It appears that, as it was already noticed
by Wille and uhlmann, the number of edges in G{0,1}may be consid-
erably larger than in G. The power and FDR for estimating the graph
Gare reported on Figure 8, a) and b). It shows that the FDR increases
with nand reaches its maximum for η= 10%. This behaviour can be
24
SSB - RR No. 16 Villers et al.
G G{0,1}G{0,1}\ G
ηNumber mean range Number mean range
0.025 11 0.72 [104,0.99] 0.3 0.09 [104,0.33]
0.05 22 0.57 [105,0.99] 17 0.05 [106,0.46]
0.1 43 0.35 [107,0.99] 217 0.02 [109,0.41]
0.15 65 0.24 [109,0.99] 322 0.012 [109,0.30]
0.2 87 0.18 [108,0.99] 337 0.01 [109,0.21]
0.3 131 0.11 304 0.009
Table 3. Comparison of G{0,1}and Gfor p= 30 and
several values of η. The column G G{0,1}gives the mean
(over 2000 simulations) number of edges that are both in
Gand G{0,1}, followed by the mean and range of the φa,b’s
corresponding to these edges. The column G{0,1}\G gives
the sames results for edges that are in G{0,1}and not in
G. In all simulations the edges of Gare edges of G{0,1}.
easily explained by looking at Figure 8, c) and d) where the FDR and
the power for estimating G{0,1}are reported. It shows that the FDR
for estimating G{0,1}stays very small and that the power increases with
n, as expected. Unfortunatly the edges detected in G{0,1}are not in G,
leading to increase the FDR for detecting edges in G.
When ηis small, say η2.5%, the number of edges that are in G{0,1}
but not in Gis very small, and then the FDR for detecting edges in
Gis not changed. But when ηis large the FDR becomes very large,
up to 20% for η= 10%. Nevertheless when ηis larger, the FDR de-
creases. This can be explained by the values of the φa,b’s that are
smaller when ηincreases as it is shown in table 3. Obviously the be-
haviour of the procedure proposed by Wille and uhlmann shown in
this simulation study, may depend on the way we simulate the con-
centration matrix. Nevertheless, we have to keep in mind that if the
graph is highly connected, or if a part of it is highly connected, then,
inferring a concentration graph on the basis of its approximation by a
0-1 conditional independence graph, may lead to detect edges wrongly.
4. APPLICATION TO BIOLOGICAL DATA
In this section, we apply the different methods to the multivariate
flow cytometry data produced by Sachs et al. (2005). These data con-
cern a human T cell signaling pathway whose deregulation may lead to
carcenogenesis. Therefore, this pathway was extensively studied in the
25
Figure 8. FDR and power for estimating G(plots a)
and b)), and G{0,1}(plots c) and d)), in function of ηfor
p= 30 and different values of n.
literature and a network involving 11 proteins and 18 interactions was
conventionally accepted (Sachs et al., 2005). This network we denoted
Graf is represented in Figure 9. Sachs et al. ’s data consist of amounts
of these 11 proteins, simultaneously measured from single cells under
several disturbed conditions. In the sequel, we focus on one general
disturbance (+ ICAM-2) that overall stimulates the cellular signaling
network. In this condition the quantities of the 11 proteins were mea-
sured in 902 cells. Let denote Dthis data set constituted of p= 11
26
SSB - RR No. 16 Villers et al.
variables and n= 902 observations. A log-transformation of the data
was made to fit the gaussian assumption better, and the vector of the
n-observations for each protein were centered and normalized.
Contrary to most of postgenomic data, flow cytometry data provide
a large sample of observations that allow us to measure the influence
of the sample size on the power of the estimation methods. From this
data set we first compare the networks inferred using the five methods
retained for their low FDR. As such abundance of data is rarely avail-
able in postgenomic data, we secondly carry out a study to determine
the influence of the observation number on the methods.
Figure 9. Gr af . Classic signaling network of the human
T cell pathway. The connections well-established in the
literature are in grey and the connections cited at least
once in the literature are represented by red dashed lines.
We represent the estimated graphs in Figure 10. The graphs inferred
with the bagging,WB and pcAlgo methods are identical. This graph
involving 10 edges is denoted G1. The KGGM method and the two vari-
ants MB.or and MB.and infer the same graph denoted G2. This graph
involves 9 edges and is identical to G1except for the edge between P KA
and Erk1/2 which is missing. To assess the quality of the methods,
we refer to the conventionally acccepted network shown in Figure 9.
This network involves 18 connections among which 16 connections are
well-established. As the data set Dis obtained by considering only
one disturbed condition we do not expect the methods to detect all
the connections established in the literature. In fact, 10 connections
are detected by three of the five methods. Among those connections,
27
Figure 10. Inferred graphs. The graph G1estimated
with the bagging,WB and pcAlgo methods is represented
in blue. The graph G2estimated with the KGGM,MB.or
and MB.and methods is in green dashed line. The values
of the partial correlation matrix associated to the data
set D are reported along each edge.
nine of them were well-established or cited at least once in the litera-
ture. The tenth one, between p38 and J NK, was detected by the five
methods previously cited. Moreover the same ten connections were
detected by Sachs et al. (2005) (Supplementary Material) applying a
bayesian network analysis. Therefore, in the following, we assume that
the graph G1represents the conditional independence structure of the
data set D.
We now investigate the influence of the observation number non the
power of the methods for estimating the graph G1. We choose nequal
to 15, 30, 100, 200, and 300. For each value of n, 2000 nsamples
are drawn from Dwithout replacement. From each sample, we esti-
mate graphs using the five methods and we compare each estimated
graph with the graph G1. We compute the proportion of wrongly de-
tected edges among the detected edges and the proportion of correctly
identified edges among the edges of G1. The means of these quantities
over the 2000 simulations are denoted FDR and power. Results are
presented in Table 4. As expected, the power of all methods increases
with the number of observations n. However, nhas to be large in order
to detect most of the edges. It comes from the fact that the graph
G1involves 11 proteins and 10 edges, which corresponds to a large
percentage of edges (18%). In this study, we notice that the edges
28
SSB - RR No. 16 Villers et al.
Raf Mek1/2 and Erk1/2Akt are detected in most of the simula-
tions even for small nand whatever the method; on the contrary the
edge P KA Erk1/2 is less often detected. It is in accordance with
the values of the partial correlation matrix given in Figure 10: indeed
the largest values of the partial correlation matrix correspond to the
most often detected edges.
Let us now compare the methods according to the number of obser-
vations at our disposal. When nis small (n= 15), the pcAlgo and
KGGM methods are the most powerful with a FDR around 1%. When n
is moderate (n= 30 or n= 100), we advise to use the MB.or procedure,
because the FDR is small and the benefit in power is large. When nis
very large and referring to the power, all methods perform quite well.
Nevertheless KGGM is slightly less powerfull. The FDR obtained with
the MB.and procedure being null, this procedure is recommended.
FDR
nbagging WB MB.and MB.or pcAlgo KGGM
15 0.0227 0.0037 0.0007 0.0017 0.0086 0.0106
30 0.0159 0.0020 0.0011 0.0044 0.0030 0.0051
100 0.0117 0.0017 0.0001 0.0067 0.0018 0.0051
200 0.0098 0.0010 0.0000 0.0111 0.0011 0.0068
300 0.0056 0.0005 0.0000 0.0136 0.0005 0.0056
Power
nbagging WB MB.and MB.or pcAlgo KGGM
15 0.27 0.33 0.23 0.26 0.38 0.40
30 0.43 0.47 0.47 0.62 0.48 0.57
100 0.68 0.69 0.68 0.77 0.69 0.69
200 0.79 0.79 0.77 0.81 0.79 0.75
300 0.85 0.83 0.82 0.83 0.83 0.79
Table 4. FDR and power for estimating the graph G1.
Results for the different methods and for different values
of n.
5. CONCLUSION
In this work, we were interested in recent methods that infer di-
rect links between entities, from experimental datasets. The results
we obtained underline both common features and specificities of these
29
methods regarding the parameters p,nand ηof the application context.
The most relevant points from our simulation study are the following:
- If one aims to control the FDR at a low level, shrinkage and
glasso should not be used.
-pcAlgo gives a better control of the FDR when the parameter α
is suitably chosen. However there is no simulation condition where it
performs better than the MB methods for example.
- The bagging procedure is less powerfull than the others though
the FDR is not better controled.
- The WB method has good performances, but we have to keep in
mind that it aims at estimating an approximation of the concentration
graph, and may lead to high FDR values when the 0-1 conditional
independence graph differs from the concentration graph.
-KGGM performs well, particularly when nis small. However, this
procedure cannot be carried out when pis large, say larger than 40.
- We recommend to use the MB procedure when it can be applied (n
large enough so that Equation (11) is not satisfied). If one can accept
a false discovery rate of the order 5%, then we recommend to use the
variant MB.or which is more powerfull than the variant MB.and. This
last one must be preferred when the false discovery rate has to be very
small.
The structure of the graph should also be considered; if the edges are
not uniformly distributed over the nodes as in the Erd¨os-R´enyi model,
then edges localized in highly connected parts of the graph or edges
joining two highly connected parts may be difficult to detect.
In the end, methods inferring graphs do not behave equivalently
faced to the graph and dataset structures. Consequently, we have to
pay great attention to the validity domain of each method before car-
rying it out.
6. Appendix. Algorithm for the MB method
For each variable aΓ, let (b
θa,b(λ), b Γ−{a}) be the LASSO esti-
mators of the parameters θa,b defined in Equation (4). In this section
the algorithm used for detecting the b
θa,b that are non zero is described.
The first step of the algorithm consists in using the LARS algorithm
for ranking the variables X
X
X−{a}according to the covariance structure
of the matrix X
X
X. Then, for the chosen value of λ, the non zero compo-
nents of (b
θa,b(λ), b Γ−{a}) are detected. This second step is described
below. 30
SSB - RR No. 16 Villers et al.
Let us define the following notations: for xa vector with qcompo-
nents, kxk2=Pq
l=1 x2
l,kxk= supl=1,...,q |xl|. For the sake of sim-
plicity, we set Y
Y
Y=X
X
Xa,U
U
U=X
X
X−{a}, and we assume that Y
Y
Yand the
columns of U
U
Uare centered and scaled such that kY
Y
Yk2=nand for all
b= 1,...,q (q=p1), kU
U
Ubk2=n. Let
(10) b
β(λ) = Arg min
βRp1kY
Y
YU
U
Uβk2+λ
p1
X
b=1
|βb|.
We will use the following properties
Property A. b
β(λ) is solution of Equation (10) if and only if there exists
ap1-vector vsatisfying
for all b= 1,...,p 1, vb= sign(b
βb(λ)) if b
βb(λ)6= 0 and
vb[1,1] if not
λv = 2U
U
UT(Y
Y
YU
U
Ub
β(λ)).
Property B. Solving (10) is equivalent to solving the following con-
straint mimimization problem
b
β(t) = Arg min
Pp1
b=1 |βb|≤t
kY
Y
YU
U
Uβk2.
Therefore, for all λ, there exists tλsuch that b
β(λ) = b
β(tλ).
Property C. Let
C(t) = 2
U
U
UTYU
U
Ub
β(t)
.
It can be shown that the function Csatisfies the following properties:
Cis a decreasing function of t(see Efron et al. (2004), lemma 7), and
λ=C(tλ).
From Property A it comes out that λ2kU
U
UTY
Y
Ykis equivalent to
b
β(λ) = 0. As 2kU
U
UTY
Y
Yk2n, we get that b
β(λ) = 0 as soon as λ2n.
Comparing this lower bound with the value λgiven by Meinshausen
and uhlmann (see Equation (5)), it appears that the parameters will
be estimated by zero whatever the data if
(11) nΦ1(1 α/2p2)2.
Consider now the case where λ < 2kU
U
UTY
Y
Ykand let us denote by
A(t) the set of active parameters :
A(t) = nb {1,...,p1},b
βb(t)6= 0o.
When tincreases A(t) becomes larger. The LARS algorithm gives the
values of t,t0= 0, t1, t2,..., for which A(t) gains a variable: for each
k= 0,1,2..., for all t]tk1, tk], A(t) is constant and equals A(tk).
31
Thanks to the third property it remains to find k= min {k, C(tk)< λ}.
The non zero components of (b
θa,b(λ), b Γ−{a}) are then equal to
A(tk).
References
Banerjee, O., Ghaoui, L., & d’Aspremont, A. (2008). Model selec-
tion through sparse maximum likelihood estimation for multivariate
gaussian or binary data. J. Machine Learning Research,9, 485–516.
Castelo, R., & Roverato, A. (2006). A robust procedure for gaussian
graphical model search from microarray data with plarger than n.
Journal of Machine Learning Research,7, 2621–2650.
Daudin, J. J., Picard, F., & Robin, S. (2006). A mixture model for
random graphs. Tech. Rep. RR-5840, INRIA, Rapport de Recherche.
Dobra, A., Hans, C., Jones, B., Nevins, J. R., Yao, G., & West, M.
(2004). Sparse graphical models for exploring gene expression data.
Journal of Multivariate Analysis,90 , 196–212.
Drton, M., & Perlman, M. (2007). Multiple testing and error control in
gaussian graphical model selection. Statistical Sciences, To appear.
Efron, B., Hastie, T., Johnston, I., & Tibshirani, R. (2004). Least angle
regression. The Annals of Statistics,32 , 407–451.
Friedman, J., Hastie, T., & Tibshirani, R. (2007). Sparse in-
verse covariance estimation with the lasso. Tech. Rep. , ,
http://www-stat.stanford.edu/tibs/ftp/graph.pdf.
Giraud, C. (2008). Estimation of gaussian graphs by model selection.
Electronic Journal of Statistics,2, 542–563.
Huang, J., Liu, N., Pourahmadi, M., & Liu, L. (2006). Covariance
matrix selection abd estimation via penalised normal likelihood.
Biometrika,93 (1), 85–98.
Husmeier, D. (2003). Sensitivity and specificity of inferring genetic
regulatory interactions from microarray experiments with dynamic
bayesian networks. Bioinformatics,19 , 2271–2282.
Kalisch, M., & B¨uhlmann, P. (2007). Estimating high-dimensional
directed acyclic graphs with the pc-algorithm. Journal of Machine
Learning Research ,8, 613–636.
Kishino, H., & Waddell, P. J. (2000). Correspondence analysis of genes
and tissue types and finding genetic links from microarray data.
Genome Informatics,11 , 83–95.
Malouche, D., & Sevestre, S. (2007). Estimating high dimensional
faithful gaussian graphical models : upc-algorithm. Tech. Rep.
arXiv:0705.1613, Technical Report.
32
SSB - RR No. 16 Villers et al.
Meinshausen, N., & B¨uhlmann, P. (2006). High dimensional graphs
and variable selection with the Lasso. Annals of Statistics,34 (3),
1436–1462.
Okamoto, S., Yamanishi, Y., Ehira, S., Kawashima, S., Tonomura, K.,
& Kanehisa, M. (2007). Prediction of nitrogen metabolism-related
genes in anabaena by kernel-based network analysis. Proteomics,
7(6), 900–909.
Sachs, K., Perez, O., D.Pe’er, Lauffenburger, D. A., & Nolan, G. P.
(2005). Causal protein-signaling networks derived from multiparam-
eter single-cell data. Science,308 , 523–529.
Sch¨afer, J., & Strimmer, K. (2005a). An empirical bayes approach
to inferring large-scale gene association nerworks. Bioinformatics,
21 (6), 754–764.
Sch¨afer, J., & Strimmer, K. (2005b). A shrinkage approach to large-
scale covariance matrix estimation and implications for functional
genomics. Statistical Applications in Genetics and Molecular Biol-
ogy,4, 1–32.
Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, Prediction
and Search . London: The MIT Press 2nd edition.
Werhli, A., & Husmeier, D. (2007). Reconstructing gene regulatory
networks with bayesian networks by combining expression data with
multiple sources of prior knowledge. Statistical Applications in Ge-
netics and Molecular Biology,6.
Wille, A., & B¨uhlmann, P. (2006). Low-order conditional indepen-
dence graphs for inferring genetic networks. Statistical Applications
in Genetics and Molecular Biology,5, 1–34.
Wu, W., & Ye, Y. (2006). Exploring gene causal interactions using an
enhanced constraint-based method. Pattern Recognition ,39 , 2439–
2449.
Yellaboina, S., Goyal, K., & Mande, S. (2007). Inferring genome-
wide functional linkages in e-coli by combining improved genome
context methods: Comparison with high-throughput experimental
data. Genome Research,17 (4), 527–535.
Yuan, M., & Lin, Y. (2007). Model selection and estimation in the
gaussian graphical model. Biometrika,94 ( ), 19–35.
33
... Improvements have also been made in exploratory methods that derive DPs solely based on reported dietary intake, because the most commonly used exploratory method, the convectional PCA involves several but crucial subjective decisions, and the resulting DPs are sometimes challenging to interpret. Some of the novel exploratory methods that addresses these limitations include Treelet transform analysis [17] and the Gaussian graphical models [18]. Similarly, the RRR, for instance, can handle multiple response variables, thus implying the use of a number of metabolites (derived from metabolomics) or taxa as response variables. ...
... Gaussian graphical models (GGM) is an important exploratory analysis that identifies the conditional independence structure in the data set by assessing pairwise correlation between two variables controlling for other variables [18,53]. GGM also introduces sparsity in generating DPs. ...
Article
Full-text available
Background and Purpose It used to be a common practice in the field of nutritional epidemiology to analyze separate nutrients, foods, or food groups. However, in reality, nutrients and foods are consumed in combination. The introduction of dietary patterns (DP) and their analysis has revolutionized this field, making it possible to take into account the synergistic effects of foods and to account for the complex interaction among nutrients and foods. Three approaches of DP analysis exist: (1) the hypothesis-based approach (based on prior knowledge regarding the current understanding of dietary components and their health relation), (2) the exploratory approach (solely relying on dietary intake data), and (3) the hybrid approach (a combination of both approaches). During the recent past, complementary approaches for DP analysis have emerged both conceptually and methodologically. Method We have summarized the recent developments that include incorporating the Treelet transformation method as a complementary exploratory approach in a narrative review. Results Uses, peculiarities, strengths, limitations, and scope of recent developments in DP analysis are outlined. Next, the narrative review gives an overview of the literature that takes into account potential relevant dietary-related factors, specifically the metabolome and the gut microbiome in DP analysis. Then the review deals with the aspect of data processing that is needed prior to DP analysis, particularly when dietary data arise from assessment methods other than the long-established food frequency questionnaire. Lastly, potential opportunities for upcoming DP analysis are summarized in the outlook. Conclusion Biological factors like the metabolome and the microbiome are crucial to understand diet-disease relationships. Therefore, the inclusion of these factors in DP analysis might provide deeper insights.
... -Le False Positive Rate FPR : FPR = F P T N + F P = 1 -Spe, la proportion de faux positifs parmi les négatifs théoriques. Villers et al. (2008) [99] donnent une synthèse complète et une comparaison des méthodes d'inférence publiées jusqu'alors et proposent de classer ces différentes approches en trois groupes selon la méthode qu'elles utilisent : estimation des matrices de covariance ou précision, approximation du graphe de concentration et estimation des voisinages. ...
... -Le False Positive Rate FPR : FPR = F P T N + F P = 1 -Spe, la proportion de faux positifs parmi les négatifs théoriques. Villers et al. (2008) [99] donnent une synthèse complète et une comparaison des méthodes d'inférence publiées jusqu'alors et proposent de classer ces différentes approches en trois groupes selon la méthode qu'elles utilisent : estimation des matrices de covariance ou précision, approximation du graphe de concentration et estimation des voisinages. ...
Thesis
L'inférence de réseaux ou inférence de graphes a de plus en plus d'applications notamment en santé humaine et en environnement pour l'étude de données micro-biologiques et génomiques. Les réseaux constituent en effet un outil approprié pour représenter, voire étudier des relations entre des entités. De nombreuses techniques mathématiques d'estimation ont été développées notamment dans le cadre des modèles graphiques gaussiens mais aussi dans le cas de données binaires ou mixtes. \\Le traitement des données d'abondance (de micro-organismes comme les bactéries par exemple) est particulier pour deux raisons : d'une part elles ne reflètent pas directement la réalité car un processus de séquençage a lieu pour dupliquer les espèces et ce processus apporte de la variabilité, d'autre part une espèce peut être absente dans certains échantillons. On est alors dans le cadre de données inflatées en zéro. Beaucoup de méthodes d'inférence de réseaux existent pour les données gaussiennes, les données binaires et les données mixtes mais les modèles inflatés en zéro sont très peu étudiés alors qu'ils reflètent la structure de nombreux jeux de données de façon pertinente. L'objectif de cette thèse concerne l'inférence de réseaux pour les modèles inflatés en zéro.\\Dans cette thèse, on se limitera à des réseaux de dépendances conditionnelles. Le travail présenté dans cette thèse se décompose principalement en deux parties. La première concerne des méthodes d'inférence de réseaux basées sur l'estimation de voisinages par une procédure couplant des méthodes de régressions ordinales et de sélection de variables. La seconde se focalise sur l'inférence de réseaux dans un modèle où les variables sont des gaussiennes inflatées en zéro par double troncature (à droite et à gauche).
... GGMs can infer direct relationships between variables within a set of repeated observations and in the absence or presence of a priori knowledge 58 . In the GGM method, networks are represented as undirected graphs. ...
... Each node represents a gene or a miRNA, and an edge connects two nodes if they are partially correlated. In contrast to correlation analyses (which measure both direct and indirect interactions between pairs of variables), partial correlation analyses measure the strength of direct interaction only 58 . A direct relationship between two variables corresponds to a non-zero entry in the partial correlation matrix. ...
Article
Full-text available
The adaptive response to extreme endurance exercise might involve transcriptional and translational regulation by microRNAs (miRNAs). Therefore, the objective of the present study was to perform an integrated analysis of the blood transcriptome and miRNome (using microarrays) in the horse before and after a 160 km endurance competition. A total of 2,453 differentially expressed genes and 167 differentially expressed microRNAs were identified when comparing pre- and post-ride samples. We used a hypergeometric test and its generalization to gain a better understanding of the biological functions regulated by the differentially expressed microRNA. In particular, 44 differentially expressed microRNAs putatively regulated a total of 351 depleted differentially expressed genes involved variously in glucose metabolism, fatty acid oxidation, mitochondrion biogenesis, and immune response pathways. In an independent validation set of animals, graphical Gaussian models confirmed that miR-21-5p, miR-181b-5p and miR-505-5p are candidate regulatory molecules for the adaptation to endurance exercise in the horse. To the best of our knowledge, the present study is the first to provide a comprehensive, integrated overview of the microRNA-mRNA co-regulation networks that may have a key role in controlling post-transcriptomic regulation during endurance exercise.
... GGMs are graphical models that show the conditional independence structure in the data set by assessing the pairwise correlation between two variables controlling for others. GGMs assume a multivariate normal distribution for underlying data and can infer a direct relation between variables in a given data set without prior knowledge [43]. The use of GGMs for exploring conditional independence structures between food intake variables is an emerging and promising approach. ...
Article
Full-text available
Background Gaussian graphical model (GGM) has been introduced as a new approach to identify patterns of dietary intake. We aimed to investigate the link between dietary networks derived through GGM and obesity in Iranian adults. Method A cross-sectional study was conducted on 850 men and women (age range: 20–59 years) who attended the local health centers in Tehran. Dietary intake was evaluated by using a validated food frequency questionnaire. GGM was applied to identify dietary networks. The odds ratios (ORs) and 95% confidence intervals (CIs) of general and abdominal adiposity across tertiles of dietary network scores were estimated using logistic regression analysis controlling for age, sex, physical activity, smoking status, marital status, education, energy intake and menopausal status. Results GGM identified three dietary networks, where 30 foods were grouped into six communities. The identified networks were healthy, unhealthy and saturated fats networks, wherein cooked vegetables, processed meat and butter were, respectively, central to the networks. Being in the top tertile of saturated fats network score was associated with a higher likelihood of central obesity by waist-to-hip ratio (OR: 1.56, 95%CI: 1.08, 2.25; P for trend: 0.01). There was also a marginally significant positive association between higher unhealthy network score and odds of central obesity by waist circumference (OR: 1.37, 95%CI: 0.94, 2.37; P for trend: 0.09). Healthy network was not associated with central adiposity. There was no association between dietary network scores and general obesity. Conclusions Unhealthy and saturated fat dietary networks were associated with abdominal adiposity in adults. GGM-derived dietary networks represent dietary patterns and can be used to investigate diet-disease associations.
... Recently, Gaussian graphical models (GGM) were introduced as a complementary approach for dietary pattern analysis (21). These are graphical methods exploring conditional independence structure in the data by taking into account the pairwise correlation among two variables controlling for others (21,22). In addition, the main advantage of GGM patterns is their ability to show how foods are consumed in different combinations, which might be important to interpret the eating pattern of the population (21). ...
Article
Full-text available
Background: Dietary patterns may be an important predictor of breast cancer risk. However, they cannot completely explain the pairwise correlations among foods. The purpose of this study is to compare food intake networks derived by Gaussian Graphical Models (GGMs) for women with and without breast cancer to better understand how foods are consumed in relation to each other according to disease status. Methods: A total of 134 women with breast cancer and 267 hospital controls were selected from referral hospitals of Tehran, Iran. Dietary intakes were evaluated by using a validated 168 food-items semi-quantitative food frequency questionnaire. GGMs were applied to log-transformed intakes of 28 food groups to construct outcome-specific food networks. Results: Among cases, a main network containing intakes of 12 central food groups (vegetables, fruits, nuts and seeds, olive oil and olive, processed meat, sweets, salt, soft drinks, fried potatoes, pickles, low-fat dairy, pizza) was detected. In controls, a main network including six central food groups (liquid oils, vegetables, fruits, sweets, fried potatoes and soft drinks) was identified. Conclusions: The findings of this study revealed a difference in GGM-identified networks graphs between cases and controls. Overall, GGM may provide additional understanding of relationships between diet and health.
... These are graphical methods that identify the conditional independence structure in the data set by assessing pairwise correlation between 2 variables controlling for others. GGMs assume multivariate normal distribution for underlying data and can infer a direct relation between variables in a given data set without prior knowledge (10). GGMs have been used to simplify and compress high-dimensional genetic (11,12) and metabolomics (13,14) data to explore respective underlying pathways. ...
Article
Full-text available
Background: Data-reduction methods such as principal component analysis are often used to derive dietary patterns. However, such methods do not assess how foods are consumed in relation to each other. Gaussian graphical models (GGMs) are a set of novel methods that can address this issue. Objective: We sought to apply GGMs to derive sex-specific dietary intake networks representing consumption patterns in a German adult population. Methods: Dietary intake data from 10,780 men and 16,340 women of the European Prospective Investigation into Cancer and Nutrition (EPIC)-Potsdam cohort were cross-sectionally analyzed to construct dietary intake networks. Food intake for each participant was estimated using a 148-item food-frequency questionnaire that captured the intake of 49 food groups. GGMs were applied to log-transformed intakes (grams per day) of 49 food groups to construct sex-specific food networks. Semiparametric Gaussian copula graphical models (SGCGMs) were used to confirm GGM results. Results: In men, GGMs identified 1 major dietary network that consisted of intakes of red meat, processed meat, cooked vegetables, sauces, potatoes, cabbage, poultry, legumes, mushrooms, soup, and whole-grain and refined breads. For women, a similar network was identified with the addition of fried potatoes. Other identified networks consisted of dairy products and sweet food groups. SGCGMs yielded results comparable to those of GGMs. Conclusions: GGMs are a powerful exploratory method that can be used to construct dietary networks representing dietary intake patterns that reveal how foods are consumed in relation to each other. GGMs indicated an apparent major role of red meat intake in a consumption pattern in the studied population. In the future, identified networks might be transformed into pattern scores for investigating their associations with health outcomes.
... Using the additional assumption that conditional dependency graphs are indeed sparse, a standard approach is to add a L 1 -penalization to the likelihood of Equation (2) ("Graphical LASSO", see [9]) or, alternatively, to consider p independent L 1 -penalized problems derived from those of Equation (3), see [5,16]. The latter, more direct approach, has been reported to be more accurate in terms of edge detection in [25]. ...
Article
Networks are very useful tools to decipher complex regulatory relationships between genes in an organism. Most work address this issue in the context of i.i.d., treated vs. control or time-series samples. However, many data sets include expression obtained for the same cell type of an organism, but in several conditions. We introduce a novel method for inferring networks from samples obtained in various but related experimental conditions. This approach is based on a double penalization: a first penalty aims at controlling the global sparsity of the solution whilst a second penalty is used to make condition-specific networks consistent with a consensual network. This "consensual network" is introduced to represent the dependency structure between genes, which is shared by all conditions. We show that different "consensus" penalties can be used, some integrating prior (e.g., bibliographic) knowledge and others that are adapted along the optimization scheme. In all situations, the proposed double penalty can be expressed in terms of a LASSO problem and hence, solved using standard approaches which address quadratic problems with 1 L -regularization. This approach is combined with a bootstrap approach and is made available in the R package therese1. Our proposal is illustrated on simulated datasets and compared with independent estimations and alternative methods. It is also applied to a real dataset to emphasize the differences in regulatory networks before and after a low-calorie diet.
Chapter
This chapter addresses the problem of reconstructing regulatory networks in molecular biology by integrating multiple sources of data. We consider data sets measured from diverse technologies all related to the same set of variables and individuals. This situation is becoming more and more common in molecular biology, for instance, when both proteomic and transcriptomic data related to the same set of “genes” are available on a given cohort of patients. To infer a consensus network that integrates both proteomic and transcriptomic data, we introduce a multivariate extension of Gaussian graphical models (GGM), which we refer to as multiattribute GGM. Indeed, the GGM framework offers a good proxy for modeling direct links between biological entities. We perform the inference of our multivariate GGM with a neighborhood selection procedure that operates at a multiscale level. This procedure employs a group-Lasso penalty in order to select interactions which operate both at the proteomic and at the transcriptomic level between two genes. We end up with a consensus network embedding information shared at multiple scales of the cell. We illustrate this method on two breast cancer data sets. An R-package is publicly available on github at https://github.com/jchiquet/multivarNetwork to promote reproducibility. © 2019, Springer Science+Business Media, LLC, part of Springer Nature.
Chapter
At the crossroads between statistics and machine learning, probabilistic graphical models provide a powerful formal framework to model complex data. Probabilistic graphical models are probabilistic models whose graphical components denote conditional independence structures between random variables. The probabilistic framework makes it possible to deal with data uncertainty while the conditional independence assumption helps process high dimensional and complex data. Examples of probabilistic graphical models are Bayesian networks and Markov random fields, which represent two of the most popular classes of such models. With the rapid advancements of high-throughput technologies and the ever decreasing costs of these next generation technologies, a fast-growing volume of biological data of various types—the so-called omics—is in need of accurate and efficient methods for modeling, prior to further downstream analysis. Network reconstruction from gene expression data represents perhaps the most emblematic area of research where probabilistic graphical models have been successfully applied. However these models have also created renew interest in genetics, in particular: association genetics, causality discovery, prediction of outcomes, detection of copy number variations, epigenetics, etc.. For all these reasons, it is foreseeable that such models will have a prominent role to play in advances in genome-wide analyses.
Book
Full-text available
What assumptions and methods allow us to turn observations into causal knowledge, and how can even incomplete causal knowledge be used in planning and prediction to influence and control our environment? In this book Peter Spirtes, Clark Glymour, and Richard Scheines address these questions using the formalism of Bayes networks, with results that have been applied in diverse areas of research in the social, behavioral, and physical sciences. The authors show that although experimental and observational study designs may not always permit the same inferences, they are subject to uniform principles. They axiomatize the connection between causal structure and probabilistic independence, explore several varieties of causal indistinguishability, formulate a theory of manipulation, and develop asymptotically reliable procedures for searching over equivalence classes of causal models, including models of categorical data and structural equation models with and without latent variables. The authors show that the relationship between causality and probability can also help to clarify such diverse topics in statistics as the comparative power of experimentation versus observation, Simpson's paradox, errors in regression models, retrospective versus prospective sampling, and variable selection. The second edition contains a new introduction and an extensive survey of advances and applications that have appeared since the first edition was published in 1993.
Article
Full-text available
We consider the problem of estimating the parameters of a Gaussian or binary distribution in such a way that the resulting undirected graphical model is sparse. Our approach is to solve a maximum likelihood problem with an added ℓ 1 -norm penalty term. The problem as formulated is convex but the memory requirements and complexity of existing interior point methods are prohibitive for problems with more than tens of nodes. We present two new algorithms for solving problems with at least a thousand nodes in the Gaussian case. Our first algorithm uses block coordinate descent, and can be interpreted as recursive ℓ 1 -norm penalized regression. Our second algorithm, based on Nesterov's first order method, yields a complexity estimate with a better dependence on problem size than existing interior point methods. Using a log determinant relaxation of the log partition function (Wainwright and Jordan [2006]), we show that these same algorithms can be used to solve an approximate sparse maximum likelihood problem for the binary case. We test our algorithms on synthetic data, as well as on gene expression and senate voting records data.
Article
Full-text available
The Erdös–Rényi model of a network is simple and possesses many explicit expressions for average and asymptotic properties, but it does not fit well to real-world networks. The vertices of those networks are often structured in unknown classes (functionally related proteins or social communities) with different connectivity properties. The stochastic block structures model was proposed for this purpose in the context of social sciences, using a Bayesian approach. We consider the same model in a frequentest statistical framework. We give the degree distribution and the clustering coefficient associated with this model, a variational method to estimate its parameters and a model selection criterion to select the number of classes. This estimation procedure allows us to deal with large networks containing thousands of vertices. The method is used to uncover the modular structure of a network of enzymatic reactions.
Article
Full-text available
In this paper, we propose and use two novel procedures for the analysis of microarray gene expression data. The first is correspondence analysis which visualizes the relationship between genes and tissues as two 2 dimensional graphs, oriented so that distances between genes are preserved, distances between tissues are preserved, and so that genes which primarily distinguish certain types of tissue are spatially close to those tissues. For the inference of genetic links, partial correlations rather than correlations are the key issue. A partial correlation between i and j is the relationship between i and j after the effect of surrounding genes has been subtracted out of their pairwise correlation. This leads to the area of graphical modeling. A limitation of the graphical modeling approach is that the correlation matrix of expression profiles between genes is degenerate whenever the number of genes to be analyzed exceeds the number of distinct expression measurements. This can cause considerable problems, as calculation of partial correlations typically uses the inverse of the correlation matrix. To avoid this limitation, we propose two practical multiple regression procedures with variable selection to measure the net, screened, relationship between pairs of genes. Possible biases arising from the analysis of a subset of genes from the genome are examined in the worked examples. It seems that both these approaches are more natural ways of analyzing gene expression data than the currently popular approach of two way clustering.
Conference Paper
The aim of this paper is to devise a new PC-algorithm (par tial correlation), uPC-algorithm, for estimating a high di mensional undirected graph associated to a faithful Gaus sian Graphical Model. First, we define the separability or der of a graph as the maximum cardinality among all its minimal separators. We construct a sequence of graphs by increasing the number of the conditioning variables. We prove that these graphs are nested and at a limited stage, equal to the separability order, this sequence is constant and equal to the true graph. Thus, the uPC-algorithm devised in this paper, is a step-down procedure based on a recursive estimation of these nested graphs. We show on simulated data its ac curacy and consistency and we compare it with the 0 − 1 covariance graph estimation recently proposed by [11].
Article
Learning of large-scale networks of interactions from microarray data is an important and challeng- ing problem in bioinformatics. A widely used approach is to assume that the available data consti- tute a random sample from a multivariate distribution belonging to a Gaussian graphical model. As a consequence, the prime objects of inference are full-order partial correlations which are partial correlations between two variables given the remaining ones. In the context of microarray data the number of variables exceed the sample size and this precludes the application of traditional structure learning procedures because a sampling version of full-order partial correlations does not exist. In this paper we consider limited-order partial correlations, these are partial correlations computed on marginal distributions of manageable size, and provide a set of rules that allow one to assess the usefulness of these quantities to derive the independence structure of the underlying Gaussian graphical model. Furthermore, we introduce a novel structure learning procedure based on a quantity, obtained from limited-order partial correlations, that we call the non-rejection rate. The applicability and usefulness of the procedure are demonstrated by both simulated and real data.
Article
DNA microarray provides a powerful basis for analysis of gene expression. Bayesian networks, which are based on directed acyclic graphs (DAGs) and can provide models of causal influence, have been investigated for gene regulatory networks. The difficulty with this technique is that learning the Bayesian network structure is an NP-hard problem, as the number of DAGs is superexponential in the number of genes, and an exhaustive search is intractable. In this paper, we propose an enhanced constraint-based approach for causal structure learning. We integrate with graphical Gaussian modeling and use its independence graph as an input of our constraint-based causal learning method. We also present graphical decomposition techniques to further improve the performance. Our enhanced method makes it feasible to explore causal interactions among genes interactively. We have tested our methodology using two microarray data sets. The results show that the technique is both effective and efficient in exploring causal structures from microarray data.
Article
We discuss the theoretical structure and constructive methodology for large-scale graphical models, motivated by their potential in evaluating and aiding the exploration of patterns of association in gene expression data. The theoretical discussion covers basic ideas and connections between Gaussian graphical models, dependency networks and specific classes of directed acyclic graphs we refer to as compositional networks. We describe a constructive approach to generating interesting graphical models for very high-dimensional distributions that builds on the relationships between these various stylized graphical representations. Issues of consistency of models and priors across dimension are key. The resulting methods are of value in evaluating patterns of association in large-scale gene expression data with a view to generating biological insights about genes related to a known molecular pathway or set of specified genes. Some initial examples relate to the estrogen receptor pathway in breast cancer, and the Rb-E2F cell proliferation control pathway.
Article
Bayesian networks have been applied to infer genetic regulatory interactions from microarray gene expression data. This inference problem is particularly hard in that interactions between hundreds of genes have to be learned from very small data sets, typically containing only a few dozen time points during a cell cycle. Most previous studies have assessed the inference results on real gene expression data by comparing predicted genetic regulatory interactions with those known from the biological literature. This approach is controversial due to the absence of known gold standards, which renders the estimation of the sensitivity and specificity, that is, the true and (complementary) false detection rate, unreliable and difficult. The objective of the present study is to test the viability of the Bayesian network paradigm in a realistic simulation study. First, gene expression data are simulated from a realistic biological network involving DNAs, mRNAs, inactive protein monomers and active protein dimers. Then, interaction networks are inferred from these data in a reverse engineering approach, using Bayesian networks and Bayesian learning with Markov chain Monte Carlo. The simulation results are presented as receiver operator characteristics curves. This allows estimating the proportion of spurious gene interactions incurred for a specified target proportion of recovered true interactions. The findings demonstrate how the network inference performance varies with the training set size, the degree of inadequacy of prior assumptions, the experimental sampling strategy and the inclusion of further, sequence-based information. The programs and data used in the present study are available from http://www.bioss.sari.ac.uk/~dirk/Supplements