ArticlePDF Available

Assessing the Validity Domains of Graphical Gaussian Models in Order to Infer Relationships among Components of Complex Biological Systems

February 2008
Statistical Applications in Genetics and Molecular Biology 7(1):Article 14

February 2008
7(1):Article 14

DOI:10.2202/1544-6115.1371

Source
PubMed

Authors:

Fanny Villers

Sorbonne Université

Sylvie Huet

French National Institute for Agriculture, Food, and Environment (INRAE)

The study of the interactions of cellular components is an essential base step to understand the structure and dynamics of biological networks. Various methods were recently developed for this purpose. While most of them combine different types of data and a priori knowledge, methods based on graphical Gaussian models are capable of learning the network directly from raw data. They consider the full-order partial correlations which are partial correlations between two variables given the remaining ones, for modeling direct links between variables. Statistical methods were developed for estimating these links when the number of observations is larger than the number of variables. However, the rapid advance of new technologies that allow the simultaneous measure of genome expression, led to large-scale datasets where the number of variables is far larger than the number of observations. To get around this dimensionality problem, different strategies and new statistical methods were proposed. In this study we focused on statistical methods recently published. All are based on the fact that the number of direct relationships between two variables is very small in regards to the number of possible relationships, p(p-1)/2. In the biological context, this assumption is not always satisfied over the whole graph. It is essential to precisely know the behavior of the methods in regards to the characteristics of the studied object before applying them. For this purpose, we evaluated the validity domain of each method from wide-ranging simulated datasets. We then illustrated our results using recently published biological data.

FDR and power of the different methods tested, in function of the sample size (plots a) and b), respectively) and the edge percentage (plots c) and d), respectively). Plots a) and b) were obtained with η = 0.025. Plots c) and d) were obtained with n = 22. All plots correspond to p = 30.

…

Influence of α on FDR and power (plots a) and b), respectively) for different values of n . The level α = 0 . 001 is indicated by the dashed line. Plots correspond to p = 30 and η = 0 . 025.

…

Performance of pcAlgo for α = 0 . 001; a) FDR and power in function of sample size for η = 0 . 025; b) FDR and power in function of the edge percentage for n = 22. Plots correspond to p = 30.

…

FDR (a)) and power (b)) obtained with the different methods tested, in function of the sample size. Graphs were simulated according to the ERMG model with p = 30.

…

FDR and power for estimating G (plots a) and b)), and G { 0 , 1 } (plots c) and d)), in function of η for

…

Figures - uploaded by Fanny Villers

Content may be subject to copyright.

Content uploaded by Fanny Villers

Content may be subject to copyright.

Assessing the Validity Domains of

Graphical Gaussian Models

in order to Infer Relationships

among Components of Complex

Biological Systems

Fanny Villers, Brigitte Schaeﬀer, Caroline Bertin

and Sylvie Huet

Research Report No. 16

July 2008

Statistics for Systems Biology Group

Jouy-en-Josas/Paris/Evry, France

http://genome.jouy.inra.fr/ssb/

Assessing the Validity

Domains of Graphical

Gaussian Models in order to

Infer Relationships among

Components of Complex

Biological Systems

Fanny Villers1

Brigitte Schaeﬀer2

Caroline Bertin3

Sylvie Huet4

Unit´e Math´ematiques et Informatique Appliqu´ees

INRA

Domaine de Vilvert

F-78352 Jouy-en-Josas Cedex

4sylvie.huet@jouy.inra.fr

SSB - RR No. 16 Villers et al.

Abstract. The study of the interactions of cellular components is an

essential base step to understand the structure and dynamics of bio-

logical networks. So, various methods were recently developed in this

purpose. While most of them combine diﬀerent types of data and ¡em¿a

priori¡/em¿ knowledge, methods based on Graphical Gaussian Models

are capable of learning the network directly from raw data. They con-

sider the full-order partial correlations which are partial correlations

between two variables given the remaining ones, for modelling direct

links between variables. Statistical methods were developed for esti-

mating these links when the number of observations is larger than the

number of variables. However, the rapid advance of new technologies

that allow to simultaneous measure genome expression, led to large-

scale datasets where the number of variables is far larger than the

number of observations. To get round this dimensionality problem,

diﬀerent strategies and new statistical methods were proposed. In this

study we focused on statistical methods recently published. All are

based on the fact that the number of direct relationship between two

variables is very small in regards to the number of possible relation-

ships, ¡em¿p(p-1)/2¡/em¿. In the biological context, this assumption is

not always satisﬁed over the whole graph. So it is essential to precisely

know the behaviour of the methods in regards to the characteristics of

the studied object before applying them. For this purpose, we evalu-

ated the validity domain of each method from wide-ranging simulated

datasets. We then illustrated our results using recently published bio-

logical data.

Subjclass. 62H12

Keywords. Graphical Gaussian Model, Estimation, Simulation

1. Introduction

Biological systems involve complex cellular processes built up from

physical and functional interactions between molecular entities (genes,

proteins, small molecules,...). Thus, to understand how these processes

are regulated, it is necessary to study the behavior of the molecular ma-

chinery. Recently, biotechnological developments were focused on the

characterization and the quantiﬁcation of cellular system components

leading to produce a huge amount of various data. So, one of major

challenges is nowadays to understand from these data, how molecular

entities interact i.e. what the functional links are, in the context of a

whole system. To this end, several mathematical and computational

approaches are developing. Some methods based on correlations or

clustering can reveal proximities between variables but do not bring

to light the direct or functional links. Other methods, such as kernel-

based methods (Okamoto et al., 2007; Yellaboina et al., 2007) imply a

learning phase and so need a training data set. Bayesian approaches

are also used to infer relations between biological entities in order to

understand the regulatory mechanisms of living cells (Husmeier, 2003;

Werhli & Husmeier, 2007). However these methods have to deal with

the prior probability that has a non-negligible inﬂuence on the posterior

probability when data are sparse and noisy.

A valuable complement to all of these methods is graphical Gauss-

ian modeling (Kishino & Waddell, 2000; Dobra et al., 2004; Wu & Ye,

2006) that can infer direct relations between variables from a set of

repeating observations of these variables without any a priori knowl-

edge. Graphical modeling is the use of a graph to represent a model. A

graph is a set of nodes and edges which can be represented as a graphic

for a visual study or as a matrix for computer processing. Graphical

modeling is based on the conditional independence concept. In other

words, a direct relation between two variables exists if those two vari-

ables are conditionally dependent given all remaining variables. In the

Gaussian setting, a direct relation between two variables corresponds

to a non-zero entry in the partial correlation matrix. As the partial

correlation matrix is related to the inverse of the covariance matrix,

a direct relation between two variables also corresponds to a non-zero

entry in the inverse of the covariance matrix.

Graphical models are classically used when the number of observa-

tions, denoted n, is larger than the number of variables, denoted p. This

is generally the case in ﬁnancial or sociological studies where surveys

concern few variables and a lot of observations. But it is not the case in

the post-genomic context where each experiment is costly in time and

money. So the number of repetitions is limited; moreover, each exper-

iment generates numerous data. Then the data set structure, p≫n,

does not match with the assumptions of the classical graphical model-

ing approach and the empirical covariance matrix cannot be inverted.

Over the last years, some mathematical and computational researches

were developed for surrounding that dimensionality problem and vari-

ous methods were proposed. Most of them are based on the fact that

the number of direct relations between two variables is very small in

regards to the number of possible relations, p(p−1)/2.

The purpose of our study is to determine the validity domain of

some of these methods recently proposed. The reason for this work is

to give biologists hints for using the most appropriate methods. Indeed,

SSB - RR No. 16 Villers et al.

biologists are very interested in infering biological networks but they

generally have a small number of repetitions, the order of ten.

The core of this document is divided in three parts. The ﬁrst one

describes the statistical methodology involved in Sch¨afer & Strim-

mer (2005a), Sch¨afer & Strimmer (2005b) Wille & B¨uhlmann (2006),

Meinshausen & B¨uhlmann (2006), Friedman et al. (2007), Kalisch &

B¨uhlmann (2007) and Giraud (2008) approaches. The second part

presents simulations carried out with each of these methods, under dif-

ferent conditions of dataset structure. The third part illustrates the

interest of the graphical Gaussian modeling with an application to ﬂow

cytometry data produced by Sachs et al. (2005). In the conclusions we

discuss the performances of each method and we bring some recom-

mendations according to their validity domain.

2. STATISTICAL METHODS

Let Γ = {1, . . . , p}be the set of nodes of the graph. The pnodes

of the graph are identiﬁed with pGaussian random variables. Let us

denote by X

X= (X1,...,Xp)T, a prandom vector distributed as a multi-

variate Gaussian N(0,Σ). For ma subset of {1,...,p}with cardinality

|m|, we denote by X

Xmthe |m|random vector whose components are

the variables Xc, where c∈m. Moreover we denote by Γ−mthe set

of nodes that are not in m, Γ−m= Γ \m, and by X

X−mthe p− |m|

random vector whose components are the variables Xc, where c∈Γ−m.

There exists an edge between nodes aand bif and only if, the random

variables Xaand Xbare not independent conditionally to X

X−{a,b}. In

other words, assuming that the matrix Σ is nonsingular, there exists

an edge between nodes aand bif and only if the component (a, b) of

the concentration matrix K= Σ−1is non zero. These graphs are called

concentration graphs or full conditional independence graphs. For each

node a, the set of neighbors of ais deﬁned as the set of nodes in Γ−{a}

that are connected with a. Finally let us denote by E, the set of edges

of the graph.

The statistical challenge is to detect the edges in the graph on the

basis of a n-sample from a multivariate distribution N(0,Σ). For each

i= 1,...,n we denote by Xi= (Xi1,...,Xip) the ith observation.

When the number of observation nis large enough, at least n≥p+1, in

order to guarantee that the sample covariance matrix Sis nonsingular,

several methods have been proposed. A detailed review can be found in

a recent paper by Drton & Perlman (2007). However, when the interest

lies on genomic networks, we are generally dealing with data where the

number of variables pis large and the number of experiments nis small.

Several methods have been proposed recently in that context.

Some of these methods aim at estimating the concentration matrix

K. For instance, Sch¨afer & Strimmer (2005b,a) proposed methods

based on bagging or shrinkage in order to stabilize either the estima-

tor of P, the correlation matrix associated to Σ, or the estimator of

Π the partial correlation matrix. Then they estimate the probability

of an edge between two nodes (a, b) by estimating the density of the

estimated partial correlation coeﬃcient. More recently some authors

(Yuan & Lin, 2007; Banerjee et al., 2008; Huang et al., 2006; Friedman

et al., 2007) proposed algorithms to estimate Kby maximizing the

penalized log-likelihood, the penalty term being proportionnal to the

sum of the absolute values of the components of K. The coeﬃcient

of proportionality may be chosen such as to control the probability of

error in estimating the graphical model.

Other methods are based on the estimation of a graph that is an

approximation of the full conditional graph. Wille & B¨uhlmann (2006)

suggested to estimate a lower-order conditional independence graph in

place of the full conditional independence graph. They use a multi-

ple testing procedure for detecting edges. Kalisch & B¨uhlmann (2007)

considered the PC-algorithm (Spirtes et al. (2000)) to estimate a graph

deﬁned through conditional dependencies on any subset of the vari-

ables. The PC-algorithm starts from the complete graph and deletes

recursively edges based on conditional independencies.

We ﬁnally consider a third kind of methods, based on neighborhood

estimation. Meinshausen & B¨uhlmann (2006) proposed to estimate the

neighbors of each node using a model selection procedure based on the

LASSO method. The choice of the penalty parameter allows to control

the probability of falsy joining distinct connectivity components in the

graph. More recently, Giraud (2008) suggested to estimate graphs using

a model selection procedure based on a penalized empirical risk. The

procedure leads to control the mean square error of prediction and its

performances are established in a non-asymptotic setting.

In the next section we brieﬂy describe these methods, specifying their

theoretical properties if any.

2.1. Estimating the concentration matrix.

2.1.1. Bagging or shrinkage for improving the covariance estimator.

Sch¨afer and Strimmer proposed to use bagging (Sch¨afer & Strimmer,

2005a) or shrinkage (Sch¨afer & Strimmer, 2005b) for obtaining accurate

and reliable estimates of the covariance matrix Σ or its inverse K.

SSB - RR No. 16 Villers et al.

The bagging approach. Bootstrap aggregation (bagging) is used in order

to reduce the variance of the estimator of the correlation matrix P. For

each bootstrap sample X∗, the empirical correlation matrix b

P∗is cal-

culated. The bagged estimator is the empirical mean of the b

P∗’s from

the bootstrap samples. The partial correlation matrix Π is estimated

from the pseudo inverse of the bagged correlation matrix estimator and

is denoted by b

Πbagged.

The shrinkage approach. The shrinkage estimator is a linear combi-

nation of the empirical covariance matrix S and of a target estima-

tor denoted b

Ω chosen for its very low variability. Precisely b

Σ(λ) =

λb

Ω + (1 −λ)Swhere the parameter λis chosen such as to minimize the

quadratic risk function deﬁned as R(λ) = E nPaPb(b

Σa,b(λ)−Σa,b )2o.

The parameter λcan be explicetly calculated and is estimated using

the data only. Let b

λbe this estimator. The partial correlation matrix

Π is estimated by b

Πshrinked from the inverse of the matrix b

Σ(b

λ).

Estimating the graph. It remains to deﬁne a decision rule for detecting

the signiﬁcant components of Π. Let us denote by b

Π either b

Πbagged

or b

Πshrinked. Sch¨afer and Strimmer assume that the distribution of

the b

Πa,b’s is known up to some parameters that are estimated. They

deduce from this estimator the posterior probability of an edge to be

present in the graph and decide to keep edges such that the posterior

probability is greater than a given threshold 1 −α.

2.1.2. Penalized maximum likelihood. Banerjee et al. (2008) considered

the problem of estimating the parameters of a Gaussian distribution

solving a maximum likelihood problem with an added ℓ1-norm penalty

term. Precisely they proposed to estimate the inverse covariance ma-

trix Kby maximizing with respect to Ω in the set of positive deﬁnite

matrices the following criteria:

C(Ω, λ) = log(detΩ) −trace(SΩ) −λX

|Ωab|.

Friedman et al. (2007) recently proposed a performant algorithm allow-

ing to estimate Kby showing that solving this optimisation problem

comes to recursively solving and updating a regression LASSO prob-

lem. For a given parameter λ, let us denote by b

K(λ) the estimator

of K. The set of pairs (a, b) such that b

Ka,b(λ) is non zero constitutes

the set of edges in the graph. Banerjee and collaborators proposed a

choice of λfor which the probability to connect two distinct connectiv-

ity components of the graph is bounded by some α. Precisely

(1) λ(α) = T−1

n−2(1 −α/2p2)

qn−2 + T−1

n−2(1 −α/2p2)

where Tn−2is the distribution function of a Student variable with n−2

degrees of freedom.

2.2. Approximation of the concentration graph.

2.2.1. The 0-1 conditional independence graph. Wille and B¨uhlmann

proposed to infer the ﬁrst-order conditional independence graph in-

stead of the full conditional independence graph. Their method has

nice computational properties but the drawback is that 0-1 conditional

independence graphs do not generally coincide with concentrations

graphs, though the links between both graphs can be established in

some cases. The 0-1 conditional independence graph is deﬁned as fol-

lows: for each pair of nodes (a, b), let Rab/∅be the correlation between

the variables Xaand Xb, and for each c∈Γ−{a,b}, let Rab/c be the cor-

relation between Xaand Xbconditionally to Xc; there exists an edge

between nodes (a, b) if Rab/∅6= 0 and Rab/c 6= 0 for all c∈Γ−{a,b}, or

equivalently if

(2) φa,b = min |Rab/c|, c ∈Γ−{a,b}∪ ∅

is non zero. Therefore, detecting edges in the graph remains to testing

p(p−1)/2 statistical hypotheses: For each (a, b), 1 ≤a < b ≤1,

there exists an edge between nodes (a, b) if the hypothesis “φab = 0′′ is

rejected. Wille and B¨uhlmann propose the following testing procedure:

For each (a, b) and c∈Γ−{a,b}∪ ∅ the likelihood ratio test statistic of

the hypothesis “Rab/c = 0” is calculated as well as the corresponding

p-value denoted P(a, b/c). Then the hypothesis “φab = 0′′ is rejected

at level αif

Pmax(a, b) = max P(a, b/c), c ∈Γ−{a,b}∪ ∅≤α.

It remains to calculate the adjusted p-values to take into account the

multiplicity of hypotheses to test, considering for example the Bonfer-

roni procedure or the Benjamini-Hochberg one’s.

Considering 0-1 conditional independence in place of full conditional

independence has several advantages. The test statistics are very easy

to calculate. For each hypothesis to test, “Rab/c = 0”, one considers the

marginal distribution of the 3-random Gaussian vector (Xa, Xb, Xc)T.

SSB - RR No. 16 Villers et al.

Therefore, provided that nis large enough, the distribution of the like-

lihood ratio test statistic of the hypothesis “Rab/c = 0” is well approx-

imated by the distribution of a χ2with 1 degree of freedom. Note that

it is not necessary to assume that pis small. It follows, that, for each

(a, b), the probability to detect an edge between aand bwhen it does

not exist is smaller than α, if nis large (see Proposition 3 in Wille &

B¨uhlmann (2006)). Moreover it can be shown that if pincreases with

nin such a way that log(p)/n tends to 0 when ntends to inﬁnity, then

the estimators of the Rab/c’s are uniformly convergent for all a, b ∈Γ

and c∈Γ−{a,b}∪ ∅.

Castelo & Roverato (2006) and Malouche & Sevestre (2007) proposed

a similar approach for estimating “up to q”-order conditional indepen-

dence graphs where the presence/absence of edges is associated to all

marginal distributions up to the order q. We will only present the

method proposed by Wille and B¨uhlmann in our simulation study.

2.2.2. The strong conditional independence graph. Let us consider graphs

deﬁned as follows : there exists an edge between nodes aand bif and

only if for all set of nodes m⊂Γ−{a,b}, the random variables Xaand

Xbare not independent conditionally to X

Xm. This graph is subset

of the full conditional independence graph and will be called strong

conditional independence graph.

Such graphs can be estimated using an iterative procedure called

the PC-algorithm proved to be computationnally very fast for sparse

graphs. The procedure starts with the complete graph and removes

edges with zero order conditional independence relations. Then edges

with one order conditional independence relations are removed and so

on. For each step s, let us denote by Esthe set edges and for each node

a, by Vs

athe set of neighbors of a. At step s+1, we need only to consider

the ordered pairs of nodes (a, b)∈Es, such that the cardinality of Vs

is strictly greater than s. For each of these pairs (a, b), the procedure

consists in keeping an edge between nodes aand bif Xaand Xbare not

independent conditionally to X

Xmfor all subsets of nodes mcontained

in Vs

awith cardinality equal to s+ 1.

Kalisch & B¨uhlmann (2007) considered a sample version of the PC-

algorithm as follows : the testing procedure for deciding to keep an edge

between nodes aand bat step sconsists in testing, for each subset of

nodes mto be considered, that the correlation between Xaand Xb

conditionally to X

Xmis zero. The test statistic is based on the Fisher’s

z-transform of the sample partial correlations b

Rab/m. Precisely

Za,b/m =1

2log 1 + b

Rab/m

1−b

Rab/m !.

and for some α > 0, the null hypothesis is rejected if pn− |m| − 3Za,b/m >

Φ−1(1 −α/2), where |m|denotes the cardinality of mand Φ the dis-

tribution function of a Gaussian centered variable with unit variance.

The edge between nodes aand bis removed at step sof the algorithm,

if there exists mwith cardinality ssuch that the test is not rejected,

Under some conditions on the distribution of X

X, the estimated graph

is a consistent estimate of the strong conditional independence graph.

The asymptotic framework considers sparse graphs of high dimension:

when ntends to inﬁnity, the maximum number of neighbors tends to

inﬁnity slower than n, while the number of nodes pmay grow like any

power of nand the parameter αhas to tend to zero.

For practical issues the choice of the parameter αis an open prob-

lem. Kalisch & B¨uhlmann (2007) discussed this point on the basis of a

simulation study for estimating the skeleton of a directed acyclic graph.

2.3. Estimating the neighbors.

2.3.1. LASSO procedure. Detecting the neighbors of all nodes leads to

detecting the edges in the graph. Because of the Gaussian assumption

on the distribution of X

X, for each variable Xa, a conditional regression

model can be deﬁned as follows:

(3) Xa=X

b∈Γ−{a}

θa,bXb+εa

where the parameters θa,b are equal to −Ka,b/Ka,a. The variable εais

distributed as a centered Gaussian variable and is independent from the

Xb’s for all b∈Γ−{a}. Meinshausen & B¨uhlmann (2006) proposed to

detect the non zero coeﬃcients of the regression of Xaon the variables

Xbfor b∈Γ−{a}on the basis of the n-sample (X1,...,Xn), using the

LASSO method as a model selection procedure. Precisely, for a given

smoothing parameter λ, the estimators of {θa,b , b ∈Γ−{a}}minimize

the sum of squares penalized by the ℓ1-norm of the parameters vector:

(4)

i=1 

Xia −X

b∈Γ−{a}

θa,bXib



+λX

b∈Γ−{a}

|θa,b|.

The solution to this minimization problem is given by a set of (b

θa,b, b ∈

Γ−{a}) that are either equal to zero or not. The set of nodes b∈Γ−{a}

SSB - RR No. 16 Villers et al.

such that b

θa,b is non zero constitutes b

Va, the estimated set of neighbors

of the node a. Two estimated graphs may be deduced from all these

Vafor a= 1, . . . , p, depending on whether we decide to put an edge

between nodes aand bif both b

θa,b are b

θb,a are non zero or if one of

these is non zero.

Meinshausen and B¨uhlmann proved that, under some conditions en-

suring that the signal to noise ratio is not too small, the method is

consistent, namely the probability for b

Vato be exactly equal to Va

tends to one. The asymptotic framework is similar to the one consid-

ered by Kalisch & B¨uhlmann (2007) : sparse graphs of high dimension.

The smoothing parameter λis assumed to decrease to zero at a rate

smaller than n−1/2.

For the sake of application they propose a choice of λsuch that

the probability to connect two distinct connectivity components of the

graph is bounded by some α. Precisely

(5) λ= 2v

i=1

iaΦ−11−α/2p2.

This choice is based on the Bonferroni inequality and it is assumed that

the variance of the variables Xafor a= 1,...,p are all equal to one.

2.3.2. Model selection procedure. Giraud (2008) considered the prob-

lem of estimating by a model selection procedure the non zero θ’s oc-

curing in the pregression models deﬁned at Equation (3). The pro-

cedure starts with the choice of a collection of graphs with pnodes

or equivalently to the choice of a collection of sets of edges, denoted

{E1,...,EL}, where Lis the cardinality of the collection. For the sake

of simplicity, we say that a p×pmatrix Ω ∼Eℓif Ωa,a = 0 and if

Ωa,b = 0 is equivalent to (a, b)/∈Eℓ. For each set Eℓin the collec-

tion, the parameters θare estimated by minimising the residual sums

of squares: b

θ(ℓ) is the p×pmatrix that minimizes

SC R(Ω) = X

a∈Γ

i=1 Xia −X

b∈Γ

Ωa,bXib !2

with respect to Ω such that Ω ∼Eℓ. The choice of the best graph

among {E1, . . . , EL}is done by selecting the estimator b

θ(ℓ) that mini-

mizes the following criteria:

Crit(ℓ) = X

a∈Γ

q(K, νa(ℓ))

i=1 Xia −X

b∈Γb

θa,b(ℓ)Xib!2

where νa(ℓ) is the number of neighbors of node ain the graph associated

to Eℓ,Kis a constant greater than 1 and qis a penalty function given

in Giraud (2008). We denote by b

θthis estimator.

The theoretical properties of the method are given in a non-asymptotic

framework with n < p. The graph is assumed to be sparse in the fol-

lowing sense: the maximum number of neighbors over all the nodes in

the graph, denoted D, must be smaller than a a quantity of the order

n/2(log p). Under this assumption, it is proved that the Mean Square

Error of Prediction of the estimator MSEP(b

θ) is bounded above, up to

a log pfactor, by a quantity closed to the minimum over ℓof the Mean

Square Error of Prediction of b

θ(ℓ).

In practice a collection of graphs has to be chosen. For example, one

can choose the set of all graphs with at most Dedges. Obviously such

a choice leads to very high computational cost for large values of p.

3. SIMULATIONS

3.1. Methods of simulation.

3.1.1. Simulating a graph. Graphs were simulated according to two

diﬀerent approaches.

The ﬁrst approach is based on the Erd¨os-R´enyi model, noted ER

model, which assumes that edges are independent and occur with the

same probability. Practically, we ﬁx the number of nodes pand the

percentages of edges ηthen we draw the number of edges according to

a binomial distribution with parameters p(p−1)/2, η. Next we choose

uniformly and independently the positions of the edges.

The second approach was proposed by Daudin et al. (2006) to take

into account the topological features of biological networks such as

connectivity degree or clustering coeﬃcient. Their model called Erdos-

R´enyi Mixtures for Graphs, noted ERMG, supposes that nodes are

spread into Qclusters with probabilities {p1,...,pQ}, and that the

connection probabilities of each cluster and between clusters are het-

erogenous. These connection probabilities constitute the connectivity

matrix C. The parameters available from Daudin et al. (2006) study,

correspond to a graph with 199 nodes. As we wanted to study the in-

ﬂuence of p, we adapted those parameters to our simulations. However,

we kept the same graph structure by taking a large weakly connected

cluster, a small highly connected cluster and the same group connection

structure. Thus we used the following parameter values

Q= 4,(p1,...,pQ) = 0.07 0.1 0.18 0.65 

(6)

SSB - RR No. 16 Villers et al.

C=





0.999 10−610−60.005

10−60.4 0.014 0.003

10−60.014 0.2065 0.011

0.005 0.003 0.011 0.013





.(7)

That leads to a mean percentage of edges ηequals to 2.5%.

Whatever the approach, we ﬁnally obtain a matrix composed of 0

and 1, the values 1 indicating the edge positions in the corresponding

graph. This matrix is denoted the incidence matrix.

3.1.2. Simulating the data. From the incidence matrix of a given graph,

we simulated nobservations as follows: ﬁrst we generate a partial corre-

lation matrix Π by replacing the values 1 indicating the edge positions

in the incidence matrix, by values drawn from the uniform distribution

between −1 and 1. Then we compute columm-wise sums of the abso-

lute values and set the corresponding diagonal element equal to this

sum plus a small constant. This ensures that the resulting matrix is

diagonally dominant and thus positive deﬁnite. Next we standardize

the matrix so that each diagonal entry equals to 1. Finally, we generate

nindependent samples from the multivariate normal distribution with

mean zero, unit variance, and correlation structure associated to the

partial correlation matrix Π.

3.2. Simulation setup. We simulated graphs and data for diﬀerent

values of p, η, n and we estimated graphs from these data using diﬀerent

methods. We review the methods and the way we carried them out.

Then we present how we assessed their performances.

3.2.1. Methods. The methods for which we present simulation are the

following:

- the b

Πbagged and b

Πshrinked methods, proposed by Sch¨afer & Strimmer

(2005a,b) with the decision rule based on posterior probabilities. The

threshold 1 −αis ﬁxed at 0.95. Both methods are implemented in R

software (GeneTS package, R-2.2.0; GeneNet package R-2.4.1).

- the glasso proposed by (Friedman et al., 2007) with α= 5% in

accordance with Banerjee et al. (2008). This method is implemented

in R software (glasso package, R-2.4.1).

- the 0-1 conditional independence graph approach, proposed by

Wille & B¨uhlmann (2006), with the decision rule based on the adjusted

p-values following the Benjamini-Hochberg procedure taking α= 5%.

We implemented the method in R-2.4.1.

- the PC-algorithm, as proposed by Kalisch & B¨uhlmann (2007) with

α= 5%. This method is implemented in R software (pcalg package,

R-2.6.1).

- the Lasso approach, with the two variants and and or proposed

by Meinshausen & B¨uhlmann (2006) and α= 5%. This method is

implemented in R software using the lars package R-2.4.1. A part of

the algorithm is implemented in R according to the description given

in Section 6.

- the model selection approach proposed by Giraud (2008) taking

K= 3 in the penalty function as suggested by the author to better

control the FDR. The method implemented in R-2.4.1 was kindly pro-

vided by the author. For saving computational time, the collection of

graphs was a subset of the set of all graphs with at most 3 neighbors

per node.

In the continuation of this document we will respectively denote

these methods as bagging,shrinkage,glasso,pcAlgo,WB,MB.and

and MB.or,KGGM.

3.2.2. Assessing the performance of methods. To assess the performance

of the investigated methods we compared each simulated graph with

the estimated graph by counting true positives TP (correctly identi-

ﬁed edges), false positive FP (wrongly detected edges), true negatives

TN (correctly identiﬁed zero-edges), and false negatives FN (not rec-

ognized edges). From those quantities we estimated the power and the

false discovery rate FDR, which are deﬁned by:

power = ETP

TP + FN

FDR = EFP

TP + FP|(TP + FP) >0.

The power and FDR values presented in this work, are the means

over 2000 simulations (according to our preliminary results which showed

that the stability of the FDR estimation was reached with 2000 simu-

lations).

The performance of the methods were evaluated for several combi-

nations of the parameters p, η and n, in regards to the problematic we

wanted to investigate. Moreover, the parameter values were chosen in

order to both make the computer time reasonable and extrapolate the

results to biological ﬁelds.

The ﬁrst problematic we have focused on, is the inﬂuence of the sam-

ple size. To this aim, we simulated random graphs ﬁxing the number of

SSB - RR No. 16 Villers et al.

nodes pequal to 30, ηequal to 2.5% and varying the number of obser-

vations nin {15,22,30,60}. Secondly, we investigated the sparsity as-

sumption common to all methods taking ηin {0%,2.5%,4%,5%,10%}.

Third, we were interested in the inﬂuence of the node number p. So,

we increased pand chose nin order to keep the p/n rates similar to

the p/n values used in the ﬁrst considered point. For all of these three

studies, graphs were simulated with the ER method.

The forth problematic we investigated concerns the inﬂuence of the

graph structure. In this goal, we also simulated graphs with the ERMG

method ﬁxing pequal to 30 and varying nin {15,22,30,60}.

Finally, we focused on the method proposed by Wille and B¨uhlmann

to evaluate the consequence of estimating the 0-1 graph instead of the

concentration graph. For this purpose we ﬁxed p= 30 and varied η

from 0.025 to 0.2 and nfrom 60 to 1200.

3.3. Results and discussion.

3.3.1. Comparing the methods. As shown in Figure 1 methods behave

very diﬀerently. Let us ﬁrst discuss methods presenting high FDR

values.

Comments on shrinkage,glasso and pcAlgo methods. The FDR val-

ues for these methods are very high for all considered values of nwhen

p= 30 and η= 2.5% as it is shown in Figure 1a. The FDR does

not vary with nand remains close to 47% and 30% for glasso and

pcAlgo respectively, while it increases with nfrom 45% with n= 15

to 75% with n= 60 for shrinkage. When η= 0, the FDR is small for

shrinkage and glasso methods while it equals 1 for pcAlgo (at least

one rejected edge at each simulation, see Figure 1c). The high FDR

values are associated with high power values. When the graph is sparse

enough, say ηsmaller than 5%, the methods are powerfull (Figures 1b

and d), particularly glasso: the power varies from 97% to 99% when

η= 2.5% and nvaries from 15 to 60. This result suggests that it may

be of interest to look for better choices of the thresholding parameter

α. This will be the object of section 3.3.2.

Comments on bagging,WB,MB, and KGGM methods. For these methods

the FDR values never exceed 6% except with the bagging method for

n= 15. The FDR values obtained with MB.or remain steady around

5.5% whereas the FDR values obtained with MB.and never exceed 1%.

KGGM behaves similarly as MB.or, with slightly smaller FDR and power

values. FDR from bagging reaches 18% when n= 15 then deeply

declines below 3%. The power represented in Figure 1b gradually in-

creases with the number of observations nexcept with the bagging

20 30 40 50 60

0.0 0.4 0.8

sample size

FDR

20 30 40 50 60

0.0 0.4 0.8

sample size

power

0.00 0.04 0.08

0.0 0.4 0.8

edge percentage

FDR

MB.and

MB.or

bagging

KGGM

shrinkage

glasso

pcAlgo

0.00 0.04 0.08

0.0 0.4 0.8

edge percentage

power

Figure 1. FDR and power of the diﬀerent methods

tested, in function of the sample size (plots a) and

b), respectively) and the edge percentage (plots c) and

d), respectively). Plots a) and b) were obtained with

η= 0.025. Plots c) and d) were obtained with n= 22.

All plots correspond to p= 30.

method which shows a drop for n=p. This phenomena was com-

mented by the authors in Sch¨afer & Strimmer (2005a). Let us notice

that MB.or and MB.and do not work when n= 15. This is due to the

fact that when n, α, p satisfy Equation (11), no edge will be detected

whatever the data. 14

SSB - RR No. 16 Villers et al.

The inﬂuence of the edge percentages ηis shown in Figure 1c and 1d,

for n= 22. When ηin {2.5%,4%,5%}, the FDR values, shown in

Figure 1c, stay under 1% with WB and MB.and methods, around 5% for

KGGM and exceed 5% with bagging and MB.or . For all methods the

power dramatically falls as ηincreases and is close to 0 when ηequals

10%, whatever the method used. Similar graphics were obtained for

n= 30 and n= 60. When η= 0, the FDR values lie between 0 for the

MB methods and 2.4% for WB.

Considering the reliability feature (low FDR), the results presented

in Figure 1 reveal that the MB.and and WB methods perform quite well

in all cases. Referring to the power, the MB.or and KGGM methods

outperfom the others. The MB.and and WB methods are less powerfull

with the advantage of producing smaller FDR values. The bagging

appears as the less competitive method in terms of power. All methods

similarly show a strong decrease of the power when ηincreases, in

accordance with the sparsity assumption.

3.3.2. Focus on the high FDRs. Previously we have seen that the

FDR values of the shrinkage,glasso and pcAlgo methods were very

high. This behavior may be due to a bad choice of the thresholding

parameter αoccurring in each of these methods. Hence it may be

worthwhile to verify if a more severe thresholding leads to reduce the

FDR keeping at the same time a good power. Therefore, we estimated

by simulation the power and the FDR for decreasing values of the

thresholding parameter using p= 30, η= 0.025 and nvarying from 15

to 60.

The curves of the power versus the FDR are shown at Figure 2 for

shrinkage and glasso methods and n= 22. For the sake of compar-

ison, we represent the corresponding curve for the MB.or method on

the same graphic. This graphic shows that we cannot both reduce the

FDR and keep a good power with the shrinkage and glasso methods.

When the FDR equals 5%, the power of MB.or,glasso and shrinkage

are respectively equal to 0.86, 0.47, 0.05. These values are obtained

when αequals 1% for MB.or and αequals 10−12 for shrinkage. For

glasso FDR values smaller than 0.45 could not be obtained by varying

α. Indeed, when αequals 10−12,λgiven by Equation (1) equals 0.981,

FDR equals 0.46 and power 0.95. Therefore we carried out the glasso

procedure by varying the values of λ. We got FDR equals 0.05 with

λ= 0.9996.

When ηincreases (η= 4%,5%,10%) or when pis taken equal to 60,

these two methods behave in the same way (results not shown). There-

fore they cannot be used for estimating graphs if one wants to control

the FDR to a value around 5%. Then we did not keep shrinkage and

glasso methods for further studies.

For the pcAlgo algorithm, power decreases with αwhile the FDR

is not a monotone function of α, as shown in Figure 3. Indeed the

pcAlgo algorithm is a stepwise procedure and at each step only nodes

for which the estimated neighborhood is large enough are involved in

the next step. If αis too small not enough nodes are kept for the

following step. So edges may appear between two nodes even though

they are linked through a dropped node. One interesting feature of the

variation of FDR versus αis that the FDR is minimum in α= 0.1%

whatever the values of n(Figure 3). Simulations (not shown) with

η= 4% lead to the same result: the FDR is a convex function of α

and is minimum for α= 0.1% whatever the values of n. Therefore we

tested again the performances of pcAlgo method using α= 0.1% for

diﬀerent values of nand η. FDR decreases from 8.4% to 1.6% when n

increases (Figure 4a). It is equal to 47% when η= 0, remains around

5% when ηvaries between 2.5% and 5%, and equals 8.8% when η= 10%

(Figure 4b). Concerning the power, the pcAlgo behaves nearly as the

WB method.

3.3.3. Inﬂuence of the number of nodes. In Section 3.3.1 we showed

that all methods loose in power when ηincreases. We now investigate

the inﬂuence of the number of nodes, p, on that loss of power. The

graphic in Figure 5 represents the power in function of η, for diﬀerent

values of p, with n=p. Results are shown for MB.and procedure, the

behavior of the other procedures being similar. In all cases the FDR

is smaller than 1%. Figure 5 shows that whatever the value of p, the

power decreases when ηincreases. However, the larger p, the faster the

loss of power with η. Consequently, all methods are eﬃcient for sparse

graphs, and the edge percentage from which the methods fail depends

on the number of nodes.

3.3.4. Inﬂuence of the numbers of neighbors. In Section 3.3.1 we showed

that if the graph is highly connected, the methods are not powerful

anymore. In this section we aim at understanding why, and we show

in particular the behavior of the methods according to the number of

neighbors of the nodes. We focus on the procedure proposed by Mein-

shausen and B¨uhlmann and we consider the experiment simulation for

p= 30, η= 0.025 and n= 30. For each node of the 2000 simulated

graphs we count the number of neighbors and the number of correctly

detected neighbors. In Table 1 we present for iin {1,...,5}, the num-

ber niof nodes with ineighbors and the percentage pi,j of nodes for

SSB - RR No. 16 Villers et al.

0.0 0.2 0.4 0.6 0.8

0.0 0.2 0.4 0.6 0.8 1.0

FDR

power

glasso

MB.or

shrinkage

Figure 2. Power in function of FDR for the glasso

and shrinkage methods. The curves for MB.or method

is given as reference. Plots correspond to p= 30, n= 22

and η= 0.025.

which the method has correctly detected jneighbors exactly, for jin

{0,...,i}.

The percentage (pii)i=1,...,5of nodes for which the whole set of neigh-

bors is correctly detected decreases when the number of neighbors i

increases. In other words when a node has several neighbors, it often

happens that at least one neighbor is not detected. This may explain

Figure 3. Inﬂuence of αon FDR and power (plots a)

and b), respectively) for diﬀerent values of n. The level

α= 0.001 is indicated by the dashed line. Plots corre-

spond to p= 30 and η= 0.025.

Figure 4. Performance of pcAlgo for α= 0.001; a)

FDR and power in function of sample size for η= 0.025;

b) FDR and power in function of the edge percentage for

n= 22. Plots correspond to p= 30.

the loss of power previously observed (see Section 3.3.1) when ηin-

creases, because the average number of neighbors increases with η.

Let us now compare the results obtained with MB.and and MB.or

procedures. In Section 3.3.1 we showed that MB.or procedure is more

SSB - RR No. 16 Villers et al.

0.02 0.04 0.06 0.08 0.10

0.0 0.2 0.4 0.6 0.8 1.0

edge percentage

power

p = 30

p = 60

p = 120

Figure 5. Power according to the edge percentage, for

diﬀerent values of pwith n=pand for Meinshausen and

B¨uhlmann method using its and variant.

powerful and we recover in Table 1 that the percentages of nodes for

which the whole set of neighbors is correctly detected are signiﬁcantly

larger with MB.or procedure than with MB.and procedure. Let us con-

sider for example, as illustrated in Figure 6, a node awith two neighbors

band csuch that ais the only neighbor of band of c. As it has been

noticed just before, the procedure of Meinshausen and B¨uhlmann will

detect more easily that ais the neighbor of band cthan the nodes b

and care both neighbors of a. This is the reason why MB.or procedure

is more powerful than MB.and procedure.

i 1 2 3 4 5

ni21398 7684 1788 287 38

percentage pi,j obtained with MB.and

i12345

0 0.182 0.078 0.062 0.066 0.105

10.818 0.643 0.465 0.380 0.263

20.279 0.414 0.411 0.500

30.059 0.139 0.132

40.004 0.000

percentage pi,j obtained with MB.or

i12345

0 0.022 0.016 0.030 0.049 0.079

10.978 0.267 0.097 0.063 0.105

20.717 0.370 0.195 0.079

30.503 0.387 0.132

40.306 0.447

50.158

Table 1. Number of nodes with ineighbors and per-

centages of nodes for which exactly jneighbors have been

correctly detected by both methods of Meinshausen and

B¨uhlmann with n= 30. Graphs are simulated according

to the ER model with p= 30, η= 0.025.

Figure 6. Node awith two neighbors band csuch that

ais the only neighbor of band of c.

3.3.5. Inﬂuence of the graph structure. In this section we present re-

sults when graphs are simulated according to the ERMG model de-

scribed in Section 3.1.1. Our aim is to evaluate the inﬂuence of hetere-

geneous clusters in the graph. Results are shown in Figure 7 for p= 30

SSB - RR No. 16 Villers et al.

and for ntaking the values 15, 22, 30 and 60 and for all methods cho-

sen for their low FDR. The parameter αfor the pcAlgo method was

taken equal to 0.1% in accordance with results given at Section 3.3.2.

For the parameters given in Equations (6) and (7) the percentage ηof

edges equals 2.5% which makes the results comparable with those of

Figure 1a, 1b and 4.

Using the ERMG model for simulating graphs does not change the

shapes of FDR and power curves. As in Figure 1a the FDR value

obtained with bagging is high when n= 15 then deeply declines, and

the power drops for n=p. Moreover we recover that the FDR values

stay very low with WB and MB.and procedure, stay under 5% for KGGM

and are larger with MB.or and pcAlgo procedures. Referring to the

power, as in Figure 1b the MB.or and KGGM procedures outperform the

others.

The main diﬀerence when graphs are simulated according to the

ERMG models is that the power remains under 0.8 even for large n

(Figure 7b) whereas it achieves 0.95 when ER model is used (Figure 1b).

So, the methods are less powerful when graphs are simulated according

to the ERMG model than according to the ER model. The next section

shed light on this loss of power.

Figure 7. FDR (a)) and power (b)) obtained with the

diﬀerent methods tested, in function of the sample size.

Graphs were simulated according to the ERMG model

with p= 30.

3.3.6. Inﬂuence of the neighborhood structure. In this section we study

why the methods are less powerful when graphs are simulated according

to the ERMG model than according to the ER model and we underline

in particular, the inﬂuence of the neighborhood structure.

We consider the same experiment study as in Section 3.3.4 except

that the graphs are simulated according to the ERMG model. In Ta-

ble 2, one can read for each iin {1,...,6}, the number niof nodes with

ineighbors and the percentage pi,j of nodes for which the method has

correctly detected jneighbors, for jin {0,...,i}. Results are obtained

with the procedure MB.or.

i1 2 3 4 5 6

ni16326 6720 2603 941 332 63

percentage pi,j obtained with MB.or

i123456

0 0.042 0.129 0.289 0.530 0.654 0.841

10.958 0.305 0.206 0.148 0.133 0.048

20.566 0.256 0.128 0.075 0.048

30.249 0.121 0.057 0.032

40.073 0.054 0.000

50.027 0.032

60.000

Table 2. Number of nodes with ineighbors and per-

centages of nodes for which jneighbors have been cor-

rectly detected by the MB.or procedure with n= 30.

Graphs are simulated according to the ERMG model

with p= 30.

Comparing Tables 1 and 2 shows that the number of nodes with 1

and 2 neighbors is smaller for the ERMG model than for the ER model,

while the number of nodes with more than 3 neighbors is greater. It

appears also that the percentages p1jare similar in both tables for nodes

with one neighbor. But when the number of neighbors iis larger than

one, the percentage of nodes pii for which the whole set of neighbors

is correctly detected are smaller in Table 2 than in Table 1. Moreover

the main diﬀerence between Table 1 and 2 concerns the percentage of

nodes for which no neighbors is detected: these percentages pi0are

SSB - RR No. 16 Villers et al.

very small in Table 1, but large in Table 2 and increases with the

number of neighbours. In other words, for graphs simulated according

to the EMRG model, detecting no neighbor often happens, especially

for nodes with a large number of neighbors. This can be explained

by the structure of the neighbors, which is more complex for graphs

simulated according to the ERMG model. This point is illustrated

above.

In the following we present the FDR and the power estimated into

each cluster and between the clusters. We ﬁrst simulate a graph G

according to the ERMG model with the parameters deﬁned in Sec-

tion 3.1.1 in order to ﬁx the number of nodes and the localisation of

edges in each cluster and between clusters. We simulate a graph with

p= 120 nodes to ensure that each cluster contains a minimal num-

ber of nodes. We denote by (n1,...,nQ) the number of nodes in each

cluster and by Nedges the matrix which speciﬁes the number of edges

within each cluster and between two clusters. For the simulated graph

G, these parameters are:

(n1,...,nQ) = 7 11 23 79 

and

Nedges =





21 0 0 3

0 21 6 3

0 6 38 19

3 3 19 35





.

We simulate 2000 data matrix as described in Section 3.1.2 from this

graph Gand we estimate the FDR and the power for detecting edges

within and between clusters. The results obtained with the MB.or

procedure and with n=p, are presented in the matrices FDR and

power given at Equations (8) and (9). The component (a, b)a6=bof the

matrix FDR (respectively power) gives the estimated false discovery

rate (respectively power) of edges between clusters aand b. When

there is no edge between two clusters, estimating the power does not

make any sense and we put Na. The elements on the diagonal give the

estimated false discovery rate (respectively power) of edges within each

cluster.

FDR = 





0.000 0.001 0.016 0.008

0.001 0.005 0.004 0.012

0.016 0.004 0.006 0.014

0.008 0.012 0.014 0.021







(8)

power = 





0.10 Na Na 0.46

Na 0.26 0.22 0.61

Na 0.22 0.29 0.61

0.46 0.61 0.61 0.87







(9)

We can notice from Equation 8 that all estimated FDR values are

small. Moreover, the estimated powers vary a lot according to the

clusters. Indeed, in the ﬁrst cluster which contains 21 edges among the

n1(n1−1)/2 = 21 possible edges, the power is very small whereas in

the fourth cluster which contains 35 edges among the n4(n4−1)/2 =

3081 possible edges, the power is large. The neighbors of the neigh-

bors also inﬂuence the power. This can be observed by comparing

the power for detecting edges between the second and third clusters,

power[2,3] = 0.22, with the power for detecting edges in the fourth

cluster, power[4,4] = 0.87. So, it appears that it is more diﬃcult to

detect edges between clusters 2 and 3 than within cluster 4, while in

both cases the percentage of edges to detect is approximately equal to

0.01. This comes from the fact that clusters 2 and 3 are both highly

connected. Therefore these two clusters involves nodes for which the

structure of the neighbors is complex.

Because of these highly connected parts, the power estimated over

the whole graph Gis smaller than if the edges were distributed uni-

formly in the graph. Indeed, the FDR and the power estimated for the

whole graph Gequal respectively 0.016 and 0.44. For graphs simulated

according to the ER model with p= 120 and η= 0.025, the aver-

age FDR and power estimated over 2000 simulations with the MB.or

procedure and n= 120 equal respectively 0.009 and 0.50.

3.3.7. Inferring a concentration graph using a 0-1 conditional indepen-

dence graph. If the gaussian distribution is faithfull for the concentra-

tion graph G(see Proposition 1 in Wille & B¨uhlmann (2006)), then

all edges in Gare edges in the 0-1 conditional independence graph de-

noted G{0,1}. A comparison between Gand G{0,1}is given at Table 3.

For each concentration matrix whose values are simulated as described

in Section 3.1.2, and for each pair (a, b), 1 ≤a < b ≤1, we calculated

φa,b deﬁned at Equation (2). It appears that, as it was already noticed

by Wille and B¨uhlmann, the number of edges in G{0,1}may be consid-

erably larger than in G. The power and FDR for estimating the graph

Gare reported on Figure 8, a) and b). It shows that the FDR increases

with nand reaches its maximum for η= 10%. This behaviour can be

SSB - RR No. 16 Villers et al.

G ∩ G{0,1}G{0,1}\ G

ηNumber mean range Number mean range

0.025 11 0.72 [10−4,0.99] 0.3 0.09 [10−4,0.33]

0.05 22 0.57 [10−5,0.99] 17 0.05 [10−6,0.46]

0.1 43 0.35 [10−7,0.99] 217 0.02 [10−9,0.41]

0.15 65 0.24 [10−9,0.99] 322 0.012 [10−9,0.30]

0.2 87 0.18 [10−8,0.99] 337 0.01 [10−9,0.21]

0.3 131 0.11 304 0.009

Table 3. Comparison of G{0,1}and Gfor p= 30 and

several values of η. The column G ∩G{0,1}gives the mean

(over 2000 simulations) number of edges that are both in

Gand G{0,1}, followed by the mean and range of the φa,b’s

corresponding to these edges. The column G{0,1}\G gives

the sames results for edges that are in G{0,1}and not in

G. In all simulations the edges of Gare edges of G{0,1}.

easily explained by looking at Figure 8, c) and d) where the FDR and

the power for estimating G{0,1}are reported. It shows that the FDR

for estimating G{0,1}stays very small and that the power increases with

n, as expected. Unfortunatly the edges detected in G{0,1}are not in G,

leading to increase the FDR for detecting edges in G.

When ηis small, say η≤2.5%, the number of edges that are in G{0,1}

but not in Gis very small, and then the FDR for detecting edges in

Gis not changed. But when ηis large the FDR becomes very large,

up to 20% for η= 10%. Nevertheless when ηis larger, the FDR de-

creases. This can be explained by the values of the φa,b’s that are

smaller when ηincreases as it is shown in table 3. Obviously the be-

haviour of the procedure proposed by Wille and B¨uhlmann shown in

this simulation study, may depend on the way we simulate the con-

centration matrix. Nevertheless, we have to keep in mind that if the

graph is highly connected, or if a part of it is highly connected, then,

inferring a concentration graph on the basis of its approximation by a

0-1 conditional independence graph, may lead to detect edges wrongly.

4. APPLICATION TO BIOLOGICAL DATA

In this section, we apply the diﬀerent methods to the multivariate

ﬂow cytometry data produced by Sachs et al. (2005). These data con-

cern a human T cell signaling pathway whose deregulation may lead to

carcenogenesis. Therefore, this pathway was extensively studied in the

Figure 8. FDR and power for estimating G(plots a)

and b)), and G{0,1}(plots c) and d)), in function of ηfor

p= 30 and diﬀerent values of n.

literature and a network involving 11 proteins and 18 interactions was

conventionally accepted (Sachs et al., 2005). This network we denoted

Graf is represented in Figure 9. Sachs et al. ’s data consist of amounts

of these 11 proteins, simultaneously measured from single cells under

several disturbed conditions. In the sequel, we focus on one general

disturbance (+ ICAM-2) that overall stimulates the cellular signaling

network. In this condition the quantities of the 11 proteins were mea-

sured in 902 cells. Let denote Dthis data set constituted of p= 11

SSB - RR No. 16 Villers et al.

variables and n= 902 observations. A log-transformation of the data

was made to ﬁt the gaussian assumption better, and the vector of the

n-observations for each protein were centered and normalized.

Contrary to most of postgenomic data, ﬂow cytometry data provide

a large sample of observations that allow us to measure the inﬂuence

of the sample size on the power of the estimation methods. From this

data set we ﬁrst compare the networks inferred using the ﬁve methods

retained for their low FDR. As such abundance of data is rarely avail-

able in postgenomic data, we secondly carry out a study to determine

the inﬂuence of the observation number on the methods.

Figure 9. Gr af . Classic signaling network of the human

T cell pathway. The connections well-established in the

literature are in grey and the connections cited at least

once in the literature are represented by red dashed lines.

We represent the estimated graphs in Figure 10. The graphs inferred

with the bagging,WB and pcAlgo methods are identical. This graph

involving 10 edges is denoted G1. The KGGM method and the two vari-

ants MB.or and MB.and infer the same graph denoted G2. This graph

involves 9 edges and is identical to G1except for the edge between P KA

and Erk1/2 which is missing. To assess the quality of the methods,

we refer to the conventionally acccepted network shown in Figure 9.

This network involves 18 connections among which 16 connections are

well-established. As the data set Dis obtained by considering only

one disturbed condition we do not expect the methods to detect all

the connections established in the literature. In fact, 10 connections

are detected by three of the ﬁve methods. Among those connections,

Figure 10. Inferred graphs. The graph G1estimated

with the bagging,WB and pcAlgo methods is represented

in blue. The graph G2estimated with the KGGM,MB.or

and MB.and methods is in green dashed line. The values

of the partial correlation matrix associated to the data

set D are reported along each edge.

nine of them were well-established or cited at least once in the litera-

ture. The tenth one, between p38 and J NK, was detected by the ﬁve

methods previously cited. Moreover the same ten connections were

detected by Sachs et al. (2005) (Supplementary Material) applying a

bayesian network analysis. Therefore, in the following, we assume that

the graph G1represents the conditional independence structure of the

data set D.

We now investigate the inﬂuence of the observation number non the

power of the methods for estimating the graph G1. We choose nequal

to 15, 30, 100, 200, and 300. For each value of n, 2000 n−samples

are drawn from Dwithout replacement. From each sample, we esti-

mate graphs using the ﬁve methods and we compare each estimated

graph with the graph G1. We compute the proportion of wrongly de-

tected edges among the detected edges and the proportion of correctly

identiﬁed edges among the edges of G1. The means of these quantities

over the 2000 simulations are denoted FDR and power. Results are

presented in Table 4. As expected, the power of all methods increases

with the number of observations n. However, nhas to be large in order

to detect most of the edges. It comes from the fact that the graph

G1involves 11 proteins and 10 edges, which corresponds to a large

percentage of edges (18%). In this study, we notice that the edges

SSB - RR No. 16 Villers et al.

Raf −Mek1/2 and Erk1/2−Akt are detected in most of the simula-

tions even for small nand whatever the method; on the contrary the

edge P KA −Erk1/2 is less often detected. It is in accordance with

the values of the partial correlation matrix given in Figure 10: indeed

the largest values of the partial correlation matrix correspond to the

most often detected edges.

Let us now compare the methods according to the number of obser-

vations at our disposal. When nis small (n= 15), the pcAlgo and

KGGM methods are the most powerful with a FDR around 1%. When n

is moderate (n= 30 or n= 100), we advise to use the MB.or procedure,

because the FDR is small and the beneﬁt in power is large. When nis

very large and referring to the power, all methods perform quite well.

Nevertheless KGGM is slightly less powerfull. The FDR obtained with

the MB.and procedure being null, this procedure is recommended.

FDR

nbagging WB MB.and MB.or pcAlgo KGGM

15 0.0227 0.0037 0.0007 0.0017 0.0086 0.0106

30 0.0159 0.0020 0.0011 0.0044 0.0030 0.0051

100 0.0117 0.0017 0.0001 0.0067 0.0018 0.0051

200 0.0098 0.0010 0.0000 0.0111 0.0011 0.0068

300 0.0056 0.0005 0.0000 0.0136 0.0005 0.0056

Power

nbagging WB MB.and MB.or pcAlgo KGGM

15 0.27 0.33 0.23 0.26 0.38 0.40

30 0.43 0.47 0.47 0.62 0.48 0.57

100 0.68 0.69 0.68 0.77 0.69 0.69

200 0.79 0.79 0.77 0.81 0.79 0.75

300 0.85 0.83 0.82 0.83 0.83 0.79

Table 4. FDR and power for estimating the graph G1.

Results for the diﬀerent methods and for diﬀerent values

of n.

5. CONCLUSION

In this work, we were interested in recent methods that infer di-

rect links between entities, from experimental datasets. The results

we obtained underline both common features and speciﬁcities of these

methods regarding the parameters p,nand ηof the application context.

The most relevant points from our simulation study are the following:

- If one aims to control the FDR at a low level, shrinkage and

glasso should not be used.

-pcAlgo gives a better control of the FDR when the parameter α

is suitably chosen. However there is no simulation condition where it

performs better than the MB methods for example.

- The bagging procedure is less powerfull than the others though

the FDR is not better controled.

- The WB method has good performances, but we have to keep in

mind that it aims at estimating an approximation of the concentration

graph, and may lead to high FDR values when the 0-1 conditional

independence graph diﬀers from the concentration graph.

-KGGM performs well, particularly when nis small. However, this

procedure cannot be carried out when pis large, say larger than 40.

- We recommend to use the MB procedure when it can be applied (n

large enough so that Equation (11) is not satisﬁed). If one can accept

a false discovery rate of the order 5%, then we recommend to use the

variant MB.or which is more powerfull than the variant MB.and. This

last one must be preferred when the false discovery rate has to be very

small.

The structure of the graph should also be considered; if the edges are

not uniformly distributed over the nodes as in the Erd¨os-R´enyi model,

then edges localized in highly connected parts of the graph or edges

joining two highly connected parts may be diﬃcult to detect.

In the end, methods inferring graphs do not behave equivalently

faced to the graph and dataset structures. Consequently, we have to

pay great attention to the validity domain of each method before car-

rying it out.

6. Appendix. Algorithm for the MB method

For each variable a∈Γ, let (b

θa,b(λ), b ∈Γ−{a}) be the LASSO esti-

mators of the parameters θa,b deﬁned in Equation (4). In this section

the algorithm used for detecting the b

θa,b that are non zero is described.

The ﬁrst step of the algorithm consists in using the LARS algorithm

for ranking the variables X

X−{a}according to the covariance structure

of the matrix X

X. Then, for the chosen value of λ, the non zero compo-

nents of (b

θa,b(λ), b ∈Γ−{a}) are detected. This second step is described

below. 30

SSB - RR No. 16 Villers et al.

Let us deﬁne the following notations: for xa vector with qcompo-

nents, kxk2=Pq

l=1 x2

l,kxk∞= supl=1,...,q |xl|. For the sake of sim-

plicity, we set Y

Y=X

Xa,U

U=X

X−{a}, and we assume that Y

Yand the

columns of U

Uare centered and scaled such that kY

Yk2=nand for all

b= 1,...,q (q=p−1), kU

Ubk2=n. Let

(10) b

β(λ) = Arg min

β∈Rp−1kY

Y−U

Uβk2+λ

p−1

b=1

|βb|.

We will use the following properties

Property A. b

β(λ) is solution of Equation (10) if and only if there exists

ap−1-vector vsatisfying

•for all b= 1,...,p −1, vb= sign(b

βb(λ)) if b

βb(λ)6= 0 and

vb∈[−1,1] if not

•λv = 2U

UT(Y

Y−U

β(λ)).

Property B. Solving (10) is equivalent to solving the following con-

straint mimimization problem

β(t) = Arg min

Pp−1

b=1 |βb|≤t

Y−U

Uβk2.

Therefore, for all λ, there exists tλsuch that b

β(λ) = b

β(tλ).

Property C. Let

C(t) = 2 



U

UTY−U

β(t)



∞.

It can be shown that the function Csatisﬁes the following properties:

Cis a decreasing function of t(see Efron et al. (2004), lemma 7), and

λ=C(tλ).

From Property A it comes out that λ≥2kU

UTY

Yk∞is equivalent to

β(λ) = 0. As 2kU

UTY

Yk∞≤2n, we get that b

β(λ) = 0 as soon as λ≥2n.

Comparing this lower bound with the value λgiven by Meinshausen

and B¨uhlmann (see Equation (5)), it appears that the parameters will

be estimated by zero whatever the data if

(11) n≤Φ−1(1 −α/2p2)2.

Consider now the case where λ < 2kU

UTY

Yk∞and let us denote by

A(t) the set of active parameters :

A(t) = nb∈ {1,...,p−1},b

βb(t)6= 0o.

When tincreases A(t) becomes larger. The LARS algorithm gives the

values of t,t0= 0, t1, t2,..., for which A(t) gains a variable: for each

k= 0,1,2..., for all t∈]tk−1, tk], A(t) is constant and equals A(tk).

Thanks to the third property it remains to ﬁnd k∗= min {k, C(tk)< λ}.

The non zero components of (b

θa,b(λ), b ∈Γ−{a}) are then equal to

A(tk∗).

References

Banerjee, O., Ghaoui, L., & d’Aspremont, A. (2008). Model selec-

tion through sparse maximum likelihood estimation for multivariate

gaussian or binary data. J. Machine Learning Research,9, 485–516.

Castelo, R., & Roverato, A. (2006). A robust procedure for gaussian

graphical model search from microarray data with plarger than n.

Journal of Machine Learning Research,7, 2621–2650.

Daudin, J. J., Picard, F., & Robin, S. (2006). A mixture model for

random graphs. Tech. Rep. RR-5840, INRIA, Rapport de Recherche.

Dobra, A., Hans, C., Jones, B., Nevins, J. R., Yao, G., & West, M.

(2004). Sparse graphical models for exploring gene expression data.

Journal of Multivariate Analysis,90 , 196–212.

Drton, M., & Perlman, M. (2007). Multiple testing and error control in

gaussian graphical model selection. Statistical Sciences, To appear.

Efron, B., Hastie, T., Johnston, I., & Tibshirani, R. (2004). Least angle

regression. The Annals of Statistics,32 , 407–451.

Friedman, J., Hastie, T., & Tibshirani, R. (2007). Sparse in-

verse covariance estimation with the lasso. Tech. Rep. , ,

http://www-stat.stanford.edu/tibs/ftp/graph.pdf.

Giraud, C. (2008). Estimation of gaussian graphs by model selection.

Electronic Journal of Statistics,2, 542–563.

Huang, J., Liu, N., Pourahmadi, M., & Liu, L. (2006). Covariance

matrix selection abd estimation via penalised normal likelihood.

Biometrika,93 (1), 85–98.

Husmeier, D. (2003). Sensitivity and speciﬁcity of inferring genetic

regulatory interactions from microarray experiments with dynamic

bayesian networks. Bioinformatics,19 , 2271–2282.

Kalisch, M., & B¨uhlmann, P. (2007). Estimating high-dimensional

directed acyclic graphs with the pc-algorithm. Journal of Machine

Learning Research ,8, 613–636.

Kishino, H., & Waddell, P. J. (2000). Correspondence analysis of genes

and tissue types and ﬁnding genetic links from microarray data.

Genome Informatics,11 , 83–95.

Malouche, D., & Sevestre, S. (2007). Estimating high dimensional

faithful gaussian graphical models : upc-algorithm. Tech. Rep.

arXiv:0705.1613, Technical Report.

SSB - RR No. 16 Villers et al.

Meinshausen, N., & B¨uhlmann, P. (2006). High dimensional graphs

and variable selection with the Lasso. Annals of Statistics,34 (3),

1436–1462.

Okamoto, S., Yamanishi, Y., Ehira, S., Kawashima, S., Tonomura, K.,

& Kanehisa, M. (2007). Prediction of nitrogen metabolism-related

genes in anabaena by kernel-based network analysis. Proteomics,

7(6), 900–909.

Sachs, K., Perez, O., D.Pe’er, Lauﬀenburger, D. A., & Nolan, G. P.

(2005). Causal protein-signaling networks derived from multiparam-

eter single-cell data. Science,308 , 523–529.

Sch¨afer, J., & Strimmer, K. (2005a). An empirical bayes approach

to inferring large-scale gene association nerworks. Bioinformatics,

21 (6), 754–764.

Sch¨afer, J., & Strimmer, K. (2005b). A shrinkage approach to large-

scale covariance matrix estimation and implications for functional

genomics. Statistical Applications in Genetics and Molecular Biol-

ogy,4, 1–32.

Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, Prediction

and Search . London: The MIT Press 2nd edition.

Werhli, A., & Husmeier, D. (2007). Reconstructing gene regulatory

networks with bayesian networks by combining expression data with

multiple sources of prior knowledge. Statistical Applications in Ge-

netics and Molecular Biology,6.

Wille, A., & B¨uhlmann, P. (2006). Low-order conditional indepen-

dence graphs for inferring genetic networks. Statistical Applications

in Genetics and Molecular Biology,5, 1–34.

Wu, W., & Ye, Y. (2006). Exploring gene causal interactions using an

enhanced constraint-based method. Pattern Recognition ,39 , 2439–

2449.

Yellaboina, S., Goyal, K., & Mande, S. (2007). Inferring genome-

wide functional linkages in e-coli by combining improved genome

context methods: Comparison with high-throughput experimental

data. Genome Research,17 (4), 527–535.

Yuan, M., & Lin, Y. (2007). Model selection and estimation in the

gaussian graphical model. Biometrika,94 ( ), 19–35.

Advances in dietary pattern analysis in nutritional epidemiology

Article

Full-text available

Dec 2021
EUR J NUTR

Background and Purpose It used to be a common practice in the field of nutritional epidemiology to analyze separate nutrients, foods, or food groups. However, in reality, nutrients and foods are consumed in combination. The introduction of dietary patterns (DP) and their analysis has revolutionized this field, making it possible to take into account the synergistic effects of foods and to account for the complex interaction among nutrients and foods. Three approaches of DP analysis exist: (1) the hypothesis-based approach (based on prior knowledge regarding the current understanding of dietary components and their health relation), (2) the exploratory approach (solely relying on dietary intake data), and (3) the hybrid approach (a combination of both approaches). During the recent past, complementary approaches for DP analysis have emerged both conceptually and methodologically. Method We have summarized the recent developments that include incorporating the Treelet transformation method as a complementary exploratory approach in a narrative review. Results Uses, peculiarities, strengths, limitations, and scope of recent developments in DP analysis are outlined. Next, the narrative review gives an overview of the literature that takes into account potential relevant dietary-related factors, specifically the metabolome and the gut microbiome in DP analysis. Then the review deals with the aspect of data processing that is needed prior to DP analysis, particularly when dietary data arise from assessment methods other than the long-established food frequency questionnaire. Lastly, potential opportunities for upcoming DP analysis are summarized in the outlook. Conclusion Biological factors like the metabolome and the microbiome are crucial to understand diet-disease relationships. Therefore, the inclusion of these factors in DP analysis might provide deeper insights.

Inférence de réseaux pour modèles inflatés en zéro

Thesis

Nov 2019

Clémence Karmann

L'inférence de réseaux ou inférence de graphes a de plus en plus d'applications notamment en santé humaine et en environnement pour l'étude de données micro-biologiques et génomiques. Les réseaux constituent en effet un outil approprié pour représenter, voire étudier des relations entre des entités. De nombreuses techniques mathématiques d'estimation ont été développées notamment dans le cadre des modèles graphiques gaussiens mais aussi dans le cas de données binaires ou mixtes. \\Le traitement des données d'abondance (de micro-organismes comme les bactéries par exemple) est particulier pour deux raisons : d'une part elles ne reflètent pas directement la réalité car un processus de séquençage a lieu pour dupliquer les espèces et ce processus apporte de la variabilité, d'autre part une espèce peut être absente dans certains échantillons. On est alors dans le cadre de données inflatées en zéro. Beaucoup de méthodes d'inférence de réseaux existent pour les données gaussiennes, les données binaires et les données mixtes mais les modèles inflatés en zéro sont très peu étudiés alors qu'ils reflètent la structure de nombreux jeux de données de façon pertinente. L'objectif de cette thèse concerne l'inférence de réseaux pour les modèles inflatés en zéro.\\Dans cette thèse, on se limitera à des réseaux de dépendances conditionnelles. Le travail présenté dans cette thèse se décompose principalement en deux parties. La première concerne des méthodes d'inférence de réseaux basées sur l'estimation de voisinages par une procédure couplant des méthodes de régressions ordinales et de sélection de variables. La seconde se focalise sur l'inférence de réseaux dans un modèle où les variables sont des gaussiennes inflatées en zéro par double troncature (à droite et à gauche).

Integrated mRNA and miRNA expression profiling in blood reveals candidate biomarkers associated with endurance exercise in the horse

Article

Full-text available

Mar 2016

The adaptive response to extreme endurance exercise might involve transcriptional and translational regulation by microRNAs (miRNAs). Therefore, the objective of the present study was to perform an integrated analysis of the blood transcriptome and miRNome (using microarrays) in the horse before and after a 160 km endurance competition. A total of 2,453 differentially expressed genes and 167 differentially expressed microRNAs were identified when comparing pre- and post-ride samples. We used a hypergeometric test and its generalization to gain a better understanding of the biological functions regulated by the differentially expressed microRNA. In particular, 44 differentially expressed microRNAs putatively regulated a total of 351 depleted differentially expressed genes involved variously in glucose metabolism, fatty acid oxidation, mitochondrion biogenesis, and immune response pathways. In an independent validation set of animals, graphical Gaussian models confirmed that miR-21-5p, miR-181b-5p and miR-505-5p are candidate regulatory molecules for the adaptation to endurance exercise in the horse. To the best of our knowledge, the present study is the first to provide a comprehensive, integrated overview of the microRNA-mRNA co-regulation networks that may have a key role in controlling post-transcriptomic regulation during endurance exercise.

Dietary networks identified by Gaussian graphical model and general and abdominal obesity in adults

Article

Full-text available

Oct 2021
Nutr J

Background Gaussian graphical model (GGM) has been introduced as a new approach to identify patterns of dietary intake. We aimed to investigate the link between dietary networks derived through GGM and obesity in Iranian adults. Method A cross-sectional study was conducted on 850 men and women (age range: 20–59 years) who attended the local health centers in Tehran. Dietary intake was evaluated by using a validated food frequency questionnaire. GGM was applied to identify dietary networks. The odds ratios (ORs) and 95% confidence intervals (CIs) of general and abdominal adiposity across tertiles of dietary network scores were estimated using logistic regression analysis controlling for age, sex, physical activity, smoking status, marital status, education, energy intake and menopausal status. Results GGM identified three dietary networks, where 30 foods were grouped into six communities. The identified networks were healthy, unhealthy and saturated fats networks, wherein cooked vegetables, processed meat and butter were, respectively, central to the networks. Being in the top tertile of saturated fats network score was associated with a higher likelihood of central obesity by waist-to-hip ratio (OR: 1.56, 95%CI: 1.08, 2.25; P for trend: 0.01). There was also a marginally significant positive association between higher unhealthy network score and odds of central obesity by waist circumference (OR: 1.37, 95%CI: 0.94, 2.37; P for trend: 0.09). Healthy network was not associated with central adiposity. There was no association between dietary network scores and general obesity. Conclusions Unhealthy and saturated fat dietary networks were associated with abdominal adiposity in adults. GGM-derived dietary networks represent dietary patterns and can be used to investigate diet-disease associations.

Gaussian Graphical Models Identified Food Intake Networks among Iranian Women with and without Breast Cancer: A Case-Control Study

Article

Full-text available

Sep 2020
NUTR CANCER

Background: Dietary patterns may be an important predictor of breast cancer risk. However, they cannot completely explain the pairwise correlations among foods. The purpose of this study is to compare food intake networks derived by Gaussian Graphical Models (GGMs) for women with and without breast cancer to better understand how foods are consumed in relation to each other according to disease status. Methods: A total of 134 women with breast cancer and 267 hospital controls were selected from referral hospitals of Tehran, Iran. Dietary intakes were evaluated by using a validated 168 food-items semi-quantitative food frequency questionnaire. GGMs were applied to log-transformed intakes of 28 food groups to construct outcome-specific food networks. Results: Among cases, a main network containing intakes of 12 central food groups (vegetables, fruits, nuts and seeds, olive oil and olive, processed meat, sweets, salt, soft drinks, fried potatoes, pickles, low-fat dairy, pizza) was detected. In controls, a main network including six central food groups (liquid oils, vegetables, fruits, sweets, fried potatoes and soft drinks) was identified. Conclusions: The findings of this study revealed a difference in GGM-identified networks graphs between cases and controls. Overall, GGM may provide additional understanding of relationships between diet and health.

Gaussian Graphical Models Identify Networks of Dietary Intake in a German Adult Population

Article

Full-text available

Jan 2016
J NUTR

Background: Data-reduction methods such as principal component analysis are often used to derive dietary patterns. However, such methods do not assess how foods are consumed in relation to each other. Gaussian graphical models (GGMs) are a set of novel methods that can address this issue. Objective: We sought to apply GGMs to derive sex-specific dietary intake networks representing consumption patterns in a German adult population. Methods: Dietary intake data from 10,780 men and 16,340 women of the European Prospective Investigation into Cancer and Nutrition (EPIC)-Potsdam cohort were cross-sectionally analyzed to construct dietary intake networks. Food intake for each participant was estimated using a 148-item food-frequency questionnaire that captured the intake of 49 food groups. GGMs were applied to log-transformed intakes (grams per day) of 49 food groups to construct sex-specific food networks. Semiparametric Gaussian copula graphical models (SGCGMs) were used to confirm GGM results. Results: In men, GGMs identified 1 major dietary network that consisted of intakes of red meat, processed meat, cooked vegetables, sauces, potatoes, cabbage, poultry, legumes, mushrooms, soup, and whole-grain and refined breads. For women, a similar network was identified with the addition of fried potatoes. Other identified networks consisted of dairy products and sweet food groups. SGCGMs yielded results comparable to those of GGMs. Conclusions: GGMs are a powerful exploratory method that can be used to construct dietary networks representing dietary intake patterns that reveal how foods are consumed in relation to each other. GGMs indicated an apparent major role of red meat intake in a consumption pattern in the studied population. In the future, identified networks might be transformed into pattern scores for investigating their associations with health outcomes.

Inferring Networks from Multiple Samples with Consensus LASSO

Article

Mar 2014

Networks are very useful tools to decipher complex regulatory relationships between genes in an organism. Most work address this issue in the context of i.i.d., treated vs. control or time-series samples. However, many data sets include expression obtained for the same cell type of an organism, but in several conditions. We introduce a novel method for inferring networks from samples obtained in various but related experimental conditions. This approach is based on a double penalization: a first penalty aims at controlling the global sparsity of the solution whilst a second penalty is used to make condition-specific networks consistent with a consensual network. This "consensual network" is introduced to represent the dependency structure between genes, which is shared by all conditions. We show that different "consensus" penalties can be used, some integrating prior (e.g., bibliographic) knowledge and others that are adapted along the optimization scheme. In all situations, the proposed double penalty can be expressed in terms of a LASSO problem and hence, solved using standard approaches which address quadratic problems with 1 L -regularization. This approach is combined with a bootstrap approach and is made available in the R package therese1. Our proposal is illustrated on simulated datasets and compared with independent estimations and alternative methods. It is also applied to a real dataset to emphasize the differences in regulatory networks before and after a low-calorie diet.

Introduction to High-Dimensional Statistics

Book

Aug 2021

Christophe Giraud

A Multiattribute Gaussian Graphical Model for Inferring Multiscale Regulatory Networks: An Application in Breast Cancer: Methods and Protocols

Chapter

Jan 2019
Meth Mol Biol

This chapter addresses the problem of reconstructing regulatory networks in molecular biology by integrating multiple sources of data. We consider data sets measured from diverse technologies all related to the same set of variables and individuals. This situation is becoming more and more common in molecular biology, for instance, when both proteomic and transcriptomic data related to the same set of “genes” are available on a given cohort of patients. To infer a consensus network that integrates both proteomic and transcriptomic data, we introduce a multivariate extension of Gaussian graphical models (GGM), which we refer to as multiattribute GGM. Indeed, the GGM framework offers a good proxy for modeling direct links between biological entities. We perform the inference of our multivariate GGM with a neighborhood selection procedure that operates at a multiscale level. This procedure employs a group-Lasso penalty in order to select interactions which operate both at the proteomic and at the transcriptomic level between two genes. We end up with a consensus network embedding information shared at multiple scales of the cell. We illustrate this method on two breast cancer data sets. An R-package is publicly available on github at https://github.com/jchiquet/multivarNetwork to promote reproducibility. © 2019, Springer Science+Business Media, LLC, part of Springer Nature.

Network Inference in Breast Cancer with Gaussian Graphical Models and Extensions

Chapter

Sep 2014

At the crossroads between statistics and machine learning, probabilistic graphical models provide a powerful formal framework to model complex data. Probabilistic graphical models are probabilistic models whose graphical components denote conditional independence structures between random variables. The probabilistic framework makes it possible to deal with data uncertainty while the conditional independence assumption helps process high dimensional and complex data. Examples of probabilistic graphical models are Bayesian networks and Markov random fields, which represent two of the most popular classes of such models. With the rapid advancements of high-throughput technologies and the ever decreasing costs of these next generation technologies, a fast-growing volume of biological data of various types—the so-called omics—is in need of accurate and efficient methods for modeling, prior to further downstream analysis. Network reconstruction from gene expression data represents perhaps the most emblematic area of research where probabilistic graphical models have been successfully applied. However these models have also created renew interest in genetics, in particular: association genetics, causality discovery, prediction of outcomes, detection of copy number variations, epigenetics, etc.. For all these reasons, it is foreseeable that such models will have a prominent role to play in advances in genome-wide analyses.

Causation, Prediction, and Search

Book

Full-text available

Jan 1993

What assumptions and methods allow us to turn observations into causal knowledge, and how can even incomplete causal knowledge be used in planning and prediction to influence and control our environment? In this book Peter Spirtes, Clark Glymour, and Richard Scheines address these questions using the formalism of Bayes networks, with results that have been applied in diverse areas of research in the social, behavioral, and physical sciences. The authors show that although experimental and observational study designs may not always permit the same inferences, they are subject to uniform principles. They axiomatize the connection between causal structure and probabilistic independence, explore several varieties of causal indistinguishability, formulate a theory of manipulation, and develop asymptotically reliable procedures for searching over equivalence classes of causal models, including models of categorical data and structural equation models with and without latent variables. The authors show that the relationship between causality and probability can also help to clarify such diverse topics in statistics as the comparative power of experimentation versus observation, Simpson's paradox, errors in regression models, retrospective versus prospective sampling, and variable selection. The second edition contains a new introduction and an extensive survey of advances and applications that have appeared since the first edition was published in 1993.

Model Selection Through Sparse Max Likelihood Estimation Model Selection Through Sparse Maximum Likelihood Estimation for Multivariate Gaussian or Binary Data

Article

Full-text available

Aug 2007

We consider the problem of estimating the parameters of a Gaussian or binary distribution in such a way that the resulting undirected graphical model is sparse. Our approach is to solve a maximum likelihood problem with an added ℓ 1 -norm penalty term. The problem as formulated is convex but the memory requirements and complexity of existing interior point methods are prohibitive for problems with more than tens of nodes. We present two new algorithms for solving problems with at least a thousand nodes in the Gaussian case. Our first algorithm uses block coordinate descent, and can be interpreted as recursive ℓ 1 -norm penalized regression. Our second algorithm, based on Nesterov's first order method, yields a complexity estimate with a better dependence on problem size than existing interior point methods. Using a log determinant relaxation of the log partition function (Wainwright and Jordan [2006]), we show that these same algorithms can be used to solve an approximate sparse maximum likelihood problem for the binary case. We test our algorithms on synthetic data, as well as on gene expression and senate voting records data.

A mixture model for random graph

Article

Full-text available

Jun 2008

The Erdös–Rényi model of a network is simple and possesses many explicit expressions for average and asymptotic properties, but it does not fit well to real-world networks. The vertices of those networks are often structured in unknown classes (functionally related proteins or social communities) with different connectivity properties. The stochastic block structures model was proposed for this purpose in the context of social sciences, using a Bayesian approach. We consider the same model in a frequentest statistical framework. We give the degree distribution and the clustering coefficient associated with this model, a variational method to estimate its parameters and a model selection criterion to select the number of classes. This estimation procedure allows us to deal with large networks containing thousands of vertices. The method is used to uncover the modular structure of a network of enzymatic reactions.

Correspondence analysis of genes and tissue types and finding genetic link from microarray data

Article

Full-text available

Feb 2000

In this paper, we propose and use two novel procedures for the analysis of microarray gene expression data. The first is correspondence analysis which visualizes the relationship between genes and tissues as two 2 dimensional graphs, oriented so that distances between genes are preserved, distances between tissues are preserved, and so that genes which primarily distinguish certain types of tissue are spatially close to those tissues. For the inference of genetic links, partial correlations rather than correlations are the key issue. A partial correlation between i and j is the relationship between i and j after the effect of surrounding genes has been subtracted out of their pairwise correlation. This leads to the area of graphical modeling. A limitation of the graphical modeling approach is that the correlation matrix of expression profiles between genes is degenerate whenever the number of genes to be analyzed exceeds the number of distinct expression measurements. This can cause considerable problems, as calculation of partial correlations typically uses the inverse of the correlation matrix. To avoid this limitation, we propose two practical multiple regression procedures with variable selection to measure the net, screened, relationship between pairs of genes. Possible biases arising from the analysis of a subset of genes from the genome are examined in the worked examples. It seems that both these approaches are more natural ways of analyzing gene expression data than the currently popular approach of two way clustering.

Least angle regres-sion

Article

Jan 2004

Estimating High Dimensional Faithful Gaussian Graphical Models by Low-Order Conditionin

Conference Paper

Feb 2008

The aim of this paper is to devise a new PC-algorithm (par tial correlation), uPC-algorithm, for estimating a high di mensional undirected graph associated to a faithful Gaus sian Graphical Model. First, we deﬁne the separability or der of a graph as the maximum cardinality among all its minimal separators. We construct a sequence of graphs by increasing the number of the conditioning variables. We prove that these graphs are nested and at a limited stage, equal to the separability order, this sequence is constant and equal to the true graph. Thus, the uPC-algorithm devised in this paper, is a step-down procedure based on a recursive estimation of these nested graphs. We show on simulated data its ac curacy and consistency and we compare it with the 0 − 1 covariance graph estimation recently proposed by [11].

A Robust Procedure For Gaussian Graphical Model Search From Microarray Data With p Larger Than n

Article

Dec 2006

Learning of large-scale networks of interactions from microarray data is an important and challeng- ing problem in bioinformatics. A widely used approach is to assume that the available data consti- tute a random sample from a multivariate distribution belonging to a Gaussian graphical model. As a consequence, the prime objects of inference are full-order partial correlations which are partial correlations between two variables given the remaining ones. In the context of microarray data the number of variables exceed the sample size and this precludes the application of traditional structure learning procedures because a sampling version of full-order partial correlations does not exist. In this paper we consider limited-order partial correlations, these are partial correlations computed on marginal distributions of manageable size, and provide a set of rules that allow one to assess the usefulness of these quantities to derive the independence structure of the underlying Gaussian graphical model. Furthermore, we introduce a novel structure learning procedure based on a quantity, obtained from limited-order partial correlations, that we call the non-rejection rate. The applicability and usefulness of the procedure are demonstrated by both simulated and real data.

Exploring gene causal interactions using an enhanced constraint-based method

Article

Dec 2006
PATTERN RECOGN

DNA microarray provides a powerful basis for analysis of gene expression. Bayesian networks, which are based on directed acyclic graphs (DAGs) and can provide models of causal influence, have been investigated for gene regulatory networks. The difficulty with this technique is that learning the Bayesian network structure is an NP-hard problem, as the number of DAGs is superexponential in the number of genes, and an exhaustive search is intractable. In this paper, we propose an enhanced constraint-based approach for causal structure learning. We integrate with graphical Gaussian modeling and use its independence graph as an input of our constraint-based causal learning method. We also present graphical decomposition techniques to further improve the performance. Our enhanced method makes it feasible to explore causal interactions among genes interactively. We have tested our methodology using two microarray data sets. The results show that the technique is both effective and efficient in exploring causal structures from microarray data.

Sparse Graphical Models for Exploring Gene Expression Data

Article

Jul 2004
J MULTIVARIATE ANAL

We discuss the theoretical structure and constructive methodology for large-scale graphical models, motivated by their potential in evaluating and aiding the exploration of patterns of association in gene expression data. The theoretical discussion covers basic ideas and connections between Gaussian graphical models, dependency networks and specific classes of directed acyclic graphs we refer to as compositional networks. We describe a constructive approach to generating interesting graphical models for very high-dimensional distributions that builds on the relationships between these various stylized graphical representations. Issues of consistency of models and priors across dimension are key. The resulting methods are of value in evaluating patterns of association in large-scale gene expression data with a view to generating biological insights about genes related to a known molecular pathway or set of specified genes. Some initial examples relate to the estrogen receptor pathway in breast cancer, and the Rb-E2F cell proliferation control pathway.

Sensitivity and Specificity of Inferring Genetic Regulatory Interactions from Microarray Experiments with Dynamic Bayesian Networks

Article

Dec 2003

Dirk Husmeier

Bayesian networks have been applied to infer genetic regulatory interactions from microarray gene expression data. This inference problem is particularly hard in that interactions between hundreds of genes have to be learned from very small data sets, typically containing only a few dozen time points during a cell cycle. Most previous studies have assessed the inference results on real gene expression data by comparing predicted genetic regulatory interactions with those known from the biological literature. This approach is controversial due to the absence of known gold standards, which renders the estimation of the sensitivity and specificity, that is, the true and (complementary) false detection rate, unreliable and difficult. The objective of the present study is to test the viability of the Bayesian network paradigm in a realistic simulation study. First, gene expression data are simulated from a realistic biological network involving DNAs, mRNAs, inactive protein monomers and active protein dimers. Then, interaction networks are inferred from these data in a reverse engineering approach, using Bayesian networks and Bayesian learning with Markov chain Monte Carlo. The simulation results are presented as receiver operator characteristics curves. This allows estimating the proportion of spurious gene interactions incurred for a specified target proportion of recovered true interactions. The findings demonstrate how the network inference performance varies with the training set size, the degree of inadequacy of prior assumptions, the experimental sampling strategy and the inclusion of further, sequence-based information. The programs and data used in the present study are available from http://www.bioss.sari.ac.uk/~dirk/Supplements

Assessing the Validity Domains of Graphical Gaussian Models in Order to Infer Relationships among Components of Complex Biological Systems

Abstract and Figures

Recommended publications

INRAE (France) is recruiting 55 research scientists

INRAE is hiring 10 research scientists - Call for research projects (CRCN)

INRAE is hiring 45 Scientists through open competitions and offering permanent positions.

A Measure of Relative Normality for LANDSAT Data Multivariate Distributions

Optimal Weights Mixed Filter for removing mixture of Gaussian and impulse noises

Radiation forces exerted on arbitrarily located sphere by acoustic tweezer

Graph Selection with GGMselect

The application of Gaussian mixture model to detecting community structure of networks