Page 1
BioMed Central
Page 1 of 15
(page number not for citation purposes)
Algorithms for Molecular Biology
Open Access
Research
A linear programming approach for estimating the structure of a
sparse linear genetic network from transcript profiling data
Sahely Bhadra1, Chiranjib Bhattacharyya*1,2, Nagasuma R Chandra*2 and I
Saira Mian3
Address: 1Department of Computer Science and Automation, Indian Institute of Science, Bangalore, Karnataka, India, 2Bioinformatics Centre,
Indian Institute of Science, Bangalore, Karnataka, India and 3Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California
94720, USA
Email: Sahely Bhadra  sahely@csa.iisc.ernet.in; Chiranjib Bhattacharyya*  chiru@csa.iisc.ernet.in;
Nagasuma R Chandra*  nchandra@serc.iisc.ernet.in; I Saira Mian  smian@lbl.gov
* Corresponding authors
Abstract
Background: A genetic network can be represented as a directed graph in which a node
corresponds to a gene and a directed edge specifies the direction of influence of one gene on
another. The reconstruction of such networks from transcript profiling data remains an important
yet challenging endeavor. A transcript profile specifies the abundances of many genes in a biological
sample of interest. Prevailing strategies for learning the structure of a genetic network from high
dimensional transcript profiling data assume sparsity and linearity. Many methods consider
relatively small directed graphs, inferring graphs with up to a few hundred nodes. This work
examines large undirected graphs representations of genetic networks, graphs with many
thousands of nodes where an undirected edge between two nodes does not indicate the direction
of influence, and the problem of estimating the structure of such a sparse linear genetic network
(SLGN) from transcript profiling data.
Results: The structure learning task is cast as a sparse linear regression problem which is then
posed as a LASSO (l1constrained fitting) problem and solved finally by formulating a Linear
Program (LP). A bound on the Generalization Error of this approach is given in terms of the Leave
OneOut Error. The accuracy and utility of LPSLGNs is assessed quantitatively and qualitatively
using simulated and real data. The Dialogue for Reverse Engineering Assessments and Methods
(DREAM) initiative provides gold standard data sets and evaluation metrics that enable and facilitate
the comparison of algorithms for deducing the structure of networks. The structures of LPSLGNs
estimated from the INSILICO1, INSILICO2 and INSILICO3 simulated DREAM2 data sets are
comparable to those proposed by the first and/or second ranked teams in the DREAM2
competition. The structures of LPSLGNs estimated from two published Saccharomyces cerevisae
cell cycle transcript profiling data sets capture known regulatory associations. In each S. cerevisiae
LPSLGN, the number of nodes with a particular degree follows an approximate power law
suggesting that its degree distributions is similar to that observed in realworld networks.
Inspection of these LPSLGNs suggests biological hypotheses amenable to experimental
verification.
Published: 24 February 2009
Algorithms for Molecular Biology 2009, 4:5 doi:10.1186/1748718845
Received: 30 May 2008
Accepted: 24 February 2009
This article is available from: http://www.almob.org/content/4/1/5
© 2009 Bhadra et al; licensee BioMed Central Ltd.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
Page 2
Algorithms for Molecular Biology 2009, 4:5http://www.almob.org/content/4/1/5
Page 2 of 15
(page number not for citation purposes)
Conclusion: A statistically robust and computationally efficient LPbased method for estimating
the topology of a large sparse undirected graph from highdimensional data yields representations
of genetic networks that are biologically plausible and useful abstractions of the structures of real
genetic networks. Analysis of the statistical and topological properties of learned LPSLGNs may
have practical value; for example, genes with high random walk betweenness, a measure of the
centrality of a node in a graph, are good candidates for intervention studies and hence integrated
computational – experimental investigations designed to infer more realistic and sophisticated
probabilistic directed graphical model representations of genetic networks. The LPbased solutions
of the sparse linear regression problem described here may provide a method for learning the
structure of transcription factor networks from transcript profiling and transcription factor binding
motif data.
Background
Understanding the dynamic organization and function of
networks involving molecules such as transcripts and pro
teins is important for many areas of biology. The ready
availability of highdimensional data sets generated using
highthroughput molecular profiling technologies has
stimulated research into mathematical, statistical, and
probabilistic models of networks. For example, GEO [1]
and ArrayExpress [2] are public repositories of wellanno
tated and curated transcript profiling data from diverse
species and varied phenomena obtained using different
platforms and technologies.
A genetic network can be represented as a graph consisting
of a set of nodes and a set of edges. A node corresponds to
a gene (transcript) and an edge between two nodes
denotes an interaction between the connected genes that
may be linear or nonlinear. In a directed graph, the ori
ented edge A → B signifies that gene A influences gene B.
In an undirected graph, the unoriented edge A  B
encodes a symmetric relationship and signifies that genes
A and B may be coexpressed, coregulated, interact or
share some other common property. Empirical observa
tions indicate that most genes are regulated by a small
number of other genes, usually fewer than ten [35].
Hence, a genetic network can be viewed as a sparse graph,
i.e., a graph in which a node is connected to a handful of
other nodes. If directed (acyclic) graphs or undirected
graphs are imbued with probabilities, the result is proba
bilistic directed graphical models and probabilistic undi
rected graphical models respectively [6].
Extant approaches for deducing the structure of genetic
networks from transcript profiling data [79] include
Boolean networks [1014], linear models [1518], neural
networks [19], differential equations [20], pairwise
mutual information [10,2123], Gaussian graphical mod
els [24,25], heuristic approachs [26,27], and coexpres
sion clustering [16,28]. Theoretical studies of sample
complexity indicate that although sparse directed acyclic
graphs or Boolean networks could be learned, inference
would be problematic because in current data sets, the
number of variables (genes) far exceedes the number of
observations (transcript profiles) [12,14,25]. Although
probabilistic graphical models provide a powerful frame
work for representing, modeling, exploring, and making
inferences about genetic networks, there remain many
challenges in learning tabula rasa the topology and proba
bility parameters of large, directed (acyclic) probabilistic
graphical models from uncertain, highdimensional tran
script profiling data [7,25,2933]. Dynamic programing
approaches [26,27] use Singular Value Decomposition
(SVD) to preprocess the data and heuristics to determine
stopping criteria. These methods have high computa
tional complexity and yield approximate solutions.
This work focuses on a plausible, albeit incomplete repre
sentation of a genetic network – a sparse undirected graph
– and the task of estimating the structure of such a net
work from highdimensional transcript profiling data.
Since the degree of every node in a sparse graph is small,
the model embodies the biological notion that a gene is
regulated by only a few other genes. An undirected edge
indicates that although the expression levels of two con
nected genes are related, the direction of influence is not
specified. The final simplification is that of restricting the
type of interaction that can occur between two genes to a
single class, namely a linear relationship. This particular
representation of a genetic network is termed a sparse lin
ear genetic network (SLGN).
Here, the task of learning the structure of a SLGN is
equated with that of solving a collection of sparse linear
regression problems, one for each gene in a network
(node in the graph). Each linear regression problem is
posed as a LASSO (l1constrained fitting) problem [34]
that is solved by formulating a Linear Program (LP). A vir
tue of this LPbased approach is that the use of the Huber
loss function reduces the impact of variation in the train
ing data on the weight vector that is estimated by regres
sion analysis. This feature is of practical importance
because technical noise arising from the transcript profil
Page 3
Algorithms for Molecular Biology 2009, 4:5 http://www.almob.org/content/4/1/5
Page 3 of 15
(page number not for citation purposes)
ing platform used coupled with the stochastic nature of
gene expression [35] leads to variation in measured abun
dance values. Thus, the ability to estimate parameters in a
robust manner should increase confidence in the structure
of an LPSLGN estimated from noisy transcript profiles.
An additional benefit of the approach is that the LP for
mulations can be solved quickly and efficiently using
widely available software and tools capable of solving LPs
involving tens of thousands of variables and constraints
on a desktop computer.
Two different LP formulations are proposed: one based on
a positive class of linear functions and the other on a gen
eral class of linear functions. The accuracy of this LPbased
approach for deducing the structure of networks is
assessed statistically using gold standard data and evalua
tion metrics from the Dialogue for Reverse Engineering
Assessments and Methods (DREAM) initiative [36]. The
LPbased approach compares favourably with algorithms
proposed by the top two ranked teams in the DREAM2
competition. The practical utility of LPSLGNs is exam
ined by estimating and analyzing network models from
two published Saccharomyces cerevisiae transcript profiling
data sets [37] (ALPHA; CDC15). The node degree distri
butions of the learned S. cerevisiae LPSLGNs, undirected
graphs with many hundreds of nodes and thousands of
edges, follow approximate power laws, a feature observed
in real biological networks. Inspection of these LPSLGNs
from a biological perspective suggests they capture known
regulatory associations and thus provide plausible and
useful approximations of real genetic networks.
Methods
Genetic network: sparse linear undirected graph
representation
A genetic network can be viewed as an undirected graph,
= {V, W}, where V is a set of N nodes (one for each
gene in the network), and W is an N × N connectivity
matrix encoding the set of edges. The (i, j)th element of the
matrix W specifies whether nodes i and j do (Wij ≠ 0) or
do not (Wij = 0) influence each other. The degree of node
n, kn, indicates the number of other nodes connected to n
and is equivalent to the number of nonzero elements in
row n of W. In real genetic networks, a gene is regulated
often by a small number of other genes [3,4] so a reason
able representation of a network is a sparse graph. A sparse
graph is a graph parametrized by a sparse matrix W, a
matrix with few nonzero elements Wij, and where most
nodes have a small degree, kn < 10.
Linear interaction model: static and dynamic settings
If the relationship between two genes is restricted to the
class of linear models, the abundance value of a gene is
treated as a weighted sum of the abundance values of
other genes. A highdimensional transcript profile is a vec
tor of abundance values for N genes. An N × T matrix E is
the concatenation of T profiles, [e(1),..., e(T)], where e(t)
= [e1(t),..., eN(t)]® and en(t) is the abundance of gene n in
profile t. In most extant profiling studies, the number of
transcripts monitored exceeds the number of available
profiles (N Ŭ T).
In the static setting, the T transcript profiles in the data
matrix E are assumed to be unrelated and so independent
of one another. In the linear interaction model, the abun
dance value of a gene is treated as a weighted sum of the
abundance values of all genes in the same profile,
The parameter wn = [wn1,..., wnN]® is a weight vector for
gene n and the jth element indicates whether genes n and j
do (wnj ≠ 0) or do not (wnj = 0) influence each other. The
constraint wnn = 0 prevents gene n from influencing itself
at the same instant so its abundance is a function of the
abundances of the remaining N  1 genes in the same pro
file.
In the dynamic setting, the T transcript profiles in E are
assumed to form a time series. In the linear interaction
model, the abundance value of a gene at time t is treated
as a weighted sum of the abundance values of all genes in
the profile from the previous time point, t  1, i.e.,
=−
w e
1
. There is no constraint wnn = 0 because
a gene can influence its own abundance at the next time
point.
As described in detail below, the SLGN structure learning
problem involves solving N independent sparse linear
regression problems, one for each node in the graph (gene
in the network), such that every weight vector wn is sparse.
The sparse linear regression problem is cast as an LP and
uses a loss function which ensures that the weight vector
is resilient to small changes in the training data. Two LPs
are formulated and each formulation contains one user
defined parameter, A, the upper bound of the l1 norm of
the weight vector. One LP is based on a general class of lin
ear functions. The other LP formulation is based on a pos
itive class of linear functions and yields an LP with fewer
variables than the first.
e t
n
w e t
nj j
t
w
j
N
n
nn
( )( )
( )
=
=
=
=
∑
w e
where
1
0
T
(1)
e t
n
t
n
( )()
T
Page 4
Algorithms for Molecular Biology 2009, 4:5 http://www.almob.org/content/4/1/5
Page 4 of 15
(page number not for citation purposes)
Simulated and real data
DREAM2 InSilicoNetwork Challenges data
A component of Challenge 4 of the DREAM2 competition
[38] is predicting the connectivity of three in silico net
works generated using simulations of biological interac
tions. Each DREAM2 data set includes time courses
(trajectories) of the network recovering from several exter
nal perturbations. The INSILICO1 data were produced from
a gene network with 50 genes where the rate of synthesis
of the mRNA of each gene is affected by the mRNA levels
of other genes; there are 23 different perturbations and 26
time points for each perturbation. The INSILICO2 data are
similar to INSILICO1 but the topology of the 50gene net
work is qualitatively different. The INSILICO3 data were
produced from a full in silico biochemical network that
had 16 metabolites, 23 proteins and 20 genes (mRNA
concentrations); there are 22 different perturbations and
26 time points for each perturbation. Since the LPbased
method yields network models in the form of undirected
graphs, the data were used to make predictions in the
DREAM2 competition
UNSIGNED. Thus, the simulated data sets used to esti
mate LPSLGNs are an N = 50 × T = 26 matrix (INSILICO1),
an N = 50 × T = 26 matrix (INSILICO2), and an N = 59 × T
= 26 matrix (INSILICO3).
category UNDIRECTED
S. cerevisiae transcript profiling data
A published study of S. cerevisiae monitored 2,467 genes
at various time points and under different conditions
[37]. In the investigations designated ALPHA and CDC15,
measurements were made over T = 15 and T = 18 time
points respectively. Here, a gene was retained only if an
abundance measurement was present in all 33 profiles.
Only 605 genes met this criterion of no missing values
and these data were not processed any further. Thus, the
real transcript profiling data sets used to estimate LP
SLGNs are an N = 605 × T = 15 matrix (ALPHA) and an N
= 605 × T = 18 matrix (CDC15).
Training data for regression analysis
A training set for regression analysis,
by generating training points for each gene from the data
matrix E. For gene n, the training points are
I
y
=
=
{(,)}
x
1
, is created
. The ith training point consists of an
"input" vector, xni = [x1i,..., xNi] (abundances values for N
genes), and an "output" scalar yni = xni (abundance value
for gene n).
In the static setting, I = T training points are created
because both the input and output are generated from the
same profile; the linear interaction model (Equation 1)
includes the constraint wnn = 0. If en(t) is the abundance of
gene n in profile t, the ith training point is xni = e(t) =
[e1(t),..., eN(t)], yni = en(t), and t = 1,..., T.
In the dynamic setting, I = T  1 training points are created
because the output is generated from the profile for a
given time point whereas the input is generated from the
profile for the previous time point; there is no constraint
wnn = 0 in the linear interaction model. The ith training
point is xni = e(t  1) = [e1(t  1),..., eN(t  1)], yni = en(t), and
t = 2,..., T.
The results reported below are based on training data gen
erated under a static setting so the constraint wnn = 0 is
imposed.
Notation
N
Let
and card(A) the cardinality of a set A. For a vector x =
[x1,..., xN]® in this space, the l2 (Euclidean) norm is the
square root of the sum of the squares of its elements,
∑
n
denote the Ndimensional Euclidean vector space
; the l1 norm is the sum of the absolute
∑
n
values of its elements, ; and the l0 norm
is the total number of nonzero elements, x0 =
card({nxn ≠ 0; 1 ≤ n ≤ N}). The term x ≥ 0 signifies that
every element of the vector is zero or positive, xn ≥ 0, ∀n ∈
{1,..., N}. The one and zerovectors are 1 = [11,..., 1N]®
and 0 = [01,..., 0N]® respectively.
Sparse linear regression: an LPbased formulation
Given a training set for gene n
the sparse linear regression problem is the task of inferring
a sparse weight vector, wn, under the assumption that
genegene interactions obey a linear model, i.e., the abun
dance of a gene n, yni = xn, is a weighted sum of the abun
= w x
dances of other genes, .
Sparse weight vector estimation
l0 norm minimization
The problem of learning the structure of an SLGN involves
estimating a weight vector such that w best approximates
y and most of elements of w are zero. Thus, one strategy
for obtaining sparsity is to stipulate that w should have at
most k nonzero elements, w0 ≤ k. The value of k is
equivalent to the degree of the node so a biologically
plausible constraint for a genetic network is w0 ≤ 10.
{}
n n
N
=1
nninii
x
2
2
1
=
=
xn
N
x
1
1
=
=

xn
N
n ni nini
N
ni
yyiI
=∈∈=
{(, );;,..., }1
xx
(2)
ynin ni
T
Page 5
Algorithms for Molecular Biology 2009, 4:5 http://www.almob.org/content/4/1/5
Page 5 of 15
(page number not for citation purposes)
Given a value of k, the number of possible choices of pre
dictors that must be examined is NCk. Since there are many
genes (N is large) and each choice of predictor variables
requires solving an optimization problem, learning a
sparse weight vector using an l0 normbased approach is
prohibitive, even for small k. Furthermore, the problem is
NPhard [39] and cannot even be approximated in time
where is small positive quantity.
LASSO
A tractable approximation of the l0 norm is the l1 norm
[40,41] (for other approximations see [42]). LASSO [34]
uses an upper bound for the l1 norm of the weight vector,
specified by a parameter A, and formulates the l1 norm
minimization problem as follows,
This formulation attempts to choose w such that it mini
mizes deviations between the predicted and the actual val
ues of y. In particular, w is chosen to minimize the loss
∑
1
function . Here, "Empirical
Error" is used as the loss function. The Empirical Error of
1
N
n
=
∑
1
a graph is , where
. The user
defined parameter A controls the upper bound of the l1
norm of the weight vector and hence the tradeoff
between sparsity and accuracy. If A = 0, the result is a poor
approximation, as the most sparse solution is a zero
weight vector, w = 0. When A = ∞, deviations are not
allowed and a nonsparse w is found if the problem is fea
sible.
LP formulation: general class of linear functions
Consider the robust regression function f(.; w). For the
general class of linear functions, f(x; w) = w®x, an element
of the parameter vector can be zero, wj = 0, or nonzero, wj
≠ 0. When wj > 0, the predictor variable j makes a positive
contribution to the linear interaction model, whereas if wj
< 0, the contribution is negative. Since the representation
of a genetic network considered here is an undirected
graph and thus the connectivity matrix is symmetric, the
interactions (edges) in a SLGN are not categorized as acti
vation or inhibition.
For the general class of linear functions f(x; w) = w®x, an
element of the weight vector w should be nonzero, wj ≠ 0.
Then, the LASSO problem
can be posed as the following LP
by substituting w = u  v, w1 = (u + v)®1, vi = ξi +
and vi = ξi 
upper bound of the l1 norm of the weight vector and thus
the tradeoff between sparsity and accuracy. Problem (4)
is an LP in (2N + 2I) variables, I equality constraints, 1
inequality constraints and (2N + 2I) nonnegativity con
straints.
. The userdefined parameter A controls the
LP formulation: positive class of linear functions
An optimization problem with fewer variables than prob
lem (4) can be formulated by considering a weaker class
of linear functions. For the positive class of linear func
tions f(x; w) = w®x, an element of the weight vector w
should be nonnegative, wj ≥ 0. Then, the LASSO problem
(Equation 3) can be posed as the following LP,
Problem (5) is an LP with (N + 2I) variables, I equality
constraints, 1 inequality constraints, and (2N + 2I) non
negativity constraints.
In most transcript profiling studies, the number of genes
monitored is considerably greater than the number of
profiles produced, N Ŭ I. Thus, an LP based on a restrictive
2
1
log
−eN
minimize
w
subject to
T
w x
w
,

.
v
i
i
I
iii
v
v
A
y
=∑
+
≤
=
1
1
L w
( )
y
ii
i
I

=−
=
w x
T
1
Empiricalerror
N
n
∑
()
Empiricalyf
errorn ninin
i
I
I
()(; )
=−
=
1
xw
minimize
w
subject to
T
w x
w
,

.
v
i
i
I
iii
v
v
A
y
=∑
+
≤
=
1
1
(3)
minimize
u v
, , , *
x x
subject to
T
uvx
*
i
*
i
()
()
(
u
x
xx
xx
i
i
I
i
≤
0
iiy
+
−
+
+−=
=∑
1
u uv
;
1
≥
*
i
v
x
≥
≥≥
)
;
T
A
i
0
00
(4)
xi
*
xi
*
minimize
w
, , *
x x
subject to
T
T
w x
w 1
w
x
i
*
i
*
i
()
xx
xx
i
i
I
iiiy
A
+
+−=
≤
≥
≥
=∑
1
0 0
00
x
i
≥
;.
*
(5)
Page 6
Algorithms for Molecular Biology 2009, 4:5http://www.almob.org/content/4/1/5
Page 6 of 15
(page number not for citation purposes)
positive linear class of functions and involving (N + 2I)
variables (Problem (5)) offers substantial computational
advantages over a formulation based on a general linear
class of functions and involving (2N + 2I) variables (Prob
lem (4)). LPs involving thousands of variables can be
solved efficiently using extant software and tools.
To estimate a graph , the training points for the nth gene,
, are used to solve a sparse linear regression problem
posed as a LASSO and formulated as an LP. The outcome
of such regression analysis is a sparse weight vector wn
whose small number of nonzero elements specify which
genes influence gene n. Aggregating the N sparse weight
vectors produced by solving N independent sparse linear
regression problems [w1,..., wN], yields the matrix W that
parameterizes the graph.
Statistical assessment of LPSLGNs: Error, Sparsity and
LeaveOneOut (LOO) Error
The "Sparsity" of a graph
node
is the average degree of a
where wn0 is the l0 norm of the weight vector for node
n.
Unfortunately, the small number of available training
points (I) means that the empirical error will be optimistic
and biased. Consequently, the LeaveOneOut (LOO)
Error is used to analyze the stability and generalization
performance of the method proposed here.
Given a training set
modified training sets are built as follows
= [(xn1, yn1),..., (xnI, ynI)], two
• Remove the ith element:
• Change the ith element:
where (x', y') is any point other than one in the training set
n
,
The LeaveOneOut Error of a graph
average over the N nodes of the LOO error of every node.
The LOO error of node n, LOOerror(
over the I training points of the magnitude of the discrep
ancy between the actual response, yni, and the predicted
i
nin
(;)
xw
, LOO Error, is the
), is the average
linear response, ,
The parameter of the function is
learned using the modified training set .
A bound for the Generalization Error of a graph
A key issue in the design of any machine learning system
is an algorithm that has low generalization error.
Here, the LeaveOneOut (LOO) error is utilized to esti
mate the accuracy of the LPbased algorithm employed to
learn the structure of a SLGN. In this section, a bound on
the generalization error based on the LOO Error is
derived. Furthermore, a low "LOO Error" of the method
proposed here is shown to signify good generalization.
The generalization error of a graph
over all N nodes of the generalization error of every node,
Error( ),
n
, Error, is the average
The parameter wn is learned from as follows,
The approch is based on the following Theorem (for
details, see [43]),
Theorem 1. Given a training set S = {z1,..., zm} of size m, let
the modified training set be Si = {z1,..., zi1,
where the ith element has been changed and is drawn from
the data space Z but independent of S. Let F = Zm →
measurable function for which there exists constants ci (i = 1,...,
m) such that
, zi+1,..., zm},
be any
n
Sparsity ==
==
∑
n
∑
n
1
N
1
N
1
0
1
k
n
N
n
N
w
(6)
n
n
i
n nini
y
\
\{(, )}
=
x
n
i
n ni ni
yy
=
′ ′
\{(,)} ( , )
xx
∪
n
f
i
n
i
ni
\\\
wx
=
T
LOO Error =
=−
=∑
n
1
N
1
I
1
LOO
LOOyf
errorn
N
errorn ni
i
ni
()
()(;
\
xwn n
i
n
I
\)
=∑
1
(7)
wn
i \
f
i
nin
i
\\
(;)
xw
n
i \
Error
=
=
=−
=
∑
[( ; , )]
1
N
E
1
x
Error
Error l fy
l f
( ; , )
x
yy
n
n
N
n
n
()
()

wn n
Tx 
(8)
n
w w x
( ,(
w
n
t
nini
i
I
I
ly
=
≤
=∑
argmin
 1
, ))
1
1
(9)
′ zi
′ zi
Page 7
Algorithms for Molecular Biology 2009, 4:5 http://www.almob.org/content/4/1/5
Page 7 of 15
(page number not for citation purposes)
Elsewhere [44], the above was given as Theorem 2.
Theorem 2. Consider a graph
points for
=
{(, );
x
ni ni
y
with N nodes. Let the data
nth
node
=
; ,..., }
iI
1
the
N
be
where
(xni,
yni) are iid. Assume that xni∞ ≤ d and yni ≤ b. Let
N
:
→
and y = f(x; w) = w®x. Using techniques from
[44], it can be stated that for 0 ≤ δ ≤ 1 and with probability at
least 1  δ over a random draw of the sample graph
,
where t is the l1 norm of the weight vector w1. LOO Error
and Error are calculated using Equation 7 and Equation 8
respectively.
PROOF. "Random draw" means that if the algorithm is run
for different graphs, one graph from the set of learned
graphs is selected at random. The proposed bound of gen
eralization error will be true for this graph with high prob
ability. This term is unrelated to term "Random graph"
used in Graph Theory.
The following proof makes use of Holder's Inequality.
A bound on the Empirical Error can be found as
Let Error( ) be the Generalization Error after training
with . Then using Equation 11
Let Error() be the Generalization Error after training
with . Then using Equation 13
If LOOerror( ) is the LOO error when the training set is
, then using Equation 11 and Equation 12,
Thus, the random variable (Error  LOO Error) satisfies the
condition of Theorem 1. Using Equation 14 and Equation
15, the condition is
sup
m
F SF S
( ( )
c
P F SE F S
s
[ ( )])
S Z
e
Z
i
i
s
i
e
′
e
,
( ( ),
[( ( )]
−≤
−≥≤
z
then
e eci
i
m
−
=
∑
22
1
2
e
/.
∈∈
;
x
ni ni
y
f
Error LOO Error
tdtd
b
1
I
≤+++
⎛
⎝⎜
⎞
⎠⎟
⎛
⎝⎜
2
⎞
⎠⎟
26
1
d
ln
(10)
y f x
(
yf
ni nin ni
i
nin
i
n ni
−
n
i
ni
x
n
−−−
≤−
≤
∞
; )(;)

(
\\
\
wxw
w xwx
ww
TT
n n
i
ni
n
d
td
\)
.
1
1
2
2
w
∞
≤
≤
(11)
max((; ))
b

.
y f xy
b td
ni nin ni
+
+
n ni
n ni
−≤
≤
≤
+
∞
w w x
wx
T
1
(12)
n
i \
n
i \
()( )
[ ( ;
x w
)] [( ;
x wi i
\
n
−
\\
n
ErrorError
EyfEyf
n
i
n
i
nn
−
=−−
nininni
i
nin
i
y f x
(
yf
td
)]
;)(;)
.
\\
≤−−−
≤
∞
wxw
2
(13)
n
i
n
i
Error Error
Error ErrorError
nn
i
nn
i
n
i
()()
(()( ))(()
\\
−
=−−− E Error
Error ErrorErrorError
n
i
nn
i
n
i
n
i
( ))
()( )()(
\\
≤−+−
) )
.
≤
4td
(14)
n
i
n
i

1
()( )
 ((; )
\\
n
LOO
∑
+
∑
LOO
I
yfy
errorn errorn
i
ni
j
nj
j
ni
−
=−−−
xw
f f
yfyf
i j
\
njn
i j
\
j i
≠
−
ni
i
njn
i
ni
i
ni
\\\
(; ))
((; )(
xw
xwx
−
′ −′ ; ;))
(;)(; )
(
\
n
\\
n
\\
\
w
xwxw
i
j
nj
j i j
njn
i j
j i
−
ni
I
+
1
ff
y
≠∑
f
≤−
≠
1
i i
nin
i
ni
i
nin
i
n
j
n
i j
\
j
yf
I
(; )(; ))
()
\\\
\
xwxw
wwx
−
′ −′
≤−
T
j j i
b td
I
Itdb td
td
b
I
++
≤−+
≤+
( )
() ()
.
1
1 2
2
(15)
Page 8
Algorithms for Molecular Biology 2009, 4:5 http://www.almob.org/content/4/1/5
Page 8 of 15
(page number not for citation purposes)
Where Errori is the Generalization of graph
Errori is LOO Error of graph
all genes are changed. Thus, only a bound on the expecta
tion of the random variable (Error  LOO Error) is needed.
Using Equation 11,
and LOO
when the ith data points for
Hence, Theorem 1 can be used to state that if Equation 16
holds, then
By equating the right hand side of Equation 17 to δ
Given this bound on the generalization error, a low LOO
Error in the method proposed here signifies good
generalization. h
Implementation and numerical issues
Prototype software implementing the two LPbased for
mulations of sparse regression was written using the tools
and solvers present in the commercial software MATLAB
[45]. Software is available in "Additional file 1" named as
"LPSLGN.tar". It should be straightforward to develop an
implementation using C and R wrapper functions for
lpsolve [46], a freely available solver for linear, integer and
mixed integer programs. The outcome of regression anal
ysis is an optimal weight vector w. Limitations in the
numerical precision of solvers means that an element is
never exactly zero but a small finite number. Once a solver
finds a vector w, a "small" userdefined threshold is used
to assign zero and nonzero elements. If the value pro
duced by a solver is greater than the threshold wj = 1, oth
erwise wj = 0. Here, a cutoff of 108 was used.
The computational experiments described here were per
formed on a large shared machine. The hardware specifi
cations are 6 × COMPAQ AlphaServers ES40 with 4 CPUs
per server with 667 MHz, 64 KB + 64 KB primary cache per
CPU, 8 MB secondary cache per CPU, 8 GB memory with
4 way interleaving, 4 * 36 GB 10 K rpm Ultra3 SCSI disk
drive, and 2*10/100 Mbit PCI Ethernet Adapter. How
ever, the programs can be run readily on a powerful PC.
For the MATLAB implementation of the LP formulation
based on the general class of linear functions, the LP took
a few seconds of wall clock time. An additional few sec
onds were required to read in files and to set up the prob
lem.
Results and discussion
DREAM2 InSilicoNetwork Challenges data
Statistical assessment of LPSLGNs estimated from simulated data
LPSLGNs were estimated from the INSILICO1, INSILICO2,
and INSILICO3 data sets using both LP formulations and
different settings of the userdefined parameter A which
controls the upper bound of the l1 norm of the weight vec
tor and hence the tradeoff between sparsity and accuracy.
The results are shown in Figure 1. For all data sets, smaller
values of A yield sparser graphs (left column) but Sparsity
comes at the expense of higher LOO Error (right column).
Higher A values produce graphs where the average degree
of a node is larger (left column). The LOO Error decreases
with increasing Sparsity (right column). The maximum
Sparsity occurs at high A values and is equal to the
number of genes N.
LPSLGNs based on the general class of linear functions
were estimated using the parameter A = 1. For the
INSILICO1 data set, the Sparsity is ~10. For the INSILICO2
data set, the Sparsity is ~13. For the INSILICO3 data set, the
Sparsity is ~35.
The learned LPSLGNs were evaluated using a script pro
vided by the DREAM2 Project [38]. The results are shown
in Table 1. The INSILICO2 LPSLGN is considerably better
than the network predicted by Team80, Which team is the
topranked team in the DREAM2 competition (Challenge
4). The INSILICO1 LPSLGN is comparable to the predicted
network of Team70, the top ranked team, but better than
that of Team 80, the secondranked team. Team rankings
sup (
,( , )
x y
)( )

ii
Error LOO ErrorErrorLOO Error
Error
1
∑
+

E
−−−
≤− r rror LOO ErrorLOO Error
ii
nn
i
n
N
ErrorError

(()( )
+−
=−
=
i
1 1
1
1
N
6
6
N
error
N
n errorn
i
n
LOO
∑
LOO
⎞
⎠⎟
td
b
I
t
d d
−
≤+
⎛
⎝⎜
=
=
()( ))
b
I
+
.
(16)
E Error[
∑
LOO Error]
( ((; )(;
\\
n
−
=−−−
1
N
1
I
yfyf
ni nin ni
i
ni
i
xwxw ) )))
.
i
n
n
N
td
=
= ∑
1
≤
1
2
P Error
⎛
⎜
⎜
⎜
⎜⎜
LOO ErrorError[
E
LOO Error[(( )]])]
exp
−−−≥
≤
−
+
I td
b
I I
e
e
22
6
⎛
⎝⎜
⎞
⎠⎟
⎝
⎞
⎟
⎟
⎟
⎟⎟
⎠
2
.
(17)
P Error[ LOO Error
]( ).
<+++
⎛
⎝⎜
⎞
⎠⎟
⎛
⎝⎜
2
⎞
⎠⎟
≥−
26
1
d
1
tdtd
b
I
I ln
d
Page 9
Algorithms for Molecular Biology 2009, 4:5 http://www.almob.org/content/4/1/5
Page 9 of 15
(page number not for citation purposes)
Quantitative evaluation of the INSILICO network models
Figure 1
Quantitative evaluation of the INSILICO network models. Statistical assessment of the LPSLGNs estimated from the
INSILICO1, INSILICO2, and INSILICO3 DREAM2 data sets [36]. The left column shows plots of "Sparsity" (Equation 6) versus the
userdefined parameter A (Equation 3). The right column shows plots of "LOO Error" (Equation 7) versus Sparsity. Each plot
shows results for an LP formulation based on a general class of linear functions (diamond) and a positive class of linear func
tions (cross).
Page 10
Algorithms for Molecular Biology 2009, 4:5 http://www.almob.org/content/4/1/5
Page 10 of 15
(page number not for citation purposes)
are not available for the INSILICO3 dataset. The predicted
networks by LPSLGN can be found in "Additional file 2"
named as "Result.tar".
S. cerevisae transcript profiling data
Statistical assessment of LPSLGNs estimated from real data
LPSLGNs for the ALPHA and CDC15 data sets were esti
mated using both LP formulations and different settings
of the userdefined parameter A. The learned undirected
graphs were evaluated by computing LOO Error (Equa
tion 7), a quantity indicating generalization performance,
and Sparsity (Equation 6), a quantity based on the degree
of each node. The results are shown in Figure 2. LP formu
lations based on a weaker positive class of linear functions
(cross) and a general class of functions linear (diamond)
produce similar results. However, the formulation based
on a positive class of linear functions can be solved more
quickly because it has fewer variables. For both data sets,
smaller A values yield sparser graphs (left column) but
sparsity comes at the expense of higher LOO Error (right
column). For high A values, the average degree of a node
is larger (left column). The LOO Error decreases with the
increase of Sparsity (right column). The maximum Spar
sity occurs at high A values and is equal to the number of
genes N. The minimum LOO Error occurs at A = 1 for
ALPHA and A = 0.9 for CDC15; the Sparsity is ~15 for
these A values. The degree of most of the nodes in the LP
SLGNs lies in the range 5–20, i.e., most of the genes are
influenced by 5–20 other genes.
Figure 3 shows logarithmic plots of the distribution of
node degree for the ALPHA and CDC15 LPSLGNs. In
each case, the degree distribution roughly follows a
straight line, i.e., the number of nodes with degree k fol
lows a power law, P(k) = βkα where β, α ∈ R. Such a
powerlaw distribution is observed in a number of real
world networks [47]. Thus, the connectivity pattern of
edges in LPSLGNs are consistent with known biological
networks.
Biological evaluation of S. cerevisiae LPSLGNs
The profiling data examined here were the outcome of a
study of the cell cycle in S. cerevisiae [37]. The published
study described gene expression clusters (groups of genes)
with similar patterns of abundance across different condi
tions. Whereas two genes in the same expression cluster
have similarly shaped expression profiles, two genes
linked by an edge in an LPSLGN model have linearly
related abundance levels (a nonzero element in the con
nectivity matrix of the undirected graph, wij ≠ 0). The
ALPHA and CDC15 LPSLGNs were evaluated from a bio
logical perspective by manual analysis and visual inspec
tion of LPSLGNs estimated using the LP formulation
based on a general class of linear functions and A = 1.01.
Figure 4 shows a small, illustrative portion of the ALPHA
and CDC15 LPSLGNs centered on the POL30 gene. For
each the genes depicted in the figure, the Saccharomyces
Genome Database (SGD) [48] description, Gene Ontol
ogy (GO) [49] terms and InterPro [50] protein domains
(when available) are listed in "Additional file 3" named as
"Supplementary.pdf". The genes connected to POL30
encode proteins that are associated with maintenance of
genomic integrity (DNA recombination repair, RAD54,
DOA1, HHF1, RAD27), cell cycle regulation, MAPK sig
nalling and morphogenesis (BEM1, SWE1, CLN2, HSL1,
ALX2/SRO4), nucleic acid and amino acid metabolism
(RPB5, POL12, GAT1), and carbohydrate metabolism and
cell wall biogenesis (CWP1, RPL40A, CHS2, MNN1,
PIG2). Physiologically, the KEGG [51] pathways associ
ated with these genes include "Cell cycle" (CDC5, CLN2,
SWE1, HSL1), "MAPK signaling pathway" (BEM1), "DNA
polymerase" (POL12), "RNA polymerase" (RPB5), "Ami
Table 1: Comparison of the networks – undirected graphs – produced by three different approaches: the LPbased method proposed
here, and techniques proposed by the top two teams of the DREAM2 competition (Challenge 4).
DatasetTeam Precision at kth correct prediction
k = 2
Area Under PR Curve Area Under ROC Curve
k = 1 k = 5k = 20
INSILICO1 Team 70
Team 80
LPSLGN
1.000000
0.142857
0.083333
1.000000
0.181818
0.086957
1.000000
0.045045
0.089286
1.000000
0.059524
0.117647
0.596721
0.070330
0.087302
0.829266
0.459704
0.509624
INSILICO2 Team 80
Team 70
LPSLGN
0.333333
0.142857
1.000000
0.074074
0.250000
1.000000
0.102041
0.121320
0.192308
0.069204
0.081528
0.183486
0.080266
0.084303
0.200265
0.536187
0.511436
0.750921
INSILICO3 LPSLGN0.068966 0.0689660.068966 0.0689660.068966 0.500000
For the first k predictions (ranked by score, and for predictions with the same score, taken in the order they were submitted in the prediction files),
the DREAM2 evaluation script defines precision as the fraction of correct predictions of k, and recall as the proportion of correct predictions out
of all the possible true connections. The other metrics are the PrecisionRecall (PR) and Receiver Operating Characteristics (ROC) curves.
Page 11
Algorithms for Molecular Biology 2009, 4:5 http://www.almob.org/content/4/1/5
Page 11 of 15
(page number not for citation purposes)
nosugars metabolism" (CHS2), "Starch and sucrose
metabolism" (RAD54), "Highmannose type Nglycan
biosynthesis" (MNN1), "Purine metabolism" (POL12,
RPB5), "Pyrimidine metabolism" (POL12, RPB5), and
"Folate biosynthesis" (RAD54).
The learned LPSLGNs provide a forum for generating bio
logical hypotheses and thus directions for future experi
mental investigations. The edge between SWE1 and BEM1
indicates that the transcript levels of these two genes
exhibit a linear relationship; the physical interactions sec
tion of their SGD [48] entries indicates that the encoded
proteins interact. These results suggests that cellular and/
or environmental factor(s) that perturb the transcript lev
els of both SWE1 and BEM1 may affect cell polarity and
cell cycle. NCE102 is connected to genes involved in cell
cycle regulation (CDC5) and cell wall remodelling
(CWP1, MNN1). A recent report indicates that the tran
script level of NCE102 changes when S. cerevisiae cells
expressing human cytochrome CYP1A2 are treated with
the hepatotoxin and hepatocarcinogen aflatoxin B1 [52].
Thus, this uncharacterized gene may be part of a cell cycle
related response to genotoxic and/or other stress.
Studies of the yeast NCE102 gene may be relevant to
human health and disease. The protein encoded by
NCE102 was used as the query for a PSIBLAST [53] search
using the WWW interface to the software at NCBI and
Quantitative evaluation of the S. cerevisiae network models
Figure 2
Quantitative evaluation of the S. cerevisiae network models. Statistical assessment of the LPSLGNs estimated from
the S. cerevisiae ALPHA and CDC15 data sets [37]. The left column shows plots of "Sparsity" (Equation 6) versus the user
defined parameter A (Equation 3). The right column shows plots of "LOO Error" (Equation 7) versus Sparsity. Each plot shows
results for an LP formulation based on a general class of linear functions (diamond) and a positive class of linear functions
(cross).
Page 12
Algorithms for Molecular Biology 2009, 4:5 http://www.almob.org/content/4/1/5
Page 12 of 15
(page number not for citation purposes)
default parameter settings. Amongst the proteins exhibit
ing statistically significant similarity (Evalue << 1e  05)
were members of the mammalian physin and gyrin fami
lies, fourtransmembrane domain proteins with roles in
vesicle trafficking and membrane morphogenesis [54].
Human synaptogyrin 1 (SYNGR1; Evalue ~ 1e  28) has
been linked to schizophrenia and bipolar disorder [55].
Conclusion
Like this work, a previous study [17] framed the question
of deducing the structure of a genetic network from tran
script profiling data as a problem of sparse linear regres
sion. The earlier investigation utilized SVD and robust
regression to deduce the structure of a network. In partic
ular, the set of all possible networks was characterized by
a connectivity matrix A defined by the equation A = A0 +
CV®. The matrix A0 computed from the data matrix E via
SVD can be seen as the best, in the l2 norm sense, connec
tivity matrix which can generate the data. The matrix V is
the right singular vectors of E. The requirement of a sparse
graph was enforced by choosing the matrix C such that
most of the entries in the matrix A are zero. An approxi
mate solution to the original equation was obtained by
posing it as a robust regression problem such that CV® =
Node degree distribution of the S. cerevisiae network models
Figure 3
Node degree distribution of the S. cerevisiae network models. The distribution of the degrees of nodes in the LP
SLGNs estimated from the S. cerevisiae ALPHA and CDC15 data sets using both LP formulations (a general class of linear func
tions; a positive class of linear functions). The best fit straight line in each logarithmic plot means that the number P(k) of nodes
with degree k follows a power law, P(k) ∝ kα. The goodness of fit and the value of the exponent α are given.
Page 13
Algorithms for Molecular Biology 2009, 4:5 http://www.almob.org/content/4/1/5
Page 13 of 15
(page number not for citation purposes)
A0 was enforced approximately. This new regression
problem was solved by formulating an LP that included
an l1 norm penalty for deviations from equality. In con
trast, the solution to the sparse linear regression problem
proposed here avoids the need for SVD by formulating the
problem directly within the framework of LOO Error and
Empirical Risk Minimization and enforcing sparsity via an
upper bound on the l1 norm of the weight vector, i.e., the
original regression problem is posed as a series of LPs. The
virtues of this LPbased approach for learning the struc
ture of SLGNs include (i) the method is tractable, (ii) a
sparse graph is produced because very few predictor vari
ables are used, (iii) the network model can be para
metrized by a positive class of linear functions to produce
LPs with few variables, (iv) efficient algorithms and
resources for solving LPs in many thousands of variables
and constraints are widely and freely available, and (v) the
learned network models are biologically reasonable and
can be used to devise hypotheses for subsequent experi
mental investigation.
Another method for deducing the structure of genetic net
works framed the task as one of finding a sparse inverse
covariance matrix from a sample covariance matrix [56].
This approach involved solving a maximum likelihood
problem with an l1norm penalty term added to encour
age sparsity in the inverse covariance matrix. The algo
rithms proposed for this can do no better than O(N3).
Better results were achieved by incorporating prior infor
mation about error in the sample covariance matrix. In
contrast, the LPbased approach to the sparse linear
regression problem avoids calculation of a covariance
matrix and does not require prior knowledge. Further
more, the approach proposed here can learn networks
with thousands genes in a few minutes on a personal com
puter.
The quality and utility of the learned LPSLGNs could be
enhanced in a number of ways. The network models
examined here were estimated from transcript profiles
that were subject to minimal data preprocessing. Appro
priate lowlevel analysis of profiling data is known to be
important [57] so estimating network models from suita
bly processed data would improve both their accuracy and
reliability. The biological predictions were made by visual
inspection of a small portion of the LPSLGNs and in an
adhoc manner. Hypotheses could be generated in a sys
tematic manner by exploiting statistical and topological
properties of sparse undirected graphs. For example, a fea
ture that unites the local and global aspects of a node is its
"betweenness", the influence the node has over the spread
of information through the graph. The randomwalk
betweenness centrality of a node [58] captures the propor
tion of times a node lies on the path between other nodes
in the graph. Nodes with high betweenness but small
degree (low connectivity) are likely to play a role in main
taining the integrity of the graph. Betweenness values
could be computed from a weighted undirected graph cre
ated from an ensemble of LPSLGNs produced by varying
the userdefined parameter A. Given a variety of LPSLGNs
estimated from data, the cost of an edge could be equated
with the frequency with it appears in the learned network
models. For the profiling data analyzed here, genes with
high betweenness and low degree may have important but
unrecognized roles in the S. cerevisae cell cycle and hence
correspond to good candidates for experimental investiga
tions of this phenomenon.
The weighted sparse undirected graph described above
could serve as the starting point for integrated computa
tional – experimental studies aimed at learning the topol
ogy and probability parameters of a probabilistic directed
graphical model, a more realistic representation of a
genetic network because the edges are oriented and the
statistical framework provides powerful tools for asking
questions related to the values of variables (nodes) given
the values of other variables (inference), handling hidden
or unobserved variables, and so on. However, estimating
the topology of probabilistic directed graphical model
representations of genetic networks from transcript profil
ing data is challenging [59]. Genes with high betweenness
and low degree could be targeted for intervention studies
whereby a specific gene would be knocked out in order to
determine the orientation of edges associated with it (see,
for example, [60]). A variety of theoretical improvements
The local environment of POL30 in the S. cerevisiae network models
Figure 4
The local environment of POL30 in the S. cerevisiae
network models. Genes connected to POL30 in the LP
SLGNs estimated from the S. cerevisiae ALPHA and CDC15
data sets (further information about the proteins encoded by
the genes shown can found in Additional File 1). Genes in
black (SWE1, POL12, CDC5, NCE102) were assigned to the
same expression cluster in the original transcript profiling
study [37]. Functionally related genes are boxed.
Page 14
Algorithms for Molecular Biology 2009, 4:5 http://www.almob.org/content/4/1/5
Page 14 of 15
(page number not for citation purposes)
are possible. An explicit model for uncertainty in tran
script profiling data could be used to formulate and then
solve robust sparse linear regression problems and hence
produce models of genetic networks that are more resil
ient to variation in training data than those generated
using the Huber loss function considered here. Expanding
the class of interactions from linear models to nonlinear
models is an important research topic.
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
SB, CB and ISM conceived and developed the computa
tional ideas presented in this work. SB and CB formulated
the optimization problems, wrote the software and per
formed the experiments. NC analyzed the data with con
tributions from the other authors. All authors read and
approved the final version of the manuscript.
Note
1http://mllab.csa.iisc.ernet.in/html/users/sahely/
Network_yeast.html
Additional material
Acknowledgements
ISM was supported by grants from the U.S. National Institute on Aging and
U.S. Department of Energy (OBER). CB and NC are supported by a grant
from MHRD, Government of India.
References
1.
GEO [http://www.ncbi.nlm.nih.gov/geo/]
2.
ArrayExpress [http://www.ebi.ac.uk/arrayexpress/]
3. Arnone MI, Davidson EH: Hardwiring of Development: Organi
zation and function of Genomic Regulatory Systems. Devel
opment 1997, 124:18511864.
Guelzim N, Bottani S, Bourgine P, Képès F: Topological and causal
structure of the yeast transcriptional regulatory network.
Nature Genetics 2002, 31:6063.
Luscombe NM, Babu MM, Yu H, Snyder M, Teichmann SA, Gerstein
M: Genomic analysis of regulatory network dynamics reveals
large topological changes. Nature 2004, 431:308312.
Jordan M: Graphical models. Statistical Science 2004, 19:140155.
Spirtes P, Glymour C, Scheines R, Kauffman S, Aimale V, Wimberly F:
Constructing Bayesian Network models of gene expression
networks from microarray data. Proceedings of the Atlantic Sym
posium on Computational Biology, Genome Information Systems & Technol
ogy 2000.
Jong HD: Modeling and Simulation of Genetic Regulatory Sys
tems: A Literature review. Journal of Computational Biology 2002,
9:67103.
Wessels LFA, Someren EPA, Reinders MJT: A comparison of
genetic network models. Pacific Symposium on Biocomputing '01
2001, 6:508519.
Andrecut M, Kauffman SA: A simple method for reverse engi
neering causal networks. PubMed Journal of Physics A: Mathematical
and General(46) .
Liang S, Fuhrman S, Somogyi R: Reveal, a general reverse engi
neering algorithm for inference of genetic network architec
tures. Pac Symp Biocomput 1998:1829.
Akutsu T, Miyano S, Kuhara S: Identification of genetic networks
from a small number of gene expression patterns under the
Boolean network model. Pacific Symposium on Biocomputing 1999,
4:1728.
Shmulevich I, Dougherty E, Kim S, Zhang W: Probabilistic Boolean
Networks: a rulebased uncertainty model for gene regula
tory networks. Bioinformatics 2002, 18:261274.
Friedman N, Yakhini Z: On the sample complexity of learning
Bayesian networks. PubMed Conference on Uncertainty in Artificial
Intelligence 1996:272282.
D'Haeseleer P, Wen X, Fuhrman S, Somogyi R: Linear modelling of
mrna expression levels during cns development and injury.
Pacific Symposium on Biocomputing '99 1999, 4:4152.
Someren E, Wessels LFA, Reinders M: Linear Modelling of genetic
networks from experimental data. Proceedings of the eighth inter
national conference on Intelligent Systems for Molecular Biology
2000:355366.
Yeung M, Tegnér J, Collins J: Reverse engineering gene networks
using singular value decomposition and robust regression.
Proc Natl Acad Sci USA 2002, 99:61636168.
Stolovitzky G, Monroe D, Califano A: Dialogue on ReverseEngi
neering Assessment and Methods: The DREAM of High
Throughput Pathway Inference. Annals of the New York Academy
of Sciences 2007, 1115:122.
Weaver D, Workman C, Stormo G: Modelling regulatory net
works with weight matrices. Pacific Symposium on Biocomputing
'99 1999, 4:112123.
Chen T, He H, Church G: Modelling gene expression with dif
ferential equations. Pacific Symposium on Biocomputing '99 1999,
4:2940.
Butte A, Tamayo P, Slonim D, Golub T, Kohane I: Discovering func
tional relationships between RNA expression and chemo
therapeutic susceptibility using relevance networks. Proc Natl
Acad Sci USA 2000, 97:1218212186.
Basso K, Margolin A, Stolovitzky G, Klein U, DallaFavera R, Califano
A: Reverse engineering of regulatory networks in human B
cells. Nature Genetics 2005, 37:382390.
Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Dalla
Favera R, Califano A: ARACNE: an algorithm for the recon
struction of gene regulatory networks in a mammalian cellu
lar context. BMC Bioinformatics. BMC Bioinformatics 2006,
7(Suppl 1):.
Schäfer J, Strimmer K: An empirical Bayes approach to inferring
largescale gene association networks. Bioinformatics 2005,
21:754764.
Friedman N: Inferring Cellular Networks Using Probabilistic
Graphical Models. Science 2004, 303(5659):799805.
Andrecut M, Kauffman SA: On the sparse reconstruction of gene
networks. PubMed Journal of computational biology .
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
Additional file 1
The codes of LPSLGN are available here.
Click here for file
[http://www.biomedcentral.com/content/supplementary/1748
718845S1.tar]
Additional file 2
Predicted networks obtained for InSilico and Yeast dataset using LP
SLGN are available here.
Click here for file
[http://www.biomedcentral.com/content/supplementary/1748
718845S2.tar]
Additional file 3
Information about the proteins encoded by the genes depicted in Fig
ure 4. For each gene, the Saccharomyces Genome Database (SGD) [48]
description, Gene Ontology (GO) [49] terms and InterPro [50] protein
domains are listed (when available).
Click here for file
[http://www.biomedcentral.com/content/supplementary/1748
718845S3.pdf]
Page 15
Publish with BioMed Central and every
scientist can read your work free of charge
"BioMed Central will be the most significant development for
disseminating the results of biomedical research in our lifetime."
Sir Paul Nurse, Cancer Research UK
Your research papers will be:
available free of charge to the entire biomedical community
peer reviewed and published immediately upon acceptance
cited in PubMed and archived on PubMed Central
yours — you keep the copyright
Submit your manuscript here:
http://www.biomedcentral.com/info/publishing_adv.asp
BioMedcentral
Algorithms for Molecular Biology 2009, 4:5 http://www.almob.org/content/4/1/5
Page 15 of 15
(page number not for citation purposes)
27. Andrecut M, Huang S, Kauffman SA: Heuristic Approach to
Sparse Approximation of Gene Regulatory Networks. Journal
of Computational Biology 2008, 15(9):11731186.
Akutsu T, Kuhara S, Maruyama O, Miyano S: Identification of Gene
Regulatory Networks by Strategic Gene Disruptions and
Gene Overexpressions. SODA 1998:695702.
Murphy K, Mian I: Modelling gene expression data using
Dynamic Bayesian Networks. 1999 [http://www.cs.berkeley.edu/
~murphyk/Papers/ismb99.ps.gz]. Tech. rep., Division of Computer
Science, University of California Berkeley
Murphy K: Learning Bayes net structure from sparse data
sets. 2001 [http://http.cs.berkeley.edu/~murphyk/Papers/bayesBN
learn.ps.gz]. Tech. rep., Division of Computer Science, University of
California Berkeley
Friedman N, Linial M, Nachman I, Pe'er D: Using Bayesian Net
works to Analyze Expression Data. Journal of Computational Biol
ogy 2000, 7:601620.
Imoto S, Kim S, Goto T, Aburatani S, Tashiro K, Kuhara S, Miyano S:
Bayesian Networks and Heteroscedastic for nonlinear mod
elling of Genetic Networks. Computer Society Bioinformatics Con
ference 2002:219227.
Hartemink A, Gifford D, Jaakkola T, Young R: Using Graphical
Models and Genomic Expression Data to Statistically Vali
date Models of Genetic Regulatory Networks. In Pacific Sympo
sium on Biocomputing 2001 (PSB01) Edited by: Altman R, Dunker A,
Hunter L, Lauderdale K, Klein T. New Jersey: World Scientific;
2001:422433.
Tibshirani R: Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society, Series B :267288.
Kaern M, Elston T, Blake W, Collins J: Stochasticity in gene
expression: from theories to phenotypes. Nature Review Genet
ics 2005, 6:451464.
DREAM Project [http://wiki.c2b2.columbia.edu/dream/index.php/
The_DREAM_Project/DREAM2_Data]
Eisen M, Spellman P, Brown P, Bottstein D: Cluster Analysis and
display of genomewide expression patterns. Proceedings of the
National Academy of Sciences of the USA 1998, 95:1486314868.
Scoring Methodologies for DREAM2 [http://wiki.c2b2.colum
bia.edu/dream/data/
golstandardScoring_Methodologies_for_DREAM2.doc]
Amaldi E, Kann V: On the approximability of minimizing
nonzero variables or unsatisfied relations in linear systems.
Theoretical Computer Science 1998.
Chen SS, Donoho DL, Saunders MA: Atomic Decomposition by
Basis Pursuit. Tech. Rep. Dept. of Statistics Technical Report, Stan
ford University; 1996.
Donoho DL, Elad M, Temlyakov V: Stable recovery of sparse
overcomplete representations in the presence of noise. IEEE
Trans Inform Theory 2004, 52:618.
Weston J, Elisseff A, Schölkopf B, Tipping M: Use of the Zero
Norm with Linear Models and Kernel Methods. Journal of
Machine Learning Research 2003, 3:.
McDiarmid C: On the method of bounded differences. In Survey
in Combinatorics Cambridge University Press; 1989:148188.
Bousquet O, Elisseeff A: Stability and Generalization. Tech. rep.,
Centre de Mathematiques Appliquees; 2000.
MATLAB [http://www.mathworks.com/products/matlab/]
Lpsolve [http://packages.debian.org/stable/math/lpsolve]
Newman M: The physics of Networks. Physics Today 2008.
SGD [http://www.yeastgenome.org/]
GO [http://www.geneontology.org/]
InterPro [http://www.ebi.ac.uk/interpro/]
KEGG [http://www.genome.jp/kegg/pathway.html]
Guo Y, Breeden L, Fan W, Zhao L, Eaton D, Zarbl H: Analysis of
cellular responses to aflatoxin B(1) in yeast expressing
human cytochrome P450 1A2 using cDNA microarrays.
Mutat Res 2006, 593:121142.
BLAST [http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/
information3.html]
Hubner K, Windoffer R, Hutter H, Leube R: Tetraspan vesicle
membrane proteins: synthesis, subcellular localization, and
functional properties. Int Rev Cytol 2002, 214:103159.
Verma R, Kubendran S, Das SSK, Jain , Brahmachari S: SYNGR1 is
associated with schizophrenia and bipolar disorder in south
ern India. J Hum Genet 2005, 50:635640.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.Banerjee O, Ghaoui LE, d'Aspremont A, Natsoulis G: Convex opti
mization techniques for fitting sparse Gaussian graphical
models. ICML '06 2006:8996.
Rubinstein B, McAuliffe J, Cawley S, Palaniswami M, Ramamohanarao
K, Speed T: Machine Learning in LowLevel Microarray Anal
ysis. SIGKDD Explorations 2003, 5:.
Newman M: A measure of betweenness centrality based on
random walks. PubMed 2003 [http://aps.arxiv.org/abs/condmat/
0309045/].
Friedman N, Koller D: Being Bayesian about network struc
ture: a Bayesian approach to structure discovery in Bayesian
Networks. Machine Learning 2003, 50:95126.
Sachs K, Perez O, Peér D, Lauffenburger D, Nolan G: Causal pro
teinsignaling networks derived from multiparameter single
cell data. Science 2005, 308:523529.
57.
58.
59.
60.