Page 1

BioMed Central

Page 1 of 15

(page number not for citation purposes)

Algorithms for Molecular Biology

Open Access

Research

A linear programming approach for estimating the structure of a

sparse linear genetic network from transcript profiling data

Sahely Bhadra1, Chiranjib Bhattacharyya*1,2, Nagasuma R Chandra*2 and I

Saira Mian3

Address: 1Department of Computer Science and Automation, Indian Institute of Science, Bangalore, Karnataka, India, 2Bioinformatics Centre,

Indian Institute of Science, Bangalore, Karnataka, India and 3Life Sciences Division, Lawrence Berkeley National Laboratory, Berkeley, California

94720, USA

Email: Sahely Bhadra - sahely@csa.iisc.ernet.in; Chiranjib Bhattacharyya* - chiru@csa.iisc.ernet.in;

Nagasuma R Chandra* - nchandra@serc.iisc.ernet.in; I Saira Mian - smian@lbl.gov

* Corresponding authors

Abstract

Background: A genetic network can be represented as a directed graph in which a node

corresponds to a gene and a directed edge specifies the direction of influence of one gene on

another. The reconstruction of such networks from transcript profiling data remains an important

yet challenging endeavor. A transcript profile specifies the abundances of many genes in a biological

sample of interest. Prevailing strategies for learning the structure of a genetic network from high-

dimensional transcript profiling data assume sparsity and linearity. Many methods consider

relatively small directed graphs, inferring graphs with up to a few hundred nodes. This work

examines large undirected graphs representations of genetic networks, graphs with many

thousands of nodes where an undirected edge between two nodes does not indicate the direction

of influence, and the problem of estimating the structure of such a sparse linear genetic network

(SLGN) from transcript profiling data.

Results: The structure learning task is cast as a sparse linear regression problem which is then

posed as a LASSO (l1-constrained fitting) problem and solved finally by formulating a Linear

Program (LP). A bound on the Generalization Error of this approach is given in terms of the Leave-

One-Out Error. The accuracy and utility of LP-SLGNs is assessed quantitatively and qualitatively

using simulated and real data. The Dialogue for Reverse Engineering Assessments and Methods

(DREAM) initiative provides gold standard data sets and evaluation metrics that enable and facilitate

the comparison of algorithms for deducing the structure of networks. The structures of LP-SLGNs

estimated from the INSILICO1, INSILICO2 and INSILICO3 simulated DREAM2 data sets are

comparable to those proposed by the first and/or second ranked teams in the DREAM2

competition. The structures of LP-SLGNs estimated from two published Saccharomyces cerevisae

cell cycle transcript profiling data sets capture known regulatory associations. In each S. cerevisiae

LP-SLGN, the number of nodes with a particular degree follows an approximate power law

suggesting that its degree distributions is similar to that observed in real-world networks.

Inspection of these LP-SLGNs suggests biological hypotheses amenable to experimental

verification.

Published: 24 February 2009

Algorithms for Molecular Biology 2009, 4:5 doi:10.1186/1748-7188-4-5

Received: 30 May 2008

Accepted: 24 February 2009

This article is available from: http://www.almob.org/content/4/1/5

© 2009 Bhadra et al; licensee BioMed Central Ltd.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0),

which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Page 2

Algorithms for Molecular Biology 2009, 4:5http://www.almob.org/content/4/1/5

Page 2 of 15

(page number not for citation purposes)

Conclusion: A statistically robust and computationally efficient LP-based method for estimating

the topology of a large sparse undirected graph from high-dimensional data yields representations

of genetic networks that are biologically plausible and useful abstractions of the structures of real

genetic networks. Analysis of the statistical and topological properties of learned LP-SLGNs may

have practical value; for example, genes with high random walk betweenness, a measure of the

centrality of a node in a graph, are good candidates for intervention studies and hence integrated

computational – experimental investigations designed to infer more realistic and sophisticated

probabilistic directed graphical model representations of genetic networks. The LP-based solutions

of the sparse linear regression problem described here may provide a method for learning the

structure of transcription factor networks from transcript profiling and transcription factor binding

motif data.

Background

Understanding the dynamic organization and function of

networks involving molecules such as transcripts and pro-

teins is important for many areas of biology. The ready

availability of high-dimensional data sets generated using

high-throughput molecular profiling technologies has

stimulated research into mathematical, statistical, and

probabilistic models of networks. For example, GEO [1]

and ArrayExpress [2] are public repositories of well-anno-

tated and curated transcript profiling data from diverse

species and varied phenomena obtained using different

platforms and technologies.

A genetic network can be represented as a graph consisting

of a set of nodes and a set of edges. A node corresponds to

a gene (transcript) and an edge between two nodes

denotes an interaction between the connected genes that

may be linear or non-linear. In a directed graph, the ori-

ented edge A → B signifies that gene A influences gene B.

In an undirected graph, the un-oriented edge A - B

encodes a symmetric relationship and signifies that genes

A and B may be co-expressed, co-regulated, interact or

share some other common property. Empirical observa-

tions indicate that most genes are regulated by a small

number of other genes, usually fewer than ten [3-5].

Hence, a genetic network can be viewed as a sparse graph,

i.e., a graph in which a node is connected to a handful of

other nodes. If directed (acyclic) graphs or undirected

graphs are imbued with probabilities, the result is proba-

bilistic directed graphical models and probabilistic undi-

rected graphical models respectively [6].

Extant approaches for deducing the structure of genetic

networks from transcript profiling data [7-9] include

Boolean networks [10-14], linear models [15-18], neural

networks [19], differential equations [20], pairwise

mutual information [10,21-23], Gaussian graphical mod-

els [24,25], heuristic approachs [26,27], and co-expres-

sion clustering [16,28]. Theoretical studies of sample

complexity indicate that although sparse directed acyclic

graphs or Boolean networks could be learned, inference

would be problematic because in current data sets, the

number of variables (genes) far exceedes the number of

observations (transcript profiles) [12,14,25]. Although

probabilistic graphical models provide a powerful frame-

work for representing, modeling, exploring, and making

inferences about genetic networks, there remain many

challenges in learning tabula rasa the topology and proba-

bility parameters of large, directed (acyclic) probabilistic

graphical models from uncertain, high-dimensional tran-

script profiling data [7,25,29-33]. Dynamic programing

approaches [26,27] use Singular Value Decomposition

(SVD) to pre-process the data and heuristics to determine

stopping criteria. These methods have high computa-

tional complexity and yield approximate solutions.

This work focuses on a plausible, albeit incomplete repre-

sentation of a genetic network – a sparse undirected graph

– and the task of estimating the structure of such a net-

work from high-dimensional transcript profiling data.

Since the degree of every node in a sparse graph is small,

the model embodies the biological notion that a gene is

regulated by only a few other genes. An undirected edge

indicates that although the expression levels of two con-

nected genes are related, the direction of influence is not

specified. The final simplification is that of restricting the

type of interaction that can occur between two genes to a

single class, namely a linear relationship. This particular

representation of a genetic network is termed a sparse lin-

ear genetic network (SLGN).

Here, the task of learning the structure of a SLGN is

equated with that of solving a collection of sparse linear

regression problems, one for each gene in a network

(node in the graph). Each linear regression problem is

posed as a LASSO (l1-constrained fitting) problem [34]

that is solved by formulating a Linear Program (LP). A vir-

tue of this LP-based approach is that the use of the Huber

loss function reduces the impact of variation in the train-

ing data on the weight vector that is estimated by regres-

sion analysis. This feature is of practical importance

because technical noise arising from the transcript profil-

Page 3

Algorithms for Molecular Biology 2009, 4:5 http://www.almob.org/content/4/1/5

Page 3 of 15

(page number not for citation purposes)

ing platform used coupled with the stochastic nature of

gene expression [35] leads to variation in measured abun-

dance values. Thus, the ability to estimate parameters in a

robust manner should increase confidence in the structure

of an LP-SLGN estimated from noisy transcript profiles.

An additional benefit of the approach is that the LP for-

mulations can be solved quickly and efficiently using

widely available software and tools capable of solving LPs

involving tens of thousands of variables and constraints

on a desktop computer.

Two different LP formulations are proposed: one based on

a positive class of linear functions and the other on a gen-

eral class of linear functions. The accuracy of this LP-based

approach for deducing the structure of networks is

assessed statistically using gold standard data and evalua-

tion metrics from the Dialogue for Reverse Engineering

Assessments and Methods (DREAM) initiative [36]. The

LP-based approach compares favourably with algorithms

proposed by the top two ranked teams in the DREAM2

competition. The practical utility of LP-SLGNs is exam-

ined by estimating and analyzing network models from

two published Saccharomyces cerevisiae transcript profiling

data sets [37] (ALPHA; CDC15). The node degree distri-

butions of the learned S. cerevisiae LP-SLGNs, undirected

graphs with many hundreds of nodes and thousands of

edges, follow approximate power laws, a feature observed

in real biological networks. Inspection of these LP-SLGNs

from a biological perspective suggests they capture known

regulatory associations and thus provide plausible and

useful approximations of real genetic networks.

Methods

Genetic network: sparse linear undirected graph

representation

A genetic network can be viewed as an undirected graph,

= {V, W}, where V is a set of N nodes (one for each

gene in the network), and W is an N × N connectivity

matrix encoding the set of edges. The (i, j)th element of the

matrix W specifies whether nodes i and j do (Wij ≠ 0) or

do not (Wij = 0) influence each other. The degree of node

n, kn, indicates the number of other nodes connected to n

and is equivalent to the number of non-zero elements in

row n of W. In real genetic networks, a gene is regulated

often by a small number of other genes [3,4] so a reason-

able representation of a network is a sparse graph. A sparse

graph is a graph parametrized by a sparse matrix W, a

matrix with few non-zero elements Wij, and where most

nodes have a small degree, kn < 10.

Linear interaction model: static and dynamic settings

If the relationship between two genes is restricted to the

class of linear models, the abundance value of a gene is

treated as a weighted sum of the abundance values of

other genes. A high-dimensional transcript profile is a vec-

tor of abundance values for N genes. An N × T matrix E is

the concatenation of T profiles, [e(1),..., e(T)], where e(t)

= [e1(t),..., eN(t)]® and en(t) is the abundance of gene n in

profile t. In most extant profiling studies, the number of

transcripts monitored exceeds the number of available

profiles (N Ŭ T).

In the static setting, the T transcript profiles in the data

matrix E are assumed to be unrelated and so independent

of one another. In the linear interaction model, the abun-

dance value of a gene is treated as a weighted sum of the

abundance values of all genes in the same profile,

The parameter wn = [wn1,..., wnN]® is a weight vector for

gene n and the jth element indicates whether genes n and j

do (wnj ≠ 0) or do not (wnj = 0) influence each other. The

constraint wnn = 0 prevents gene n from influencing itself

at the same instant so its abundance is a function of the

abundances of the remaining N - 1 genes in the same pro-

file.

In the dynamic setting, the T transcript profiles in E are

assumed to form a time series. In the linear interaction

model, the abundance value of a gene at time t is treated

as a weighted sum of the abundance values of all genes in

the profile from the previous time point, t - 1, i.e.,

=−

w e

1

. There is no constraint wnn = 0 because

a gene can influence its own abundance at the next time

point.

As described in detail below, the SLGN structure learning

problem involves solving N independent sparse linear

regression problems, one for each node in the graph (gene

in the network), such that every weight vector wn is sparse.

The sparse linear regression problem is cast as an LP and

uses a loss function which ensures that the weight vector

is resilient to small changes in the training data. Two LPs

are formulated and each formulation contains one user-

defined parameter, A, the upper bound of the l1 norm of

the weight vector. One LP is based on a general class of lin-

ear functions. The other LP formulation is based on a pos-

itive class of linear functions and yields an LP with fewer

variables than the first.

e t

n

w e t

nj j

t

w

j

N

n

nn

( )( )

( )

=

=

=

=

∑

w e

where

1

0

T

(1)

e t

n

t

n

( )()

T

Page 4

Algorithms for Molecular Biology 2009, 4:5 http://www.almob.org/content/4/1/5

Page 4 of 15

(page number not for citation purposes)

Simulated and real data

DREAM2 In-Silico-Network Challenges data

A component of Challenge 4 of the DREAM2 competition

[38] is predicting the connectivity of three in silico net-

works generated using simulations of biological interac-

tions. Each DREAM2 data set includes time courses

(trajectories) of the network recovering from several exter-

nal perturbations. The INSILICO1 data were produced from

a gene network with 50 genes where the rate of synthesis

of the mRNA of each gene is affected by the mRNA levels

of other genes; there are 23 different perturbations and 26

time points for each perturbation. The INSILICO2 data are

similar to INSILICO1 but the topology of the 50-gene net-

work is qualitatively different. The INSILICO3 data were

produced from a full in silico biochemical network that

had 16 metabolites, 23 proteins and 20 genes (mRNA

concentrations); there are 22 different perturbations and

26 time points for each perturbation. Since the LP-based

method yields network models in the form of undirected

graphs, the data were used to make predictions in the

DREAM2 competition

UNSIGNED. Thus, the simulated data sets used to esti-

mate LP-SLGNs are an N = 50 × T = 26 matrix (INSILICO1),

an N = 50 × T = 26 matrix (INSILICO2), and an N = 59 × T

= 26 matrix (INSILICO3).

category UNDIRECTED-

S. cerevisiae transcript profiling data

A published study of S. cerevisiae monitored 2,467 genes

at various time points and under different conditions

[37]. In the investigations designated ALPHA and CDC15,

measurements were made over T = 15 and T = 18 time

points respectively. Here, a gene was retained only if an

abundance measurement was present in all 33 profiles.

Only 605 genes met this criterion of no missing values

and these data were not processed any further. Thus, the

real transcript profiling data sets used to estimate LP-

SLGNs are an N = 605 × T = 15 matrix (ALPHA) and an N

= 605 × T = 18 matrix (CDC15).

Training data for regression analysis

A training set for regression analysis,

by generating training points for each gene from the data

matrix E. For gene n, the training points are

I

y

=

=

{(,)}

x

1

, is created

. The ith training point consists of an

"input" vector, xni = [x1i,..., xNi] (abundances values for N

genes), and an "output" scalar yni = xni (abundance value

for gene n).

In the static setting, I = T training points are created

because both the input and output are generated from the

same profile; the linear interaction model (Equation 1)

includes the constraint wnn = 0. If en(t) is the abundance of

gene n in profile t, the ith training point is xni = e(t) =

[e1(t),..., eN(t)], yni = en(t), and t = 1,..., T.

In the dynamic setting, I = T - 1 training points are created

because the output is generated from the profile for a

given time point whereas the input is generated from the

profile for the previous time point; there is no constraint

wnn = 0 in the linear interaction model. The ith training

point is xni = e(t - 1) = [e1(t - 1),..., eN(t - 1)], yni = en(t), and

t = 2,..., T.

The results reported below are based on training data gen-

erated under a static setting so the constraint wnn = 0 is

imposed.

Notation

N

Let

and card(A) the cardinality of a set A. For a vector x =

[x1,..., xN]® in this space, the l2 (Euclidean) norm is the

square root of the sum of the squares of its elements,

∑

n

denote the N-dimensional Euclidean vector space

; the l1 norm is the sum of the absolute

∑

n

values of its elements, ; and the l0 norm

is the total number of non-zero elements, ||x||0 =

card({n|xn ≠ 0; 1 ≤ n ≤ N}). The term x ≥ 0 signifies that

every element of the vector is zero or positive, xn ≥ 0, ∀n ∈

{1,..., N}. The one- and zero-vectors are 1 = [11,..., 1N]®

and 0 = [01,..., 0N]® respectively.

Sparse linear regression: an LP-based formulation

Given a training set for gene n

the sparse linear regression problem is the task of inferring

a sparse weight vector, wn, under the assumption that

gene-gene interactions obey a linear model, i.e., the abun-

dance of a gene n, yni = xn, is a weighted sum of the abun-

= w x

dances of other genes, .

Sparse weight vector estimation

l0 norm minimization

The problem of learning the structure of an SLGN involves

estimating a weight vector such that w best approximates

y and most of elements of w are zero. Thus, one strategy

for obtaining sparsity is to stipulate that w should have at

most k non-zero elements, ||w||0 ≤ k. The value of k is

equivalent to the degree of the node so a biologically

plausible constraint for a genetic network is ||w||0 ≤ 10.

{}

n n

N

=1

nninii

x

2

2

1

=

=

xn

N

x

1

1

=

=

||

xn

N

n ni nini

N

ni

yyiI

=∈∈=

{(, )|;;,..., }1

xx

(2)

ynin ni

T

Page 5

Algorithms for Molecular Biology 2009, 4:5 http://www.almob.org/content/4/1/5

Page 5 of 15

(page number not for citation purposes)

Given a value of k, the number of possible choices of pre-

dictors that must be examined is NCk. Since there are many

genes (N is large) and each choice of predictor variables

requires solving an optimization problem, learning a

sparse weight vector using an l0 norm-based approach is

prohibitive, even for small k. Furthermore, the problem is

NP-hard [39] and cannot even be approximated in time

where is small positive quantity.

LASSO

A tractable approximation of the l0 norm is the l1 norm

[40,41] (for other approximations see [42]). LASSO [34]

uses an upper bound for the l1 norm of the weight vector,

specified by a parameter A, and formulates the l1 norm

minimization problem as follows,

This formulation attempts to choose w such that it mini-

mizes deviations between the predicted and the actual val-

ues of y. In particular, w is chosen to minimize the loss

∑

1

function . Here, "Empirical

Error" is used as the loss function. The Empirical Error of

1

N

n

=

∑

1

a graph is , where

. The user-

defined parameter A controls the upper bound of the l1

norm of the weight vector and hence the trade-off

between sparsity and accuracy. If A = 0, the result is a poor

approximation, as the most sparse solution is a zero

weight vector, w = 0. When A = ∞, deviations are not

allowed and a non-sparse w is found if the problem is fea-

sible.

LP formulation: general class of linear functions

Consider the robust regression function f(.; w). For the

general class of linear functions, f(x; w) = w®x, an element

of the parameter vector can be zero, wj = 0, or non-zero, wj

≠ 0. When wj > 0, the predictor variable j makes a positive

contribution to the linear interaction model, whereas if wj

< 0, the contribution is negative. Since the representation

of a genetic network considered here is an undirected

graph and thus the connectivity matrix is symmetric, the

interactions (edges) in a SLGN are not categorized as acti-

vation or inhibition.

For the general class of linear functions f(x; w) = w®x, an

element of the weight vector w should be non-zero, wj ≠ 0.

Then, the LASSO problem

can be posed as the following LP

by substituting w = u - v, ||w||1 = (u + v)®1, |vi| = ξi +

and vi = ξi -

upper bound of the l1 norm of the weight vector and thus

the trade-off between sparsity and accuracy. Problem (4)

is an LP in (2N + 2I) variables, I equality constraints, 1

inequality constraints and (2N + 2I) non-negativity con-

straints.

. The user-defined parameter A controls the

LP formulation: positive class of linear functions

An optimization problem with fewer variables than prob-

lem (4) can be formulated by considering a weaker class

of linear functions. For the positive class of linear func-

tions f(x; w) = w®x, an element of the weight vector w

should be non-negative, wj ≥ 0. Then, the LASSO problem

(Equation 3) can be posed as the following LP,

Problem (5) is an LP with (N + 2I) variables, I equality

constraints, 1 inequality constraints, and (2N + 2I) non-

negativity constraints.

In most transcript profiling studies, the number of genes

monitored is considerably greater than the number of

profiles produced, N Ŭ I. Thus, an LP based on a restrictive

2

1

log

−eN

minimize

w

subject to

T

w x

w

,

||

.

v

i

i

I

iii

v

v

A

y

=∑

+

≤

=

1

1

L w

( )

y

ii

i

I

||

=−

=

w x

T

1

Empiricalerror

N

n

∑

()

Empiricalyf

errorn ninin

i

I

I

()|(; )|

=−

=

1

xw

minimize

w

subject to

T

w x

w

,

||

.

v

i

i

I

iii

v

v

A

y

=∑

+

≤

=

1

1

(3)

minimize

u v

, , , *

x x

subject to

T

uvx

*

i

*

i

()

()

(

u

x

xx

xx

i

i

I

i

≤

0

iiy

+

−

+

+−=

=∑

1

u uv

;

1

≥

*

i

v

x

≥

≥≥

)

;

T

A

i

0

00

(4)

xi

*

xi

*

minimize

w

, , *

x x

subject to

T

T

w x

w 1

w

x

i

*

i

*

i

()

xx

xx

i

i

I

iiiy

A

+

+−=

≤

≥

≥

=∑

1

0 0

00

x

i

≥

;.

*

(5)