ArticlePDF Available

Inferring Adaptive Regulation Thresholds and Association Rules from Gene Expression Data through Combinatorial Optimization Learning

Authors:

Abstract and Figures

There is a need to design computational methods to support the prediction of gene regulatory networks. Such models should offer both biologically-meaningful and computationally-accurate predictions, which in combination with other techniques may improve large-scale, integrative studies. This paper presents a new machine learning method for the prediction of putative regulatory associations from expression data, which exhibit properties never or only partially addressed by other techniques recently published. The method was tested on a Saccharomyces cerevisiae gene expression dataset. The results were statistically validated and compared with the relationships inferred by two machine learning approaches to gene regulatory network prediction. Furthermore, the resulting predictions were assessed using domain knowledge. The proposed algorithm may be able to accurately predict relevant biological associations between genes. One of the most relevant features of this new method is the prediction of adaptive regulation thresholds for the discretization of gene expression values, which is required prior to the rule association learning process. Moreover, an important advantage consists of its low computational cost to infer association rules. The proposed system may significantly support exploratory, large-scale studies of automated identification of potentially-relevant gene expression associations.
Content may be subject to copyright.
Inferring Adaptive Regulation Thresholds and
Association Rules from Gene Expression Data
through Combinatorial Optimization Learning
Ignacio Ponzoni, Francisco J. Azuaje, Juan Carlos Augusto, and David H. Glass
Abstract—There is a need to design computational methods to support the prediction of gene regulatory networks (GRNs). Such
models should offer both biologically meaningful and computationally accurate predictions which, in combination with other techniques,
may improve large-scale integrative studies. This paper presents a new machine-learning method for the prediction of putative
regulatory associations from expression data which exhibit properties never or only partially addressed by other techniques recently
published. The method was tested on a Saccharomyces cerevisiae gene expression data set. The results were statistically validated
and compared with the relationships inferred by two machine-learning approaches to GRN prediction. Furthermore, the resulting
predictions were assessed using domain knowledge. The proposed algorithm may be able to accurately predict relevant biological
associations between genes. One of the most relevant features of this new method is the prediction of adaptive regulation thresholds
for the discretization of gene expression values, which is required prior to the rule association learning process. Moreover, an important
advantage consists of its low computational cost to infer association rules. The proposed system may significantly support exploratory
large-scale studies of automated identification of potentially relevant gene expression associations.
Index Terms—Combinatorial optimization, genetic regulatory networks, machine learning, gene expression data, decision trees.
Ç
1BACKGROUND
A
gene regulatory network (GRN) aims to represent
high-level relationships that govern the rates at which
genes in the network are transcribed into mRNA. In this
way, genes can be viewed as nodes in this network whose
expression levels (outputs) are controlled by other nodes
(transcription factors). Nowadays, the inference, modeling,
and simulation of GRNs is a fundamental topic in
functional genomics [1], [2]. Over the past few years,
several statistical and artificial intelligence techniques have
been proposed to carry out the reverse engineering of GRNs
from monitoring and analyzing large-scale gene expression
data [3], [4]. Clustering algorithms represented one of the
first approaches to support the large-scale identification of
regulatory modules [5], [6]. Such an approach approxi-
mated regulatory networks by 1) identifying groups of
coexpressed genes and 2) analyzing relationships between
their regulatory regions and DNA binding motifs targeted
by known transcription factors. A key limitation of this
approach is that it assumes that coexpression is always
equivalent to regulation. Moreover, this method implies
symmetric relationships between the genes, which may not
always correspond to biological phenomena.
Within the area of machine learning, Boolean Networks
were one of the first models to be employed in GRNs
inference [7], [8] and variations of this approach have been
published recently [9]. These models basica lly aim at
inferring logical rul es from a discretiz ation of gene
expression time series. Even though these models can be
easily applied, they depend on arbitrary discretizations of
the gene expression values [10], which impose strong
assumptions and restrictions about the biological system
under study.
Bayesian Networks have also provided the basis for
several approaches to inferring GRNs [11], [12], [13]. These
methods employ conditional probabilistic distributions for
gene interactions modeling. Despite the strong theoretical
rationale behind these approaches, the exponential explo-
sion of the parameter space required for these models,
together with the large quantity of data needed to make
reliable inferences, reduces their capacity to infer complex
GRNs by using gene expression data only. Since they are
acyclic directed graphs, they cannot represent autoregula-
tion or time-course regulation in a straightforward way [2].
From the area of evolutionary computing approaches
[14], several methods were proposed. Ando et al. [15]
presented an algorithm that combines genetic programming
with the minimum least squares method. This technique
infers a differential equation system that represents regula-
tion interactions between genes. Although this method may
be robust in statistical terms, the algorithm was only tested
on small GRNs (10 genes) an d the authors detected
important scalability limitations when applied to more
complex data. Iba and Mimura [16] proposed an iterative
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 4, NO. 4, OCTOBER-DECEMBER 2007 1
. I. Ponzoni is with the Department of Computer Science and Engineering,
Universidad Nacional del Sur, Av. Alem 1253, Bahı
´
a Blanca, CP 8000,
Argentina. E-mail: ip@cs.uns.edu.ar.
. F.J. Azuaje is with the Computer Science Research Institute and the School
of Computing and Mathematics, University of Ulster at Jordanstown,
BT37 0QB, Newtownabbey, Co. Antrim, UK. E-mail: fj.azuaje@ieee.org.
. J.C. Augusto and D.H. Glass are with the School of Computing and
Mathematics, University of Ulster at Jordanstown, BT37 0QB, New-
townabbey, Co. Antrim, UK. E-mail: {jc.augusto, dh.glass}@ulster.ac.uk.
Manuscript received 14 Mar. 2006; revised 23 Aug. 2006; accepted 6 Nov.
2006; published online 22 Jan.2007.
For information on obtaining reprints of this article, please send e-mail to:
tcbb@computer.org, and reference IEEECS Log Number TCBB-0057-0306.
Digital Object Identifier no. 10.1109/TCBB.2007.1049.
1545-5963/07/$25.00 ß 2007 IEEE Published by the IEEE CS, CI, and EMB Societies & the ACM
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
inference approach based on a genetic algorithm (GA)
whose learning process was guided by a molecular
biologist. The main goal was to allow the expert to perform
interactive analysis and validation of the results based on
the introduction of new constraints until a GRN with a high
level of predictive confidence was achieved. One of the
most important drawbacks of this methodology is that it
requires the biologist to have a good understanding of the
dynamics of the GA in order to select optimum learning
parameters. Recently, Hallinan and Wiles proposed an
evolutionary algorithm [17], which predicts GRNs based on
the Artificial Genome model presented by Reil [18]. Although
this model is more biologically plausible than traditional
machine-learning methods and presents potentially useful
properties, the network dynamics rely on synchronous
updating, which is biologically implausible. On the other
hand, when a more realistic asynchronous updating scheme
was used, the dynamic behavior collapsed at a single point
attractor under almost all conditions [17].
Soinov et al. [10] approached the task of reconstructing
GRNs as a classification problem. In summary, the authors
proposed the application of decision trees to infer classifiers
that may represent regulatory rules (relationships) between
genes. They applied the C4.5 algorithm to infer the decision
trees [19]. This method’s computational efficiency limita-
tions are well known for classification problems with
continuous-valued attributes [20], which is the case in the
GRNs inference problem since the gene expression values
are real numbers. Although this is a sound and concep-
tually interesting approach, it may exhibit significant
predictive limitations when dealing with more complex
GRNs (that is, networks that may consist of hundreds or
thousands of genes).
Another important category of predictive approaches
includes several methods that are based on the detection of
modules of genes significantly coexpressed under specific
conditions [21], [22], [23]. Such modules allow both the
approximation of higher level network representations and
module-specific relationships. This divide -and-conquer
approach is a useful option for achieving reliable predic-
tions in the absence of lar ger amounts of expre ssion
samples. However, recent evidence suggests conflicting
views about the meaning and nature of functional modules
represented in GRNs [24]. For a more comprehensive
review of GRN inference methods, the reader is referred
to [2], [25], and [26].
1.1 Proposed Approach
The method proposed in this paper addresses key limita-
tions shown by data-driven whole-set GRNs prediction
methods. The main objective is to provide a user-friendly,
biologically meaningful, and computationally efficient
algorithm to support the inference of complex putative
GRNs. We do not claim that data-driven machine-learning
approaches are sufficient to infer biologically meaningful
networks. However, such tools may provide significant
evidence necessary to aid scientists in detecting and
validating biologically relevant associations. Moreover, the
method proposed here neither makes strong statistical
assumptions nor applies arbitrary expression discretization
schemes (including adaptive thresh olds for inferring
regulation rules). Thus, a new machine-learning algorithm
based on combinatorial optimization, from now on referred
to as GRNCOP (abbreviated from Gene Regulatory Net-
work inference by Combinatorial OPtimization), is as-
sessed. This method infers association rules that represent
interactions between genes, which are obtained from gene
expression data sets. The discovered rules may be used to
predict the gene expression states of a gene in terms of the
gene expression values of other genes and, in this way, a
putative GRN may then be reconstructed by applying and
combining these rules.
Our approach offers several advantages in relation to
existing methods. First of all, it does not assume arbitrary
and uniform gene expression value discretizations. Second,
GRNCOP is not constrained by regulatory symmetry
relationships that are shown by clustering-based network
inference techniques. Third, the results can be easily
interpreted since the association rules are derived from
models that classify the different regulation states. Finally,
the algorithm computes the potential interactions between
genes with a low computational effort of Oðn
2
Þ, where n is
the number of genes in the GRN. Moreover, the new
methodology may in principle be adapted to other model-
ing approaches, such as modular methods [21], [22], [23]
and multisource-based prediction techniques [27].
GRNCOP aims at inferring different types of rules that
capture relevant associations (that is, potential regulatory
relationships) reflected in the expression values of the genes.
In order to test this approach, GRNCOP was applied to the
microarray data sets presented by Spellman et al. [28] to infer
a GRN relevant to the yeast cell cycle. The results were
statistically validated. Moreover, the rules generated by
GRNCOP were compared to relationships inferred by two
other published methods [10], [12]. Furthermore, biologically
relevant predictions were verified and potentially novel
predictions were assessed through literature searches and an
analysis of curated functional annotations derived from the
Saccharomyces cerevisiae Genome Database (SGD).
The rest of this paper is organized as follows: Key
definitions to interpret the regulatory rules and the
combinatorial optimization problem are introduced in
Section 2, the new algorithm is explained in Section 3, and
experimental results obtained by GRNCOP are discussed in
Section 4. A summary of contributions, future research, and
conclusions are presented in Section 5.
2SYSTEMS AND METHODS
2.1 Gene Expression Association Rules and GRNS
Inference
The time series encoded in a gene expression data set may
be represented by means of a gene expression data matrix,
XX, where the rows and columns represent genes and
samples (experimental perturbations or conditions), respec-
tively. In this way, each element xx
ij
of XX contains the
expression value of gene i in the sample j.
Although the gene expression values belong to a
continuous range of the real numbers, it is possible to
define a finite expression state set for each gene by means of
a discretization procedure. Such a procedure is required in
order to encode the inputs to any combinatorial optimization
2 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 4, NO. 4, OCTOBER-DECEMBER 2007
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
process or other machine-learning methods. The results
reported in this paper, as in previous representative studies,
concentrate on two states for each gene: upregulated (when
the gene is expressed with a value greater than its mean gene
expression value) and downregulated (when the gene is
expressed with a value less than or equal to its mean expression
value). Nevertheless, the model can be generalized to any
number of states in a straightforward way. In GRNCOP, the
state of a gene i in a sample j is denoted as s
ij
and
i
represents the mean expression value for this gene. Thus,
s
ij
¼ 1 if xx
ij
>
i
; otherwise, s
ij
¼1.
On the other hand, the inference process also requires the
definition of discreti zation thresholds in order to infer
putative regulatory relationships betw een genes. These
“regulation thresholds” have traditionally been estimated
as unique static values for all of the genes under study. For
example, ad hoc methods based on mean expression values
have been applied. However, a more biologically meaningful
scheme should model the fact that a gene may actually have
distinct regulation thresholds in relation to different genes in
the regulatory network. For example, regarding the regula-
tory network under study (see Section 4), the genes CLB2 and
SWI5 are shown to be inhibited by gene CLB1, but their
respective downregulation thresholds are different. CLB2 is
downregulated (or inhibited) when the gene expression
value of CLB1 is above 0.07, whereas SWI5 is downregulated
when the gene expression of CLB1 is above -0.28. Therefore, a
fundamental problem consists of estimating the regulation
thresholds for each gene in relation to each potential target
gene, which can more accurately reflect significant interac-
tions between genes.
At this point, our hypothesis is stated as follows:
Association rules (that is, potential regulatory relationships)
may be accurately inferred from expression data to reveal
how the present and future state of a gene may be affected by the
gene expression values of the other genes, taking into account
their relative regulation thresholds. In this paper, we consider
three types of association rules: simultaneous, time-delay, and
change-based rules. The rule types are the same as those
studied by Soinov et al. [10] and Bulashevska and Eils [12],
but the rule syntax adopted here is slightly different. In
particular, Soinov et al. [10] referred to the third group of
associations as changes rules. In this paper, we refer to such
relationships as change-based rules.
Simultaneous rules represent the situation in which the
state of a gene i in a sample j depends on the gene expression
values of other genes in the same sample j. The syntax for
these rules is < symbol >< gene >,< symbol >< gene > .
The symbols þ and on the left side of the rule indicate
above and below some specific regulation threshold, respectively,
whereas the symbols þ and on the right side of the rule
indicate upregulated and downregulated states, respectively.
For example, the rule þCLB1 CLB2 denotes that, when
CLB1 is above its regulation threshold in relation to CLB2
tt
CLB1;CLB2
in a sample, then CLB2 will be upregulated in the
same sample.
Time-delay rules represent the situation in which the state
of a gene i in a sample j depends on the gene expression
values of other genes in the previous sample (that is, previous
experimental condition) j 1. The syntax for these rules is
< symbol >< gene >!< symbol >< gene > . The symbols
þ and on the left side of the rule indicate above and below
some specific regulation threshold, respectively, whereas the
symbols þ and on the right side of the rule indicate
upregulated and downregulated states, respectively. For
example, the rule þ= CLB1 != þ MCM1 denotes that,
if CLB1 is above its regulation threshold in relation to
MCM1, tt
CLB1;MCM1
,inasample,thenMCM1willbe
downregulated in the next sample and, if CLB1 is below
or equal to tt
CLB1;MCM1
in a sample j, then MCM1 will be
upregulated in the next sample j þ 1.
Finally, change-based rules represent events of the transi-
tion-state machine corresponding to the GRN. The syntax for
these rules is < symbol >< gene >)< symbol >< gene > .
In both sides of the rule, the symbols þ and indicate
upregulated and downregulated states, respectively. For
example, the rule þCLB1 CLB2 denotes that, when the
gene CLB1 changes its state from downregulated to
upregulated, then the gene CLB2 will also change its state
from downregulated to upregulated at the same experi-
mental condition j. The six resulting regulation cases for the
three types of rules are shown in Table 1.
Note that two different types of discretization are
defined in this paper. The first one is to set the state of
each gene, which is computed using its mean expression
value, and the second one is to evaluate the potential
interaction between each pair of genes and it is calculated in
an adaptive gene-pair-specific way. In this paper, we focus
on the impact of adaptive regulation thresholds in the rule
inference process. However, the study of adaptive thresh-
olds for the definition of the gene’s states is another
potential improvement of existing GRN inference methods.
This task will be part of future research.
2.2 Combinatorial Optimization for Putative GRNs
Inference
GRNCOP infers the association rules described above by
exploring the possible combinations of interactions between
each pair of genes. In this sense, we assume six particular
cases, which are represented by the nonnull integer
numbers between 3 and 3 and a special case that indicates
the absence of association, which is represented by the
number 0. All of these cases are described in Table 1.
PONZONI ET AL.: INFERRING ADAPTIVE REGULATION THRESHOLDS AND ASSOCIATION RULES FROM GENE EXPRESSION DATA... 3
TABLE 1
Summary of the Different Types of Association Rules
Inferred by GRNCOP
The first column encodes the cases.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
In mathematical terms, the inference of the rules to
reconstruct a GRN can be expressed as the following
combinatorial optimization problem:
[
nn
ii¼1
max
ii
2P
ð
ii
;ðXX; iiÞÞ; ð1Þ
subject to:
. n ¼ number of genes in the microarray data set,
. m ¼ number of samples in the microarray data set,
. XX 2<
nxm
, matrix with the gene expression data,
. P is the space of all vectors vv of dimension n such
that v
i
2f3; 2; 1; 0; 1; 2; 3g8i, i ¼ 1::n,
. ðXX; iÞ is the discretization function such that
ðXX; iÞ¼DD
i
and DD
i
2f1; 1g
nxm
,
.
i
2 P is a classifier for DD
i
, and
. ð
i
;DD
i
Þ is a performance function of
i
as classifier
of DD
i
.
From now on, the symbol indicates the set of optimal
classifiers, ¼f
1
;
2
; ...;
n
g. It is important to note that
the general optimization problem is the same for the three
types of rules. The only difference lies in the definition of
the discretization function because each type of rule is
based on different expression discretizations of XX.
3ALGORITHMS
3.1 GRNCOP: Combinatorial Optimization
Algorithm
The machine-learning process to obtain the rules consists of
three phases, one for each type of rule (see the left side of
Fig. 1). Phases 1 and 2 follow a similar processing principle
(see the right side of Fig. 1). The core of the algorithm is a
loop, where the vector of potential regulators ð
i
Þ for a
gene i is calculated at each iteration. After n iterations, the
set of potential regulators corresponding to all the genes is
held in set .
Phases 1 and 2 differ in terms of the procedure applied to
calculate the discretization thresholds, which are required
for the discretization of the matrix XX and for obtaining the
discretization function ðXX; iÞ. Both procedures will be
explained in Section 3.2.
Note that, although computing the threshold is concep-
tually a subroutine of the discretization procedure, these
two procedures are actually independent components in the
algorithm due to efficiency reasons. All of the regulation
thresholds corresponding to each gene are calculated
simultaneously, whereas the discretization of XX is calcu-
lated in relation to each gene.
With respect to phase 3 (Fig. 2), the main difference with
the previous phases is the discretization procedure, which
is calculated only once. This is because the same discrete
matrix is common to all genes. The rationale behind this
difference is further explained later. Therefore, the dis-
cretization function has matrix XX as its unique argument.
Consequently, only the procedure to obtain the optimum
i
for a gene i is calculated in each iteration. In this way, as we
have mentioned before, the procedure to calculate the
optimal solution is the same for each type of rule.
3.2 Discretization Step: Function Calculation
During discretization, the real numbers corresponding to
the gene expression values, which are held in matrix XX, are
mapped to values 1 and 1 using the function ðXX; iÞ. The
main question at this point is how to define the discretiza-
tion regulation thresholds for each gene in relation to the
others. A traditional approach consists of using the mean
expression value from a gene i in the sample set XX. This
solution is easy to implement, but it represents a strong
simplification of reality because it assumes a unique
putative regulation threshold for each gene with respect
to the others. It is well known that the gene expression
value required by gene
R
to activate (or inhibit) a gene
T1
is
not necessarily the same value required by the same gene
R
to activate (inhibit) a gene
T2
. For this reason, we propose
applying a more flexible and dynamic threshold-selection
policy which calculates a specific regulation threshold for
each pair of genes.
In particular, GRNCOP calculates the thresholds by
applying the same continuous-valued attribute discretiza-
tion techniques as those used for classification algorithms
based on decision trees. Basically, it considers each
expression value shown by gene
R
in XX as a potential
threshold for the discretization of gene
R
. A partition of the
sample set XX into two subsets, namely, DoDo and UpUp,is
generated for each gene and each candidate threshold, tt. DoDo
4 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 4, NO. 4, OCTOBER-DECEMBER 2007
Fig. 1. General schema of the GRNCOP algorithm. Dotted arcs indicate
the connection between the main program and the subroutine used for
phases 1 and 2 (gray box).
Fig. 2. Schema of the phase 3 subroutine corresponding to the
GRNCOP algorithm. The discretization step is shown outside of the
loop.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
contains all samples where the gene
R
has an expression
value less than or equal to tt, whereas UpUp contains all of the
samples where the gene
R
has an expression value greater
than tt. In other words, DoDo and UpUp represent sample sets in
which the gene
R
has values equal to 1 and 1, respectively,
on the basis of tt, which is the candidate discretization
regulation threshold for the gene
R
.
The next step consists of the calculation of the partition
entropy, which is a statistical indicator of the quality of a
threshold tt as a discretization value for gene
R
with respect
to another gene
T
. To further illustrate this concept, suppose
that we are trying to infer the potential regulators for a
given gene
T
, then, for each gene
R
(potential regulator of
gene
T
), we obtain a discretization of this gene’s expression
values, which can help us to infer whether or not the gene
R
is actually a regulator of gene
T
. In numerical terms, the
partition entropy is 0 when all samples satisfy the same
association rule case (ideal situation from a predictive
viewpoint) and the partition entropy is 1 when the samples
belong to both regulation scenarios in equal proportion
(50 percent and 50 percent). Then, when the partition
entropy value associated with a discretization approximates
to 0, the threshold that generates this discretization
represents a better solution. Thus, such a threshold value
allows one to optimally detect potential significant relation-
ships between gene
T
and gene
R
in terms of the association
rule cases. The entropy calculation is based on definitions
given in [29] an d the partition entropy equation was
previously applied by Kohani [30] as follows:
P EntropyP EntropyðR;tt; XXÞ¼
DoDo
jj
XX
jj
EntropyEntropyðDoDoÞþ
UpUp
jj
XX
jj
EntropyEntropyðUpUpÞ
ð2Þ
where
. R identifies the gene under consideration (potential
regulator),
. tt is the partition threshold,
. XX is the set of samples corresponding to the time
series,
. DoDo is the subset of XX with the samples, where the
gene expression value of the gene
R
is less than or
equal to tt, and
. UpUp is the subset of XX with the samples where the
gene expression value of the gene
R
is greater than tt.
Then, for each pair of genes, GRNCOP calculates the
threshold that minimizes the partition entropy using (2).
After that, for each gene
i
, the function ðXX; iÞ maps the
corresponding gene expression values in XX to the discrete
matrix DD
i
using the thresholds previously calculated. Thus,
each gene i in the original matrix XX is associated with a
discrete matrix DD
i
.
This discretization policy for XX is used for both
simultaneous and time-delay rules. However, a temporal
shift for the vector encoding the expression values of gene
i
is required for the latter type. The time-delay rules predict
the situation when the state of a gene
i
in a sample j depends
on the gene expression values of its regulators in the
previous sample j 1. In other words, these rules deter-
mine the correlations between the expression value of a
gene
i
in a sample j, XX
i;j
, and the values of the others genes
in the previous sample, XX
k;j1
, for k ¼ 1...n. For this
reason, if XX 2<
nxm
, then DD
i
2f1; 1g
nxðm1Þ
, where the
ith row of DD
i
corresponds to the discretization of gene
i
in
the last m 1 samples of XX, whereas the values of the
remaining rows of DD
i
correspond to the discretization of
other genes in the first m 1 samples of XX.
Finally, the discretization procedure for the change rules
is significantly different. The discretization goal in this case
is to obtain a matrix DD that represents the transition of each
gene between the upregulated and downregulated states in
time. As explained in Section 2, because the state of a gene is
discretizated using its mean expression value, the discreti-
zation function ðXXÞ does not require a threshold for each
pair of genes. In this situation, we are only interested in
identifying the state changes of each gene. For this reason,
only one matrix, DD, is generated which is common to all
genes. This discretization coincides with the change rules
modeling presented by Soinov et al. [10].
3.3 Optimization Step: Function and
i
Calculation
As defined in (1), the optimization problem consists of
finding a set of optimal
i
which define potential associa-
tion rules between i and the other genes (potential
regulators). Basically,
i
is a vector that represents the set
of potential regulators of the gene
i
. Each component of the
vector holds an integer value between 3 and 3, which
represents one of the seven regulatory cases shown in
Table 1. Thus,
i
(k) indicates the regulation case detected
between gene
k
and gene
i
, that is,
i
is a gene expression
classifier that represents the potential regulators for the
gene
i
along with the characteristics of these potential
relationships.
The next step is the definition of an objective function
for the selection of the optimal set of association rules.
Taking into account that
i
represents a classifier obtained
from the set of samples DD
i
, the optimum
i
could be
calculated by maximizing a typical classifier performance
function. In particular, we use the followi ng function
proposed by Carvalho and Freitas [31]:
ð
ii
;DD
ii
Þ¼
TPTP
ðTPTP þ FNFNÞ

TNTN
ðFPFP þ TNTNÞ

; ð3Þ
where
. TPTP (True Positives) i s the number of po sitive
association cases (see Table 1) of DD
i
correctly
classified by
i
,
. FNFN (False Negatives) is the number of positive cases
of DD
i
incorrectly classified by
i
,
. TNTN (True Negatives) is the number of negative cases
of DD
i
classified correctly by
i
, and
. FPFP (False Positives) is the number of negative cases
of DD
i
incorrectly classified by
i
.
In this formula, the first factor is usually known as the
sensitivity of a classifier, whereas the second one is typically
recognized as specificity of a classifier. Both factors generate
values between 0 and 1 and, so, ð
i
;DD
i
Þ is always in this
range too. The best classifier is obtained when ð
i
;DD
i
Þ¼1
PONZONI ET AL.: INFERRING ADAPTIVE REGULATION THRESHOLDS AND ASSOCIATION RULES FROM GENE EXPRESSION DATA... 5
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
because this represents the situation where all expression
association states were correctly classified, whereas
ð
i
;DD
i
Þ¼0 refers to the opposite case.
GRNCOP calculates
i
using a constructive approach,
which explores all possible combinations of values for its
components
i
ðkÞ.Inshort,GRNCOPcomputesthe
sensitivity and specificity for each possible interaction case
value (encoded by values ranging from 3 and 3) for each
i
ðkÞ and assigns the value that maximizes the product of
both rates to
i
ðkÞ. After repeating this for each
i
ðkÞ, with
k ¼ 1...n, the resulting
i
maximizes (3).
It is important to stress the low computational effort
required for GRNCOP to infer a putative GRN. For
example, for a problem with n genes, the algorithm only
needs to calculate the metrics TP, FP, TN, and FN n times to
find the association rules relative to each gene. These four
metrics can be calculated simultaneously for a gene. Taking
into account the fact that the sensitivity and the specificity
are calculated with a computational complexity of OðnÞ,
where n is the number of genes, the total runtime required
to find the exact combinatorial solution to this problem is
very low, Oðn
2
Þ. That is, the calculation of the sensitivity
and specificity values is repeated n times, one iteration per
gene. This represents an improvement in relation to
previous research. For example, the C
4:5
algorithm applied
by Soinov et al. [10] has a complexity of Oðn
2
logðnÞÞ for
problems with continuous-valued attributes [32].
4RESULTS AND DISCUSSION
The predictive performance of GRNCOP was tested using
the microarray data in [22], which also includes data from
S. cerevisiae cell cultures [27]. These data were synchronized
by three different methods: cdc15, cdc28, and alpha-factors.
Therefore, these three gene expression data sets may be
defined as statistically independent [10].
For the performance analysis of the proposed method,
the same training and validation experiments used by
Soinov et al. [10] and Bulashevska and Eils [12] were
analyzed in order to achieve a fair comparison between the
three inference methods. The results reported here focus on
genes CLN1-3, CLB1-6, CDC28, MBP1, CDC53, CDC34,
SKP1, SWI4-6, HCT1, CDC20, SIC1, and MCM1 in order to
establish comparisons with previous studies ([10], [12]);
hence, n ¼ 21. The largest database, cdc15, was used as a
training set, that is, as the matrix X of the optimization
problem. For the prediction of simultaneous rules, we used
all of the samples in cdc15, whereas, for the time-delay and
changes rules inference, we used adjacent equidistant
samples only.
All the data available was considered in the prediction of
simultaneous rules, whereas only adjacent equidistant
samples were considered for inference of time-delay and
change-based rules. The accuracy of the rules obtained from
the cdc15 training set was assessed by three different
validation procedures: a 10-fold stratified cross-validation
[30] and independent tests using the cdc28 and alpha-factor
data sets. Our choices of training and validation data sets,
as well as validation procedures, were the same as those
implemented by Soinov et al. [10] and Bulashevska and Eils
[12]. However, these studies differ in the sense that
Bulashevska and Eils did not carry out a 10-fold cross-
validation test.
The association rules inferred by GRNCOP are summar-
ized in Table 2. Only the rules that achieved the highest
levels of accuracy after the validation process are reported.
All of the rules included in Table 2 reached an accuracy
over 70 percent in each validation procedure and an overall
mean accuracy higher than 80 percent, that is, taking into
account the average of the three validation tests. As a
consequence of this stringent evaluation, none of the
change-based rules detected from cdc15 passed the valida-
tion test.
The last two columns of Table 2 indicate interaction
relationships that were also inferred by the methods
proposed by Soinov et al. [10] and Bulashevska and Eils
[12], respectively, using the same data sets. The GRN
corresponding to the simultaneous rules inferred by
GRNCOP is shown in Fig. 3. The nodes represent genes,
and the arcs indicate the potential regulatory relationships.
The direction of the arcs determines the direction of the
putative regulatory interactions. In particular, the dotted
arcs denote new potential relationships discovered exclu-
sively by GRNCOP.
The accuracy values obtained for the classifiers proposed
in the three studies are presented in Tables 3 and 4. Each
row holds the mean accuracy value obtained by the rules
that represent the set of potential interactions for a
particular gene. It is important to clarify that Bulashevska
and Eils [12] did not report results by applying 10-fold
stratified cross-validation. They only carried out indepen-
dent tests using the cdc28 and alpha-factor data sets.
In summary, all simultaneous rules inferred by the
decision-tree method [10] were also detected by GRNCOP,
with the exception of the rules associated with genes MBP1,
CDC34, and SKP1. Nevertheless, the simultaneous rules
6 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 4, NO. 4, OCTOBER-DECEMBER 2007
TABLE 2
Candidate Association Rules Inferred by GRNCOP, Soinov
et al. (2003), and Bulashevska and Eils (2005) for S. cerevisiae
Using the cdc15 Data Set from Spellman et al. (1998)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
involving these genes are reported as “Questionable rules”
by Soinov et al. [10] because these rules have a high 10-fold
cross-validation accuracy on cdc15 data, but their accuracy
decreases significantly with the cdc28 and alpha-factor data
sets. For example, Soinov et al. reported that the accuracy
estimated by 10-fold cross validation for the rule þMBP1 ,
SKP1 under “simultaneous” events is almost 92 percent,
but the performance of the rule was not confirmed by
estimations with cdc28 and alpha-factor test sets. In other
words, the number of “FP” for this rule is high when cdc28
and alpha-factor are used as test sets. Therefore, GRNCOP
inferred all the highly accurate simultaneous rules obtained
by the decision-tree method [10].
Furthermore, the unquestionable simultane ous rules
discovered by Soinov et al. [10] also belong to the most
accurate rule subset inferred by GRNCOP. Each unques-
tionable simultaneous rule obtained by Soinov et al.’s
algorithm was inferred by the GRNCOP with an accuracy of
approximately 90 percent. It is important to note that the
number of rules inferred by Soinov et al.’s algorithm is
increased by GRNCOP by 40.5 percent, with the same
overall accuracy (the mean accuracy of the three validation
tests was exactly equal to 84.93 percent for both methods).
In other words, GRNCOP detects more association rules
than Soinov et al.’s method with the same accuracy levels.
Although the Bulashevska and Eils’s method based on
Bayesian Networks infers several interaction relationships
that were not detected by GRNCOP, the emergence of many
of their rules may be explained by the relaxation of the
accuracy percentages required during the validation test, as
shown in Table 4, and not as a result of a better predicting
ability. From this table, it is clear that, both in [10] and in
our work, a more conservative validation test was carried
out to decide the final set of association rules. Moreover, if
the accuracy percentage is decreased to 60 percent,
GRNCOP obtained more rules, but an accuracy of at least
70 percent was used in order to achieve a fair but stringent
comparison in relation to Soinov et al.’s experiments.
In addition, from Table 2, it is evident that several of the
rules inferred by GRNCOP and Soinov et al. [10] were not
detected by Bulashevska and Eils. Finally, it is important to
stress that, despite these differences, no major inconsisten-
cies were found between the methods.
4.1 Biological Relevance of Results
The biological relevance of the inferred rules was estimated
by analyzing whether such interrelationships reflect key
functional properties relating to the different cell cycle
phases G
1
,S,G
2
, M, and M=G
1
. Genes CLN1 and CLN2
transcribe G
1
cyclins, whereas CLB5 and CLB6 transcribe
B-cyclins. They share a similar expression pattern and attain
their highest expression level during the G
1
phase, which
can be verified in the experimental data analyzed [34], [35],
[36]. This knowledge is consistent with the rules:
þ = CLB6 = CLB5; þ= CLN1 = CLB5;
þ = CLN2 = CLB5; þ= CLB5 = CLB6;
þ = CLN1 = CLB6; þ= CLN2 = CLB6;
þ = CLB5 = CLN1; þ= CLN2 = CLN1;
þ = CLN1 = CLN2:
In particular, the new rules inferred by GRNCOP,
þ = CLN2 = CLB5; þ= CLB5 =
CLN1;
þ = CLN1 = CLN2;
are consistent with observations on the partial functional
redundancy existing among CLB5, CLN1, and CLN2,
which have been reported by Epstein and Cross [37] and
Levine et al. [38].
CLB1 and CLB2 are specific cyclins of the G
2
phase and
there is a biological evidence that they are coexpressed in
this process [39]. Gene SWI5 is a transcription factor whose
PONZONI ET AL.: INFERRING ADAPTIVE REGULATION THRESHOLDS AND ASSOCIATION RULES FROM GENE EXPRESSION DATA... 7
TABLE 3
Comparison between the Validation Test Results Obtained by
GRNCOP and Soinov et al. (2003) for S. cerevisiae Using the
cdc15 Data Set from Spellman et al. (1998)
Fig. 3. GRN based on the simultaneous rules inferred by GRNCOP for
S. cerevisiae. Dotted arcs show potentially novel association rules for
CLN1, CLN2, CLB1, CLB5, and SWI4.
TABLE 4
Comparison between the Validation Test Results Obtained by
GRNCOP and Bulashevska and Eils (2005) for S. cerevisiae
Using the cdc15 Data Set from Spellman et al. (1998)
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
activation occurs during the G2 phase. These facts justify
the following rules:
þ = CLB2 = CLB1; þ= SWI5 = CLB1;
þ = CLB1 = CLB2; þ= SWI5 = CLB2;
þ = CLB1 = SWI5; þ= CLB2 = SWI5;
which are further supported by biological evidence
presented by Koranda et al. [40]. Furthermore, the
transcription of SWI5 is activated late in phase S and its
peak of mRNA concentration occurs during the G
2
phase
[41]. This information is consistent with the rule:
þ= CLB1 = SWI5.
It is also well known that, in budding yeast, the G
1
cyclins such as CLN1 and CLN2 are expressed in G
1
and S
phases, whereas mitotic cyclins such as CLB1 and CLB2 are
expressed in G2 and M phases. Amon et al. [42] found that
the CLBs play a central role in the transition from S to
G
2
phases, showing evidence that CLBs repress CLNs. This
negative regulation of CLNs may occur via the transcription
factor SWI4 because CLBs are necessary for G
2
repression of
SCB-regulated genes like CLN1 and CLN2. On the other
hand, Andrews and Measday [43] present evidence that the
Cyclin/CDK complexes (CDC28/CLN1 and CDC28/
CLN2) regulates CLB proteolysis. This data is consistent
with the inhibitory relationships inferred between G
1
and
G
2
-specific genes:
þ = CLN1 ,= þ CLB1; þ= CLN2 ,= þ CLB1;
þ = CLB6 ,= þ CLB1; þ= CLN1 ,= þ CLB2;
þ = CLN2 ,= þ CLB2; þ= CLB2 ,= þ CLN2;
þ = SWI5 ,= þ CLN2;
and the time-delay rule: þ= CLB6 != þ CLB1.In
particular, the rules
þ = CLN1 ,= þ CLB1; þ= CLN1 ,= þ CLB2;
þ = CLN2 ,= þ
CLB1;
and þ= CLN2 ,= þ CLB2 were only inferred by
GRNCOP. The reader is referred to [39], [41], and [44] for
additional detailed information on the biological relevance
of these associations.
With regard to SIC1, it is well known that this gene is an
inhibitor of CLB complexes and that it is active during the
G
1
phase inhibiting CLB1 and CLB2 [45]. This validates the
simultaneous rule: þ= SIC1 ,= þ CLB2.CDC20is
transcribed late in the S=G
2
phase [36], whereas CLN1 is
expressed during the G
1
phase. This explains its interaction
with CLN1, which may be represented by the rule
þ= CDC20 ,= þ CLN1. Printz et al. [46] presented
evidence that CLB2 stimulates the synthesis of CDC20
and Chen et al. [47] described time delays between the
expression of CLB2 and the activation of CDC20. This
feature is captured by a new rule inferred by GRNCOP:
þ= CLB2 = CDC20. This rule was not detected by
the methods compared with GRNCOP.
The protein SWI4 is a component of the SBF complex,
which controls the expression of genes during phase G
1
[48].
This is in concordance with the inhibitory action of SWI4 on
the genes expressed in the G
2
phase, as represented by the
rule þ= SWI4 ,= þ CLB1, and its activator role of the
genes expressed during the G
1
phase, as revealed by the rule
þ= SWI4 = CLN2. Moreover, Igual et al. [48] showed
experimental evidence that the SWI4 regulates the transcrip-
tion of gene CLN2, which is represented in one of the
simultaneous rules inferred by GRNCOP only. These ob-
servations offer evidence of the biological relevance of the
association rules inferred by GRNCOP.
Additionally, a functional annotation-driven analysis of
the interacting pairs only predicted by GRNCOP (Table 2)
further suggests its potential for making biol ogically
meaningful predictions. Their curated Gene Ontology
(GO) [49] annotations derived from the SGD (http://
www.yeastgenome.org/) were processed to assess func-
tional similarity between such pairs under the three GO
hierarchies: Molecular Function (MF), Biological Process
(BP), and Cellular Component (CC), as investigated else-
where [50]. Only higher quality annotations were pro-
cessed, that is, electronically inferred annotations were not
considered.
All of the pairs exhibited relatively high functional
similarity values over all the GO hierarchies using the
March 2005 release of the SGD. This stresses that these pairs
are linked to common biological functions, pathways, and
cellular localizations. With regard to the BP hierarchy, for
example, all of the similarity values were higher than 0.4,
which is above the SGD mean similarity value. Only the
pair SWI4-CLN2 showed null similarity under the MF
hierarchy. All of the pairs showed CC similarity values
above 0.6, except for the pair CLB2-CDC20 (0.20). A closer
look at the GO annotations for these novel predictions
confirms the relevance of these findings. For example, the
pair CLN1-CLB1 is involved in regulation of cyclin
dependent proteins. Similarly, CLB2 is a known regulator
of cyclin dependent protein kinase activity, which was
predicted by GRNCOP as a regulator of CDC20. CDC20 is
known to be involved in cyclin catabolism.
5CONCLUSIONS
In this paper, GRNCOP, a combinatorial optimizat ion
algorithm designed for the inference of putative GRNs,
was presented. GRNCOP obtains an optimal classifier that
represents potential interaction relationships between
genes. This classifier is attained in two sequentially
executed main steps. First, a gene-specific discretization of
the gene expression values is carried out, which can more
accurately reflect the complexity of the regulatory relation-
ships between pairs of genes. In a second stage, the
association rules are inferred by means of a combinatorial
exploration of the predictive relationships existing between
the discretized values.
This study does not claim that our or other data-driven
machine learning approaches are sufficient to infer biologi-
cally meaningful regulatory networks. However, such tools
may offer significant evidence necessary to aid scientists in
exploring and identifying biologically relevant associations.
The method proposed here is also computationally efficient
(that is, runtimes), it does not require arbitrary assumptions
about the discretization of gene expression values, and it also
proved to have a good predictive performance in the
8 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 4, NO. 4, OCTOBER-DECEMBER 2007
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
inference of a GRN for S. cerevisiae. The results obtained by
GRNCOP were compared with the relationships inferred by
two recently published methods [10], [12]. This comparison
reveals the efficacy of GRNCOP as a prediction tool. This is
not only because it detects a high percentage of the rules
inferred by the other methods, but also because it finds new
relevant relationships, which satisfied a stringent statistical
validation. Moreover, all interactions between genes inferred
by GRNCOP are consistent with previous biological knowl-
edge. It is also important to remark that the low computa-
tional effort required by our method makes it suitable for the
inference of complex GRN, involving thousands of genes.
As future work, we plan to extend our algorithm in order
to implement other types of inferable interaction relation-
ships. The algorithm currently has the ability to infer
potential regulatory rules with one-to-one cardinality, that
is, rules where the precedent (left side of the rule) contains
only one gene. Biological phenomena may, of course,
comprise relationships described by a “many-to-one”
cardinality. Moreover, a variety of motifs may also be
found, such as “one-to-many” relationships. Therefore, we
will incorporate the prediction of rules with higher
cardinality.
A related future direction concerns the manner in which
inferences of potential interactions are made. At present,
our algorithm asse sses each gene independently as a
potential regulator for the target gene under consideration,
that is, in determining whether gene
R
is a potential
regulator of gene
T
, we do not take into account gene
R
’s
relationship with other regulators of gene
T
. This means that
it is possible that some direct interactions identified by our
approach are, in fact, due to indirect relationships between
genes. A first step to addressing this issue would be to
investigate thresholds for both direct and indirect processes
identified by our approach in order to determine whether
there is any redundancy in the GRN that has been inferred.
We also intend to integrate additional data sources such
as factor binding motifs and location analysis data, as well
as prior functional knowledge (for example, ontology-
based) and network constraints such as topological con-
straints. With respect to possible applications, we also plan
to test our method on data from other organisms such as
mice. Finally, further comparisons with others methods and
the hybridization with modular techniques constitute
another long-term goal.
ACKNOWLEDGMENTS
Dr. Ponzoni did this work as a visiting researcher at the
School of Computing and Mathematics, University of Ulster.
The authors would like to express their acknowledgment to
the ANPCyT from Argentina for their economic support
given through Grant No. 11-12778 (Res. 117/2003) as part of
the “Contrato de Pre
´
stamo BID 1728/OC-AR” and to the
Universidad Nacional del Sur for their economic support
given through Grants Res. CSU-598 and PGI 24/N019.
REFERENCES
[1] H. Bolouri a nd E.H. Davidson, “Modeling Transcriptional
Regulatory Networks,” BioEssays, vol. 24, pp. 1118-1129, 2002.
[2] M.P. Styczynski and G. Stephanopoulos, “Overview of Computa-
tional Methods for the Inference of Gene Regulatory Networks,”
Computers and Chemical Eng., vol. 29, pp. 519-534, 2005.
[3] H. De Jong, “Modeling and Simulation of Genetic Regulatory
Systems: A Literature Review,” J. Computational Biology, vol. 9,
pp. 67-103, 2002.
[4] C. Pridgeon and D. Corne, “Genetic Network Reverse-Engineer-
ing and Network Size; Can We Identify Large GRNs?” Proc. 2004
IEEE Symp. Computational Intelligence in Bioinformatics and Compu-
tational Biology, pp. 32-36, Oct. 2004.
[5] J.L. De Risi, V.R. Iyer, and P.O. Broma, “Exploring the Metabolic
and Genetic Control of Gene Expression on a Genomic Scale,”
Science, vol. 278, pp. 680-686, 1997.
[6] M.B. Eisen, P.T. Spellman, P.O. Brown, and D. Botstein, “Cluster
Analysis and Display of Genome-Wide Expression Patterns,” Proc.
Nat’l Academy of Sciences USA, vol. 95, pp. 14863-14868, 1998.
[7] S. Liang, S. Fuhrman, and R. Somogyi, “REVEAL, A General
Reverse Engineering Algorithm for Inference of Genetic Network
Architectures,” Proc. Pacific Symp. Biocomputing, vol. 3, pp. 18-29,
Jan. 1998.
[8] T. Akutsu, S. Miyano, and S. Buhara, “Identification of Genetic
Networks from a Small Number of Gene Expression Patterns
under the Boolean Network Model,” Proc. Pacific Symp. Biocom-
puting, vol. 4, pp. 17-28, Jan. 1998 1999.
[9] S. Mehra, W.-S. Hu, and G. Karypis, “G: A Boolean Algorithm for
Reconstructing the Structure of Regulatory Networks,” Metabolic
Eng., vol. 6, pp. 326-339, 2004.
[10] L.A. Soinov, M.A. Krestyaninova, and A. Brazma, “Towards
Reconstruction of Gene Networks from Expression Data by
Supervised Learning,” Genome Biology, vol. 4, Article R6, 2003.
[11] N. Friedman, M. Linial, I. Nachman, and D. Pe
´
er, “Using Bayesian
Networks to Analyze Expression Data,” J. Computational Biology,
vol. 7, pp. 601-620, 2000.
[12] S. Bulashevska and R. Eils, “Inferring Genetic Regulatory Logic
from Expression Data,” Bioinformatics, vol. 21, pp. 2706-2713, 2005.
[13] M. Zou and S.D. Conzen, “A New Dynamic Bayesian Network
(DBN) Approach for Identifying Gene Regulatory Networks from
Time Course Microarray Data,” Bioinformatics, vol. 21, pp. 71-79,
2005.
[14] A.E. Eiben and J.E. Smith, Introduction to Evolutionary Computing.
Springer, 2003.
[15] S. Ando, E. Sakamoto, and H. Iba, “Evolutionary Modeling and
Inference of Gene Network,” Information Sciences, vol. 145, pp. 225-
236, 2002.
[16] H. Iba and A. Mimura, “Inference of a Gene Regulatory Network
by Means of Interactive Evolutionary Computing,” Information
Sciences, vol. 145, pp. 225-236, 2002.
[17] J. Hallinan and J. Wiles, “Evolving Genetic Regulatory Networks
Using an Artificial Genome,” Proc. Second Asia-Pacific Bioinfor-
matics Conf., 2004.
[18] T. Reil, “Dynamics of Gene Expression in an Artificial Genome:
Implications for Biological and Artificial Ontogeny,” Proc. Fifth
European Conf. Artificial Life,
D. Floreano, F. Mondada, and
J.D. Nicoud, eds., pp. 457-466, 1999.
[19] J.R. Quinlan, C4.5: Programs for Machine Learning. Morgan
Kaufmann, 1992.
[20] S. Ruggieri, “Efficient C4.5,” IEEE Trans. Knowledge and Data Eng.,
vol. 14, pp. 438-444, 2002.
[21] J. Ihmels, G. Friedlander, S. Bergmann, O. Sarig, Y. Ziv, and N.
Barkai, “Revealing Modular Organization in the Yeast Transcrip-
tional Network,” Nature Genetics, vol. 31, pp. 370-377, 2002.
[22] E. Segal, M. Shapira, A. Regev, D. Pe’er, D. Botstein, D. Koller, and
N. Friedman, “Module Networks: Identifying Regulatory Mod-
ules and Their Condition-Specific Regulators from Gene Expres-
sion Data,” Nature Genetics, vol. 34, pp. 166-176, 2003.
[23] P.H. Lee and D. Lee, “Inferring Genetic Regulatory Logic from
Expression Data,” Bioinformatics, vol. 21, pp. 2739-2747, 2005.
[24] B. Snel and M.A. Huynen, “Quantifying Modularity in the
Evolution of Biomolecular Systems,” Genome Research, vol. 14,
pp. 391-397, 2004.
[25] T. Schlitt and A. Brazma, “Modelling Gene Networks at Different
Organisational Levels,” FEBS Letters, vol. 579, pp. 1859-1866, 2005.
[26] T. Schlitt and A. Brazma, “Modelling in Molecular Biology:
Describing Transcription Regulatory Networks at Different
Scales,” Philosophical Trans. Royal Soc. of London Series B, Biological
Sciences, vol. 361, no. 1467, pp. 483-494, 2006.
PONZONI ET AL.: INFERRING ADAPTIVE REGULATION THRESHOLDS AND ASSOCIATION RULES FROM GENE EXPRESSION DATA... 9
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
[27] C.-H. Yeang and T. Jaakkola, “Physical Network Models and
Multi-Source Data Integration,” Proc. Seventh Ann. Int’l Conf.
Research in Computational Molecular Biology, pp. 312-321, Apr. 2003.
[28] P.T. Spellman, G. Sherlock, M.Q. Zhang, V.R. Iyer, K. Andres, B.
Eisen, P.O. Brown, D. Botstein, and B. Futcher, “Comprehensive
Identification of Cell Cycle-Regulated Genes of the Yeast
Saccharomyces cerevisiae by Microarray Hybridization,” Molecu-
lar Biology of the Cell, vol. 9, pp. 3273-3297, 1998.
[29] T. Mitchel, Machine Learning, chapter 3. WCB/McGraw-Hill, 1997.
[30] R. Kohani, “Wrappers for Performance Enhancement and Ob-
livious Decision Graphs,” PhD dissertation, Computer Science
Dept., Stanford Univ., 1995.
[31] D.R. Carvalho and A.A. Freitas, “A Hybrid Decision Tree/Genetic
Algorithm Method for Data Mining,” Information Sciences, vol. 163,
pp. 13-35, 2004.
[32] F. Provost, D. Jensen, and T. Oates, “Efficient Progressive
Sampling,” Proc. Fifth ACM SIGKDD Int’l Conf. Knowledge
Discovery and Data Mining, Paper ID 442, Aug. 1999.
[33] R.J. Cho, M.J. Campbell, E.A. Winzeler, L. Steinmetz, A. Conway,
L. Wodicka, T.G. Wolfsberg, A.E. Gabrielian, D. Landsman, D.J.
Lockhart, and R.W. Davis, “A Genome-Wide Transcriptional
Analysis of the Mitotic Cell Cycle,” Molecular Cell, vol. 2, pp. 65-73,
1998.
[34] C. Kuhne and P. Linder, “A New Pair of B-Type Cyclins from
Saccharomyces cerevisiae that Function Early in the Cell Cycle,”
European Molecular Biology Organization J., vol. 12, pp. 3437-3447,
1993.
[35] K.C. Chen, A. Csikasz-Nagy, B. Gyorffy, J. Val, B. Novak, and J.J.
Tyson, “Kinetic Analysis of a Molecular Model of the Budding
Yeast Cell Cycle,” Molecular Biology of the Cell, vol. 11, pp. 369-391,
2000.
[36] L.H. Hwang, L.F. Lau, D.L. Smith, C.A. Mistrot, K.G. Hardwick,
E.S. Hwang, A. Amon, and A.W. Murray, “Budding Yeast CDC20:
A Target of the Spindle Checkpoint,” Science, vol. 279, pp. 1041-
1044, 1998.
[37] C.B. Epstein and F.R. Cross, “CLB5: A Novel B Cyclin from
Budding Yeast with a Role in S Phase,” Genes and Development,
vol. 6, pp. 1695-1706, 1992.
[38] K. Levine, K. Huang, and F.R. Cross, “Saccharomyces cerevisiae
G1 Cyclins Differ in Their Intrinsic Functional Specificities,”
Molecular and Cellular Biology, vol. 16, pp. 6794-6803, 1996.
[39] H. Althoefer, A. Schleiffer, K. Wassmann, A. Nordheim, and G.
Ammerer, “McmI Is Required to Coordinate G2-Specific Tran-
scription in Saccharomyces cerevisiae,” Molecular and Cellular
Biology, vol. 15, pp. 5917-5928, 1995.
[40] M. Koranda, A. Schleiffer, L. Endler, and G. Ammerer, “Forkhead-
Like Transcription Factors Recruit NddI to the Chromatin of G2/
M-Specific Promoters,” Nature, vol. 406, pp. 94-98, 2000.
[41] C.J. Loy, D. Lydall, and U. Surana, “NDDI, a High-Dosage
Suppressor of cdc28-I N, Is Essential for Expression of a Subset of
Late-S-Phase-Specific Genes in S. cerevisiae, Molecular and
Cellular Biology, vol. 19, pp. 3312-3327, 1999.
[42] A. Amon, M. Tyers, B. Futcher, and K. Nasmyth, “Mechanisms
that Help the Yeast Cell Cycle Clock Tick: G2 Cyclins Tran-
scriptionally Activate G2 Cyclins and Repress G1 Cyclins,” Cell,
vol. 74, pp. 993-1007, 1993.
[43] B. Andrews and V. Measday, “The Cyclin Family of Budding
Yeast: Abundant Use of a Good Idea,” Trends in Genetics, vol. 14,
pp. 66-72, 1998.
[44] B. Schneider, E. Patton, S. Lanker, M. Mendenhall, C. Wittenberg,
B. Futcher, and M. Tyers, “Yeast GI Cyclins Are Instable in GI
Phase,” Nature, vol. 395, pp. 86-89, 1998.
[45] J.H. Toyn, A.L. Johnson, J.D. Donovan, W.M. Toone, and L.H.
Johnston, “The Swi5 Transcription Factor of Saccharomyces
cerevisiae Has a Role in Exit from Mitosis through Induction of
the Cdk-Inhibitor SicI in Telophase,” Genetics, vol. 145, pp. 85-96,
1997.
[46] S. Prinz, E.S. Hwang, R. Visintin, and A. Amon, “The Regulation
of Cdc20 Proteolysis Reveals a Role for the APC Components
Cdc23 and Cdc27 during S Phase and Early Mitosis,” Current
Biology, vol. 8, pp. 750-760, 1998.
[47] K.C. Chen, A. Csikasz-Nagy, B. Gyorffy, J. Val, B. Novak, and J.J.
Tyson, “Kinetic Analysis of a Molecular Model of the Budding
Yeast Cell Cycle,” Molecular Biology of the Cell, vol. 11, pp. 369-391,
2000.
[48] J.C. Igual, W.M. Toone, and L.H. Johnston, “A Genetic Screen
Reveals a Role for the Late G1-Specific Transcription Factor Swi4p
in Diverse Cellular Functions Including Cytokinesis,” J. Cell
Science, vol. 110, pp. 1647-1654, 1997.
[49] “The Gene Ontology Consortium: Creating the Gene Ontology
Resource: Design and Implementation,” Genome Research, vol. 11,
pp. 1425-1433, 2001.
[50] H. Wang, F.J. Azuaje, O. Bodenreider, and J. Dopazo, “Gene
Expression Correlation and Gene Ontology-Based Similarity: An
Assessment of Quantitative Relationships,” Proc. IEEE 2004 Symp.
Computational Intelligence in Bioinformatics and Computational
Biology, pp. 25-31, 2004.
Ignacio Ponzoni received the BSc degree in
computer science from the Universidad Nacional
del Sur, Bahı
´
a Blanca, Argentina, in 1996 and
the PhD degree in computer science from the
Universidad Nacional del Sur in 2001. He
received a fellowship from the National Council
of Scientific and Technological Research of
Argentina (CONICET) in 1996. He is a lecturer
in the Department of Computer Science and
Engineering at the Universidad Nacional del Sur,
and a scientific researcher at Planta Piloto de Ingenierı
´
a Quı
´
mica, which
is a National Research Institute of CONICET. His research interests
focus mainly on computational intelligence applied to chemical
engineering problems and bioinformatics. With his academic work, he
has contributed 11 journal publications and more than 20 international
conference/workshop publications. He is a member of the ACM.
Francisco J. Azuaje received the BSc degree in electronic engineering
from Simon Bolivar University, Caracas, Venezuela, in 1995, the MSc
degree in policy and management of technological innovation from the
Central University of Venezuela in 1996, and the PhD degree in artificial
intelligence and medical informatics from the University of Ulster,
Jordanstown, United Kingdom. Before joining the University of Ulster as
a reader in 2002, he was a lecturer in the Department of Computer
Science at Trinity College, Dublin, Ireland. He has published extensively
in journals, books, and conference proceedings related to the areas of
bioinformatics, artificial intelligence, and medical informatics. He is an
editorial board member of the IEEE Transactions on Nanobioscience,
BioMedical Engineering OnLine, Cancer Informatics, and the Online
Journal of Bioinformatics. He has coedited three books in the areas of
biomedical informatics and systems biology. He coadministers the IEEE
Forum on Bioinformatics and Systems Biology. He is a senior member
of the IEEE.
Juan Carlos Augusto is a lecturer at the
University of Ulster, Jordanstown, United King-
dom. His research interests focus mainly on
Artificial Intelligence (AI), particularly the subar-
ea of temporal reasoning (TR). Since writing his
PhD thesis (1998), he has explored diverse
areas of application for the concept of temporal
reasoning and its relevance for dynamic sys-
tems. With his academic work, he contributed 16
journal/edited volumes/book chapter publica-
tions and more than 30 international conference/workshop publications.
He has also been actively involved in the organization of scientific
events, participating in more than 20 of them as the chair/cochair or a
steering or program committee member.
David H. Glass received the degree in pure and
applied mathematics, the PhD degree in theore-
tical atomic physics, and the MA degree in
philosophy from the Queen’s University of
Belfast in 1994, 1997, and 2000, respectively.
After completing the PhD, he continued to carry
out research on the theory of laser interactions
with atoms and molecules. He has been a
lecturer in the School of Computing and Mathe-
matics at the University of Ulster since 2000,
where he has carried out research in the field of artificial intelligence,
including work on probabilistic reasoning and possibilistic logic.
10 IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 4, NO. 4, OCTOBER-DECEMBER 2007
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
... Ji and Tan's method (Ji and Tan, 2004) 3 variation between time points soinov Soinov's change state method (Soinov et al., 2003) 2 variation between time points mean-sd mean plus standard (Ponzoni et al., 2007) . We also explore how size penalty range influences network optimization. ...
... TDT target discretization threshold Gallo et al. (2011) 2 min(var(S 1 ) + var(S 2 )), where S i represents a gene state var(S i ) are variance for S i , i = 1, 2, S 1 ∩ S 2 = ∅ iX equal width discretization 2-5 discretize data by splitting the range of data with X levelsMadeira and Oliveira (2005) into X intervals equally qX equal frequency discretization 2-5 split data into strata with each strata with X levels Madeira and Oliveira (2005) having the same amount of data mean discretization through 2 a ij = 1 if a ij > E(A i ), otherwise a ij = 0 comparing to mean value Madeira and Oliveira (2005) where A i = (a i1 , a i1 , ...) kmeansX kmeans discretization 2-5 assign data into k (k is given ) levels with X levels Li et al. (2010) through k-means clustering data into k clusters ji&tan Ji and Tan's method Ji and Tan (2004) 3 compare ratio between consecutive samples soinov Soinov's change 2 minimize sum of entropy for up-regulating(1) and state method Soinov et al. (2003); Ponzoni et al. (2007) down-regulating(0) gene states mean-sd mean plus standard Ponzoni et al. (2007) 3 . CC-BY-NC-ND 4.0 International license author/funder. ...
... TDT target discretization threshold Gallo et al. (2011) 2 min(var(S 1 ) + var(S 2 )), where S i represents a gene state var(S i ) are variance for S i , i = 1, 2, S 1 ∩ S 2 = ∅ iX equal width discretization 2-5 discretize data by splitting the range of data with X levelsMadeira and Oliveira (2005) into X intervals equally qX equal frequency discretization 2-5 split data into strata with each strata with X levels Madeira and Oliveira (2005) having the same amount of data mean discretization through 2 a ij = 1 if a ij > E(A i ), otherwise a ij = 0 comparing to mean value Madeira and Oliveira (2005) where A i = (a i1 , a i1 , ...) kmeansX kmeans discretization 2-5 assign data into k (k is given ) levels with X levels Li et al. (2010) through k-means clustering data into k clusters ji&tan Ji and Tan's method Ji and Tan (2004) 3 compare ratio between consecutive samples soinov Soinov's change 2 minimize sum of entropy for up-regulating(1) and state method Soinov et al. (2003); Ponzoni et al. (2007) down-regulating(0) gene states mean-sd mean plus standard Ponzoni et al. (2007) 3 . CC-BY-NC-ND 4.0 International license author/funder. ...
Preprint
Full-text available
The rapid development in quantitatively measuring DNA, RNA, and protein has generated a great interest in the development of reverse-engineering methods, that is, data-driven approaches to infer the network structure or dynamical model of the system. Many reverse-engineering methods require discrete quantitative data as input, while many experimental data are continuous. Some studies have started to reveal the impact that the choice of data discretization has on the performance of reverse-engineering methods. However, more comprehensive studies are still greatly needed to systematically and quantitatively understand the impact that discretization methods have on inference methods. Furthermore, there is an urgent need for systematic comparative methods that can help select between discretization methods. In this work, we consider 4 published intracellular networks inferred with their respective time-series datasets. We discretized the data using different discretization methods. Across all datasets, changing the data discretization to a more appropriate one improved the reverse-engineering methods’ performance. We observed no universal best discretization method across different time-series datasets. Thus, we propose DiscreeTest, a two-step evaluation metric for ranking discretization methods for time-series data. The underlying assumption of DiscreeTest is that an optimal discretization method should preserve the dynamic patterns observed in the original data across all variables. We used the same datasets and networks to show that DiscreeTest is able to identify an appropriate discretization among several candidate methods. To our knowledge, this is the first time that a method for benchmarking and selecting an appropriate discretization method for time-series data has been proposed. Availability All the datasets, reverse-engineering methods and source code used in this paper are available in Vera-Licona’s lab Github repository: https://github.com/VeraLiconaResearchGroup/Benchmarking_TSDiscretizations
... This work presents the method named GarNet. Finally, Ponzoni et al. present in [36] the method GRNCOP and GRNCOP2, which are combinatorial optimization algorithms. ...
... In Table 3, the performance of PRIORREGNET and REGNET can be observed. Furthermore, the proposal is compared against others benchmark methods for the inference of gene networks in a similar way that is established in [26], where the association rule-based method named GarNet, the GRNCOP method [36], the decision tree-based methods [41] and the first orderbased method [7] are used as benchmark methods. The contribution in [3] has been added as a more current method. ...
Article
Full-text available
Traditional computational techniques are recently being improved with the use of prior biological knowledge from open-access repositories in the area of gene expression data analysis. In this work, we propose the use of prior knowledge as heuristic in an inference method of gene-gene associations from gene expression profiles. In this paper, we use Gene Ontology, which is an open-access ontology where genes are annotated using their biological functionality, as a source of prior knowledge together with a gene pairwise Gene-Ontology-based measure. The performance of our proposal has been compared to other benchmark methods for the inference of gene networks, outperforming in some cases and obtaining similar and competitive results in others, but with the advantage of providing simple and interpretable models, which is a desired feature for the Artificial Intelligence Health related models as stated by the European Union.
... This technology introduces a large amount of information to analyze, hence several data analysis issues (Alves et al., 2010). There are several techniques that carry out the reverse engineering of GRNs from analyzing gene expression data (Alves et al., 2010;De Jong, 2002;Espanés et al., 2016;Gallo et al., 2011;Karlebach and Shamir, 2008;Lakshmanan et al., 2013;Li et al., 2008;Ponzoni et al., 2007;Pridgeon and Corne, 2004;Zamani et al., 2010). They vary from Boolean models, model-free approaches up to data mining techniques, each of them tries to overcome the disadvantages of the previous one. ...
... Particularly, one machine-learning approach for the inference of timelagged rules, called GRNCOP2 (Gallo et al., 2011), performs this task. These timedelayed gene regulation rules are a common phenomenon (Bulashevska and Elis, 2005;Li et al., 2006;Ponzoni et al., 2007;Soinov et al., 2003;van Someren et al., 2000) and they add complexity and computational cost. The resulting potential interactions between genes from the algorithm are used to predict the gene expression states of a gene in terms of gene expression values of other genes and to assemble a GRN by applying and combining these rules. ...
Article
Gene regulatory networks (GRNs) are crucial in every process of life since they govern the majority of the molecular processes. Therefore, the task of assembling these networks is highly important. In particular, the so called model-free approaches have an advantage modeling the complexities of dynamic molecular networks, since most of the gene networks are hard to be mapped with accuracy by any other mathematical model. A highly abstract model-free approach, called rule-based approach, offers several advantages performing data-driven analysis; such as the requirement of the least amount of data. They also have an important ability to perform inferences: its simplicity allows the inference of large size models with a higher speed of analysis. However, regarding these techniques, the reconstruction of the relational structure of the network is partial, hence incomplete, for an effective biological analysis. This situation motivated us to explore the possibility of hybridizing with other approaches, such as biclustering techniques. This led to incorporate a biclustering tool that finds new relations between the nodes of the GRN. In this work we present a new software, called GeRNeT that integrates the algorithms of GRNCOP2 and BiHEA along a set of tools for interactive visualization, statistical analysis and ontological enrichment of the resulting GRNs. In this regard, results associated with Alzheimer disease datasets are presented that show the usefulness of integrating both bioinformatics tools.
... • Soinov et al. [53], a C4.5-based method; • Bulashevska et al. [54], a Bayesian-based method; • Ponzoni et al. [55], a combinatorial optimization algorithm (GRNCOP); • Gallo et al. [52], an upgraded version of the previous algorithm named GRNCOP2; and • Gomez-Vela et al. [15], a fuzzy method to infer gene co-expression networks named FyNe. ...
Article
Full-text available
Gene networks have become a powerful tool in the comprehensive analysis of gene expression. Due to the increasing amount of available data, computational methods for networks generation must deal with the so-called curse of dimensionality in the quest for the reliability of the obtained results. In this context, ensemble strategies have significantly improved the precision of results by combining different measures or methods. On the other hand, structure optimization techniques are also important in the reduction of the size of the networks, not only improving their topology but also keeping a positive prediction ratio. In this work, we present Ensemble and Greedy networks (EnGNet), a novel two-step method for gene networks inference. First, EnGNet uses an ensemble strategy for co-expression networks generation. Second, a greedy algorithm optimizes both the size and the topological features of the network. Not only do achieved results show that this method is able to obtain reliable networks, but also that it significantly improves topological features. Moreover, the usefulness of the method is proven by an application to a human dataset on post-traumatic stress disorder, revealing an innate immunity-mediated response to this pathology. These results are indicative of the method’s potential in the field of biomarkers discovery and characterization.
... In particular, GRNCOP [13] and GRNCOP2 [10] belong to this family of AR extraction methods. GRNCOP uses combinatorial optimization and machine learning for inferring gene pairwise associations as classifiers. ...
Article
Background and objective: Gene regulatory networks (GRNs) are essential for understanding most molecular processes. In this context, the so-called model-free approaches have an advantage modeling the complex topologies behind these dynamic molecular networks, since most GRNs are difficult to map correctly by any other mathematical model. Abstract model-free approaches, also known as rule-based extraction methods, offer valuable benefits when performing data-driven analysis; such as requiring the least amount of data and simplifying the inference of large models at a faster analysis speed. In particular, GRNCOP2 is a combinatorial optimization method with an adaptive criterion for the discretization of gene expression data and high performance, in contrast to other rule-based extraction methods for discovering GRNs. However, the analysis of the large relational structures of the networks inferred by GRNCOP2 requires the support of effective tools for interactive network visualization and topological analysis of the extracted associations. This need motivated the possibility of integrating GRNCOP2 in the Cytoscape ecosystem in order to benefit from Cytoscapes core functionality, as well as all the other apps in its ecosystem. Methods: In this paper, we introduce the implementation of a GRNCOP2 Cytoscape app. This incorporation to Cytoscape platform includes new functionality for GRN visualizations, dynamic user-interaction and integration with other apps for topological analysis of the networks. Results: In order to demonstrate the usefulness of integrating GRNCOP2 in Cytoscape, the new app was used to tackle a novel use case for GRNCOP2: the analysis of crosstalk between pathways. In this regard, datasets associated with Alzheimer's disease (AD) were analyzed using GRNCOP2 app and other apps of the Cytoscape ecosystem by performing a topological analysis of the AD progression and its synchronization with the Ubiquitin Mediated Proteolysis pathway. Finally, the biological relevance of the findings achieved by this new app were evaluated by searching for evidence in the literature. Conclusions: The proposed crosstalk analysis with the new GRNCOP2 app focused on assessing the phase of the Alzheimer's disease progression where the coordination with the Ubiquitin Mediated Proteolysis pathway increase, and identifying the genes that explain the signalling between these cellular processes. Both questions were explored by topological contrastive analysis of the GRNs generated for the GRNCOP2 app, where several facilities of Cytoscape were exploited. The topological patterns inferred by this new App have been consistent with biological evidence reported in the scientic literature, illustrating the effectiveness of using this new GRNCOP2 App in pathway analysis. Availability: The GRNCOP2 App is freely available at the official Cytoscape app store: http://apps.cytoscape.org/apps/grncop2.
... A key step in the analysis of the gene expression data is the identification of the groups of genes that manifest similar expression patterns [17]. If two expression profiles are similar, one can hypothesize that the respective genes are functionally related and they can be helpful in gene function prediction [18,19], gene regulatory network analysis [20][21][22][23], functional module prediction [24,25], and disease prediction [26,27]. There exist a number of measures to find out the expression similarity between the genes, namely, Pearson correlation (PC) [10,28], Euclidean distance (ED) [29], Manhattan distance/Cityblock distance (MD) [9], and Spearman rank correlation (SRC) [30]. ...
... Data preprocessing is an important task as original gene expressions may contain noise, missing values and technical or experimental errors. The utility of discretization is shown in several studies such as gene regula- tory network analysis [17,50] and disease prediction [13,46]. While in gene regulatory network analysis discretization helps in constructing the network more accurately [50], in disease prediction it helps in improving classification accuracy in identifying normal and diseased patients/genes [13]. ...
Article
Discretizing gene expression values is an important step in data preprocessing as it helps in reducing noise and experimental errors. This in turn provides better results in various tasks such as gene regulatory network analysis and disease prediction. A supervised discretization method for gene expressions using gene annotation is developed. The method is called “Gene Annotation Based Discretization” (GABD) where the discretization width is determined by maximizing the positive predictive value (PPV), computed using gene annotations, for top 20,000 gene pairs. The method can capture the gene similarity better than those obtained using original expressions. The performance of GABD is compared with some existing discretization methods like equal width discretization, equal frequency discretization and k-means discretization in terms of positive predictive value (PPV). The utility of GABD is also shown by clustering genes using k-medoid algorithm and thereby predicting the function of 23 unclassified Saccharomyces cerevisiae genes using p-value cut off 10−10. The source code for GABD is available at http://www.sampa.droppages.com/GABD.html.
Chapter
Microarray technology is a powerful tool to analyze thousands of gene expression values with a single experiment. Due to the huge amount of data, most of recent studies are focused on the analysis and the extraction of useful and interesting information from microarray data. Examples of applications include detecting genes highly correlated to diseases, selecting genes which show a similar behavior under specific conditions, building models to predict the disease outcome based on genetic profiles, and inferring regulatory networks. This chapter presents a review of four popular data mining techniques (i.e., Classification, Feature Selection, Clustering and Association Rule Mining) applied to microarray data. It describes the main characteristics of microarray data in order to understand the critical issues which are introduced by gene expression values analysis. Each technique is analyzed and examples of pertinent literature are reported. Finally, prospects of data mining research on microarray data are provided.
Chapter
Nowadays, a huge amount of high throughput molecular data are available for analysis and provide novel and useful insights into complex biological systems, through the acquisition of a high-resolution picture of their molecular status in defined experimental conditions. In this context, microarrays are a powerful tool to analyze thousands of gene expression values with a single experiment. A number of approaches have been developed to detecting genes highly correlated to diseases, selecting genes that exhibit a similar behavior under specific conditions, building models to predict disease outcome based on genetic profiles, and inferring regulatory networks. This paper discusses popular and recent data mining techniques (i.e., Feature Selection, Clustering, Classification, and Association Rule Mining) applied to microarray data. The main characteristics of microarray data and preprocessing procedures are presented to understand the critical issues introduced by gene expression values analysis. Each technique is analyzed, and relevant examples of pertinent literature are reported. Moreover, real use cases exploiting analytic pipelines that use these methods are also introduced. Finally, future directions of data mining research on microarray data are envisioned.
Article
The rapid development in quantitatively measuring DNA, RNA and protein has generated a great interest in the development of reverse-engineering methods, that is, data-driven approaches to infer the network structure or dynamical model of the system. Many reverse-engineering methods require discrete quantitative data as input, while many experimental data are continuous. Some studies have started to reveal the impact that the choice of data discretization has on the performance of reverse-engineering methods. However, more comprehensive studies are still greatly needed to systematically and quantitatively understand the impact that discretization methods have on inference methods. Furthermore, there is an urgent need for systematic comparative methods that can help select between discretization methods. In this work, we consider four published intracellular networks inferred with their respective time-series datasets. We discretized the data using different discretization methods. Across all datasets, changing the data discretization to a more appropriate one improved the reverse-engineering methods' performance. We observed no universal best discretization method across different time-series datasets. Thus, we propose DiscreeTest, a two-step evaluation metric for ranking discretization methods for time-series data. The underlying assumption of DiscreeTest is that an optimal discretization method should preserve the dynamic patterns observed in the original data across all variables. We used the same datasets and networks to show that DiscreeTest is able to identify an appropriate discretization among several candidate methods. To our knowledge, this is the first time that a method for benchmarking and selecting an appropriate discretization method for time-series data has been proposed. Availability and implementation: All the datasets, reverse-engineering methods and source code used in this paper are available in Vera-Licona's lab Github repository: https://github.com/VeraLiconaResearchGroup/Benchmarking_TSDiscretizations. Supplementary information: Supplementary data are available at Bioinformatics online.
Article
Full-text available
Motivation: Inferring the genetic interaction mechanism using Bayesian networks has recently drawn increasing attention due to its well-established theoretical foundation and statistical robustness. However, the relative insufficiency of experiments with respect to the number of genes leads to many false positive inferences. Results: We propose a novel method to infer genetic networks by alleviating the shortage of available mRNA expression data with prior knowledge. We call the proposed method 'modularized network learning' (MONET). Firstly, the proposed method divides a whole gene set to overlapped modules considering biological annotations and expression data together. Secondly, it infers a Bayesian network for each module, and integrates the learned subnetworks to a global network. An algorithm that measures a similarity between genes based on hierarchy, specificity and multiplicity of biological annotations is presented. The proposed method draws a global picture of inter-module relationships as well as a detailed look of intra-module interactions. We applied the proposed method to analyze Saccharomyces cerevisiae stress data, and found several hypotheses to suggest putative functions of unclassified genes. We also compared the proposed method with a whole-set-based approach and two expression-based clustering approaches.
Article
Full-text available
DNA hybridization arrays simultaneously measure the expression level for thousands of genes. These measurements provide a `snapshot' of transcription levels within the cell. A major challenge in computational biology is to uncover, from such measurements, gene/protein interactions and key biological features of cellular systems. In this paper, we propose a new framework for discovering interactions between genes based on multiple expression measurements. This framework builds on the use of Bayesian networks for representing statistical dependencies. A Bayesian network is a graph-based model of joint multi-variate probability distributions that captures properties of conditional independence between variables. Such models are attractive for their ability to describe complex stochastic processes, and for providing clear methodologies for learning from (noisy) observations. We start by showing how Bayesian networks can describe interactions between genes. We then present an efficient algorithm capable of learning such networks and a statistical method to assess our confidence in their features. Finally, we apply this method to the S. cerevisiae cell-cycle measurements of Spellman et al. to uncover biological features.
Article
The spindle checkpoint regulates the cell division cycle by keeping cells with defective spindles from leaving mitosis. In the two-hybrid system, three proteins that are components of the checkpoint, Mad1, Mad2, and Mad3, were shown to interact with Cdc20, a protein required for exit from mitosis. Mad2 and Mad3 coprecipitated with Cdc20 at all stages of the cell cycle. The binding of Mad2 depended on Mad1 and that of Mad3 on Mad1 and Mad2. Overexpression of Cdc20 allowed cells with a depolymerized spindle or damaged DNA to leave mitosis but did not overcome the arrest caused by unreplicated DNA. Mutants in Cdc20 that were resistant to the spindle checkpoint no longer bound Mad proteins, suggesting that Cdc20 is the target of the spindle checkpoint.
Article
Increasing volumes of data about the cellular phenotype and classes of intracellular molecules have necessitated the introduction of systemic methods for the analysis of biological systems. These methods bring to focus the integrated nature and complex interactions of biological molecules and processes and, as such, define the emerging field of systems biology. Of the multitude of systems thus analyzed, we provide here an overview of foundational and current methods in the inference of gene regulatory networks (GRNs) and sequence-based pattern discovery. In GRN analysis, the reverse engineering paradigm is given particular attention, including the various types of models (discrete, continuous, hybrid) which may be utilized in reverse engineering a network's structure. Future research directions in these areas are discussed, particularly the potential for ventures that integrate GRN inference, pattern discovery, and experimental methods into a cohesive, productive methodology.
Article
This paper addresses the well-known classification task of data mining, where the objective is to predict the class which an example belongs to. Discovered knowledge is expressed in the form of high-level, easy-to-interpret classification rules. In order to discover classification rules, we propose a hybrid decision tree/genetic algorithm method. The central idea of this hybrid method involves the concept of small disjuncts in data mining, as follows. In essence, a set of classification rules can be regarded as a logical disjunction of rules, so that each rule can be regarded as a disjunct. A small disjunct is a rule covering a small number of examples. Due to their nature, small disjuncts are error prone. However, although each small disjunct covers just a few examples, the set of all small disjuncts can cover a large number of examples, so that it is important to develop new approaches to cope with the problem of small disjuncts. In our hybrid approach, we have developed two genetic algorithms (GA) specifically designed for discovering rules covering examples belonging to small disjuncts, whereas a conventional decision tree algorithm is used to produce rules covering examples belonging to large disjuncts. We present results evaluating the performance of the hybrid method in 22 real-world data sets.
Article
This paper describes an Evolutionary Modeling (EM) approach to building causal model of differential equation system from time series data. The main target of the modeling is the gene regulatory network. A hybrid method of Genetic Programming (GP) and statistical analysis is featured in our work. GP and Least Mean Square method (LMS) were combined to identify a concise form of regulation between the variables from a given set of time series. Our approach was evaluated in several real-world problems. Further, Monte Carlo analysis is applied to indicate the robust and significant influence from the results for gene network analysis purpose.
Article
Inferring a gene regulatory network is one of the challenging topics in the field of Bioinformatics. In order to infer a network structure effectively, the new approach that allows human intervention and strategic data acquisition in the inference process seems to be necessary. In this paper, we will propose an effective approach for interactively inferring gene regulatory networks using gene expression data from DNA microarrays. We will also establish the system that realizes our approach by GA-based interactive algorithm. Experimental results show that our method can infer the network structure accurately with a relatively small amount of expression data.