ArticlePDF Available

Inference of Molecular Regulatory Systems Using Statistical Path-Consistency Algorithm

MDPI
Entropy
Authors:
  • Zhongnan University of Economics and Law

Abstract and Figures

One of the key challenges in systems biology and molecular sciences is how to infer regulatory relationships between genes and proteins using high-throughout omics datasets. Although a wide range of methods have been designed to reverse engineer the regulatory networks, recent studies show that the inferred network may depend on the variable order in the dataset. In this work, we develop a new algorithm, called the statistical path-consistency algorithm (SPCA), to solve the problem of the dependence of variable order. This method generates a number of different variable orders using random samples, and then infers a network by using the path-consistent algorithm based on each variable order. We propose measures to determine the edge weights using the corresponding edge weights in the inferred networks, and choose the edges with the largest weights as the putative regulations between genes or proteins. The developed method is rigorously assessed by the six benchmark networks in DREAM challenges, the mitogen-activated protein (MAP) kinase pathway, and a cancer-specific gene regulatory network. The inferred networks are compared with those obtained by using two up-to-date inference methods. The accuracy of the inferred networks shows that the developed method is effective for discovering molecular regulatory systems.
This content is subject to copyright.


Citation: Yan, Y.; Jiang, F., Zhang, X.;
Tian, T. Inference of Molecular
Regulatory Systems Using
Statistical Path-Consistency
Algorithm. Entropy 2022,24, 693.
https://doi.org/10.3390/e24050693
Academic Editor: Christian Beck
Received: 31 March 2022
Accepted: 12 May 2022
Published: 13 May 2022
Publisher’s Note: MDPI stays neutral
with regard to jurisdictional claims in
published maps and institutional affil-
iations.
Copyright: © 2022 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
entropy
Article
Inference of Molecular Regulatory Systems Using Statistical
Path-Consistency Algorithm
Yan Yan 1, Feng Jiang 2, Xinan Zhang 3,* and Tianhai Tian 4,*
1School of Mathematics and Physics, Wuhan Institute of Technology, Wuhan 430205, China;
yanyan@wit.edu.cn
2School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan 430073, China;
fjiang@zuel.edu.cn
3School of Mathematics and Statistics, Central China Normal University, Wuhan 430079, China
4School of Mathematics, Monash University, Melbourne 3800, Australia
*Correspondence: xinanzhang@mail.ccnu.edu.cn (X.Z.); tianhai.tian@monash.edu (T.T.)
Abstract:
One of the key challenges in systems biology and molecular sciences is how to infer
regulatory relationships between genes and proteins using high-throughout omics datasets. Although
a wide range of methods have been designed to reverse engineer the regulatory networks, recent
studies show that the inferred network may depend on the variable order in the dataset. In this work,
we develop a new algorithm, called the statistical path-consistency algorithm (SPCA), to solve the
problem of the dependence of variable order. This method generates a number of different variable
orders using random samples, and then infers a network by using the path-consistent algorithm based
on each variable order. We propose measures to determine the edge weights using the corresponding
edge weights in the inferred networks, and choose the edges with the largest weights as the putative
regulations between genes or proteins. The developed method is rigorously assessed by the six
benchmark networks in DREAM challenges, the mitogen-activated protein (MAP) kinase pathway,
and a cancer-specific gene regulatory network. The inferred networks are compared with those
obtained by using two up-to-date inference methods. The accuracy of the inferred networks shows
that the developed method is effective for discovering molecular regulatory systems.
Keywords:
molecular regulation; complex network; graphic model; path consistency; statistical
inference
1. Introduction
A molecular regulatory network is a collection of molecular regulators that interact
with each other and with other substances in the cell to govern the functions of mRNA
molecules and proteins. Experimental studies in recent years have produced a large amount
of high-throughout datasets for measuring the gene expression levels or protein activities in
the genome scale. Among them, the microarray gene expression data and RNA-sequence
data provide rich information to reconstruct genetic regulatory networks, and proteomic
data give opportunities to find the protein–protein interactions in cell signaling transduction
pathways [
1
,
2
]. To study these molecular systems with complex regulations, network science
is a powerful tool to investigate the structure of networks and regulatory mechanisms [
3
,
4
].
The reverse-engineering study, which is designed to develop genetic regulatory networks
or protein–protein interaction networks, is one of the challenging research topics in systems
biology and molecular sciences [
5
7
]. In recent years, the advances of single-cell technologies
have provided both opportunities and substantial challenges for the development of molecular
regulatory networks using single-cell data [810].
To meet the high demand from biological studies, a wide range of inference methods
have been designed to reconstruct regulatory networks based on experimental datasets.
There are three major types of inference algorithms according to the mathematical methods
used in these algorithms, namely, the correlation-based methods, mechanistic methods, and
Entropy 2022,24, 693. https://doi.org/10.3390/e24050693 https://www.mdpi.com/journal/entropy
Entropy 2022,24, 693 2 of 18
machine learning methods [
11
14
]. The majority of the correlation-based methods use sta-
tistical measures or information theory methods to calculate the relationship between each
pair of genes or proteins. Due to the efficiency in computation, these algorithms can be used
to reconstruct large-scale molecular systems. A number of statistical measures have been
employed in these methods, including the Pearson correlation coefficient, Spearman rank
correlation, Kendall rank correlation, partial correlation, and distance
measures [1517]
.
The correlation-based methods have also been used to explore the relationship between
various types of molecules in both healthy and disease cells [
18
,
19
]. A recent study used
213 single-cell datasets to evaluate the performance of 17 association measures [
20
]. This
study suggests that a few association methods do not obtain accurate results for certain
datasets in either single-cell or bulk transcriptional datasets. In addition, since time delay
is an important issue in gene expression, correlation methods based on time-delay have
also been proposed to explain the time differences in gene expression [21,22].
Compared with the correlation coefficients, the mutual information is able to quan-
tify the nonlinear relationship between pairs of variables in the system [
23
], and thus
provides an alternative approach to measure the correlation relationship. A number of
algorithms have been designed to reconstruct networks models in biology and financial
sciences [2426]
. Other measures from the information theory, such as the conditional
mutual information and part mutual information [
27
,
28
], are able to search for the joint
regulations based on the concepts of conditional dependency between a subset of variables.
The combination of mutual information and conditional mutual information is able to
detect false positive interactions.
The mechanistic methods use mathematical models to simulate the dynamics of molec-
ular behavior. Thus, the developed models can be used to make testable predictions. The
Boolean models use binary state vectors to describe state transition trajectories that are
governed by a network with the Boolean logic functions [
29
,
30
]. To provide the detailed
regulatory functions, ordinary differential equation (ODE) models are the widely used ap-
proach to describe the continuous changes of molecular dynamics [
31
34
]. The linear ODE
model is the first option to simulate the dynamics of gene expression [
35
,
36
]. In addition,
stochastic models, such as the Ornstein–Uhlenbeck (OU) process and Markov processes,
have been used to model differential processes [
37
,
38
]. Generally, the model-based methods
are capable of studying relatively small-scaled networks due to the computational costs
and complexity of parameter space.
To reduce the complexity, the hybrid approaches, which combine the correlation-based
methods and model-based methods together, are used to infer the gene regulatory networks.
The correlation-based methods are employed first to generate sparse networks that are
the basis for the next step to use the model-based methods [
39
43
]. In recent years, there
has been a trend to use machine learning techniques for developing genetic regulatory
networks [4446].
The path-consistency algorithm (PCA) combines mutual information and high-order
mutual information to infer regulatory networks [
24
,
27
,
28
]. Since the conditional mutual
information (CMI) needs other variables to determine the correlation relations, PCA is
order dependent, namely, the generated network is based on the order of variables in the
dataset [
47
]. To address this issue, the path-consistency (PC) stable algorithm was designed
to remove the effect of order dependence [
48
]. However, this algorithm uses all of the
possible CMI in the network and thus, it may increase the false positive rate. In addition,
the part mutual information (PMI) is a more accurate measure than the conditional mutual
information [28], but it has not been fully considered in the PC-stable algorithm.
When calculating the CMI of genes Xand Y, it requires a third gene Zthat has a a high
correlation relationship to both
X
and
Y
. However, if one of these correlation relationships
is removed from the network before the computation, this CMI does not exist. Thus, the
path-consistency algorithm depends on the order of variables in the system. If changing
the variable order, we may infer a different network using the same algorithm and same
experimental dataset. To address the issue of dependence of variable order, this work designs a
new algorithm, called the statistical path-consistency algorithm (SPCA), to develop regulatory
Entropy 2022,24, 693 3 of 18
networks. Rather than using one single variable order to develop the regulatory network,
we propose to generate multiple variable orders and then develop a number of different
networks. According to the weights of each edge in all these generated networks, we propose
the measures to select the final edges. The proposed algorithm is rigorously validated using
six golden benchmark networks in DREAM challenges, the mitogen-activated protein (MAP)
kinase pathway, and a cancer-specific gene regulatory network.
2. Methods
In this section, we first briefly introduce the information theory for measuring the
dependent relationship between pairs of genes. The detailed formulas are given in the
Supplementary Information.
Then we introduce the path-consistency algorithm. To im-
prove the accuracy, we propose the statistical path-consistent algorithm (SPCA) with part
mutual information to estimate the regulatory relationships between genes and proteins.
2.1. Information Theory
Mutual information (MI) is designed to measure the dependent relationship between
two random variables. Unlike the correlation coefficient that can only measure the linear
relationship between random variables, MI is able to describe the nonlinear dependency
between random variables. Let the joint density function of two random variables
(X
,
Y)
be
p(x
,
y)
, and the marginal density functions of random variables
X
and
Y
be
p(x)
and
q(y), respectively. MI can be calculated by
MI(X,Y) = Z ZX×Y
p(x,y)log p(x,y)
p(x)q(y)dxdy. (1)
where
X
and
Y
are the integral regions of
X
and
Y
, respectively. For application
problems, we normally have the observed samples
{(x1
,
y1)
,
. . .
,
(xn
,
yn)}
of the random
variables
(X
,
Y)
. We may use the samples to estimate the density functions, and then use
the discrete form of (1) to calculate the MI value, which is the widely used bin method [
25
].
We may also assume that random variables
(X
,
Y)
follow a particular distribution (e.g.,
the Gaussian distribution) and then use Formula (1) to calculate the MI value directly. In
this method, the sample data are used to estimate the key parameters in the distribution
functions. Alternatively, MI can be obtained by using the values of entropies. The detailed
formulas can be found in the Supplementary Information.
Although a larger value of MI suggests that two random variables may have a closer
relationship, for networks with a large number of random variables, the close relationship of
two random variables may be based on the strong relationship to the third random variable.
CMI is designed to find the conditional relationship between two random variables
X
and
Y, given the third random variable Z, defined by
CMI(X,Y|Z) = Z Z ZX×YZ
p(x,y,z)log p(x,y|z)
p(x|z)q(y|z)dxdydz, (2)
where
p(x
,
y
,
z)
is the joint density function of these three random variables,
p(x|z)
and
q(y|z)
are the conditional density functions of random variables
X
and
Y
under the condi-
tion of
Z
, respectively, and
p(x
,
y|z)
is the conditional density function of (
X
,
Y)
given
Z
.
In addition, X,Yand Zare the integral regions of X,Yand Z, respectively.
When the values of
p(x|z)
and/or
q(y|z)
are very small, CMI may be sensitive to small
perturbations in the dataset. To address this issue, PMI is proposed to replace CMI for
measuring the dependency relationship [
28
,
49
]. The key difference between CMI and PMI
is that the conditional density functions
p(x|z)
and
q(y|z)
in CMI are replaced by partial
dependence functions
p(x|z)
and
q(y|z)
in PMI, respectively. For example, function
p(x|z)in PMI is defined by
p(x|z) =
y
p(x|z,y)p(y).
Entropy 2022,24, 693 4 of 18
where
p(x|z
,
y)
is the conditional density of
X
given
(Y
,
Z)
. The detailed formulas for
calculating PMI are given in the Supplementary Information.
2.2. Path-Consistency Algorithm
In this work, a gene regulatory network is represented by a graph model
G(V
,
E)
,
where
V
is a set of nodes (i.e., gene), where the size is denoted as
|V|=p
, and
E
is a set
of edges, where each edge
e(i
,
j)
connects genes
i
and
j
. The PC algorithm starts from the
fully connected network and then removes edges that have relatively weaker independent
relationships based on the selected threshold value.
In the first step, MI is used to remove edges from the fully connected network. If the
MI value of an edge
e(i
,
j)
is less than the threshold value
e1
, it is assumed that gene
i
and
gene
j
are independent, and this edge is removed from the network. After the first stage,
we obtain a network whose density should be larger than the desired density. Since PMI is
not used in this step, the derived network is called the zero-order PMI network.
In the second step, the first-order PMI is used to remove edges from the zero-order
PMI network. For an edge
e(i
,
j)
of the zero-order PMI network, if we cannot find any gene
that has edges connecting to both gene
i
and gene
j
in the zero-order PMI network, we keep
edge
e(i
,
j)
in the network. If we can find one or more genes, we calculate the first-order
PMI values and keep edge
e(i
,
j)
in the network only when the maximal value of these PMI
values is larger than the threshold value e2.
We can also apply high-order PMI to further remove edges from the network. For
example, when applying the second-order PMI to edge
e(i
,
j)
, two genes should exist,
and each gene has edges that connect to both gene
i
and gene
k
. If these two genes do
not exist, edge
e(i
,
j)
will remain in the network. Otherwise we calculate the value of the
second-order
PMI(i
,
j|k1
,
k2)
and keep edge
e(i
,
j)
in the network when this PMI value is
larger than the threshold value
e3
. Normally, we use a single threshold value for PMI with
different orders. The following Algorithm 1 summarizes the computational step of the
PC-stable algorithm using PMI.
Algorithm 1 PC algorithm using PMI (PCA-PMI).
1: Input: dataset Dfor a network with Ngenes. Set efor deciding the independence.
2:
Generate a complete network represented by a matrix
GN×N
whose elements all are 1.
3:
Stage
L=
0 to get a zero-order PMI network. For each edge
e(i
,
j)
in network G,
calculate the value of
MI(i
,
j)
that is zero-order PMI. If
MI(i
,
j)<e
, let
Gij =
0 and
delete edge e(i,j)from the network. Otherwise keep Gij =1.
4: Let L=L+1 to get the Lth-order PMI network.
1.
For each edge
e(i
,
j)
in network G derived from the previous stage, find genes that
are connected to genes iand j. Let Tbe the number of all such genes.
2.
If
T<L
, we cannot calculate the
L
th-order PMI and keep this edge in the network.
3.
If
T>L
, selected
L
genes from these
T
genes (i.e. the selection number is
CL
T
). For
each selection, calculate the Lth-order PMI.
4.
Find the maximal value PMI
max
of all
CL
T
values of the
L
th-order PMI. If
PMImax <e, let Gij =0. Otherwise keep Gij =1.
5.
If
L
is less than the selected order, go to Step 4 and continue the computation.
Otherwise stop the program and output the inferred network G.
2.3. Statistical Path-Consistency Algorithm (SPCA)
A key issue of the PC algorithm is that the developed network is associated with the
order of variables in the system. When PMI
max(i
,
j|k)<e
, edge
e(i
,
j)
should be deleted in
Step 4.4 of Algorithm 1. However, if the edge
e(i
,
k)
or
e(j
,
k)
is deleted before computing
PMI
(i
,
j|k)
, this PMI does not exist, and edge
e(i
,
j)
will remain in the network. To address
this issue, the PC-stable algorithm will examine all the possible PMI first. Rather than
removing one edge immediately after calculating a particular PMI, the PC algorithm saves
the indexes of the edges that should be removed first but still retains the edges in the
Entropy 2022,24, 693 5 of 18
network. After all the PMI values are determined at one stage, the algorithm removes all
the edges that should be removed [
48
]. In this way, the influence of variable order can be
reduced to the minimum.
Since the PC algorithm keeps all the possible edges at one stage, this algorithm may
increase the probability of obtaining large PMI values. In addition, the PC algorithm uses
the maximal value of all the related PMI values as the final value. Although one large value
of PMI is enough to keep that regulation in the network, this method may increase the
maximal value and lead to false positive regulations.
To address these issues, we propose a new method to infer regulatory networks. Since
the order of variables may be simply determined by the experimental conditions (e.g.,
the alphabetical order of gene/protein names), we may not be able to obtain accurate
inferred networks by using this order. We use random samples to generate a large number
of different variable orders. For each generated order, we infer a regulatory network by
using Algorithm 1 with a given threshold value. In this way,
N
networks are inferred by
using these
N
variable orders. For each edge
e(i
,
j)
connecting genes
i
and
j
, there are three
possibilities for the appearance of this edge in the inferred Nnetworks.
Case 1: Edge e(i,j)appears in all the inferred Nnetworks.
Case 2: Edge
e(i
,
j)
appears in part of the networks but disappears in the other
networks.
Case 3: Edge e(i,j)disappears in all the inferred Nnetworks.
Note that the threshold value in Algorithm 1 should be selected in order that the edge
number in Case 1 is less than the expected edge number in the inferred network, but the
total edge numbers in Cases 1 and 2 should be larger than the expected edge number. Thus,
the key issue is how to rank and select the edges from Case 2.
For any edge
e(i
,
j)
in the
k
-th network (
k=
1,
. . .
,
N
), a weight
w(k)
ij
is defined based
on the appearance of this edge in the
k
-th network, namely
w(k)
ij =
1 if this edge appears in
the network, or
w(k)
ij =
0 if this edge is not in the network. We define the mean weight for
edge e(i,j)as
MWij =1
N
N
k=1
wk
ij , (3)
which is the first criterion to select the edges for the final network. An edge will remain in
the network if this weight is larger than the given threshold value.
The mean weight (3) may be sensitive to the given threshold value
e
in
Algorithm 1.
Thus,
we considered the second measure that is based on the average of PMI values, defined by
APMIij =1
N
N
k=1
PMI(k)
ij w(k)
ij , (4)
where
PMI(k)
ij
is the PMI value of edge
e(i
,
j)
in the
k
-th network. An alternative approach
is to consider the maximal PMI value of corresponding values in all inferred networks,
given by
MPMIij =max
k=1,...,N{PMI(k)
ij }. (5)
We will test the effectiveness of these three criteria in the following studies.
Figure 1
gives
the diagram of the proposed SPCA with detailed description. The following
Algorithm 2
gives the major steps of SPCA.
Entropy 2022,24, 693 6 of 18
W12
W15
(1)Gene expression data
G1 G3
G4
G5
G2
(2)Complete network (3)Networks by different orders
(4)Full network with weights
(5)Final network
G1 G3
G4
G5
G2
G1 G3
G4
G5
G2
G1 G3
G4
G5
G2
G1 G3
G4
G5
G2
G1
G3
G4
G5
G2
Figure 1.
The diagram of SPCA. (1) The omics dataset with
n
genes and
m
observations. (2) The
complete network in which each pair of genes are connected. The solid and dash lines in the network
represent direct and indirect regulations, respectively. (3) A number of networks are inferred by using
the PCA-PMI algorithm (Algorithm 1) based on the generated different variable orders.
(4) Calculate
the weight of each edge in the complete network based on the corresponding PMI values in all
inferred networks in Step (3). (5) Select edges with the largest values of PMI measures to form the
inferred network.
Algorithm 2 Statistical path-consistent Algorithm (SPCA).
Input
:
Dn×m
: molecular (gene or protein) activity dataset with
n
variables and
m
observations.
V={v1, . . . , vn}: node set. N: number of different variable orders. e: threshold value.
M: number of edges in the output network.
Output
: Network G(V, E).
E
is the set of selected edges.
1: for id = 1: N do
2:
Generate a sample order
{k1
,
. . .
,
kn}
for
{
1,
. . .
,
n}
and reorganize the dataset based
on the new order of variables {xk1, . . . , xkn}.
3:
Use the PCA-PMI algorithm (Algorithm 1) to construct a network based on the new order.
4:
For each edge
e(i
,
j)
, find the PMI value of the highest order
PMIid
ij
(or the edge
weight wid
ij based on the threshold value e).
5: end for
6:
Calculate the value of
MWij
(3),
MPMIij
(4) or
MPMIij
(5). Sort all edges according to
the calculated values.
7:
Select the top
M
edges with the highest values of mean part mutual information or
mean weight, and form the regulatory network.
8: Export the generated network G(V, VE).
To test the influence of threshold value on the inferred network structure, we use
N
different variable orders to obtain
N
different networks. We consider the maximal, minimal
and average edge numbers of these
N
networks based on the various threshold values. The
variation ratio is defined by
Variation ratio =max edge number min edge number
average edge number . (6)
Entropy 2022,24, 693 7 of 18
2.4. Accuracy Measures
The sensitivity and accuracy (ACC) are used in this work to measure the accuracy of
inferred networks, which is defined by
ACC =TP +TN
TP +FT +TN +PN,
and these values are the numbers of the true positive (TP), false positive (FP), true negative
(TN) and false negative (FN) in the generated network. To show the accuracy of the
methods, we use the receiver operating characteristic curve (ROC) and the area under the
ROC curve (AUC) as the measures. In this case, we are interested in the true positive rate
(TPR), namely, the proportion of correctly predicted regulations, and false positive rate
(FPR), namely the proportion of wrongly predicted regulations, given by
TPR =TP
TP +PN,
FPR =FP
TP +PN.
We also use the positive predictive value (PPV), also called precision, to describe the
accuracy of the methods, defined by
PPV =TP
TP +FP. (7)
The ideal value of the PPV, with a perfect test, is 1 (100%), and the worst possible value is
zero. In addition, we use the harmonic mean of precision and sensitivity (i.e., F1-score) to
measure the accuracy of prediction, defined by
F1 =2TP
2TP +FP +FN. (8)
2.5. Experimental Datasets
We use three datasets to test the accuracy of the proposed new algorithm. The first
dataset comes from the DREAM3 and DREAM4 challenges. The DREAM3 dataset has
the gene expression levels of the SOS DNA repair system [
50
,
51
]. This dataset includes
100 genes, and each gene has 100 observations. The exact network includes 166 gene-
gene connections. The DREAM4 data consist of in silico networks of gene expression
measurements of steady-state levels, obtained by applying 100 different multifactorial
perturbations to the original network with, in total, 100 genes [
52
]. The brief information
of the five standard networks from DREAM4 is given in Table 1. For each network, the
degree of each gene varies substantially, from 1 to up to 36. In addition, the density of each
network is quite low.
Table 1. Descriptions of the five datasets for gene expression activities from the Dream4 challenge.
Dataset No. of
Genes
No. of
Samples
Max
Degree
Min
Degree
No. of
Edges Density
Dataset 1 100 100 26 1 169 0.03483
Dataset 2 100 100 38 1 242 0.04988
Dataset 3 100 100 16 1 192 0.03957
Dataset 4 100 100 16 1 207 0.04267
Dataset 5 100 100 18 1 191 0.03937
The second dataset is for the mitogen-activated protein (MAP) pathway, including
the ERK, JNK and p38 pathways, and is one of the most important pathways that regulate
a wide range of cellular functions [
53
,
54
]. A recent proteomics study generated peptides
Entropy 2022,24, 693 8 of 18
to almost 12,000 distinct proteins by using 40 breast cancer lines and 4 primary breast
tumors [
55
]. To generate a relatively small dataset for network analysis, we select proteins
whose functions are connected to cell proliferation. We use the pathway maps in Kyoto
Encyclopedia of Genes and Genomes (KEGG) [
56
] to select 57 proteins in the MAP kinase
pathway. These proteins are the important components of the MAP kinase pathway as well
as the crosstalk between the MAP kinase pathway and the Ras and P13K pathways [
57
].
However, part of these 57 proteins have about a half or even more than a half of missing
values. Thus, we use the regularized iterative PCA algorithm in the R package missMDA
to estimate the missing values first to generate a complete dataset for these 57 proteins.
The third dataset is the RNA sequencing data for acute myeloid leukemia (AML) based
on a large cohort of AML patients from TCGA (http://cancergenome.nih.gov/, access on
23 May 2020) [
27
,
58
,
59
]. The Level-3 processed data are used in this study. The RPKM (read
per kilobase of exon per million mapped reads) values are used as the gene expression
data. This dataset has the expression levels of 81 genes, including 16 transcriptional factors
and 65 target genes. These transcriptional factors include a number of well-known tumor
driver genes, such as c-Fos, PU.1 and Egr-1.
3. Results
3.1. Dependence of Network Structure on Variable Orders
We first examine the dependence of developed networks on the gene order in the
regulatory network using the DREAM3 dataset [
50
,
51
]. To examine the network struc-
ture dependence, we use random samples to obtain 1000 different gene orders and use
Algorithm 1
to develop 1000 networks based on the sampled gene orders. Table 2provides
the maximal, minimal and average edge numbers of these 1000 networks based on different
threshold values. It shows that the developed networks are dependent on the generated
gene orders. For the four tests in Table 2, the variation ratios all are above 10%. Among
them, the largest ratio is 17.14%. In addition, the ratio value is related to the total number
of edges in the network. If the edge number is smaller, the ratio is larger.
Table 2.
Variations ratio of the 1000 generated networks using four different threshold values
e
for
the DREAM3 network with 100 genes.
eMean Edge
Number
Min Edge
Number
Max Edge
Number Variation Ratio
0.05 105 95 113 17.14%
0.03 155 136 161 16.13%
0.02 197 186 209 11.68%
0.016 250 234 262 11.12%
Figure 2gives the frequency for the edges of Case 2, namely edges appearing in only
part of the networks, in all 1000 networks, using four different threshold values. In all four
tests, the edge numbers are evenly distributed. There are about 50% of Case 2 edges that
appear in more than a half of the generated 1000 networks. In addition, the edge number
with frequency
0.5 is relatively large, which increases the difficulty for selecting edges.
The edge number of Case 2 increases in accordance with the increase in total network edge
number by using a small threshold value. For example, the edge number of Case 2 is 104
when e=0.05. However, it increases to 155 when e=0.016.
Entropy 2022,24, 693 9 of 18
Figure 2.
Edge frequency of the Case 2 edges in 1000 networks. (
A
) Threshold value
e=
0.05.
(B)e=0.03. (C)e=0.02. (D)e=0.016.
3.2. Effectiveness of SPCA
After examining the dependence of network structure on the variable order, we next
show the accuracy of the proposed SPCA by comparing with the two published methods,
namely the PCA-PMI and PC-stable algorithm with PMI. We first use the SOS DNA repair
gene dataset for reconstructing gene regulatory networks [
51
]. We use the published PCA-
PMI algorithm in the literature [
24
] and the R-package pcalg for the PC-stable
method [60]
.
Figure 3shows that SPCA has an AUC value of 0.8649 that is larger than the value of
the PC-stable algorithm with PMI (i.e., 0.8605) and that of PCA-PMI (i.e., 0.8571). These
results suggest that SPCA has better accuracy than these two algorithms. In addition, the
AUC value of the PC-stable algorithm is slightly smaller than that of the PCA-PMI, which
suggests that the PC-stable algorithm may increase the false–positive regulations.
Figure 3.
ROC curves of the inferred gene networks by using PCA-PMI (green line), PC-stable-PMI
(blue line) and our proposed SPCA (red line). A larger value of AUC shows the method is more accurate.
Entropy 2022,24, 693 10 of 18
We next use five datasets from the DREAM4 challenge to further investigate the
accuracy of our proposed algorithm. We apply the three methods, namely the PCA-PMI,
PC-stable algorithm with PMI, and our SPCA algorithm, to infer the five gene networks.
For the SPAC algorithm, we test three criteria for selecting the putative regulations. For
each network, we run the algorithms with five threshold values (
e
= 0.031
0.035) and
obtain the networks with slightly more edges than that of the true network. The ranges
of edge numbers of all the inferred networks are given in Table 3. Figure 4gives the AUC
values of these three methods applied to networks 1
4. The averaged AUC values in Table
3suggest that our proposed SPCA algorithm achieves better accuracy than the other two
methods. In addition, the maximal PMI criterion achieves better accuracy than the other
two average criteria. Note that the accuracy of the PC-stable algorithm is not as good as
that of the PCI-PMI algorithm for some networks. In addition, the performance of SPCA
with the average PMI criterion (3) is not as good as the other criteria, but the SPCA with
the maximal PMI criterion (5) has the best performance among the three proposed criteria.
Table 3.
Ranges of the edge numbers and values of the area under ROC curve (AUC) for the inferred
networks of the five gene expression datasets from the DREAM4 challenge. (MW: mean weight (3),
APMI: average PMI (4), MPMI: maximal PMI (5)).
Dataset Edge
Ranges PCI-PMI PC-Stable SPCA(MW) SPCA(APMI) SPCA(MPMI)
1 232294 0.6441 0.6484 0.6475 0.6537 0.6523
2 227296 0.5828 0.5902 0.5978 0.5988 0.6025
3 219263 0.6980 0.6991 0.7021 0.7024 0.7124
4 236279 0.6565 0.6507 0.6611 0.6628 0.6679
5 243297 0.6902 0.6882 0.6914 0.6919 0.6999
Figure 4.
Values of the area under ROC curve (AUC) of the inferred networks for datasets 1
4 from
the DREAM4 challenge by using the PCA-PMI algorithm, PC-stable algorithm with PMI and our
proposed SPCA algorithm. (
A
) Dataset 1; (
B
) dataset 2; (
C
) dataset 3; (
D
) dataset 4. (PCA-PMI:
solid-line, PC-stable-PMI: dash line, SPCA with mean weight (3): dash-dot line, SPCA with average
PMI (4): circle, SPCA with maximal PMI (5): star).
Entropy 2022,24, 693 11 of 18
Tables 4and 5give the FFV values and F1-scores of these three methods applied
to networks 1
5. Both the FFV values and F1-scores suggest that our proposed SPCA
algorithm with the maximal PMI criterion (5) achieves better accuracy than the other two
methods. In addition, the maximal PMI criterion achieves better accuracy than the other
two average criteria. Note that the accuracy of the PC-stable algorithm is not as good as that
of the PCI-PMI algorithm for some networks. In addition, the performance of SPCA with
the mean weight (3) and average PMI criteria (4) is not as good as the other two methods,
but the SPCA with the maximal PMI criterion (5) has the best performance among the
three proposed criteria. Figure 5gives the FFV values and F1-scores of these three methods
applied to networks 1
3. When networks have more edges, the FFV values and F1-scores
are smaller. The results in Figure 5are consistent with those shown in
Figure 4
. However,
if the AUC values of the SPCA algorithm with mean weight (3) or average
PMI (4)
are
just marginally better than those of the other two methods, the corresponding PPV value
and/or F1-score of the SPCA algorithm may not be as good as those of the other two
methods for certain datasets.
Table 4.
Ranges of the edge numbers and Positive predictive values (FFV) for the inferred networks
of the five gene expression datasets from the DREAM4 Challenge. (MW: mean weight (3), APMI:
average PMI (4), MPMI: maximal PMI (5)).
Dataset Edge
Ranges PCI-PMI PC-Stable SPCA(MW) SPCA(APMI) SPCA(MPMI)
1 232294 0.1964 0.2084 0.2015 0.2074 0.2083
2 227296 0.2671 0.2738 0.2826 0.2718 0.2898
3 219263 0.3192 0.3208 0.3259 0.3316 0.3470
4 236279 0.3064 0.2919 0.3104 0.3041 0.3200
5 243297 0.3462 0.3296 0.3505 0.3436 0.3614
Table 5.
Ranges of the edge numbers and values of the harmonic mean of precision and sensitivity
(F1-score) for the inferred networks of the five gene expression datasets from the DREAM4 challenge.
(MW: mean weight (3), APMI: average PMI (4), MPMI: maximal PMI (5)).
Dataset Edge
Ranges PCI-PMI PC-Stable SPCA(MW) SPCA(APMI) SPCA(MPMI)
1 232294 0.2325 0.2468 0.2388 0.2458 0.2468
2 227296 0.2671 0.2738 0.2826 0.2718 0.2898
3 219263 0.3530 0.3549 0.3605 0.3669 0.3838
4 236279 0.3462 0.3296 0.3505 0.3436 0.3614
5 243297 0.2964 0.2881 0.2962 0.2946 0.3036
The computing time is not an issue for the implementation of the proposed method.
The CPU time of PCA-PMI is only 5 s for a network with 100 genes using an Apple iMac
with 3.4 GHz processor. Although the required time of the proposed method is
N
times
the computing time of PCA-PMI, the value of mutual information is the same for different
variable orders, which can be used to reduce the computing time. To infer a network with
100 genes using 1000 different variable orders, the computing time is less than 5 min using
the same computer.
Entropy 2022,24, 693 12 of 18
Figure 5.
Positive predictive values (PPV) and harmonic mean of precision and sensitivity (F1-score)
of the inferred networks for datasets 1
3 from the DREAM4 challenge by using the PCA-PMI algo-
rithm, PC-stable algorithm with PMI and our proposed SPCA algorithm. (
A
) PPV values of dataset 1;
(
B
) F1-scores of dataset 1; (
C
) PPV values of dataset 2; (
D
) F1-scores of dataset 2; (
E
) PPV values of
dataset 3; (
F
) F1-scores of dataset 3. (PCA-PMI: solid-line, PC-stable-PMI: dash line, SPCA with mean
weight (3): dash-dot line, SPCA with average PMI (4): circle, SPCA with maximal PMI (5): star).
3.3. Map Kinase Network
The MAP kinase cascade includes a number of kinases that transfer cellular signals
from the trans-membrance protein receptor on the cell surface to DNA in the nucleus to
regulate cellular functions [
53
,
54
]. We next apply the proposed SPCA to infer the regulatory
structure of the MAP kinase pathway. By using different threshold values, we develop three
networks with
57,
114 and
170 edges. When the number of edges is
57 or
114,
quite a few proteins form a small subnetwork and are isolated from the main network. To
build a connected network, we first use the minimum spanning tree (MST) to connect all
proteins first, and then add the edges with the largest PMI weights to the network.
Figure 6gives the inferred network with 106 connections. We use different colors to
represent proteins with different connection numbers (i.e., degree). It shows that proteins
in the MAP kinase three-tier modules have relatively larger degrees, such as the proteins
Raf1, RafB, and MKK3, that are three MAPK proteins. On the other hand, the downstream
target genes may have fewer connections to other proteins, such as transcriptional factors
p53 and JunD. However, this rule is not always true. For example, the connection to kinase
p38 is only one.
Entropy 2022,24, 693 13 of 18
Figure 6.
Inferred network of the MAP kinase pathway with
N=
57 proteins and 106 connections by
using the proposed SPCA. A protein with darker color has more inferred connections.
There are four types of connections in the developed network. The first one represents
protein regulations that have been confirmed by experimental studies, such as the connec-
tions MEK-ERK, PI3K-AKT, and RafB-AKT. The second type is for connections of protein
isoforms or proteins that are regulated by the same upstream protein. For example, MEK1-
MEK2 connects two isoforms of MEK protein. Similar connections include MKK3-MKK6.
Quite a number of connections belong to the third type, namely indirect connections that
may have other proteins between these connected proteins. For example, Rafb-ERK is an
indirect connection of the MAP kinase module Raf-MEK-ERK. Similar connections include
GRB2-Raf1, and MEKK3-MAPKAPK. For the last type of connections, we cannot find any
published results to support these connections. These connections are the predictions of
putative regulations or may be special connections in cancer cells. Note that the above
conclusions are based on the pathway maps from KEGG. In recent years, experimental
studies have found more connections between the proteins in the MAP kinase pathway.
3.4. Reconstruction of Cancer-Specific Gene Regulatory Network
After the study of the MAP kinase pathway, we next examine a gene regulatory network
for acute myeloid leukemia (AML) [
27
]. Due to the dysfunction of certain functional modules
and pathways, the regulatory relationships between transcriptional factors and target genes
in the disease cells may be different from those in the healthy cells [
61
]. To explore the gene
regulations in cancer cells, we apply SPCA to develop a regulatory network for AML using
the RNA-seq data of a large cohort of AML patients from TCGA [
27
,
58
,
59
]. Figure 7shows
the inferred AML-specific gene network by using SPCA. In this network, there are 81 total
cancer genes, including 16 transcriptional factors and 65 target genes. This system has also
Entropy 2022,24, 693 14 of 18
been studied by RACER [
62
] to develop a regulatory network using the same AML gene
expression dataset. By using different threshold values, we can detect different numbers of
regulations. Although there are 227 reported regulations, the real regulations in cancer cells
are not known. Here, our inferred network is only a prediction for the potential regulations.
Further experimental studies are needed to evaluate these predictions.
Figure 7.
The inferred acute myeloid leukemia (AML) genetic regulatory network generated by
SPCA. Blue genes: regulators. Pink genes: target genes.
4. Discussions and Conclusions
The motivation of this study is to address the issue of variable order dependence
when using the conditional correlation methods to infer complex systems. An observation
dataset has a variable order that is determined by the experimental conditions, for example,
the alphabetic order of gene names or the rank of patient ID numbers. The variations of
the given variable order lead to different inferred networks that may have better accuracy.
Since the optimal order may not exist or is not known, we generate a number of different
variable orders, and then calculate the PMI values of each edge based on different variable
orders and infer different regulatory networks. The key question is how to derive the
final inferred network based on these networks. In this work, we propose three criteria
to determine the weight or PMI value of each edge and then infer the final network. The
research results suggest that the criterion using the maximal PMI value leads to the inferred
networks with better accuracy.
With more and more generated omics datasets, there is a strong need to design effective
algorithms to infer regulatory relationships or functional relationships between genes and
proteins. However, it is well recognized that the gene expression process is nonlinear,
sparse and systematic. In addition, time delay widely exists in gene expression and the
Entropy 2022,24, 693 15 of 18
length of delay varies from gene to gene. In the current information-theory-based methods,
a single threshold value is normally used to select the significant regulations. We tested
the DREAM challenge datasets but found certain regulations have quite small correlation
coefficient values and MI values. Thus, these regulations may never be selected unless a
small threshold value is used. However, in that case, the inferred network would be quite
dense. On the other hand, it is relatively easy to incorporate nonlinearity and time delay
into a dynamic model, but the inference of model parameters is still a challenging issue
for a relatively large network. Although quite a large number of algorithms have been
proposed to infer regulatory networks [
13
,
63
], the accuracy of developed networks is still
not satisfactory. The further research in this area is still challenging and exciting.
In summary, this work develops a new method called SPCA to infer molecular regu-
lation networks using various omics datasets. To address the issue of the dependence of
network structure on the variable order, this method generates a number of variable orders
and develops a network by using the PCA-PMI algorithm, using each variable order. Infor-
mation theory is used to measure the correlation relationship between each pair of genes or
proteins. We calculate the weight or PMI value of each edge based on the networks using a
different variable order, and use a threshold value to select the final putative regulations.
The standard networks in DREAM challenges are used to examine the accuracy of the
proposed method. We also use the new method to develop the mitogen-activated protein
(MAP) kinase pathway, and a cancer-specific gene regulatory network. Our studies suggest
that the developed method is effective for discovering molecular regulatory systems. In
this work, we propose a general approach to improve the inference accuracy by generating
a large number of different variable orders. This approach may be applied to certain cur-
rent state-of-the-art methods to obtain more promising results. This will be an interesting
research topic for our future research.
Supplementary Materials:
The following are available online at https://www.mdpi.com/article/10
.3390/e24050693/s1, Supplementary Information.
Author Contributions:
Conceptualization, T.T.; methodology, X.Z. and T.T.; software, Y.Y.; valida-
tion, Y.Y., F.J., X.Z. and T.T.; formal analysis, Y.Y., F.J., X.Z. and T.T.; investigation, Y.Y., F.J., X.Z. and
T.T.; writing—original draft preparation, Y.Y. and T.T.; writing—review and editing, X.Z. and T.T.;
funding acquisition, Y.Y., F.J. and X.Z. All authors have read and agreed to the published version of
the manuscript.
Funding:
This research was funded by National Natural Science Foundation of China (11871238,
11931019, 61773401), the Science Foundation of Wuhan Institute of Technology (20QD47), and the
Foundation of Zhongnan University of Economics and Law (3173211205).
Institutional Review Board Statement: No applicable.
Informed Consent Statement: No applicable.
Data Availability Statement: Data sharing is not applicable to this article.
Conflicts of Interest: The authors declare no conflict of interest.
Abbreviations
The following abbreviations are used in this manuscript:
SPCA Statistical Path-Consistency Algorithm
MAP Mitogen-Activated Protein
DREAM Dialogue for Reverse Engineering Assessments and Methods
CMI Conditional Mutual Information
PCA Path-Consistency Algorithm
AUC Area Under ROC Curve
AML Acute Myeloid Leukemia
Entropy 2022,24, 693 16 of 18
References
1. Haas, R.; Zelezniak, A.; Iacovacci, J.; Kamrad, S.; Townsend, S.; Ralser, M. Designing and interpreting ‘multi-omic’ experiments
that may change our understanding of biology. Curr. Opin. Syst. Biol. 2017,6, 37–45. [CrossRef] [PubMed]
2.
Vogel, C.; Marcotte, E.M. Insights into the regulation of protein abundance from proteomic and transcriptomic analyses. Nat. Rev.
Genet. 2012,13, 227–232. [CrossRef] [PubMed]
3.
Saintantoine, M.M.; Singh, A. Network inference in systems biology: Recent developments, challenges, and applications. Curr.
Opin. Biotechnol. 2020,63, 89–98. [CrossRef]
4.
Karlebach, G.; Shamir, R. Modelling and analysis of gene regulatory networks. Nat. Rev. Mol. Cell Biol.
2008
,9, 770–780.
[CrossRef] [PubMed]
5.
Basso, K.; Margolin, A.A.; Stolovitzky, G.; Klein, U.; Dallafavera, R.; Califano, A. Reverse engineering of regulatory networks in
human b cells. Nat. Genet. 2005,37, 382–390. [CrossRef] [PubMed]
6.
De Smet, R.; Marchal, K. Advantages and limitations of current network inference methods. Nat. Rev. Microbiol.
2010
,8, 717–729.
[CrossRef]
7. Li, H.; Xie, L.; Zhang, X.; Wang, Y. Wisdom of crowds for robust gene network inference. Nat. Methods 2012,9, 796–804.
8.
Laehnemann, D.; Kster, J.; Szczurek, E.; McCarthy, D.J.; Hicks, S.C.; Robinson, M.D.; Vallejos, C.A.; Campbell, K.R.; Beerenwinkel,
N.; Mahfouz, A.; et al. Eleven grand challenges in single-cell data science. Genome Biol. 2020,21, 31. [CrossRef]
9.
Dai, H.; Jin, Q.Q.; Li, L.; Chen, L.N. Reconstructing gene regulatory networks in single-cell transcriptomic data analysis. Zool Res.
2020,41, 599–604. [CrossRef]
10.
Stumpf, M.P.H. Inferring better gene regulation networks from single-cell data. Curr. Opin. Syst. Biol.
2021
,27, 100342. [CrossRef]
11.
Maetschke, S.R.; Madhamshettiwar, P.B.; Davis, M.J.; Ragan, M.A. Supervised, semi-supervised and unsupervised inference of
gene regulatory networks. Brief. Bioinform. 2013,15, 195–211. [CrossRef] [PubMed]
12.
Huynthu,V.A.; Sanguinetti, G. Gene regulatory network inference: An Introductory Survey. Methods Mol. Biol.
2019
,1883, 1–23.
13.
Zhao, M.; He. W.; Tang, J.; Zou, Q.; Guo, F. A comprehensive overview and critical evaluation of gene regulatory network
inference technologies. Brief Bioinform. 2021,22, bbab009. [CrossRef] [PubMed]
14.
Li, X.; Li, W.; Zeng, M.; Zheng, R.; Li, M. Network-based methods for predicting essential genes or proteins: A survey. Brief
Bioinform. 2020,21, 566–583. [CrossRef] [PubMed]
15.
Liu, Z. Quantifying Gene Regulatory Relationships with Association Measures: A Comparative Study. Front Genet.
2017
,8, 96.
[CrossRef]
16.
Stuart, J.M.; Segal, E.; Koller, D.; Kim, S.M. A gene-coexpression network for global discovery of conserved genetic modules.
Science 2003,302, 249–255. [CrossRef]
17.
Casadiego, J.; Nitzan, M.; Hallerberg, S.; Timme, M. Model-free inference of direct network interactions from nonlinear collective
dynamics. Nat. Commun. 2017,8, 2192. [CrossRef]
18.
Peng, C.; Zou, L.; Huang, D.S. Discovery of relationships between long non-oding RNAs and genes in human diseases based on
tensor completion. IEEE Access 2018,6, 59152–59162. [CrossRef]
19.
Yuan, L., Guo, L., Yuan, C., Zhang, Y., Han, K., Nandi, A.K., Honig, B., Huang, D. Integration of multi-omics data for gene
regulatory network inference and application to breast cancer. IEEE/ACM Trans. Comput. Biol. Bioinf.
2019
,16, 782–791.
[CrossRef]
20.
Skinnider, M.A.; Squair, J.W.; Foster, L.J. Evaluating measures of association for single-cell transcriptomics. Nat. Methods
2019
,16,
381–386. [CrossRef]
21.
Yang, B.; Bao, W.; Huang, D.S.; Chen, Y. Inference of large-scale time-delayed gene regulatory network with parallel MapReduce
cloud platform. Sci. Rep. 2018,8, 17787. [CrossRef] [PubMed]
22.
Yang, B.; Chen, Y.; Zhang, W.; Lv, J.; Bao, W.; Huang, D. Hscvfnt: Inference of time-delayed gene regulatory network based on
complex-valued flexible neural tree model. Int. J. Mol. Sci. 2018,19, 3178. [CrossRef] [PubMed]
23.
Meyer, P.E.; Kontos, K.; Lafitte, F.; Bontempi, G. Information-theoretic inference of large transcriptional regulatory networks.
EURASIP J. Bioinf. Syst. Biol. 2007,2007, 8. [CrossRef] [PubMed]
24.
Zhang, X.; Zhao, X.; He, K.; Lu, L.; Cao, Y.; Liu, J.; Hao, J.; Liu, Z.; Chen, L. Inferring gene regulatory networks from gene
expression data by path consistency algorithm based on conditional mutual information. Bioinformatics
2012
,28, 98–104.
[CrossRef] [PubMed]
25.
Guo, X.; Zhang, H.; Tian, T. Development of stock correlation networks using mutual information and financial big data. PLoS
ONE 2018,13, e0195941. [CrossRef]
26.
Yan, Y.; Wu, B.; Tian, T.; Zhang, H. Development of Stock Networks Using Part Mutual Information and Australian Stock Market
Data. Entropy 2020,22, 773. [CrossRef]
27.
Zhang, X.; Zhao, J.; Hao, J.; Zhao, X.; Chen, L. Conditional mutual inclusive information enables accurate quantification of
associations in gene regulatory networks. Nucleic Acids Res. 2015,43, e31. [CrossRef]
28.
Zhao, J.; Zhou, Y.; Zhang, X.; Chen, L. Part mutual information for quantifying direct associations in networks. Proc. Natl. Acad.
Sci. USA 2016,113, 5130–5135. [CrossRef]
29.
Li, H.; Xie, L.; Wang, Y. Output Regulation of Boolean Control Networks. IEEE Trans Auto Control
2017
,62, 2993–2998. [CrossRef]
30.
Ouyang, H.; Fang, J.; Shen, L.; Dougherty, E.R.; Liu, W. Learning restricted Boolean network model by time-series data. EURASIP
J. Bioinform. Syst. Biol. 2014,2014, 10. [CrossRef]
Entropy 2022,24, 693 17 of 18
31.
Cantone, I.; Marucci, L.; Iorio, F.; Ricci, M.A.; Belcastro, V.; Bansal, M.; Santini, S.; Di Bernardo, M.; Di Bernardo, D.; Cosma,
M.P. A yeast synthetic network for in vivo assessment of reverse-engineering and modeling approaches. Cell
2019
,137, 172–181.
[CrossRef] [PubMed]
32.
Chan, T.E.; Stumpf, M.P.; Babtie, A.C. Gene regulatory network inference from single-cell data using multivariate information
measures. Cell Syst. 2017,5, 251–267. [CrossRef] [PubMed]
33.
Kishan, K.C.; Li, R.; Cui, F.; Haake, A.R. GNE: A deep learning framework for gene network inference by aggregating biological
information. BMC Syst. Biol. 2019,13, 38.
34.
Ma, B.; Fang, M.; Jiao, X. Inference of gene regulatory networks based on nonlinear ordinary differential equations. Bioinformatics
2020,36, 4885–4893. [CrossRef]
35.
Specht, A.T.; Li, J. LEAP: Constructing gene co-expression networks for single-cell RNA-sequencing data using pseudo-time
ordering. Bioinformatics 2016,33, 764–766. [CrossRef]
36.
Matsumoto, H.; Kiryu, H.; Furusawa, C.; Gunawan, R. SCODE: An efficient regulatory network inference algorithm from
single-cell RNA-seq during differentiation. Bioinformatics 2017,33, 2314–2321. [CrossRef]
37.
Matsumoto, H.; Kiryu, H. SCOUP: probabilistic model based on the Ornstein–Uhlenbeck process to analyze single-cell expression
data during differentiation. BMC Bioinform. 2016,17, 232. [CrossRef]
38.
Bonnaffoux, A.; Herbach, U.; Richard, A.; Guillemin, A.; Gonin-Giraud, S.; Gros, P.A.; Gandrillon, O. WASABI: A dynamic
iterative framework for gene regulatory network inference. BMC Bioinform. 2019,20, 220. [CrossRef]
39.
Andrea, O.; Laleh, H.; Mueller, N.S.; Theis, F.J. Reconstructing gene regulatory dynamics from high-dimensional single-cell
snapshot data. Bioinformatics 2015,31, 89–96.
40.
Wang, J.; Wu, Q.; Hu, X.T.; Tian, T. An integrated platform for reverse-engineering protein-gene interaction network. Methods
2016,110, 3–13. [CrossRef]
41.
Wei, J.; Hu, X.; Zou, X.; Tian, T. Reverse-engineering of gene networks for regulating early blood development from single-cell
measurements. BMC Med. Genom. 2017,10, 72. [CrossRef] [PubMed]
42.
Wu, S.; Cui, T.; Zhang, X. Tian, T. A non-linear reverse-engineering method for inferring genetic regulatory networks. PeerJ
2020
,
8, e9065. [CrossRef] [PubMed]
43.
Yan, Y.; Jiang, F.; Zhang, X.; Tian, T. Integrated inference of asymmetric protein interaction networks using dynamic model and
individual patient proteomics data. Symmetry 2021,13, 1097. [CrossRef]
44.
Huynh-Thu, V.A.; Irrthum, A.; Wehenkel, L.; Geurts, P. Inferring regulatory networks from expression data using tree-based
methods. PLoS ONE 2010,5, e12776. [CrossRef]
45.
Yang, B.; Bao, W. RNDEtree: regulatory network with differential equation based on flexible neural tree with novel criterion
function. IEEE Access 2019,7, 58255–58263. [CrossRef]
46.
Ye, Y.; Bar-Joseph, Z. Deep learning for inferring gene relationships from single-cell expression data. Proc. Natl. Acad. Sci. USA
2019,116, 27151–27158.
47.
Yan, Y.; Zhang, X.; Tian, T. Inference method for reconstructing regulatory networks using statistical path-consistency algorithm
and mutual information. Lect. Notes Comput. Sci. 2020,12464, 45–56.
48.
Colombo, D.; Maathuis, M.H. Order-independent constraint-based causal structure learning. J. Mach. Learn Res.
2014
,15,
3741–3782.
49.
Janzing, D.; Balduzzi, D.; Grosse-Wentrup, M.; Schölkopf, B. Quantifying causal influences. Ann. Stat.
2013
,41, 2324–2358.
[CrossRef]
50.
Marbach, D.; Prill, R.J.; Schaffter, T.; Mattiussi, C.; Floreano, D.; Stolovitzky, G. Revealing strengths and weaknesses of methods
for gene network inference. Proc. Nat. Acad. Sci. USA 2010,107, 6286–6291. [CrossRef]
51.
Ronen, M.; Rosenberg, R.; Shraiman, B.I. Assigning numbers to the arrows: Parameterizing a gene regulation network by using
accurate expression kinetics. Proc. Natl. Acad. Sci. USA 2002,99, 10555–10560. [CrossRef] [PubMed]
52.
Greenfield, A.; Madar, A.; Ostrer, H.; Bonneau, R. DREAM4: combining genetic and dynamic information to identify biological
networks and dynamical models. PLoS ONE 2014,5, e13397. [CrossRef] [PubMed]
53.
Cargnello, M.; Roux, P.P. Activation and function of the mapks and their substrates, the mapk-activated protein kinases. Microbiol.
Mol. Biol. Rev. 2011,75, 50–83. [CrossRef]
54.
Tian, T.; Song, J. Mathematical modelling of the MAP kinase pathway based on proteomics dataset. PLoS ONE
2012
,7, e42230.
[CrossRef] [PubMed]
55.
Pozniak, Y.; Balintlahat, N.; Rudolph, J.D.; Lindskog, C.; Katzir, R.; Avivi, C.; Ponten, F.; Ruppin, E.; Barshack, I.; Geiger, T.
System-wide clinical proteomics of breast cancer reveals global remodeling of tissue homeostasis. Cell Syst.
2016
,2, 172–184.
[CrossRef] [PubMed]
56.
Kanehisa, M.; Goto, S.; Kawashima, S.; Okuno, Y.; Hattori, M. The kegg resource for deciphering the genome. Nucleic Acids Res.
2004,32 (Suppl. 1), 277–280. [CrossRef]
57.
Lawrence, R.T.; Perez, E.M.; Hernandez, D.; Miller, C.P.; Haas, K.M.; Irie, H.Y.; Lee, S.I.; Blau, C.A.; Villén, J. The proteomic
landscape of triple-negative breast cancer. Cell Rep. 2015,11, 630–644. [CrossRef]
58.
McLendon, R.; Friedman, A.D.B.; Van Meir, E.G.; Brat, D.J.; Mastrogianakis, G.M.; Olson, J.J.; Mikkelsen, T.; Lehman, N.;
Aldape, K.
Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature
2008
,455,
1061–1068.
Entropy 2022,24, 693 18 of 18
59.
Cancer Genome Atlas Research Network. Genomic and epigenomic landscapes of adult de novo acute myeloid leukemia. N.
Engl. J. Med. 2013,368, 2059. [CrossRef]
60.
Kalisch, M.; Maechler, M.; Colombo, D. Causal inference using graphical models with the r package pcalg. J. Stat. Softw.
2012
,47,
1–26. [CrossRef]
61. Liu, K.; Liu, Z.; Hao, J.; Chen, L.; Zhao, X.M. Identifying dysregulated pathways in cancers from pathway interaction networks.
Bmc Bioinform. 2012,13, 126. [CrossRef] [PubMed]
62.
Li, Y.; Liang, M.; Zhang, Z. Regression analysis of combined gene expression regulation in acute myeloid leukemia. PLoS Comput.
Biol. 2014,10, 1003908. [CrossRef] [PubMed]
63.
Wu, Z.; Liao, Q.; Liu, B. A comprehensive review and evaluation of computational methods for identifying protein complexes
from protein-sprotein interaction networks. Brief. Bioinform. 2020,21, 1531–1548. [CrossRef] [PubMed]
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Recent advances in experimental biology studies have produced large amount of molecular activity data. In particular, individual patient data provide non-time series information for the molecular activities in disease conditions. The challenge is how to design effective algorithms to infer regulatory networks using the individual patient datasets and consequently address the issue of network symmetry. This work is aimed at developing an efficient pipeline to reverse-engineer regulatory networks based on the individual patient proteomic data. The first step uses the SCOUT algorithm to infer the pseudo-time trajectory of individual patients. Then the path-consistent method with part mutual information is used to construct a static network that contains the potential protein interactions. To address the issue of network symmetry in terms of undirected symmetric network, a dynamic model of ordinary differential equations is used to further remove false interactions to derive asymmetric networks. In this work a dataset from triple-negative breast cancer patients is used to develop a protein-protein interaction network with 15 proteins.
Article
Full-text available
Gene regulatory networks play pivotal roles in our understanding of biological processes/mechanisms at the molecular level. Many studies have developed sample-specific or cell-type-specific gene regulatory networks from single-cell transcriptomic data based on a large amount of cell samples. Here, we review the state-of-the-art computational algorithms and describe various applications of gene regulatory networks in biological studies. © 2020 Editorial Office of Zoological Research, Kunming Institute of Zoology, Chinese Academy of Sciences
Article
Full-text available
Complex network is a powerful tool to discover important information from various types of big data. Although substantial studies have been conducted for the development of stock relation networks, correlation coefficient is dominantly used to measure the relationship between stock pairs. Information theory is much less discussed for this important topic, though mutual information is able to measure nonlinear pairwise relationship. In this work we propose to use part mutual information for developing stock networks. The path-consistency algorithm is used to filter out redundant relationships. Using the Australian stock market data, we develop four stock relation networks using different orders of part mutual information. Compared with the widely used planar maximally filtered graph (PMFG), we can generate networks with cliques of large size. In addition, the large cliques show consistency with the structure of industrial sectors. We also analyze the connectivity and degree distributions of the generated networks. Analysis results suggest that the proposed method is an effective approach to develop stock relation networks using information theory.
Article
Full-text available
Hematopoiesis is a highly complex developmental process that produces various types of blood cells. This process is regulated by different genetic networks that control the proliferation, differentiation, and maturation of hematopoietic stem cells (HSCs). Although substantial progress has been made for understanding hematopoiesis, the detailed regulatory mechanisms for the fate determination of HSCs are still unraveled. In this study, we propose a novel approach to infer the detailed regulatory mechanisms. This work is designed to develop a mathematical framework that is able to realize nonlinear gene expression dynamics accurately. In particular, we intended to investigate the effect of possible protein heterodimers and/or synergistic effect in genetic regulation. This approach includes the Extended Forward Search Algorithm to infer network structure (top-down approach) and a non-linear mathematical model to infer dynamical property (bottom-up approach). Based on the published experimental data, we study two regulatory networks of 11 genes for regulating the erythrocyte differentiation pathway and the neutrophil differentiation pathway. The proposed algorithm is first applied to predict the network topologies among 11 genes and 55 non-linear terms which may be for heterodimers and/or synergistic effect. Then, the unknown model parameters are estimated by fitting simulations to the expression data of two different differentiation pathways. In addition, the edge deletion test is conducted to remove possible insignificant regulations from the inferred networks. Furthermore, the robustness property of the mathematical model is employed as an additional criterion to choose better network reconstruction results. Our simulation results successfully realized experimental data for two different differentiation pathways, which suggests that the proposed approach is an effective method to infer the topological structure and dynamic property of genetic regulations.
Article
Full-text available
The recent boom in microfluidics and combinatorial indexing strategies, combined with low sequencing costs, has empowered single-cell sequencing technology. Thousands-or even millions-of cells analyzed in a single experiment amount to a data revolution in single-cell biology and pose unique data science problems. Here, we outline eleven challenges that will be central to bringing this emerging field of single-cell data science forward. For each challenge, we highlight motivating research questions, review prior work, and formulate open problems. This compendium is for established researchers, newcomers, and students alike, highlighting interesting and rewarding problems for the coming years.
Article
Networks can provide a graphical representation of complex systems, including gene regulation processes. They can also provide a basis for further quantitative and computational analysis and modelling of biological systems. Single cell technology is now allowing us to probe molecular mechanisms and processes at unprecedented detail, and there is hope that we can learn better, more detailed network models from such data, in order to help us understand the mechanisms underlying cellular processes. But learning the structure of networks is notoriously difficult for at least two reasons: (i) network inference is a statistically challenging problem; and (ii) the naive picture of static networks may be fundamentally inapplicable to the description of biological systems. Here I will give some overview of the basic problem; discuss a set of promising network inference methods and how to validate them; and outline how we can go beyond the limitations imposed by static network models.
Article
Gene regulatory network (GRN) is the important mechanism of maintaining life process, controlling biochemical reaction and regulating compound level, which plays an important role in various organisms and systems. Reconstructing GRN can help us to understand the molecular mechanism of organisms and to reveal the essential rules of a large number of biological processes and reactions in organisms. Various outstanding network reconstruction algorithms use specific assumptions that affect prediction accuracy, in order to deal with the uncertainty of processing. In order to study why a certain method is more suitable for specific research problem or experimental data, we conduct research from model-based, information-based and machine learning-based method classifications. There are obviously different types of computational tools that can be generated to distinguish GRNs. Furthermore, we discuss several classical, representative and latest methods in each category to analyze core ideas, general steps, characteristics, etc. We compare the performance of state-of-the-art GRN reconstruction technologies on simulated networks and real networks under different scaling conditions. Through standardized performance metrics and common benchmarks, we quantitatively evaluate the stability of various methods and the sensitivity of the same algorithm applying to different scaling networks. The aim of this study is to explore the most appropriate method for a specific GRN, which helps biologists and medical scientists in discovering potential drug targets and identifying cancer biomarkers.
Chapter
The advances of high-throughout technologies have produced huge amount of data regarding gene expressions or protein activities under various experimental conditions. The reverse-engineering of regulatory networks using these datasets is one of the top important research topics in computational biology. Although substantial efforts have been contributed to design effective inference methods, there are still a number of significant challenges to deal with the weak correlations between the observation data and the dependence of network structure on the order of variables in the systems. To address these issues, this work proposes a novel statistical approach to infer the structure of regulatory networks. Instead of using one single variable order, we generate a number of variable orders and then obtain different networks based on these orders. The weight of each edge for connecting genes/proteins is determined by the statistical measures based on the generated networks using different variable orders. Our proposed algorithm is evaluated by using the golden standard networks in Dream challenges and a cell signalling transduction pathway by using experimental data. Inference results suggest that our proposed algorithm is an effective approach for the reverse-engineering of regulatory networks with better accuracy.
Article
Genes that are thought to be critical for the survival of organisms or cells are called essential genes. The prediction of essential genes and their products (essential proteins) is of great value in exploring the mechanism of complex diseases, the study of the minimal required genome for living cells and the development of new drug targets. As laboratory methods are often complicated, costly and time-consuming, a great many of computational methods have been proposed to identify essential genes/proteins from the perspective of the network level with the in-depth understanding of network biology and the rapid development of biotechnologies. Through analyzing the topological characteristics of essential genes/proteins in protein-protein interaction networks (PINs), integrating biological information and considering the dynamic features of PINs, network-based methods have been proved to be effective in the identification of essential genes/proteins. In this paper, we survey the advanced methods for network-based prediction of essential genes/proteins and present the challenges and directions for future research.
Article
Motivation: Gene regulatory networks capture the regulatory interactions between genes, resulting from the fundamental biological process of transcription and translation. In some cases, the topology of GRNs is not known, and has to be inferred from gene expression data. Most of the existing GRNs reconstruction algorithms are either applied to time-series data or steady-state data. Although time-series data include more information about the system dynamics, steady-state data imply stability of the underlying regulatory networks. Results: In this paper, we propose a method for inferring GRNs from time-series data and steady-state data jointly. We make use of a nonlinear ordinary differential equations framework to model dynamic gene regulation and an importance measurement strategy to infer all putative regulatory links efficiently. The proposed method is evaluated extensively on the artificial DREAM4 dataset and two real gene expression datasets of yeast and Escherichia coli. Based on public benchmark datasets, the proposed method outperforms other popular inference algorithms in terms of overall score. By comparing the performance on the datasets with different scales, the results show that our method still keeps good robustness and accuracy at a low computational complexity. Availability and implementation: The proposed method is written in the Python language, and is available at: https://github.com/lab319/GRNs_nonlinear_ODEs. Supplementary information: Supplementary data are available at Bioinformatics online.