ArticlePDF Available

Meta-analytic framework for liquid association

Authors:

Abstract and Figures

Motivation: Although coexpression analysis via pair-wise expression correlation is popularly used to elucidate gene-gene interactions at the whole-genome scale, many complicated multi-gene regulations require more advanced detection methods. Liquid association is a powerful tool to detect the dynamic correlation of two gene variables depending on the expression level of a third variable (LA scouting gene). Liquid association detection from single transcriptomic study, however, is often unstable and not generalizable due to cohort bias, biological variation, and limited sample size. With the rapid development of microarray and NGS technology, liquid association analysis combining multiple gene expression studies can provide more accurate and stable results. Results: In this paper, we proposed two meta-analytic approaches for liquid association analysis (MetaLA and MetaMLA) to combine multiple transcriptomic studies. To compensate demanding computing, we also proposed a two-step fast screening algorithm for more efficient genome-wide screening: bootstrap filtering and sign filtering. We applied the methods to five Saccharomyces cerevisiae data sets related to environmental changes. The fast screening algorithm reduced 98% of running time. Compared with single study analysis, MetaLA and MetaMLA provided stronger detection signal and more consistent and stable results. The top triplets are highly enriched in fundamental biological processes related to environmental changes. Our method can help biologists understand underlying regulatory mechanisms under different environmental exposure or disease states. Availability: A MetaLA R package, data and code for this paper are available at http://tsenglab.biostat.pitt.edu/software.htm . Contact: ctseng@pitt.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
Content may be subject to copyright.
Gene expression
Meta-analytic framework for liquid association
Lin Wang
1,†
, Silvia Liu
2,3,†
, Ying Ding
2,3
, Shin-sheng Yuan
4
, Yen-Yi Ho
5,
*
and George C. Tseng
2,3,
*
1
School of Statistics, Capital University of Economics and Business, Fengtai, Beijing 100070, China,
2
Department of
Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA,
3
Department
of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA,
4
Institute of Statistical Science, Academia Sinica, Nankang, Taipei 115, Taiwan and
5
Department of Statistics,
College of Arts and Sciences, University of South Carolina, Columbia, SC 29208, USA
The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint Authors.
*To whom correspondence should be addressed.
Associate Editor: Bonnie Berger
Received on September 20, 2016; revised on February 11, 2017; editorial decision on March 6, 2017; accepted on March 9, 2017
Abstract
Motivation: Although coexpression analysis via pair-wise expression correlation is popularly used
to elucidate gene-gene interactions at the whole-genome scale, many complicated multi-gene
regulations require more advanced detection methods. Liquid association (LA) is a powerful tool to
detect the dynamic correlation of two gene variables depending on the expression level of a third
variable (LA scouting gene). LA detection from single transcriptomic study, however, is often un-
stable and not generalizable due to cohort bias, biological variation and limited sample size. With
the rapid development of microarray and NGS technology, LA analysis combining multiple gene
expression studies can provide more accurate and stable results.
Results: In this article, we proposed two meta-analytic approaches for LA analysis (MetaLA and
MetaMLA) to combine multiple transcriptomic studies. To compensate demanding computing, we
also proposed a two-step fast screening algorithm for more efficient genome-wide screening: boot-
strap filtering and sign filtering. We applied the methods to five Saccharomyces cerevisiae datasets
related to environmental changes. The fast screening algorithm reduced 98% of running time.
When compared with single study analysis, MetaLA and MetaMLA provided stronger detection sig-
nal and more consistent and stable results. The top triplets are highly enriched in fundamental bio-
logical processes related to environmental changes. Our method can help biologists understand
underlying regulatory mechanisms under different environmental exposure or disease states.
Availability and Implementation: AMetaLA R package, data and code for this article are available
at http://tsenglab.biostat.pitt.edu/software.htm
Contact: ctseng@pitt.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
Gene co-expression analysis is vastly applied to study pairwise gene
synchronization to elucidate potential gene regulatory mechanisms.
For example, an unweighted gene co-expression network can be
constructed from a transcriptomic study given a co-expression meas-
ure (e.g. Pearson correlation) and an edge cut-off (e.g. two nodes are
connected if absolute correlation 0.6 and disconnected if <0.6).
In the literature, different measures such as Pearson correlation,
Spearman correlation and mutual information (Butte and Kohane,
2000) have been used (see Song et al., 2012 for a comparative
study). Alternatively, Zhang et al. (2005) developed a weighted
V
CThe Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com 2140
Bioinformatics, 33(14), 2017, 2140–2147
doi: 10.1093/bioinformatics/btx138
Advance Access Publication Date: 11 March 2017
Original Paper
correlation network analysis (WGCNA) framework using cluster
analysis to construct gene co-expression modules and their associ-
ated weighted co-expression networks. Network properties and ex-
tended pathway analysis can then be studied to investigate disease-
related network alterations and mechanisms.
Although guilt-by-association heuristic assumed in gene co-
expression network analysis is widely used in genomics (Wolfe et al.,
2005), many complex regulatory mechanisms in the system cannot be
readily captured by direct association because of multi-way inter-
actions. The first column in Figure 1A shows an example of liquid as-
sociation (LA) first described in Li (2002). Gene YCR005C and
YPL262W are overall non-correlated in study GSE11452 (Spearman
correlation ¼0.239) but they exhibited high correlation (cor ¼0.692)
when a third gene YGR175C is low expressed (expression intensity
<0.424) and a much lower correlation (cor ¼0.790) when ex-
pression of gene YGR175C is high (>0.441). The simple interaction
among the trio is biologically meaningful since the third gene
YGR175C may serve as a surrogate of certain (hidden) cellular state
or regulator that controls the presence and absence of co-regulation
between gene YCR005C and YPL262W.
To quantify the conditional association in the triplet genes, Li
(2012) proposed a LA measure to quantify the dynamic correlation
of two variables depending on a third variable (Ho et al., 2011;Li,
2002;Li et al., 2004). Li (2002) introduced this concept and pro-
posed a computationally efficient three-product-moment measure
(see Section 2.2). Zhang et al. (2007) adopted a simplified LA score
based on z-transformed Pearson correlation conditional on discre-
tized expression of the third gene. Ho et al. (2011) extended the tri-
variate dependency structure into a parametric Gaussian framework
(called modified LA; MLA) to develop improved estimation frame-
works and statistical test for the existence of the LA dependence.
The computational complexity to screen all possible triplets is O(n
3
)
and is generally too high for applying LA methods in a genome-wide
scale. Gunderson and Ho (2014) introduced an efficient screening
algorithm fastLA for the MLA method containing two steps: (i)
screening the candidate triplets by difference between the correl-
ations of the LA pair when the scouting gene is high and low; (ii) fit-
ting and estimating the model based on conditional normal
distributions. The algorithm greatly improved the computing effi-
ciency for genome-wide LA analysis.
LA estimated from a single study is often unstable and not gener-
alizable due to cohort bias, biological variation and sample size limi-
tation. With rapid accumulation of transcriptomic studies in the
public domain, identifying LA triplets by combining multiple studies
A
B
Fig. 1. The scatter plot of the gene expressions in the high and low bins. (A) is for the triplet selected by GSE11452 through singleMLA and (B) is for the triplet
selected by the studies GSE11452, Causton and Gasch through MetaMLA (Color version of this figure is available at Bioinformatics online.)
Meta liquid association 2141
is likely to produce more stable and biologically reproducible re-
sults. For example, Figure 1A shows an example of an LA triplet
(gene YCR005C, YPL262W and YGR175C) where the LA is statis-
tically significant in the first yeast study GSE11452 but the LA asso-
ciation does not hold for the remaining four independent studies.
Such an association is likely condition-specific for the first study or a
false positive. On the other hand, the LA triplet (YGR264C,
YOR197W and YDR519W) in Figure 1B is obtained from the com-
bined meta-analysis of the first three studies. The association is
more likely to validate in the fourth and fifth studies. In this article,
we develop two meta-analytic frameworks for LA to accurately
identify LA triplets that are consistent across multiple studies.
The result shows that meta-analytic methods generate more stable
LA triplets that are more reproducible in independent studies.
The LA triplets also generate better pathway enrichment results to
better understand the biological insight and/or generate further
hypothesis.
2 Materials and methods
2.1 Datasets and databases
We used five yeast (Saccharomyces cerevisiae) datasets—Causton
(Causton et al., 2001), Gasch (Gasch et al., 2000), Rosetta (Hughes
et al., 2000), GSE60613 (Chasman et al., 2014) and GSE11452
(Knijnenburg et al., 2009)—to illustrate our meta-analytic methods.
In each study, yeast samples are exposed to a variety of environmen-
tal stress and the transcriptomic expression profiles are measured.
Causton et al. includes a yeast gene expression series including
yeasts treated with acid, alkali, heat, hydrogen peroxide, salt,
sorbital and during diauxic shift; Gasch et al. contains yeasts treated
with amino acid starvation, diamide, DTT, exposure to peroxide,
menadione, nitrogen depletion, osmolarity and temperature shifts;
Rosetta corresponds to 300 diverse mutations and chemical treat-
ments; GSE60613 analyzes the stress-activated signaling network;
GSE11452 corresponds to chemostat cultures under 55 different
conditions. As shown in the data preprocessing step in Figure 2,
within each individual study we first deleted genes and samples with
>10 and 30% missing values respectively, imputed the missing val-
ues with K-nearest neighbors algorithm (Altman, 1992), and quan-
tile normalized the samples (Amaratunga and Cabrera, 2001).
We further performed unbiased filtering within each study to filter
out non-expressed genes (lowest 35% of mean expression) and non-
informative genes (lowest 35% of expression variances). Finally, our
datasets include 1770 overlapped genes across five studies and 45,
173, 300, 67 and 170 samples for study Causton, Gasch, Rosetta,
GSE60613 and GSE11452, respectively.
As an in silico biological evaluation of the LA triplets, we down-
loaded yeast protein–protein interaction (PPI) database from
Saccharomyces Genome database (Cherry et al., 2011). The data-
base included 101 325 unique PPI pairs involving 5706 genes. We
applied pathway enrichment analysis on two databases: gene ontol-
ogy (GO) (Cherry et al., 2011) and Kyoto Encyclopedia of Genes
and Genomes (KEGG) (Kanehisa et al., 2016) databases and ob-
tained 1398 GO terms and 95 KEGG pathways with at least five
genes. Additionally in order to test how co-regulated genes are en-
riched in transcription factor (TF)-binding data, we downloaded a
TF-binding gene sets from YEASTRACT database (Teixeira et al.,
2013) and 96 gene sets with 5–200 validated genes were selected for
further enrichment analysis. Fisher’s exact test (Upton, 1992) was
used for pathway enrichment analysis. The P-values were corrected
by Benjamini-Hochberg (BH) algorithm (Benjamini and Hochberg,
1995) and the significance level was set to be a¼0.05.
2.2 LA methods (LA and MLA) for a single study
Li (2002) introduced the concept of ‘LA’ and defined the LA score
for a gene pair X1and X2given a scouting gene X3as
LAðX1;X2jX3Þ¼Eg0ðX3Þ, where gðx3Þ¼EðX1X2jX3¼x3Þand
g0ðxÞis the first derivative of g(x). After standardizing the three gene
expressions to fit Gaussian assumption and applying Stein’s lemma,
they proposed a computationally efficient estimator by
c
LA ¼Pn
l¼1X1lX2lX3l=n, where nis the total number of observa-
tions (samples) and X1l;X2land X3lare the lth observations for
genes X1;X2and X3, respectively.
Ho et al. (2011) proposed an MLA method by
MLAðX1;X2jX3Þ¼Eh0ðX3Þ, where hðX3Þ¼qðX1;X2jX3Þ;h0ðxÞis
the first derivative of h(x), and qis the Pearson correlation coeffi-
cient. They proposed a direct estimation of MLA score by
d
MLA ¼PM
j¼1b
qj
X3j=M, where Mis the number of bins over X3;
X3j
is the sample mean of X3within bin j, and b
qjis the correlation of
the LA pair X1and X2in bin j. A key advantage of the MLA estima-
tor is the capability of performing hypothesis testing H0:MLAðX1;
X2jX3Þ¼0 by a Wald test statistics TMLA ¼d
MLA=SEðd
MLAÞto as-
sess the P-value, where SEðd
MLAÞis the standard error of d
MLA.
2.3 MetaMLA and MetaLA methods
In this section, we extend the original three-product-moment LA
method (Li, 2002) and the model-based MLA method (Gunderson
and Ho, 2014;Ho et al., 2011) into a meta-analytic scheme for com-
bining information from multiple transcriptomic studies.
Suppose that we have Kstudies. For a gene triplet t:
ðX1;X2;X3Þ, if the LA scouting gene is Z¼Xiði¼1;2;3Þ, after
standardizing all the three genes to have mean 0 and variance 1 and
the scouting gene to follow normal distribution, the direct estima-
tion of the MLA score (Ho et al., 2011) for the single study k
(k¼1;...;K) is defined as d
MLAðk;iÞ
t¼PM
j¼1b
qðk;iÞ
t;j
zðk;iÞ
t;j=M, where M
is the number of bins, b
qðk;iÞ
t;jis the sample Pearson correlation coeffi-
cient of the LA pair in bin jwhen the scouting gene Zis Xiin triplet
t, and
zðk;iÞ
t;jis the mean of Zin bin j. The test statistic for single study
kis Tðk;iÞ
MLA;t¼d
MLAðk;iÞ
t=SEðd
MLAðk;iÞ
tÞ;where SEðd
MLAðk;iÞ
tÞis the
Fig. 2. A process map of the genome-wide application of the MetaMLA
algorithm
2142 L.Wang et al.
standard error of d
MLAðk;iÞ
tfor k¼1;...;Kand i¼1, 2, 3. The
MetaMLA statistic combines individual study MLA statistics Tðk;iÞ
MLA;t
and is defined as mMLAðiÞ
t¼
TðiÞ
MLA;t=ðsðiÞ
tþs0Þwhere
TðiÞ
MLA;tand
sðiÞ
tare the sample mean and SD of fTðk;iÞ
MLA;t;k¼1;2;...;Kg, respect-
ively. sðiÞ
tprovides standardization according to the variance of
MLA scores across studies. s0is a fudge parameter to avoid obtain-
ing large mMLA score caused by very small sðiÞ
tvalues, which hap-
pens frequently in genome-wide screening. In our yeast datasets,
suppose Nis the total number of triplets for the hypothesis testing.
We choose s0to be 10 medfsðiÞ
t;i¼1;2;3 and t¼1;...;Ng
(where medðÞ means the median) to guarantee the stability of the
test statistics, especially when sample size is small. The standardiza-
tion by dividing the variance in the Tðk;iÞ
MLA;tscore considers both sam-
ple size and sample heterogeneity effects in single studies. For a
study of large sample size, the SD of MLA score is usually smaller
and thus generates larger Tðk;iÞ
MLA;tscore. For a study containing large
biological variation or considerable outliers in samples, the SD of
MLA score is large and results in smaller Tðk;iÞ
MLA;tscore.
The MetaLA statistic can be defined similarly with the MetaMLA
statistic. The estimation of the LA score (Li, 2002) for the single study
k(k¼1;...;K) is defined as c
LAðkÞ
t¼Pnk
l¼1XðkÞ
1lXðkÞ
2lXðkÞ
3l=nk,where
n
k
is the total number of observations (samples) and XðkÞ
1l;XðkÞ
2land
XðkÞ
3lare the lth observations for genes X1;X2and X3in study k,re-
spectively. The MetaLA statistic combines individual study LA scores
c
LAðkÞ
tand is defined as mLAt¼c
LA
ðkÞ
t=ðstþs0Þwhere c
LA
ðkÞ
tand s
t
are the sample mean and SD of fc
LAðkÞ
t;k¼1;2;...;Kg, respectively.
s
t
provides standardization according to the variance of LA scores
across studies. s0is a fudge parameter to avoid obtaining large mLA
score caused by very small s
t
values.
2.4 Hypothesis testing and inference for MetaMLA
and MetaLA
Based on MetaMLA, the hypothesis for LA in the gene triplet t:ðX1;
X2;X3Þis
H0:mMLAðiÞ
t¼0;8i2f1;2;3g
$H1:9i2f1;2;3g;s:t:MLAðiÞ
t 0;
where i¼1, 2, 3 corresponds to LA scouting gene Z¼Xi(i¼1, 2, 3).
The null hypothesis represents all zero LAs no matter which one of
X1;X2and X3acts as the scouting gene Z. The test statistic is defined as
Tt¼max
i¼1;2;3jmMLAðiÞ
tj:
The distribution of T
t
under the null hypothesis can be obtained by
randomly permuting the samples of the LA scouting gene Zwhen
calculating each mMLAðiÞ
tin the T
t
statistics. We repeat the permu-
tation for Btimes and use the resulting BNpermuted values of
TðbÞ
t(1 bB;1tN) as the null distribution. The P-value
can be given by P¼ð
PB
b¼1PN
t¼1IðTðbÞ
tTobsÞ=ðBNÞÞ;where
T
obs
is the observed value of the test statistic. The P-values are cor-
rected by BH algorithm (Benjamini and Hochberg, 1995) and the
false discovery rate (FDR) is set to be a¼0.01. Since the number of
possible triplets Nis usually very large, a small Bis needed (B¼40)
and used in the article. We note that theoretically we should perform
permutation for each triplet to form its own null distribution.
The computation is, however, obviously not feasible (¼number of
permutations number of triplets). In our approach, we imposed
an assumption of common null distributions across all triplets to
allow affordable computation.
Based on MetaLA, the hypothesis for LA in the gene triplet
t:ðX1;X2;X3Þis H0:mLAt¼0$H1:mLAt 0. The test statistic
can be defined as Tt¼jmLAtj. The distribution of T
t
under the null
hypothesis can be obtained by randomly permuting the samples in-
side gene X1;X2or X3in turn. We repeat the permutation for B
times and use the resulting BNpermuted values of TðbÞ
t
(1 bB;1tN) as the null distribution. The P-value can
be given by P¼PB
b¼1PN
t¼1IðTðbÞ
tTobsÞ=ðBNÞ;where T
obs
is
the observed value of the test statistic. The P-values are corrected by
BH algorithm (Benjamini and Hochberg, 1995) and the FDR is set
to be a¼0.01. Similar to MetaMLA, B¼40 is used.
2.5 Filtering to reduce computation of MetaMLA
Genome-wide calculation of the LA is usually time-consuming and
resource-intensive for a single study (Ho et al., 2011;Li, 2002). This
problem is further aggravated when combining multiple studies. In
this section, we will develop a screening algorithm to perform a
genome-wide MetaMLA analysis with higher efficiency. As illus-
trated in Figure 2, our algorithm seeks to reduce the number of trip-
lets which need to be examined in depth in two screening steps:
bootstrap filtering and sign filtering (Fig. 2).
In the first bootstrap filtering step, we filter out triplets with
small correlation difference between the high and low bins. Define
qdiff to be the difference of the LA pair correlations when scouting
gene assigned to the highest and lowest bins. In the literature, the
fastLA algorithm for single study (Gunderson and Ho, 2014) has
used screening procedure for fast computing. In meta-analysis, we
aim to detect triplets with consistently large or consistently small
LAs across multiple studies. For the triplet t:ðX1;X2;X3Þ, given the
scouting gene Z¼Xi(i¼1, 2, 3), we define qðk;iÞ
diff;t¼qðk;iÞ
high;tqðk;iÞ
low;t,
where qðk;iÞ
high;tand qðk;iÞ
low;tare the Pearson correlations when gene Zis
in the high and low bins of study k, respectively. We use the score
PK
k¼1jqðk;iÞ
diff;tj=Kas the meta-filtering criteria. Since the scouting gene
Zcould be X1;X2,orX3, we use maxi¼1;2;3ðPK
k¼1jqðk;iÞ
diff;t=Kto
order and filter out triplets that are unlikely to have LA association.
To avoid outlier effect when calculating correlations in the bins, we
propose to bootstrap (Efron and Tibshirani, 1986) samples in each
study for Btimes and get qðmeta;bÞ
diff;t¼maxi¼1;2;3PK
k¼1jqðk;i;bÞ
diff;tj=K,
where b¼1;2;...;B. Finally, we can use
qðmetaÞ
diff;t¼medðqðmeta;bÞ
diff;t;b¼1;...;BÞ
to screen the triplets, where medðÞ means taking the median. We set
qðmetaÞ
diff;t>0:4 as the cutoff to keep the triplets for further testing.
qðmetaÞ
diff;tcan largely reduce computational complexity for two reasons:
(i) calculating qðmetaÞ
diff;tis computationally much simpler than the
MetaMLA statistic; (ii) qðmetaÞ
diff;tcan filter out a large percent of trip-
lets and further reduce the computational cost of P-value calculation
in the permutation step.
In the second sign filtering step, we filter out triplets with incon-
sistent signs of test statistics among meta and singleMLA. The scout-
ing gene is chosen to maximize the test statistic of MetaMLA. In
other words, we keep the triplets satisfying QK
k¼1IðsignðmMLAði0Þ
tÞ
signðTðk;i0Þ
MLA;tÞ¼1Þ¼1, where IðÞ is the indicator function and
Meta liquid association 2143
i0¼arg maxi¼1;2;3jmMLAðiÞ
tj. For fair comparison, we use the same
triplets filtered by MetaMLA to perform MetaLA and single-study
MLA.
3 Results
3.1 Computational reduction by filtering
Below we describe the screening result to avoid high computational
load when evaluating all possible triplets in MetaMLA. After un-
biased filtering of non-expressed and non-informative genes, we
kept 1,770 genes, which led to a total number of 1770
3

9:23
108triplets. The computing time is demanding if we perform hy-
pothesis testing for all possible triplets. By applying bootstrap filter-
ing with qðmetaÞ
diff;t>0:4 with three bins, the number of triplets reduced
to 2.18 10
7
,2.36% of the original total number. Furthermore,
the sign filtering step decreased the number of the remaining triplets
to 1.21 10
7
, which was only 1.32% of the total number.
Given the fact that our screening pipeline can dramatically re-
duce the number of triplets, we assessed whether the filtering pro-
cedures ignored statistically significant LA triplets. We performed
MetaMLA on all the 9.23 10
8
triplets and reduced 1.32% triplets
after filtering. As shown in Supplementary Table S1, our screening
steps only missed 89, 219, 375, 520 and 690 of the top 2000, 4000,
6000, 8000 and 10 000 triplets obtained from full analysis without
filtering. P-values from Fisher’s exact test are almost 0 and odds
ratio are between 1000 and 1600 (Supplementary Table S1). In sum-
mary, we only missed about 5% significant triplets but saved almost
99% of computing time to make genome-wide LA triplet screening
possible. Since filtering step also consumes computing time, we com-
pared computing time of analyses with filtering versus non-filtering
on a small dataset of 95 genes (using stringent selection criteria by
removing genes with small means and small variances). By using five
computing threads (Intel Xeon E7-2850), computing time for ana-
lyses with filtering versus non-filtering saved about 88% of comput-
ing time (16.3 versus 134.6 min).
In general, filtering out potentially non-significant triplets will
gain statistical power (Bourgon et al., 2010;van Iterson et al.,
2010). In other words, we can detect more significant triplets under
the same FDR control. To demonstrate the empirical effect of filter-
ing in real data, we randomly selected 500 genes from the five Yeast
studies and re-ran our MetaMLA algorithms by both filtering and
non-filtering pipelines. Supplementary Figure S1 shows that for a
given reasonable FDR (e.g. 0.005 and 0.01), filtering pipeline can
detect more significant triplets than full studies as we expected.
3.2 MetaMLA detects more over-represented pathways
We performed pathway enrichment analysis using GO and KEGG for
all the genes from top msignificant triplets (m¼200;300;...;1000)
selected by the single study MLA, MetaMLA, and MetaLA. Figure 3
shows the numbers of enriched GO terms and KEGG pathways for
different top numbers of triplets under FDR ¼0.05 threshold.
MetaMLA (solid square line) consistently performed better than any
single-study MLA (five dash lines) and MetaLA (solid rhombus line)
method by detecting more enriched pathway. Jitter plots of q-values
of the GO terms and KEGG pathways for the top 500 triplets at
minus log 10 scale are further shown in Supplementary Figure S2.
Since single MLA and MetaMLA method can differentiate LA scout-
ing gene Z, similar pathway enrichment analysis were done only for Z
genes from the top triples (Supplementary Figs S3 and S4).
3.3 MetaMLA provides more consistent biomarker and
pathway results with single study analyses
Figure 1A shows an example with LA association in the first study
(correlation dropped from 0.692 to 0.79 for high and low expres-
sion groups of the LA scouting gene YGR175C) but fails to repro-
duce in the remaining four studies. Such an LA association with
failed reproducibility is likely a false positive. Figure 1B demon-
strates another example with consistent LA association in all five
studies (correlation dropped significantly for high and low expres-
sion groups of YDR519W). In order to inspect agreement of top LA
triplets across pairwise studies, Supplementary Figures S5 and S6
show scatter plots of test statistics and rank correlations of the pair-
wise top 1000 triplets. MetaMLA method combines information
from all single studies. Conceptually, MetaMLA can provide more
consistent results with single study MLA than results among single
study MLA. In Figure 4A, we examined pairwise overlap of detected
top 1000 triplets from the five single-study MLA and the
MetaMLA. The result shows zero overlapping in all single-study
MLA top triplets. (We also tried other top number of triplets in
Supplementary Fig. S7 and they all show out small overlap among
single studies.) On the other hand, top triplets from MetaMLA have
much higher percentage of overlapping with results from each
single-study MLA.
We next calculated the number of overlaps of enriched GO terms
and KEGG pathways when we used all the genes from the top 500
triplets from each MLA analysis for pathway enrichment. The re-
sults are shown in Figure 4B and C. Numbers on the diagonal cells
demonstrate the number of enriched GO or KEGG pathways from
AB
Fig. 3. The number of enriched gene sets for all the genes from different num-
bers of top triplets detected by meta and single analysis. (A) is for GO terms
and (B) is for KEGG pathways (Color version of this figure is available at
Bioinformatics online.)
AB C
Fig. 4. Overlap of meta and single analysis. (A) is for the number of overlapped
triplets for the top 1000 significant triplets; (B) is for the number of overlapped
enriched GO terms using all the genes from top 500 triplets for gene set enrich-
ment analysis; (C) is for the number of overlapped enriched KEGG pathways
using all the genes from top 500 triplets for gene set enrichment analysis
(Color versionof this figureis available at Bioinformatics online.)
2144 L.Wang et al.
each single-study MLA and MetaMLA. (Similarly, overlapped path-
ways by only Zgenes from the top 1000 triplets are shown in
Supplementary Fig. S8). Similar to overlapped triplets in Figure 4A,
we observed much higher overlapped pathways between the
MetaMLA result and each single-study MLA result than results be-
tween pair-wise single-study MLA. For example, study Causton de-
tected 27 enriched GO terms, among which 8, 9, 6, and 9 pathways
overlapped with results from the other four single-study MLA.
Notably, it has 12 and 15 GO terms overlapped with MetaLA and
MetaMLA. Comparing the two meta-analytic methods, MetaMLA
performed much better than MetaLA.
3.4 MetaLA and MetaMLA provide more stable results
Below we apply subsampling and bootstrap techniques to compare
stability for LA triplets detected by single-study MLA, MetaLA and
MetaMLA. Figure 5A and B show the number of overlapped triplets
between top triplets detected by original full dataset and subsampled
datasets (90 and 80%, respectively). The numbers of top triplets are
displayed on the x- and y-axis is for the overlapping numbers. The
result shows much better reproducibility of top triplets detected by
subsampled data in MetaMLA (solid square line) and MetaLA (solid
rhombus line) compared to single-study MLA (five dash lines).
Similarly, comparison with bootstrapped data in Figure 5C shows
similar trend. In summary, MetaMLA provides better stability in de-
tecting top LA triplets, when compared to single-study MLA.
MetaLA further outperforms MetaMLA.
3.5 Pathway enrichment analysis and network
visualization
In Sections 3.2–3.4, although MetaLA provides more stable result
than MetaMLA 3.4, it detects much fewer enriched pathways
(Section 3.2) and generates less consistent biomarker and pathways
with single studies (Section 3.3). As a result, we will focus on
MetaMLA for further biological investigation in this subsection.
To test how the LA genes detected by MetaMLA method are
consistent with TF binding, we downloaded the TF-binding gene
sets from the YEASTRACT database (Teixeira et al., 2013) and se-
lected 96 gene sets with 5–200 genes. Among these 96 TF genes,
Hog1 (YLR113W) has the highest frequency among all the genes
from the top 20 000 triplets detected by MetaMLA method. Genes
inside the same triplet as Hog1 are enriched in Hog1 binding gene
sets (P¼0.027). More significantly, Hog1 is also the most frequent
gene among the LA scouting gene Zin the top 100 000 triplets.
Genes regulated by Hog1 (inside the same triplets) are more
significantly enriched in Hog1 binding gene sets (p¼1.44E5).
Supplementary Table S2 shows the top enriched TF binding gene
sets. Among them, Hot1 is another enriched gene sets (p¼
7.67E6) and Alepuz et al. (2003) shows that Hot1 targets on
Hop1p to osmostress responsive promoters and Hog1 mediates re-
cruitment/activation of RNAPII at Hot1p-dependent promoters.
The analysis shows that top triplets selected by MetaMLA method
are highly consistent with known TF regulation pattern.
Table 1 shows 18 significantly enriched KEGG pathways with
hierarchical structure using all the genes from top 500 triplets se-
lected by MetaMLA. Pathway enrichment using GO database iden-
tified 68 GO terms (Supplementary Table S3). Since the five
transcriptomic studies contain yeast samples treated with different
environmental conditions and mutations, we observed many path-
ways related to energy metabolism (q¼5.67E12), carbohydrate,
metabolism (q¼1.40E8), amino acid metabolism (q¼5.87E8)
and translation (q¼0.0065).
To investigate further the identified LA association gene inter-
actions, we chose among the top 20 000 LA triplets (q<6.64E5)
and included a total of 41 triplets with all three genes involved in
the metabolism category (q¼1.85E21 in Table 1) for network
visualization (Fig. 6). Genes within one triplet are connected by
edges in the same color. The dashed line represents reported inter-
actions or regulations in the PPI database. In this network, there are
totally four interactions validated by PPI database, more enriched
than a randomly generated PPI database (0.69 random interactions
on average, with P-value 0.00197 by Fisher’s exact test).
In Figure 6, we observed a cluster of gene modules related to
carbohydrate metabolism (purple background circle in Fig. 6; al-
most all genes annotated with gray dots). IDH2 and BDH2 are
two notable hub genes that have many LA association with other
neighboring genes. IDH2 is a subunit of mitochondrial NAD(þ)-
dependent isocitrate dehydrogenase, a key complex in tricarboxylic
acid (TCA) cycle to catalyze the oxidation of isocitrate to
alpha-ketoglutarate (Reinders et al., 2007). BDH2 is a putative
medium-cahin alcohol dehydrogenase (Dickinson et al., 2003).
In carbohydrate metabolism, pyruvate is the main input for a series
of chemical reactions for aerobic TCA cycle. In the subnetwork,
CDC19 is a key pyruvate kinase, which coverts phosphoenolpyru-
vate to pyruvate (Byrne and Wolfe, 2005;Xu et al., 2012), and its
physical PPI with IDH2 has been previously reported in Gavin et al.
(2006). In addition, PDC5 is a minor isoform of pyruvate decarb-
oxylase and PDC1 is a major of three pyruvate decarboxylase iso-
zymes to decarboxylate pyruvate to acetaldehyde (Dickinson et al.,
2003). ENO2 is also a phosphopyruvate hydratase involved in
AB C
Fig. 5. The number of overlapped top significant triplets between the original dataset and the subsampled or bootstrap datasets. (A) and (B) are for the results of
means and standard errors of ten times subsampling for the proportion of 0.90 and 0.80, respectively; (C) is for the results of means and SEs of ten times boot-
strap (Color version of this figure is available at Bioinformatics online.)
Meta liquid association 2145
pyruvate metabolism to catalyze 2-phosphoglycerate to phosphoe-
nolpyruvate during glycolysis (Byrne and Wolfe, 2005;McAlister
and Holland, 1982). All these genes from the top MetaMLA triplets
are potentially co-regulated with functional annotation from the
carbohydrate metabolism pathway. However, if we examine the dir-
ect gene–gene Pearson correlations, the pair-wise correlations are
low and the co-expression analysis will fail to identify association
among these genes (see Supplementary Table S4).
4 Conclusion and discussion
In this article, we proposed two meta-analytic methods (MetaLA
and MetaMLA) for LA analysis combining multiple studies. We
used the mean of the singleMLA test statistics as the main part of
the MetaMLA statistic and the SD to penalize the inconsistent pat-
terns among different studies. On the genome-wide application, we
proposed to screen genes by bootstrap filtering and sign filtering
(Fig. 2) to reduce the computation load. In the yeast datasets, we
Table 1. Enriched KEGG pathways and their hierarchical categories for all the genes from top 500 triplets selected by MetaMLA method
Entry and category P-value q-value Odds ratio Count Size Name
Metabolism 1.53E-23 1.85E-21 2.69 200 835
Energy metabolism 9.37E-14 5.67E-12 5.36 43 122
sce00190 2.03E-12 8.21E-11 7.91 29 72 Oxidative phosphorylation
sce00680 0.007957 0.041109 3.49 8 28 Methane metabolism
Carbohydrate metabolism 4.64E-10 1.40E-08 2.86 62 229
sce00620 1.91E-06 3.30E-05 5.53 17 39 Pyruvate metabolism
sce00630 3.15E-05 0.000423 6.14 12 26 Glyoxylate and dicarboxylate metabolism
sce00020 6.30E-05 0.000763 4.99 13 32 Citrate cycle (TCA cycle)
sce00010 0.000527 0.004249 2.99 17 58 Glycolysis/Gluconeogenesis
sce00051 0.005765 0.034881 3.76 8 25 Fructose and mannose metabolism
sce00030 0.007158 0.039369 3.23 9 28 Pentose phosphate pathway
Amino acid metabolism 2.42E-09 5.87E-08 3.06 51 178
sce00260 6.40E-08 1.29E-06 8.99 16 32 Glycine, serine and threonine metabolism
sce00270 0.000146 0.001468 4.44 13 36 Cysteine and methionine metabolism
sce00250 0.002637 0.016791 3.60 10 30 Alanine, aspartate and glutamate metabolism
Lipid metabolism 0.000913 0.006501 2.11 29 126
sce00100 0.000183 0.001705 6.89 9 17 Steroid biosynthesis
sce01040 0.000454 0.003927 12.21 6 11 Biosynthesis of unsaturated fatty acids
sce00062 0.002175 0.014623 10.16 5 8 Fatty acid elongation
Metabolism of cofactors and vitamins 0.011664 0.052272 1.80 24 117
sce00670 0.008629 0.041763 4.57 6 15 One carbon pool by folate
Genetic information processing 0.214481 0.447452 1.10 114 1123
Translation 0.000861 0.006501 1.59 70 682
sce03010 9.42E-05 0.001036 2.22 37 181 Ribosome
Folding, sorting and degradation 0.105526 0.283748 1.27 41 263
sce03050 0.008154 0.041109 2.91 10 35 Proteasome
Cellular processes 0.718769 0.995721 0.92 45 382
Transport and catabolism 0.010656 0.049591 1.61 36 194
sce04145 2.96E-05 0.000423 5.07 14 36 Phagosome
Fig. 6. Gene network associated with metabolism. Genes within one triplet are connected by edges in the same color. The dash line means that the edge is in the
PPI database. The small circle connected with the gene means that this gene is in the corresponding subcategory (Color version of this figure is available at
Bioinformatics online.)
2146 L.Wang et al.
reduced >98% of the triplets for the hypothesis testing and captured
94–95% of the top triplets with large MetaMLA statistic. When
compared with singleMLA method, MetaMLA can provide stronger
pathway enrichment signal, more consistent results with single-
study analysis, and more stable results with data subsampling or
bootstrapping. Although MetaLA generated more stable results than
MetaMLA, it detected less enriched pathways and is less consistent
with single study analysis. Among the top significant triplets selected
by MetaMLA, we constructed a gene regulatory network visualiza-
tion to investigate the complex three-way conditional associations.
The result identifies a subnetwork in carbohydrate metabolism net-
work, which otherwise cannot be identified by traditional pair-wise
co-expression analysis. We identified validation in PPI and focused
functional annotation in TSA cycle.
The LA and MLA methods to detect LA triplets have their pros
and cons. On one hand, the LA score by three-product estimation
on normalized gene intensities is much easier to compute than the
model-free estimation of MLA score. However, MLA is more accur-
ate when interdependency among the triplet (i.e. conditional mean
and variance of two genes depend on the third gene) exist and such
interdependency is theoretically ignored by the LA method.
Additionally, MLA also provides systematic inference to assess
P-values and FDR control. To circumvent computational burden in
MetaMLA, our proposed two-stage filtering can significantly reduce
computing time. In this article, we demonstrate genome-wide
screening on all possible gene triplets. To further reduce computa-
tional load, one may apply pre-selected scouting genes from prior
biological knowledge, TF or PPI databases.
Our meta-analytic framework has the advantage to stablely com-
bine multiple studies from different microarray or next-generation
sequencing platforms. Potential heterogeneity from platform, batch
effect or measurement scaling issues is automatically standardized in
the meta-analysis. In the literature, it has been well-acknowledged
that simple correlation and co-expression analysis are not sufficient
to describe the complex system of gene regulation. Applying
advanced association models elucidates novel regulatory mechan-
isms and meta-analysis by combining multiple transcriptomic stud-
ies will greatly reduce false positive findings. Our proposed meta-
analytic LA methods help accurately detect complicated three-way
interactions and regulatory mechanisms.
Funding
This work was supported by the National Institutes of Health NIH
(R01CA190766 to S.L. and G.C.T.); China Scholarship Council
(201508110051 to L.W.); National Nature Science Foundation of China
(11526146 to L.W.); Scientific Research Level Improvement Quota Project of
Capital University of Economics and Business [to L.W.]; and University of
Minnesota Grant-In-Aid (to Y.Y.H.).
Conflict of Interest: none declared.
References
Alepuz,P.M. et al. (2003) Osmostress-induced transcription by Hot1 depends
on a Hog1-mediated recruitment of the RNA Pol II. EMBO J., 22,
2433–2442.
Altman,N.S. (1992) An introduction to kernel and nearest-neighbor nonpara-
metric regression. Am. Stat., 46, 175–185.
Amaratunga,D. and Cabrera,J. (2001) Analysis of data from viral DNA micro-
chips. J. Am. Stat. Assoc., 96, 1161–1170.
Benjamini,Y. and Hochberg,Y. (1995) Controlling the false discovery rate: a
practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B.,
57, 289–300.
Bourgon,R. et al. (2010) Independent filtering increases detection power for
high-throughput experiments. Proc. Natl. Acad. Sci. USA,107, 9546–9551.
Butte,A.J., and Kohane,I.S. (2000). Mutual information relevance networks:
functional genomic clustering using pairwise entropy measurements. In Pac.
Symp. Biocomput., volume 5, pages 418–429
Byrne,K.P. and Wolfe,K.H. (2005) The yeast gene order browser: combining
curated homology and syntenic context reveals gene fate in polyploid spe-
cies. Genome Res., 15, 1456–1461.
Causton,H.C. et al. (2001) Remodeling of yeast genome expression in re-
sponse to environmental changes. Mol. Biol. Cell,12, 323–337.
Chasman,D. et al. (2014) Pathway connectivity and signaling coordination in
the yeast stress-activated signaling network. Mol. Syst. Biol., 10, 759.
Cherry,J.M. et al. (2011) Saccharomyces genome database: the genomics re-
source of budding yeast. Nucleic Acids Res., 40, D700–D705.
Dickinson,J.R. et al. (2003) The catabolism of amino acids to long chain and
complex alcohols in saccharomyces cerevisiae. J. Biol. Chem., 278,
8028–8034.
Efron,B. and Tibshirani,R. (1986) Bootstrap methods for standard errors, con-
fidence intervals, and other measures of statistical accuracy. Stat. sci, pages
54–75.
Gasch,A.P. et al. (2000) Genomic expression programs in the response of yeast
cells to environmental changes. Mol. Biol. Cell,11, 4241–4257.
Gavin,A.-C. et al. (2006) Proteome survey reveals modularity of the yeast cell
machinery. Nature,440, 631–636.
Gunderson,T. and Ho,Y.-Y. (2014) An efficient algorithm to explore liquid as-
sociation on a genome-wide scale. BMC Bioinformatics,15, (1), 371.
Ho,Y.-Y. et al. (2011) Modeling liquid association. Biometrics,67, 133–141.
Hughes,T.R. et al. (2000) Functional discovery via a compendium of expres-
sion profiles. Cell,102, 109–126.
Kanehisa,M. et al. (2016) Kegg as a reference resource for gene and protein an-
notation. Nucleic Acids Res,44, D457–D462.
Knijnenburg,T.A. et al. (2009) Combinatorial effects of environmental param-
eters on transcriptional regulation in saccharomyces cerevisiae: a quantita-
tive analysis of a compendium of chemostat-based transcriptome data.
BMC Genomics,10,1.
Li,K.-C. (2002) Genome-wide coexpression dynamics: theory and application.
Proc. Natl. Acad. Sci. USA,99, 16875–16880.
Li,K.-C. et al. (2004) A system for enhancing genome-wide coexpression dy-
namics study. Proc. Natl. Acad. Sci. USA,101, 15561–15566.
McAlister,L. and Holland,M.J. (1982) Targeted deletion of a yeast enolase
structural gene. identification and isolation of yeast enolase isozymes. J.
Biol. Chem., 257, 7181–7188.
Reinders,J. et al. (2007) Profiling phosphoproteins of yeast mitochondria re-
veals a role of phosphorylation in assembly of the ATP synthase. Mol. Cell.
Proteomics,6, 1896–1906.
Song,L. et al. (2012) Comparison of co-expression measures: mutual informa-
tion, correlation, and model based indices. BMC Bioinformatics,13, 328.
Teixeira,M.C. et al. (2013) The yeastract database: an upgraded information
system for the analysis of gene and genomic transcription regulation in sac-
charomyces cerevisiae. Nucleic Acids Res., 42, D161–D166.
Upton,G.J. (1992) Fisher’s exact test. J. Roy. Stat. Soc. A Stat., 155, 395–402.
van Iterson,M. et al. (2010) Filtering, FDR and power. BMC Bioinformatics,
11, 450.
Wolfe,C.J. et al. (2005) Systematic survey reveals general applicability of
‘guilt-by-association’ within gene coexpression networks. BMC
Bioinformatics,6,1.
Xu,Y.-F. et al. (2012) Regulation of yeast pyruvate kinase by ultrasensitive
allostery independent of phosphorylation. Mol. Cell,48, 52–62.
Zhang,B. et al. (2005) A general framework for weighted gene co-expression
network analysis. Stat. Appl. Genet. Mol. Biol., 4, 1128.
Zhang,J. et al. (2007) Extracting three-way gene interactions from microarray
data. Bioinformatics,23, 2903–2909.
Meta liquid association 2147
... In the literature, different measures such as structural equation models, Bayesian networks and other probabilistic graphical models are widely used for study conditional correlation and causal relationship [33][34][35]. However, many complex regulatory in the biological system can't be captured by direct guilt-by association using above methods because of multi-way interaction [36]. For example, two gene expression levels are overall non-correlated but they exhibited high correlation when a third gene is high expressed and a much lower correlation when expression of the third gene is low. ...
... For example, two gene expression levels are overall non-correlated but they exhibited high correlation when a third gene is high expressed and a much lower correlation when expression of the third gene is low. In this case, the third gene may serve as an indicator of certain cellular state or regulator that controls the presence and absence of coregulation between two gene pairs [36]. To identify the conditional association in the gene triplet, Li (2002) proposed a liquid association to explore the dynamicpattern as opposed to the static-pattern of gene expression in cell, and previous studies and our results has proved that LA method is a useful tool for investigating the dynamic nature of co-expression on a genome-wide scale [18,[20][21][22]. ...
... LAPs with high LA scores are likely to be involved in biological pathways (Additional file 6: Table S4), which implies that metabolism-related genes are more susceptible to being regulated in this manner. Of course, with rapid accumulation of transcript omic studies, combining multiple studies to indentifying LA triplets is likely to produce more accurate and stable results [36]. ...
Article
Full-text available
Background Dissecting the genetic basis and regulatory mechanisms for the biosynthesis and accumulation of nutrients in maize could lead to the improved nutritional quality of this crop. Gene expression is regulated at the genomic, transcriptional, and post-transcriptional levels, all of which can produce diversity among traits. However, the expression of most genes connected with a particular trait usually does not have a direct association with the variation of that trait. In addition, expression profiles of genes involved in a single pathway may vary as the intrinsic cellular state changes. To work around these issues, we utilized a statistical method, liquid association (LA) to investigate the complex pattern of gene regulation in maize kernels. Results We applied LA to the expression profiles of 28,769 genes to dissect dynamic trait-trait correlation patterns in maize kernels. Among the 1000 LA pairs (LAPs) with the largest LA scores, 686 LAPs were identified conditional correlation. We also identified 830 and 215 LA-scouting leaders based on the positive and negative LA scores, which were significantly enriched for some biological processes and molecular functions. Our analysis of the dynamic co-expression patterns in the carotene biosynthetic pathway clearly indicated the important role of lcyE, CYP97A, ZEP1, and VDE in this pathway, which may change the direction of carotene biosynthesis by controlling the influx and efflux of the substrate. The dynamic trait-trait correlation patterns between gene expression and oil concentration in the fatty acid metabolic pathway and its complex regulatory network were also assessed. 23 of 26 oil-associated genes were correlated with oil concentration conditioning on 580 LA-scoutinggenes, and 5% of these LA-scouting genes were annotated as enzymes in the oil metabolic pathway. Conclusions By focusing on the carotenoid and oil biosynthetic pathways in maize, we showed that a genome-wide LA analysis provides a novel and effective way to detect transcriptional regulatory relationships. This method will help us understand the biological role of maize kernel genes and will benefit maize breeding programs. Electronic supplementary material The online version of this article (10.1186/s12870-017-1119-y) contains supplementary material, which is available to authorized users.
... Li (2002) examined these dynamic correlation changes (referred to as liquid association in his paper) in canonical pathways using microarray gene expression data from a model organism, Saccharomyces cerevisiae. For a typical genomic study, a pathway-based or a genome-wide screening strategy can be implemented as presented in several studies to effectively identify potential dynamic correlation changes (Dawson and Kendziorski, 2012;Gunderson and Ho, 2014;Wang et al., 2017;Yu, 2018;Kinzy et al., 2019). Li's study and other studies since then have evidently established its biological validity and popularized it to be a useful tool for analyzing genomic data (Li, 2002;Ho et al., 2007;Zhang et al., 2007;Ho et al., 2011;Wang et al., 2013;Khayer et al., 2017;Wang et al., 2017;Xu et al., 2017;Ai et al., 2019;Kong and Yu, 2019;Wen et al., 2020). ...
... For a typical genomic study, a pathway-based or a genome-wide screening strategy can be implemented as presented in several studies to effectively identify potential dynamic correlation changes (Dawson and Kendziorski, 2012;Gunderson and Ho, 2014;Wang et al., 2017;Yu, 2018;Kinzy et al., 2019). Li's study and other studies since then have evidently established its biological validity and popularized it to be a useful tool for analyzing genomic data (Li, 2002;Ho et al., 2007;Zhang et al., 2007;Ho et al., 2011;Wang et al., 2013;Khayer et al., 2017;Wang et al., 2017;Xu et al., 2017;Ai et al., 2019;Kong and Yu, 2019;Wen et al., 2020). ...
Article
Interactions between biological molecules in a cell are tightly coordinated and often highly dynamic. As a result of these varying signaling activities, changes in gene co‐expression patterns could often be observed. The advancements in next‐generation sequencing technologies bring new statistical challenges for studying these dynamic changes of gene co‐expression. In recent years, methods have been developed to examine genomic information from individual cells. Single‐cell RNA sequencing (scRNA‐seq) data are count‐based, and often exhibit characteristics such as over‐dispersion and zero‐inflation. To explore the dynamic dependence structure in scRNA‐seq data and other zero‐inflated count data, new approaches are needed. In this paper, we consider over‐dispersion and zero‐inflation in count outcomes and propose a ZEro‐inflated Negative binomial dynamic COrrelation model (ZENCO). The observed count data are modeled as a mixture of two components: success amplifications and dropout events in ZENCO. A latent variable is incorporated into ZENCO in order to model the covariate‐dependent correlation structure. We conduct simulation studies to evaluate the performance of our proposed method and to compare it with existing approaches. We also illustrate the implementation of our proposed approach using scRNA‐seq data from a study of minimal residual disease in melanoma.
... Most of the methods try to find "network markers", i.e. small subnetworks that change expression levels in response to clinical conditions [6][7][8][9][10][11][12][13][14][15][16][17]. Some other methods study the dynamic correlation patterns on the network, without considering the clinical outcome [18][19][20]. ...
Article
Full-text available
Background: The biological network is highly dynamic. Functional relations between genes can be activated or deactivated depending on the biological conditions. On the genome-scale network, subnetworks that gain or lose local expression consistency may shed light on the regulatory mechanisms related to the changing biological conditions, such as disease status or tissue developmental stages. Results: In this study, we develop a new method to select genes and modules on the existing biological network, in which local expression consistency changes significantly between clinical conditions. The method is called DNLC: Differential Network Local Consistency. In simulations, our algorithm detected artificially created local consistency changes effectively. We applied the method on two publicly available datasets, and the method detected novel genes and network modules that were biologically plausible. Conclusions: The new method is effective in finding modules in which the gene expression consistency change between clinical conditions. It is a useful tool that complements traditional differential expression analyses to make discoveries from gene expression data. The R package is available at https://cran.r-project.org/web/packages/DNLC.
... Currently, the existing methods suffer from computational scalability when examining the entire biological system since it is difficult to examine gene-level three-way interactions triplet-by-triplet as the amount of possible combinations is extremely large. Efforts have been made to focus on a smaller number of subsets, by considering consistent LA relations across multiple datasets [14], focusing on subnetwork-level LA relations [15]. ...
Article
Full-text available
Background The biological regulatory system is highly dynamic. Correlations between functionally related genes change over different biological conditions, which are often unobserved in the data. At the gene level, the dynamic correlations result in three-way gene interactions involving a pair of genes that change correlation, and a third gene that reflects the underlying cellular conditions. This type of ternary relation can be quantified by the Liquid Association statistic. Studying these three-way interactions at the gene triplet level have revealed important regulatory mechanisms in the biological system. Currently, due to the extremely large amount of possible combinations of triplets within a high-throughput gene expression dataset, no method is available to examine the ternary relationship at the biological system level and formally address the false discovery issue. Results Here we propose a new method, Hypergraph for Dynamic Correlation (HDC), to construct module-level three-way interaction networks. The method is able to present integrative uniform hypergraphs to reflect the global dynamic correlation pattern in the biological system, providing guidance to down-stream gene triplet-level analyses. To validate the method’s ability, we conducted two real data experiments using a melanoma RNA-seq dataset from The Cancer Genome Atlas (TCGA) and a yeast cell cycle dataset. The resulting hypergraphs are clearly biologically plausible, and suggest novel relations relevant to the biological conditions in the data. Conclusions We believe the new approach provides a valuable alternative method to analyze omics data that can extract higher order structures. The software is at https://github.com/yunchuankong/HypergraphDynamicCorrelation. Electronic supplementary material The online version of this article (10.1186/s12864-019-5787-x) contains supplementary material, which is available to authorized users.
... To examine how co-occurrence might be mediated by biological or environmental variables we employed liquid association (LA) analysis [26,27] in succession to co-occurrence network analysis by the local similarity analysis (LSA) [6,28,29]. The rationale here is that LSA acts as a filtering mechanism to limit our three way analysis to only those variables that were at least sometimes associated with each other. ...
Article
Full-text available
Background Discovering the key microbial species and environmental factors of microbial community and characterizing their relationships with other members are critical to ecosystem studies. The microbial co-occurrence patterns across a variety of environmental settings have been extensively characterized. However, previous studies were limited by their restriction toward pairwise relationships, while there was ample evidence of third-party mediated co-occurrence in microbial communities. Methods We implemented and applied the triplet-based liquid association analysis in combination with the local similarity analysis procedure to microbial ecology data. We developed an intuitive scheme to visualize those complex triplet associations along with pairwise correlations. Using a time series from the marine microbial ecosystem as example, we identified pairs of operational taxonomic units (OTUs) where the strength of their associations appeared to relate to the values of a third “mediator” variable. These “mediator” variables appear to modulate the associations between pairs of bacteria. Results Using this analysis, we were able to assess the OTUs’ ability to regulate its functional partners in the community, typically not manifested in the pairwise correlation patterns. For example, we identified Flavobacteria as a multifaceted player in the marine microbial ecosystem, and its clades were involved in mediating other OTU pairs. By contrast, SAR11 clades were not active mediators of the community, despite being abundant and highly correlated with other OTUs. Our results suggested that Flavobacteria are more likely to respond to situations where particles and unusual sources of dissolved organic material are prevalent, such as after a plankton bloom. On the other hand, SAR11s are oligotrophic chemoheterotrophs with inflexible metabolisms, and their relationships with other organisms may be less governed by environmental or biological factors. Conclusions By integrating liquid association with local similarity analysis to explore the mediated co-varying dynamics, we presented a novel perspective and a useful toolkit to analyze and interpret time series data from microbial community. Our augmented association network analysis is thus more representative of the true underlying dynamic structure of the microbial community. The analytic software in this study was implemented as new functionalities of the ELSA (Extended local similarity analysis) tool, which is available for free download (http://bitbucket.org/charade/elsa).
... The method scans through all possible gene triplets to find potential dynamic correlations. Similar approaches that utilize genes as mediators [8,9], integrative analysis utilizing Liquid Association [10,11], as well as statistical theory of Liquid Association [12] were later developed. ...
Article
Full-text available
Dynamic correlations are pervasive in high-throughput data. Large numbers of gene pairs can change their correlation patterns in response to observed/unobserved changes in physiological states. Finding changes in correlation patterns can reveal important regulatory mechanisms. Currently there is no method that can effectively detect global dynamic correlation patterns in a dataset. Given the challenging nature of the problem, the currently available methods use genes as surrogate measurements of physiological states, which cannot faithfully represent true underlying biological signals. In this study we develop a new method that directly identifies strong latent dynamic correlation signals from the data matrix, named DCA: Dynamic Correlation Analysis. At the center of the method is a new metric for the identification of pairs of variables that are highly likely to be dynamically correlated, without knowing the underlying physiological states that govern the dynamic correlation. We validate the performance of the method with extensive simulations. We applied the method to three real datasets: a single cell RNA-seq dataset, a bulk RNA-seq dataset, and a microarray gene expression dataset. In all three datasets, the method reveals novel latent factors with clear biological meaning, bringing new insights into the data.
... To advance from microarray to next generation sequencing (NGS) technology, Ma et al. (2016) andMa et al. (2017) proposed Bayesian hierarchical model for meta-analysis to combine multiple RNA-seq studies or to combine microarray and RNA-seq cross-platform studies. In addition to these methods for detecting differentially expressed genes, omics meta-analysis for other biological purposes have been developed, including quality control ( Kang et al., 2012), pathway analysis ( Shen and Tseng, 2010), cluster analysis ( Huo et al., 2016), classification analysis ( Kim et al., 2016a), dimension reduction ( ), gene regulatory association ( Wang et al., 2017) and differential co-expression network analysis ( Zhu et al., 2016). ...
... The method scans through all possible gene triplets to find potential dynamic correlations. Similar approaches that utilize genes are mediators (15,16), integrative analysis utilizing LA (17,18), as well as some statistical theory of LA (19) were later developed. Although focusing on gene-level dynamic correlations can reveal some important local regulatory mechanisms, a more global approach to dynamic correlation could discover critical regulation mechanisms that penetrate multiple biological processes, or help identify hidden sub-groups in the samples. ...
Article
In high-throughput data, dynamic correlation between genes, i.e. changing correlation patterns under different biological conditions, can reveal important regulatory mechanisms. Given the complex nature of dynamic correlation, and the underlying conditions for dynamic correlation may not manifest into clinical observations, it is difficult to recover such signal from the data. Current methods seek underlying conditions for dynamic correlation by using certain observed genes as surrogates, which may not faithfully represent true latent conditions. In this study we develop a new method that directly identifies strong latent signals that regulate the dynamic correlation of many pairs of genes, named DCA: Dynamic Correlation Analysis. At the center of the method is a new metric for the identification of gene pairs that are highly likely to be dynamically correlated, without knowing the underlying conditions of the dynamic correlation. We validate the performance of the method with extensive simulations. In real data analysis, the method reveals novel latent factors with clear biological meaning, bringing new insights into the data.
Article
Local associations refer to spatial–temporal correlations that emerge from the biological realm, such as time-dependent gene co-expression or seasonal interactions between microbes. One can reveal the intricate dynamics and inherent interactions of biological systems by examining the biological time series data for these associations. To accomplish this goal, local similarity analysis algorithms and statistical methods that facilitate the local alignment of time series and assess the significance of the resulting alignments have been developed. Although these algorithms were initially devised for gene expression analysis from microarrays, they have been adapted and accelerated for multi-omics next generation sequencing datasets, achieving high scientific impact. In this review, we present an overview of the historical developments and recent advances for local similarity analysis algorithms, their statistical properties, and real applications in analyzing biological time series data. The benchmark data and analysis scripts used in this review are freely available at http://github.com/labxscut/lsareview.
Article
With recent advances in technologies to profile multi‐omics data at the single‐cell level, integrative multi‐omics data analysis has been increasingly popular. It is increasingly common that information such as methylation changes, chromatin accessibility, and gene expression are jointly collected in a single‐cell experiment. In biomedical studies, it is often of interest to study the associations between various data types and to examine how these associations might change according to other factors such as cell types and gene regulatory components. However, since each data type usually has a distinct marginal distribution, joint analysis of these changes of associations using multi‐omics data is statistically challenging. In this paper, we propose a flexible copula‐based framework to model covariate‐dependent correlation structures independent of their marginals. In addition, the proposed approach could jointly combine a wide variety of univariate marginal distributions, either discrete or continuous, including the class of zero‐inflated distributions. The performance of the proposed framework is demonstrated through a series of simulation studies. Finally, it is applied to a set of experimental data to investigate the dynamic relationship between single‐cell RNA‐sequencing, chromatin accessibility, and DNA methylation at different germ layers during mouse gastrulation. This article is protected by copyright. All rights reserved
Article
Full-text available
Yeast contain two nontandemly repeated enolase structural genes which have been isolated on bacterial plasmids designated peno46 and peno8 (Holland, M. J., Holland, J. P., Thill, G. P., and Jackson, K. A. (1981) J. Biol. Chem. 256, 1385-1395). In order to study the expression of the enolase genes in vivo, the resident enolase gene in a wild type yeast strain corresponding to the gene isolated on peno46 was replaced with a deletion, constructed in vitro, which lacks 90% of the enolase coding sequences. Three catalytically active enolases are resolved differ DEAE-Sephadex chromatography of wild type cellular extracts. As expected, a single form of enolase was resolved from extracts of the mutant cell. Immunological and electrophoretic analyses of the multiple forms of enolase confirm that two enolase genes are expressed in wild type cells and that isozymes are formed in the cell by random assortment of the two polypeptides into three active enolase dimers. The yeast enolase loci have been designated ENO1 and ENO2. The deletion mutant lacks the enolase 1 polypeptide confirming that this polypeptide is encoded by the gene isolated on peno46. The intracellular steady state concentrations of the two polypeptides are dependent on the carbon source used to propagate the cells. Log phase cells grown on glucose contain 20-fold more enolase 2 polypeptide than enolase 1 polypeptide, whereas cells grown on ethanol or glycerol plus lactate contain similar amounts of the two polypeptides. The 20-fold higher than in cells grown on the nonfermentable carbon sources. In vitro translation of total cellular RNA suggests that the steady state concentrations of the two enolase mRNAs in cells grown on different carbon sources are proportional to the steady state concentrations of the respective enolase polypeptides.
Article
Full-text available
KEGG (http://www.kegg.jp/ or http://www.genome.jp/kegg/) is an integrated database resource for biological interpretation of genome sequences and other high-throughput data. Molecular functions of genes and proteins are associated with ortholog groups and stored in the KEGG Orthology (KO) database. The KEGG pathway maps, BRITE hierarchies and KEGG modules are developed as networks of KO nodes, representing high-level functions of the cell and the organism. Currently, more than 4000 complete genomes are annotated with KOs in the KEGG GENES database, which can be used as a reference data set for KO assignment and subsequent reconstruction of KEGG pathways and other molecular networks. As an annotation resource, the following improvements have been made. First, each KO record is re-examined and associated with protein sequence data used in experiments of functional characterization. Second, the GENES database now includes viruses, plasmids, and the addendum category for functionally characterized proteins that are not represented in complete genomes. Third, new automatic annotation servers, BlastKOALA and GhostKOALA, are made available utilizing the non-redundant pangenome data set generated from the GENES database. As a resource for translational bioinformatics, various data sets are created for antimicrobial resistance and drug interaction networks.
Article
Full-text available
Background The growing wealth of public available gene expression data has made the systemic studies of how genes interact in a cell become more feasible. Liquid association (LA) describes the extent to which coexpression of two genes may vary based on the expression level of a third gene (the controller gene). However, genome-wide application has been difficult and resource-intensive. We propose a new screening algorithm for more efficient processing of LA estimation on a genome-wide scale and apply its use to a Saccharomyces cerevisiae data set.ResultsOn a test subset of the data, the fast screening algorithm achieved >99.8% agreement with the exhaustive search of LA values, while reduced run time by 81¿93 %. Using a well-known yeast cell-cycle data set with 6,178 genes, we identified triplet combinations with significantly large LA values. In an exploratory gene set enrichment analysis, the top terms for the controller genes in these triplets with large LA values are involved in some of the most fundamental processes in yeast such as energy regulation, transportation, and sporulation.Conclusion In summary, in this paper we propose a novel, efficient algorithm to explore LA on a genome-wide scale and identified triplets of interest in cell cycle pathways using the proposed method in a yeast data set. A software package named fastLiquidAssociation for implementing the algorithm is available through http://www.bioconductor.org.
Article
Full-text available
Stressed cells coordinate a multi-faceted response spanning many levels of physiology. Yet knowledge of the complete stress-activated regulatory network as well as design principles for signal integration remains incomplete. We developed an experimental and computational approach to integrate available protein interaction data with gene fitness contributions, mutant transcriptome profiles, and phospho-proteome changes in cells responding to salt stress, to infer the salt-responsive signaling network in yeast. The inferred subnetwork presented many novel predictions by implicating new regulators, uncovering unrecognized crosstalk between known pathways, and pointing to previously unknown ‘hubs’ of signal integration. We exploited these predictions to show that Cdc14 phosphatase is a central hub in the network and that modification of RNA polymerase II coordinates induction of stress-defense genes with reduction of growth-related transcripts. We find that the orthologous human network is enriched for cancer-causing genes, underscoring the importance of the subnetwork's predictions in understanding stress biology.
Article
Full-text available
The YEASTRACT (http://www.yeastract.com) information system is a tool for the analysis and prediction of transcription regulatory associations in Saccharomyces cerevisiae. Last updated in June 2013, this database contains over 200 000 regulatory associations between transcription factors (TFs) and target genes, including 326 DNA binding sites for 113 TFs. All regulatory associations stored in YEASTRACT were revisited and new information was added on the experimental conditions in which those associations take place and on whether the TF is acting on its target genes as activator or repressor. Based on this information, new queries were developed allowing the selection of specific environmental conditions, experimental evidence or positive/negative regulatory effect. This release further offers tools to rank the TFs controlling a gene or genome-wide response by their relative importance, based on (i) the percentage of target genes in the data set; (ii) the enrichment of the TF regulon in the data set when compared with the genome; or (iii) the score computed using the TFRank system, which selects and prioritizes the relevant TFs by walking through the yeast regulatory network. We expect that with the new data and services made available, the system will continue to be instrumental for yeast biologists and systems biology researchers.
Article
Full-text available
Background Co-expression measures are often used to define networks among genes. Mutual information (MI) is often used as a generalized correlation measure. It is not clear how much MI adds beyond standard (robust) correlation measures or regression model based association measures. Further, it is important to assess what transformations of these and other co-expression measures lead to biologically meaningful modules (clusters of genes). Results We provide a comprehensive comparison between mutual information and several correlation measures in 8 empirical data sets and in simulations. We also study different approaches for transforming an adjacency matrix, e.g. using the topological overlap measure. Overall, we confirm close relationships between MI and correlation in all data sets which reflects the fact that most gene pairs satisfy linear or monotonic relationships. We discuss rare situations when the two measures disagree. We also compare correlation and MI based approaches when it comes to defining co-expression network modules. We show that a robust measure of correlation (the biweight midcorrelation transformed via the topological overlap transformation) leads to modules that are superior to MI based modules and maximal information coefficient (MIC) based modules in terms of gene ontology enrichment. We present a function that relates correlation to mutual information which can be used to approximate the mutual information from the corresponding correlation coefficient. We propose the use of polynomial or spline regression models as an alternative to MI for capturing non-linear relationships between quantitative variables. Conclusion The biweight midcorrelation outperforms MI in terms of elucidating gene pairwise relationships. Coupled with the topological overlap matrix transformation, it often leads to more significantly enriched co-expression modules. Spline and polynomial networks form attractive alternatives to MI in case of non-linear relationships. Our results indicate that MI networks can safely be replaced by correlation networks when it comes to measuring co-expression relationships in stationary data.
Article
The common approach to the multiplicity problem calls for controlling the familywise error rate (FWER). This approach, though, has faults, and we point out a few. A different approach to problems of multiple significance testing is presented. It calls for controlling the expected proportion of falsely rejected hypotheses — the false discovery rate. This error rate is equivalent to the FWER when all hypotheses are true but is smaller otherwise. Therefore, in problems where the control of the false discovery rate rather than that of the FWER is desired, there is potential for a gain in power. A simple sequential Bonferronitype procedure is proved to control the false discovery rate for independent test statistics, and a simulation study shows that the gain in power is substantial. The use of the new procedure and the appropriateness of the criterion are illustrated with examples.
Article
Nonparametric regression is a set of techniques for estimating a regression curve without making strong assumptions about the shape of the true regression function. These techniques are therefore useful for building and checking parametric models, as well as for data description. Kernel and nearest-neighbor regression estimators are local versions of univariate location estimators, and so they can readily be introduced to beginning students and consulting clients who are familiar with such summaries as the sample mean and median.
Article
This paper reviews the problems that bedevil the selection of an appropriate test for the analysis of a 2 x 2 table. In contradiction to an earlier paper, the author now argues the case for the use of Fisher's exact test. It is noted that all test statistics for the 2 x 2 table have discrete distributions and it is suggested that it is irrational to prescribe an unattainable fixed significance level. The use of mid-P is suggested, if a formula is required for prescribing a variable tail probability. The problems of two-tail tests are discussed.