Content uploaded by Ying Ding
Author content
All content in this area was uploaded by Ying Ding on Jul 27, 2018
Content may be subject to copyright.
Gene expression
Meta-analytic framework for liquid association
Lin Wang
1,†
, Silvia Liu
2,3,†
, Ying Ding
2,3
, Shin-sheng Yuan
4
, Yen-Yi Ho
5,
*
and George C. Tseng
2,3,
*
1
School of Statistics, Capital University of Economics and Business, Fengtai, Beijing 100070, China,
2
Department of
Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA,
3
Department
of Computational and Systems Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15213, USA,
4
Institute of Statistical Science, Academia Sinica, Nankang, Taipei 115, Taiwan and
5
Department of Statistics,
College of Arts and Sciences, University of South Carolina, Columbia, SC 29208, USA
†
The authors wish it to be known that, in their opinion, the first two authors should be regarded as Joint Authors.
*To whom correspondence should be addressed.
Associate Editor: Bonnie Berger
Received on September 20, 2016; revised on February 11, 2017; editorial decision on March 6, 2017; accepted on March 9, 2017
Abstract
Motivation: Although coexpression analysis via pair-wise expression correlation is popularly used
to elucidate gene-gene interactions at the whole-genome scale, many complicated multi-gene
regulations require more advanced detection methods. Liquid association (LA) is a powerful tool to
detect the dynamic correlation of two gene variables depending on the expression level of a third
variable (LA scouting gene). LA detection from single transcriptomic study, however, is often un-
stable and not generalizable due to cohort bias, biological variation and limited sample size. With
the rapid development of microarray and NGS technology, LA analysis combining multiple gene
expression studies can provide more accurate and stable results.
Results: In this article, we proposed two meta-analytic approaches for LA analysis (MetaLA and
MetaMLA) to combine multiple transcriptomic studies. To compensate demanding computing, we
also proposed a two-step fast screening algorithm for more efficient genome-wide screening: boot-
strap filtering and sign filtering. We applied the methods to five Saccharomyces cerevisiae datasets
related to environmental changes. The fast screening algorithm reduced 98% of running time.
When compared with single study analysis, MetaLA and MetaMLA provided stronger detection sig-
nal and more consistent and stable results. The top triplets are highly enriched in fundamental bio-
logical processes related to environmental changes. Our method can help biologists understand
underlying regulatory mechanisms under different environmental exposure or disease states.
Availability and Implementation: AMetaLA R package, data and code for this article are available
at http://tsenglab.biostat.pitt.edu/software.htm
Contact: ctseng@pitt.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction
Gene co-expression analysis is vastly applied to study pairwise gene
synchronization to elucidate potential gene regulatory mechanisms.
For example, an unweighted gene co-expression network can be
constructed from a transcriptomic study given a co-expression meas-
ure (e.g. Pearson correlation) and an edge cut-off (e.g. two nodes are
connected if absolute correlation 0.6 and disconnected if <0.6).
In the literature, different measures such as Pearson correlation,
Spearman correlation and mutual information (Butte and Kohane,
2000) have been used (see Song et al., 2012 for a comparative
study). Alternatively, Zhang et al. (2005) developed a weighted
V
CThe Author 2017. Published by Oxford University Press. All rights reserved. For Permissions, please e-mail: journals.permissions@oup.com 2140
Bioinformatics, 33(14), 2017, 2140–2147
doi: 10.1093/bioinformatics/btx138
Advance Access Publication Date: 11 March 2017
Original Paper
correlation network analysis (WGCNA) framework using cluster
analysis to construct gene co-expression modules and their associ-
ated weighted co-expression networks. Network properties and ex-
tended pathway analysis can then be studied to investigate disease-
related network alterations and mechanisms.
Although guilt-by-association heuristic assumed in gene co-
expression network analysis is widely used in genomics (Wolfe et al.,
2005), many complex regulatory mechanisms in the system cannot be
readily captured by direct association because of multi-way inter-
actions. The first column in Figure 1A shows an example of liquid as-
sociation (LA) first described in Li (2002). Gene YCR005C and
YPL262W are overall non-correlated in study GSE11452 (Spearman
correlation ¼0.239) but they exhibited high correlation (cor ¼0.692)
when a third gene YGR175C is low expressed (expression intensity
<0.424) and a much lower correlation (cor ¼0.790) when ex-
pression of gene YGR175C is high (>0.441). The simple interaction
among the trio is biologically meaningful since the third gene
YGR175C may serve as a surrogate of certain (hidden) cellular state
or regulator that controls the presence and absence of co-regulation
between gene YCR005C and YPL262W.
To quantify the conditional association in the triplet genes, Li
(2012) proposed a LA measure to quantify the dynamic correlation
of two variables depending on a third variable (Ho et al., 2011;Li,
2002;Li et al., 2004). Li (2002) introduced this concept and pro-
posed a computationally efficient three-product-moment measure
(see Section 2.2). Zhang et al. (2007) adopted a simplified LA score
based on z-transformed Pearson correlation conditional on discre-
tized expression of the third gene. Ho et al. (2011) extended the tri-
variate dependency structure into a parametric Gaussian framework
(called modified LA; MLA) to develop improved estimation frame-
works and statistical test for the existence of the LA dependence.
The computational complexity to screen all possible triplets is O(n
3
)
and is generally too high for applying LA methods in a genome-wide
scale. Gunderson and Ho (2014) introduced an efficient screening
algorithm fastLA for the MLA method containing two steps: (i)
screening the candidate triplets by difference between the correl-
ations of the LA pair when the scouting gene is high and low; (ii) fit-
ting and estimating the model based on conditional normal
distributions. The algorithm greatly improved the computing effi-
ciency for genome-wide LA analysis.
LA estimated from a single study is often unstable and not gener-
alizable due to cohort bias, biological variation and sample size limi-
tation. With rapid accumulation of transcriptomic studies in the
public domain, identifying LA triplets by combining multiple studies
A
B
Fig. 1. The scatter plot of the gene expressions in the high and low bins. (A) is for the triplet selected by GSE11452 through singleMLA and (B) is for the triplet
selected by the studies GSE11452, Causton and Gasch through MetaMLA (Color version of this figure is available at Bioinformatics online.)
Meta liquid association 2141
is likely to produce more stable and biologically reproducible re-
sults. For example, Figure 1A shows an example of an LA triplet
(gene YCR005C, YPL262W and YGR175C) where the LA is statis-
tically significant in the first yeast study GSE11452 but the LA asso-
ciation does not hold for the remaining four independent studies.
Such an association is likely condition-specific for the first study or a
false positive. On the other hand, the LA triplet (YGR264C,
YOR197W and YDR519W) in Figure 1B is obtained from the com-
bined meta-analysis of the first three studies. The association is
more likely to validate in the fourth and fifth studies. In this article,
we develop two meta-analytic frameworks for LA to accurately
identify LA triplets that are consistent across multiple studies.
The result shows that meta-analytic methods generate more stable
LA triplets that are more reproducible in independent studies.
The LA triplets also generate better pathway enrichment results to
better understand the biological insight and/or generate further
hypothesis.
2 Materials and methods
2.1 Datasets and databases
We used five yeast (Saccharomyces cerevisiae) datasets—Causton
(Causton et al., 2001), Gasch (Gasch et al., 2000), Rosetta (Hughes
et al., 2000), GSE60613 (Chasman et al., 2014) and GSE11452
(Knijnenburg et al., 2009)—to illustrate our meta-analytic methods.
In each study, yeast samples are exposed to a variety of environmen-
tal stress and the transcriptomic expression profiles are measured.
Causton et al. includes a yeast gene expression series including
yeasts treated with acid, alkali, heat, hydrogen peroxide, salt,
sorbital and during diauxic shift; Gasch et al. contains yeasts treated
with amino acid starvation, diamide, DTT, exposure to peroxide,
menadione, nitrogen depletion, osmolarity and temperature shifts;
Rosetta corresponds to 300 diverse mutations and chemical treat-
ments; GSE60613 analyzes the stress-activated signaling network;
GSE11452 corresponds to chemostat cultures under 55 different
conditions. As shown in the data preprocessing step in Figure 2,
within each individual study we first deleted genes and samples with
>10 and 30% missing values respectively, imputed the missing val-
ues with K-nearest neighbors algorithm (Altman, 1992), and quan-
tile normalized the samples (Amaratunga and Cabrera, 2001).
We further performed unbiased filtering within each study to filter
out non-expressed genes (lowest 35% of mean expression) and non-
informative genes (lowest 35% of expression variances). Finally, our
datasets include 1770 overlapped genes across five studies and 45,
173, 300, 67 and 170 samples for study Causton, Gasch, Rosetta,
GSE60613 and GSE11452, respectively.
As an in silico biological evaluation of the LA triplets, we down-
loaded yeast protein–protein interaction (PPI) database from
Saccharomyces Genome database (Cherry et al., 2011). The data-
base included 101 325 unique PPI pairs involving 5706 genes. We
applied pathway enrichment analysis on two databases: gene ontol-
ogy (GO) (Cherry et al., 2011) and Kyoto Encyclopedia of Genes
and Genomes (KEGG) (Kanehisa et al., 2016) databases and ob-
tained 1398 GO terms and 95 KEGG pathways with at least five
genes. Additionally in order to test how co-regulated genes are en-
riched in transcription factor (TF)-binding data, we downloaded a
TF-binding gene sets from YEASTRACT database (Teixeira et al.,
2013) and 96 gene sets with 5–200 validated genes were selected for
further enrichment analysis. Fisher’s exact test (Upton, 1992) was
used for pathway enrichment analysis. The P-values were corrected
by Benjamini-Hochberg (BH) algorithm (Benjamini and Hochberg,
1995) and the significance level was set to be a¼0.05.
2.2 LA methods (LA and MLA) for a single study
Li (2002) introduced the concept of ‘LA’ and defined the LA score
for a gene pair X1and X2given a scouting gene X3as
LAðX1;X2jX3Þ¼Eg0ðX3Þ, where gðx3Þ¼EðX1X2jX3¼x3Þand
g0ðxÞis the first derivative of g(x). After standardizing the three gene
expressions to fit Gaussian assumption and applying Stein’s lemma,
they proposed a computationally efficient estimator by
c
LA ¼Pn
l¼1X1lX2lX3l=n, where nis the total number of observa-
tions (samples) and X1l;X2land X3lare the lth observations for
genes X1;X2and X3, respectively.
Ho et al. (2011) proposed an MLA method by
MLAðX1;X2jX3Þ¼Eh0ðX3Þ, where hðX3Þ¼qðX1;X2jX3Þ;h0ðxÞis
the first derivative of h(x), and qis the Pearson correlation coeffi-
cient. They proposed a direct estimation of MLA score by
d
MLA ¼PM
j¼1b
qj
X3j=M, where Mis the number of bins over X3;
X3j
is the sample mean of X3within bin j, and b
qjis the correlation of
the LA pair X1and X2in bin j. A key advantage of the MLA estima-
tor is the capability of performing hypothesis testing H0:MLAðX1;
X2jX3Þ¼0 by a Wald test statistics TMLA ¼d
MLA=SEðd
MLAÞto as-
sess the P-value, where SEðd
MLAÞis the standard error of d
MLA.
2.3 MetaMLA and MetaLA methods
In this section, we extend the original three-product-moment LA
method (Li, 2002) and the model-based MLA method (Gunderson
and Ho, 2014;Ho et al., 2011) into a meta-analytic scheme for com-
bining information from multiple transcriptomic studies.
Suppose that we have Kstudies. For a gene triplet t:
ðX1;X2;X3Þ, if the LA scouting gene is Z¼Xiði¼1;2;3Þ, after
standardizing all the three genes to have mean 0 and variance 1 and
the scouting gene to follow normal distribution, the direct estima-
tion of the MLA score (Ho et al., 2011) for the single study k
(k¼1;...;K) is defined as d
MLAðk;iÞ
t¼PM
j¼1b
qðk;iÞ
t;j
zðk;iÞ
t;j=M, where M
is the number of bins, b
qðk;iÞ
t;jis the sample Pearson correlation coeffi-
cient of the LA pair in bin jwhen the scouting gene Zis Xiin triplet
t, and
zðk;iÞ
t;jis the mean of Zin bin j. The test statistic for single study
kis Tðk;iÞ
MLA;t¼d
MLAðk;iÞ
t=SEðd
MLAðk;iÞ
tÞ;where SEðd
MLAðk;iÞ
tÞis the
Fig. 2. A process map of the genome-wide application of the MetaMLA
algorithm
2142 L.Wang et al.
standard error of d
MLAðk;iÞ
tfor k¼1;...;Kand i¼1, 2, 3. The
MetaMLA statistic combines individual study MLA statistics Tðk;iÞ
MLA;t
and is defined as mMLAðiÞ
t¼
TðiÞ
MLA;t=ðsðiÞ
tþs0Þwhere
TðiÞ
MLA;tand
sðiÞ
tare the sample mean and SD of fTðk;iÞ
MLA;t;k¼1;2;...;Kg, respect-
ively. sðiÞ
tprovides standardization according to the variance of
MLA scores across studies. s0is a fudge parameter to avoid obtain-
ing large mMLA score caused by very small sðiÞ
tvalues, which hap-
pens frequently in genome-wide screening. In our yeast datasets,
suppose Nis the total number of triplets for the hypothesis testing.
We choose s0to be 10 medfsðiÞ
t;i¼1;2;3 and t¼1;...;Ng
(where medðÞ means the median) to guarantee the stability of the
test statistics, especially when sample size is small. The standardiza-
tion by dividing the variance in the Tðk;iÞ
MLA;tscore considers both sam-
ple size and sample heterogeneity effects in single studies. For a
study of large sample size, the SD of MLA score is usually smaller
and thus generates larger Tðk;iÞ
MLA;tscore. For a study containing large
biological variation or considerable outliers in samples, the SD of
MLA score is large and results in smaller Tðk;iÞ
MLA;tscore.
The MetaLA statistic can be defined similarly with the MetaMLA
statistic. The estimation of the LA score (Li, 2002) for the single study
k(k¼1;...;K) is defined as c
LAðkÞ
t¼Pnk
l¼1XðkÞ
1lXðkÞ
2lXðkÞ
3l=nk,where
n
k
is the total number of observations (samples) and XðkÞ
1l;XðkÞ
2land
XðkÞ
3lare the lth observations for genes X1;X2and X3in study k,re-
spectively. The MetaLA statistic combines individual study LA scores
c
LAðkÞ
tand is defined as mLAt¼c
LA
ðkÞ
t=ðstþs0Þwhere c
LA
ðkÞ
tand s
t
are the sample mean and SD of fc
LAðkÞ
t;k¼1;2;...;Kg, respectively.
s
t
provides standardization according to the variance of LA scores
across studies. s0is a fudge parameter to avoid obtaining large mLA
score caused by very small s
t
values.
2.4 Hypothesis testing and inference for MetaMLA
and MetaLA
Based on MetaMLA, the hypothesis for LA in the gene triplet t:ðX1;
X2;X3Þis
H0:mMLAðiÞ
t¼0;8i2f1;2;3g
$H1:9i2f1;2;3g;s:t:MLAðiÞ
t6¼ 0;
where i¼1, 2, 3 corresponds to LA scouting gene Z¼Xi(i¼1, 2, 3).
The null hypothesis represents all zero LAs no matter which one of
X1;X2and X3acts as the scouting gene Z. The test statistic is defined as
Tt¼max
i¼1;2;3jmMLAðiÞ
tj:
The distribution of T
t
under the null hypothesis can be obtained by
randomly permuting the samples of the LA scouting gene Zwhen
calculating each mMLAðiÞ
tin the T
t
statistics. We repeat the permu-
tation for Btimes and use the resulting BNpermuted values of
TðbÞ
t(1 bB;1tN) as the null distribution. The P-value
can be given by P¼ð
PB
b¼1PN
t¼1IðTðbÞ
tTobsÞ=ðBNÞÞ;where
T
obs
is the observed value of the test statistic. The P-values are cor-
rected by BH algorithm (Benjamini and Hochberg, 1995) and the
false discovery rate (FDR) is set to be a¼0.01. Since the number of
possible triplets Nis usually very large, a small Bis needed (B¼40)
and used in the article. We note that theoretically we should perform
permutation for each triplet to form its own null distribution.
The computation is, however, obviously not feasible (¼number of
permutations number of triplets). In our approach, we imposed
an assumption of common null distributions across all triplets to
allow affordable computation.
Based on MetaLA, the hypothesis for LA in the gene triplet
t:ðX1;X2;X3Þis H0:mLAt¼0$H1:mLAt6¼ 0. The test statistic
can be defined as Tt¼jmLAtj. The distribution of T
t
under the null
hypothesis can be obtained by randomly permuting the samples in-
side gene X1;X2or X3in turn. We repeat the permutation for B
times and use the resulting BNpermuted values of TðbÞ
t
(1 bB;1tN) as the null distribution. The P-value can
be given by P¼PB
b¼1PN
t¼1IðTðbÞ
tTobsÞ=ðBNÞ;where T
obs
is
the observed value of the test statistic. The P-values are corrected by
BH algorithm (Benjamini and Hochberg, 1995) and the FDR is set
to be a¼0.01. Similar to MetaMLA, B¼40 is used.
2.5 Filtering to reduce computation of MetaMLA
Genome-wide calculation of the LA is usually time-consuming and
resource-intensive for a single study (Ho et al., 2011;Li, 2002). This
problem is further aggravated when combining multiple studies. In
this section, we will develop a screening algorithm to perform a
genome-wide MetaMLA analysis with higher efficiency. As illus-
trated in Figure 2, our algorithm seeks to reduce the number of trip-
lets which need to be examined in depth in two screening steps:
bootstrap filtering and sign filtering (Fig. 2).
In the first bootstrap filtering step, we filter out triplets with
small correlation difference between the high and low bins. Define
qdiff to be the difference of the LA pair correlations when scouting
gene assigned to the highest and lowest bins. In the literature, the
fastLA algorithm for single study (Gunderson and Ho, 2014) has
used screening procedure for fast computing. In meta-analysis, we
aim to detect triplets with consistently large or consistently small
LAs across multiple studies. For the triplet t:ðX1;X2;X3Þ, given the
scouting gene Z¼Xi(i¼1, 2, 3), we define qðk;iÞ
diff;t¼qðk;iÞ
high;tqðk;iÞ
low;t,
where qðk;iÞ
high;tand qðk;iÞ
low;tare the Pearson correlations when gene Zis
in the high and low bins of study k, respectively. We use the score
PK
k¼1jqðk;iÞ
diff;tj=Kas the meta-filtering criteria. Since the scouting gene
Zcould be X1;X2,orX3, we use maxi¼1;2;3ðPK
k¼1jqðk;iÞ
diff;tjÞ=Kto
order and filter out triplets that are unlikely to have LA association.
To avoid outlier effect when calculating correlations in the bins, we
propose to bootstrap (Efron and Tibshirani, 1986) samples in each
study for Btimes and get qðmeta;bÞ
diff;t¼maxi¼1;2;3PK
k¼1jqðk;i;bÞ
diff;tj=K,
where b¼1;2;...;B. Finally, we can use
qðmetaÞ
diff;t¼medðqðmeta;bÞ
diff;t;b¼1;...;BÞ
to screen the triplets, where medðÞ means taking the median. We set
qðmetaÞ
diff;t>0:4 as the cutoff to keep the triplets for further testing.
qðmetaÞ
diff;tcan largely reduce computational complexity for two reasons:
(i) calculating qðmetaÞ
diff;tis computationally much simpler than the
MetaMLA statistic; (ii) qðmetaÞ
diff;tcan filter out a large percent of trip-
lets and further reduce the computational cost of P-value calculation
in the permutation step.
In the second sign filtering step, we filter out triplets with incon-
sistent signs of test statistics among meta and singleMLA. The scout-
ing gene is chosen to maximize the test statistic of MetaMLA. In
other words, we keep the triplets satisfying QK
k¼1IðsignðmMLAði0Þ
tÞ
signðTðk;i0Þ
MLA;tÞ¼1Þ¼1, where IðÞ is the indicator function and
Meta liquid association 2143
i0¼arg maxi¼1;2;3jmMLAðiÞ
tj. For fair comparison, we use the same
triplets filtered by MetaMLA to perform MetaLA and single-study
MLA.
3 Results
3.1 Computational reduction by filtering
Below we describe the screening result to avoid high computational
load when evaluating all possible triplets in MetaMLA. After un-
biased filtering of non-expressed and non-informative genes, we
kept 1,770 genes, which led to a total number of 1770
3
9:23
108triplets. The computing time is demanding if we perform hy-
pothesis testing for all possible triplets. By applying bootstrap filter-
ing with qðmetaÞ
diff;t>0:4 with three bins, the number of triplets reduced
to 2.18 10
7
,2.36% of the original total number. Furthermore,
the sign filtering step decreased the number of the remaining triplets
to 1.21 10
7
, which was only 1.32% of the total number.
Given the fact that our screening pipeline can dramatically re-
duce the number of triplets, we assessed whether the filtering pro-
cedures ignored statistically significant LA triplets. We performed
MetaMLA on all the 9.23 10
8
triplets and reduced 1.32% triplets
after filtering. As shown in Supplementary Table S1, our screening
steps only missed 89, 219, 375, 520 and 690 of the top 2000, 4000,
6000, 8000 and 10 000 triplets obtained from full analysis without
filtering. P-values from Fisher’s exact test are almost 0 and odds
ratio are between 1000 and 1600 (Supplementary Table S1). In sum-
mary, we only missed about 5% significant triplets but saved almost
99% of computing time to make genome-wide LA triplet screening
possible. Since filtering step also consumes computing time, we com-
pared computing time of analyses with filtering versus non-filtering
on a small dataset of 95 genes (using stringent selection criteria by
removing genes with small means and small variances). By using five
computing threads (Intel Xeon E7-2850), computing time for ana-
lyses with filtering versus non-filtering saved about 88% of comput-
ing time (16.3 versus 134.6 min).
In general, filtering out potentially non-significant triplets will
gain statistical power (Bourgon et al., 2010;van Iterson et al.,
2010). In other words, we can detect more significant triplets under
the same FDR control. To demonstrate the empirical effect of filter-
ing in real data, we randomly selected 500 genes from the five Yeast
studies and re-ran our MetaMLA algorithms by both filtering and
non-filtering pipelines. Supplementary Figure S1 shows that for a
given reasonable FDR (e.g. 0.005 and 0.01), filtering pipeline can
detect more significant triplets than full studies as we expected.
3.2 MetaMLA detects more over-represented pathways
We performed pathway enrichment analysis using GO and KEGG for
all the genes from top msignificant triplets (m¼200;300;...;1000)
selected by the single study MLA, MetaMLA, and MetaLA. Figure 3
shows the numbers of enriched GO terms and KEGG pathways for
different top numbers of triplets under FDR ¼0.05 threshold.
MetaMLA (solid square line) consistently performed better than any
single-study MLA (five dash lines) and MetaLA (solid rhombus line)
method by detecting more enriched pathway. Jitter plots of q-values
of the GO terms and KEGG pathways for the top 500 triplets at
minus log 10 scale are further shown in Supplementary Figure S2.
Since single MLA and MetaMLA method can differentiate LA scout-
ing gene Z, similar pathway enrichment analysis were done only for Z
genes from the top triples (Supplementary Figs S3 and S4).
3.3 MetaMLA provides more consistent biomarker and
pathway results with single study analyses
Figure 1A shows an example with LA association in the first study
(correlation dropped from 0.692 to 0.79 for high and low expres-
sion groups of the LA scouting gene YGR175C) but fails to repro-
duce in the remaining four studies. Such an LA association with
failed reproducibility is likely a false positive. Figure 1B demon-
strates another example with consistent LA association in all five
studies (correlation dropped significantly for high and low expres-
sion groups of YDR519W). In order to inspect agreement of top LA
triplets across pairwise studies, Supplementary Figures S5 and S6
show scatter plots of test statistics and rank correlations of the pair-
wise top 1000 triplets. MetaMLA method combines information
from all single studies. Conceptually, MetaMLA can provide more
consistent results with single study MLA than results among single
study MLA. In Figure 4A, we examined pairwise overlap of detected
top 1000 triplets from the five single-study MLA and the
MetaMLA. The result shows zero overlapping in all single-study
MLA top triplets. (We also tried other top number of triplets in
Supplementary Fig. S7 and they all show out small overlap among
single studies.) On the other hand, top triplets from MetaMLA have
much higher percentage of overlapping with results from each
single-study MLA.
We next calculated the number of overlaps of enriched GO terms
and KEGG pathways when we used all the genes from the top 500
triplets from each MLA analysis for pathway enrichment. The re-
sults are shown in Figure 4B and C. Numbers on the diagonal cells
demonstrate the number of enriched GO or KEGG pathways from
AB
Fig. 3. The number of enriched gene sets for all the genes from different num-
bers of top triplets detected by meta and single analysis. (A) is for GO terms
and (B) is for KEGG pathways (Color version of this figure is available at
Bioinformatics online.)
AB C
Fig. 4. Overlap of meta and single analysis. (A) is for the number of overlapped
triplets for the top 1000 significant triplets; (B) is for the number of overlapped
enriched GO terms using all the genes from top 500 triplets for gene set enrich-
ment analysis; (C) is for the number of overlapped enriched KEGG pathways
using all the genes from top 500 triplets for gene set enrichment analysis
(Color versionof this figureis available at Bioinformatics online.)
2144 L.Wang et al.
each single-study MLA and MetaMLA. (Similarly, overlapped path-
ways by only Zgenes from the top 1000 triplets are shown in
Supplementary Fig. S8). Similar to overlapped triplets in Figure 4A,
we observed much higher overlapped pathways between the
MetaMLA result and each single-study MLA result than results be-
tween pair-wise single-study MLA. For example, study Causton de-
tected 27 enriched GO terms, among which 8, 9, 6, and 9 pathways
overlapped with results from the other four single-study MLA.
Notably, it has 12 and 15 GO terms overlapped with MetaLA and
MetaMLA. Comparing the two meta-analytic methods, MetaMLA
performed much better than MetaLA.
3.4 MetaLA and MetaMLA provide more stable results
Below we apply subsampling and bootstrap techniques to compare
stability for LA triplets detected by single-study MLA, MetaLA and
MetaMLA. Figure 5A and B show the number of overlapped triplets
between top triplets detected by original full dataset and subsampled
datasets (90 and 80%, respectively). The numbers of top triplets are
displayed on the x- and y-axis is for the overlapping numbers. The
result shows much better reproducibility of top triplets detected by
subsampled data in MetaMLA (solid square line) and MetaLA (solid
rhombus line) compared to single-study MLA (five dash lines).
Similarly, comparison with bootstrapped data in Figure 5C shows
similar trend. In summary, MetaMLA provides better stability in de-
tecting top LA triplets, when compared to single-study MLA.
MetaLA further outperforms MetaMLA.
3.5 Pathway enrichment analysis and network
visualization
In Sections 3.2–3.4, although MetaLA provides more stable result
than MetaMLA 3.4, it detects much fewer enriched pathways
(Section 3.2) and generates less consistent biomarker and pathways
with single studies (Section 3.3). As a result, we will focus on
MetaMLA for further biological investigation in this subsection.
To test how the LA genes detected by MetaMLA method are
consistent with TF binding, we downloaded the TF-binding gene
sets from the YEASTRACT database (Teixeira et al., 2013) and se-
lected 96 gene sets with 5–200 genes. Among these 96 TF genes,
Hog1 (YLR113W) has the highest frequency among all the genes
from the top 20 000 triplets detected by MetaMLA method. Genes
inside the same triplet as Hog1 are enriched in Hog1 binding gene
sets (P¼0.027). More significantly, Hog1 is also the most frequent
gene among the LA scouting gene Zin the top 100 000 triplets.
Genes regulated by Hog1 (inside the same triplets) are more
significantly enriched in Hog1 binding gene sets (p¼1.44E5).
Supplementary Table S2 shows the top enriched TF binding gene
sets. Among them, Hot1 is another enriched gene sets (p¼
7.67E6) and Alepuz et al. (2003) shows that Hot1 targets on
Hop1p to osmostress responsive promoters and Hog1 mediates re-
cruitment/activation of RNAPII at Hot1p-dependent promoters.
The analysis shows that top triplets selected by MetaMLA method
are highly consistent with known TF regulation pattern.
Table 1 shows 18 significantly enriched KEGG pathways with
hierarchical structure using all the genes from top 500 triplets se-
lected by MetaMLA. Pathway enrichment using GO database iden-
tified 68 GO terms (Supplementary Table S3). Since the five
transcriptomic studies contain yeast samples treated with different
environmental conditions and mutations, we observed many path-
ways related to energy metabolism (q¼5.67E12), carbohydrate,
metabolism (q¼1.40E8), amino acid metabolism (q¼5.87E8)
and translation (q¼0.0065).
To investigate further the identified LA association gene inter-
actions, we chose among the top 20 000 LA triplets (q<6.64E5)
and included a total of 41 triplets with all three genes involved in
the metabolism category (q¼1.85E21 in Table 1) for network
visualization (Fig. 6). Genes within one triplet are connected by
edges in the same color. The dashed line represents reported inter-
actions or regulations in the PPI database. In this network, there are
totally four interactions validated by PPI database, more enriched
than a randomly generated PPI database (0.69 random interactions
on average, with P-value 0.00197 by Fisher’s exact test).
In Figure 6, we observed a cluster of gene modules related to
carbohydrate metabolism (purple background circle in Fig. 6; al-
most all genes annotated with gray dots). IDH2 and BDH2 are
two notable hub genes that have many LA association with other
neighboring genes. IDH2 is a subunit of mitochondrial NAD(þ)-
dependent isocitrate dehydrogenase, a key complex in tricarboxylic
acid (TCA) cycle to catalyze the oxidation of isocitrate to
alpha-ketoglutarate (Reinders et al., 2007). BDH2 is a putative
medium-cahin alcohol dehydrogenase (Dickinson et al., 2003).
In carbohydrate metabolism, pyruvate is the main input for a series
of chemical reactions for aerobic TCA cycle. In the subnetwork,
CDC19 is a key pyruvate kinase, which coverts phosphoenolpyru-
vate to pyruvate (Byrne and Wolfe, 2005;Xu et al., 2012), and its
physical PPI with IDH2 has been previously reported in Gavin et al.
(2006). In addition, PDC5 is a minor isoform of pyruvate decarb-
oxylase and PDC1 is a major of three pyruvate decarboxylase iso-
zymes to decarboxylate pyruvate to acetaldehyde (Dickinson et al.,
2003). ENO2 is also a phosphopyruvate hydratase involved in
AB C
Fig. 5. The number of overlapped top significant triplets between the original dataset and the subsampled or bootstrap datasets. (A) and (B) are for the results of
means and standard errors of ten times subsampling for the proportion of 0.90 and 0.80, respectively; (C) is for the results of means and SEs of ten times boot-
strap (Color version of this figure is available at Bioinformatics online.)
Meta liquid association 2145
pyruvate metabolism to catalyze 2-phosphoglycerate to phosphoe-
nolpyruvate during glycolysis (Byrne and Wolfe, 2005;McAlister
and Holland, 1982). All these genes from the top MetaMLA triplets
are potentially co-regulated with functional annotation from the
carbohydrate metabolism pathway. However, if we examine the dir-
ect gene–gene Pearson correlations, the pair-wise correlations are
low and the co-expression analysis will fail to identify association
among these genes (see Supplementary Table S4).
4 Conclusion and discussion
In this article, we proposed two meta-analytic methods (MetaLA
and MetaMLA) for LA analysis combining multiple studies. We
used the mean of the singleMLA test statistics as the main part of
the MetaMLA statistic and the SD to penalize the inconsistent pat-
terns among different studies. On the genome-wide application, we
proposed to screen genes by bootstrap filtering and sign filtering
(Fig. 2) to reduce the computation load. In the yeast datasets, we
Table 1. Enriched KEGG pathways and their hierarchical categories for all the genes from top 500 triplets selected by MetaMLA method
Entry and category P-value q-value Odds ratio Count Size Name
Metabolism 1.53E-23 1.85E-21 2.69 200 835
Energy metabolism 9.37E-14 5.67E-12 5.36 43 122
sce00190 2.03E-12 8.21E-11 7.91 29 72 Oxidative phosphorylation
sce00680 0.007957 0.041109 3.49 8 28 Methane metabolism
Carbohydrate metabolism 4.64E-10 1.40E-08 2.86 62 229
sce00620 1.91E-06 3.30E-05 5.53 17 39 Pyruvate metabolism
sce00630 3.15E-05 0.000423 6.14 12 26 Glyoxylate and dicarboxylate metabolism
sce00020 6.30E-05 0.000763 4.99 13 32 Citrate cycle (TCA cycle)
sce00010 0.000527 0.004249 2.99 17 58 Glycolysis/Gluconeogenesis
sce00051 0.005765 0.034881 3.76 8 25 Fructose and mannose metabolism
sce00030 0.007158 0.039369 3.23 9 28 Pentose phosphate pathway
Amino acid metabolism 2.42E-09 5.87E-08 3.06 51 178
sce00260 6.40E-08 1.29E-06 8.99 16 32 Glycine, serine and threonine metabolism
sce00270 0.000146 0.001468 4.44 13 36 Cysteine and methionine metabolism
sce00250 0.002637 0.016791 3.60 10 30 Alanine, aspartate and glutamate metabolism
Lipid metabolism 0.000913 0.006501 2.11 29 126
sce00100 0.000183 0.001705 6.89 9 17 Steroid biosynthesis
sce01040 0.000454 0.003927 12.21 6 11 Biosynthesis of unsaturated fatty acids
sce00062 0.002175 0.014623 10.16 5 8 Fatty acid elongation
Metabolism of cofactors and vitamins 0.011664 0.052272 1.80 24 117
sce00670 0.008629 0.041763 4.57 6 15 One carbon pool by folate
Genetic information processing 0.214481 0.447452 1.10 114 1123
Translation 0.000861 0.006501 1.59 70 682
sce03010 9.42E-05 0.001036 2.22 37 181 Ribosome
Folding, sorting and degradation 0.105526 0.283748 1.27 41 263
sce03050 0.008154 0.041109 2.91 10 35 Proteasome
Cellular processes 0.718769 0.995721 0.92 45 382
Transport and catabolism 0.010656 0.049591 1.61 36 194
sce04145 2.96E-05 0.000423 5.07 14 36 Phagosome
Fig. 6. Gene network associated with metabolism. Genes within one triplet are connected by edges in the same color. The dash line means that the edge is in the
PPI database. The small circle connected with the gene means that this gene is in the corresponding subcategory (Color version of this figure is available at
Bioinformatics online.)
2146 L.Wang et al.
reduced >98% of the triplets for the hypothesis testing and captured
94–95% of the top triplets with large MetaMLA statistic. When
compared with singleMLA method, MetaMLA can provide stronger
pathway enrichment signal, more consistent results with single-
study analysis, and more stable results with data subsampling or
bootstrapping. Although MetaLA generated more stable results than
MetaMLA, it detected less enriched pathways and is less consistent
with single study analysis. Among the top significant triplets selected
by MetaMLA, we constructed a gene regulatory network visualiza-
tion to investigate the complex three-way conditional associations.
The result identifies a subnetwork in carbohydrate metabolism net-
work, which otherwise cannot be identified by traditional pair-wise
co-expression analysis. We identified validation in PPI and focused
functional annotation in TSA cycle.
The LA and MLA methods to detect LA triplets have their pros
and cons. On one hand, the LA score by three-product estimation
on normalized gene intensities is much easier to compute than the
model-free estimation of MLA score. However, MLA is more accur-
ate when interdependency among the triplet (i.e. conditional mean
and variance of two genes depend on the third gene) exist and such
interdependency is theoretically ignored by the LA method.
Additionally, MLA also provides systematic inference to assess
P-values and FDR control. To circumvent computational burden in
MetaMLA, our proposed two-stage filtering can significantly reduce
computing time. In this article, we demonstrate genome-wide
screening on all possible gene triplets. To further reduce computa-
tional load, one may apply pre-selected scouting genes from prior
biological knowledge, TF or PPI databases.
Our meta-analytic framework has the advantage to stablely com-
bine multiple studies from different microarray or next-generation
sequencing platforms. Potential heterogeneity from platform, batch
effect or measurement scaling issues is automatically standardized in
the meta-analysis. In the literature, it has been well-acknowledged
that simple correlation and co-expression analysis are not sufficient
to describe the complex system of gene regulation. Applying
advanced association models elucidates novel regulatory mechan-
isms and meta-analysis by combining multiple transcriptomic stud-
ies will greatly reduce false positive findings. Our proposed meta-
analytic LA methods help accurately detect complicated three-way
interactions and regulatory mechanisms.
Funding
This work was supported by the National Institutes of Health NIH
(R01CA190766 to S.L. and G.C.T.); China Scholarship Council
(201508110051 to L.W.); National Nature Science Foundation of China
(11526146 to L.W.); Scientific Research Level Improvement Quota Project of
Capital University of Economics and Business [to L.W.]; and University of
Minnesota Grant-In-Aid (to Y.Y.H.).
Conflict of Interest: none declared.
References
Alepuz,P.M. et al. (2003) Osmostress-induced transcription by Hot1 depends
on a Hog1-mediated recruitment of the RNA Pol II. EMBO J., 22,
2433–2442.
Altman,N.S. (1992) An introduction to kernel and nearest-neighbor nonpara-
metric regression. Am. Stat., 46, 175–185.
Amaratunga,D. and Cabrera,J. (2001) Analysis of data from viral DNA micro-
chips. J. Am. Stat. Assoc., 96, 1161–1170.
Benjamini,Y. and Hochberg,Y. (1995) Controlling the false discovery rate: a
practical and powerful approach to multiple testing. J. Roy. Stat. Soc. B.,
57, 289–300.
Bourgon,R. et al. (2010) Independent filtering increases detection power for
high-throughput experiments. Proc. Natl. Acad. Sci. USA,107, 9546–9551.
Butte,A.J., and Kohane,I.S. (2000). Mutual information relevance networks:
functional genomic clustering using pairwise entropy measurements. In Pac.
Symp. Biocomput., volume 5, pages 418–429
Byrne,K.P. and Wolfe,K.H. (2005) The yeast gene order browser: combining
curated homology and syntenic context reveals gene fate in polyploid spe-
cies. Genome Res., 15, 1456–1461.
Causton,H.C. et al. (2001) Remodeling of yeast genome expression in re-
sponse to environmental changes. Mol. Biol. Cell,12, 323–337.
Chasman,D. et al. (2014) Pathway connectivity and signaling coordination in
the yeast stress-activated signaling network. Mol. Syst. Biol., 10, 759.
Cherry,J.M. et al. (2011) Saccharomyces genome database: the genomics re-
source of budding yeast. Nucleic Acids Res., 40, D700–D705.
Dickinson,J.R. et al. (2003) The catabolism of amino acids to long chain and
complex alcohols in saccharomyces cerevisiae. J. Biol. Chem., 278,
8028–8034.
Efron,B. and Tibshirani,R. (1986) Bootstrap methods for standard errors, con-
fidence intervals, and other measures of statistical accuracy. Stat. sci, pages
54–75.
Gasch,A.P. et al. (2000) Genomic expression programs in the response of yeast
cells to environmental changes. Mol. Biol. Cell,11, 4241–4257.
Gavin,A.-C. et al. (2006) Proteome survey reveals modularity of the yeast cell
machinery. Nature,440, 631–636.
Gunderson,T. and Ho,Y.-Y. (2014) An efficient algorithm to explore liquid as-
sociation on a genome-wide scale. BMC Bioinformatics,15, (1), 371.
Ho,Y.-Y. et al. (2011) Modeling liquid association. Biometrics,67, 133–141.
Hughes,T.R. et al. (2000) Functional discovery via a compendium of expres-
sion profiles. Cell,102, 109–126.
Kanehisa,M. et al. (2016) Kegg as a reference resource for gene and protein an-
notation. Nucleic Acids Res,44, D457–D462.
Knijnenburg,T.A. et al. (2009) Combinatorial effects of environmental param-
eters on transcriptional regulation in saccharomyces cerevisiae: a quantita-
tive analysis of a compendium of chemostat-based transcriptome data.
BMC Genomics,10,1.
Li,K.-C. (2002) Genome-wide coexpression dynamics: theory and application.
Proc. Natl. Acad. Sci. USA,99, 16875–16880.
Li,K.-C. et al. (2004) A system for enhancing genome-wide coexpression dy-
namics study. Proc. Natl. Acad. Sci. USA,101, 15561–15566.
McAlister,L. and Holland,M.J. (1982) Targeted deletion of a yeast enolase
structural gene. identification and isolation of yeast enolase isozymes. J.
Biol. Chem., 257, 7181–7188.
Reinders,J. et al. (2007) Profiling phosphoproteins of yeast mitochondria re-
veals a role of phosphorylation in assembly of the ATP synthase. Mol. Cell.
Proteomics,6, 1896–1906.
Song,L. et al. (2012) Comparison of co-expression measures: mutual informa-
tion, correlation, and model based indices. BMC Bioinformatics,13, 328.
Teixeira,M.C. et al. (2013) The yeastract database: an upgraded information
system for the analysis of gene and genomic transcription regulation in sac-
charomyces cerevisiae. Nucleic Acids Res., 42, D161–D166.
Upton,G.J. (1992) Fisher’s exact test. J. Roy. Stat. Soc. A Stat., 155, 395–402.
van Iterson,M. et al. (2010) Filtering, FDR and power. BMC Bioinformatics,
11, 450.
Wolfe,C.J. et al. (2005) Systematic survey reveals general applicability of
‘guilt-by-association’ within gene coexpression networks. BMC
Bioinformatics,6,1.
Xu,Y.-F. et al. (2012) Regulation of yeast pyruvate kinase by ultrasensitive
allostery independent of phosphorylation. Mol. Cell,48, 52–62.
Zhang,B. et al. (2005) A general framework for weighted gene co-expression
network analysis. Stat. Appl. Genet. Mol. Biol., 4, 1128.
Zhang,J. et al. (2007) Extracting three-way gene interactions from microarray
data. Bioinformatics,23, 2903–2909.
Meta liquid association 2147