Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies.
ABSTRACT Expression quantitative trait loci (eQTL) studies are an integral tool to investigate the genetic component of gene expression variation. A major challenge in the analysis of such studies are hidden confounding factors, such as unobserved covariates or unknown subtle environmental perturbations. These factors can induce a pronounced artifactual correlation structure in the expression profiles, which may create spurious false associations or mask real genetic association signals. Here, we report PANAMA (Probabilistic ANAlysis of genoMic dAta), a novel probabilistic model to account for confounding factors within an eQTL analysis. In contrast to previous methods, PANAMA learns hidden factors jointly with the effect of prominent genetic regulators. As a result, this new model can more accurately distinguish true genetic association signals from confounding variation. We applied our model and compared it to existing methods on different datasets and biological systems. PANAMA consistently performs better than alternative methods, and finds in particular substantially more trans regulators. Importantly, our approach not only identifies a greater number of associations, but also yields hits that are biologically more plausible and can be better reproduced between independent studies. A software implementation of PANAMA is freely available online at http://ml.sheffield.ac.uk/qtl/.
- Pekka Marttinen, Matti Pirinen, Antti-Pekka Sarin, Jussi Gillberg, Johannes Kettunen, Ida Surakka, Antti J Kangas, Pasi Soininen, Paul F O'Reilly, Marika Kaakinen, Mika Kähönen, Terho Lehtimäki, Mika Ala-Korpela, Olli T Raitakari, Veikko Salomaa, Marjo-Riitta Järvelin, Samuli Ripatti, Samuel Kaski[Show abstract] [Hide abstract]
ABSTRACT: A typical genome-wide association study searches for associations between single nucleotide polymorphisms (SNPs) and a univariate phenotype. However, there is a growing interest to investigate associations between genomics data and multivariate phenotypes, for example in gene expression or metabolomics studies. A common approach is to perform a univariate test between each genotype-phenotype pair, and then to apply a stringent significance cutoff to account for the large number of tests performed. However, this approach has limited ability to uncover dependencies involving multiple variables. Another trend in the current genetics is the investigation of the impact of rare variants on the phenotype, where the standard methods often fail due to lack of power when the minor allele is present in only a limited number of individuals. We propose a new statistical approach based on Bayesian reduced rank regression to assess the impact of multiple SNPs on a high-dimensional phenotype. Due to the method's ability to combine information over multiple SNPs and phenotypes, it is particularly suitable for detecting associations involving rare variants. We demonstrate the potential of our method and compare it with alternatives using the Northern Finland Birth Cohort with 4,702 individuals, for whom genome-wide SNP data along with lipoprotein profiles comprising 74 traits are available. We discovered two genes (XRCC4 and MTHFD2L) without previously reported associations, which replicated in a combined analysis of two additional cohorts: 2,390 individuals from the Cardiovascular Risk in Young Finns study and 3,659 individuals from the FINRISK Study. R-code freely available for download at http://users.ics.aalto.fi/pemartti/gene_metabolome/. samuli.ripatti@helsinki.fi, samuel.kaski@aalto.fi.Bioinformatics 03/2014; · 5.47 Impact Factor - [Show abstract] [Hide abstract]
ABSTRACT: Abstract Sparse modeling, a feature selection method widely used in the machine-learning community, has been recently applied to identify associations in genetic studies including expression quantitative trait locus (eQTL) mapping. These genetic studies usually involve high dimensional data where the number of features is much larger than the number of samples. The high dimensionality of genetic data introduces a problem that there exist multiple solutions for optimizing a sparse model. In such situations, a single optimization result provides only an incomplete view of the data and lacks power to find alternative features associated with the same trait. In this article, we propose a novel method aimed to detecting alternative eQTLs where two genetic variants have alternative relationships regarding their associations with the expression of a particular gene. Our method accomplishes this goal by exploring multiple solutions sampled from the solution space. We proved our method theoretically and demonstrated its usage on simulated data. We then applied our method to a real eQTL data and identified a set of alternative eQTLs with potential biological insights. Additionally, these alternative eQTLs implicate a network view of understanding gene regulation.Journal of computational biology: a journal of computational molecular cell biology 04/2014; · 1.69 Impact Factor - SourceAvailable from: Chun Jimmie Ye[Show abstract] [Hide abstract]
ABSTRACT: Expression quantitative trait loci (eQTL) mapping is a tool that can systematically identify genetic variation affecting gene expression. eQTL mapping studies have shown that certain genomic locations, referred to as regulatory hotspots, may affect the expression levels of many genes. Recently, studies have shown that various confounding factors may induce spurious regulatory hotspots. Here, we introduce a novel statistical method that effectively eliminates spurious hotspots while retaining genuine hotspots. Applied to simulated and real datasets, we validate that our method achieves greater sensitivity while retaining low false discovery rates compared to previous methods.Genome biology 04/2014; 15(4):R61. · 10.30 Impact Factor
Page 1
Joint Modelling of Confounding Factors and Prominent
Genetic Regulators Provides Increased Accuracy in
Genetical Genomics Studies
Nicolo ´ Fusi1.*, Oliver Stegle2.*, Neil D. Lawrence1*
1Sheffield Institute for Translational Neuroscience, University of Sheffield, Sheffield, United Kingdom, 2Machine Learning and Computational Biology Research Group,
Max Planck Institute for Developmental Biology, Tu ¨bingen, Germany
Abstract
Expression quantitative trait loci (eQTL) studies are an integral tool to investigate the genetic component of gene
expression variation. A major challenge in the analysis of such studies are hidden confounding factors, such as unobserved
covariates or unknown subtle environmental perturbations. These factors can induce a pronounced artifactual correlation
structure in the expression profiles, which may create spurious false associations or mask real genetic association signals.
Here, we report PANAMA (Probabilistic ANAlysis of genoMic dAta), a novel probabilistic model to account for confounding
factors within an eQTL analysis. In contrast to previous methods, PANAMA learns hidden factors jointly with the effect of
prominent genetic regulators. As a result, this new model can more accurately distinguish true genetic association signals
from confounding variation. We applied our model and compared it to existing methods on different datasets and
biological systems. PANAMA consistently performs better than alternative methods, and finds in particular substantially
more trans regulators. Importantly, our approach not only identifies a greater number of associations, but also yields hits
that are biologically more plausible and can be better reproduced between independent studies. A software
implementation of PANAMA is freely available online at http://ml.sheffield.ac.uk/qtl/.
Citation: Fusi N, Stegle O, Lawrence ND (2012) Joint Modelling of Confounding Factors and Prominent Genetic Regulators Provides Increased Accuracy in
Genetical Genomics Studies. PLoS Comput Biol 8(1): e1002330. doi:10.1371/journal.pcbi.1002330
Editor: Matthew Stephens, University of Chicago, United States of America
Received August 11, 2011; Accepted November 13, 2011; Published January 5, 2012
Copyright: ? 2012 Fusi et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted
use, distribution, and reproduction in any medium, provided the original author and source are credited.
Funding: This research was supported by the FP7 PASCAL II Network of Excellence. NF was supported by PhD scholarships from the University of Sheffield and
the University of Manchester. OS was supported by a fellowship from the Volkswagen Foundation. The funders had no role in study design, data collection and
analysis, decision to publish, or preparation of the manuscript.
Competing Interests: The authors have declared that no competing interests exist.
* E-mail: nicolo.fusi@sheffield.ac.uk (NF); oliver.stegle@tuebingen.mpg.de (OS); N.Lawrence@sheffield.ac.uk (NDL)
. These authors contributed equally to this work.
Introduction
Genome-wide analysis of the regulatory role of polymorphic loci
on gene expression has been carried out in a range of different
study designs and biological systems. For example, association
mapping in human has uncovered an abundance of cis associations
that contribute to the variation of a third of all human genes [1,2].
In segregating yeast strains, linkage studies have revealed extensive
genetic trans regulation, with a few regulatory hotspots controlling
the expression profiles of tens or hundreds of genes [3,4].
Despite the success of such expression quantitative trait loci
(eQTL) studies, it has also become clear that the analysis of these
data comes along with non-trivial statistical hurdles [5]. Different
types of external confounding factors, including environment or
technical influences, can substantially alter the outcome of an
eQTL scan. Unobserved confounders can both obscure true
association signals and create new spurious associations that are
false [6,7].
Suitable data preprocessing, or careful design of randomized
studies are helpful measures to avoid confounders in the first place
[8], however they rarely rule out confounding influences entirely.
It is also relatively straightforward to account for those factors that
are known and measured. For example, it is standard procedure to
include covariates such as age and gender in the analysis [9,10].
Similarly, the effect of populational relatedness between samples, a
confounding effect that is observed or can be reliably estimated
form the genotype data [11,12], is usually included in the model.
However other factors, including subtle environmental or
technical influences, often remain unknown to the experimenter,
but still need to be accounted for. Their potential impact has
previously been characterized in multiple studies; for example
Plagnol et al. [13] and Locke et al. [14] showed that virtually any
aspect of sample handling can impact the analysis.
Several computational methods have been developed to account
for unknown confounding variation within eQTL analyses
[2,6,7,15,16]. A common assumption these methods built on is
that confounders are prone to exhibit broad effects, influencing
large fractions of the measured gene expression levels. This
characteristic has been exploited to learn the profile of hidden
confounders using models that are related to PCA [2,6,15]. Once
learnt, these factors can then be included in the analysis
analogously to known covariates. Another branch of methods
avoids recovering the hidden factors explicitly, instead correcting
for the correlation structure they induce between the samples
[7,16]. Here, the inter-sample correlation is estimated from the
expression profiles first, to then account for its influence in an
association scan using mixed linear models. Both types of methods
have been applied in a number of studies. Advantages versus naive
PLoS Computational Biology | www.ploscompbiol.org1 January 2012 | Volume 8 | Issue 1 | e1002330
Page 2
analysis include better-calibrated test statistics [16] and improved
reproducibility of hits between independent studies [7]. Perhaps
most strikingly, statistical methods to correct for hidden con-
founders have also been shown to substantially increase the power
to detect eQTLs, increasing the number of significant cis
associations by up to 3-fold [2,17].
While improved sensitivity to detect cis-acting eQTLs is an
important and necessary step, we expect that even more valuable
insights can be gained from those loci that regulate multiple target
genes in trans. The interest in these regulatory hotspots has been
tremendous in recent years, but limited reproducibility between
studies has been a concern (see for example the discussion in
Breitling et al. [18]). Accurate correction for confounding factors is
key to improve the reliability of these regulatory associations,
however statistical overlap between confounding factors and true
association signals from downstream effects can hamper the
identification and fitting of confounders. For example, methodol-
ogy that merely accounts for broad variance components, such as
PCA, is doomed to fail. If the effect size of trans regulatory hotspots
is large enough, they induce a correlation structure that is similar
to the one caused by confounding factors. As a result, true trans
regulators tend to be mistaken for confounders and are
erroneously explained away.
Here, we report an integrated probabilistic model PANAMA
(Probabilistic ANAlysis of genoMic dAta) to address these
shortcoming of established approaches. PANAMA learns a
dictionary of confounding factors from the observed expression
profiles. Unique to PANAMA is to jointly learn these factors while
accounting for the effect of loci with a pronounced trans regulatory
effect, thereby avoiding overlaps between true genetic association
signals and the covariance structure induced by the learnt
confounders. The statistical model underlying our algorithm is
simple and computationally tractable for large eQTL datasets.
PANAMA is based on the framework of mixed linear models, and
combines the advantages of factor-based methods, such as PCA,
SVA [6] or PEER [2,15] with methods that estimate the implicit
covariance structure induced by confounding variation [12,16].
The model is fully automated and can be easily adapted to include
additional observed confounding sources of variation, such as
population structure or known covariates.
We applied PANAMA to a range of eQTL studies, including
synthetic data and studies from yeast, mouse and human. Across
datasets, PANAMA performed better than previous methods,
identifying a greater number of significant eQTLs and in
particular additional trans regulators. We provide multiple sources
of evidence that the associations recovered by PANAMA are
indeed likely to be real. Most strikingly in yeast, the findings by
PANAMA can be better reproduced between independent studies
and are more consistent with prior knowledge about the
underlying regulatory network. Finally, we also give insights into
the limitations of current methods to account for confounders that
help to understand the relationship between confounding
variation, cis regulation and trans effects.
Results
Learning of confounding factors in the presence of trans
regulators
The statistical model underlying PANAMA assumes additive
contributions from true genetic effects and hidden confounding
factors. Briefly, this linear model expresses the gene expression of
gene g measured in N individuals as the sum of weighted
contributions from a set of K SNPs S~fs1,...,sKg as well as Q
confounders X~fx1,...,xQg, a mean term mgand a noise term
Eg(See Figure 1a)
yg~mgz
X
K
k~1
vk,gskz
X
Q
q~1
wg,qxqzE E E Eg:
Neither the regression weights wg,q nor the profiles of the
confounding factors xqare known a priori and hence need to be
learnt from the expression data. Parameter inference in PANAMA
is done in the mixed model framework [12,19]. In this hierarchical
model, the regression weights of the hidden factors are
marginalized out, yielding a covariance structure in a multivariate
Gaussian model to capture the effect of confounders. Intuitively,
the objective during learning in PANAMA is to explain the
empirical correlation structure between samples shared across
genes by the state of the hidden factors. In the presence of
extensive trans regulation this approach leads to over-correction,
running the risk of explaining away true genetic association
signals. To circumvent this side effect, PANAMA also includes a
subset of all SNPs in the model, resulting in a more complete
covariance structure that satisfies an appropriate balance between
explaining confounding variation and preserving true genetic
signals (Figure 1b,c). In this approach, the variance contribution of
few major signal SNPs and the state of the hidden factors are then
jointly estimated. Moreover, an appropriate number of hidden
factors is determined automatically during learning. As a result,
PANAMA is statistically robust and inference of hidden factors is
feasible without manual setting of any tuning parameters.
Additional observed covariates, if available, can also be included
in the model; see Methods and the supplementary Text S1 for full
details.
Simulation study
The evaluation of methods to call eQTLs is difficult as reliable
ground truth information is not available. Following previous work
[2,20,21], we have used synthetic data to assess and compare
PANAMA with alternative approaches. To minimize assumptions
Author Summary
The computational analysis of genetical genomics studies
is challenged by confounding variation that is unrelated to
the genetic factors of interest. Several approaches to
account for these confounding factors have been pro-
posed, greatly increasing the sensitivity in recovering
direct genetic (cis) associations between variable genetic
loci and the expression levels of individual genes. Crucially,
these existing techniques largely rely on the true
association signals being orthogonal to the confounding
variation. Here, we show that when studying indirect
(trans) genetic effects, for example from master regulators,
their association signals can overlap with confounding
factors estimated using existing methods. This technical
overlap can lead to overcorrection, erroneously explaining
away true associations as confounders. To address these
shortcomings, we propose PANAMA, a model that jointly
learns hidden factors while accounting for the effect of
selected genetic regulators. In applications to several
studies, PANAMA is more accurate than existing methods
in recovering the hidden confounding factors. As a result,
we find an increase in the statistical power for direct (cis)
and indirect (trans) associations. Most strikingly on yeast,
PANAMA not only finds additional associations but also
identifies master regulators that can be better reproduced
between independent studies.
Accurate Confounder Correction for eQTL Studies
PLoS Computational Biology | www.ploscompbiol.org 2January 2012 | Volume 8 | Issue 1 | e1002330
Page 3
we need to impose on the simulation procedure we created an
artificial dataset that borrows key characteristics from a real eQTL
study in yeast [4] (See also Application to segregating yeast strains).
In this approach, we first fit PANAMA to the original yeast eQTL
data, thereby estimating the number of cis and trans associations,
an empirical distribution of effect sizes, and finally the character-
istics of confounding variation. Based on these estimates we
recreated an in silico eQTL dataset using standard linear
assumptions; see Text S1 for full details on the exact approach.
To rule out possible biases of this dataset towards our method, we
additionally considered a simulation setting when fitting the ICE
model [7] to the real data for estimating simulation parameters
(see below).
Given the synthetic eQTL study, we employed alternative
methods to recover the underlying simulated associations. We
compared PANAMA to standard linear regression (LINEAR),
ignoring the presence of confounders entirely, as well as SVA [6],
ICE [7] and PEER [2,15], established and widely used approaches
to correct for hidden confounders. For reference, we also
compared to an idealized model with the simulated confounders
perfectly removed (IDEAL). First, Figure 2a and 2b show the
respective number of significant cis and trans associations as a
function of the false discovery rate (FDR) cutoff. To avoid overly
optimistic association counts due to linkage disequilibrium, we
considered at most a single cis association per gene and at most one
trans association per chromosome for each gene. PANAMA found
more cis associations than any other approach and retrieved the
greatest number of trans associations among methods that correct
for hidden confounders. Notably, the linear model appeared to
find even more trans associations, however the majority of these
calls were inconsistent with the simulated ground truth and were
spurious false positives. The extent of false associations called by
the linear model is also reflected in Figure 2c, which shows the
receiver operating characteristics for each method. All approaches
that correct for confounders performed strikingly better than the
linear model. Among these, PANAMA was most accurate,
achieving greater sensitivity than any other method for a large
range of false positive rates (FPR), approaching the performance of
an ideal model (IDEAL).
Since some models, including SVA and PEER, allow to account
for additional known covariates, we investigated their performance
when adding the strongest genetic regulators as covariates. This
procedure is mimicking the central concept of PANAMA using
previous methods. However, comparative results (Supplementary
Figure S7) show that iterative learning of PANAMA still performs
significantly better.
Next, we studied the statistics of obtained p-values, checking for
departure from a uniform distribution that either indicates
inflation (genomic control lw1) or deflation (genomic control
lv1) of the respective methods (Figure 2d and Supplementary
Figure S8 for corresponding Q-Q-plots). All methods except for
ICE yielded an inflated p-value distribution. Notably, this
observation also applies to the ideal model where the effect of
confounders had been perfectly removed. Thus, in settings with
Figure 1. Illustration of the PANAMA model. (a) Representation of the linear model used by PANAMA to correct for the effect of confounding
factors. (b) Alternative settings of confounders in relation to true genetic signals: First, orthogonality between confounders and genetics. The
variation in the gene expression levels (green arrow) can be better explained by the SNP (blue arrow). Second, statistical overlap between variation
explained by confounders and the genetic variation as often found in trans hotspots. Gene expression variation can be equally well explained as
genetic or due to a confounding factor. Previous methods focus in the first setting, while PANAMA is able to handle both situations. (c) PANAMA
applied to the yeast eQTL dataset. Pronounced trans regulators that overlap with the learnt confounding factors are highlighted in red.
doi:10.1371/journal.pcbi.1002330.g001
Accurate Confounder Correction for eQTL Studies
PLoS Computational Biology | www.ploscompbiol.org3 January 2012 | Volume 8 | Issue 1 | e1002330
Page 4
sufficiently strong trans regulation, inflated statistics are not
necessarily due to poor calibration because of confounders, but
instead may occur as a consequence of an excess of true biological
signals themselves. We also checked that calls by the various
methods were not overly optimistic and artificially inflated.
Indeed, false discovery rate estimates from all methods but the
linear model were approximately in line with the empirical rate of
errors when taking the ground truth into account (Supporting
Figure S1), with PANAMA being the best calibrated method.
We then repeated the same analysis on a broader range of
simulated datasets, varying particular aspects of the simulation
procedure around the parameters obtained from the fit to the real
yeast data. Figure 2e shows the accuracy of alternative methods
when reducing the extent of simulated trans regulation by
subsampling from the set of initial trans effects. These results
highlight that previous methods only work well in the regime of
little trans regulation, while PANAMA provides for accurate calls
for a wider range of settings. Similarly, Figure 2f shows results for
strong trans regulation, now varying the extent of confounding
factors from weaker to stronger influences. Again, PANAMA was
found to be more robust than previous approaches, recovering
true simulated associations with great accuracy irrespectively of
the magnitude of simulated confounding.
Finally, we investigated the impact of the exact of model used to
fit the association characteristics to the initial yeast dataset.
Supporting Figure S2 shows summary results for a second
synthetic dataset fitted using ICE. As ICE tends to be the most
conservative approach among the considered methods, the extent
of trans regulation on this simulated data was severely reduced. As
a consequence, the differences between methods were consider-
ably smaller, however confirming the previously observed trends.
Application to segregating yeast strains
Having established the accuracy of PANAMA in recovering
hidden confounders, we applied PANAMA and the alternative
methods to the primary eQTL dataset from segregating yeast
strains [4]. These data cover a set of 108 genetically diverse strains
that have been expression profiled in two environmental
conditions, glucose and ethanol. First, we focused on the glucose
condition, which has previously been expression profiled [3],
providing an independent study for the purpose of comparison.
Figure 3a and 3b show the number of cis and trans associations
for different methods as a function of the FDR cutoff. Again, we
considered at most one association per chromosome to avoid
confounding the size of associations with their number. In line
with previously reported results [2,7] and the simulated setting
(Simulation study), the standard linear model identified fewer cis
associations than methods that correct for confounding variation.
The trends from the simulated dataset also carried over for trans
associations, where the linear model called many more associa-
tions than methods that account for confounders, yielding an
excess of regulatory hotspots (See Supporting Figure S3). It has
previously been suggested that many of these are likely to be false;
see for example the discussion in Kang et al. [7]. Among the
methods that correct for confounding variation, PANAMA
identified the greatest number of associations. Among the
Figure 2. Evaluation of PANAMA and alternative methods on the simulated eQTL dataset. (a,b) number of recovered cis and trans
associations as a function of the chosen false discovery rate cutoff. To circumvent biases due to linkage, at most one association per chromosome
and gene is counted. (c) Receiver Operating Characteristics (ROC) for recovering true simulated associations, depicting the true positive rate (TPR) as a
function of the permitted false positive rate (FPR). (d) inflation factors, defined as Dl~l{1, indicating either inflated p-value distributions (Dlw0) or
deflation (Dlv0) of the respective tests statistics. (e) Area under the ROC curve for alternative simulated datasets, subsampling certain fractions of
number of simulated trans association. (f) Area under the ROC curve for alternative simulated datasets, subsampling the number of simulated
confounding factors.
doi:10.1371/journal.pcbi.1002330.g002
Accurate Confounder Correction for eQTL Studies
PLoS Computational Biology | www.ploscompbiol.org4 January 2012 | Volume 8 | Issue 1 | e1002330
Page 5
alternative methods, ICE appeared to be more sensitive in
recovering cis associations while PEER and SVA retrieved a
greater number of trans associations. Also note that models that
account for confounding factors yielded slightly inflated p-value
distributions (Figure 3c, Supplementary Figure S9), supporting
that also in real settings, a certain degree of inflation may be
caused by extensive trans regulation. Finally, supporting Figure S3
shows the number of associations called by different methods as a
function of the genomic position. This summary of genome-wide
eQTLs confirms that ICE is most conservative in detecting
hotspots, whereas all other methods do find multiple trans bands.
For comparison we also included a version of PANAMA that also
corrects for the trans regulators that are accounted for while
learning (PANAMAtrans, see Methods and supporting Text S1).
PANAMAtransyields near-identical results to ICE, which explains
the differences and similarities between the two approaches, where
PANAMA can be regarded as generalization of ICE. By
accounting for pronounced regulators PANAMA circumvents
the over-conservative correction of the ICE model.
Reproducibility of eQTLs between studies.
shed light on the validity of the associations called, we considered
the consistency of calls between two independent studies. The
glucose environment from Smith et al. [4] has previously been
studied [3], sharing a common set of segregants. We checked the
consistencyin calling genes with a cisassociation for increasing FDR
cutoffs (Figure 3d). Alternatively, focusing on the consistency of
regulatory hotspots, Figure 3e shows the ranking consistency of
polymorphisms ordered by their regulatory potential on multiple
genes. Reassuringly, for both cis effects and trans regulatoryhotspots,
PANAMA yielded results with far greater consistency than any
To objectively
other currently available method. In particular the consistency of
trans hotspots suggest that PANAMA achieved an appropriate
balance between explaining away spurious signals as confounding
variation and identifying hotspots that are likely to have a true
genetic underpinning.
Consistency of trans regulatory hotspots with respect to
known regulatory mechanisms in yeast.
of validating trans eQTLs, we investigated to what extent poly-
morphisms that regulate multiple genes in trans can be interpreted as
indirect effects that are mediated by known transcriptional regulators.
For this analysis we considered an established regulatory network of
transcription factors extracted from Yeastract [22]. Although we do
not expect trans associations to be exclusively mediated by direct
transcriptional regulation, the degree of associations that are con-
sistent with this regulatory structure is nevertheless an informative
indicator for the validity of eQTL calls from different models.
For each transcription factor, we considered polymorphisms in
the vicinity of the coding region of the transcription factor
(+10 kb around the coding region), and tested the fraction of
associations with genes that are known targets of the transcription
factor versus other associations with genes that are no direct
targets. Table S1 shows the F-score (harmonic mean between
precision and recall) for each of 129 transcription factors that had
at least one SNP in the local cis window. For half of the 129 TFs,
PANAMA yielded a higher F-score than any of the other methods
considered. Interestingly, the standard linear models performed
second best under this metric, achieving the greatest F-score in
36% of all cases, followed by PEER (28%), SVA (15%) and ICE
(6%). Among the methods that correct for confounders, PANAMA
consistently yielded the highest F-score.
As a second means
Figure 3. Evaluation of alternative methods on the eQTL dataset from segregating yeast strains (glucose condition). (a,b): number of
cis and trans associations found by alternative methods as a function of the chosen FDR cutoff. (c) Inflation factors of alternative methods, defined as
Dl~l{1. (d) Consistency of calling cis associations between two independent glucose yeast eQTL datasets. (e) Consistency of calling eQTL hotspots
between two independent glucose yeast datasets, where SNPs are ordered by the extent of trans regulation as determined by v{log(pv)w.
doi:10.1371/journal.pcbi.1002330.g003
Accurate Confounder Correction for eQTL Studies
PLoS Computational Biology | www.ploscompbiol.org 5January 2012 | Volume 8 | Issue 1 | e1002330
Page 6
Detecting
ments.
Smith et al. [4], combining expression measurement in an ethanol
and glucose background. Because each yeast strain was profiled
twice, the set of samples was not independent, but instead had a
replicate population structure. Similarly as done in previous work
[16], we accounted for this genetic relatedness in PANAMA by
adding a population covariance term (Material and Methods).
Supporting Figure S4 shows the number of associations retrieved
by PANAMA and alternative methods on this joint yeast dataset.
Because PANAMA accounted for the replicate structure of the
dataset, the increase in the number of associations compared to
the analysis of the single-condition analysis was modest. Other
methods, not accounting for the replicate structure of the
genotypes, yielded severely inflated test statistics, identifying a
trans effect for the great majority of all genes. To check the impact
of the population structure covariance, we also applied PANAMA
without the correction for artificial genetic relatedness, yielding
similarly inflated results (data not shown).
The complete set of eQTL calls from PANAMA, on the glucose
condition alone and the joint analysis on both conditions, are available
as Supporting Dataset S1 and Supporting Dataset S2 respectively.
eQTLsthat areshared acrossenviron-
Finally, we considered the full expression dataset from
Application to further eQTL studies
We successfully applied PANAMA to additional ongoing and
retrospective studies. For example, on a dataset from inbred
mouse crosses [23], PANAMA identified a greater number of
associations than other methods (Supplementary Figure S5). In
contrast to the yeast dataset, the distribution of p-values on this
dataset was almost uniform, suggesting that the extent of true
trans regulation is lower. We also investigated parts of a dataset of
the genetics of human cortical gene expression [24]. On
chromosome 17, methods that account for confounders identified
more genes in associations than a linear model, with SVA and
PANAMA retrieving the greatest number (see supporting Figure
S6). Results on other four other chromosomes were similar (data
not shown).
Finally, results of PANAMA applied to an RNA-Seq eQTL
study on Arabidopsis [25] indicate that expression heterogeneity as
accounted for by PANAMA is also present on expression estimates
from short read technologies, which is consistent with previous
reports in human RNA-Seq studies [26]. This suggests that
statistical challenges due to confounding variation are not specific
to a particular platform for measuring gene expression.
Discussion
We have reported the development of PANAMA, an
advanced statistical model to correct for confounding influences
while preserving genuine genetic association signals. We have
shown that this approach is of substantial practical use in a
range of real settings and studies. The correction approach of
PANAMA, for the first time, is able to not only find more cis
eQTLs, but also greatly improves the statistical power to
uncover true trans regulators. PANAMA finds a greater number
of associations, and calls eQTLs that are more likely to be real,
as validated by means of realistic simulated settings and an
analysis of eQTL consistency between independent studies.
Most notably, PANAMA identified several strong trans hotspots
on yeast, out of which at least 40% could be reproduced on a
replication dataset.
There are several previous approaches to correct for confound-
ing influences in eQTL studies. These methods can be broadly
grouped into factor-based models like PCA, SVA [6] and PEER
[2,15], and approaches that employ a mixed linear model [7,16],
estimating a covariance structure that captures the confounding
variation. An important reason why PANAMA performs well is
the intermediate approach taken here, that is, learning a
covariance structure within a linear mixed model (LMM), but at
the same time retaining the low-rank constraint which yields an
explicit representation of factors. Moreover, PANAMA systemat-
ically exploits the flexibility provided by the representation in
terms of covariance structures, jointly accounting for genetic
regulators while estimating the confounding factors. Our approach
is stable and robust, avoiding the need to first subtract off the
genetic contribution greedily, as for example suggested and
implemented in SVA [6] and PEER [2,15]. Although this is not
the focus of this work, we have shown how our approach can be
combined with additional measures to correct for observed sources
of confounding variation, such as known covariates or popula-
tional relatedness. The utility of such measures has been illustrated
in the joint analysis on data from two environmental conditions. A
more specialized approach that is aimed at the combined
correction for expression confounders and population structure
has recently been proposed by Listgarten et al. [16]. This LMM-
EH approach is methodologically related to what is done here, as
the contribution from multiple sources of variation are combined
within a single covariance structure. Importantly, the main
contribution in PANAMA is an integrated model that does not
include additional confounders but true genetic regulators. Unique
to PANAMA, these regulators are jointly identified and accounted
for during learning of the confounding factors. Our analysis shows,
that this approach yields a significant improvement in the
sensitivity of recovering trans associations and plausible regulatory
hotspots. A tabular overview of the relation between alternative
methods is shown in Supporting Table S2.
In conclusion, PANAMA is an important step towards
exhaustively addressing common types of confounding variation
in eQTL studies. The number of datasets that benefit from careful
dissection of true genetic signals and confounders, as done here, is
expected to rise quickly. Growing sample sizes and expression
profiling in more than one environment allow for the estimation of
more subtle confounding influences and at the same time provide
the statistical power to detect many more trans effects than possible
as of today.
Materials and Methods
PANAMA is based on a linear additive linear model,
accounting for effects from K observed SNPs S~(s1,...,sK)
and contributions from a dictionary of Q hidden factors
X~(x1,...,xQ). The resulting generative model for G gene
expression levels Y~(y1,...,yG) can then be cast as
Y~mzSVzXWzE E E E:
ð1Þ
We assume that expression levels and SNPs are observed in each
of n~1,...,N individuals, m~(m1,...,mG) is a vector of gene-
specific mean terms and e denotes Gaussian distributed observa-
tion noise, En,g*N(0,s2
weights for the SNP effects and hidden factor effects respectively.
To improve the parameters estimation, we introduce a hierarchy
on the weights of genetic influences and hidden factors in Equation
(1). We marginalize out the effect of the latent factors, X and a
subset of the SNPs with a strong regulatory role (see below),
resulting in a mixed linear model. We choose independent
Gaussian priors for the factors weights wq and the weights of
respective SNPs vk
e). The matrices V and W represent the
Accurate Confounder Correction for eQTL Studies
PLoS Computational Biology | www.ploscompbiol.org6January 2012 | Volume 8 | Issue 1 | e1002330
Page 7
p(W)~ P
Q
q~1N(wq0,a2
qI
???
),
p(V)~ P
K
k~1N(vk0,b2
kI
??
),
and integrate them out. The corresponding marginal likelihood,
conditioned on the state of the confounding factors X is now
factorized across genes
p(Y X,H
j
)~ P
G
g~1N
yg0,
X
K
k~1
b2
ksksT
kz
X
Q
q~1
a2
qxqxT
qzs2
eI
?????
!
:
ð2Þ
For notational convenience we dropped the mean term m and we
have defined H~ffb2
of the model.
kg,fa2
qg,s2
eg, the set of all hyperparameters
Known covariates
If available, additional covariates can directly be included in the
background covariance structure from Equation (2)
p(Y X
j ,H)~ P
G
g~1N yg0,
X
K
k~1
b2
ksksT
kz
X
Q
q~1
a2
qxqxT
qzc2K0zs2
eI
?????
!
,ð3Þ
where K0 denotes the covariance induced by these additional
covariates and c2the corresponding scaling parameter. Examples
for possible choices of this covariance include the covariance
induced by a fixed covariate vectors, i.e. K0~ccTor a kinship
matrix that accounts for the genetic relatedness (see for example
Kang et al. [12] and Listgarten et al. [16]).
Model fitting
The most probable state of the latent variables X and the
hyperparameters H can be identified via a straightforward
maximum likelihood approach
f
^
H,
^
Xg~argmaxp(Y X,H
H,x
j
),
ð4Þ
for example employing a gradient-based optimizer. In practical
applications of PANAMA, this model fitting (Equation (4)) is not
carried out with the set of all genome-wide SNPs included in
Equation (1), because the number of weight parameters b2
SNP would be prohibitive. Only those genetic regulators with strong
effects on multiple genes do play a role during the estimation of
hidden factors and thus need to be accounted for. Our inference
scheme determines the set of relevant regulators in an iterative
procedure. The number of hidden factors to be learnt, Q is not set a
priori and instead Q is set to a sufficiently large value. During the
optimization, theindividualvarianceparametersfor each factors,a2
automatically determine an appropriate number of effective factors,
switching off unused ones. For full details of the algorithm and
analysis of the robustness of this approach see Supporting Text S1.
kfor each
q,
Significance testing
Once the confounding-correcting covariance structure is
determined from the maximum likelihood solution of Equation
(4), significance testing can be carried out in the framework of
mixed linear models. The association between a SNP k and gene g
to be tested is treated as fixed effect, allowing to construct a
likelihood ratio statistics of the form
LODg,k~log
N yghsk,s2
N yg0, s2
kKzs2
eI
??
??
kKzs2
eI
??
?? :
ð5Þ
Here, the covariance matrix K denotes the covariance structure
explaining confounding variation, which is derived from the fitted
PANAMA model. Computationally, the likelihood ratio tests
(Equation (5)) can be efficiently implemented using recently
proposed computational tricks [19], allowing for application to
large-scale genomic data (Supporting Text S1).
In PANAMA, this correction covariance structure K only
accounts for the confounding factors, excluding the genetic
regulators (See Equation (2))
K~
X
Q
q~1
a2
qxqxT
q:
In PANAMAtrans, also correcting for the trans factors, the
covariance equals to
Ktrans~
X
K
k~1
b2
ksksT
kz
X
Q
q~1
a2
qxqxT
q:
For computational efficiency we fix the covariance structure K that
is learnt from the full expression dataset upfront. The relative
weighting of the covariance (s2
adjusted on the background and null model (Equation (5)) for
every single test carried out, using recent advances for efficient
mixed model inference [19].
k) and the noise term (s2
e) are then
Yeast datasets
We used the yeast expression dataset from Smith et al. [4]
(GEO accession number GSE9376), which consists of 5,493
probes measured in 109 segregants derived from a cross between
BY and RM. The authors provided the genotypes, which consisted
of 2,956 genotyped loci.
An association was defined as cis if the location of the SNP and the
location of the opening reading frame (ORF) of the gene were within
10 kb, and trans otherwise. In order to validate the associations found,
we also used data from Brem et al. [3] (GEO accession number
GSE1990), which consisted of 7,084 probes and 2,956 genotyped loci
in 112 segregants. For the purpose of comparison, we defined cis
associations in the same way as we did for the previous dataset.
Mouse dataset
We used the data described in Schadt [23], consisting of 23,698
expression measurements and 137 genotyped loci for 111 F2
mouse lines.
Human dataset
We used the dataset from [24] (GEO accession number
GSE8919), which consists of 14,078 transcripts and 366,140 SNPs
genotyped on 193 human samples.
Yeastract
We used data from Yeastract [22], which contains information
about the regulatory network between 185 transcription factors
and 6,298 genes. Out of these 189 transcription factors, we
Accurate Confounder Correction for eQTL Studies
PLoS Computational Biology | www.ploscompbiol.org7January 2012 | Volume 8 | Issue 1 | e1002330
Page 8
selected the 129 TFs that had a polymorphism in the vicinity
(10 kb) of the coding region.
Supporting Information
Dataset S1
condition alone in the yeast dataset.
(CSV)
List of eQTL calls from PANAMA, on the glucose
Dataset S2
analysis of both conditions (ethanol, glucose) in the yeast dataset.
(CSV)
List of eQTL calls from PANAMA in the joint
Figure S1
discovery estimates for alternative methods. Shown is the
estimated false discovery rate (E(FDR)) as a function of the
empirical false discovery rate for associations called on the
simulated dataset. In summary, PANAMA is better calibrated
than any other method, neither underestimating nor overestimat-
ing the FDR.
(PDF)
Comparison of the calibration accuracy of false
Figure S2
simulated dataset based on a fit of ICE to the original yeast
dataset. While the general performance differences are smaller, the
general trends remain. The kink in ICE is due to deflation of the
model. See the main paper Figure 2 for complementary results on
a dataset simulated from PANAMA.
(PDF)
Receiver operating characteristics for an alternative
Figure S3
genomic position for alternative methods on the eQTL dataset
from segregating yeast strains (glucose condition).
(PDF)
Number of associations called as a function of the
Figure S4
dataset from segregating yeast strains (glucose and ethanol
jointly). (a,b) number of recovered cis and trans associations as
a function of the false discovery rate cutoff. At most one
association per chromosome and gene was counted. (b) inflation
factors, defined as Dl~l{1. Note that PANAMA included a
covariance term that accounts for the genetic relatedness of
identical individuals profiled in two conditions. As a result,
PANAMA yielded bettercalibrated
associations than other methods.
(PDF)
Evaluation of alternative methods on the eQTL
results, callingfewer
Figure S5
dataset from mouse. (a) Number of cis and trans associations found
by alternative methods as a function of the FDR cutoff. (b)
Inflation factors of alternative methods, defined as Dl~l{1.
(PDF)
Evaluation of alternative methods on the eQTL
Figure S6
discovery rate cutoff on the human dataset.
(PDF)
Number of associations as a function of the false
Figure S7
comparing PANAMA to a modified version of SVA that models
the most prominent genetic regulators as covariates.
(PDF)
Receiver operating characteristics (ROC) curve
Figure S8
distribution. Figure shows the quantile-quantile plots for alterna-
tive methods evaluated on the simulated dataset.
(PDF)
Comparison of theoretical PV statistics with empirical
Figure S9
distribution. Figure shows the quantile-quantile plots for alterna-
tive methods evaluated on the yeast dataset.
(PDF)
Comparison of theoretical PV statistics with empirical
Table S1
confounders (SVA,PEER, ICE, LMM-EH, PANAMA) and
LINEAR. A mark indicates that the model exhibits that property.
The properties are: Low rank: is the model using a low-rank
representation of the confounders? LMM: is it a linear mixed
model? Preserve genetic signal: is the model explicitly preserving the
genetic signal or is it greedily subtracting the confounding effects?
PANAMA is the only model that spans all the different properties,
since it imposes a low-rank structure for the confounders, but is
efficiently implemented as a linear mixed model. Moreover, the
latent confounders are learned in conjunction with the genetics,
thereby preserving true genetic signals.
(PDF)
F-score (F~2:precision:recall
precisionzrecall) for alternative meth-
ods in recovering known regulatory mechanisms from Yeastract.
(PDF)
Comparison of the different models that account for
Table S2
Text S1
(PDF)
Supplementary methods.
Acknowledgments
The authors would like to thank Leonid Kruglyak, Erin Smith and Rachel
Brem for access to gene expression and genotype data as well as permission
to include the primary data alongside with this manuscript.
Author Contributions
Conceived and designed the experiments: NF OS NDL. Performed the
experiments: NF OS NDL. Analyzed the data: NF OS NDL. Contributed
reagents/materials/analysis tools: NF OS NDL. Wrote the paper: NF OS
NDL.
References
1. Stranger BE, Nica AC, Forrest MS, Dimas A, Bird CP, et al. (2007) Population
genomics of human gene expression. Nat Genet 39: 1217–24.
2. Stegle O, Parts L, Durbin R, Winn J (2010) A Bayesian Framework to Account
for Complex Non-Genetic Factors in Gene Expression Levels Greatly Increases
Power in eQTL Studies. PLoS Comput Biol 6: e1000770.
3. Brem RB, Yvert G, Clinton R, Kruglyak L (2002) Genetic dissection of
transcriptional regulation in budding yeast. Science 296: 752–5.
4. Smith EN, Kruglyak L (2008) Gene-environment interaction in yeast gene
expression. PLoS Biol 6: e83.
5. McCarthy MI, Abecasis GR, Cardon LR, Goldstein DB, Little J, et al. (2008)
Genome-wide as- sociation studies for complex traits: consensus, uncertainty and
challenges. Nat Rev Genet 9: 356–369.
6. Leek JT, Storey JD (2007) Capturing heterogeneity in gene expression studies by
surrogate variable analysis. PLoS Genet 3: 1724–35.
7. Kang HM, Ye C, Eskin E (2008) Accurate discovery of expression quantitative
trait loci under confounding from spurious and genuine regulatory hotspots.
Genetics 180: 1909–25.
8. Churchill G (2002) Fundamentals of experimental design for cDNA microarrays.
Nat Genet 32 Suppl: 490–5.
9. Balding D, Bishop M, Cannings (2003) Handbook of Statistical Genetics. N.Y.:
Wiley J. and Sons Ltd., second edition.
10. Johnson WE, Li C, Rabinovic A (2007) Adjusting batch effects in microarray
expression data using empirical Bayes methods. Biostatistics 8: 118–27.
11. Kang H, Zaitlen N, Wade C, Kirby A, Heckerman D, et al. (2008) Efficient control
ofpopulationstructureinmodelorganismassociationmapping.Genetics178:1709.
12. Kang HM, Sul JH, Service SK, Zaitlen NA, Kong SY, et al. (2010) Variance
component model to account for sample structure in genome- wide association
studies. Nat Genet 42: 348–354.
13. Plagnol V, Uz E, Wallace C, Stevens H, Clayton D, et al. (2008) Extreme
clonality in lymphoblas- toid cell lines with implications for allele specific
expression analyses. PLoS One 3: 2966.
14. Locke D, Segraves R, Carbone L, Archidiacono N, Albertson D, et al. (2003)
Large-scale variation among human and great ape genomes determined by
array comparative genomic hybridization. Genome Res 13: 347.
Accurate Confounder Correction for eQTL Studies
PLoS Computational Biology | www.ploscompbiol.org8January 2012 | Volume 8 | Issue 1 | e1002330
Page 9
15. Stegle O, Parts L, Winn J, Durbin R (2011) Using Probabilistic Estimation of
Expression Residuals (PEER) to obtain increased power and interpretability of
gene expression analyses. Nat Protoc. In press.
16. Listgarten J, Kadie C, Schadt E, Heckerman D (2010) Correction for hidden
confounders in the genetic analysis of gene expression. Proc Natl Acad Sci U S A
107: 16465.
17. Nica A, Parts L, Glass D, Nisbet J, Barrett A, et al. (2011) The Architecture of
Gene Regulatory Variation across Multiple Human Tissues: The MuTHER
Study. PLoS Genet 7: e1002003.
18. Breitling R, Li Y, Tesson B, Fu J, Wu C, et al. (2008) Genetical genomics:
spotlight on QTL hotspots. PLoS Genet 4: e1000232.
19. Lippert C, Listgarten J, Liu Y, Kadie C, Davidson R, et al. (2011) Fast linear
mixed models for genome-wide association studies. Nat Methods 8: 833–835.
20. Price A, Patterson N, Plenge R, Weinblatt M, Shadick N, et al. (2006) Principal
components analysis corrects for stratification in genome-wide association
studies. Nat Genet 38: 904–909.
21. Yu J, Pressoir G, Briggs W, Bi I, Yamasaki M, et al. (2005) A unified mixed-
model method for association mapping that accounts for multiple levels of
relatedness. Nat Genet 38: 203–208.
22. Teixeira MC, Monteiro P, Jain P, Tenreiro S, Fernandes AR, et al. (2006) The
YEASTRACT database: a tool for the analysis of transcription regulatory
associations in Saccharomyces cere- visiae. Nucleic Acids Res 34: D3–D5.
23. Schadt E, Lamb J, Yang X, Zhu J, Edwards S, et al. (2005) An integrative
genomics approach to infer causal associations between gene expression and
disease. Nat Genet 37: 710–717.
24. Myers A, Gibbs J, Webster J, Rohrer K, Zhao A, et al. (2007) A survey of genetic
human cortical gene expression. Nat Genet 39: 1494–1499.
25. Gan X, Stegle O, Behr J, Steffen J, Drewe P, et al. (2011) Multiple reference
genomes and tran- scriptomes for arabidopsis thaliana. Nature 477: 419–423.
26. Pickrell J, Marioni J, Pai A, Degner J, Engelhardt B, et al. (2010) Understanding
mechanisms underlying human gene expression variation with rna sequencing.
Nature 464: 768.
Accurate Confounder Correction for eQTL Studies
PLoS Computational Biology | www.ploscompbiol.org9January 2012 | Volume 8 | Issue 1 | e1002330