Joint Modelling of Confounding Factors and Prominent Genetic Regulators Provides Increased Accuracy in Genetical Genomics Studies

Sheffield Institute for Translational Neuroscience, University of Sheffield, Sheffield, United Kingdom.
PLoS Computational Biology (Impact Factor: 4.62). 01/2012; 8(1):e1002330. DOI: 10.1371/journal.pcbi.1002330
Source: PubMed


Author Summary
The computational analysis of genetical genomics studies is challenged by confounding variation that is unrelated to the genetic factors of interest. Several approaches to account for these confounding factors have been proposed, greatly increasing the sensitivity in recovering direct genetic (cis) associations between variable genetic loci and the expression levels of individual genes. Crucially, these existing techniques largely rely on the true association signals being orthogonal to the confounding variation. Here, we show that when studying indirect (trans) genetic effects, for example from master regulators, their association signals can overlap with confounding factors estimated using existing methods. This technical overlap can lead to overcorrection, erroneously explaining away true associations as confounders. To address these shortcomings, we propose PANAMA, a model that jointly learns hidden factors while accounting for the effect of selected genetic regulators. In applications to several studies, PANAMA is more accurate than existing methods in recovering the hidden confounding factors. As a result, we find an increase in the statistical power for direct (cis) and indirect (trans) associations. Most strikingly on yeast, PANAMA not only finds additional associations but also identifies master regulators that can be better reproduced between independent studies.

Download full-text


Available from: Oliver Stegle, Oct 05, 2015
17 Reads
  • Source
    • "Central problems include that these signals are often very weak, and the found signals can be spurious due to confounding. Confounding can stem from varying experimental conditions and demographics such as age, ethnicity, gender [5], and—crucially—population structure, which is due to the relatedness between the samples [6] [5] [7]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: A large class of problems in statistical genetics amounts to finding a sparse linear effect in a binary classification setup, such as finding a small set of genes that most strongly predict a disease. Very often, these signals are spurious and obfuscated by confounders such as age, ethnicity or population structure. In the probit regression model, such confounding can be modeled in terms of correlated label noise, but poses mathematical challenges for learning algorithms. In this paper we propose a learning algorithm to overcome these problems. We manage to learn sparse signals that are less influenced by the correlated noise. This problem setup generalizes to fields outside statistical genetics. Our method can be understood as a hybrid between an $\ell_1$ regularized probit classifier and a Gaussian Process (GP) classifier. In addition to a latent GP to capture correlated noise, the model captures sparse signals in a linear effect. Because the observed labels can be explained in part by the correlated noise, the linear effect will try to learn signals that capture information beyond just correlated noise. As we show on real-world data, signals found by our model are less correlated with the top confounders. Hence, we can find effects closer to the unconfounded sparse effects we are aiming to detect. Besides that, we show that our method outperforms Gaussian process classification and uncorrelated probit regression in terms of prediction accuracy.
  • Source
    • "We compared our method with several methods including SVA [19], ICE [18], LMM-EH [20] and PANAMA [21]. They were all used to correct expression heterogeneity on the simulated data and the results are shown on eQTL plots (Figure 2). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Expression quantitative trait loci (eQTL) mapping is a tool that can systematically identify genetic variation affecting gene expression. eQTL mapping studies have shown that certain genomic locations, referred to as regulatory hotspots, may affect the expression levels of many genes. Recently, studies have shown that various confounding factors may induce spurious regulatory hotspots. Here, we introduce a novel statistical method that effectively eliminates spurious hotspots while retaining genuine hotspots. Applied to simulated and real datasets, we validate that our method achieves greater sensitivity while retaining low false discovery rates compared to previous methods.
    Genome biology 04/2014; 15(4):R61. DOI:10.1186/gb-2014-15-4-r61 · 10.81 Impact Factor
  • Source
    • "Finally, it is worth to be noticed that an appropriate choice of the kernel function in the Gaussian process latent variable model (GPLVM) (Lawrence, 2005) allows to account for a partially observed input variable. This was notably studied in (Fusi et al, 2012). However, as explained in (Lawrence, 2005), the mapping yielded by GPLVM cannot be " inverted " , due to the non-linear nature of the kernels used in practice. "
    [Show abstract] [Hide abstract]
    ABSTRACT: A partially-latent-output mapping (PLOM) method is proposed. PLOM infers a regression function between an observed input (typically high-dimensional) and a partially-latent output (typically low-dimensional). More precisely, the vector-valued output variable is formed of both observed and unobserved components. The main and novel feature of PLOM is that it provides a framework to deal with situations where some of the output's components can be observed while the remaining components can neither be measured nor be easily annotated. Moreover, by modeling the non-observed output components as latent variables, we prevent the observed components from being contaminated with artifacts that cannot be absorbed with standard noise models. We also emphasize that the proposed formulation unifies regression and dimensionality reduction into a common framework referred to as Gaussian Locally-Linear Mapping (GLLiM). We formally derive EM inference procedures for the corresponding family of models. Tests and comparisons with state-of-the-art methods reveal the PLOM's prominent advantage to be robust to various experimental conditions.
Show more