Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies.

Sheffield Institute for Translational Neuroscience, University of Sheffield, Sheffield, United Kingdom.
PLoS Computational Biology (Impact Factor: 4.83). 01/2012; 8(1):e1002330. DOI: 10.1371/journal.pcbi.1002330
Source: PubMed

ABSTRACT Expression quantitative trait loci (eQTL) studies are an integral tool to investigate the genetic component of gene expression variation. A major challenge in the analysis of such studies are hidden confounding factors, such as unobserved covariates or unknown subtle environmental perturbations. These factors can induce a pronounced artifactual correlation structure in the expression profiles, which may create spurious false associations or mask real genetic association signals. Here, we report PANAMA (Probabilistic ANAlysis of genoMic dAta), a novel probabilistic model to account for confounding factors within an eQTL analysis. In contrast to previous methods, PANAMA learns hidden factors jointly with the effect of prominent genetic regulators. As a result, this new model can more accurately distinguish true genetic association signals from confounding variation. We applied our model and compared it to existing methods on different datasets and biological systems. PANAMA consistently performs better than alternative methods, and finds in particular substantially more trans regulators. Importantly, our approach not only identifies a greater number of associations, but also yields hits that are biologically more plausible and can be better reproduced between independent studies. A software implementation of PANAMA is freely available online at

Download full-text


Available from: Oliver Stegle, Jun 30, 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A partially-latent-output mapping (PLOM) method is proposed. PLOM infers a regression function between an observed input (typically high-dimensional) and a partially-latent output (typically low-dimensional). More precisely, the vector-valued output variable is formed of both observed and unobserved components. The main and novel feature of PLOM is that it provides a framework to deal with situations where some of the output's components can be observed while the remaining components can neither be measured nor be easily annotated. Moreover, by modeling the non-observed output components as latent variables, we prevent the observed components from being contaminated with artifacts that cannot be absorbed with standard noise models. We also emphasize that the proposed formulation unifies regression and dimensionality reduction into a common framework referred to as Gaussian Locally-Linear Mapping (GLLiM). We formally derive EM inference procedures for the corresponding family of models. Tests and comparisons with state-of-the-art methods reveal the PLOM's prominent advantage to be robust to various experimental conditions.
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Abstract High-dimensional phenotypes hold promise for richer findings in association studies, but testing of several phenotype traits aggravates the grand challenge of association studies, that of multiple testing. Several methods have recently been proposed for testing jointly all traits in a high-dimensional vector of phenotypes, with prospect of increased power to detect small effects that would be missed if tested individually. However, the methods have rarely been compared to the extent of enabling assessment of their relative merits and setting up guidelines on which method to use, and how to use it. We compare the methods on simulated data and with a real metabolomics data set comprising 137 highly correlated variables and approximately 550,000 SNPs. Applying the methods to genome-wide data with hundreds of thousands of markers inevitably requires division of the problem into manageable parts facilitating parallel processing, parts corresponding to individual genetic variants, pathways, or genes, for example. Here we utilize a straightforward formulation according to which the genome is divided into blocks of nearby correlated genetic markers, tested jointly for association with the phenotypes. This formulation is computationally feasible, reduces the number of tests, and lets the methods take advantage of combining information over several correlated variables not only on the phenotype side, but also on the genotype side. Our experiments show that canonical correlation analysis has higher power than alternative methods, while remaining computationally tractable for routine use in the GWAS setting, provided the number of samples is sufficient compared to the numbers of phenotype and genotype variables tested. Sparse canonical correlation analysis and regression models with latent confounding factors show promising performance when the number of samples is small compared to the dimensionality of the data.
    Statistical Applications in Genetics and Molecular Biology 06/2013; 12(4):1-19. DOI:10.1515/sagmb-2012-0032 · 1.52 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Exploring the genetic basis of heritable traits remains one of the central challenges in biomedical research. In simple cases, single polymorphic loci explain a significant fraction of the phenotype variability. However, many traits of interest appear to be subject to multifactorial control by groups of genetic loci instead. Accurate detection of such multivariate associations is nontrivial and often hindered by limited power. At the same time, confounding influences such as population structure cause spurious association signals that result in false positive findings if they are not accounted for in the model. Here, we propose LMM-Lasso, a mixed model that allows for both, multi-locus mapping and correction for confounding effects. Our approach is simple and free of tuning parameters, effectively controls for population structure and scales to genome-wide datasets. We show practical use in genome-wide association studies and linkage mapping through retrospective analyses. In data from Arabidopsis thaliana and mouse, our method is able to find a genetic cause for significantly greater fractions of phenotype variation in 91% of the phenotypes considered. At the same time, our model dissects this variability into components that result from individual SNP effects and population structure. In addition to this increase of genetic heritability, enrichment of known candidate genes suggests that the associations retrieved by LMM-Lasso are more likely to be genuine.