Joint modelling of confounding factors and prominent genetic regulators provides increased accuracy in genetical genomics studies.

Sheffield Institute for Translational Neuroscience, University of Sheffield, Sheffield, United Kingdom.
PLoS Computational Biology (Impact Factor: 4.87). 01/2012; 8(1):e1002330. DOI: 10.1371/journal.pcbi.1002330
Source: PubMed

ABSTRACT Expression quantitative trait loci (eQTL) studies are an integral tool to investigate the genetic component of gene expression variation. A major challenge in the analysis of such studies are hidden confounding factors, such as unobserved covariates or unknown subtle environmental perturbations. These factors can induce a pronounced artifactual correlation structure in the expression profiles, which may create spurious false associations or mask real genetic association signals. Here, we report PANAMA (Probabilistic ANAlysis of genoMic dAta), a novel probabilistic model to account for confounding factors within an eQTL analysis. In contrast to previous methods, PANAMA learns hidden factors jointly with the effect of prominent genetic regulators. As a result, this new model can more accurately distinguish true genetic association signals from confounding variation. We applied our model and compared it to existing methods on different datasets and biological systems. PANAMA consistently performs better than alternative methods, and finds in particular substantially more trans regulators. Importantly, our approach not only identifies a greater number of associations, but also yields hits that are biologically more plausible and can be better reproduced between independent studies. A software implementation of PANAMA is freely available online at

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A typical genome-wide association study searches for associations between single nucleotide polymorphisms (SNPs) and a univariate phenotype. However, there is a growing interest to investigate associations between genomics data and multivariate phenotypes, for example in gene expression or metabolomics studies. A common approach is to perform a univariate test between each genotype-phenotype pair, and then to apply a stringent significance cutoff to account for the large number of tests performed. However, this approach has limited ability to uncover dependencies involving multiple variables. Another trend in the current genetics is the investigation of the impact of rare variants on the phenotype, where the standard methods often fail due to lack of power when the minor allele is present in only a limited number of individuals. We propose a new statistical approach based on Bayesian reduced rank regression to assess the impact of multiple SNPs on a high-dimensional phenotype. Due to the method's ability to combine information over multiple SNPs and phenotypes, it is particularly suitable for detecting associations involving rare variants. We demonstrate the potential of our method and compare it with alternatives using the Northern Finland Birth Cohort with 4,702 individuals, for whom genome-wide SNP data along with lipoprotein profiles comprising 74 traits are available. We discovered two genes (XRCC4 and MTHFD2L) without previously reported associations, which replicated in a combined analysis of two additional cohorts: 2,390 individuals from the Cardiovascular Risk in Young Finns study and 3,659 individuals from the FINRISK Study. R-code freely available for download at,
    Bioinformatics 03/2014; · 5.47 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Abstract Sparse modeling, a feature selection method widely used in the machine-learning community, has been recently applied to identify associations in genetic studies including expression quantitative trait locus (eQTL) mapping. These genetic studies usually involve high dimensional data where the number of features is much larger than the number of samples. The high dimensionality of genetic data introduces a problem that there exist multiple solutions for optimizing a sparse model. In such situations, a single optimization result provides only an incomplete view of the data and lacks power to find alternative features associated with the same trait. In this article, we propose a novel method aimed to detecting alternative eQTLs where two genetic variants have alternative relationships regarding their associations with the expression of a particular gene. Our method accomplishes this goal by exploring multiple solutions sampled from the solution space. We proved our method theoretically and demonstrated its usage on simulated data. We then applied our method to a real eQTL data and identified a set of alternative eQTLs with potential biological insights. Additionally, these alternative eQTLs implicate a network view of understanding gene regulation.
    Journal of computational biology: a journal of computational molecular cell biology 04/2014; · 1.69 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Expression quantitative trait loci (eQTL) mapping is a tool that can systematically identify genetic variation affecting gene expression. eQTL mapping studies have shown that certain genomic locations, referred to as regulatory hotspots, may affect the expression levels of many genes. Recently, studies have shown that various confounding factors may induce spurious regulatory hotspots. Here, we introduce a novel statistical method that effectively eliminates spurious hotspots while retaining genuine hotspots. Applied to simulated and real datasets, we validate that our method achieves greater sensitivity while retaining low false discovery rates compared to previous methods.
    Genome biology 04/2014; 15(4):R61. · 10.30 Impact Factor


Available from
Jun 1, 2014