Mining Genetic Epidemiology Data with Bayesian Networks Application to APOE Gene Variation and Plasma Lipid Levels

Human Genetics Center, University of Texas Health Science Center, 1200 Herman Pressler Drive, Houston, TX 77030, USA.
Journal of Computational Biology (Impact Factor: 1.74). 02/2005; 12(1):1-11. DOI: 10.1089/cmb.2005.12.1
Source: PubMed


There is a critical need for data-mining methods that can identify SNPs that predict among individual variation in a phenotype of interest and reverse-engineer the biological network of relationships between SNPs, phenotypes, and other factors. This problem is both challenging and important in light of the large number of SNPs in many genes of interest and across the human genome. A potentially fruitful form of exploratory data analysis is the Bayesian or Belief network. A Bayesian or Belief network provides an analytic approach for identifying robust predictors of among-individual variation in a disease endpoints or risk factor levels. We have applied Belief networks to SNP variation in the human APOE gene and plasma apolipoprotein E levels from two samples: 702 African-Americans from Jackson, MS, and 854 non-Hispanic whites from Rochester, MN. Twenty variable sites in the APOE gene were genotyped in both samples. In Jackson, MS, SNPs 4036 and 4075 were identified to influence plasma apoE levels. In Rochester, MN, SNPs 3937 and 4075 were identified to influence plasma apoE levels. All three SNPs had been previously implicated in affecting measures of lipid and lipoprotein metabolism. Like all data-mining methods, Belief networks are meant to complement traditional hypothesis-driven methods of data analysis. These results document the utility of a Belief network approach for mining large scale genotype-phenotype association data.

3 Reads
  • Source
    • "The CA subset selects the intersection of the DA and MBA subsets. During the course of our analyses, we discovered that the Markov blanket had previously been used to identify involved SNPs in an analysis of plasma lipid levels [4,5]. In our application to the GAW17 data, using the DA subset in most cases resulted in a lower p-value than using the MBA subset. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Using single-nucleotide polymorphism (SNP) genotypes from the 1000 Genomes Project pilot3 data provided for Genetic Analysis Workshop 17 (GAW17), we applied Bayesian network structure learning (BNSL) to identify potential causal SNPs associated with the Affected phenotype. We focus on the setting in which target genes that harbor causal variants have already been chosen for resequencing; the goal was to detect true causal SNPs from among the measured variants in these genes. Examining all available SNPs in the known causal genes, BNSL produced a Bayesian network from which subsets of SNPs connected to the Affected outcome were identified and measured for statistical significance using the hypergeometric distribution. The exploratory phase of analysis for pooled replicates sometimes identified a set of involved SNPs that contained more true causal SNPs than expected by chance in the Asian population. Analyses of single replicates gave inconsistent results. No nominally significant results were found in analyses of African or European populations. Overall, the method was not able to identify sets of involved SNPs that included a higher proportion of true causal SNPs than expected by chance alone. We conclude that this method, as currently applied, is not effective for identifying causal SNPs that follow the simulation model for the GAW17 data set, which includes many rare causal SNPs.
    BMC proceedings 11/2011; 5 Suppl 9(Suppl 9):S109. DOI:10.1186/1753-6561-5-S9-S109
  • Source
    • "BNs with LVs none no Bayesian fine mapping studies HaploBlock MRFs allele states decomposability frequentist fine mapping studies HapGraph (Thomas, 2005) BNs genotype states none fine mapping studies - BNs none candidate gene studies - BNs genotype states SNP preselection no frequentist GWASs allele states yes Beagle (Browning, 2006) MRFs genotype states yes Bayesian graphminer BNs SNP-phenotype genotype states none yes frequentist physical distance & interval graphs (Thomas, 2009a) (Thomas, 2009b) physical distance & decomposability allele states & haplotype clusters genotype or allele states (Mourad et al., 2010) (Mourad et al., 2011) BNs with LVs (HMMs) allele states & haplotype clusters physical order of SNPs SNP-SNP & SNP-phenotype allele states & haplotype clusters (Greenspan and Geiger, 2004) (Greenspan and Geiger, 2005) (Rodin et al., 2005) (Sebastiani et al., 2005) "
    [Show abstract] [Hide abstract]
    ABSTRACT: Probabilistic graphical models have been widely recognized as a powerful formalism in the bioinformatics field, especially in gene expression studies and linkage analysis. Although less well known in association genetics, many successful methods have recently emerged to dissect the genetic architecture of complex diseases. In this review article, we cover the applications of these models to the population association studies' context, such as linkage disequilibrium modeling, fine mapping and candidate gene studies, and genome-scale association studies. Significant breakthroughs of the corresponding methods are highlighted, but emphasis is also given to their current limitations, in particular, to the issue of scalability. Finally, we give promising directions for future research in this field.
    Briefings in Bioinformatics 03/2011; 13(1):20-33. DOI:10.1093/bib/bbr015 · 9.62 Impact Factor
  • Source
    • "BNs have already been successfully applied in association studies, for example to study overt stroke in sickle cell anaemia [4] and to identify the relationships between SNP variations in the human APOE gene and plasma apolipoprotein E levels [5]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Bayesian networks are powerful instruments to learn genetic models from association studies data. They are able to derive the existing correlation between genetic markers and phenotypic traits and, at the same time, to find the relationships between the markers themselves. However, learning Bayesian networks is often non-trivial due to the high number of variables to be taken into account in the model with respect to the instances of the dataset. Therefore, it becomes very interesting to use an abstraction of the variable space that suitably reduces its dimensionality without losing information. In this paper we present a new strategy to achieve this goal by mapping the SNPs related to the same gene to one meta-variable. In order to assign states to the meta-variables we employ an approach based on classification trees. We applied our approach to data coming from a genome-wide scan on 288 individuals affected by arterial hypertension and 271 nonagenarians without history of hypertension. After pre-processing, we focused on a subset of 24 SNPs. We compared the performance of the proposed approach with the Bayesian network learned with SNPs as variables and with the network learned with haplotypes as meta-variables. The results were obtained by running a hold-out experiment five times. The mean accuracy of the new method was 64.28%, while the mean accuracy of the SNPs network was 58.99% and the mean accuracy of the haplotype network was 54.57%. The new approach presented in this paper is able to derive a gene-based predictive model based on SNPs data. Such model is more parsimonious than the one based on single SNPs, while preserving the capability of highlighting predictive SNPs configurations. The prediction performance of this approach was consistently superior to the SNP-based and the haplotype-based one in all the test sets of the evaluation procedure. The method can be then considered as an alternative way to analyze the data coming from association studies.
    BMC Bioinformatics 02/2009; 10 Suppl 2(Suppl 2):S7. DOI:10.1186/1471-2105-10-S2-S7 · 2.58 Impact Factor
Show more

Preview (2 Sources)

3 Reads
Available from