Power of Data Mining Methods to Detect Genetic Associations and Interactions

Division of Biostatistics, School of Public Health, Yale University, New Haven, Conn., USA. annette.molinaro @ yale.edu
Human Heredity (Impact Factor: 1.47). 09/2011; 72(2):85-97. DOI: 10.1159/000330579
Source: PubMed


Genetic association studies, thus far, have focused on the analysis of individual main effects of SNP markers. Nonetheless, there is a clear need for modeling epistasis or gene-gene interactions to better understand the biologic basis of existing associations. Tree-based methods have been widely studied as tools for building prediction models based on complex variable interactions. An understanding of the power of such methods for the discovery of genetic associations in the presence of complex interactions is of great importance. Here, we systematically evaluate the power of three leading algorithms: random forests (RF), Monte Carlo logic regression (MCLR), and multifactor dimensionality reduction (MDR).
We use the algorithm-specific variable importance measures (VIMs) as statistics and employ permutation-based resampling to generate the null distribution and associated p values. The power of the three is assessed via simulation studies. Additionally, in a data analysis, we evaluate the associations between individual SNPs in pro-inflammatory and immunoregulatory genes and the risk of non-Hodgkin lymphoma.
The power of RF is highest in all simulation models, that of MCLR is similar to RF in half, and that of MDR is consistently the lowest.
Our study indicates that the power of RF VIMs is most reliable. However, in addition to tuning parameters, the power of RF is notably influenced by the type of variable (continuous vs. categorical) and the chosen VIM.

Download full-text


Available from: Patricia Hartge,
  • Source
    • "Based on the current findings it will be interesting to take into account the substantive non-additive genetic variation underlying the association of alcohol intake and GGT by performing gene-finding studies that assume a (2 df) genotypic model instead of an (1 df) additive model. In addition, prediction models may be fitted that involve complex interactions among the genetic markers, such as random forests (Molinaro et al., 2011). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Blood levels of gamma-glutamyl transferase (GGT) are used as a marker for (heavy) alcohol use. The role of GGT in the anti-oxidant defense mechanism that is part of normal metabolism supposes a causal effect of alcohol intake on GGT. However, there is variability in the response of GGT to alcohol use, which may result from genetic differences between individuals. This study aimed to determine whether the epidemiological association between alcohol intake and GGT at the population level is necessarily a causal one or may also reflect effects of genetic pleiotropy (genes influencing multiple traits). Data on alcohol intake (grams alcohol/day) and GGT, originating from twins, their siblings and parents (N=6465) were analyzed with structural equation models. Bivariate genetic models tested whether genetic and environmental factors influencing alcohol intake and GGT correlated significantly. Significant genetic and environmental correlations are consistent with a causal model. If only the genetic correlation is significant, this is evidence for genetic pleiotropy. Phenotypic correlations between alcohol intake and GGT were significant in men (r=.17) and women (r=.09). The genetic factors underlying alcohol intake correlated significantly with those for GGT, whereas the environmental factors were weakly correlated (explaining 4-7% vs. 1-2% of the variance in GGT respectively). In this healthy population sample, the epidemiological association of alcohol intake with GGT is at least partly explained by genetic pleiotropy. Future longitudinal twin studies should determine whether a causal mechanism underlying this association might be confined to heavy drinking populations.
    Drug and alcohol dependence 09/2013; 134(1). DOI:10.1016/j.drugalcdep.2013.09.016 · 3.42 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Random forests (RF) is a popular tree-based ensemble machine learning tool that is highly data adaptive, applies to "large p, small n" problems, and is able to account for correlation as well as interactions among features. This makes RF particularly appealing for high-dimensional genomic data analysis. In this article, we systematically review the applications and recent progresses of RF for genomic data, including prediction and classification, variable selection, pathway analysis, genetic association and epistasis detection, and unsupervised learning.
    Genomics 04/2012; 99(6):323-9. DOI:10.1016/j.ygeno.2012.04.003 · 2.28 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The identification of causal single nucleotide polymorphisms (SNPs) for complex diseases like type 2 diabetes (T2D) is a challenge because of the low statistical power of individual markers from a genome-wide association study (GWAS). SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity. Hence, we aim to detect T2D causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. The selected SNPs with SNP combinations are mapped with multi-dimensional levels of T2D-related information and gene set enrichment analysis (GSEA). A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected, with an error rate of 10.25%. Matching with known disease genes and gene sets revealed the relationships between T2D and SNP combinations. We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.
    Proceedings of the ACM sixth international workshop on Data and text mining in biomedical informatics; 10/2012
Show more