Power of Data Mining Methods to Detect Genetic Associations and Interactions

Division of Biostatistics, School of Public Health, Yale University, New Haven, Conn., USA. annette.molinaro @
Human Heredity (Impact Factor: 1.64). 09/2011; 72(2):85-97. DOI: 10.1159/000330579
Source: PubMed

ABSTRACT Genetic association studies, thus far, have focused on the analysis of individual main effects of SNP markers. Nonetheless, there is a clear need for modeling epistasis or gene-gene interactions to better understand the biologic basis of existing associations. Tree-based methods have been widely studied as tools for building prediction models based on complex variable interactions. An understanding of the power of such methods for the discovery of genetic associations in the presence of complex interactions is of great importance. Here, we systematically evaluate the power of three leading algorithms: random forests (RF), Monte Carlo logic regression (MCLR), and multifactor dimensionality reduction (MDR).
We use the algorithm-specific variable importance measures (VIMs) as statistics and employ permutation-based resampling to generate the null distribution and associated p values. The power of the three is assessed via simulation studies. Additionally, in a data analysis, we evaluate the associations between individual SNPs in pro-inflammatory and immunoregulatory genes and the risk of non-Hodgkin lymphoma.
The power of RF is highest in all simulation models, that of MCLR is similar to RF in half, and that of MDR is consistently the lowest.
Our study indicates that the power of RF VIMs is most reliable. However, in addition to tuning parameters, the power of RF is notably influenced by the type of variable (continuous vs. categorical) and the chosen VIM.


Available from: Patricia Hartge, Apr 01, 2015
1 Bookmark
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Epistasis helps to explain how multiple single-nucleotide polymorphisms (SNPs) interact to cause disease. A variety of tools have been developed to detect epistasis. In this article, we explore the strengths and weaknesses of an information theory approach for detecting epistasis and compare it to the logistic regression approach through simulations. We consider several scenarios to simulate the involvement of SNPs in an epistasis network with respect to linkage disequilibrium patterns among them and the presence or absence of main and interaction effects. We conclude that the information theory approach more efficiently detects interaction effects when main effects are absent, whereas, in general, the logistic regression approach is appropriate in all scenarios but results in higher false positives. We compute epistasis networks for SNPs in the FSD1L gene using a two-phase head and neck cancer genome-wide association study involving 2,185 cases and 4,507 controls to demonstrate the practical application of the methods.
    Cancer informatics 01/2015; 14(Suppl 2):17-23. DOI:10.4137/CIN.S17289
  • [Show abstract] [Hide abstract]
    ABSTRACT: Genome-wide association studies (GWASs) have become the focus of the statistical analysis of complex traits in humans, successfully shedding light on several aspects of genetic architecture and biological aetiology. Single-nucleotide polymorphisms (SNPs) are usually modelled as having additive, cumulative and independent effects on the phenotype. Although evidently a useful approach, it is often argued that this is not a realistic biological model and that epistasis (that is, the statistical interaction between SNPs) should be included. The purpose of this Review is to summarize recent directions in methodology for detecting epistasis and to discuss evidence of the role of epistasis in human complex trait variation. We also discuss the relevance of epistasis in the context of GWASs and potential hazards in the interpretation of statistical interaction terms.
    Nature Reviews Genetics 09/2014; 15(11). DOI:10.1038/nrg3747 · 39.79 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The identification of causal single nucleotide polymorphisms (SNPs) for complex diseases like type 2 diabetes (T2D) is a challenge because of the low statistical power of individual markers from a genome-wide association study (GWAS). SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity. Hence, we aim to detect T2D causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. The selected SNPs with SNP combinations are mapped with multi-dimensional levels of T2D-related information and gene set enrichment analysis (GSEA). A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected, with an error rate of 10.25%. Matching with known disease genes and gene sets revealed the relationships between T2D and SNP combinations. We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.
    Proceedings of the ACM sixth international workshop on Data and text mining in biomedical informatics; 10/2012