An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings

Division of Biostatistics, School of Public Health, University of California, Berkeley, CA, USA.
BMC Genetics (Impact Factor: 2.4). 06/2010; 11(1):49. DOI: 10.1186/1471-2156-11-49
Source: PubMed


As computational power improves, the application of more advanced machine learning techniques to the analysis of large genome-wide association (GWA) datasets becomes possible. While most traditional statistical methods can only elucidate main effects of genetic variants on risk for disease, certain machine learning approaches are particularly suited to discover higher order and non-linear effects. One such approach is the Random Forests (RF) algorithm. The use of RF for SNP discovery related to human disease has grown in recent years; however, most work has focused on small datasets or simulation studies which are limited.
Using a multiple sclerosis (MS) case-control dataset comprised of 300 K SNP genotypes across the genome, we outline an approach and some considerations for optimally tuning the RF algorithm based on the empirical dataset. Importantly, results show that typical default parameter values are not appropriate for large GWA datasets. Furthermore, gains can be made by sub-sampling the data, pruning based on linkage disequilibrium (LD), and removing strong effects from RF analyses. The new RF results are compared to findings from the original MS GWA study and demonstrate overlap. In addition, four new interesting candidate MS genes are identified, MPHOSPH9, CTNNA3, PHACTR2 and IL7, by RF analysis and warrant further follow-up in independent studies.
This study presents one of the first illustrations of successfully analyzing GWA data with a machine learning algorithm. It is shown that RF is computationally feasible for GWA data and the results obtained make biologic sense based on previous studies. More importantly, new genes were identified as potentially being associated with MS, suggesting new avenues of investigation for this complex disease.

Download full-text


Available from: Adele Cutler,
30 Reads
  • Source
    • "However, epistatic effects (gene-gene interactions) as well as gene-environment interactions are thought to play key roles in determining phenotype and there are a number of models of interaction among SNPs known as epistasis where the individual (or main) effect of each SNP might be small but in combination, the effect is large [32]. There are a number of possible models of epistatic interaction in [23] and a number of methods used for the discovery of these models: • Regression based methods such as logistic regression [16] and penalized regression [24]; • Decision tree [3], [20], [31], [11], [29] ; • Multifactor dimensionality reduction [25], [14]; • Combinatorial partitioning [21]; • Restricted Partitioning Method [8]. Here, a global search based algorithm, ant colony optimisation is used to derive near-optimal decision tree interactions between a number of SNPs. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper ant colony optimisation is used to derive near-optimal interactions between a number of single nucleotide polymorphisms. This approach is used to discover small numbers of single nucleotide polymorphisms that are combined into a decision tree or contingency table model. It is shown that these two models can be highly discriminatory from a statistical perspective and a number of the single nucleotide polymorphisms discovered have been identified previously in large genome-wide association studies.
    Workshop on Computational Intelligence for Biomedicine and Bioinformatics as part of the IEEE International Conference on Bioinformatics and Biomedicine (BIBM); 12/2014
  • Source
    • "In this section, we empirically compared our USR algorithm with single-marker test (χ 2 test), elastic-net, orthogonal matching pursuit (OMP), focal underdetermined system solver (FOCUSS) [Rao and Kreutz-Delgado, 1999], Random Forest [Chen and Ishwaran, 2012; Goldstein et al., 2010], and Gemma [Zhou and Stephens, 2012]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Joint adjustment of cryptic relatedness and population structure is necessary to reduce bias in DNA sequence analysis; however, existent sparse regression methods model these two confounders separately. Incorporating prior biological information has great potential to enhance statistical power but such information is often overlooked in many existent sparse regression models. We developed a unified sparse regression (USR) to incorporate prior information and jointly adjust for cryptic relatedness, population structure, and other environmental covariates. Our USR models cryptic relatedness as a random effect and population structure as fixed effect, and utilize the weighted penalties to incorporate prior knowledge. As demonstrated by extensive simulations, our USR algorithm can discover more true causal variants and maintain a lower false discovery rate than do several commonly used feature selection methods. It can handle both rare and common variants simultaneously. Applying our USR algorithm to DNA sequence data of Mexican Americans from GAW18, we replicated three hypertension pathways, demonstrating the effectiveness in identifying susceptibility genetic variants.
    Genetic Epidemiology 12/2014; 38(8). DOI:10.1002/gepi.21849 · 2.60 Impact Factor
  • Source
    • "Nonetheless, this methodology has been primarily applied to the study of complex diseases in humans, considering a small number of SNPs (Chang et al. 2008; Ballard et al. 2010). Its application to large SNP data sets (such as those produced by next-generation techniques) is more complicated and requires the modification of certain standard assumptions (Goldstein et al. 2010; Chen & Ishwaran 2012). Still, this methodology offers an interesting option to the study of genetic pathways shaping natural adaptations. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Establishing the genetic and molecular basis underlying adaptive traits is one of the major goals of evolutionary geneticists in order to understand the connection between genotype and phenotype, and elucidate the mechanisms of evolutionary change. Despite considerable effort to address this question, there remain relatively few systems in which the genes shaping adaptations have been identified. Here we review the experimental tools that have been applied to document the molecular basis underlying evolution in several natural systems, in order to highlight their benefits, limitations and suitability. In most cases a combination of DNA, RNA and functional methodologies with field experiments will be needed to uncover the genes and mechanisms shaping adaptation in nature.This article is protected by copyright. All rights reserved.
    Methods in Ecology and Evolution 12/2014; 6(4). DOI:10.1111/2041-210X.12324 · 6.55 Impact Factor
Show more