An application of Random Forests to a genome-wide association dataset: Methodological considerations & new findings

Division of Biostatistics, School of Public Health, University of California, Berkeley, CA, USA.
BMC Genetics (Impact Factor: 2.4). 06/2010; 11(1):49. DOI: 10.1186/1471-2156-11-49
Source: PubMed


As computational power improves, the application of more advanced machine learning techniques to the analysis of large genome-wide association (GWA) datasets becomes possible. While most traditional statistical methods can only elucidate main effects of genetic variants on risk for disease, certain machine learning approaches are particularly suited to discover higher order and non-linear effects. One such approach is the Random Forests (RF) algorithm. The use of RF for SNP discovery related to human disease has grown in recent years; however, most work has focused on small datasets or simulation studies which are limited.
Using a multiple sclerosis (MS) case-control dataset comprised of 300 K SNP genotypes across the genome, we outline an approach and some considerations for optimally tuning the RF algorithm based on the empirical dataset. Importantly, results show that typical default parameter values are not appropriate for large GWA datasets. Furthermore, gains can be made by sub-sampling the data, pruning based on linkage disequilibrium (LD), and removing strong effects from RF analyses. The new RF results are compared to findings from the original MS GWA study and demonstrate overlap. In addition, four new interesting candidate MS genes are identified, MPHOSPH9, CTNNA3, PHACTR2 and IL7, by RF analysis and warrant further follow-up in independent studies.
This study presents one of the first illustrations of successfully analyzing GWA data with a machine learning algorithm. It is shown that RF is computationally feasible for GWA data and the results obtained make biologic sense based on previous studies. More importantly, new genes were identified as potentially being associated with MS, suggesting new avenues of investigation for this complex disease.

Download full-text


Available from: Adele Cutler
  • Source
    • "RF has been widely used for modeling complex joint and interactive associations between response and multiple features[12,32,33,53]. In particular, many nice properties of RF make it an extremely attractive tool for genome studies: the data structure of response and features can be a mixture of categorical and continuous variables; it can nonparametrically incorporate complex nonlinear associations between feature and response; it can implicitly incorporate joint and unknown complex interactions among a large number of features (higher orders or any structure); it is able to handle big data with a large number of features but limited sample size; it can implicitly accommodate highly correlated features; it is less prone to over-fitting; it has good predictive performance even when the majority of features are noise; it is invariant to monotone transformations of the features; it is robust to changes in its tuning parameters; it performs internal estimation of error, so does not need to assess classification performance by cross-validation, and hence greatly reduces computational time[13,32,53,54]. Using an ensemble method (also called committee method), RF creates multiple classification and regression trees (CARTs). "
    [Show abstract] [Hide abstract]
    ABSTRACT: Background: Genome-wide association studies (GWAS) interrogate large-scale whole genome to characterize the complex genetic architecture for biomedical traits. When the number of SNPs dramatically increases to half million but the sample size is still limited to thousands, the traditional p-value based statistical approaches suffer from unprecedented limitations. Feature screening has proved to be an effective and powerful approach to handle ultrahigh dimensional data statistically, yet it has not received much attention in GWAS. Feature screening reduces the feature space from millions to hundreds by removing non-informative noise. However, the univariate measures used to rank features are mainly based on individual effect without considering the mutual interactions with other features. In this article, we explore the performance of a random forest (RF) based feature screening procedure to emphasize the SNPs that have complex effects for a continuous phenotype. Results: Both simulation and real data analysis are conducted to examine the power of the forest-based feature screening. We compare it with five other popular feature screening approaches via simulation and conclude that RF can serve as a decent feature screening tool to accommodate complex genetic effects such as nonlinear, interactive, correlative, and joint effects. Unlike the traditional p-value based Manhattan plot, we use the Permutation Variable Importance Measure (PVIM) to display the relative significance and believe that it will provide as much useful information as the traditional plot. Conclusion: Most complex traits are found to be regulated by epistatic and polygenic variants. The forest-based feature screening is proven to be an efficient, easily implemented, and accurate approach to cope whole genome data with complex structures. Our explorations should add to a growing body of enlargement of feature screening better serving the demands of contemporary genome data.
    Full-text · Article · Dec 2015 · BMC Genetics
  • Source
    • "Among them, the application of random forests (RFs) to the discovery of SNPs related to human diseases has grown in recent years. [8] The importance of empirical power studies based on realistic datasets is fully acknowledged (http://www.gaworkshop .org/). "
    [Show abstract] [Hide abstract]
    ABSTRACT: It is generally acknowledged that most complex diseases are affected in part by interactions between genes and genes and/or between genes and environmental factors. Taking into account environmental exposures and their interactions with genetic factors in genome-wide association studies (GWAS) can help to identify high-risk subgroups in the population and provide a better understanding of the disease. For this reason, many methods have been developed to detect gene-environment (G×E) interactions. Despite this, few loci that interact with environmental exposures have been identified so far. Indeed, the modest effect of G×E interactions as well as confounding factors entail low statistical power to detect such interactions. Another potential obstacle to detect G×E interaction is the fact that true exposure is seldom observed: Indeed, only proxy effects are measured in general. Furthermore, power studies used to evaluate a new method often are done through simulations that give an advantage to the new approach over the other methods. Methods: In this work, we compare the relative performance of popular methods such as PLINK, random forests and linear mixed models to detect G×E interactions in the particular scenario where the causal exposure (E) is unknown and only proxy covariates are observed. For this purpose, we provide an adapted simulated dataset and apply a recently introduced method for H1 simulations called waffect. Results: When the causal environmental exposure is unobserved but only a proxy of this exposure is observed, all the methods considered fail to detect G×E interaction. Conclusions: The hidden causal exposure is an obstacle to detect G×E interaction in GWAS and the approaches considered in our power study all have insufficient power to detect the strong simulated interaction.
    Preview · Article · Dec 2015
  • Source
    • "However, epistatic effects (gene-gene interactions) as well as gene-environment interactions are thought to play key roles in determining phenotype and there are a number of models of interaction among SNPs known as epistasis where the individual (or main) effect of each SNP might be small but in combination, the effect is large [32]. There are a number of possible models of epistatic interaction in [23] and a number of methods used for the discovery of these models: • Regression based methods such as logistic regression [16] and penalized regression [24]; • Decision tree [3], [20], [31], [11], [29] ; • Multifactor dimensionality reduction [25], [14]; • Combinatorial partitioning [21]; • Restricted Partitioning Method [8]. Here, a global search based algorithm, ant colony optimisation is used to derive near-optimal decision tree interactions between a number of SNPs. "
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper ant colony optimisation is used to derive near-optimal interactions between a number of single nucleotide polymorphisms. This approach is used to discover small numbers of single nucleotide polymorphisms that are combined into a decision tree or contingency table model. It is shown that these two models can be highly discriminatory from a statistical perspective and a number of the single nucleotide polymorphisms discovered have been identified previously in large genome-wide association studies.
    Full-text · Conference Paper · Dec 2014
Show more