Use of support vector machines for disease risk prediction in genome-wide association studies: Concerns and opportunities

Center for Bioinformatics Tuebingen (ZBIT), University of Tuebingen, Tübingen, Germany.
Human Mutation (Impact Factor: 5.14). 12/2012; 33(12). DOI: 10.1002/humu.22161
Source: PubMed


The success of genome-wide association studies (GWAS) in deciphering the genetic architecture of complex diseases has fueled the expectations whether the individual risk can also be quantified based on the genetic architecture. So far, disease risk prediction based on top-validated single-nucleotide polymorphisms (SNPs) showed little predictive value. Here, we applied a support vector machine (SVM) to Parkinson disease (PD) and type 1 diabetes (T1D), to show that apart from magnitude of effect size of risk variants, heritability of the disease also plays an important role in disease risk prediction. Furthermore, we performed a simulation study to show the role of uncommon (frequency 1-5%) as well as rare variants (frequency <1%) in disease etiology of complex diseases. Using a cross-validation model, we were able to achieve predictions with an area under the receiver operating characteristic curve (AUC) of ∼0.88 for T1D, highlighting the strong heritable component (∼90%). This is in contrast to PD, where we were unable to achieve a satisfactory prediction (AUC ∼0.56; heritability ∼38%). Our simulations showed that simultaneous inclusion of uncommon and rare variants in GWAS would eventually lead to feasible disease risk prediction for complex diseases such as PD. The used software is available at

Download full-text


Available from: Manu Sharma, Oct 09, 2015
50 Reads
  • Source
    • "Until now, the underlying genetic and molecular mechanisms of PD have not been completely understood. Mittag et al. showed in their study that it is not possible to predict the disease risk for PD with top-validated single-nucleotide polymorphisms, although such a prediction is possible for type 1 diabetes [8]. Thus, in the case of PD, genetic markers alone cannot explain the disease outbreak. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Parkinson's disease is an age-related disease whose pathogenesis is not completely known. Animal models exist for investigating the disease but not all results can be easily transferred to humans. Therefore, mathematical or probabilistic models for the human disease are to be constructed \textit{in silico} in order to predict specific processes within a cell, such as the dopamine metabolism and transport processes in a neuron. We present a Systems Biology Markup Language (SBML) model of a whole dopaminergic nerve cell consisting of 139 reactions and 111 metabolites which includes, among others, the dopamine metabolism and transport, oxidative stress, aggregation of alpha-synuclein (alphaSYN), lysosomal and proteasomal degradation, and mitophagy. The predictive power of the model was investigated using flux balance analysis for the identification of steady model states. To this end, we performed six experiments: (i) investigation of the normal cell behavior, (ii) increase of O2, (iii) increase of ATP, (iv) influence of neurotoxins, (v) increase of alphaSYN in the cell, and (vi) increase of dopamine synthesis. The SBML model is available in the BioModels database with identifier MODEL1302200000. It is possible to simulate the normal behavior of an in vivo nerve cell with the developed model. We show that the model is sensitive for neurotoxins and oxidative stress. Further, an increased level of alphaSYN induces apoptosis and an increased flux of alphaSYN to the extracellular space was observed.
    BMC Neuroscience 11/2013; 14(1):136. DOI:10.1186/1471-2202-14-136 · 2.67 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: It is important to correctly and efficiently map drugs and enzymes to their possible interaction network in modern drug research. In this work, a novel approach was introduced to encode drug and enzyme molecules with physicochemical molecular descriptors and pseudo amino acid composition, respectively. Based on this encoding method, Random Forest was adopted to build the drug-enzyme interaction network. After selecting the optimal features that are able to represent the main factors of drug-enzyme interaction in our prediction, totally 129 features were attained which can be clustered into nine categories: Elemental Analysis, Geometry, Chemistry, Amino Acid Composition, Secondary Structure, Polarity, MolecularVolume, Codon Diversity and Electrostatic Charge. It is further found that Geometry features were the most important of all the features. As a result, our predicting model achieved an MCC of 0.915 and a Sensitivity of 87.9% at the Specificity level of 99.8% for 10-fold cross-validation test, and achieved an MCC of 0.895 and a Sensitivity of 95.7% at the Specificity level of 95.4% for independent set test. This article is part of a Special Issue entitled: Computational Proteomics, Systems Biology & Clinical Implications.
    Biochimica et Biophysica Acta 07/2013; 1844(1). DOI:10.1016/j.bbapap.2013.07.008 · 4.66 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: ABSTRACT Objective: Genome-wide Association Studies (GWAS) and subsequently their meta-analyses have changed the landscape of Rheumatoid arthritis (RA) genetics by uncovering several novel genes. Such studies are heavily weighted by samples from Caucasian populations but they explain only a small proportion of total heritability. Our previous studies in the genetically distinct north Indian (NI) RA cohorts have demonstrated apparent allelic/genetic heterogeneity between NI and the West, warranting GWAS in non-European populations. Methods: High-quality genotypes for over 600,000 SNPs in 706 RA cases and 761 controls from NI were generated in the discovery phase. 12 SNPs showing suggestive association (P<5×10−5) were then tested in an independent cohort of 927 cases and 1148 controls. Additional disease associated loci were determined by support vector machine (SVM) analyses. Fine mapping of novel locus was done by employing imputation. Results: In addition to the expected association of HLA locus with RA, we identified association with a novel intronic SNP of ARL15 [rs255758; chromosome 5; Pcombined=6.57E-06; OR=1.42]. Genotype-phenotype correlation by assaying adiponectin levels demonstrated the functional significance of this novel gene in disease pathogenesis. SVM analysis confirmed this association along with a few more replication phase genes. Conclusion: In this first GWAS of RA among NI, ARL15 emerged as a novel genetic risk factor in addition to the classical HLA locus, suggestive of contributions of population-specific as well as shared genetic loci between Asian and European populations to RA etiology. Further, our study reveals the potential of machine learning methods in unraveling gene-gene interactions using genome-wide data. © 2013 American College of Rheumatology.
    Arthritis & Rheumatology 08/2013; DOI:10.1002/art.38110. · 7.76 Impact Factor
Show more