A structured approach to predictive modeling of a two-class problem using multidimensional data sets

Department of Preventive Medicine and Community Health, University of Texas Medical Branch (UTMB), Galveston, TX, USA
Methods (Impact Factor: 3.65). 01/2013; 61(1). DOI: 10.1016/j.ymeth.2013.01.002
Source: PubMed


Biological experiments in the post-genome era can generate a staggering amount of complex data that challenges experimentalists to extract meaningful information. Increasingly, the success of an appropriately controlled experiment relies on a robust data analysis pipeline. In this paper, we present a structured approach to the analysis of multidimensional data that relies on a close, two-way communication between the bioinformatician and experimentalist. A sequential approach employing data exploration (visualization, graphical and analytical study), pre-processing, feature reduction and supervised classification using machine learning is presented. This standardized approach is illustrated by an example from a proteomic data analysis that has been used to predict the risk of infectious disease outcome. Strategies for model selection and post-hoc model diagnostics are presented and applied to the case illustration. We discuss some of the practical lessons we have learned applying supervised classification to multidimensional data sets, one of which is the importance of feature reduction in achieving optimal modeling performance.

Download full-text


Available from: Hyunsu Ju, Oct 02, 2015
95 Reads
  • [Show abstract] [Hide abstract]
    ABSTRACT: Asthma is an idiopathic disease characterized by episodic inflammation and reversible airway obstruction triggered by exposure to environmental agents. Because this disease is heterogeneous in onset, exacerbations, inflammatory states, and response to therapy, there is intense interest in developing personalized approaches to its management. Of focus in this review, the recognition that a component of the pathophysiology of asthma is mediated by inflammation has implications for understanding its etiology and individualizing its therapy. Despite understanding how Th2 polarization mediates asthma exacerbations by aeroallergen exposure, we do not yet fully understand how RNA virus infections produce asthmatic exacerbations. This review will summarize the explosion of information that has revealed how patterns produced by RNA virus infection trigger the innate immune response (IIR) in sentinel airway cells. When the IIR is triggered, these cells elaborate inflammatory cytokines and protective mucosal interferons whose actions activate long-lived adaptive immunity and limit organismal replication. Recent work has shown the multifaceted way that dysregulation of the IIR is linked to viral-induced exacerbation, steroid insensitivity, and T helper polarization of adaptive immunity. New developments in quantitative proteomics now enable accurate identification of subgroups of individuals that demonstrate activation of IIR ("innate endotype"). Potential applications to clinical research are proposed. Together, these developments open realistic prospects for how identification of the IIR endotype may inform asthma therapy in the future.
    Current Allergy and Asthma Reports 06/2013; 13(5). DOI:10.1007/s11882-013-0363-y · 2.77 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: While acting upon chromatin compaction, histone post-translational modifications (PTMs) are involved in modulating gene expression through histone-DNA affinity and protein-protein interactions. These dynamic and environment-sensitive modifications are constitutive of the histone code that reflects the transient transcriptional state of the chromatin. Here we describe a global screening approach for revealing epigenetic disruption at the histone level. This original approach enables fast and reliable relative abundance comparison of histone PTMs and variants in human cells within a single LC-MS experiment. As a proof of concept, we exposed BeWo human choriocarcinoma cells to sodium butyrate (SB), a universal histone deacetylase (HDAC) inhibitor. Histone acid-extracts (n = 45) equally representing 3 distinct classes, Control, 1 mM and 2.5 mM SB, were analysed using ultra-performance liquid chromatography coupled with a hybrid quadrupole time-of-flight mass spectrometer (UPLC-QTOF-MS). Multivariate statistics allowed us to discriminate control from treated samples based on differences in their mass spectral profiles. Several acetylated and methylated forms of core histones emerged as markers of sodium butyrate treatment. Indeed, this untargeted histonomic approach could be a useful exploratory tool in many cases of xenobiotic exposure when histone code disruption is suspected.
    Molecular BioSystems 08/2014; 10(11). DOI:10.1039/c4mb00395k · 3.21 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Dengue virus (DENV) infection is a significant risk to over a third of the human population that causes a wide spectrum of illness, ranging from sub-clinical disease to intermediate syndrome of vascular complications called dengue fever complicated (DFC) and severe, dengue hemorrhagic fever (DHF). Methods for discriminating outcomes will impact clinical trials and understanding disease pathophysiology. We integrated a proteomics discovery pipeline with a heuristics approach to develop a molecular classifier to identify an intermediate phenotype of DENV-3 infectious outcome. 121 differentially expressed proteins were identified in plasma from DHF vs dengue fever (DF), and informative candidates were selected using nonparametric statistics. These were combined with markers that measure complement activation, acute phase response, cellular leak, granulocyte differentiation and viral load. From this, we applied quantitative proteomics to select a 15 member panel of proteins that accurately predicted DF, DHF, and DFC using a random forest classifier. The classifier primarily relied on acute phase (A2M), complement (CFD), platelet counts and cellular leak (TPM4) to produce an 86% accuracy of prediction with an area under the receiver operating curve of >0.9 for DHF and DFC vs DF. Integrating discovery and heuristic approaches to sample distinct pathophysiological processes is a powerful approach in infectious disease. Early detection of intermediate outcomes of DENV-3 will speed clinical trials evaluating vaccines or drug interventions. Copyright © 2015 Elsevier B.V. All rights reserved.
    Journal of clinical virology: the official publication of the Pan American Society for Clinical Virology 03/2015; 64. DOI:10.1016/j.jcv.2015.01.011 · 3.02 Impact Factor
Show more