A structured approach to predictive modeling of a two-class problem using multidimensional data sets

Department of Preventive Medicine and Community Health, University of Texas Medical Branch (UTMB), Galveston, TX, USA
Methods (Impact Factor: 3.65). 01/2013; 61(1). DOI: 10.1016/j.ymeth.2013.01.002
Source: PubMed


Biological experiments in the post-genome era can generate a staggering amount of complex data that challenges experimentalists to extract meaningful information. Increasingly, the success of an appropriately controlled experiment relies on a robust data analysis pipeline. In this paper, we present a structured approach to the analysis of multidimensional data that relies on a close, two-way communication between the bioinformatician and experimentalist. A sequential approach employing data exploration (visualization, graphical and analytical study), pre-processing, feature reduction and supervised classification using machine learning is presented. This standardized approach is illustrated by an example from a proteomic data analysis that has been used to predict the risk of infectious disease outcome. Strategies for model selection and post-hoc model diagnostics are presented and applied to the case illustration. We discuss some of the practical lessons we have learned applying supervised classification to multidimensional data sets, one of which is the importance of feature reduction in achieving optimal modeling performance.

Download full-text


Available from: Hyunsu Ju,
  • [Show abstract] [Hide abstract]
    ABSTRACT: Asthma is an idiopathic disease characterized by episodic inflammation and reversible airway obstruction triggered by exposure to environmental agents. Because this disease is heterogeneous in onset, exacerbations, inflammatory states, and response to therapy, there is intense interest in developing personalized approaches to its management. Of focus in this review, the recognition that a component of the pathophysiology of asthma is mediated by inflammation has implications for understanding its etiology and individualizing its therapy. Despite understanding how Th2 polarization mediates asthma exacerbations by aeroallergen exposure, we do not yet fully understand how RNA virus infections produce asthmatic exacerbations. This review will summarize the explosion of information that has revealed how patterns produced by RNA virus infection trigger the innate immune response (IIR) in sentinel airway cells. When the IIR is triggered, these cells elaborate inflammatory cytokines and protective mucosal interferons whose actions activate long-lived adaptive immunity and limit organismal replication. Recent work has shown the multifaceted way that dysregulation of the IIR is linked to viral-induced exacerbation, steroid insensitivity, and T helper polarization of adaptive immunity. New developments in quantitative proteomics now enable accurate identification of subgroups of individuals that demonstrate activation of IIR ("innate endotype"). Potential applications to clinical research are proposed. Together, these developments open realistic prospects for how identification of the IIR endotype may inform asthma therapy in the future.
    Current Allergy and Asthma Reports 06/2013; 13(5). DOI:10.1007/s11882-013-0363-y · 2.77 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Molecular classification using robust biochemical measurements provides a level of diagnostic precision that is unattainable using indirect phenotypic measurements. Multidimensional measurements of proteins, genes, or metabolites (analytes) can identify subtle differences in the pathophysiology of patients with asthma in a way that is not otherwise possible using physiological or clinical assessments. We overview a method for relating biochemical analyte measurements to generate predictive models of discrete (categorical) clinical outcomes, a process referred to as "supervised classification." We consider problems inherent in wide (small n and large p) high-dimensional data, including the curse of dimensionality, collinearity and lack of information content. We suggest methods for reducing the data to the most informative features. We describe different approaches for phenotypic modeling, using logistic regression, classification and regression trees, random forest and nonparametric regression spline modeling. We provide guidance on post hoc model evaluation and methods to evaluate model performance using ROC curves and generalized additive models. The application of validated predictive models for outcome prediction will significantly impact the clinical management of asthma.
    Advances in Experimental Medicine and Biology 01/2014; 795:273-88. DOI:10.1007/978-1-4614-8603-9_17 · 1.96 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: While acting upon chromatin compaction, histone post-translational modifications (PTMs) are involved in modulating gene expression through histone-DNA affinity and protein-protein interactions. These dynamic and environment-sensitive modifications are constitutive of the histone code that reflects the transient transcriptional state of the chromatin. Here we describe a global screening approach for revealing epigenetic disruption at the histone level. This original approach enables fast and reliable relative abundance comparison of histone PTMs and variants in human cells within a single LC-MS experiment. As a proof of concept, we exposed BeWo human choriocarcinoma cells to sodium butyrate (SB), a universal histone deacetylase (HDAC) inhibitor. Histone acid-extracts (n = 45) equally representing 3 distinct classes, Control, 1 mM and 2.5 mM SB, were analysed using ultra-performance liquid chromatography coupled with a hybrid quadrupole time-of-flight mass spectrometer (UPLC-QTOF-MS). Multivariate statistics allowed us to discriminate control from treated samples based on differences in their mass spectral profiles. Several acetylated and methylated forms of core histones emerged as markers of sodium butyrate treatment. Indeed, this untargeted histonomic approach could be a useful exploratory tool in many cases of xenobiotic exposure when histone code disruption is suspected.
    Molecular BioSystems 08/2014; 10(11). DOI:10.1039/c4mb00395k · 3.21 Impact Factor
Show more