Statistical external validation and consensus modeling: A QSPR case study for KOC prediction

Università degli Studi dell'Insubria, Varese, Lombardy, Italy
Journal of Molecular Graphics and Modelling (Impact Factor: 1.72). 04/2007; 25(6):755-66. DOI: 10.1016/j.jmgm.2006.06.005
Source: PubMed


The soil sorption partition coefficient (log K(oc)) of a heterogeneous set of 643 organic non-ionic compounds, with a range of more than 6 log units, is predicted by a statistically validated QSAR modeling approach. The applied multiple linear regression (ordinary least squares, OLS) is based on a variety of theoretical molecular descriptors selected by the genetic algorithms-variable subset selection (GA-VSS) procedure. The models were validated for predictivity by different internal and external validation approaches. For external validation we applied self organizing maps (SOM) to split the original data set: the best four-dimensional model, developed on a reduced training set of 93 chemicals, has a predictivity of 78% when applied on 550 validation chemicals (prediction set). The selected molecular descriptors, which could be interpreted through their mechanistic meaning, were compared with the more common physico-chemical descriptors log K(ow) and log S(w). The chemical applicability domain of each model was verified by the leverage approach in order to propose only reliable data. The best predicted data were obtained by consensus modeling from 10 different models in the genetic algorithm model population.

Download full-text


Available from: Paola Gramatica, Jun 19, 2014
51 Reads
    • "In addition, to assess the risk of chance correlation, permutation test (y scrambling) was performed (Gramatica et al. 2007). In this way, the dependent variable of model (pIC 50 of different peptides) was randomly shuffled 50 times and maximum correlation coefficient of cross-validation in permutation test (Q 2 MP ) was calculated which was 0.14. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The treatment of infections caused by multi-drugs resistant bacteria and fungi is a particular challenge. Whereas cationic antimicrobial peptides (CAPs) are considered as promising drug candidates for treatment of such superbugs, recent studies have focused on design of those peptides with increased bioavailability and stability against proteases. In between, applications of the quantitative structure-activity relationship (QSAR) studies which provide information on activities of CAPs based on descriptors for each individual amino acid are inevitable. However, the satisfactory results derived from a QSAR model depend highly on the choice of amino acid descriptors and the mathematical strategy used to relate the descriptors to the CAPs' activity. In this study, the quantitative sequence-activity modeling (QSAM) of 60 CAPs derived from O-W-F-I-F-H(1-Bzl)-NH2 sequence which showed excellent activities against a broad range of hazardous microorganisms: e.g., MRSA, MRSE, E. coli and C. albicans, is discussed. The peptides contained natural and non-natural amino acids (AAs) of the both isomers D and L. In this study, a segmented principal component strategy was performed on the structural descriptors of AAs to extract AA's indices. Our results showed that constructed models covered more than 82, 94, 80 and 78 % of the cross-validated variance of C. albicans, MRSA, MRSE and E. coli data sets, respectively. The results were also used to determine the important and significant AAs which are important in CAPs activities. According to the best of our knowledge, it is the first successful attempt in the QSAM studies of peptides containing both natural and non-natural AAs of the both L and D isomers.
    Amino Acids 10/2014; 47:125-134. DOI:10.1007/s00726-014-1850-8 · 3.29 Impact Factor
  • Source
    • "Since Van Zelm et al. (2009) demonstrated that uncertainty in the HC50 drastically decreases when data on three instead of two test species are available, we used HC50 values based on toxicity data for at least three test species for porewater as well as for freshwater. The input for the QSAR that was used to predict K oc , i.e. the descriptor values, was taken from the Supporting Information of Gramatica et al. (2007) or, if not available, calculated with DRA- GON software version 5.5 (Todeschini et al., 2007). For the log K oc values and more details on the QSAR, we refer to the Supporting Information. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Since chemicals' ecotoxic effects depend for most soil species on the dissolved concentration in pore water, the equilibrium partitioning (EP) method is generally used to estimate hazardous concentrations (HC50) in the soil from aquatic toxicity tests. The present study analyzes the statistical uncertainty in terrestrial HC50s derived by the EP-method. For 47 organic chemicals, we compared freshwater HC50s derived from standard aquatic ecotoxicity tests with porewater HC50s derived from terrestrial ecotoxicity tests. Statistical uncertainty in the HC50s due to limited species sample size and in organic carbon-water partitioning coefficients due to predictive error was treated with probability distributions propagated by Monte Carlo simulations. Particularly for specifically acting chemicals, it is very important to base the HC50 on a representative sample of species, composed of both target and non-target species. For most chemical groups, porewater HC50 values were approximately a factor of 3 higher than freshwater HC50 values. The ratio of the porewater HC50/freshwater HC50 was typically 3.0 for narcotic chemicals (2.8 for nonpolar and 3.4 for polar narcotics), 0.8 for reactive chemicals, 2.9 for neurotoxic chemicals (4.3 for AChE agents and 0.1 for the cyclodiene type), and 2.5 for herbicides-fungicides. However, the statistical uncertainty associated with this ratio was large (typically 2.3 orders of magnitude). For 81% of the organic chemicals studied, there was no statistical difference between the hazardous concentration of aquatic and terrestrial species. We conclude that possible systematic deviations between the HC50s of aquatic and terrestrial species appear to be less prominent than the overall statistical uncertainty.
    Chemosphere 06/2013; 93(2). DOI:10.1016/j.chemosphere.2013.05.007 · 3.34 Impact Factor
  • Source
    • "Random forest is a representative algorithm that consists of many decision trees that vote to select class membership. Some authors also reported that ensemble methods have worked well in QSAR models [21,36] and microarray-based studies [22,23] with a small number of datasets, but a literature search did not produce any comprehensive evaluations of the utility of ensemble methods in microarray-based classifier development. The MAQC-II study participants did not determine a preferred approach to select a best model for each endpoint, leaving that selection as part of an individual team’s preference. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Genomic biomarkers play an increasing role in both preclinical and clinical application. Development of genomic biomarkers with microarrays is an area of intensive investigation. However, despite sustained and continuing effort, developing microarray-based predictive models (i.e., genomics biomarkers) capable of reliable prediction for an observed or measured outcome (i.e., endpoint) of unknown samples in preclinical and clinical practice remains a considerable challenge. No straightforward guidelines exist for selecting a single model that will perform best when presented with unknown samples. In the second phase of the MicroArray Quality Control (MAQC-II) project, 36 analysis teams produced a large number of models for 13 preclinical and clinical endpoints. Before external validation was performed, each team nominated one model per endpoint (referred to here as 'nominated models') from which MAQC-II experts selected 13 'candidate models' to represent the best model for each endpoint. Both the nominated and candidate models from MAQC-II provide benchmarks to assess other methodologies for developing microarray-based predictive models. We developed a simple ensemble method by taking a number of the top performing models from cross-validation and developing an ensemble model for each of the MAQC-II endpoints. We compared the ensemble models with both nominated and candidate models from MAQC-II using blinded external validation. For 10 of the 13 MAQC-II endpoints originally analyzed by the MAQC-II data analysis team from the National Center for Toxicological Research (NCTR), the ensemble models achieved equal or better predictive performance than the NCTR nominated models. Additionally, the ensemble models had performance comparable to the MAQC-II candidate models. Most ensemble models also had better performance than the nominated models generated by five other MAQC-II data analysis teams that analyzed all 13 endpoints. Our findings suggest that an ensemble method can often attain a higher average predictive performance in an external validation set than a corresponding "optimized" model method. Using an ensemble method to determine a final model is a potentially important supplement to the good modeling practices recommended by the MAQC-II project for developing microarray-based genomic biomarkers.
    BMC Bioinformatics 10/2011; 12 Suppl 10(Suppl 10):S3. DOI:10.1186/1471-2105-12-S10-S3 · 2.58 Impact Factor
Show more