Statistical external validation and consensus modeling: A QSPR case study for KOC prediction

Università degli Studi dell'Insubria, Varese, Lombardy, Italy
Journal of Molecular Graphics and Modelling (Impact Factor: 1.72). 04/2007; 25(6):755-66. DOI: 10.1016/j.jmgm.2006.06.005
Source: PubMed


The soil sorption partition coefficient (log K(oc)) of a heterogeneous set of 643 organic non-ionic compounds, with a range of more than 6 log units, is predicted by a statistically validated QSAR modeling approach. The applied multiple linear regression (ordinary least squares, OLS) is based on a variety of theoretical molecular descriptors selected by the genetic algorithms-variable subset selection (GA-VSS) procedure. The models were validated for predictivity by different internal and external validation approaches. For external validation we applied self organizing maps (SOM) to split the original data set: the best four-dimensional model, developed on a reduced training set of 93 chemicals, has a predictivity of 78% when applied on 550 validation chemicals (prediction set). The selected molecular descriptors, which could be interpreted through their mechanistic meaning, were compared with the more common physico-chemical descriptors log K(ow) and log S(w). The chemical applicability domain of each model was verified by the leverage approach in order to propose only reliable data. The best predicted data were obtained by consensus modeling from 10 different models in the genetic algorithm model population.

Download full-text


Available from: Paola Gramatica, Jun 19, 2014
  • Source
    • "In fact chemicals within this domain can be considered as reliable predicted data. In the case of structurally dissimilar molecules, the data predicted by the model must be judged as extrapolations [46] [55] [56]. To visualize the DOA of the model the Williams plot, the plot of standardized cross-validated residuals versus leverage (Hat diagonal) values (h), is used. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The configuring of a radial basis function neural network (RBFN) consists of optimizing the architecture and the network parameters (centers, widths, and weights). Methods such as genetic algorithm (GA), K-means and cluster analysis (CA) are among center selection methods. In the most of reports on RBFN modeling optimum centers are selected among rows of descriptors matrix. A combination of RBFN and GA is introduced for better description of quantitative structure-property relationships (QSPR) models. In this method, centers are not exactly rows of the independent matrix and can be located in any point of the samples space. In the proposed approach, initial centers are randomly selected from the calibration set. Then GA changes the locations of the initially selected centers to find the optimum positions of centers from the whole space of scores matrix, in order to obtain highest prediction ability. This approach is called whole space GA-RBFN (wsGA-RBFN) and applied to predict the adsorption coefficients (logk), of 40 small molecules on the surface of multi-walled carbon nanotubes (MWCNTs). The data consists of five solute descriptors [R, π, α, β, V] of the molecules and known as data set1. Prediction ability of wsGA-RBFN is compared to GA-RBFN and MLR models. The obtained Q2 values for wsGA-RBFN, GA-RBFN and MLR are 0.95, 0.85, and 0.78, respectively, which shows the merit of wsGA-RBFN. The method is also applied on the logarithm of surface area normalized adsorption coefficients (logKSA), of organic compounds (OCs) on MWCNTs surface. The data set2 includes 69 aromatic molecules with 13 physicochemical properties of the OCs. Thirty-nine of these molecules were similar to those of data set1 and the others were aromatic compounds included of small and big molecules. Prediction ability of wsGA-RBFN for second data set was compared to GA-RBF. The Q2 values for wsGA-RBFN and GA-RBF are obtained as 0.89 and 0.80, respectively.
    Full-text · Article · Oct 2015 · Journal of Molecular Structure
    • "In addition, to assess the risk of chance correlation, permutation test (y scrambling) was performed (Gramatica et al. 2007). In this way, the dependent variable of model (pIC 50 of different peptides) was randomly shuffled 50 times and maximum correlation coefficient of cross-validation in permutation test (Q 2 MP ) was calculated which was 0.14. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The treatment of infections caused by multi-drugs resistant bacteria and fungi is a particular challenge. Whereas cationic antimicrobial peptides (CAPs) are considered as promising drug candidates for treatment of such superbugs, recent studies have focused on design of those peptides with increased bioavailability and stability against proteases. In between, applications of the quantitative structure-activity relationship (QSAR) studies which provide information on activities of CAPs based on descriptors for each individual amino acid are inevitable. However, the satisfactory results derived from a QSAR model depend highly on the choice of amino acid descriptors and the mathematical strategy used to relate the descriptors to the CAPs' activity. In this study, the quantitative sequence-activity modeling (QSAM) of 60 CAPs derived from O-W-F-I-F-H(1-Bzl)-NH2 sequence which showed excellent activities against a broad range of hazardous microorganisms: e.g., MRSA, MRSE, E. coli and C. albicans, is discussed. The peptides contained natural and non-natural amino acids (AAs) of the both isomers D and L. In this study, a segmented principal component strategy was performed on the structural descriptors of AAs to extract AA's indices. Our results showed that constructed models covered more than 82, 94, 80 and 78 % of the cross-validated variance of C. albicans, MRSA, MRSE and E. coli data sets, respectively. The results were also used to determine the important and significant AAs which are important in CAPs activities. According to the best of our knowledge, it is the first successful attempt in the QSAM studies of peptides containing both natural and non-natural AAs of the both L and D isomers.
    No preview · Article · Oct 2014 · Amino Acids
  • Source
    • "Since Van Zelm et al. (2009) demonstrated that uncertainty in the HC50 drastically decreases when data on three instead of two test species are available, we used HC50 values based on toxicity data for at least three test species for porewater as well as for freshwater. The input for the QSAR that was used to predict K oc , i.e. the descriptor values, was taken from the Supporting Information of Gramatica et al. (2007) or, if not available, calculated with DRA- GON software version 5.5 (Todeschini et al., 2007). For the log K oc values and more details on the QSAR, we refer to the Supporting Information. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Since chemicals' ecotoxic effects depend for most soil species on the dissolved concentration in pore water, the equilibrium partitioning (EP) method is generally used to estimate hazardous concentrations (HC50) in the soil from aquatic toxicity tests. The present study analyzes the statistical uncertainty in terrestrial HC50s derived by the EP-method. For 47 organic chemicals, we compared freshwater HC50s derived from standard aquatic ecotoxicity tests with porewater HC50s derived from terrestrial ecotoxicity tests. Statistical uncertainty in the HC50s due to limited species sample size and in organic carbon-water partitioning coefficients due to predictive error was treated with probability distributions propagated by Monte Carlo simulations. Particularly for specifically acting chemicals, it is very important to base the HC50 on a representative sample of species, composed of both target and non-target species. For most chemical groups, porewater HC50 values were approximately a factor of 3 higher than freshwater HC50 values. The ratio of the porewater HC50/freshwater HC50 was typically 3.0 for narcotic chemicals (2.8 for nonpolar and 3.4 for polar narcotics), 0.8 for reactive chemicals, 2.9 for neurotoxic chemicals (4.3 for AChE agents and 0.1 for the cyclodiene type), and 2.5 for herbicides-fungicides. However, the statistical uncertainty associated with this ratio was large (typically 2.3 orders of magnitude). For 81% of the organic chemicals studied, there was no statistical difference between the hazardous concentration of aquatic and terrestrial species. We conclude that possible systematic deviations between the HC50s of aquatic and terrestrial species appear to be less prominent than the overall statistical uncertainty.
    Full-text · Article · Jun 2013 · Chemosphere
Show more