Article

Statistical external validation and consensus modeling: A QSPR case study for K-oc prediction

Università degli Studi dell'Insubria, Varese, Lombardy, Italy
Journal of Molecular Graphics and Modelling (Impact Factor: 2.02). 04/2007; 25(6):755-66. DOI: 10.1016/j.jmgm.2006.06.005
Source: PubMed

ABSTRACT The soil sorption partition coefficient (log K(oc)) of a heterogeneous set of 643 organic non-ionic compounds, with a range of more than 6 log units, is predicted by a statistically validated QSAR modeling approach. The applied multiple linear regression (ordinary least squares, OLS) is based on a variety of theoretical molecular descriptors selected by the genetic algorithms-variable subset selection (GA-VSS) procedure. The models were validated for predictivity by different internal and external validation approaches. For external validation we applied self organizing maps (SOM) to split the original data set: the best four-dimensional model, developed on a reduced training set of 93 chemicals, has a predictivity of 78% when applied on 550 validation chemicals (prediction set). The selected molecular descriptors, which could be interpreted through their mechanistic meaning, were compared with the more common physico-chemical descriptors log K(ow) and log S(w). The chemical applicability domain of each model was verified by the leverage approach in order to propose only reliable data. The best predicted data were obtained by consensus modeling from 10 different models in the genetic algorithm model population.

Download full-text

Full-text

Available from: Paola Gramatica, Jun 19, 2014
0 Followers
 · 
135 Views
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Since chemicals' ecotoxic effects depend for most soil species on the dissolved concentration in pore water, the equilibrium partitioning (EP) method is generally used to estimate hazardous concentrations (HC50) in the soil from aquatic toxicity tests. The present study analyzes the statistical uncertainty in terrestrial HC50s derived by the EP-method. For 47 organic chemicals, we compared freshwater HC50s derived from standard aquatic ecotoxicity tests with porewater HC50s derived from terrestrial ecotoxicity tests. Statistical uncertainty in the HC50s due to limited species sample size and in organic carbon-water partitioning coefficients due to predictive error was treated with probability distributions propagated by Monte Carlo simulations. Particularly for specifically acting chemicals, it is very important to base the HC50 on a representative sample of species, composed of both target and non-target species. For most chemical groups, porewater HC50 values were approximately a factor of 3 higher than freshwater HC50 values. The ratio of the porewater HC50/freshwater HC50 was typically 3.0 for narcotic chemicals (2.8 for nonpolar and 3.4 for polar narcotics), 0.8 for reactive chemicals, 2.9 for neurotoxic chemicals (4.3 for AChE agents and 0.1 for the cyclodiene type), and 2.5 for herbicides-fungicides. However, the statistical uncertainty associated with this ratio was large (typically 2.3 orders of magnitude). For 81% of the organic chemicals studied, there was no statistical difference between the hazardous concentration of aquatic and terrestrial species. We conclude that possible systematic deviations between the HC50s of aquatic and terrestrial species appear to be less prominent than the overall statistical uncertainty.
    Chemosphere 06/2013; 93(2). DOI:10.1016/j.chemosphere.2013.05.007 · 3.50 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: in determining their bioaccessibility/bioavailability and their fate and transport in aquatic environments. We here explore and evaluate the use of a state-of-the-art data analysis technique (Project to Latent Structures, PLS) to estimate log S of environmentally relevant chemicals. A large number (n = 624) of molecular descriptors was computed for over 1400 organic chemicals, and then refined by a feature selection technique. Candidate predictor descriptors were fitted to data by means of PLS, which was optimized by an internal leave-one-out cross-validation technique and validated by an external data set. The final (best) PLS model with only four variables (AlogP, X1sol, Mv, and E) exhibited noteworthy stability and good predictive power. It was able to explain 91% of the data (n = 1400) variance with an average absolute error of 0.5 log units through the solubilities span over 12 orders of magnitude. The newly proposed model is transparent, easily portable from one user to another, and robust enough to accurately estimate log S of a wide range of emerging contaminants.
    Water Research 06/2013; DOI:10.1016/j.watres.2013.06.011 · 5.32 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A Quantitative Structure-Property Relationships (QSPRs) study for the prediction of the environmental persistence of a set of 250 heterogeneous organic compounds is here presented. Three a priori defined classes of environmental persistence were generated, by Hierarchical Cluster Analysis, from the combination of half-life data in air, water, soil and sediment available for all the studied compounds. QSPR classification models were successfully developed using different techniques (k-NN, CART and CP-ANN) and three interpretable theoretical molecular descriptors. Robust external validation was provided by statistical splitting and also on completely new data. The good performances of all these models were compared and their structural domains were analyzed. The analysis of the errors highlights a slight tendency of persistence overestimation, misclassifying chemicals from a lower to a higher class of persistence, in line with the precautionary principle. Finally, the reliability of the proposed QSPR models was verified further with new data from the literature. The structure-based classification models, applicable for the prediction of potential persistence of heterogeneous organic compounds, could be useful as preliminary support tools for the identification and prioritization of new potential POPs among already existing chemicals as well as "screening prior to synthesis" procedures to avoid the production, and consequent release into the environment, of new POPs.
    Journal of molecular graphics & modelling 04/2008; 27(1):59-65. DOI:10.1016/j.jmgm.2008.02.004 · 2.02 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Since chemicals' ecotoxic effects depend for most soil species on the dissolved concentration in pore water, the equilibrium partitioning (EP) method is generally used to estimate hazardous concentrations (HC50) in the soil from aquatic toxicity tests. The present study analyzes the statistical uncertainty in terrestrial HC50s derived by the EP-method. For 47 organic chemicals, we compared freshwater HC50s derived from standard aquatic ecotoxicity tests with porewater HC50s derived from terrestrial ecotoxicity tests. Statistical uncertainty in the HC50s due to limited species sample size and in organic carbon-water partitioning coefficients due to predictive error was treated with probability distributions propagated by Monte Carlo simulations. Particularly for specifically acting chemicals, it is very important to base the HC50 on a representative sample of species, composed of both target and non-target species. For most chemical groups, porewater HC50 values were approximately a factor of 3 higher than freshwater HC50 values. The ratio of the porewater HC50/freshwater HC50 was typically 3.0 for narcotic chemicals (2.8 for nonpolar and 3.4 for polar narcotics), 0.8 for reactive chemicals, 2.9 for neurotoxic chemicals (4.3 for AChE agents and 0.1 for the cyclodiene type), and 2.5 for herbicides-fungicides. However, the statistical uncertainty associated with this ratio was large (typically 2.3 orders of magnitude). For 81% of the organic chemicals studied, there was no statistical difference between the hazardous concentration of aquatic and terrestrial species. We conclude that possible systematic deviations between the HC50s of aquatic and terrestrial species appear to be less prominent than the overall statistical uncertainty.
    Chemosphere 06/2013; 93(2). DOI:10.1016/j.chemosphere.2013.05.007 · 3.50 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: in determining their bioaccessibility/bioavailability and their fate and transport in aquatic environments. We here explore and evaluate the use of a state-of-the-art data analysis technique (Project to Latent Structures, PLS) to estimate log S of environmentally relevant chemicals. A large number (n = 624) of molecular descriptors was computed for over 1400 organic chemicals, and then refined by a feature selection technique. Candidate predictor descriptors were fitted to data by means of PLS, which was optimized by an internal leave-one-out cross-validation technique and validated by an external data set. The final (best) PLS model with only four variables (AlogP, X1sol, Mv, and E) exhibited noteworthy stability and good predictive power. It was able to explain 91% of the data (n = 1400) variance with an average absolute error of 0.5 log units through the solubilities span over 12 orders of magnitude. The newly proposed model is transparent, easily portable from one user to another, and robust enough to accurately estimate log S of a wide range of emerging contaminants.
    Water Research 06/2013; DOI:10.1016/j.watres.2013.06.011 · 5.32 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A Quantitative Structure-Property Relationships (QSPRs) study for the prediction of the environmental persistence of a set of 250 heterogeneous organic compounds is here presented. Three a priori defined classes of environmental persistence were generated, by Hierarchical Cluster Analysis, from the combination of half-life data in air, water, soil and sediment available for all the studied compounds. QSPR classification models were successfully developed using different techniques (k-NN, CART and CP-ANN) and three interpretable theoretical molecular descriptors. Robust external validation was provided by statistical splitting and also on completely new data. The good performances of all these models were compared and their structural domains were analyzed. The analysis of the errors highlights a slight tendency of persistence overestimation, misclassifying chemicals from a lower to a higher class of persistence, in line with the precautionary principle. Finally, the reliability of the proposed QSPR models was verified further with new data from the literature. The structure-based classification models, applicable for the prediction of potential persistence of heterogeneous organic compounds, could be useful as preliminary support tools for the identification and prioritization of new potential POPs among already existing chemicals as well as "screening prior to synthesis" procedures to avoid the production, and consequent release into the environment, of new POPs.
    Journal of molecular graphics & modelling 04/2008; 27(1):59-65. DOI:10.1016/j.jmgm.2008.02.004 · 2.02 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Since chemicals' ecotoxic effects depend for most soil species on the dissolved concentration in pore water, the equilibrium partitioning (EP) method is generally used to estimate hazardous concentrations (HC50) in the soil from aquatic toxicity tests. The present study analyzes the statistical uncertainty in terrestrial HC50s derived by the EP-method. For 47 organic chemicals, we compared freshwater HC50s derived from standard aquatic ecotoxicity tests with porewater HC50s derived from terrestrial ecotoxicity tests. Statistical uncertainty in the HC50s due to limited species sample size and in organic carbon-water partitioning coefficients due to predictive error was treated with probability distributions propagated by Monte Carlo simulations. Particularly for specifically acting chemicals, it is very important to base the HC50 on a representative sample of species, composed of both target and non-target species. For most chemical groups, porewater HC50 values were approximately a factor of 3 higher than freshwater HC50 values. The ratio of the porewater HC50/freshwater HC50 was typically 3.0 for narcotic chemicals (2.8 for nonpolar and 3.4 for polar narcotics), 0.8 for reactive chemicals, 2.9 for neurotoxic chemicals (4.3 for AChE agents and 0.1 for the cyclodiene type), and 2.5 for herbicides-fungicides. However, the statistical uncertainty associated with this ratio was large (typically 2.3 orders of magnitude). For 81% of the organic chemicals studied, there was no statistical difference between the hazardous concentration of aquatic and terrestrial species. We conclude that possible systematic deviations between the HC50s of aquatic and terrestrial species appear to be less prominent than the overall statistical uncertainty.
    Chemosphere 06/2013; 93(2). DOI:10.1016/j.chemosphere.2013.05.007 · 3.50 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: in determining their bioaccessibility/bioavailability and their fate and transport in aquatic environments. We here explore and evaluate the use of a state-of-the-art data analysis technique (Project to Latent Structures, PLS) to estimate log S of environmentally relevant chemicals. A large number (n = 624) of molecular descriptors was computed for over 1400 organic chemicals, and then refined by a feature selection technique. Candidate predictor descriptors were fitted to data by means of PLS, which was optimized by an internal leave-one-out cross-validation technique and validated by an external data set. The final (best) PLS model with only four variables (AlogP, X1sol, Mv, and E) exhibited noteworthy stability and good predictive power. It was able to explain 91% of the data (n = 1400) variance with an average absolute error of 0.5 log units through the solubilities span over 12 orders of magnitude. The newly proposed model is transparent, easily portable from one user to another, and robust enough to accurately estimate log S of a wide range of emerging contaminants.
    Water Research 06/2013; DOI:10.1016/j.watres.2013.06.011 · 5.32 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: A Quantitative Structure-Property Relationships (QSPRs) study for the prediction of the environmental persistence of a set of 250 heterogeneous organic compounds is here presented. Three a priori defined classes of environmental persistence were generated, by Hierarchical Cluster Analysis, from the combination of half-life data in air, water, soil and sediment available for all the studied compounds. QSPR classification models were successfully developed using different techniques (k-NN, CART and CP-ANN) and three interpretable theoretical molecular descriptors. Robust external validation was provided by statistical splitting and also on completely new data. The good performances of all these models were compared and their structural domains were analyzed. The analysis of the errors highlights a slight tendency of persistence overestimation, misclassifying chemicals from a lower to a higher class of persistence, in line with the precautionary principle. Finally, the reliability of the proposed QSPR models was verified further with new data from the literature. The structure-based classification models, applicable for the prediction of potential persistence of heterogeneous organic compounds, could be useful as preliminary support tools for the identification and prioritization of new potential POPs among already existing chemicals as well as "screening prior to synthesis" procedures to avoid the production, and consequent release into the environment, of new POPs.
    Journal of molecular graphics & modelling 04/2008; 27(1):59-65. DOI:10.1016/j.jmgm.2008.02.004 · 2.02 Impact Factor