Random forest models to predict aqueous solubility

Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK.
Journal of Chemical Information and Modeling (Impact Factor: 4.07). 01/2007; 47(1):150-8. DOI: 10.1021/ci060164k
Source: PubMed

ABSTRACT Random Forest regression (RF), Partial-Least-Squares (PLS) regression, Support Vector Machines (SVM), and Artificial Neural Networks (ANN) were used to develop QSPR models for the prediction of aqueous solubility, based on experimental data for 988 organic molecules. The Random Forest regression model predicted aqueous solubility more accurately than those created by PLS, SVM, and ANN and offered methods for automatic descriptor selection, an assessment of descriptor importance, and an in-parallel measure of predictive ability, all of which serve to recommend its use. The prediction of log molar solubility for an external test set of 330 molecules that are solid at 25 degrees C gave an r2 = 0.89 and RMSE = 0.69 log S units. For a standard data set selected from the literature, the model performed well with respect to other documented methods. Finally, the diversity of the training and test sets are compared to the chemical space occupied by molecules in the MDL drug data report, on the basis of molecular descriptors selected by the regression analysis.

1 Follower
  • [Show abstract] [Hide abstract]
    ABSTRACT: The research in the group for computational chemistry at the ETH Zurich focuses on the development of methods and software for classical molecular dynamics simulations and cheminformatics, and their application to biological and chemical questions. Here, important advances and challenges in these subfields of computational chemistry are reviewed and potential opportunities for cross-fertilization are outlined.
    CHIMIA International Journal for Chemistry 09/2014; 68(9). DOI:10.2533/chimia.2014.620 · 1.09 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Accurate estimation of the canopy chlorophyll content of a crop is essential for crop production. Ground-based hyperspectral datasets were obtained under a wide range of plant and environmental conditions in Jilin using Analytical Spectral Devices(ASD) spectroradiometers, and canopy chlorophyll content in canopy were measured by Soil and Plant Analyzer Development(SPAD)-502. The objective of this study is to determine the most suitable input variables to estimate the canopy chlorophyll content by Random Forest model. On the basis of a comprehensive analysis of the spectral data, the RF model is explored to provide an accurate and robust assessment of Canopy Chlorophyll Content(CCC). The correlation coefficient (R2) of the second RF model between the measured chlorophyll content and the predicated chlorophyll content is 0.82, and the root mean square error (RMSE) is 12.5738, which is better than the first RF model and the other indexes.
    2013 10th International Conference on Fuzzy Systems and Knowledge Discovery (FSKD); 07/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Arsenic stress induces in subtle changes in the canopy chlorophyll content (CCC). Therefore, the establishment of a spectral index that is sensitive to subtle changes in the CCC is important for monitoring crop arsenic contamination in large areas by remote sensing. Experimental sites with three contamination levels were selected and were located in Chang Chun City, Jilin City, Jilin Province, China. Arsenic stress can induce small changes in the CCC, reflecting in the crop spectrum. This study created a new index to monitor the CCC. Then, the results from the index were compared with these from other indices and the random forest model, respectively. The final purpose of this study is to find an optimal index, which is sensitive to small changes in the CCC under arsenic stress for monitoring regional CCC in rice. The results indicate that the distribution of the CCC is aligned with the distribution of the arsenic stress level and that NVI (R640, R732, and R752) is the best index for monitoring CCC. The correlation coefficient R2 between the predicated values using NVI and the measured values of canopy chlorophyll content is 0.898, which performs better than the random forest model and other indices.
    International Journal of Applied Earth Observation and Geoinformation 03/2015; 36. DOI:10.1016/j.jag.2014.10.017 · 2.54 Impact Factor