Random Forest Models To Predict Aqueous Solubility

Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK.
Journal of Chemical Information and Modeling (Impact Factor: 3.74). 01/2007; 47(1):150-8. DOI: 10.1021/ci060164k
Source: PubMed


Random Forest regression (RF), Partial-Least-Squares (PLS) regression, Support Vector Machines (SVM), and Artificial Neural Networks (ANN) were used to develop QSPR models for the prediction of aqueous solubility, based on experimental data for 988 organic molecules. The Random Forest regression model predicted aqueous solubility more accurately than those created by PLS, SVM, and ANN and offered methods for automatic descriptor selection, an assessment of descriptor importance, and an in-parallel measure of predictive ability, all of which serve to recommend its use. The prediction of log molar solubility for an external test set of 330 molecules that are solid at 25 degrees C gave an r2 = 0.89 and RMSE = 0.69 log S units. For a standard data set selected from the literature, the model performed well with respect to other documented methods. Finally, the diversity of the training and test sets are compared to the chemical space occupied by molecules in the MDL drug data report, on the basis of molecular descriptors selected by the regression analysis.

1 Follower
105 Reads
  • Source
    • "In each tree, the ensemble predicts the data that are not in the tree, and by calculating the difference in the mean square errors between the OOB (out of bag) data and the data that were used to grow the regression trees, the RF algorithm gives the OOB error of each variable. (Breiman, 2001a,b,c; Maindonald and Braun, 2006; Prasad et al., 2006; Palmer et al., 2007). Modeling RF could balance the bias of different sets, and improve performance by tuning a few parameters; in addition, even the defaults present a high performance. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Arsenic stress induces in subtle changes in the canopy chlorophyll content (CCC). Therefore, the establishment of a spectral index that is sensitive to subtle changes in the CCC is important for monitoring crop arsenic contamination in large areas by remote sensing. Experimental sites with three contamination levels were selected and were located in Chang Chun City, Jilin City, Jilin Province, China. Arsenic stress can induce small changes in the CCC, reflecting in the crop spectrum. This study created a new index to monitor the CCC. Then, the results from the index were compared with these from other indices and the random forest model, respectively. The final purpose of this study is to find an optimal index, which is sensitive to small changes in the CCC under arsenic stress for monitoring regional CCC in rice. The results indicate that the distribution of the CCC is aligned with the distribution of the arsenic stress level and that NVI (R640, R732, and R752) is the best index for monitoring CCC. The correlation coefficient R2 between the predicated values using NVI and the measured values of canopy chlorophyll content is 0.898, which performs better than the random forest model and other indices.
    International Journal of Applied Earth Observation and Geoinformation 03/2015; 36. DOI:10.1016/j.jag.2014.10.017 · 3.47 Impact Factor
  • Source
    • "Random forests [36] are a general classification and regression algorithms and the are well adapted to dependent input data. They have already been successfully applied to numerous problems [37] [38], including compound classification [39]. Here, the 32 compounds represent only around 7% of the mechanically stable half-Heuslers. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The lattice thermal conductivity (\kappa_{\omega}) is a key property for many potential applications of compounds. Discovery of materials with very low or high \kappa_{\omega} remains an experimental challenge due to high costs and time-consuming synthesis procedures. High-throughput computational pre-screening is a valuable approach for significantly reducing the set of candidate compounds. In this article, we introduce efficient methods for reliably estimating the bulk \kappa_{\omega} for a large number of compounds. The algorithms are based on a combination of machine-learning algorithms, physical insights, and automatic ab-initio calculations. We scanned approximately 79,000 half-Heusler entries in the database. Among the 450 mechanically stable ordered semiconductors identified, we find that \kappa_{\omega} spans more than two orders of magnitude- a much larger range than that previously thought. \kappa_{\omega} is lowest for compounds whose elements in equivalent positions have large atomic radii. We then perform a thorough screening of thermodynamical stability that allows to reduce the list to 77 systems. We can then provide a quantitative estimate of \kappa_{\omega} for this selected range of systems. Three semiconductors having \kappa_{\omega} < 5, W /( m K ) are proposed for further experimental study.
    Physical Review X 01/2014; 4(1). DOI:10.1103/PhysRevX.4.011019 · 9.04 Impact Factor
  • Source
    • "As we see from Figure 4, the sensitivity of the classification did not significantly change once ntree>20. The number of variables randomly sampled as candidates at each split (mtry) was chosen as the square root of the number of features (262 in our case), hence mtry was set to 16. Palmer et al. [23] and Liaw et al. [24] also reported that RF is usually insensitive to the training parameters. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare. We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases. We predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process. In combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.
    BMC Medical Informatics and Decision Making 07/2011; 11(1):51. DOI:10.1186/1472-6947-11-51 · 1.83 Impact Factor
Show more