Random forest models to predict aqueous solubility

Unilever Centre for Molecular Science Informatics, Department of Chemistry, University of Cambridge, Lensfield Road, Cambridge CB2 1EW, UK.
Journal of Chemical Information and Modeling (Impact Factor: 4.07). 01/2007; 47(1):150-8. DOI: 10.1021/ci060164k
Source: PubMed

ABSTRACT Random Forest regression (RF), Partial-Least-Squares (PLS) regression, Support Vector Machines (SVM), and Artificial Neural Networks (ANN) were used to develop QSPR models for the prediction of aqueous solubility, based on experimental data for 988 organic molecules. The Random Forest regression model predicted aqueous solubility more accurately than those created by PLS, SVM, and ANN and offered methods for automatic descriptor selection, an assessment of descriptor importance, and an in-parallel measure of predictive ability, all of which serve to recommend its use. The prediction of log molar solubility for an external test set of 330 molecules that are solid at 25 degrees C gave an r2 = 0.89 and RMSE = 0.69 log S units. For a standard data set selected from the literature, the model performed well with respect to other documented methods. Finally, the diversity of the training and test sets are compared to the chemical space occupied by molecules in the MDL drug data report, on the basis of molecular descriptors selected by the regression analysis.

Download full-text


Available from: Robert Glen, Oct 21, 2014
1 Follower
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Nitrogen N is one of the most important limiting nutrients for sugarcane production. Conventionally, sugarcane N concentration is examined using direct methods such as collecting leaf samples from the field followed by analytical assays in the laboratory. These methods do not offer real-time, quick, and non-destructive strategies for estimating sugarcane N concentration. Methods that take advantage of remote sensing, particularly hyperspectral data, can present reliable techniques for predicting sugarcane leaf N concentration. Hyperspectral data are extremely large and of high dimensionality. Many hyperspectral features are redundant due to the strong correlation between wavebands that are adjacent. Hence, the analysis of hyperspectral data is complex and needs to be simplified by selecting the most relevant spectral features. The aim of this study was to explore the potential of a random forest RF regression algorithm for selecting spectral features in hyperspectral data necessary for predicting sugarcane leaf N concentration. To achieve this, two Hyperion images were captured from fields of 6–7 month-old sugarcane, variety N19. The machine-learning RF algorithm was used as a feature-selection and regression method to analyse the spectral data. Stepwise multiple linear SML regression was also examined to predict the concentration of sugarcane leaf N after the reduction of the redundancy in hyperspectral data. The results showed that sugarcane leaf N concentration can be predicted using both non-linear RF regression coefficient of determination, R 2 = 0.67; root mean square error of validation RMSEV = 0.15%; 8.44% of the mean and SML regression models R 2 = 0.71; RMSEV = 0.19%; 10.39% of the mean derived from the first-order derivative of reflectance. It was concluded that the RF regression algorithm has potential for predicting sugarcane leaf N concentration using hyperspectral data.
    International Journal of Remote Sensing 01/2013; 34(2):712-728. DOI:10.1080/01431161.2012.713142 · 1.36 Impact Factor
  • Source
    10/2008, Degree: PhD, Supervisor: João Aires-de-Sousa, Fernando FM Silva Fernandes, Filomena FM Freitas
  • [Show abstract] [Hide abstract]
    ABSTRACT: Drug design is a process which is driven by technological breakthroughs implying advanced experimental and computational methods. Nowadays, the techniques or the drug design methods are of paramount importance for prediction of biological profile, identification of hits, generation of leads, and moreover to accelerate the optimization of leads into drug candidates. Quantitative structure–activity relationship (QSAR) has served as a valuable predictive tool in the design of pharmaceuticals and agrochemicals. From decades to recent research, QSAR methods have been applied in the development of relationship between properties of chemical substances and their biological activities to obtain a reliable statistical model for prediction of the activities of new chemical entities. Classical QSAR studies include ligands with their binding sites, inhibition constants, rate constants, and other biological end points, in addition molecular to properties such as lipophilicity, polarizability, electronic, and steric properties or with certain structural features. 3D-QSAR has emerged as a natural extension to the classical Hansch and Free–Wilson approaches, which exploit the three-dimensional properties of the ligands to predict their biological activities using robust chemometric techniques such as PLS, G/PLS, and ANN. This paper provides an overview of 1-6 dimension-based developed QSAR methods and their approaches. In particular, we present various dimensional QSAR approaches, such as comparative molecular field analysis (CoMFA), comparative molecular similarity analysis, Topomer CoMFA, self-organizing molecular field analysis, comparative molecule/pseudo receptor interaction analysis, comparative molecular active site analysis, and FLUFF-BALL, 4D-QSAR, and G-QSAR approaches.
    Medicinal Chemistry Research 12/2014; 23(12):1-17. DOI:10.1007/s00044-014-1072-3 · 1.61 Impact Factor