ABSTRACT: A modified random forest (RF) algorithm, as a novel machine learning technique, was developed to estimate the maximum recommended daily dose (MRDD) of a large and diverse pharmaceutical dataset for phase I human trials using substructure fingerprint descriptors calculated from simple molecular structure alone. This type of novel molecular descriptors encodes molecular structure in a series of binary bits that represent the presence or absence of particular substructures in the molecule and thereby can accurately and directly depict a series of local information hidden in this molecule. Two model validation approaches, 5-fold cross-validation and an independent validation set, were used for assessing the prediction capability of our models. The results obtained in this study indicate that the modified RF gave prediction accuracy of 80.45%, sensitivity of 75.08%, specificity of 84.85% for 5-fold cross-validation, and prediction accuracy of 80.5%, sensitivity of 76.47%, specificity of 83.48% for independent validation set, respectively, which are as a whole better than those by the original RF. At the same time, the important substructure fingerprints, recognized by the RF technique, gave some insights into the structure features related to toxicity of pharmaceuticals. This could help provide intuitive understanding for medicinal chemists.
Analytica chimica acta 04/2011; 692(1-2):50-6. · 4.31 Impact Factor