SAR and QSAR in Environmental Research, Vol. 16, No. 4, August 2005, 339–347
An in silico ensemble method for lead discovery: decision
H. HONG†, W. TONG‡*, Q. XIE†, H. FANG† and R. PERKINS†
†Division of Bioinformatics, Z-Tech at National Center for Toxicological Research, U.S. Food and
Drug Administration, Jefferson, AR 72079, USA
‡Division of Systems Toxicology, Center for Toxicoinformatics, National Center for Toxicological
Research, U.S. Food and Drug Administration, Jefferson, AR 72079, USA
(Received 8 December 2004; in final form 20 April 2005)
Recent progress in combinatorial chemistry and parallel synthesis has radically changed the approach to
drug discovery in the pharmaceutical industry. At present, thousands of compounds can be made in a
short period, creating a need for fast and effective in silico methods to select the most promising lead
candidates. Decision forest is a novel pattern recognition method, which combines the results of multiple
distinct but comparable decision tree models to reach a consensus prediction. In this article, a decision
forest model was developed using a structurally diverse training data set containing 232 compounds
whose estrogen receptor binding activity was tested at the U.S. Food and Drug Administration (FDA)’s
National Center for Toxicological Research (NCTR). The model was subsequently validated using a test
data set of 463 compounds selected from the literature, and then applied to a large data set with 57,145
compounds as a screening example. The results show that the decision forest method is a fast, reliable and
effective in silico approach, which could be useful in drug discovery.
Keywords: Decision forest; Lead discovery; Classification; Estrogen binding; QSAR
The pharmaceutical industry has a growing and pressing need for accurate and economical
methods to identify new lead candidates and to optimize lead compounds during drug
development . In silico methods play an important role in this process. Common in silico
methods include, but are not limited to, pharmacophore identification followed by searching
against three-dimension (3D) structure database , virtual screening by docking small
molecules into target enzyme for lead discovery, and quantitative structure activity
relationship (QSAR) methods for lead optimization and absorption, distribution,
metabolism, excretion and toxicity (ADME/Tox) prediction.
Given the fact that the data derived from high-throughput screening is normally noisy and
categorical in nature, classification becomes another important in silico approach for the
*Corresponding author. Email: email@example.com
SAR and QSAR in Environmental Research
ISSN 1062-936X print/ISSN 1029-046X online q 2005 Taylor & Francis Group Ltd?
340 H. Hong et al.
purpose of discovering lead compounds in drug discovery. Supervised learning-based
classification methods require a training data set of compounds of known class, such as active
or inactive, or multiple classes such as strong, medium, weak and inactive. A classification
model can be developed based on the known compounds using molecular descriptors derived
from physicochemical and molecular structure properties and then the model can be
subsequently used to predict class membership of untested compounds.
One of the common classification methods is linear discriminate analysis and its variants
, including quadratic discriminate analysis, diagonal linear discriminate analysis,
regularized discriminate analysis and flexible discriminate analysis. Other classification
approaches widely used and well-supported in the literature include Bayes’s rule, K-nearest
neighbors (KNN) , soft independent modeling of class analogy (SIMCA) , artificial
neural networks (ANN), support vector machine and decision tree (or recursive partitioning).
Among these classification methods, decision tree methodology has been demonstrated to be
a computationally economical method for use in screening large data sets to provide potential
hits for in vitro or in vivo experimental evaluation [6,7]. Lim and Loh  compared 22
decision tree methods with 9 statistical methods and 2 ANN methods using 32 data sets and
concluded from the investigation that no statistical difference existed among the methods
evaluated. In the investigation of structure–activity relationship of estrogens, we also found
that the decision tree method gives comparable results to KNN, SIMCA and ANN .
Based on decision tree algorithm, a novel ensemble method named decision forest was
developed in our laboratory by combining the predictions from multiple independent decision
tree models to reach a better prediction . The method outperforms decision tree in both
training and validation. In this paper, we demonstrate through an example the potential and
importance of decision forest method applied in drug discovery as well as environmental research.
2. Materials and methods
There were two premises for decision forest: (1) each decision tree to be combined should be
of similar good quality to ensure equal contribution in combination; and (2) each decision
tree should be distinct and thus make a unique contribution to the pattern recognition of the
ensemble of trees. The above criteria was used to develop the algorithm as described below:
1. The process is initiated with a predefined number (N) of decision tree models to be
combined. The choice of N is dependent on many factors, including the size of the dataset,
the number and quality of the descriptors used, the parameter setting in the tree
construction. This is a trial-and-error process. However, extensive evaluation revealed
that the best ensemble results are normally realized with N ¼ 4 for most datasets we have
evaluated (results not shown).
2. Since misclassifications of a tree is dependent on quality of the data set, the minimum
misclassifications can be determined based on the decision tree that is developed without
pruning. In this study, the decision tree model is developed using a variant of the
classification and regression tree (CART) method . In this method, the descriptors are
identified that best divide samples in a node into two child nodes. The splits maximize the
341 Decision forest for lead discovery
homogeneity of samples in each node. The splitting continues on each child node until
compounds in each node are either in one class or can not be split further to improve the
quality of the tree. The number of misclassified samples (NMS) in the initial tree then
serves as a quality criterion to guide individual tree construction and pruning in the
following iterative steps 3–6.
3. An individual decision tree is constructed and pruned. The extent of pruning is determined
by the NMS. The pruned decision tree assigns a probability (0–1) for each sample in the
training data to each class.
4. The descriptors used in the previous decision tree model are removed from the original
descriptor pool, and the remaining descriptors are used to develop the next decision tree
model. Removal of the previously used descriptors increases the independence of the
individual tree models.
5. Steps 3 and 4 are repeated drawing from unused descriptors until no additional trees can be
developed with the defined NMS from the unused portion of the original descriptor pool.
6. If the total number of decision tree models is less than N, the NMS is incremented by one and
steps 3–5 are repeated until N is equal or larger than NMS.
7. At the end of the process, the probabilities for a compound assigned by all decision tree
models are linearly combined with the resultant mean probability taken as the classification of
The algorithm was written in C.
2.2 Model development
The reliability and predictability of an in silico model not only depends on the computational
method itself but also on the quality of the training data set. The training data set is used to
generate valid rules and to guide decision making. Ideally, a robust and predictive
classification model should be developed based on a training set that spans both broad
structural diversity (active and inactive compounds) and activity. We used such a training
set, the so-called NCTR data set [12,13], based on estrogen receptor binding affinity data to
demonstrate decision forest and compare decision forest with decision tree. The NCTR data
set contains 232 structurally diverse compounds , of which 131 compounds exhibit
estrogen receptor binding spanning a million fold range of affinity, while 101 are inactive in
the competitive estrogen receptor binding assay.
The tree-based model classifies compounds into active and inactive categories using a set
of rules based on values of molecular descriptors. The descriptors used in this model
characterize the structural similarity as represented by the molecular descriptors across
compounds that are to be correlated with associated biological activity. Therefore, selecting
descriptors that appropriately encode activity-determining structural features is paramount.
The Molconn-Z software [15,16] was used to compute descriptors encompassing a wide
range of topological indices of structure to encode molecular information. The descriptors
342 H. Hong et al.
are calculated from two-dimensional (2D) structure of compound. While other software tools
such as Cerius2 are available to calculate descriptors based on three-dimensional (3D)
structure, selecting the biologically active conformation is not only time-consuming but also
could be ambiguous since the active conformer is usually unknown in the first place. We
compared 2D with 3D descriptors with respect to the model performances based on the
NCTR data set using several classification methods, and we found that all results were
comparable (data not shown here). We conclude, as also suggested by others, that 2D
structural descriptors are promising for application in drug development and environmental
research because of easy and fast calculation and high reproducibility.
There are a number of descriptors that are linearly correlated. The effect of correlated
descriptors on the model performance was investigated. Removing the high correlated
descriptors (correlation coefficient . 0.9) actually decreased the quality of the forest model
(results not shown). This might be explained that the colinearity of the descriptors did not
correlate with the non-linear nature of the tree development process. In other words, the
correlated descriptors contribute distinctively to the different tree models. Thus, all
descriptors were included in the tree development without preprocessing.
The decision forest model developed using the NCTR data set consists of five comparable
decision tree models whose misclassification rate ranges from 6.03 to 6.90%. However, the
decision forest model yields a 3.02% misclassification rate, one half or less of the
misclassification rates of individual decision tree models. Table 1 compares decision forest
and decision tree results and shows that decision forest has consistently better predictive
performance by all statistical measures shown.
3.1 Model validation
The current challenge in classification is no longer in developing a fitted model with sound
statistical measures, but rather the development of a model that can generalize in prediction
for unknown compounds. Thus, we conducted both cross-validation and external validation
to assess the predictivity of the decision forest model.
Leave-10%-out cross-validation (L10O) was conducted in this study by randomly dividing
the data set into ten equal portions, where each portion was excluded once to be the test set
while the remaining nine portions were used as the training set to develop a model. The
overall accuracy was taken as the average of the test set prediction accuracies for the 10
models. Each random division of the data set into 10 portions leads to 10 specific training and
Table 1. Comparison of model performance of decision forest with decision tree.
False positive rate
False negative rate
* The first decision tree is used for comparison.
343 Decision forest for lead discovery
test sets with a corresponding accuracy in a single L10O that is likely biased compared to the
average random case. Therefore, L10O was repeated 2000 times to calculate an unbiased
predictive accuracy. Predictive accuracy based on the 2000 L10O average was 81.9% for
decision forest, considerably higher than the 75.8% obtained for decision tree.
Both decision tree and decision forest assign a probability value (0–1) that a compound is
active or inactive; compounds with the probability equal to or greater than 0.5 are designated
as active, while others are designated as inactive. To assess the confidence of predictions of
decision forest and decision trees, it is useful to place each prediction from the L10O cross
validation in one of ten equal bins of probability value between 0 and 1, and separately
compare confidence of prediction for each bin. Figures 1A and B show the distribution of
correct and incorrect predictions within each probability bin from the L10O cross validation
for decision tree and decision forest, respectively. For both decision forest and decision tree,
the majority of predictions are within either the high confidence bins from 0 to 0.1 or from 0.9
to 1.0 that have the highest accuracy. Lower prediction accuracy is found in the bins from 0.2
to 0.9. Figure 1C compares the ratio of correct to incorrect prediction for all probability bins
for both decision forest and decision tree. The average across all bins and in the high
confidence bins, prediction accuracy is 10% higher for decision forest than for decision tree.
Even if a model is validated as high quality by L10O, uncertainty will remain regarding its
ability to predict compounds not in the training set. To address this question, validation with
an external data set is required. An ideal external testing data set should contain the types of
compounds to which the model will be applied. Thus, an external data set reported by
Nishihara et al.  was used in this study. The data set contains 517 compounds whose
estrogenic activity was tested with the yeast two-hybrid assay. Only 463 compounds
remained after eliminating those lacking unique compound structures (e.g. mixtures), of
which 62 compounds were categorized as active based on the definition given by the authors.
Accordingly, the prevalence of active compounds is low, which is similar to the real-world
situation in drug discovery where inactive compounds far outnumber the active ones. The
decision forest model based on the NCTR training data set has overall accuracy of 82.94%
for predicting the test set, whereas the decision tree model has the prediction accuracy of
80.99%. Importantly, decision forest yields much high prediction accuracy in the high
confidence regions compared to decision tree (see figure 2). Specifically, in the probability
region 0.9 , 1.0, decision forest has lower false positives than decision tree, with a
corresponding enrichment of the proportion of active compounds. Since there is no
additional computational expense in using decision forest compared with decision tree, there
is no cost associated with improving the fraction of lead prospects for subsequent testing.
3.2 Screening a large data set
To further demonstrate the usage of decision forest in drug discovery with respect to speed
and efficiency, we screened a large data set to identify potential estrogens. Walker et al.
[18,19] developed a database that contains a large and diverse collection of known pesticides
and industrial compounds as well as some food additives and drugs. The database contains
92,964 chemical abstract service (CAS) registry numbers of compounds. A final data set of
57,145 compounds resulted after eliminating compounds for which structures were not
available or structural descriptors could not be calculated. Activity predictions for the 57,145
compound by the decision forest model derived from the NCTR data set for estrogen receptor
344 H. Hong et al.
Compounds in the left 5 bins are predicted as inactive while the right 5 bins are active. Distributions of correct and
wrong predictions for decision tree and decision forest are shown in A and B, respectively. C compares the prediction
distributions of correct/wrong ratio for both decision forest and decision tree.
Results of 2000 of leave-10%-out cross-validation runs across 10 equal bins of probability (0–1).
345 Decision forest for lead discovery
Nishihara et al. . The distributions of correct and wrong predictions across 10 equal probability bins for decision
tree and decision forest are shown in A and B, respectively.
External validation results of both decision forest and decision tree models on a test set reported by
binding completed in less than one minute. Some 15% of compounds were predicted active
according to the probability bin distribution depicted in figure 3. An effective algorithm for
lead discovery should produce a smaller population of leads with lower false positives to be
subsequently synthesized and assayed, or to be investigated with more time-consuming
QSAR techniques. Figure 3 shows that the high confidence bin for active prediction contains
1384 compounds, a 50-fold decrease in the population from which to test for leads.
In silico methods are widely used in drug discovery to identify potential leads for further
experimental evaluation. The methods range from the simple Lipinski’s-rule-of-five type of
346 H. Hong et al.
compounds (and the percentage of the data size) predicted in each probability bin.
Screening result of the decision forest model on a large data set. The pie chart shows the number of
approaches , docking, pharmacophore searching on database, to supervised and
unsupervised clustering/classification and QSARs. Success is achieved if a few good leads
ultimately survive to the drug development stage. It is not necessary to discover or design all
potentially active compounds. In other words, high false negatives are tolerable, but false
positives should be low to ensure that the selected compounds contain enough active leads to
warrant further investigation and investment. Thus, high positive predictivity (true positives
out of total predicted positives) is important since subsequent testing of predicted positives
consumes time and money. Decision forest generally gives higher positive predictivity than
decision tree, and even higher positive predictivity within definable high confidence regions.
Thus, decision forest is a suitable in silico tool for drug discovery.
A recursive process combining screening assays and in silico modeling has become
prevalent throughout much of the pharmaceutical industry. In lead discovery, the process has
been called sequential screening  as depicted in figure 4. The process starts with assay
data for an initial set of compounds from an existing compound library. The resulting data for
active compounds, and sometimes for inactive compounds, are then used for initial in silico
modeling. The resultant model can be used to identify potential leads in many ways, such as
searching a library of existing compounds or assisting in the design of a virtual combinatorial
library. The identified potential lead compounds are assayed, and these data are then used to
refine the model. In this process, the speed and efficiency of the in silico method is important
to enable rapid modeling and provision of new leads for synthesis. Decision forest
Initial data or
modeling and prediction are integral to the process. The output of the process is drug leads that may be further
Depiction of the sequential screening process now prevalent in the pharmaceutical industry where in silico
347 Decision forest for lead discovery
outperforms most classification methods in speed and efficiency for model development,
refinement and in screening large databases. This is, in part because only 2D descriptors are
sufficient to ensure a high quality model. Of course, any type of molecular descriptors,
compound properties or even other related biological properties can also be used in decision
 B.A. Beutel. Annu Rep. Med. Chem., 32, 261 (1997).
 H. Hong, N. Neamati, S. Wang, M.C. Nicklaus, A. Mazumder, Y. Pommier, T.R. Burke Jr., H. Zhao, G.W.A.
Milne. J. Med. Chem., 40, 930 (1997).
 R.A. Fisher. An. Eugen., 7, 179 (1936).
 B.R. Kowalski, C.F. Bender. J. Am. Chem. Soc., 94, 5632 (1972).
 S. Wold. Pattern Recogn., 8, 127 (1976).
 A. Rusinko III, M.W. Farmen, C.G. Lambert, P.L. Brown, S.S. Young. J. Chem. Inf. Comput. Sci., 39, 1017
 D.M. Hawkins, S.S. Young, A. Rusinko III. Quant. Struct.–Act. Relat., 16, 296 (1997).
 T.S. Lim, W.Y. Loh, Y.S. Shin. Machine Learn. 40, 203 (2000).
 L. Shi, W. Tong, H. Fang, Q. Xie, H. Hong, R. Perkins, J. Wu, M. Tu, R.M. Blair, W.S. Branham, C. Waller, J.
Walker, D.M. Sheehan. SAR QSAR Environ. Res., 13, 69 (2002).
 W. Tong, H. Hong, H. Fang, Q. Xie, R. Perkins. J. Chem. Inf. Comput. Sci., 43, 525 (2003).
 L.A. Clark, D. Pregibon, Chapter 9. Modern Applied Statistics with S-PLUS, pp. 413–430, Chamers & Hastie,
 R. Blair, H. Fang, W.S. Branham, B. Hass, S.L. Dial, C.L. Moland, W. Tong, L. Shi, R. Perkins, D.M. Sheehan.
Toxicol. Sci., 54, 138 (2000).
 W.S. Branham, S.L. Dial, C.L. Moland, B. Hass, R. Blair, H. Fang, L. Shi, W. Tong, R. Perkins, D.M. Sheehan.
J. Nutr., 132, 658 (2002).
 H. Fang, W. Tong, L. Shi, R. Blair, R. Perkins, W.S. Branham, S.L. Dial, C.L. Moland, D.M. Sheehan. Chem.
Res. Toxicol., 14, 280 (2001).
 L.B. Kier, L.H. Hall. Molecular Structure Description: The Electrotopological State, Academic Press, New
 L.B. Kier, L.H. Hall. Molecular Connectivity in Chemistry and Drug Research, Academic Press, New York
 T. Nishihara, J. Nishikawa, T. Kanayama, F. Dakeyama, K. Saito, M. Imagawa, S. Takatori, Y. Kitagawa, S.
Hori, H. Utsumi. J. Health Sci., 46, 282 (2000).
 J.D. Walker, C.W. Waller, S.? Kane. The endocrine disruption priority setting database (EDPSD): a tool to
rapidly sort and prioritize compounds for endocrine disruption screening and testing. In Handbook on
Quantitative Structure Activity Relationships (QSARs) for Predicting Compound Endocrine Disruption
Potentials, J.D. Walker (Ed.), SETAC Press, Pensecola, FL (2001).
 H. Hong, W. Tong, H. Fang, L. Shi, Q. Xie, J. Wu, R. Perkins, J.D. Walker, W. Branham, D.M. Sheehan.
Environ. Health Persp., 110, 29 (2002).
 C.A. Lipinski, F. Lombardo, B.W. Dominy, P.J. Feeney. Adv. Drug Deliv. Rev., 23, 3 (1997).
 M.F.? Engels, T. Thielemans, D. Verbinnen, J.P. Tollenaere, R. Verbeeck. J. Chem. Inf. Comput. Sci., 40,