PreprintPDF Available

Beyond Random Split for Assessing Statistical Model Performance

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Even though a train/test split of the dataset randomly performed is a common practice, could not always be the best approach for estimating performance generalization under some scenarios. The fact is that the usual machine learning methodology can sometimes overestimate the generalization error when a dataset is not representative or when rare and elusive examples are a fundamental aspect of the detection problem. In the present work, we analyze strategies based on the predictors' variability to split in training and testing sets. Such strategies aim at guaranteeing the inclusion of rare or unusual examples with a minimal loss of the population's representativeness and provide a more accurate estimation about the generalization error when the dataset is not representative. Two baseline classifiers based on decision trees were used for testing the four splitting strategies considered. Both classifiers were applied on CTU19 a low-representative dataset for a network security detection problem. Preliminary results showed the importance of applying the three alternative strategies to the Monte Carlo splitting strategy in order to get a more accurate error estimation on different but feasible scenarios.
Content may be subject to copyright.
Beyond Random Split for Assessing Statistical
Model Performance
Carlos Catania, Jorge Guerra, Juan Manuel Romero, and Gabriel Caffaratti
and Martin Marchetta
Universidad Nacional de Cuyo.
Facultad de Ingenier´ıa. LABSIN.
Mendoza. Argentina
Abstract. Even though a train/test split of the dataset randomly per-
formed is a common practice, could not always be the best approach for
estimating performance generalization under some scenarios. The fact
is that the usual machine learning methodology can sometimes overes-
timate the generalization error when a dataset is not representative or
when rare and elusive examples are a fundamental aspect of the detec-
tion problem. In the present work, we analyze strategies based on the
predictors’ variability to split in training and testing sets. Such strate-
gies aim at guaranteeing the inclusion of rare or unusual examples with
a minimal loss of the population’s representativeness and provide a more
accurate estimation about the generalization error when the dataset is
not representative. Two baseline classifiers based on decision trees were
used for testing the four splitting strategies considered. Both classifiers
were applied on CTU19 a low-representative dataset for a network se-
curity detection problem. Preliminary results showed the importance of
applying the three alternative strategies to the Monte Carlo splitting
strategy in order to get a more accurate error estimation on different
but feasible scenarios.
Keywords: Sampling strategies ·Population representativeness ·Dataset
1 Motivation
An experimental design is a fundamental part of the machine learning workflow.
In the particular case of prediction problems, part of the design includes esti-
mating the model’s generalization performance. Estimating this performance is
a critical aspect of developing a model since it gives an idea of whether it can
deal with future (not seen) scenarios reasonably well.
The standard experimental design for evaluating the performance of a ma-
chine learning model is well known. As depicted in Figure 1 a dataset is split
usually in a 70/30 ratio. 30% of the data, called testing set, should be left aside
and ideally never used until model tuning has finished. On the other hand, 70%
of the data, referred as training set, could be used to train and optionally validate
the model or conduct a hyperparameter search for model tuning.
arXiv:2209.03346v1 [cs.LG] 4 Sep 2022
Fig. 1: Standard experimental design for evaluating the performance of a ma-
chine learning model.
Train and test datasets need some degree of similarity (both need to follow the
same distribution). Otherwise, it would be impossible for the model to achieve
a decent performance on test. However, if the examples are too similar in both
datasets, then it is not possible to assure an accurate generalization performance
for the model. Moreover, the model overall generalization performance could be
Even though a train/test split of the dataset randomly performed is a com-
mon practice, is not always the best approach for estimating performance gener-
alization under some scenarios. A common situation is when predicting patient
outcomes. In these cases, the model should be constructed using certain patient
sets (e.g., from the same clinical site or disease stage), but then need to be
tested on a different sample population [4]. Another situation is the fact that
it is not always possible to have access to a representative sample. Detecting a
non-representative sample is possible through the application several techniques,
such as cross-validation, learning curves, and confidence intervals, among oth-
ers. Unfortunately, in many cases, a non-representative sample is all we have
to generate a machine learning model. In those cases when a sample does not
follow the same population distribution, a random split might not provide the
required level of representativeness for rare or elusive examples in a testing set.
As a result, the standard error metrics could overestimate the performance of
the model. In classification problems, it is possible to deal with the lack of rep-
resentativeness using a stratification strategy. However, when rare examples are
not labeled, a predictor-based sampling strategy will be necessary [9,1]
In the present work, we analyze several strategies based on the predictors’
variability to split in training and testing sets. Such strategies aim at guar-
anteeing the inclusion of rare or unusual examples with a minimal loss of the
population’s representativeness. The hypothesis is that by including rare exam-
ples during model evaluation a more accurate performance estimates will be
The contributions of the present article are:
The analysis of four splitting strategies with different distributions for train-
ing and testing sets.
The evaluation of two different tree-based baseline classifiers over four dif-
ferent splitting strategies.
2 Splitting Strategies
2.1 Monte Carlo
The usual strategy for model evaluation consists of taking an uniformly random
sample without replacement of a portion of the data for the training set, while
all other data points are added to the testing set. Such strategy can be thought
as special case of the Monte Carlos Cross Validation (MCCV) [10] with just one
resample instance. The Monte Carlo (MC) splitting strategy guarantees the same
distribution across not only response but also predictor variables for training and
testing sets.
In comparison with Monte Carlo, the remaining splitting strategies provide
steps to create different test sets that include rare or elusive examples while
maintaining similar properties across the predictor space as the training set.
2.2 Dissimilarity-based
Maximum dissimilarity splitting strategies were proposed by [1] and [9]. The
simplest method to measure dissimilarity consist of using the distance between
the predictor values for two samples. The larger the distance between points, the
larger indicative of dissimilarity. The application of dissimilarity during data
splitting requires a set initialized with a few samples. Then, the dissimilarity
between this set and the rest of the unallocated samples can be calculated. The
unallocated sample that is most dissimilar to the initial set would then be added
to the test set.
Dissimilarity splitting proved to be useful over chemical databases’ splitting
[6,5]. Nevertheless, this method strongly depends on the initial set used to cal-
culate the dissimilarity of the rest of the samples, prompting problems in cases
of small datasets where the initial set is not representative enough [11].
2.3 Informed-based
A well-known non-random split strategy consists of using some kind of grouping
information from the data to restrict the set of samples used for testing. The
general idea after splitting the data is that members of a group present in training
set should not be included in the testing set. Such strategies are well-known in
areas such as Medicine and Finances [4], where testing should be conducted on
a different patient group or, in the finances field, where the model should be
tested on a time series from a different time period.
2.4 Clustering-based
The clustering split strategy follows the same principle of the informed split.
However, there could be situations where no grouping information can be ex-
tracted from the samples to perform an informed split. In these cases, the appli-
cation of a clustering algorithm could be used to replace the missing information.
The labels generated by this procedure will be then used for performing a group
split similarly to the informed split strategy.
3 Application to a Network Security Dataset
The four splitting strategies described in section 2 were applied to a network
security dataset for botnet detection conformed by nineteen network captures
published by the stratosphere IPS research group at CTU [7]. Specifically, the
dataset has fourteen botnet captures and five normal captures including traffics
like DNS, HTTPS and P2P. In total, all captures represents 20866 connections
having 19271 labeled as “Botnet” and 1595 labeled as “Normal”. All these cap-
tures were gathered between 2013 and 2017.
The first ten predictors of the CTU19 dataset summarize each flow of con-
nections with same IP, protocol and destination port into a 10-dimensional nu-
merical vector:
vfeat =< xsp, xwp , xwnp , xsnp, xds , xdm, xdl , xss , xsm, xsl >(1)
wherein the first four dimensions of the numerical vector represent the periodicity
predictors (strong periodicity (xsp), weak periodicity (xwp), weak non periodicity
(xwnp) and strong non periodicity (xsnp)), the next three refer to the duration
predictors (duration short (xds), duration medium (xdm ) and duration large
(xdl), respectively), and the last three represent the size predictors (size short
(xss), size medium (xsm ), size large (xsl)). The vector for a given connection
is generated considering the cumulative frequency of the corresponding values
associated with the behavioral of each predictor.
In addition to the information provided by the flow-based predictors, the
CTU19 dataset includes information related with the flow like source IP, des-
tination IP, protocol, port and the source linked with each capture. However,
the present study will focus only on the information provided by the flow-based
predictors as discussed in [2,8].
3.1 Initial Exploratory Analysis
Fig. 2 represents a 2D projection of the CTU19 dataset considering the first 2
principal components (PCA was applied to the flow-based predictors). In ad-
dition, the figure includes the distribution for each predictor. As depicted by
the box plot, botnet and normal classes present a different distribution over the
flow-based predictors. However, both classes show a partial overlap at the time
of projecting them into a 2D space.
Fig. 2: Boxplot and 2D projection of the flow-based predictors in the CTU19
dataset. Botnet traffic in red and Normal in blue.
Fig. 3 decomposes the CTU19 dataset by traffic capture with the same informa-
tion as provided by Fig 2. In general, the patterns from the 2D projection normal
captures are different compared with Botnet captures. Normal captures are con-
centrated while Botnet captures spread along the 2D predictor space. When
analysing Normal captures, most of them overlap the same predictor space. In
other words, each capture has some examples on every portion of the Normal
predictor subspace, which suggests the adequate representativeness. Neverthe-
less, in the case of Botnet, there are several cases of captures having only a
limited presence on the class predictor subspace (see captures 2014-02-07-win3
and 2014-01-25-win3). Such lack of representativeness observed by some Botnet
captures could difficult the classification model’s performance estimation.
3.2 Training and Testing Sets Creation
Fig. 4 describes the process used for the creation of the training and testing sets
according to each splitting strategy. A subset of 3000 data points from CTU19
were used for each strategy. 2000 data points for conforming the training set
and 1000 for the testing set. The generation of the training and testing sets will
depend on the splitting strategy applied as discussed in section 2. The procedure
is repeated 25 times. Therefore 25 pairs of training and testing sets are generated
for each splitting strategy.
The differences between each splitting strategy are observed in Fig 5 considering
data points from the 25 samples. The 2D projection using the first two princi-
pal components confirms the similarities between training and testing sets when
the baseline Monte Carlo splitting is applied. Moreover, both datasets follow
the same pattern in the predictor space which corresponds with the similarity
observed in the predictor distributions (see box plots below). However, different
patterns are observed between training and testing sets in the remaining split-
ting strategies. In particular, dissimilarity-based and informed-based splitting
Fig. 3: On the top, the scatter plots on the top depicts the 2D projection using
PCA for each one of the 19 captures conforming the CTU19 dataset (Botnet traf-
fic in red and Normal in blue). On the bottom, the corresponding distributions
for the 10 flow-based predictors on each capture.
present the most different patterns. Nevertheless, the same pattern is observed
in both datasets for the clustering-based splitting case, although with a small
displacement from the axis.
Pick the first 2000 data
points for train.The
next 1000 for test
Monte Carlo Sampling
Pick first 2000 data points
for train. Then pick the 900
most different Botnet data
points and the 100 most
different Normal data points
Dissimilarity-based Sampling
Find clusters in the dataset.
Pick the first 2000 data
points from a random set of
clusters (𝝰) for train. Then
pick 1000 for test from a
different random set (𝞫).
Clustering-based Sampling
Pick the first 2000 data
points from a random set of
captures (𝝰) for train. Then
pick 1000 for test from a
different random set (𝞫).
Informed Sampling
Fig. 4: Procedure used for generating the training and testing sets for the
four splitting strategies: Monte Carlo, Dissimilarity-based, clustering-based and
informed-based. In all the cases a training set with 2000 data points and a testing
set with 1000 were generated.
When projecting many predictors into two dimensions, intricate predictor rela-
tionships could mask regions within the predictor space where the model will
inadequately predict new samples.
An algorithmic approach as described in [3] is applied to get more detailed
information about the similarities between training and testing datasets. The
general idea is to create a new dataset randomly permuting the predictors from
the training set and then, row-wise concatenate to the original training set label-
ing original and permuted samples. A classification model is run on the resulting
dataset to predict the probability of new data being in the class of the training
set. Table 1 shows average percentage of samples from testing set not recognized
as part of the training set.
As expected, Monte Carlo splitting exhibits the lower error and standard
deviation, confirming the similarities between training and testing set observed
in the 2D projection from Fig.5. Both dissimilarity and informed-based strate-
gies has the same average error. However, informed-based shows a higher varia-
tion. Finally, the clustering-based strategy shows the biggest differences between
training and testing sets.
Fig. 5: 2D projection and boxplot distribution for the 25 pairs of training and
testing datasets for each splitting strategy.
Table 1: Average Percentage and standard deviation of test samples not recog-
nized as part of the training set for the four splitting strategies
Avg Err %sd Splitting Strategy
0.06 0.05 informed-based
0.01 0.01 monte carlo
0.17 0.13 cluster-based
0.06 0.02 dissimilarity-based
4 Error Estimation on Baseline Classifiers
The impact of the different splitting strategies discussed in section 2 is measured
on Random Forest (RF) and CatBoost (CB) classifiers. Both CB and RF are
two well-known classifiers providing acceptable results on tabular data without
conducting a hyper parameters tuning. Both baseline classifiers were executed
with default parameters. Nevertheless, downsampling technique is applied to
training set to deal with classes imbalance issues.
4.1 Metrics
Several standard metrics for model classification assessment were used to eval-
uate baseline classifiers performance on the different train/test datasets. The
metrics correspond to True Positive Rate (TPR) and False Positive Rate (FPR).
The Sensitivity measures the proportion of positives that are correctly identi-
fied (TPR), and the Specificity measures the proportion of negatives that are
correctly identified (1 FPR).
Additional metrics were used to deal with class imbalance: F1-Score and
Balanced Accuracy. F1-Score is computed as the weighted average between
TPR and the total numbers of malicious connections in the dataset. Balanced
Accuracy is calculated as the average of correctly classified proportion of each
class individually.
4.2 Results
(a) Monte Carlo splitting Strategy (b) Dissimilar splitting Strategy
(c) Informed-based splitting Strategy (d) Clustering-based splitting Strategy
Fig. 6: Random Forest and CatBoost models’ performance result for each split-
ting strategy
Fig. 6 depicts the overall performance results for baseline classifications mod-
els using the splitting strategies described in Section 2. Both models were used to
predict “Botnet” or ‘Normal” connections. Since no significant difference is ob-
servable in the performance between both baseline models, the results discussed
in this section corresponds only to the catboost algorithm.
Fig. 6a displays testing set’s results for RF (below) and CB (above) using the
MC splitting strategy. Prediction performance in terms of Balanced Accuracy
median is around 0.91. The remaining metrics obtained similar values for both
baseline models. A low variability is observed in all the considered metrics. For
the Dissimilarity-based splitting strategy (see Fig. 6b), all considered metrics de-
crease in comparison to MC strategy. The Balanced Accuracy median decreases
to 0.75 while the specificity decreases to 0.70. The F1 and Sensitivity median
maintained higher values, although lower than the observed in the MC strategy.
Fig. 6c states the results for the Informed-based splitting strategy. In this
case, Balanced Accuracy and F1 show a median value of 0.85, 0.87, respectively.
On the other hand, in terms of Specificity and Sensitivity, the median values are
around 0.78 and 0.93, respectively. Notice that despite the good performance
in terms of median values, a considerable variation is observed for the F1 and
Sensitivity metrics. Finally, Fig. 6d presents the models’ results using the Clus-
tering splitting strategy. The Balanced Accuracy median value decreased to 0.82
while F1 increases to 0.93 compared with the informed-splitting strategy. The
rest of the metrics, Sensitivity and Specificity, shown similar results with median
around 0.89 and 0.87, respectively.
5 Discussion
In general both baseline models presented acceptable performance for predict-
ing “Botnet” and “Normal” classes since the median Balanced Accuracy metric
was over 0.75 in most of cases. As expected, the Monte Carlo splitting strategy
showed the smallest estimation error (a high balanced accuracy). The similarities
between training and testing sets was already observed in Table 1, where only 1%
of the testing set was not recognized as part of training set. When the size of not
recognized samples increments only to a 6%, such as in the case of dissimilarity
strategies, the estimation error increases considerably (from 0.91 to 0.75 median
Balanced Accuracy). Since Monte Carlo and Dissimilarity-based splitting strate-
gies use the same procedure for generating the training set, their analysis can
provide a valuable error estimation under an anomalous but certainly feasible
set of samples.
In the case of informed and clustering based splitting strategies, the differ-
ence between training and testing sets is not only larger than Monte Carlo but
also both show a larger variation (see Table 1). A considerable variation is also
observed in the performance of both baseline classifiers. However, both split-
ting strategies still provide information about the robustness of the models on
sets with different representativeness. Previous statement is particularly valid for
the informed-based splitting strategy, where captures with very different repre-
sentiveness levels are used for building training and testing sets. For instance,
informed-based Balanced Accuracy range provides ad-hoc information about the
expected values when not so representative sets are used for training. Moreover,
it is possible to infer that a 0.75 value for Balanced Accuracy could be the worst
performance scenario observed by the model. Such value is still suitable under
some real-life situations.
The clustering-based strategy provides a more extreme scenario than informed-
based for estimating the model performance under not so representative sets.
Under the clustering-base strategy a concentrated portion of the predictor space
present in the testing set is excluded from training set, whereas in the informed-
based it is possible to find datapoints spread along the whole predictor space.
Consequently, it is possible to observe higher variation in the baseline models
6 Concluding Remarks and Future Work
Despite being the standard splitting strategy, Monte Carlo can overestimate the
results when dataset is not representative. Other splitting techniques based on
dissimilarity, information present in the dataset and the application of a cluster-
ing algorithm can help in the estimation under different low-representativeness
Multiple training and testing sets where generated using the different strate-
gies on the CTU19 Botnet dataset. Small differences between training and testing
sets were corroborated using the algorithm proposed by [3] in all the four tech-
niques. As expected Monte Carlo showed the smallest differences while clustering-
based showed the biggest.
Two baseline classifiers were used for evaluating each splitting strategy in the
error estimation process. The Dissimilarity-based splitting strategy provided a
valuable error estimation under an anomalous but certainly feasible set of sam-
ples. On the the other hand, informed-based strategy offers ad-hoc information
about the expected values when sets not so representive are used for training,
while clustering-based strategy emerges as an alternative to informed-based for
estimating the model performance under low representativeness sets with more
pessimist estimation.
Preliminary results showed the importance of applying the three alternative
strategies to the Monte Carlo splitting strategy in order to get a more accurate
error estimation on different but possible situations. However, given the par-
ticular low-representativeness nature of the botnet detection problem, a deeper
analysis and evaluation on other datasets should be conducted.
7 Acknowledgments
The authors would like to thank the financial support received by SIIP-UNCuyo
during this work. In particular the projects 06/B363 and 06/B374. In addition,
we want to gratefully acknowledge the support of NVIDIA Corporation with the
donation of the Titan V GPU used for this research.
1. Clark, R.D.: OptiSim: An extended dissimilarity selection method for finding di-
verse representative subsets. Journal of Chemical Information and Computer Sci-
ences 37(6), 1181–1188 (1997).
2. Guerra, J.L., Veas, E., Catania, C.A.: A study on labeling network
hostile behavior with intelligent interactive tools. In: 2019 IEEE Sym-
posium on Visualization for Cyber Security (VizSec). pp. 1–10 (2019).
3. Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning: data
mining, inference, and prediction. Springer Science & Business Media (2009)
4. Kuhn, M., Johnson, K.: Applied Predictive Modeling with Applications in R.
Springer (2013)
5. Martin, T.M., Harten, P., Young, D.M., Muratov, E.N., Golbraikh, A., Zhu, H.,
Tropsha, A.: Does rational selection of training and test sets improve the outcome
of QSAR modeling? Journal of Chemical Information and Modeling 52(10), 2570–
2578 (2012).
6. Snarey, M., Terrett, N.K., Willett, P., Wilton, D.J.: Comparison of algorithms
for dissimilarity-based compound selection. Journal of Molecular Graphics and
Modelling 15(6), 372–385 (1997).
7. Stratosphere IPS Project: The CTU-19 dataset, Malware Captures. https://mcfp. (October 2017), [Online; accessed Jun-2020]
8. Torres, J.L.G., Catania, C.A., Veas, E.: Active learning approach to label net-
work traffic datasets. Journal of Information Security and Applications 49,
102388 (2019)., http:
9. Willett, P.: Dissimilarity-based algorithms for selecting structurally diverse
sets of compounds. Journal of Computational Biology 6(3-4), 447–457 (1999).
10. Xu, Q.S., Liang, Y.Z.: Monte carlo cross validation. Chemometrics and Intelligent
Laboratory Systems 56(1), 1–11 (2001)
11. Yang, Y., Ye, Z., Su, Y., Zhao, Q., Li, X., Ouyang, D.: Deep learning for in vitro
prediction of pharmaceutical formulations. Acta Pharmaceutica Sinica B 9(1), 177–
185 (2019).
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Current pharmaceutical formulation development still strongly relies on the traditional trial-and-error methods of pharmaceutical scientists. This approach is laborious, time-consuming and costly. Recently, deep learning has been widely applied in many challenging domains because of its important capability of automatic feature extraction. The aim of the present research is to apply deep learning methods to predict pharmaceutical formulations. In this paper, two types of dosage forms were chosen as model systems. Evaluation criteria suitable for pharmaceutics were applied to assess the performance of the models. Moreover, an automatic dataset selection algorithm was developed for selecting the representative data as validation and test datasets. Six machine learning methods were compared with deep learning. Results showed that the accuracies of both two deep neural networks were above 80% and higher than other machine learning models; the latter showed good prediction of pharmaceutical formulations. In summary, deep learning employing an automatic data splitting algorithm and the evaluation criteria suitable for pharmaceutical formulation data was developed for the prediction of pharmaceutical formulations for the first time. The cross-disciplinary integration of pharmaceutics and artificial intelligence may shift the paradigm of pharmaceutical research from experience-dependent studies to data-driven methodologies.
Conference Paper
Labeling a real network dataset is specially expensive in computer security, as an expert has to ponder several factors before assigning each label. This paper describes an interactive intelligent system to support the task of identifying hostile behaviors in network logs. The RiskID application uses visualizations to graphically encode features of network connections and promote visual comparison. In the background, two algorithms are used to actively organize connections and predict potential labels: a recommendation algorithm and a semi-supervised learning strategy. These algorithms together with interactive adaptions to the user interface constitute a behavior recommendation. A study is carried out to analyze how the algorithms for recommendation and prediction influence the workflow of labeling a dataset. The results of a study with 16 participants indicate that the behaviour recommendation significantly improves the quality of labels. Analyzing interaction patterns, we identify a more intuitive workflow used when behaviour recommendation is available.
In the field of network security, the process of labeling a network traffic dataset is specially expensive since expert knowledge is required to perform the annotations. With the aid of visual analytic applications such as RiskID, the effort of labeling network traffic is considerable reduced. However, since the label assignment still requires an expert pondering several factors, the annotation process remains a difficult task. The present article introduces a novel active learning strategy for building a random forest model based on user previously-labeled connections. The resulting model provides to the user an estimation of the probability of the remaining unlabeled connections helping him in the traffic annotation task. The article describes the active learning strategy, the interfaces with the RiskID system, the algorithms used to predict botnet behavior, and a proposed evaluation framework. The evaluation framework includes studies to assess not only the prediction performance of the active learning strategy but also the learning rate and resilience against noise as well as the improvements on other well known labeling strategies. The framework represents a complete methodology for evaluating the performance of any active learning solution. The evaluation results showed proposed approach is a significant improvement over previous labeling strategies.
When predicting a categorical outcome, some measure of classification accuracy is typically used to evaluate the model’s effectiveness. However, there are different ways to measure classification accuracy, depending of the modeler’s primary objectives. Most classification models can produce both a continuous and categorical prediction output. In Section 11.1, we review these outputs, demonstrate how to adjust probabilities based on calibration plots, recommend ways for displaying class predictions, and define equivocal or indeterminate zones of prediction. In Section 11.2, we review common metrics for assessing classification predictions such as accuracy, kappa, sensitivity, specificity, and positive and negative predicted values. This section also addresses model evaluation when costs are applied to making false positive or false negative mistakes. Classification models may also produce predicted classification probabilities. Evaluating this type of output is addressed in Section 11.3, and includes a discussion of receiver operating characteristic curves as well as lift charts. In Section 11.4, we demonstrate how measures of classification performance can be generated in R.
Prior to using a quantitative structure activity relationship (QSAR) model for external predictions, its predictive power should be established and validated. In the absence of a true external data set, the best way to validate the predictive ability of a model is to perform its statistical external validation. In statistical external validation, the overall data set is divided into training and test sets. Commonly, this splitting is performed using random division. Rational splitting methods can divide data sets into training and test sets in an intelligent fashion. The purpose of this study was to determine whether rational division methods lead to more predictive models compared to random division. A special data splitting procedure was used to facilitate the comparison between random and rational division methods. For each toxicity end point, the overall data set was divided into a modeling set (80% of the overall set) and an external evaluation set (20% of the overall set) using random division. The modeling set was then subdivided into a training set (80% of the modeling set) and a test set (20% of the modeling set) using rational division methods and by using random division. The Kennard-Stone, minimal test set dissimilarity, and sphere exclusion algorithms were used as the rational division methods. The hierarchical clustering, random forest, and k-nearest neighbor (kNN) methods were used to develop QSAR models based on the training sets. For kNN QSAR, multiple training and test sets were generated, and multiple QSAR models were built. The results of this study indicate that models based on rational division methods generate better statistical results for the test sets than models based on random division, but the predictive power of both types of models are comparable.
In order to choose correctly the dimension of calibration model in chemistry, a new simple and effective method named Monte Carlo cross validation (MCCV) is introduced in the present work. Unlike leave-one-out procedure commonly used in chemometrics for cross validation (CV), the Monte Carlo cross validation developed in this paper is an asymptotically consistent method in determining the number of components in calibration model. It can avoid an unnecessary large model and therefore decreases the risk of over-fitting for the calibration model. The results obtained from simulation study showed that MCCV has an obviously larger probability than leave-one-out CV in choosing the correct number of components that the model should contain. The results from real data sets demonstrated that MCCV could successfully choose the appropriate model, but leave-one-out CV could not.
Dissimilarity-based compound selection has been suggested as an effective method for selecting structurally diverse subsets of chemical databases. This article reports a comparison of several maximum-dissimilarity and sphere-exclusion algorithms for dissimilarity-based selection. The effectiveness of the algorithms is quantified by the numbers of biological activity classes identified in subsets selected from the World Drugs Index database, and by the numbers of active compounds identified in feedback searches of this database. The experiments demonstrate the general effectiveness and efficiency of the MaxMin algorithm.
Compound selection methods currently available to chemists are based on maximum or minimum dissimilarity selection or on hierarchical clustering. Optimizable K-Dissimilarity Selection (OptiSim) is a novel and efficient stochastic selection algorithm which includes maximum and minimum dissimilarity-based selection as special cases. By adjusting the subsample size parameter K, it is possible to adjust the balance between representativeness and diversity in the compounds selected. The OptiSim algorithm is described, along with some analytical tools for comparing it to other selection methods. Such comparisons indicate that OptiSim can mimic the representativeness of selections based on hierarchical clustering and, at least in some cases, improve upon them.
This paper commences with a brief introduction to modern techniques for the computational analysis of molecular diversity and the design of combinatorial libraries. It then reviews dissimilarity-based algorithms for the selection of structurally diverse sets of compounds in chemical databases. Procedures are described for selecting a diverse subset of an entire database, and for selecting diverse combinatorial libraries using both reagent-based and product-based selection.