Beyond Random Split for Assessing Statistical
Carlos Catania, Jorge Guerra, Juan Manuel Romero, and Gabriel Caﬀaratti
and Martin Marchetta
Universidad Nacional de Cuyo.
Facultad de Ingenier´ıa. LABSIN.
Abstract. Even though a train/test split of the dataset randomly per-
formed is a common practice, could not always be the best approach for
estimating performance generalization under some scenarios. The fact
is that the usual machine learning methodology can sometimes overes-
timate the generalization error when a dataset is not representative or
when rare and elusive examples are a fundamental aspect of the detec-
tion problem. In the present work, we analyze strategies based on the
predictors’ variability to split in training and testing sets. Such strate-
gies aim at guaranteeing the inclusion of rare or unusual examples with
a minimal loss of the population’s representativeness and provide a more
accurate estimation about the generalization error when the dataset is
not representative. Two baseline classiﬁers based on decision trees were
used for testing the four splitting strategies considered. Both classiﬁers
were applied on CTU19 a low-representative dataset for a network se-
curity detection problem. Preliminary results showed the importance of
applying the three alternative strategies to the Monte Carlo splitting
strategy in order to get a more accurate error estimation on diﬀerent
but feasible scenarios.
Keywords: Sampling strategies ·Population representativeness ·Dataset
An experimental design is a fundamental part of the machine learning workﬂow.
In the particular case of prediction problems, part of the design includes esti-
mating the model’s generalization performance. Estimating this performance is
a critical aspect of developing a model since it gives an idea of whether it can
deal with future (not seen) scenarios reasonably well.
The standard experimental design for evaluating the performance of a ma-
chine learning model is well known. As depicted in Figure 1 a dataset is split
usually in a 70/30 ratio. 30% of the data, called testing set, should be left aside
and ideally never used until model tuning has ﬁnished. On the other hand, 70%
of the data, referred as training set, could be used to train and optionally validate
the model or conduct a hyperparameter search for model tuning.
arXiv:2209.03346v1 [cs.LG] 4 Sep 2022
Fig. 1: Standard experimental design for evaluating the performance of a ma-
chine learning model.
Train and test datasets need some degree of similarity (both need to follow the
same distribution). Otherwise, it would be impossible for the model to achieve
a decent performance on test. However, if the examples are too similar in both
datasets, then it is not possible to assure an accurate generalization performance
for the model. Moreover, the model overall generalization performance could be
Even though a train/test split of the dataset randomly performed is a com-
mon practice, is not always the best approach for estimating performance gener-
alization under some scenarios. A common situation is when predicting patient
outcomes. In these cases, the model should be constructed using certain patient
sets (e.g., from the same clinical site or disease stage), but then need to be
tested on a diﬀerent sample population . Another situation is the fact that
it is not always possible to have access to a representative sample. Detecting a
non-representative sample is possible through the application several techniques,
such as cross-validation, learning curves, and conﬁdence intervals, among oth-
ers. Unfortunately, in many cases, a non-representative sample is all we have
to generate a machine learning model. In those cases when a sample does not
follow the same population distribution, a random split might not provide the
required level of representativeness for rare or elusive examples in a testing set.
As a result, the standard error metrics could overestimate the performance of
the model. In classiﬁcation problems, it is possible to deal with the lack of rep-
resentativeness using a stratiﬁcation strategy. However, when rare examples are
not labeled, a predictor-based sampling strategy will be necessary [9,1]
In the present work, we analyze several strategies based on the predictors’
variability to split in training and testing sets. Such strategies aim at guar-
anteeing the inclusion of rare or unusual examples with a minimal loss of the
population’s representativeness. The hypothesis is that by including rare exam-
ples during model evaluation a more accurate performance estimates will be
The contributions of the present article are:
–The analysis of four splitting strategies with diﬀerent distributions for train-
ing and testing sets.
–The evaluation of two diﬀerent tree-based baseline classiﬁers over four dif-
ferent splitting strategies.
2 Splitting Strategies
2.1 Monte Carlo
The usual strategy for model evaluation consists of taking an uniformly random
sample without replacement of a portion of the data for the training set, while
all other data points are added to the testing set. Such strategy can be thought
as special case of the Monte Carlos Cross Validation (MCCV)  with just one
resample instance. The Monte Carlo (MC) splitting strategy guarantees the same
distribution across not only response but also predictor variables for training and
In comparison with Monte Carlo, the remaining splitting strategies provide
steps to create diﬀerent test sets that include rare or elusive examples while
maintaining similar properties across the predictor space as the training set.
Maximum dissimilarity splitting strategies were proposed by  and . The
simplest method to measure dissimilarity consist of using the distance between
the predictor values for two samples. The larger the distance between points, the
larger indicative of dissimilarity. The application of dissimilarity during data
splitting requires a set initialized with a few samples. Then, the dissimilarity
between this set and the rest of the unallocated samples can be calculated. The
unallocated sample that is most dissimilar to the initial set would then be added
to the test set.
Dissimilarity splitting proved to be useful over chemical databases’ splitting
[6,5]. Nevertheless, this method strongly depends on the initial set used to cal-
culate the dissimilarity of the rest of the samples, prompting problems in cases
of small datasets where the initial set is not representative enough .
A well-known non-random split strategy consists of using some kind of grouping
information from the data to restrict the set of samples used for testing. The
general idea after splitting the data is that members of a group present in training
set should not be included in the testing set. Such strategies are well-known in
areas such as Medicine and Finances , where testing should be conducted on
a diﬀerent patient group or, in the ﬁnances ﬁeld, where the model should be
tested on a time series from a diﬀerent time period.
The clustering split strategy follows the same principle of the informed split.
However, there could be situations where no grouping information can be ex-
tracted from the samples to perform an informed split. In these cases, the appli-
cation of a clustering algorithm could be used to replace the missing information.
The labels generated by this procedure will be then used for performing a group
split similarly to the informed split strategy.
3 Application to a Network Security Dataset
The four splitting strategies described in section 2 were applied to a network
security dataset for botnet detection conformed by nineteen network captures
published by the stratosphere IPS research group at CTU . Speciﬁcally, the
dataset has fourteen botnet captures and ﬁve normal captures including traﬃcs
like DNS, HTTPS and P2P. In total, all captures represents 20866 connections
having 19271 labeled as “Botnet” and 1595 labeled as “Normal”. All these cap-
tures were gathered between 2013 and 2017.
The ﬁrst ten predictors of the CTU19 dataset summarize each ﬂow of con-
nections with same IP, protocol and destination port into a 10-dimensional nu-
vfeat =< xsp, xwp , xwnp , xsnp, xds , xdm, xdl , xss , xsm, xsl >(1)
wherein the ﬁrst four dimensions of the numerical vector represent the periodicity
predictors (strong periodicity (xsp), weak periodicity (xwp), weak non periodicity
(xwnp) and strong non periodicity (xsnp)), the next three refer to the duration
predictors (duration short (xds), duration medium (xdm ) and duration large
(xdl), respectively), and the last three represent the size predictors (size short
(xss), size medium (xsm ), size large (xsl)). The vector for a given connection
is generated considering the cumulative frequency of the corresponding values
associated with the behavioral of each predictor.
In addition to the information provided by the ﬂow-based predictors, the
CTU19 dataset includes information related with the ﬂow like source IP, des-
tination IP, protocol, port and the source linked with each capture. However,
the present study will focus only on the information provided by the ﬂow-based
predictors as discussed in [2,8].
3.1 Initial Exploratory Analysis
Fig. 2 represents a 2D projection of the CTU19 dataset considering the ﬁrst 2
principal components (PCA was applied to the ﬂow-based predictors). In ad-
dition, the ﬁgure includes the distribution for each predictor. As depicted by
the box plot, botnet and normal classes present a diﬀerent distribution over the
ﬂow-based predictors. However, both classes show a partial overlap at the time
of projecting them into a 2D space.
Fig. 2: Boxplot and 2D projection of the ﬂow-based predictors in the CTU19
dataset. Botnet traﬃc in red and Normal in blue.
Fig. 3 decomposes the CTU19 dataset by traﬃc capture with the same informa-
tion as provided by Fig 2. In general, the patterns from the 2D projection normal
captures are diﬀerent compared with Botnet captures. Normal captures are con-
centrated while Botnet captures spread along the 2D predictor space. When
analysing Normal captures, most of them overlap the same predictor space. In
other words, each capture has some examples on every portion of the Normal
predictor subspace, which suggests the adequate representativeness. Neverthe-
less, in the case of Botnet, there are several cases of captures having only a
limited presence on the class predictor subspace (see captures 2014-02-07-win3
and 2014-01-25-win3). Such lack of representativeness observed by some Botnet
captures could diﬃcult the classiﬁcation model’s performance estimation.
3.2 Training and Testing Sets Creation
Fig. 4 describes the process used for the creation of the training and testing sets
according to each splitting strategy. A subset of 3000 data points from CTU19
were used for each strategy. 2000 data points for conforming the training set
and 1000 for the testing set. The generation of the training and testing sets will
depend on the splitting strategy applied as discussed in section 2. The procedure
is repeated 25 times. Therefore 25 pairs of training and testing sets are generated
for each splitting strategy.
The diﬀerences between each splitting strategy are observed in Fig 5 considering
data points from the 25 samples. The 2D projection using the ﬁrst two princi-
pal components conﬁrms the similarities between training and testing sets when
the baseline Monte Carlo splitting is applied. Moreover, both datasets follow
the same pattern in the predictor space which corresponds with the similarity
observed in the predictor distributions (see box plots below). However, diﬀerent
patterns are observed between training and testing sets in the remaining split-
ting strategies. In particular, dissimilarity-based and informed-based splitting
Fig. 3: On the top, the scatter plots on the top depicts the 2D projection using
PCA for each one of the 19 captures conforming the CTU19 dataset (Botnet traf-
ﬁc in red and Normal in blue). On the bottom, the corresponding distributions
for the 10 ﬂow-based predictors on each capture.
present the most diﬀerent patterns. Nevertheless, the same pattern is observed
in both datasets for the clustering-based splitting case, although with a small
displacement from the axis.
Pick the first 2000 data
points for train.The
next 1000 for test
Monte Carlo Sampling
Pick first 2000 data points
for train. Then pick the 900
most different Botnet data
points and the 100 most
different Normal data points
Find clusters in the dataset.
Pick the first 2000 data
points from a random set of
clusters (𝝰) for train. Then
pick 1000 for test from a
different random set (𝞫).
Pick the first 2000 data
points from a random set of
captures (𝝰) for train. Then
pick 1000 for test from a
different random set (𝞫).
Fig. 4: Procedure used for generating the training and testing sets for the
four splitting strategies: Monte Carlo, Dissimilarity-based, clustering-based and
informed-based. In all the cases a training set with 2000 data points and a testing
set with 1000 were generated.
When projecting many predictors into two dimensions, intricate predictor rela-
tionships could mask regions within the predictor space where the model will
inadequately predict new samples.
An algorithmic approach as described in  is applied to get more detailed
information about the similarities between training and testing datasets. The
general idea is to create a new dataset randomly permuting the predictors from
the training set and then, row-wise concatenate to the original training set label-
ing original and permuted samples. A classiﬁcation model is run on the resulting
dataset to predict the probability of new data being in the class of the training
set. Table 1 shows average percentage of samples from testing set not recognized
as part of the training set.
As expected, Monte Carlo splitting exhibits the lower error and standard
deviation, conﬁrming the similarities between training and testing set observed
in the 2D projection from Fig.5. Both dissimilarity and informed-based strate-
gies has the same average error. However, informed-based shows a higher varia-
tion. Finally, the clustering-based strategy shows the biggest diﬀerences between
training and testing sets.
Fig. 5: 2D projection and boxplot distribution for the 25 pairs of training and
testing datasets for each splitting strategy.
Table 1: Average Percentage and standard deviation of test samples not recog-
nized as part of the training set for the four splitting strategies
Avg Err %sd Splitting Strategy
0.06 0.05 informed-based
0.01 0.01 monte carlo
0.17 0.13 cluster-based
0.06 0.02 dissimilarity-based
4 Error Estimation on Baseline Classiﬁers
The impact of the diﬀerent splitting strategies discussed in section 2 is measured
on Random Forest (RF) and CatBoost (CB) classiﬁers. Both CB and RF are
two well-known classiﬁers providing acceptable results on tabular data without
conducting a hyper parameters tuning. Both baseline classiﬁers were executed
with default parameters. Nevertheless, downsampling technique is applied to
training set to deal with classes imbalance issues.
Several standard metrics for model classiﬁcation assessment were used to eval-
uate baseline classiﬁers performance on the diﬀerent train/test datasets. The
metrics correspond to True Positive Rate (TPR) and False Positive Rate (FPR).
The Sensitivity measures the proportion of positives that are correctly identi-
ﬁed (TPR), and the Speciﬁcity measures the proportion of negatives that are
correctly identiﬁed (1 −FPR).
Additional metrics were used to deal with class imbalance: F1-Score and
Balanced Accuracy. F1-Score is computed as the weighted average between
TPR and the total numbers of malicious connections in the dataset. Balanced
Accuracy is calculated as the average of correctly classiﬁed proportion of each
(a) Monte Carlo splitting Strategy (b) Dissimilar splitting Strategy
(c) Informed-based splitting Strategy (d) Clustering-based splitting Strategy
Fig. 6: Random Forest and CatBoost models’ performance result for each split-
Fig. 6 depicts the overall performance results for baseline classiﬁcations mod-
els using the splitting strategies described in Section 2. Both models were used to
predict “Botnet” or ‘Normal” connections. Since no signiﬁcant diﬀerence is ob-
servable in the performance between both baseline models, the results discussed
in this section corresponds only to the catboost algorithm.
Fig. 6a displays testing set’s results for RF (below) and CB (above) using the
MC splitting strategy. Prediction performance in terms of Balanced Accuracy
median is around 0.91. The remaining metrics obtained similar values for both
baseline models. A low variability is observed in all the considered metrics. For
the Dissimilarity-based splitting strategy (see Fig. 6b), all considered metrics de-
crease in comparison to MC strategy. The Balanced Accuracy median decreases
to 0.75 while the speciﬁcity decreases to 0.70. The F1 and Sensitivity median
maintained higher values, although lower than the observed in the MC strategy.
Fig. 6c states the results for the Informed-based splitting strategy. In this
case, Balanced Accuracy and F1 show a median value of 0.85, 0.87, respectively.
On the other hand, in terms of Speciﬁcity and Sensitivity, the median values are
around 0.78 and 0.93, respectively. Notice that despite the good performance
in terms of median values, a considerable variation is observed for the F1 and
Sensitivity metrics. Finally, Fig. 6d presents the models’ results using the Clus-
tering splitting strategy. The Balanced Accuracy median value decreased to 0.82
while F1 increases to 0.93 compared with the informed-splitting strategy. The
rest of the metrics, Sensitivity and Speciﬁcity, shown similar results with median
around 0.89 and 0.87, respectively.
In general both baseline models presented acceptable performance for predict-
ing “Botnet” and “Normal” classes since the median Balanced Accuracy metric
was over 0.75 in most of cases. As expected, the Monte Carlo splitting strategy
showed the smallest estimation error (a high balanced accuracy). The similarities
between training and testing sets was already observed in Table 1, where only 1%
of the testing set was not recognized as part of training set. When the size of not
recognized samples increments only to a 6%, such as in the case of dissimilarity
strategies, the estimation error increases considerably (from 0.91 to 0.75 median
Balanced Accuracy). Since Monte Carlo and Dissimilarity-based splitting strate-
gies use the same procedure for generating the training set, their analysis can
provide a valuable error estimation under an anomalous but certainly feasible
set of samples.
In the case of informed and clustering based splitting strategies, the diﬀer-
ence between training and testing sets is not only larger than Monte Carlo but
also both show a larger variation (see Table 1). A considerable variation is also
observed in the performance of both baseline classiﬁers. However, both split-
ting strategies still provide information about the robustness of the models on
sets with diﬀerent representativeness. Previous statement is particularly valid for
the informed-based splitting strategy, where captures with very diﬀerent repre-
sentiveness levels are used for building training and testing sets. For instance,
informed-based Balanced Accuracy range provides ad-hoc information about the
expected values when not so representative sets are used for training. Moreover,
it is possible to infer that a 0.75 value for Balanced Accuracy could be the worst
performance scenario observed by the model. Such value is still suitable under
some real-life situations.
The clustering-based strategy provides a more extreme scenario than informed-
based for estimating the model performance under not so representative sets.
Under the clustering-base strategy a concentrated portion of the predictor space
present in the testing set is excluded from training set, whereas in the informed-
based it is possible to ﬁnd datapoints spread along the whole predictor space.
Consequently, it is possible to observe higher variation in the baseline models
6 Concluding Remarks and Future Work
Despite being the standard splitting strategy, Monte Carlo can overestimate the
results when dataset is not representative. Other splitting techniques based on
dissimilarity, information present in the dataset and the application of a cluster-
ing algorithm can help in the estimation under diﬀerent low-representativeness
Multiple training and testing sets where generated using the diﬀerent strate-
gies on the CTU19 Botnet dataset. Small diﬀerences between training and testing
sets were corroborated using the algorithm proposed by  in all the four tech-
niques. As expected Monte Carlo showed the smallest diﬀerences while clustering-
based showed the biggest.
Two baseline classiﬁers were used for evaluating each splitting strategy in the
error estimation process. The Dissimilarity-based splitting strategy provided a
valuable error estimation under an anomalous but certainly feasible set of sam-
ples. On the the other hand, informed-based strategy oﬀers ad-hoc information
about the expected values when sets not so representive are used for training,
while clustering-based strategy emerges as an alternative to informed-based for
estimating the model performance under low representativeness sets with more
Preliminary results showed the importance of applying the three alternative
strategies to the Monte Carlo splitting strategy in order to get a more accurate
error estimation on diﬀerent but possible situations. However, given the par-
ticular low-representativeness nature of the botnet detection problem, a deeper
analysis and evaluation on other datasets should be conducted.
The authors would like to thank the ﬁnancial support received by SIIP-UNCuyo
during this work. In particular the projects 06/B363 and 06/B374. In addition,
we want to gratefully acknowledge the support of NVIDIA Corporation with the
donation of the Titan V GPU used for this research.
1. Clark, R.D.: OptiSim: An extended dissimilarity selection method for ﬁnding di-
verse representative subsets. Journal of Chemical Information and Computer Sci-
ences 37(6), 1181–1188 (1997). https://doi.org/10.1021/ci970282v
2. Guerra, J.L., Veas, E., Catania, C.A.: A study on labeling network
hostile behavior with intelligent interactive tools. In: 2019 IEEE Sym-
posium on Visualization for Cyber Security (VizSec). pp. 1–10 (2019).
3. Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning: data
mining, inference, and prediction. Springer Science & Business Media (2009)
4. Kuhn, M., Johnson, K.: Applied Predictive Modeling with Applications in R.
5. Martin, T.M., Harten, P., Young, D.M., Muratov, E.N., Golbraikh, A., Zhu, H.,
Tropsha, A.: Does rational selection of training and test sets improve the outcome
of QSAR modeling? Journal of Chemical Information and Modeling 52(10), 2570–
2578 (2012). https://doi.org/10.1021/ci300338w
6. Snarey, M., Terrett, N.K., Willett, P., Wilton, D.J.: Comparison of algorithms
for dissimilarity-based compound selection. Journal of Molecular Graphics and
Modelling 15(6), 372–385 (1997). https://doi.org/10.1016/S1093-3263(98)00008-4
7. Stratosphere IPS Project: The CTU-19 dataset, Malware Captures. https://mcfp.
felk.cvut.cz/publicDatasets/ (October 2017), [Online; accessed Jun-2020]
8. Torres, J.L.G., Catania, C.A., Veas, E.: Active learning approach to label net-
work traﬃc datasets. Journal of Information Security and Applications 49,
102388 (2019). https://doi.org/https://doi.org/10.1016/j.jisa.2019.102388, http:
9. Willett, P.: Dissimilarity-based algorithms for selecting structurally diverse
sets of compounds. Journal of Computational Biology 6(3-4), 447–457 (1999).
10. Xu, Q.S., Liang, Y.Z.: Monte carlo cross validation. Chemometrics and Intelligent
Laboratory Systems 56(1), 1–11 (2001)
11. Yang, Y., Ye, Z., Su, Y., Zhao, Q., Li, X., Ouyang, D.: Deep learning for in vitro
prediction of pharmaceutical formulations. Acta Pharmaceutica Sinica B 9(1), 177–
185 (2019). https://doi.org/10.1016/j.apsb.2018.09.010