Available via license: CC BY 4.0

Content may be subject to copyright.

Beyond Random Split for Assessing Statistical

Model Performance

Carlos Catania, Jorge Guerra, Juan Manuel Romero, and Gabriel Caﬀaratti

and Martin Marchetta

Universidad Nacional de Cuyo.

Facultad de Ingenier´ıa. LABSIN.

Mendoza. Argentina

Abstract. Even though a train/test split of the dataset randomly per-

formed is a common practice, could not always be the best approach for

estimating performance generalization under some scenarios. The fact

is that the usual machine learning methodology can sometimes overes-

timate the generalization error when a dataset is not representative or

when rare and elusive examples are a fundamental aspect of the detec-

tion problem. In the present work, we analyze strategies based on the

predictors’ variability to split in training and testing sets. Such strate-

gies aim at guaranteeing the inclusion of rare or unusual examples with

a minimal loss of the population’s representativeness and provide a more

accurate estimation about the generalization error when the dataset is

not representative. Two baseline classiﬁers based on decision trees were

used for testing the four splitting strategies considered. Both classiﬁers

were applied on CTU19 a low-representative dataset for a network se-

curity detection problem. Preliminary results showed the importance of

applying the three alternative strategies to the Monte Carlo splitting

strategy in order to get a more accurate error estimation on diﬀerent

but feasible scenarios.

Keywords: Sampling strategies ·Population representativeness ·Dataset

splitting

1 Motivation

An experimental design is a fundamental part of the machine learning workﬂow.

In the particular case of prediction problems, part of the design includes esti-

mating the model’s generalization performance. Estimating this performance is

a critical aspect of developing a model since it gives an idea of whether it can

deal with future (not seen) scenarios reasonably well.

The standard experimental design for evaluating the performance of a ma-

chine learning model is well known. As depicted in Figure 1 a dataset is split

usually in a 70/30 ratio. 30% of the data, called testing set, should be left aside

and ideally never used until model tuning has ﬁnished. On the other hand, 70%

of the data, referred as training set, could be used to train and optionally validate

the model or conduct a hyperparameter search for model tuning.

arXiv:2209.03346v1 [cs.LG] 4 Sep 2022

Fig. 1: Standard experimental design for evaluating the performance of a ma-

chine learning model.

Train and test datasets need some degree of similarity (both need to follow the

same distribution). Otherwise, it would be impossible for the model to achieve

a decent performance on test. However, if the examples are too similar in both

datasets, then it is not possible to assure an accurate generalization performance

for the model. Moreover, the model overall generalization performance could be

overestimated.

Even though a train/test split of the dataset randomly performed is a com-

mon practice, is not always the best approach for estimating performance gener-

alization under some scenarios. A common situation is when predicting patient

outcomes. In these cases, the model should be constructed using certain patient

sets (e.g., from the same clinical site or disease stage), but then need to be

tested on a diﬀerent sample population [4]. Another situation is the fact that

it is not always possible to have access to a representative sample. Detecting a

non-representative sample is possible through the application several techniques,

such as cross-validation, learning curves, and conﬁdence intervals, among oth-

ers. Unfortunately, in many cases, a non-representative sample is all we have

to generate a machine learning model. In those cases when a sample does not

follow the same population distribution, a random split might not provide the

required level of representativeness for rare or elusive examples in a testing set.

As a result, the standard error metrics could overestimate the performance of

the model. In classiﬁcation problems, it is possible to deal with the lack of rep-

resentativeness using a stratiﬁcation strategy. However, when rare examples are

not labeled, a predictor-based sampling strategy will be necessary [9,1]

In the present work, we analyze several strategies based on the predictors’

variability to split in training and testing sets. Such strategies aim at guar-

anteeing the inclusion of rare or unusual examples with a minimal loss of the

population’s representativeness. The hypothesis is that by including rare exam-

ples during model evaluation a more accurate performance estimates will be

obtained.

2

The contributions of the present article are:

–The analysis of four splitting strategies with diﬀerent distributions for train-

ing and testing sets.

–The evaluation of two diﬀerent tree-based baseline classiﬁers over four dif-

ferent splitting strategies.

2 Splitting Strategies

2.1 Monte Carlo

The usual strategy for model evaluation consists of taking an uniformly random

sample without replacement of a portion of the data for the training set, while

all other data points are added to the testing set. Such strategy can be thought

as special case of the Monte Carlos Cross Validation (MCCV) [10] with just one

resample instance. The Monte Carlo (MC) splitting strategy guarantees the same

distribution across not only response but also predictor variables for training and

testing sets.

In comparison with Monte Carlo, the remaining splitting strategies provide

steps to create diﬀerent test sets that include rare or elusive examples while

maintaining similar properties across the predictor space as the training set.

2.2 Dissimilarity-based

Maximum dissimilarity splitting strategies were proposed by [1] and [9]. The

simplest method to measure dissimilarity consist of using the distance between

the predictor values for two samples. The larger the distance between points, the

larger indicative of dissimilarity. The application of dissimilarity during data

splitting requires a set initialized with a few samples. Then, the dissimilarity

between this set and the rest of the unallocated samples can be calculated. The

unallocated sample that is most dissimilar to the initial set would then be added

to the test set.

Dissimilarity splitting proved to be useful over chemical databases’ splitting

[6,5]. Nevertheless, this method strongly depends on the initial set used to cal-

culate the dissimilarity of the rest of the samples, prompting problems in cases

of small datasets where the initial set is not representative enough [11].

2.3 Informed-based

A well-known non-random split strategy consists of using some kind of grouping

information from the data to restrict the set of samples used for testing. The

general idea after splitting the data is that members of a group present in training

set should not be included in the testing set. Such strategies are well-known in

areas such as Medicine and Finances [4], where testing should be conducted on

a diﬀerent patient group or, in the ﬁnances ﬁeld, where the model should be

tested on a time series from a diﬀerent time period.

3

2.4 Clustering-based

The clustering split strategy follows the same principle of the informed split.

However, there could be situations where no grouping information can be ex-

tracted from the samples to perform an informed split. In these cases, the appli-

cation of a clustering algorithm could be used to replace the missing information.

The labels generated by this procedure will be then used for performing a group

split similarly to the informed split strategy.

3 Application to a Network Security Dataset

The four splitting strategies described in section 2 were applied to a network

security dataset for botnet detection conformed by nineteen network captures

published by the stratosphere IPS research group at CTU [7]. Speciﬁcally, the

dataset has fourteen botnet captures and ﬁve normal captures including traﬃcs

like DNS, HTTPS and P2P. In total, all captures represents 20866 connections

having 19271 labeled as “Botnet” and 1595 labeled as “Normal”. All these cap-

tures were gathered between 2013 and 2017.

The ﬁrst ten predictors of the CTU19 dataset summarize each ﬂow of con-

nections with same IP, protocol and destination port into a 10-dimensional nu-

merical vector:

vfeat =< xsp, xwp , xwnp , xsnp, xds , xdm, xdl , xss , xsm, xsl >(1)

wherein the ﬁrst four dimensions of the numerical vector represent the periodicity

predictors (strong periodicity (xsp), weak periodicity (xwp), weak non periodicity

(xwnp) and strong non periodicity (xsnp)), the next three refer to the duration

predictors (duration short (xds), duration medium (xdm ) and duration large

(xdl), respectively), and the last three represent the size predictors (size short

(xss), size medium (xsm ), size large (xsl)). The vector for a given connection

is generated considering the cumulative frequency of the corresponding values

associated with the behavioral of each predictor.

In addition to the information provided by the ﬂow-based predictors, the

CTU19 dataset includes information related with the ﬂow like source IP, des-

tination IP, protocol, port and the source linked with each capture. However,

the present study will focus only on the information provided by the ﬂow-based

predictors as discussed in [2,8].

3.1 Initial Exploratory Analysis

Fig. 2 represents a 2D projection of the CTU19 dataset considering the ﬁrst 2

principal components (PCA was applied to the ﬂow-based predictors). In ad-

dition, the ﬁgure includes the distribution for each predictor. As depicted by

the box plot, botnet and normal classes present a diﬀerent distribution over the

ﬂow-based predictors. However, both classes show a partial overlap at the time

of projecting them into a 2D space.

4

Fig. 2: Boxplot and 2D projection of the ﬂow-based predictors in the CTU19

dataset. Botnet traﬃc in red and Normal in blue.

Fig. 3 decomposes the CTU19 dataset by traﬃc capture with the same informa-

tion as provided by Fig 2. In general, the patterns from the 2D projection normal

captures are diﬀerent compared with Botnet captures. Normal captures are con-

centrated while Botnet captures spread along the 2D predictor space. When

analysing Normal captures, most of them overlap the same predictor space. In

other words, each capture has some examples on every portion of the Normal

predictor subspace, which suggests the adequate representativeness. Neverthe-

less, in the case of Botnet, there are several cases of captures having only a

limited presence on the class predictor subspace (see captures 2014-02-07-win3

and 2014-01-25-win3). Such lack of representativeness observed by some Botnet

captures could diﬃcult the classiﬁcation model’s performance estimation.

3.2 Training and Testing Sets Creation

Fig. 4 describes the process used for the creation of the training and testing sets

according to each splitting strategy. A subset of 3000 data points from CTU19

were used for each strategy. 2000 data points for conforming the training set

and 1000 for the testing set. The generation of the training and testing sets will

depend on the splitting strategy applied as discussed in section 2. The procedure

is repeated 25 times. Therefore 25 pairs of training and testing sets are generated

for each splitting strategy.

The diﬀerences between each splitting strategy are observed in Fig 5 considering

data points from the 25 samples. The 2D projection using the ﬁrst two princi-

pal components conﬁrms the similarities between training and testing sets when

the baseline Monte Carlo splitting is applied. Moreover, both datasets follow

the same pattern in the predictor space which corresponds with the similarity

observed in the predictor distributions (see box plots below). However, diﬀerent

patterns are observed between training and testing sets in the remaining split-

ting strategies. In particular, dissimilarity-based and informed-based splitting

5

Fig. 3: On the top, the scatter plots on the top depicts the 2D projection using

PCA for each one of the 19 captures conforming the CTU19 dataset (Botnet traf-

ﬁc in red and Normal in blue). On the bottom, the corresponding distributions

for the 10 ﬂow-based predictors on each capture.

present the most diﬀerent patterns. Nevertheless, the same pattern is observed

in both datasets for the clustering-based splitting case, although with a small

displacement from the axis.

6

Original

dataset

Pick the first 2000 data

points for train.The

next 1000 for test

shuffle

Monte Carlo Sampling

train

test

Pick first 2000 data points

for train. Then pick the 900

most different Botnet data

points and the 100 most

different Normal data points

shuffle

Dissimilarity-based Sampling

train

test

Original

dataset

Find clusters in the dataset.

Pick the first 2000 data

points from a random set of

clusters (𝝰) for train. Then

pick 1000 for test from a

different random set (𝞫).

𝝰∩𝞫=0

Clustering-based Sampling

train

test

Original

dataset

Pick the first 2000 data

points from a random set of

captures (𝝰) for train. Then

pick 1000 for test from a

different random set (𝞫).

𝝰∩𝞫=0

Informed Sampling

train

test

Original

dataset

Fig. 4: Procedure used for generating the training and testing sets for the

four splitting strategies: Monte Carlo, Dissimilarity-based, clustering-based and

informed-based. In all the cases a training set with 2000 data points and a testing

set with 1000 were generated.

When projecting many predictors into two dimensions, intricate predictor rela-

tionships could mask regions within the predictor space where the model will

inadequately predict new samples.

An algorithmic approach as described in [3] is applied to get more detailed

information about the similarities between training and testing datasets. The

general idea is to create a new dataset randomly permuting the predictors from

the training set and then, row-wise concatenate to the original training set label-

ing original and permuted samples. A classiﬁcation model is run on the resulting

dataset to predict the probability of new data being in the class of the training

set. Table 1 shows average percentage of samples from testing set not recognized

as part of the training set.

As expected, Monte Carlo splitting exhibits the lower error and standard

deviation, conﬁrming the similarities between training and testing set observed

in the 2D projection from Fig.5. Both dissimilarity and informed-based strate-

gies has the same average error. However, informed-based shows a higher varia-

tion. Finally, the clustering-based strategy shows the biggest diﬀerences between

training and testing sets.

7

Fig. 5: 2D projection and boxplot distribution for the 25 pairs of training and

testing datasets for each splitting strategy.

Table 1: Average Percentage and standard deviation of test samples not recog-

nized as part of the training set for the four splitting strategies

Avg Err %sd Splitting Strategy

0.06 0.05 informed-based

0.01 0.01 monte carlo

0.17 0.13 cluster-based

0.06 0.02 dissimilarity-based

4 Error Estimation on Baseline Classiﬁers

The impact of the diﬀerent splitting strategies discussed in section 2 is measured

on Random Forest (RF) and CatBoost (CB) classiﬁers. Both CB and RF are

two well-known classiﬁers providing acceptable results on tabular data without

conducting a hyper parameters tuning. Both baseline classiﬁers were executed

with default parameters. Nevertheless, downsampling technique is applied to

training set to deal with classes imbalance issues.

4.1 Metrics

Several standard metrics for model classiﬁcation assessment were used to eval-

uate baseline classiﬁers performance on the diﬀerent train/test datasets. The

metrics correspond to True Positive Rate (TPR) and False Positive Rate (FPR).

The Sensitivity measures the proportion of positives that are correctly identi-

ﬁed (TPR), and the Speciﬁcity measures the proportion of negatives that are

correctly identiﬁed (1 −FPR).

8

Additional metrics were used to deal with class imbalance: F1-Score and

Balanced Accuracy. F1-Score is computed as the weighted average between

TPR and the total numbers of malicious connections in the dataset. Balanced

Accuracy is calculated as the average of correctly classiﬁed proportion of each

class individually.

4.2 Results

(a) Monte Carlo splitting Strategy (b) Dissimilar splitting Strategy

(c) Informed-based splitting Strategy (d) Clustering-based splitting Strategy

Fig. 6: Random Forest and CatBoost models’ performance result for each split-

ting strategy

9

Fig. 6 depicts the overall performance results for baseline classiﬁcations mod-

els using the splitting strategies described in Section 2. Both models were used to

predict “Botnet” or ‘Normal” connections. Since no signiﬁcant diﬀerence is ob-

servable in the performance between both baseline models, the results discussed

in this section corresponds only to the catboost algorithm.

Fig. 6a displays testing set’s results for RF (below) and CB (above) using the

MC splitting strategy. Prediction performance in terms of Balanced Accuracy

median is around 0.91. The remaining metrics obtained similar values for both

baseline models. A low variability is observed in all the considered metrics. For

the Dissimilarity-based splitting strategy (see Fig. 6b), all considered metrics de-

crease in comparison to MC strategy. The Balanced Accuracy median decreases

to 0.75 while the speciﬁcity decreases to 0.70. The F1 and Sensitivity median

maintained higher values, although lower than the observed in the MC strategy.

Fig. 6c states the results for the Informed-based splitting strategy. In this

case, Balanced Accuracy and F1 show a median value of 0.85, 0.87, respectively.

On the other hand, in terms of Speciﬁcity and Sensitivity, the median values are

around 0.78 and 0.93, respectively. Notice that despite the good performance

in terms of median values, a considerable variation is observed for the F1 and

Sensitivity metrics. Finally, Fig. 6d presents the models’ results using the Clus-

tering splitting strategy. The Balanced Accuracy median value decreased to 0.82

while F1 increases to 0.93 compared with the informed-splitting strategy. The

rest of the metrics, Sensitivity and Speciﬁcity, shown similar results with median

around 0.89 and 0.87, respectively.

5 Discussion

In general both baseline models presented acceptable performance for predict-

ing “Botnet” and “Normal” classes since the median Balanced Accuracy metric

was over 0.75 in most of cases. As expected, the Monte Carlo splitting strategy

showed the smallest estimation error (a high balanced accuracy). The similarities

between training and testing sets was already observed in Table 1, where only 1%

of the testing set was not recognized as part of training set. When the size of not

recognized samples increments only to a 6%, such as in the case of dissimilarity

strategies, the estimation error increases considerably (from 0.91 to 0.75 median

Balanced Accuracy). Since Monte Carlo and Dissimilarity-based splitting strate-

gies use the same procedure for generating the training set, their analysis can

provide a valuable error estimation under an anomalous but certainly feasible

set of samples.

In the case of informed and clustering based splitting strategies, the diﬀer-

ence between training and testing sets is not only larger than Monte Carlo but

also both show a larger variation (see Table 1). A considerable variation is also

observed in the performance of both baseline classiﬁers. However, both split-

ting strategies still provide information about the robustness of the models on

sets with diﬀerent representativeness. Previous statement is particularly valid for

the informed-based splitting strategy, where captures with very diﬀerent repre-

10

sentiveness levels are used for building training and testing sets. For instance,

informed-based Balanced Accuracy range provides ad-hoc information about the

expected values when not so representative sets are used for training. Moreover,

it is possible to infer that a 0.75 value for Balanced Accuracy could be the worst

performance scenario observed by the model. Such value is still suitable under

some real-life situations.

The clustering-based strategy provides a more extreme scenario than informed-

based for estimating the model performance under not so representative sets.

Under the clustering-base strategy a concentrated portion of the predictor space

present in the testing set is excluded from training set, whereas in the informed-

based it is possible to ﬁnd datapoints spread along the whole predictor space.

Consequently, it is possible to observe higher variation in the baseline models

performance.

6 Concluding Remarks and Future Work

Despite being the standard splitting strategy, Monte Carlo can overestimate the

results when dataset is not representative. Other splitting techniques based on

dissimilarity, information present in the dataset and the application of a cluster-

ing algorithm can help in the estimation under diﬀerent low-representativeness

scenarios.

Multiple training and testing sets where generated using the diﬀerent strate-

gies on the CTU19 Botnet dataset. Small diﬀerences between training and testing

sets were corroborated using the algorithm proposed by [3] in all the four tech-

niques. As expected Monte Carlo showed the smallest diﬀerences while clustering-

based showed the biggest.

Two baseline classiﬁers were used for evaluating each splitting strategy in the

error estimation process. The Dissimilarity-based splitting strategy provided a

valuable error estimation under an anomalous but certainly feasible set of sam-

ples. On the the other hand, informed-based strategy oﬀers ad-hoc information

about the expected values when sets not so representive are used for training,

while clustering-based strategy emerges as an alternative to informed-based for

estimating the model performance under low representativeness sets with more

pessimist estimation.

Preliminary results showed the importance of applying the three alternative

strategies to the Monte Carlo splitting strategy in order to get a more accurate

error estimation on diﬀerent but possible situations. However, given the par-

ticular low-representativeness nature of the botnet detection problem, a deeper

analysis and evaluation on other datasets should be conducted.

7 Acknowledgments

The authors would like to thank the ﬁnancial support received by SIIP-UNCuyo

during this work. In particular the projects 06/B363 and 06/B374. In addition,

11

we want to gratefully acknowledge the support of NVIDIA Corporation with the

donation of the Titan V GPU used for this research.

References

1. Clark, R.D.: OptiSim: An extended dissimilarity selection method for ﬁnding di-

verse representative subsets. Journal of Chemical Information and Computer Sci-

ences 37(6), 1181–1188 (1997). https://doi.org/10.1021/ci970282v

2. Guerra, J.L., Veas, E., Catania, C.A.: A study on labeling network

hostile behavior with intelligent interactive tools. In: 2019 IEEE Sym-

posium on Visualization for Cyber Security (VizSec). pp. 1–10 (2019).

https://doi.org/10.1109/VizSec48167.2019.9161489

3. Hastie, T., Tibshirani, R., Friedman, J.: The elements of statistical learning: data

mining, inference, and prediction. Springer Science & Business Media (2009)

4. Kuhn, M., Johnson, K.: Applied Predictive Modeling with Applications in R.

Springer (2013)

5. Martin, T.M., Harten, P., Young, D.M., Muratov, E.N., Golbraikh, A., Zhu, H.,

Tropsha, A.: Does rational selection of training and test sets improve the outcome

of QSAR modeling? Journal of Chemical Information and Modeling 52(10), 2570–

2578 (2012). https://doi.org/10.1021/ci300338w

6. Snarey, M., Terrett, N.K., Willett, P., Wilton, D.J.: Comparison of algorithms

for dissimilarity-based compound selection. Journal of Molecular Graphics and

Modelling 15(6), 372–385 (1997). https://doi.org/10.1016/S1093-3263(98)00008-4

7. Stratosphere IPS Project: The CTU-19 dataset, Malware Captures. https://mcfp.

felk.cvut.cz/publicDatasets/ (October 2017), [Online; accessed Jun-2020]

8. Torres, J.L.G., Catania, C.A., Veas, E.: Active learning approach to label net-

work traﬃc datasets. Journal of Information Security and Applications 49,

102388 (2019). https://doi.org/https://doi.org/10.1016/j.jisa.2019.102388, http:

//www.sciencedirect.com/science/article/pii/S2214212618304344

9. Willett, P.: Dissimilarity-based algorithms for selecting structurally diverse

sets of compounds. Journal of Computational Biology 6(3-4), 447–457 (1999).

https://doi.org/10.1089/106652799318382

10. Xu, Q.S., Liang, Y.Z.: Monte carlo cross validation. Chemometrics and Intelligent

Laboratory Systems 56(1), 1–11 (2001)

11. Yang, Y., Ye, Z., Su, Y., Zhao, Q., Li, X., Ouyang, D.: Deep learning for in vitro

prediction of pharmaceutical formulations. Acta Pharmaceutica Sinica B 9(1), 177–

185 (2019). https://doi.org/10.1016/j.apsb.2018.09.010

12