PreprintPDF Available

problexity -- an open-source Python library for binary classification problem complexity assessment

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The classification problem's complexity assessment is an essential element of many topics in the supervised learning domain. It plays a significant role in meta-learning -- becoming the basis for determining meta-attributes or multi-criteria optimization -- allowing the evaluation of the training set resampling without needing to rebuild the recognition model. The tools currently available for the academic community, which would enable the calculation of problem complexity measures, are available only as libraries of the C++ and R languages. This paper describes the software module that allows for the estimation of 22 complexity measures for the Python language -- compatible with the scikit-learn programming interface -- allowing for the implementation of research using them in the most popular programming environment of the machine learning community.
Content may be subject to copyright.
arXiv:2207.06709v1 [cs.LG] 14 Jul 2022
problexity an open-source Python library for binary
classification problem complexity assessment
Joanna Komorniczak and Pawel Ksieniewicz
Department of Systems and Computer Networks,
Faculty of Information and Communication Technology,
Wrocław University of Science and Technology,
Wybrzeże Wyspiańskiego 27, 50-370 Wrocław, Poland
The classification problem’s complexity assessment is an essential element of
many topics in the supervised learning domain. It plays a significant role in
meta-learning becoming the basis for determining meta-attributes or multi-
criteria optimization allowing the evaluation of the training set resampling
without needing to rebuild the recognition model. The tools currently avail-
able for the academic community, which would enable the calculation of
problem complexity measures, are available only as libraries of the C++ and
Rlanguages. This paper describes the software module that allows for the
estimation of 22 complexity measures for the Python language compatible
with the scikit-learn programming interface allowing for the implementa-
tion of research using them in the most popular programming environment
of the machine learning community.
Keywords: Problem complexity, Classification, Python
1. Motivation and significance
Proper evaluation of the algorithms and methods proposed for solving the
pattern classification task, in accordance with good practices adopted in the
machine learning research environment [1], requires extensive experiments
performed on a large pool of appropriately diverse data sets [2]. In a typical
approach to designing an experimental evaluation procedure, the usefulness
of a selected group of problems is most often assessed by basic reporting
of the sets dimensionality, the number of examples describing them, the
Preprint submitted to Neurocomputing July 15, 2022
number of problem classes [3] and their prior distribution in the case of non-
uniformity of their representation [4]. However, it is necessary to remember
that these are only simple measures, briefly describing the problem difficulty
without giving the researcher sufficient insight into the actual complexity
of the task. A true complexity of a problem is contained not only in the
basic characteristics of problem space or class imbalance but also in problem-
specific distribution, understood as its linearity, neighborhood characteristic,
geometrical and topological complexity, and feature dependency [5].
The classification task is the main issue of supervised machine learning,
finding its applications in almost every branch of life and science, starting
from economics, through epidemiology and medicine, to machine vision and
predictive maintenance systems [6]. Such a variety of undertaken problems
leads to a plethora of various difficulties, expressed in a multiplicity of class-
building clusters, a significant imbalance ratio, or significant class overlap,
to enumerate a few. The considered complexity metrics demonstrate the
potential for assessing the diversity of collected data sets which were precisely
described and spanned over the taxonomy by Lorena et al. [7].
The computations of the problem complexity do not only find their ap-
plication in the proper selection of benchmark datasets for the experimental
evaluation. Their application in meta-learning solutions has been particu-
larly popular in recent years [8], being one of the primary sources of meta-
features for problem identification. Such an approach allows for the induction
of task representations [9] and the automation of the deep neural networks
structure configuration [10]. The measures of problem complexity are also
used in geospatial data in filtering the predictor noise [11] or in the dif-
ficulty metrics dedicated to spectral data [12]. Preliminary works assessing
the usefulness of such measures in data stream processing are also present in
the literature [13].
All the above-mentioned examples concern research from the narrow span
of the last few years, in which researchers carried out elements of problem
complexity analysis with the use of two problem-complexity libraries cur-
rently available to the scientific community.
The first library DCoL was developed in 2010 by the team of Univer-
sitat Ramon Llull and Bell Laboratories [14] and is an implementation of 14
basic complexity metrics for the C++ language. In 2017 its source code was
made available on the GitHub1.
A second library, ECoL, published as supporting software for Lorena et
al. [7] publication, has been developed since September 2016 and is an
implementation of 22 metrics in Rlanguage, currently being the most com-
prehensive solution for this type available to the scientific community. It
reached its current version (0.3.0) in December 2020. Its entire version
history is available in public GitHub repository2.
It is important to emphasize that in recent years, the Python program-
ming language has started playing a much more significant role in the devel-
opment of machine learning methods than any other experimental environ-
ment [15].
This publication presents a problexity library, containing the implementa-
tion of 22 problem complexity measures divided to six categories: (i) feature-
based, (ii) linearity, (iii) neighborhood, (iv ) network, (v) dimensionality, and
(vi) class imbalance as well as a ComplexityCalculator class, introducing ad-
ditional utilities facilitating research and enabling simple expansion of the
module with additional metrics. By proposing this library, new methods
considering the classification complexity can be developed, which will fur-
ther impact the evolution of machine learning algorithms both in the meta-
learning field and in other research applications from recent years.
The measures of the problem complexity can be used as a substitute
criterion in optimization tasks. We can expect that the quality of the clas-
sification will depend on the problem’s difficulty expressed in measures of
complexity. Testing the classifier’s ability to recognize objects, often using
cross-validation and induction algorithms with computational overhead sig-
nificantly larger than measure computation, will be time consuming. We
can potentially speed up the optimization process by problem complexity
assessment as an alternative criterion.
2. Software description
This chapter contains the software description of the library. It will present
the package structure, a minimal processing example, and an exemplary anal-
ysis of the results. The chapter will illustrate the implemented measures of
problem complexity and the ComplecityCalculator model.
2.1. Software Architecture
The library consists of two main elements the measures submodule and the
ComplexityCalculator class. Within the measures, six categories are distin-
Feature-based containing F1,F1v,F2,F3, and F4measures,
Linearity containing L1,L2, and L3measures,
Neighborhood with N1,N2,N3,N4,T1, and LSC measures,
Network containing density,ClsCoef, and Hubs measures,
Dimensionality containing T2,T3, and T4measures,
Class imbalance containing C1and C2measures.
The ComplexityCalculator module enables the computation and analysis
of data sets in the context of the difficulty of the classification task. It offers
a set of methods that allows calculating the measure values and presenting
the result as a single score, report, or illustrative graph.
2.2. Measures
The package divides measures into six categories, introduced in the publi-
cation by Lorena et al. [7]. The measures return 0or a value close to 0
for simple problems and values close to 1for complex problems. The only
measures not limited to 1 are T2and T3from the dimensionality category.
Four implemented measures (L1,L2,L3,N4) are non-deterministic; there-
fore, subsequent calculations on the same data can yield varying results. In
the case of measures from the Linear category, this behavior results from
Linear SVM classifier optimization, whose weights are initialized randomly.
In the case of L3 and N4 measures, the generation of synthetic instances in
a randomized manner makes the calculations non-deterministic.
2.2.1. Feature-based measures
The measures describe the ability of features to separate classes in the classifi-
cation problem. They analyze features separately or evaluate how attributes
work together.
fit(X, y)
Calculates metrics for given dataset.
Returns report of problem complexity.
Returns integrated score of problem
Draws matplotlib figure illustrating
problem complexity.
f1(), f1v(), f2(), f3(), f4()
feature based
l1(), l2(), l3()
n1(), n2(), n3(), n4(), t1(), lsc()
density(), clsCoef(), hubs()
t2(), t3(), t4()
c1(), c2()
class imbalance
Figure 1: Overall schema of the software architecture
F1Maximum Fisher’s discriminant ratio.
The measure describes the overlap of feature values in each class. The
inverse of the original formulation, taking into account the most sig-
nificant discriminant ratio, is taken into account, the same as in ECoL
F1vDirectional vector maximum Fisher’s discriminant ratio.
The measure computes projection that maximizes class separation by
directional Fisher’s criterion.
F2Volume of overlapping region
The measure describes the overlap of the feature values within the
classes. It is determined by the minimum and maximum values of
features in each class. The overlap is then calculated and normalized
by the range of values in each class.
F3Maximum individual feature efficiency
The measure describes the efficiency of each feature in the separation
of classes. It considers the maximum value among all features. The
equation proposed by Lorena et al. [7] has been slightly modified to
obtain a maximum complexity value of 1 in case all instances of separate
classes overlap.
F4Collective feature efficiency
The measure describes the features synergy. The instances separated by
the most discriminant attribute that was not used already are excluded
from further analysis. The process continues until all instances are
classified, or all features are used. The measure is calculated according
to the number of instances in the overlapping region and the total
number of samples.
2.2.2. Linearity measures
The measures evaluate the level of problem class linear separation. The
measures use Linear Support Vector Machines (svm) classifier.
L1Sum of the error distance by linear programming
The measure calculates the distance of incorrectly classified samples
from the svm hyperplane.
L2Error rate of linear classifier
The measure is described by the error rate of the Linear svm classifier
within the dataset.
L3Non-linearity of linear classifier The measure is described by the
classifier’s error rate on synthesized points of the dataset. The synthetic
points are obtained by linearly interpolating instances of each class.
The class of original examples determines the label of an augmented
point, and the number of artificial points is equal to the original dataset
2.2.3. Neighborhood measures
The measures analyze the neighborhood of instances in a feature space.
Neighbors of each sample are established based on the distance between
problem instances.
N1Fraction of borderline points
The Minimum Spanning Three is generated over input instances in
order to obtain this measure. The value is computed by calculating the
number of edges in the mst between examples of different classes over
a total number of samples.
N2Ratio of intra/extra class NN distance
The measure depends on the distances of each problem instance to
its nearest neighbor of the same class and the distance to the nearest
neighbor of a different class. According to the proportions of those
values, the final value is calculated.
N3Error rate of NN classifier
The measure is determined by the error rate of the One Nearest Neigh-
bor Classifier in the Leave One Out evaluation protocol.
N4Calculates the Non-linearity of NN classifier (N4) metric
The measure is determined by the error rate of the k-Nearest Neighbor
Classifier on synthetic points, generated by linearly interpolating orig-
inal instances. The classifier is fitted on original points and evaluated
on synthetic instances.
T1Fraction of hyperspheres covering data
The measure is defined by the number of hyperspheres needed to cover
the data divided by a number of instances. First, a hypersphere is
generated for each problem sample. A sample lies in the center of the
hypersphere. Its radius is dependent on the distance to the instance
of another class. The hyperspheres are eliminated if a different one
already covers the center instance. The elimination starts from the
hyperspheres with the largest radius and continues to the ones with a
smaller radius. The hyperspheres that were not eliminated are taken
into account during the calculation of complexity.
LSC Local set average cardinality (LSC)
The measure is dependent on the distances between instances and the
distances to the instances’ nearest enemies the nearest sample of the
opposite class. The number of cases that lie closer to the sample than
its closest enemy is considered during the calculation.
2.2.4. Network measures
The measures consider the instances as the vertices of the graph. All mea-
sures of this category generate an epsilon-Nearest Neighbours graph. The
epsilon value is set to 0.15, same as in the ECoL package. The edges are
selected based on the Gower distance between samples, normalized to the
range between 0 and 1. The edge is placed between the points if a normal-
ized Gower distance is smaller than 0.15. Edges between instances of distinct
classes are removed.
density Density metric
The measure calculates the number of edges in the final graph divided
by the total possible number of edges.
clsCoef Clustering Coefficient metric
For the purpose of obtaining this measure, the neighborhood of each
vertex is calculated, i.e., the instances directly connected to it. Then,
the number of edges between the sample’s neighbors is calculated and
divided by the maximum possible number of edges between them. The
final measure is calculated based on the neighborhood of each point in
the dataset.
hubs Hubs metric
For the purpose of obtaining this measure, the neighborhood of each
vertex is obtained. The measure scores each sample by the number of
connections to neighbors, weighted by the number of connections the
neighbors have.
2.2.5. Dimensionality measures
The measures analyze the relation between the number of features and the
number of instances in the dataset.
T2Average number of features per dimension
For the purpose of obtaining this measure, the number of dimensions
describing the dataset is divided by the number of instances.
T3Average number of PCA dimensions per points
To obtain this measure, first, the number of PCA components needed
to represent 95% of data variability is calculated. Then, the value is
divided by the instance number in the dataset.
T4Ration of the PCA dimension to the original dimension
To obtain this measure, the number of PCA components needed to
represent 95% of data variability is divided by the original number of
dimensions. This measure describes the proportion of relevant dimen-
sions in the dataset.
2.2.6. Class imbalance measures
The Class Imbalance measures evaluate the dataset based on the degree of
data imbalance.
C1Entropy of Class Proportions The measure is obtained based on
the proportion of each class’s samples divided by the total number of
C2Imbalance Ratio
The measure is obtained based on the proportion of each class’s samples
divided by a number of opposite class samples.
2.3. ComplexityCalculator
The library introduces the ComplexityCalculator class to facilitate the use of
measure implementation. Its objects are initialized with a list of measures
and an optional list of (a) category colors for visualization purposes, and (b)
a dictionary indicating the number of measures in a given category, which
are necessary only in case of non-default collection of measures.
By default, the module will analyze all 22 metrics of the measures module.
Executing the fit() method, which takes a set of features Xand a set of
labels yas an argument, will calculate the values of the metrics.
The obtained measures’ values can be accessed as a single value using the
score() method, optionally taking a vector of weights as a parameter. The
vector length has to correspond to the number of analyzed measures. By
default, each measure has an equal weight, which means that executing the
score() method will return the arithmetic mean of measured complexities.
The report() method provides a more detailed description of the classi-
fication problem. The method returns a dictionary containing a summary of
each metric value, their arithmetic mean, and other data set characteristics,
such as the number of samples, dimensionality, labels, and the number of
classes with a prior probability.
The values can also be presented in the form of a graph. Executing the
plot method returns a chart that illustrates the value of each measure from
respective categories as well as the default score of the problem.
2.4. Minimal processing example
The problexity module is open Python software released under the GPL-3.0
license and versioned in the public Python Package Index (PyPI) repository.
Therefore, it can be easily obtained with the pip package installer with the
> pip install problexity
To enable the possibility to modify the measures provided by problexity
or in case of necessity to expand it with functions that it does not yet include,
it is also possible to install the module directly from the source code. If
any modifications are introduced, they propagate to the module currently
available to the environment.
> git clone
>cd problexity
> make install
The problexity module is imported in the standard Python fashion.
At the same time, for the convenience of implementation, the authors rec-
ommend importing it under the px alias:
1# Importing problexity
2import problexity as px
The library is equipped with the ComplexityCalculator calculator, which
serves as the basic tool for establishing metrics. The following code presents
an example of the generation of a synthetic data set typical for the scikit-
learn module and the determination of the value of measures by fitting the
complexity model in accordance with the standard API adopted for scikit-
learn estimators:
1# Loading benchmark dataset from scikit-learn
2from sklearn.datasets import load_breast_cancer
3X, y = load_breast_cancer(return_X_y=True)
5# Initialize CoplexityCalculator with default parametrization
6cc = px.ComplexityCalculator()
8# Fit model with data,y)
As the L1,L2and L3measures use the recommended LinearSVC im-
plementation from the svm module of the scikit-learn package in their cal-
culations, the warning "ConvergenceWarning: Liblinear failed to converge,
increase the number of iterations." might occur. It is not a problem for the
metric calculation only indicating the lack of linear problem separability.
The complexity calculator object stores a list of all estimated measures
that can be read by the model’s complexity attribute:
[0.227 0.064 0.000 0.478 0.012 0.225 0.070 0.042 0.043 0.296 0.084
0.025 0.178 0.912 0.741 0.268 0.569 0.053 0.002 0.033 0.047 0.122]
They appear in the list in the same order as the declarations of the used
metrics, which can also be obtained from the hidden method _metrics():
[’f1’, ’f1v’, ’f2’, ’f3’, ’f4’, ’l1’, ’l2’, ’l3’, ’n1’, ’n2’, ’n3’,
’n4’, ’t1’, ’lsc’, ’density’, ’clsCoef’, ’hubs’, ’t2’, ’t3’, ’t4’,
’c1’, ’c2’]
The problem difficulty score can also be obtained as a single scalar mea-
sure, which is the arithmetic mean of all measures used in the calculation:
The problexity module, in addition to raw data output, also provides two
standard representations of problem analysis. The first is a report in the form
of a dictionary presenting the number of patterns (n_samples), attributes
(n_features), classes (classes), their prior distribution (prior_probability),
average metric (score) and all member metrics (complexities), which can
be obtained using the model’s report() method:
’n_samples’: 569,
’n_features’: 30,
’n_classes’: 2,
’classes’: array([0, 1]),
’prior_probability: array([0.373, 0.627]),
’score’: 0.214,
’f1’: 0.227, ’f1v’: 0.064, ’f2’: 0.001, ’f3’: 0.478, ’f4’:
’l1’: 0.433, ’l2’ : 0.069, ’l3’: 0.049, ’n1’: 0.043, ’n2’:
’n3’: 0.084, ’n4’ : 0.039, ’t1’: 0.178, ’t2’: 0.053, ’t3’:
’t4’: 0.033, ’c1’ : 0.047, ’c2’: 0.122,
’lsc’: 0.912, ’density’: 0.741, ’clsCoef’: 0.268, ’hubs’: 0.569
The second form of reporting is a graph which, in the polar projection,
collates all metrics, grouped into categories using color codes:
red feature based measures,
orange linearity measures,
yellow neighborhood measures,
green network measures,
teal dimensionality measures,
blue class imbalance measures.
Each problem difficulty category occupies the same graph area, mean-
ing that contexts that are less numerous in metrics (class imbalance) are
not dominated in this presentation by categories described by many met-
rics (neighborhood). The illustration is built with the standard tools of the
matplotlib module as a subplot of a figure and can be generated with the
following source code:
1# Import matplotlib
2import matplotlib.pyplot as plt
4# Prepare figure
5fig = plt.figure(figsize=(7,7))
7# Generate plot describing the dataset
8cc.plot(fig, (1,1,1))
An example of a complexity graph is shown in the Figure 2.
0.05 t3
Figure 2: Exemplary complexity graph generated by problexity module
3. Comparison with available modules
The juxtaposition in Table 1 presents a comparison of the available libraries
analyzing the difficulty of classification problems. The columns successively
present the ECoL,DCoL, and the problexity libraries. The rows show the
categories of the complexity measures. The values in the cells indicate the
number of available metrics compared to the number described in a publica-
tion by Lorena et al. [7].
Since DCoL was the earliest implemented library of the analyzed ones, it
contains the fewest measures. The ECoL and problexity libraries are based on
the same publication, so the number of measures in each category matches.
The Table also shows the availability of utilities offered by the libraries and
basic information describing them. All of the compared packages offer the
possibility of generating a report containing a summary of the values of se-
lected measures. In addition, the problexity package includes a method al-
lowing to represent selected measures as a single value and as well contains
a tool for graphically presenting the measures.
Table 1: Comparison of measures and utilities available in ECoL,DCoL and problexity
area functionality ECoL DCoL problexity
Measures Feature-based 5/5 5/5 5/5
Linearity 3/3 3/3 3/3
Neighborhood 5/5 5/5 5/5
Network 3/3 0/3 3/3
Dimensionality 3/3 1/3 3/3
Class imbalance 2/2 0/2 2/2
Utility Module Score X
Report X X X
Plot X
Basic information Language R C++ Python
Current version 0.3.0 0.3.2
4. Impact
In recent years, the complexity measures for classification problems have
gained particular interest in the scientific community. Their most common
use is the construction of meta-attributes (features describing sets [16]) to au-
tomate the selection of processing flows typical of the meta-learning topic [8].
An interesting trend here is, in particular, the construction of abstract rep-
resentations of recognition tasks [9], which allow for an initial generalization
of the problem under consideration [17], allowing for a significant reduction
of the time necessary to select the optimal classification model for a given
task [10].
An alternative field of use is the classification of difficult data, with par-
ticular emphasis on multidimensional discrete signals for example in the
form of multispectral and geospatial problems [12]. On the one hand, the
measures of complexity allow for agnostic estimation of the correct structure
of the objects in the predictor’s action space [18]. On the other hand, they
are tools useful in filtering its noise [11]. The same agnostic characteristics
show a particular potential in the processing of imbalanced data [19], allow-
ing the quality of the proposed resampling to be assessed without having to
rebuild the recognition model [20].
As with imbalanced data, complexity measures also find their application
in processing data streams [13]. By demonstrating the relationship between
the quality of the recognition model and some currently available measures,
it is possible to use them as a proxy-classifier, which allows for a signifi-
cant reduction in the time of reviewing available solutions when using any
optimization methods [21].
The field of classification problem complexity is still very active, not only
in applications described above but also in the proposals of new measures
that appear each year. The newly proposed measures allow the assessment
of other processing contexts such as category learning [22], rule-based disso-
ciations [23], or lost points identification [24], to enumerate a few.
5. Conclusions
This paper presents the problexity library for Python programming language.
The library contains measures for assessing the complexity of binary classifi-
cation problems. Twenty-two evaluation measures of the problem complexity
have been implemented in the following categories: feature-based, linearity,
neighborhood, network, dimensionality, and class imbalance. Additionally,
the library incorporates the ComplexityCalculator module, which provides
additional tools for analyzing classification data sets.
The library was created to fill the gap related to the lack of accessible
measures for assessing the complexity of the classification problem in Python.
Increasing the availability of complexity assessment methods and creating
a tool for exploring them will allow for a more detailed analysis of data
sets’ characteristics, potentially impacting the development of new machine
learning methods.
The current version of the package offers a set of measures adapted to
binary classification datasets, which are the most frequent objective of ma-
chine learning applications. The intent of future versions is for the library to
include measures adapted to multiclass problems. As we pointed out in the
impact section, in the field of data complexity evaluation, new measures are
being proposed. The package will be maintained to contain other measures
of classification complexity and possibly extended with regression complex-
ity evaluation measures. Further works will focus on adapting the library to
analyze data streams in the context of data difficulties. Finally, the problex-
ity library will continue to be used in research studies of the classification
problem complexity employment.
This work was supported by the Polish National Science Centre under the
grant No. 2019/35/B/ST6/0442 as well as by the statutory funds of the
Department of Systems and Computer Networks, Faculty of Information and
Communication Technology, Wroclaw University of Science and Technology.
[1] K. Stapor, P. Ksieniewicz, S. García, M. Woźniak, How to design the fair
experimental classifier evaluation, Applied Soft Computing 104 (2021)
[2] F. Hoffmann, T. Bertram, R. Mikut, M. Reischl, O. Nelles, Benchmark-
ing in classification and regression, Wiley Interdisciplinary Reviews:
Data Mining and Knowledge Discovery 9 (5) (2019) e1318.
[3] J. M. Sotoca, J. Sánchez, R. A. Mollineda, A review of data complexity
measures and their applicability to pattern classification problems, Actas
del III Taller Nacional de Mineria de Datos y Aprendizaje. TAMIDA
(2005) 77–83.
[4] A. Fernández, S. García, M. Galar, R. C. Prati, B. Krawczyk, F. Herrera,
Learning from imbalanced data sets, Vol. 10, Springer, 2018.
[5] T. K. Ho, M. Basu, Complexity measures of supervised classification
problems, IEEE transactions on pattern analysis and machine intelli-
gence 24 (3) (2002) 289–300.
[6] A. A. Soofi, A. Awan, Classification techniques in machine learning:
applications and issues, Journal of Basic & Applied Sciences 13 (2017)
[7] A. C. Lorena, L. P. Garcia, J. Lehmann, M. C. Souto, T. K. Ho, How
complex is your classification problem? a survey on measuring classi-
fication complexity, ACM Computing Surveys (CSUR) 52 (5) (2019)
[8] J. Vanschoren, Meta-learning: A survey, arXiv preprint
[9] M. M. Meskhi, A. Rivolli, R. G. Mantovani, R. Vilalta,
Learning abstract task representations, in: I. Guyon, J. N. van Rijn,
S. Treguer, J. Vanschoren (Eds.), AAAI Workshop on Meta-Learning
and MetaDL Challenge, Vol. 140 of Proceedings of Machine Learning
Research, PMLR, 2021, pp. 127–137.
[10] E. Konuk, K. Smith, An empirical study of the relation between network
architecture and complexity, in: Proceedings of the IEEE/CVF Inter-
national Conference on Computer Vision (ICCV) Workshops, 2019.
[11] H. Guillon, C. F. Byrne, B. A. Lane, S. Sandoval Solis, G. B. Pasternack,
Machine learning predicts reach-scale channel types from coarse-scale
geospatial data in a large river basin, Water Resources Research 56 (3)
(2020) e2019WR026691.
[12] F. Branchaud-Charron, A. Achkar, P.-M. Jodoin, Spectral metric for
dataset complexity assessment, in: Proceedings of the IEEE/CVF Con-
ference on Computer Vision and Pattern Recognition (CVPR), 2019.
[13] M. Ellis, A. S. Bosman, A. P. Engelbrecht, Characterisation of envi-
ronment type and difficulty for streamed data classification problems,
Information Sciences 569 (2021) 615–649.
[14] A. Orriols-Puig, N. Macia, T. K. Ho, Documentation for the data com-
plexity library in c++, Universitat Ramon Llull, La Salle 196 (1-40)
(2010) 12.
[15] G. Nguyen, S. Dlugolinsky, M. Bobák, V. Tran, Á. López García, I. Here-
dia, P. Malík, L. Hluch`y, Machine learning and deep learning frameworks
and libraries for large-scale data mining: a survey, Artificial Intelligence
Review 52 (1) (2019) 77–124.
[16] A. Rivolli, L. P. Garcia, C. Soares, J. Vanschoren, A. C. de Carvalho,
Characterizing classification datasets: a study of meta-features for meta-
learning, arXiv preprint arXiv:1808.10406.
[17] A. Rivolli, L. P. Garcia, C. Soares, J. Vanschoren, A. C. de Carvalho,
Meta-features for meta-learning, Knowledge-Based Systems 240 (2022)
[18] L. P. Garcia, A. C. de Carvalho, A. C. Lorena, Effect of label noise in
the complexity of classification problems, Neurocomputing 160 (2015)
[19] D. Lee, K. Kim, An efficient method to determine sample size in over-
sampling based on classification complexity for imbalanced data, Expert
Systems with Applications 184 (2021) 115442.
[20] V. H. Barella, L. P. Garcia, M. P. de Souto, A. C. Lorena, A. de Car-
valho, Data complexity measures for imbalanced classification tasks,
in: 2018 International Joint Conference on Neural Networks (IJCNN),
IEEE, 2018, pp. 1–8.
[21] Z. Cai, Y. Long, L. Shao, Classification complexity assessment for hyper-
parameter optimization, Pattern Recognition Letters 125 (2019) 396–
[22] L. A. Rosedahl, F. G. Ashby, A difficulty predictor for perceptual cate-
gory learning, Journal of Vision 19 (6) (2019) 20–20.
[23] F. G. Ashby, J. D. Smith, L. A. Rosedahl, Dissociations between rule-
based and information-integration categorization are not caused by dif-
ferences in task difficulty, Memory & cognition 48 (4) (2020) 541–552.
[24] C. Lancho, I. Martín de Diego, M. Cuesta, V. Aceña, J. M Moguerza,
A complexity measure for binary classification problems based on lost
points, in: International Conference on Intelligent Data Engineering and
Automated Learning, Springer, 2021, pp. 137–146.
Required Metadata
Current executable software version
Table 2: Code metadata (mandatory)
Nr. Code metadata description Please fill in this column
C1 Current code version 0.3.2
C2 Permanent link to code/repository
used for this code version
https :
C3 Legal Code License GPL-3.0
C4 Code versioning system used git
C5 Software code languages, tools, and
services used
C6 Compilation requirements, operat-
ing environments & dependencies
C7 If available Link to developer docu-
https ://
C8 Support email for questions
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Hydrologic and geomorphic classifications have gained traction in response to the increasing need for basin-wide water resources management. Regardless of the selected classification scheme, an open scientific challenge is how to extend information from limited field sites to classify tens of thousands to millions of channel reaches across a basin. To address this spatial scaling challenge, this study leverages machine learning to predict reach-scale geomorphic channel types using publicly available geospatial data. A bottom-up machine learning approach selects the most accurate and stable model among ∼20,000 combinations of 287 coarse geospatial predictors, preprocessing methods, and algorithms in a three-tiered framework to (i) define a tractable problem and reduce predictor noise, (ii) assess model performance in statistical learning, and (iii) assess model performance in prediction. This study also addresses key issues related to the design, interpretation, and diagnosis of machine learning models in hydrologic sciences. In an application to the Sacramento River basin (California, USA), the developed framework selects a Random Forest model to predict 10 channel types previously determined from 290 field surveys over 108,943 two hundred-meter reaches. Performance in statistical learning is reasonable with a 61% median cross-validation accuracy, a sixfold increase over the 10% accuracy of the baseline random model, and the predictions coherently capture the large-scale geomorphic organization of the landscape. Interestingly, in the study area, the persistent roughness of the topography partially controls channel types and the variation in the entropy-based predictive performance is explained by imperfect training information and scale mismatch between labels and predictors.
Full-text available
Characteristics extracted from the training datasets of classification problems have proven to be effective predictors in a number of meta-analyses. Among them, measures of classification complexity can be used to estimate the difficulty in separating the data points into their expected classes. Descriptors of the spatial distribution of the data and estimates of the shape and size of the decision boundary are among the known measures for this characterization. This information can support the formulation of new data-driven pre-processing and pattern recognition techniques, which can in turn be focused on challenges highlighted by such characteristics of the problems. This article surveys and analyzes measures that can be extracted from the training datasets to characterize the complexity of the respective classification problems. Their use in recent literature is also reviewed and discussed, allowing to prospect opportunities for future work in the area. Finally, descriptions are given on an R package named Extended Complexity Library (ECoL) that implements a set of complexity measures and is made publicly available.
Full-text available
Predicting human performance in perceptual categorization tasks in which category membership is determined by similarity has been historically difficult. This article proposes a novel biologically motivated difficulty measure that can be generalized across stimulus types and category structures. The new measure is compared to 12 previously proposed measures on four extensive data sets that each included multiple conditions that varied in difficulty. The studies were highly diverse and included experiments with both continuous- and binary-valued stimulus dimensions, a variety of different stimulus types, and both linearly and nonlinearly separable categories. Across these four applications, the new measure was the most successful at predicting the observed rank ordering of conditions by difficulty, and it was also the most accurate at predicting the numerical values of the mean error rates in each condition.
Full-text available
The article presents an overview of the status quo in benchmarking in classification and nonlinear regression. It outlines guidelines for a comparative analysis in machine learning, benchmarking principles, accuracy estimation, and model validation. It provides references to established repositories and competitions and discusses the objectives and limitations of benchmarking. Benchmarking is key to progress in machine learning as it allows an unprejudiced comparison among alternative methods. This article presents guidelines and best practices for benchmarking in classification and regression. It reviews state‐of‐the‐art approaches in machine learning, establishes benchmarking principles and discusses performance metrics for a sound statistical comparative analysis. This article is categorized under: Technologies > Computational Intelligence Fundamental Concepts of Data and Knowledge > Key Design Issues in Data Mining Technologies > Machine Learning Technologies > Classification
Resampling, one of the approaches to handle class imbalance, is widely used alone or in combination with other approaches, such as cost-sensitive learning and ensemble learning because of its simplicity and independence in learning algorithms. Oversampling methods, in particular, alleviate class imbalance by increasing the size of the minority class. However, previous studies related to oversampling generally have focused on where to add new samples, how to generate new samples, and how to prevent noise and they rarely have investigated how much sampling is sufficient. In many cases, the oversampling size is set so that the minority class has the same size as the majority class. This setting only considers the size of the classes in sample size determination, and the balanced training set can induce overfitting with the addition of too many minority samples. Moreover, the effectiveness of oversampling can be improved by adding synthetics into the appropriate locations. To address this issue, this study proposes a method to determine the oversampling size less than the sample size needed to obtain a balance between classes, while considering not only the absolute imbalance but also the difficulty of classification in a dataset on the basis of classification complexity. The effectiveness of the proposed sample size in oversampling is evaluated using several boosting algorithms with different oversampling methods for 16 imbalanced datasets. The results show that the proposed sample size achieves better classification performance than the sample size for attaining class balance.
SDCP require classifiers with the ability to learn and to adjust to the underlying relationships in data streams in real-time. This requirement poses a challenge to classifiers, because the learning task is no longer just to find the optimal decision boundaries, but also to track changes in the decision boundaries as new training data is received. Each SDCP can be described in terms of its environment and difficulty. The environment of an SDCP describes the rate and magnitude of changes in the decision boundaries in the data streams. On the other hand, the difficulty of an SDCP describes the availability of the data that define the decision boundaries during an environment instance. In any empirical analysis of streamed data classifiers, a set of SDCP is used. Understanding the environment and difficulty of each SDCP allows for a more holistic analysis of empirical results. This article proposes (i) a novel quantitative method for analysing the environment of SDCP, and (ii) a difficulty classification scheme based on the construction of SDCP. The proposed methods are evaluated by applying them to a benchmark suite of SDCP.
Many researchers working on classification problems evaluate the quality of developed algorithms based on computer experiments. The conclusions drawn from them are usually supported by the statistical analysis and chosen experimental protocol. Statistical tests are widely used to confirm whether considered methods significantly outperform reference classifiers. Usually, the tests are applied to stratified datasets, which could raise the question of whether data folds used for classification are really randomly drawn and how the statistical analysis supports robust conclusions. Unfortunately, some scientists do not realize the real meaning of the obtained results and overinterpret them. They do not see that inappropriate use of such analytical tools may lead them into a trap. This paper aims to show the commonly used experimental protocols’ weaknesses and discuss if we really can trust in such evaluation methodology, if all presented evaluations are fair and if it is possible to manipulate the experimental results using well-known statistical evaluation methods. We will present that it is possible to choose only such results, confirming the experimenter’s expectation. We will try to show what could be done to avoid such likely unethical behavior. At the end of this work, we will formulate recommendations on improving an experimental protocol to design fair experimental classifier evaluation.
In rule-based (RB) category-learning tasks, the optimal strategy is a simple explicit rule, whereas in information-integration (II) tasks, the optimal strategy is impossible to describe verbally. Many studies have reported qualitative dissociations between training and performance in RB and II tasks. Virtually all of these studies were testing predictions of the dual-systems model of category learning called COVIS. The most prominent alternative account to COVIS is that humans have one learning system that is used in all tasks, and that the observed dissociations occur because the II task is more difficult than the RB task. This article describes the first attempt to test this difficulty hypothesis against anything more than a single set of data. First, two novel predictions are derived that discriminate between the difficulty and multiple-systems hypotheses. Next, these predictions are tested against a wide variety of published categorization data. Overall, the results overwhelmingly reject the difficulty hypothesis and instead strongly favor the multiple-systems account of the many RB versus II dissociations.
Achieving the best performance with many machine learning methods depends critically on model hyper-parameter optimization. However, this optimization which requires strong expertise is often a “black magic” especially on deep learning models. Currently, even some widely used classic methods such as random search, grid search and manual search have obtained success to some extent, whose evaluation on hyper-parameter optimization problems is still computationally expensive and unpractical. They have to face the same challenge about how to choose the initial set of trials from random hyper-parameter permutation and combination. In this paper, to develop these methods, we present a simple and efficient framework for improving the efficiency and accuracy of hyper-parameter optimization by combining classification complexity and hyper-parameter optimization. Through this framework, it can quickly choose an initial set of hyper-parameters for a new coming classification task, thus reducing the number of trials in terms of hyper-parameter space. Results of six real-world datasets on three representative deep learning models demonstrate that the initial hyper-parameters set which is provided by our framework can make good performance, while plays a significant and efficient role on hyper-parameter optimization.