ArticlePDF Available

Autorank: A Python package for automated ranking of classifiers



Content may be subject to copyright.
Autorank: A Python package for automated ranking of
Steen Herbold1
1Institute for Computer Science, University of Goettingen, Germany
DOI: 10.21105/joss.02173
Editor: Arfon Smith
Submitted: 17 February 2020
Published: 14 April 2020
Authors of papers retain
copyright and release the work
under a Creative Commons
Attribution 4.0 International
License (CC-BY).
Analyses to determine dierences in the central tendency, e.g., mean or median values, are an
important application of statistics. Often, such comparisons must be done with paired samples,
i.e., populations that are not dependent on each other. This is, for example, required if the
performance dierent machine learning algorithms should be compared on multiple data sets.
The performance measures on each data set are then the paired samples, the dierence in
the central tendency can be used to rank the dierent algorithms. This problem is not new
and how such tests could be done was already described in the well-known article by Demšar
Regardless, the correct use of Demšar’s guidelines is hard for non-experts in statistics. The
distribution of the populations must be analyzed with the Shapiro-Wilk test for normality
and, depending on the normality with Levene’s test or Bartlett’s tests for homogeneity of the
data. Based on the results and the number of populations, researchers must decide whether the
paired t-test, Wilcoxon’s rank sum test, repeated measures ANOVA with Tukey’s HSD as post-
hoc test, or Friedman’s tests and Nemenyi’s post-hoc test is the suitable statistical framework.
All this is already quite complex. Additionally, researchers must adjust the signicance level
due to the number of tests to achieve the desired family-wise signicance and control the
false-positive rate of the test results.
Moreover, there are important aspects that go beyond Demšar’s guidelines regarding best
practice for the reporting of statistical result. Good reporting of the results goes beyond
simply stating the signicance of ndings. Additional aspects also matter, e.g., eect sizes,
condence intervals, and the decision whether it is appropriate to report the mean value and
standard deviation, or whether the median value and the median absolute deviation are better
The goal of Autorank is to simplify the statistical analysis for non-experts. Autorank takes
care of all of the above with a single function call. This is the dierence between Autorank
other packages, like Scipy (Virtanen et al., 2020), who expect users to know which tests to
use and how to interpret the results. The decision ow of Autorank is as follows.
Herbold, S., (2020). Autorank: A Python package for automated ranking of classiers. Journal of Open Source Software, 5(48), 2173.
Figure 1: Decision Flow
Additional functions allow the generation of appropriate plots, result tables, and even of a
complete latex document. All that is required is the data about the populations is in a
We believe that Autorank can help to avoid common statistical errors such as, such as the use
of inappropriate statistical tests, reporting of parametric measures instead of non-parametric
measures in case of non-normal data, and incomplete reporting in general.
Using Autorank
In our research, we recently used autorank to compare dierences between data generation
methods for defect prediction research (Herbold, Trautsch, & Trautsch, 2020). In general,
Autorank can be used anywhere, where dierent classiers are compared on multiple data
sets. The results must only be prepared as a dataframe. For example, the dataframe could
contain the accuracy of classiers trained on dierent data sets.1The following three lines
would then perform the statistical tests, generate the textual description of the results, as well
as the plot of the results.
>from autorank import autorank, create_report, plot_stats
>results =autorank(data)
The statistical analysis was conducted for 6populations with 20 paired
The family-wise significance level of the tests is alpha=0.050.
We rejected the null hypothesis that the population is normal for the
population Random Forest (p=0.000). Therefore, we assume that not all
populations are normal.
Because we have more than two populations and the populations and one of
1See also:
Herbold, S., (2020). Autorank: A Python package for automated ranking of classiers. Journal of Open Source Software, 5(48), 2173.
them is not normal, we use the non-parametric Friedman test as omnibus
test to determine if there are any significant differences between the
median values of the populations. We use the post-hoc Nemenyi test to
infer which differences are significant. We report the median (MD), the
median absolute deviation (MAD) and the mean rank (MR) among all
populations over the samples. Differences between populations are
significant, if the difference of the mean rank is greater than the
critical distance CD=1.686 of the Nemenyi test.
We reject the null hypothesis (p=0.000) of the Friedman test that there
is no difference in the central tendency of the populations Naive Bayes
(MD=0.875+-0.065, MAD=0.053, MR=2.750), Random Forest (MD=0.850+-0.100,
MAD=0.062, MR=2.850), RBF SVM (MD=0.885+-0.217, MAD=0.059, MR=2.900),
Neural Net (MD=0.876+-0.070, MAD=0.045, MR=3.300), Decision Tree
(MD=0.810+-0.173, MAD=0.074, MR=4.525), and Linear SVM (MD=0.710+-0.245,
MAD=0.253, MR=4.675). Therefore, we assume that there is a statistically
significant difference between the median values of the populations.
Based the post-hoc Nemenyi test, we assume that there are no significant
differences within the following groups: Naive Bayes, Random Forest, RBF
SVM, and Neural Net;Random Forest, RBF SVM, Neural Net, and Decision
Tree;Neural Net, Decision Tree, and Linear SVM. All other differences
are significant.
Figure 2: CD Diagram
This work is partially funded by DFG Grant 402774445.
Demšar, J. (2006). Statistical comparisons of classiers over multiple data sets. J. Mach.
Learn. Res.,7, 1–30.
Herbold, S., Trautsch, A., & Trautsch, F. (2020). Issues with szz: An empirical assessment
of the state of practice of defect prediction data collection. Retrieved from http://arxiv.
Herbold, S., (2020). Autorank: A Python package for automated ranking of classiers. Journal of Open Source Software, 5(48), 2173.
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M., Reddy, T., Cournapeau, D.,
Burovski, E., et al. (2020). SciPy 1.0: Fundamental Algorithms for Scientic Computing
in Python. Nature Methods,17, 261–272. doi:10.1038/s41592-019-0686-2
Herbold, S., (2020). Autorank: A Python package for automated ranking of classiers. Journal of Open Source Software, 5(48), 2173.
... The decision tree model's accuracy value is calculated by inputting the XTrain parameters as well as the YTrain, which is the fit shape The Decision tree model's score is then discovered, bypassing the XTest and YTest parameters to the system score() function, which searches for the Decision tree model's score. Apart from this, Herbold [53] has stated that the decision tree algorithm is considered as a flowchart such as a tree structure. It has an internal mode that displays the outcome and the branches that constitute a decision rule. ...
... Decision tree classifier. Source:[53]. ...
Full-text available
The paper focuses on the construction of an artificial intelligence-based heart disease detection system using machine learning algorithms. We show how machine learning can help predict whether a person will develop heart disease. In this paper, a python-based application is developed for healthcare research as it is more reliable and helps track and establish different types of health monitoring applications. We present data processing that entails working with categorical variables and conversion of categorical columns. We describe the main phases of application developments: collecting databases, performing logistic regression, and evaluating the dataset’s attributes. A random forest classifier algorithm is developed to identify heart diseases with higher accuracy. Data analysis is needed for this application, which is considered significant according to its approximately 83% accuracy rate over training data. We then discuss the random forest classifier algorithm, including the experiments and the results, which provide better accuracies for research diagnoses. We conclude the paper with objectives, limitations and research contributions.
... 1. Adding more datasets would strengthen the statistical significance of results, and allow the use of more advanced ranking methods, such as autorank (Herbold, 2020). 2. Developing more downstream tasks that captures the expectation from uncertainty scores together with simple generic baseline when available. ...
If Uncertainty Quantification (UQ) is crucial to achieve trustworthy Machine Learning (ML), most UQ methods suffer from disparate and inconsistent evaluation protocols. We claim this inconsistency results from the unclear requirements the community expects from UQ. This opinion paper offers a new perspective by specifying those requirements through five downstream tasks where we expect uncertainty scores to have substantial predictive power. We design these downstream tasks carefully to reflect real-life usage of ML models. On an example benchmark of 7 classification datasets, we did not observe statistical superiority of state-of-the-art intrinsic UQ methods against simple baselines. We believe that our findings question the very rationale of why we quantify uncertainty and call for a standardized protocol for UQ evaluation based on metrics proven to be relevant for the ML practitioner.
... The experiment has been carried out in Python language employing the scikit-learn library [31]. Bayesian analysis has been done using two packages -Autorank [32] and Baycomp [14]. Datasets, after being scaled are fed to Feature Selection methods such as Chi-squared, Correlation, Mutual Information, Sequential Feature Selection and proposed PFS model. ...
Full-text available
Software Fault Prediction (SFP) is a key practice in the development of quality software. To cater to rising human expectations software are getting complex and leads to increase in size of source code (addition of new functionalities). A strategy like SFP can help in detecting faults beforehand and avoid software downtime. To reduce the cost of SFP, we propose a Permutation-based hybrid feature selection model (PFS). This model helps in the removal of irrelevant and redundant features without compromising classifier performance. PFS has been compared with five different supervised feature selection methods-Chi-squared, Correlation, Sequential Forward Feature Selection, Sequential Backward Feature Selection, and Mutual Information. Random Forest (RF) classifier is employed and experimental results (Accuracy, Precision, Recall, and AUC-ROC) were found out on Twenty-four different datasets of three public software repositories. Bayesian statistical analysis of AUC-ROC results was carried out and it was found that PFS was able to outperform other techniques by lower computational time and lower dimensions.
... The two populations case pyVHR automatically handles the above significance testing procedure within the StatAnalysis() class, by relying on the Autorank Python package (Herbold, 2020). StatAnalysis() ingests the results produced at the previous step and runs the appropriate statistical test on a chosen comparison metric: 1 from pyVHR . ...
Full-text available
Remote photoplethysmography (rPPG) aspires to automatically estimate heart rate (HR) variability from videos in realistic environments. A number of effective methods relying on data-driven, model-based and statistical approaches have emerged in the past two decades. They exhibit increasing ability to estimate the blood volume pulse (BVP) signal upon which BPMs (Beats per Minute) can be estimated. Furthermore, learning-based rPPG methods have been recently proposed. The present pyVHR framework represents a multi-stage pipeline covering the whole process for extracting and analyzing HR fluctuations. It is designed for both theoretical studies and practical applications in contexts where wearable sensors are inconvenient to use. Namely, pyVHR supports either the development, assessment and statistical analysis of novel rPPG methods, either traditional or learning-based, or simply the sound comparison of well-established methods on multiple datasets. It is built up on accelerated Python libraries for video and signal processing as well as equipped with parallel/accelerated ad-hoc procedures paving the way to online processing on a GPU. The whole accelerated process can be safely run in real-time for 30 fps HD videos with an average speedup of around 5. This paper is shaped in the form of a gentle tutorial presentation of the framework.
... According to the results, the statistical models have more stable performance through the forecast horizons; meanwhile, the machine learning models show an increment as the forecast horizon increases. The deep learning models with transfer learning tend to show the better performance at the Mape and sMAPE in each forecast horizon, but to determine if there are significant statistical differences between models, the CD Diagrams were generated, following the procedure of the package autorank from python [29]. ...
Full-text available
Deep Learning and transfer learning models are being used to generate time series forecasts; however, there is scarce evidence about their performance prediction that it is more evident for monthly time series. The purpose of this paper is to compare Deep Learning models with transfer learning and without transfer learning and other traditional methods used for monthly forecasts to answer three questions about the suitability of Deep Learning and Transfer Learning to generate predictions of time series. Time series of M4 and M3 competitions were used for the experiments. The results suggest that deep learning models based on TCN, LSTM, and CNN with transfer learning tend to surpass the performance prediction of other traditional methods. On the other hand, TCN and LSTM, trained directly on the target time series, got similar or better performance than traditional methods for some forecast horizons.
... Further to that, we have used Autorank [29] to analyse the performance of SymbioSLAM against other ML algorithms compared on all datasets. Autorank analyses the result to determine differences in the central tendency, e.g. ...
Conference Paper
Full-text available
Loop closure detection is an essential tool of Simultaneous Localization and Mapping (SLAM) to minimize drift in its localization. Many state-of-the-art loop closure detection (LCD) algorithms use visual Bag-of-Words (vBoW), which is robust against partial occlusions in a scene but cannot perceive the semantics or spatial relationships between feature points. CNN object extraction can address those issues, by providing semantic labels and spatial relationships between objects in a scene. Previous work has mainly focused on replacing vBoW with CNN derived features. In this paper we propose SymbioLCD, a novel ensemble-based LCD that utilizes both CNN-extracted objects and vBoW features for LCD candidate prediction. When used in tandem, the added elements of object semantics and spatial-awareness creates a more robust and symbiotic loop closure detection system. The proposed SymbioLCD uses scale-invariant spatial and semantic matching, Hausdorff distance with temporal constraints, and a Random Forest that utilizes combined information from both CNN-extracted objects and vBoW features for predicting accurate loop closure candidates. Evaluation of the proposed method shows it outperforms other Machine Learning (ML) algorithms - such as SVM, Decision Tree and Neural Network, and demonstrates that there is a strong symbiosis between CNN-extracted object information and vBoW features which assists accurate LCD candidate prediction. Furthermore, it is able to perceive loop closure candidates earlier than state-of-the-art SLAM algorithms, utilizing added spatial and semantic information from CNN-extracted objects.
... The statistical test was conducted for the four methods (Euclidean of PCA, Mahalanobis, Silhouette with Mahalanobis, and Euclidean of Kernel PCA) and a dataset of 71 paired samples. For this analysis, we used the Python package Autorank [47] and a significance level of 0.01. We rejected the null hypothesis (stating that the population is normally distributed) for the Euclidean of Kernel PCA (p < 0.0005) and Silhouette with Mahalanobis (p < 0.0005) methods as the p-values obtained were smaller than the significance level. ...
Full-text available
This study proposes a new index to measure the resilience of an individual to stress, based on the changes of specific physiological variables. These variables include electromyography, which is the muscle response, blood volume pulse, breathing rate, peripheral temperature, and skin conductance. We measured the data with a biofeedback device from 71 individuals subjected to a 10-min psychophysiological stress test. The data exploration revealed that features’ variability among test phases could be observed in a two-dimensional space with Principal Components Analysis (PCA). In this work, we demonstrate that the values of each feature within a phase are well organized in clusters. The new index we propose, Resilience to Stress Index (RSI), is based on this observation. To compute the index, we used non-supervised machine learning methods to calculate the inter-cluster distances, specifically using the following four methods: Euclidean Distance of PCA, Mahalanobis Distance, Cluster Validity Index Distance, and Euclidean Distance of Kernel PCA. While there was no statistically significant difference (p>0.01) among the methods, we recommend using Mahalanobis, since this method provides higher monotonic association with the Resilience in Mexicans (RESI-M) scale. Results are encouraging since we demonstrated that the computation of a reliable RSI is possible. To validate the new index, we undertook two tasks: a comparison of the RSI against the RESI-M, and a Spearman correlation between phases one and five to determine if the behavior is resilient or not. The computation of the RSI of an individual has a broader scope in mind, and it is to understand and to support mental health. The benefits of having a metric that measures resilience to stress are multiple; for instance, to the extent that individuals can track their resilience to stress, they can improve their everyday life.
Full-text available
Outlier detection in process mining refers to aspects such as infrequent behavior in relation to the underlying business process models or to anomalous latencies of task execution, termed as temporal anomalies. In this work, we focus on the latter form of anomalies and we aim at investigating in depth the behavior of several proximity-based variants, which are shown to outperform simple statistical ones. We investigate multiple distance functions and approaches to establishing the outlierness of traces or individual tasks, and we explain the superiority of our proposals over existing proximity and probability distribution fitting-based techniques yielding up to 2.05X higher F1 score. We also provide guidelines as to which variant to be chosen based on the type of anomalies targeted and the dataset characteristics.
SMOTE is a well-known oversampling method for learning on imbalanced datasets. However, it has the risk of introducing noisy instances and overfitting problems. In order to improve its performance, this paper proposes an oversampling method called SMOTE-COF, which is an improvement of SMOTE based on center offset factor. The SMOTE-COF method first removes noisy samples, then computes center offset factor to select sparsely distributed minority class samples. Furthermore, these samples are used to generate new minority class samples with other minority class instances distributed in the same sub-cluster by SMOTE. Comparative experiments on one simulated dataset and fourteen UCI data sets provide evidence that the SMOTE-COF can effectively reduce noisy samples, generate better minority classes, and improve classification performance for imbalanced datasets.
Full-text available
SciPy is an open-source scientific computing library for the Python programming language. Since its initial release in 2001, SciPy has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year. In this work, we provide an overview of the capabilities and development practices of SciPy 1.0 and highlight some recent technical developments. This Perspective describes the development and capabilities of SciPy 1.0, an open source scientific computing library for the Python programming language.
While methods for comparing two learning algorithms on a single data set have been scrutinized for quite some time already, the issue of statistical tests for comparisons of more algorithms on multiple data sets, which is even more essential to typical machine learning studies, has been all but ignored. This article reviews the current practice and then theoretically and empirically examines several suitable tests. Based on that, we recommend a set of simple, yet safe and robust non-parametric tests for statistical comparisons of classifiers: the Wilcoxon signed ranks test for comparison of two classifiers and the Friedman test with the corresponding post-hoc tests for comparison of more classifiers over multiple data sets. Results of the latter can also be neatly presented with the newly introduced CD (critical difference) diagrams.
Issues with SZZ: An empirical assessment of the state of practice of defect prediction data collection
  • Steffen Herbold
  • Alexander Trautsch
  • Fabian Trautsch