PreprintPDF Available

On the intrinsic robustness to noise of some leading classifiers and symmetric loss function -- an empirical evaluation

Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

Paper accepted at the Workshop « Data Quality Assessment for Machine Learning (DQAML)” SIGKDD 2021 --- In some industrial application as fraud detection common supervision techniques may not be efficient because they rely on the quality of labels. In concrete cases, these labels may be weak in quantity, quality or trustworthiness. We propose a benchmark to evaluate the natural robustness of various algorithms taken from various paradigms on artificially corrupted datasets with a focus on noisy labels. This paper studies the intrinsic robustness of some leading classifiers and, taken the result from recent literature, how symmetric loss function may help.
Content may be subject to copyright.
On the intrinsic robustness to noise of some leading classifiers
and symmetric loss function - an empirical evaluation
Hugo Le Baher
Orange Labs
Lannion, FRANCE
Vincent Lemaire
Orange Labs
Lannion, France
Romain Trinquart
Orange Labs
Lannion, France
In some industrial applications such as fraud detection, the perfor-
mance of common supervision techniques may be aected by the
poor quality of the available labels : in actual operational use-cases,
these labels may be weak in quantity, quality or trustworthiness.
We propose a benchmark to evaluate the natural robustness of
dierent algorithms taken from various paradigms on articially
corrupted datasets, with a focus on noisy labels. This paper studies
the intrinsic robustness of some leading classiers. The algorithms
under scrutiny include SVM, logistic regression, random forests,
XGBoost, Khiops. Furthermore, building on results from recent
literature, the study is supplemented with an investigation into
the opportunity to enhance some algorithms with symmetric loss
Computing methodologies Supervised learning by clas-
sication;Classication and regression trees.
robustness, label noise, supervised classication, tabular data
In recent years, there has been a surge for businesses to turn to
ML solutions, especially supervised classiers, as a solution for au-
tomation and hence scaling. In this paradigm, what was previously
tackled with hard encoded expertise is now supposedly discovered
and exploited automatically by learning algorithms. Alas, the path
to ML starts with a strong prerequisite : the availability of labeled
data. In some domains, the collection of these labels is a costly
process, if not an impossible one. Classiers are then trained with
imperfect labels or proxies. Even in such adversary context, the
performance of the classiers may still deliver an added business
value, such as ltering events and relieving the human operator in a
monitoring scenario. But there are some application domains where
the high nancial stakes push for controlling the eect that noisy
labels may have on classication performance. Fraud detection is
one of these domains, especially for large companies where even a
small fraction of fraudulent activities may yield important losses.
As for a motivation to the present work, let us consider one specic
realm for fraudsters : the wholesale markets in Telecommunication.
Telecommunication companies use a variety of international
routes to send trac to each other across dierent countries. In a
“wholesale market", telecom carriers can obtain trac to make up
a shortfall, or send trac on other routes, by trading with other
Both authors contributed equally to this research.
carriers in the wholesale or carrier-to-carrier market. Minutes ex-
changes allow carriers to buy and sell terminations. Prices in the
wholesale market can change on a daily or weekly basis. A car-
rier will look for least cost routing function to optimize its trading
on the wholesale market. The quality of routes on the wholesale
market can also vary, as the trac may be going on a grey route.
A chain value exists between operators that provide the connec-
tion between two customers. But this value chain could be broken
if a fraudster nds a way to generate communication without pay-
ing. For the last decades, fraud has been a growing concern in the
telecommunication industry. In a 2017 survey [
], the CFCA esti-
mates the annual global fraud loss to $30 billion (USD). Therefore,
detecting and preventing, when possible, is primordial in this do-
main. Regarding Wholesale market, a list of fraud is known [
] by
the operators and fraud detection platforms already exist.
One of these platforms has been realized by Orange (as a whole-
sale operator). This platform contains modules which exploit the
information given by the expert (the scoring module, for exam-
ple, which is a classier), others explore the data to interact with
the expert(s) by nding new patterns, including malevolent ones.
This goal is achieved by knowledge discovery and active learning
techniques. Those exploration modules [
] are responsible for
adapting to the constant evolution of fraudster’s behaviors under
(in the case of this platform) the constraint of the limited time that
experts can aord to spend in this exploration. This platform share
similarities with the one presented by Veeramachaneni et al. in [
Both platforms combine a supervised model for predictions with
unsupervised models for exploration of unknown pattern, and take
into account the user feedback in the learning phase.
In this paper, we are interested by the inspection of the label noise
which is natively incorporated in this kind of platform. Here the
noise comes from an “inaccurate supervision” [
]. The values
used as a target for learning, also called labels, could be wrong, due
to a least two factors: the human annotators could make errors,
the fraud could drift over the times and previous normal behaviors
could contain not yet detected fraud behaviors. These errors would
be referred as noise or corruption.
In this section, we will delineate the scope of our empirical eval-
uation. First we provide basic denitions and notations on binary
classication. Then we discuss how the task of classication can
be aected by various types of noise.
The goal of binary classication is to learn a model from a lim-
ited sample of observations, in order to predict an associated class
or label. Such a technique is based on the hypothesis that rules,
WK DQAML@SIGKDD ’21, August 14–18, 2021, Online Hugo Le Baher, Vincent Lemaire, and Romain Trinquart
patterns or associations learned from a representative subset of
individuals, identied as the training set, can be reused on new data
from a similar source, denoted testing set.
In this article, an individual observation will refer to a vector of
. The observations are organised in a vector
𝑋∈ X𝑛
. Each
has an associated class
, those associated classes (or
labels) are themselves organised into a vector
whose domain is
, with
. The aim is to model the best relationship
between attributes and the associated class, dened as a couple
(𝑥, 𝑦 ) ∈ 𝑋 , 𝑌
. The goal of the model is to nd a classication func-
tion built from observed examples: 𝑓:X → Y.
In theory, the dataset used for training is supposed to represent a
subset of the ground truth. However, data in real-world applications
rarely correspond perfectly to reality. As dened in [
], noise is
“anything that obscures the relationship between the features of an
instance and its class”. According to this denition, every error or
imprecision into label is considered as noise in this paper.
2.1 Impact of Label Noise
When label noise occurs, i.e. a degradation of the learning examples
classes, it is no surprise that a signicant decrease of the performance
has been widely observed and studied. Impact on the learning pro-
cess could be due to some form of overtting: concerned models
focus too much on corrupted individuals and have trouble gen-
erating rules or relations that would generalize properly on new
data. For example, in [
], Adaboost shows overtting behavior
in presence of label noise due to wrong labels having a strong in-
uence on the decision boundary. But in some situations, these
corrupted labels can oer a new articial diversity in the original
dataset which prevents existing overtting. For example in [
], ran-
domized targets as a technique is compared to bagging with better
results. Besides the impact on performance, other problems may
arise. These issues may look less obvious but should still be taken
in consideration: (i) interpretability, (ii) statistical signicance tests,
(iii) feature ranking, etc which are important in fraud detection
2.2 Type of Noise
The classier is a function which models the relationship between
input features and an output qualitative variable. Only label noise
are corrupted is considered in this paper. Label noise is
a stochastic process which consists in a misclassication of the
individuals. Considering
(𝑥, ˜
𝑦) ∈ 𝑋 , 𝑌
, with
describing an
example with measured attributes and observed label, the following
probabilities apply:
𝑥, ˜
𝑦∈ (X,Y), 𝑃 (˜
)=𝜌1, 𝑃 (˜
In other terms, the observed label takes one value among other
values possible randomly according to two parameters
, the noise parameters. Since only binary classication will be
considered later in the paper (
), introducing noise
on the labels consist in swapping the observed value to the other
possible one: positive becomes negative, negative becomes positive.
2.3 Experimental settings and Noise
Frenay et al. [
] oer a taxonomy regarding dierent settings of
label noise. We discuss it briey and refer to it in order to specify
the environment our evaluation sets in with controlled random
settings. In their publication, authors compare the context of label
noise to the context of missing values, described in [
]. The noise
generation is characterized in terms of the distribution and the
magnitude it depends on.
One can isolate 3 distinct cases where the labels are noised:
Noisy Completely At Random (NCAR), Noisy At Random (NAR)
and Noisy Not At Random (NNAR). In this paper only NCAR will be
studied in future experiments. This entails that, with
dened as the probability of noise insertion on negative and positive
individuals respectively,
. The same proportion of
the positive class is noised as the noised proportion in the negative
class. The noise is uniform.
2.4 Class Balance
The nal element that describes our targeted task is the balance of
the categories. In natural phenomena or datasets, labels are rarely
perfectly balanced. Some proportion may appear with almost 50
percent each for a binary labelling for example, which would lead
to fairly balanced datasets. However, some type of problems exhibit
less balanced proportions and sometimes even really imbalanced
ones. Let us consider the example of fraud detection : one can
reasonably imagine that in some system, the large majority of users
have an acceptable behavior. Only a small minority would present
signs of hostile actions.
This issue should be a particular focus that induces some appro-
priate benchmark design choices as detailed hereafter:
2.4.1 Choice of Metrics. Metrics should be chosen appropriately
]. If most common metrics work well when imbalance is no
issue, some may be deeply aected when most of the individuals
are from the same category. A simple example can be described
to illustrate: a really unintelligent algorithm that would always
predict the majority class would achieve an accuracy of 95% if the
evaluation dataset is composed of only 5% of the minority class. If
no proper attention is paid to the dataset or the built model, then
the result could be considered as satisfactory.
2.4.2 Imbalance of Categories. As studied in [
], imbalanced and
noisy data entails a decrease in performance. As this article con-
cluded, the addition of noise yields a proportional drop in perfor-
mance for every tested learner, even though a few of them did
actually benet from noise in terms of robustness. Also the study
proves that noise in the minority class is a lot more critical than
noise in the majority class. Finally, the authors conclude that lters
aiming to detect and correct potential mislabels before any super-
vised learning procedure, look inappropriate in the case of skewed
2.5 Current Solutions
In this section, we sketch out the existing solutions for coping with
noisy labels. We follow the classication proposed by [
] into 4
categories, we shall then focus on the latter two.
On the intrinsic robustness to noise of some leading classifiers and symmetric loss function - an empirical evaluation WK DQAML@SIGKDD ’21, August 14–18, 2021, Online
2.5.1 Manual Review. The rst pragmatic way of tackling this issue
is to manually review the labels with the goal of identifying cor-
rupted labels. However, this task is very similar to manual labelling
and it is very costly in time and human resources.
2.5.2 Automatic filtering and cleansing. In binary classication,
once a corrupted label is detected, correction is easy: since only two
values are available for labelling, the faulty label should be switched
to the only other possible value. Alas automating the detection of
faulty labels is of similar complexity as anomaly detection, which
is far from being trivial. The interested reader may take a look at
[25, 27, 28, 36, 38].
2.5.3 Robust Algorithms. A third way of coping with noise is to
look for learning algorithms which embed strong resistance to
skewed labels in datasets and produce classication models whose
performance may be fairly robust to noise addition. It should be an
output of our evaluation to identify such classiers.
2.5.4 Changing loss function for more robustness. The fourth and
last approach is a variation on the previous one : instead of looking
for classiers that are robust to noise by design, one might tweak
part of the algorithm to reach robustness. Recent studies (see [
or [
]) have put a new emphasis on the research of more relevant
loss functions: aiming at risk minimization in presence of noisy
labels, [
] shows theoretically and experimentally that when the
loss function satises a symmetry condition, it contributes to the
robustness of this algorithm.
The concept of symmetry for loss functions already has multi-
ple conicting denitions. Here, we will be using the following.
Considering a loss function
is the prediction of the
model and
is the target label taken amongst
is symmetrical if
Í𝑦∈Y L(𝑓(𝑥), 𝑦 )=𝑐
being a constant. In a binary context where
Y={+1,1}, we have: L(𝑓(𝑥),+1) + L(𝑓(𝑥),1)=𝑐.
In other terms, an algorithm will be penalized equally according
to a relationship between target
and predicted value
ever the target being positive or negative. Some implementations of
algorithms oer multiple loss functions at disposal or at least allow
users to implement their own. In our study, a relevant goal is to
know if using such losses is worth the change, in general or in the
case of corrupted datasets.
Note: In this paper, we are more interested by an evaluation of
the robustness of some recent leading classiers and symmetric lost
function thus the manual review and automatic ltering / cleansing
will not be studied.
2.6 Objectives and organisation of this paper
The objectives of this paper are two-folds. First we are interested
in an empirical study of the performance of some of the leading
classiers as well as the classier used in the platform described
in the introduction. Secondly we are interested in assessing the
performance of dedicated symmetric loss function and hence their
relevance to our platform for mitigating the eects of label noise.
For our experiments, the following hypothesis will be considered:
(i) If label noise has an impact on performance in general, some
algorithms may be robust to it.(ii) The addition of noise over labels
may entail heterogeneous performance decrease over datasets, that
is to say noise may entail a lesser performance decrease on some
specic problems. (iii) Some simple tweaks in the algorithms might
increase the robustness. Use of symmetrical losses when available
is a promising option.
After the introduction (Section 1) and the description of the
context above in this section the rest of the paper is organized as
follow: the Section 3 described the methodology used to design
our benchmark then the Section 4 give detailed results before a
discussion / conclusion in last Section ??.
Following previously published evaluations such as [
we present the design of our empirical evaluation through the
following facets of specication: (i) the overall protocol, (ii) the
datasets used for benchmark, (iii) the list of learning algorithms
under evaluation, (iv) the procedures to generate the articial noise,
(v) the criteria for nal evaluation of the tests.
3.1 Protocol
In order to ensure that our evaluation delivers relevant results, the
protocol must explore a large variety of options. First it should in-
tegrate algorithms that are taken from dierent paradigms. Second
it should confront those algorithms with diverse datasets. Not only
should these datasets dier in their features, they should exhibit
a sucient range of target distribution, from fairly balanced to
heavily imbalanced ones. Finally the number of run tests must be
sucient to draw trustworthy conclusions; this number should also
be congurable so as to control computing time.
The protocol can be described as a pipeline, composed of many
processes, from collecting datasets to results. Here are all the steps
that have to be implemented, with associated adjustable parameters:
Collection of publicly available datasets whatever the format
is (.csv, .json, .ar, ...).
Preprocessing and standardization of the collected datasets,
to be used in the following steps with a uniform code. This
implies various transformations:
Interpreting the format of the dataset as tabular.
Filling missing values. Categorical and Numerical are con-
sidered separately:
For categorical columns, missing values are considered
as their own separate value.
For numeric columns, missing values replaced by the
average of the whole column.
Note that such process may penalize models that can han-
dle missing values into their design, compared to other
Selection of the relevant variables, according to the docu-
mentation furnished or not with the dataset.
Standardization of the labels into
1 :
𝑛𝑒𝑔𝑎𝑡𝑖 𝑣𝑒, +
1 :
𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 }
Cleaning of the strings (values and columns)
splitting the datasets into
-fold repeated
times. Note that
the folds are stratied, i. e. they respect the original classes
proportions into splits.
applying some noise on the splits. A random portion of the
training labels are chosen according to a xed parameter.
These xed parameters are called
. For the noise to be
WK DQAML@SIGKDD ’21, August 14–18, 2021, Online Hugo Le Baher, Vincent Lemaire, and Romain Trinquart
comparable across all datasets, the chosen proportion scale
over the minority class percentage. If we have
5and the
prior is 0.25 (25% of the examples are from the minority class),
then we would apply an eective label noise of 0
0.125 =12.5%.
dierent algorithms on the corrupted training sets
and evaluate over the testing sets. Some algorithms may use
dierent preprocessing than others.
compute the chosen metrics according to the predicted sets
produced by the models.
At the end of this pipeline
(𝐷𝐾𝑅∗ |𝜌| 𝐴𝑀)
results will be
metrics computed over the learning of
The algorithms learns over
datasets, split
times. These
splits are corrupted depending on a xed parameter
that scales
on the prior.
3.2 Datasets
In the introduction to this paper, we did motivate our benchmark
through the description of the operational fraud detection system
we contribute to. This use-case may seem like a good candidate
for providing datasets for evaluation purpose. Unfortunately this
source of data is sensitive and cannot be disclosed. Moreover these
data are aicted by the exact plague that this study is focused upon
: the labels are not fully known and hence the actual performance
in classication cannot be measured. Instead, we chose to turn to
datasets which are publicly available and commonly used in bench-
marks; the target labels in those datasets could then be partially
changed in a controlled manner. The datasets chosen and used in
this paper are listed in Table 1. They come from NASA or UCI and
have been used in [
], [
], or [
]. These datasets are willingly
more or less simplied versions of existing use-cases; they are close
to our real application : tabular datasets with a medium number of
explanatory variables which are mixed (categorical and numerical)
and which are not made of images, music or language.
Name Num Cat N Min%
1 Trucks 170 0 76000 1.8
2 Bank 10 10 41188 3.3
3 PC1 22 0 1109 6.9
4 CM1 22 0 498 9.8
5 KC1 22 0 2105 15.4
6 KC3 40 0 194 18.6
7 JM1 17 5 10885 19.3
8 KC2 22 0 522 20.5
9 Adult 5 8 48842 23.9
10 Breast Cancer 9 0 699 34.4
11 Spambase 57 0 4601 39.4
12 Eye State 14 0 14980 44.8
13 Phishing 68 0 11055 44.3
14 Mushroom 0 22 8416 46.6
Table 1: Datasets used for the Benchmark and some Charac-
teristics: name, number of numerical features (Num), num-
ber of categorical features (Cat), number of examples (N),
percentage of the minority class.
NASA datasets were chosen because they are used in [
] and
because fault detection and fraud detection are quite related topics.
However, original versions linked into the article are not avail-
able anymore
. Moreover, authors use a cleansing process over the
datasets that we are not able to reproduce, because of a lack of
information. For these reasons, we will not be able to compare our
results with the article.
With this choice of datasets, a large range of the class balance is
covered: Mushroom is balanced while Trucks is really imbalanced.
Also, volume are quite various in number of rows or columns which
lead to dierent diculties of learning tasks.
3.3 Algorithms
The aim of this paper is to demonstrate the dierent levels of ro-
bustness to label noise that are achieved by various algorithms
paradigms. In this subsection, we list the algorithms we have cho-
sen as paragons. For each algorithm, we provide the motivations
for our choice as well as the parameters and implementations used
Note that, given the datasets used in this benchmark are tabular,
with mixed variables, we chose not to include deep learner. But the
reader may nd a recent survey focused on this learning paradigm
in [
]. An over-all criteria for picking algorithms was interpretabil-
ity, which is a key-feature for experts in fraud detection3.
3.3.1 Linear SVC. To serve as a baseline or starting point, a simple
and common model has been chosen in the context of binary clas-
sication. The support vector machine classier seems to t with
these requirements. The goal of this algorithm is to nd a maximum
margin hyper plane that provides the greatest separation between
the classes [
]. An intermediate of the Liblinear implementation
] is available in Scikit-learn ( Moreover,
the Liblinear implementation allows scaling on large datasets.
For our experiments, the following parameters have been used:
L2 penalty is applied, regularized at 𝐶=1.0, with a squared hinge
loss function. As advised in documentation, in the case where the
amount of samples outnumbers the number of features, we prefer
the primal optimization problem rather than dual. We also want to
reach convergence in a maximum cases as possible while nding a
solution in a reasonable time: the number of iterations is xed at
20000. This model is expected to have poor results on corrupted
labels without cleansing [14].
3.3.2 Logistic Regression (LR). Adding another linear model to the
experiments may conrm or not the expected results hypothesis.
Logistic Regression is expected to be aected in the same way as
Linear SVC [
]. This is conrmed by a review of the unsymmetrical
loss use on corrupted labels: this problem can indeed be expressed
as an optimization using a logistic loss, which does not satisfy the
symmetry condition. Using this algorithm also serves the purpose
of showing faults of such losses functions. The implementation
used is also from Scikit-learn, with similar parameters than SVC.
Only a portion have been backed up and made available here:
The code of all the experiments, and the datasets are available at:
Even though there is been a lot of recent tools like saliency maps and activation
dierences that work great for some domains, they do not transfer completely to
all applications. It is still dicult to interpret per-feature importance to the overall
decision of the deep net
On the intrinsic robustness to noise of some leading classifiers and symmetric loss function - an empirical evaluation WK DQAML@SIGKDD ’21, August 14–18, 2021, Online
lbfgs implementation is used with only available options for this:
L2 regularization with
0and primal optimization. Number of
iterations is also set at 20000.
3.3.3 Random Forests (RF). An option shown to outperform other
approaches when dealing with label noise is the use of ensemble
methods, or Bagging [
]. This kind of methods was developed
mostly to upgrade performances of existing models and prevent
overtting. The principle is quite simple: multiple versions of a
learner are trained on dierent samples and the outcome is a vote
]. This process works well when learners are naturally unstable.
For this reason Random forests [7] are often privileged.
To multiply the tests, two dierent implementations of Random
Forests were used. One from Scikit-learn, one from Weka (http:// Parameters were chosen to be similar on both
implementations. The forest is composed of 100 trees, learned on .
The trees have an unlimited depth. In Weka, the minimum instances
per leaf equals 1, while in Scikit-learn the minimum amount to
authorize a split is 2, which leads to the same idea. According to
[15] good results can be expected on the noised datasets.
3.3.4 Khiops. This a tool developed and used in Orange Labs. The
software, named Khiops
, is available here This
is also the classier used in the platform described in Section 1.
To summarize briey and for supervised classication it contains
a Selective Naive Bayes (SNB): i.e feature selection and feature
weighting are used.
Various methods of features selection have therefore been pro-
posed [
] to focus the description of the examples on supposedly
relevant features. For instance, heuristics for adding and removing
features can be used to select the best features using a wrapper
approach [
]. One way to average a large number of selective
naive Bayes classiers obtained with dierent subsets of features is
to use one model only, but with features weighting [
]. The Bayes
formula under the hypothesis of features independence condition-
ally to classes becomes:
, where
represents the weight of the feature
is component
is the class labels. The predicted class
is the one that maximizes
the conditional probability
. The probabilities
estimated by interval using a discretization for continuous features.
For categorical features, this estimation can be done if the feature
has few dierent modalities. Otherwise, grouping into modalities
is used using [
]. The computation of the weights of the SNB as
well as the discretization of the numerical variables and grouping
modalities of categorical variables are done with the use of data
dependent priors and a model selection approach [
] which has
no recourse to cross-validation. As shown in [
] good results can
be expected on the noised datasets since these kind of method is
auto regularized.
An interface was used for the benchmark, available in Python and
developed to be uniform with Scikit-learn. For this use of Khiops,
no parameters are required. Another additional model, referred as
"KhiopsRF" into the results analysis has also been tested. For this
An evaluation license could be obtained by everyone for free and two months without
any technical limitations.
part, Khiops build some supplementary variables to be inputted
thanks to Random Trees (100 trees) [41].
3.3.5 XGBoost. One of the solution that needed to be tested with
high priority is the use of a symmetrical loss, as explained in Section
2.5.4. Then, an implementation that allowed the use of symmetrical
loss is needed or custom loss at least. The rst case seems to be
quite rare so we made the choice to use XGBoost, which allows to
implement our own losses (objective function in XGBoost) [10].
Boosting is similar to bagging in a way that it combines many
simple learners to get a single decision. However, in bagging, each
learner is learned at the same time and independently, while in
boosting we learn models after models. Also, each model serves as
a baseline for the next one. Other boosting methods like Adaboost
have been shown to have really good results overall, but less than
Bagging when label noise occur [
]. This algorithm has been
chosen mostly for the comparison of loss functions.
Other than the choice of loss function, here are the choices of
𝑚𝑖𝑛_𝑐ℎ𝑖𝑙𝑑_𝑤𝑒𝑖𝑔ℎ𝑡 =
1. Eta or
the learning rate that shrinks the feature weights used in the next
step to prevent overtting.
corresponds to the
minimum sum of instance weight needed in a child to continue
partitioning. It also corresponds to the sum of hessian. In some
context like linear regression, the sum of hessian corresponds to
the number of instances. With symmetrical losses that we use, the
hessian is always null, so we need this parameter to be null also if
we want to trigger the learning. Lambda or
is the L2 regularization
parameter. 100 estimators are built.
3.4 Evaluation
To evaluate the quality of the results of all classiers, a comparison
between the predicted targets and the ground truth is needed both
on the training and the testing set. Score obtained on training set
helps to understand if the algorithms learn eectively. The testing
score evaluates how much the model manages to generalize its
results on new data. Since our datasets range from balanced to very
unbalanced, three appropriate metrics will be considered:
3.4.1 Balanced Accuracy. A good alternative to accuracy in our
imbalanced scenario is the balanced accuracy, also called "bacc"
for short. To compute this metric, the accuracy is computed on
each individual and each individual is weighted by its class balance.
Thanks to this, errors on examples taken from the minority class
are emphasized and thus, not ignored. Balanced accuracy is dened
on [0,+1], where +1is the perfect score.
3.4.2 Area Under ROC Curve. Already dened into the review of
] this metric is relevant because it shows the trade-o between
detection and false alarm rates. The AUC is dened on
where +1is the perfect score.
3.4.3 Cohen’s Kappa. The last metric we have chosen is the Co-
hen’s Kappa ([
]). It is a statistic that measures agreement between
two annotators. Here, we can consider the original target values
and the predicted one as assigned by two dierent annotators. If
is considered as the observed agreement ratio and
the expected
agreement probability, if the labeling is randomly chosen by both
annotators, then: Kappa
. The Cohen’s Kappa is dened
WK DQAML@SIGKDD ’21, August 14–18, 2021, Online Hugo Le Baher, Vincent Lemaire, and Romain Trinquart
, where
1is the perfect score and a negative or null
value means a random classication.
3.5 Summary: Parameters Chosen and Volume
of Results
To conclude this section, here is a recap of all parameters used
during the benchmark:
14 datasets: detailed in Subsection 3.2.
10 algorithms: Linear SVC, Logistic Regression, Scikit-Learn
Random Forest (skRF), Weka Random Forest (wekaRF), Khiops,
Khiops with Random Forest features (KhiopsRF), XGBoost
with asymmetric losses functions: Hinge loss (XGB_HINGE)
and Squared error loss (XGB_SQUERR); XGBoost with sym-
metric losses functions: Unhinged loss (XGB_UNHINGED)
and Ramp loss (XGB_RAMP)
5 folded splits, repeated 5 times.
12 levels of noise
3 metrics: AUC, Balanced Accuracy and Cohen’s Kappa
When running full tests, one obtain this amount of run models:
000. Each of these models have been evaluated
with 3 metrics in both train and test. Below, only testing results
will be used.
4.1 General Results
As stated at the end of the previous section, the benchmark is
comprised of 42 thousands of runs. To interpret the results over
such volumes, looking closely at each individual result is not an
option: it is mandatory to aggregate the results. Considering the
average results by models and datasets is a sound approach to get an
overview of the algorithms performance. On the contrary averaging
over all value of noise would defeat the purpose of the benchmark
since it would mask the evolution of performances along with noise
For the dataset breastcancer which is almost balanced, one can
observe in Figure 1, that performances decrease along with noise
on all models (bars that represent each levels of noise are sorted
from left to right, and from dark to lighter color). For spambase
(see Figure 2) the results are the same. For trucks (see Figure 3),
which is the larger dataset (in number of examples and explanatory
variables) and the most unbalanced, the results seems stable for all
Note that Y axis is not drawn on the full range of values: only
from half the range to maximum values. The following observations
can be made on this case:
Performances collapse genuinely at 𝜌=1.25.
If models achieve similar results without noise, khiops, khiop-
sRF, linearSVC and logistic regression know a really small
decrease even at 𝜌=1.00.
As a reminder, the eective noise is scaled on the minority class. These levels have
been chosen to go progressively to 0 from 1: From no noise, to a level of noise equals
to the minority balance proportions. Above 1, no information should be available on
the minority class so that performances should collapse. A level of noise of 1.25 is
added to validate this hypothesis. In this case
means a level noise of
times the percentage of the minority class.
Both symmetric Ramp and Unhinged losses are more robust
than Hinge and Squared Error with XGBoost.
Both implementations of Random Forests have the same
results. Although this solution was considered as a promis-
ing option [
], it seems that the presence of noise yields a
signicant decrease of performance.
Due to place consideration we cannot show the detailed results
for all datasets, but they are all available here:
Hugoswnw/NoiseEvaluation for the AUC, the balanced accuracy
as well the Cohen’s Kappa (the plots of every datasets results are
available). This supplementary material also contains the retained
performance which is discussed in the next subsection.
The following conclusions can be withdrawn from the supple-
mentary material: the results from NASA datasets are really poor.
On KC1, KC3 and CM1 especially, the algorithms struggle to achieve
more than random guesses (half of the metrics range) and they ex-
hibit a high variance. These datasets being quite small and very
specic must have challenging features for the models to solve. The
collapse at
25 is also clear with eyestate, mushroom, phishing.
4.2 Retained Performances - View 1: Averaged
on all datasets
To measure robustness only, a simple metric has been used to em-
phasize this aspect. For each run with a given 𝜌, the result is com-
pared to the run done with exact same conditions at
0, i. e.
when there is no noise: the retained performance6
𝑅𝑒𝑠𝑢𝑙 𝑡𝜌
𝑅𝑒𝑠𝑢𝑙 𝑡𝜌=0
Using this metric, the robustness of all datasets can be compared in
a glance (Figure 4).
These results corroborate the following assertions:
Performances collapse at 𝜌=1.25.
Both symmetric Ramp and Unhinged losses are more robust
than Hinge and Squared Error with XGBoost.
Both implementations of Random Forests yield the same
results and their decrease in performance are signicant.
This observation conicts the conclusions made in [15].
However the
metric ignores completely the absolute perfor-
mances; therefore the next section focuses on this point.
4.3 Retained Performances - View 2:
Algorithms compared with each others vs.
a dataset
An aspect of the results that we really care about is how algorithms
behave compared with each others. The barplots in Figures 5 and
6 allow a clear view over the results, but it is far from trivial to
conclude whether algorithm A is better than B, and for which noise
values. Digging further into this idea, we can plot these aggregated
results where we focus on the evolution of performances of all mod-
els with respect to the noise level. Figure 5 and Figure 6 illustrate
the results for datasets SpamBase and Adult. Results of all datasets
are available in the supplementary material.
Note that for this computations, Cohen’s Kappa must be scaled from
[−1; +1]
[0; +2]with a simple addition.
On the intrinsic robustness to noise of some leading classifiers and symmetric loss function - an empirical evaluation WK DQAML@SIGKDD ’21, August 14–18, 2021, Online
Figure 1: Results on Breastcancer averaged per metrics, algorithms and noise applied: AUC, Cohen’s Kappa and Bacc
Figure 2: Results on Spambase averaged per metrics, algorithms and noise applied: AUC, Cohen’s Kappa and Bacc
For clarity purposes, the algorithms that belong to similar paradigms
are represented with the same color: Khiops and KhiopsRF in or-
ange, Weka and Scikit-Learn Random Forests in blue, SVC and
logistic regression in green, XGBoost versions in red. Since the
25 entails dierent behaviours which are not represen-
tative, only the values between 0 and 1 are kept.
From Figures 5 and 6, as well as from the others available in
supplementary material, we can make the following observations
(some of which reinforce some of our previous conclusions):
Random Forests achieve very good results overall when there
is no noise. However, it becomes one of the worst when noise
occur. This disagrees with the results in [14].
Khiops and KhiopsRF achieve very good results overall and
show quite stable results when noise occur.
Logistic Regression and Linear SVC show surprisingly good
results overall and a “small” sensibility to labels corruption.
Symmetric loss functions in XGBoost appear stable indeed.
However, it is still outperformed by a large margin by asym-
metric losses and other models in most of experiments. It
appears to be only worth the use for very high noise levels
WK DQAML@SIGKDD ’21, August 14–18, 2021, Online Hugo Le Baher, Vincent Lemaire, and Romain Trinquart
Figure 3: Results on Trucks averaged per metrics, algorithms and noise applied: AUC, Cohen’s Kappa and Bacc
Figure 4: Retained Performance averaged on all datasets: AUC, Cohen’s Kappa and Bacc
(For Phishing and Spambase, they outperform asymmetric
losses only after 𝜌=0.66).
Note on symmetric loss functions - At rst it could be remind
that the result presented in papers as [
] are asymptotically results,
𝑁→ ∞
, which is not often the case in real situation. If we
consider only dataset where the XGboost algorithm provided of the
symmetric loss function unhinged (XGB_UNHINGED) has good
results (i.e where its AUC performance is above 0.7) it is not false
that a link exists between the size of training set and the perfor-
mance kept but that is also the case for others algorithm as Linear
SVC or Khiops. Then even if XGB_UNHINGED is more stable than
XGB_SQUERR the performances for high
values of XGB_SQUERR
are better than those of XGB_UNHINGED. The XGB_UNHINGED’s
performances are also often low for these datasets (whatever is the
value of
) as well on the others datasets (where its AUC perfor-
mance is below 0.7 and are mainly present when the percentage of
the minority class is low).
Secondly it appears that loss functions are not universal, they
should be tailored to the learning algorithm. There are loss functions
adapted to regression, classication etc... Even if Unhinged and
Ramp loss are symmetric they seem not to be suited when using
XGboost for binary classication in our empirical study. It has also
been reported that performances with such losses are signicantly
aected by noisy labels [
]. Such implementations perform well
On the intrinsic robustness to noise of some leading classifiers and symmetric loss function - an empirical evaluation WK DQAML@SIGKDD ’21, August 14–18, 2021, Online
Figure 5: Impact on performances along with noise addition on spambase: AUC versus 𝜌.
Figure 6: Impact on performances along with noise addition on adult: AUC versus 𝜌.
only in simple cases, when learning is easy or the number of classes
is small. Moreover, the modication of the loss function increases
the training time for convergence [42].
This paper presents an extensive benchmark dedicated to binary
classication problems and for testing the ability of various learn-
ing algorithms to cope with noisy labels. Through the example of
operational fraud detection, it is shown in the introduction that
such an ability is crucial for automatic classiers to be adopted in
"real-life" applications.
The article has also a tutorial value in some parts and provides
additional results to [
] either in the datasets considered or in the
classiers considered. The benchmark protocol covers a wide range
of settings, with diverse datasets and a variety of target classes bal-
ance (or imbalance). The algorithms under scrutiny include SVM,
logistic regression, random forests, XGBoost, Khiops. Furthermore
the study is supplemented with an investigation into the opportu-
nity to enhance some algorithms with symmetric loss functions.
For the benchmark’s conclusion to be as meaningful as possible,
the set of use-cases and parameters to test yields a very large set of
results to be synthesized. The motivation for picking a few metrics
are exposed and the aggregated results are then discussed. The
conclusions can be summed up in the few sentences here-under.
If the labels in the data available for model training are trusted
and there is no reason to believe that any corruption process hap-
pened, then Random Forest is a good option. However as soon as
the dataset is suspected not to be perfectly reliable and the aim for
more stability is required, Khiops looks like a better, safer solution.
Simple models such as SVM and Logistic Regression also seem to
be provide trustworthy results in such context. We also disagree
with the results in [
] which considers random forest robuts to
label noise.
The lead of symmetrical losses still looks promising. In most of
our tests, their results remained quite stable even in high level of
noise. However the main issue that would have to be solved is the
underwhelming performances in low noise contexts as discussed
in the note of the previous section.
In spite of the eorts to make the test protocol as complete as
possible, some design choices were made to stay focus, which leave
room for further investigation. First, in the experiments part, the
only type of noise considered for icking labels is Noise Completly
at random (NCAR). It will be interesting to pursue the benchmark
with NAR (Noise at Random) and MNAR (Noise Not At Random).
Second, the test protocol was designed to compare classiers faced
with “raw” noisy data. Another possible setting would consist in
preparing the data with the application of automatic cleaning oper-
ations such as [
] and then measure how the classiers perform.
This would have been an added value to this empirical study and
so we consider this as a future work.
WK DQAML@SIGKDD ’21, August 14–18, 2021, Online Hugo Le Baher, Vincent Lemaire, and Romain Trinquart
Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. 1992. A Training
Algorithm for Optimal Margin Classiers. In Proceedings of the Fifth Annual Work-
shop on Computational Learning Theory. Association for Computing Machinery,
New York, NY, USA, 144–152.
M. Boullé. 2005. A Bayes optimal approach for partitioning the values of categor-
ical attributes. Journal of Machine Learning Research 6 (2005), 1431–1452.
Marc Boullé. 2006. MODL: a Bayes optimal discretization method for continuous
attributes. Machine Learning 65, 1 (2006), 131–165.
Marc Boullé. 2007. Compression-Based Averaging of Selective Naive Bayes
Classiers. Journal of Machine Learning Research 8 (07 2007), 1659–1685.
Leo Breiman. 1996. Bagging predictors. Machine Language 24, 2 (Aug. 1996),
Leo Breiman. 2000. Randomizing Outputs to Increase Prediction Accuracy. Mach.
Learn. 40, 3 (Sept. 2000), 229–242.
[7] Leo Breiman. 2001. Random Forests. Machine Learning 45, 1 (2001), 5–32.
CFCA. 2018. 2017 Global Fraud Loss Survey. Survey Results. Communications
Fraud Control Association.
Nontawat Charoenphakdee, Jongyeong Lee, and Masashi Sugiyama. 2019. On
Symmetric Losses for Learning from Corrupted Labels. In Proceedings of the 36th
International Conference on Machine Learning, Vol. 97. 961–970.
Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting Sys-
tem. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining (2016), 785–794.
J. Cohen. 1960. A Coecient of Agreement for Nominal Scales. Educational and
Psychological Measurement 20, 1 (1960), 37.
Thomas G. Dietterich. 2000. An Experimental Comparison of Three Methods for
Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomiza-
tion. Machine Language 40, 2 (Aug. 2000), 139–157.
Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen
Lin. 2008. LIBLINEAR: A Library for Large Linear Classication. The Journal of
Machine Learning Research 9 (June 2008), 1871–1874.
Andres Folleco, Taghi M. Khoshgoftaar, Jason Van Hulse, and Lofton Bullard.
2008. Identifying Learners Robust to Low Quality Data. In 2008 IEEE International
Conference on Information Reuse and Integration. 190–195.
Andres A. Folleco, Taghi M. Khoshgoftaar, Jason Van Hulse, and Amri Napolitano.
2009. Identifying Learners Robust to Low Quality Data. Informatica 33 (2009),
Benoit Frenay and Michel Verleysen. 2014. Classication in the Presence of Label
Noise: A Survey. IEEE Transactions on Neural Networks and Learning Systems 25,
5 (2014), 845–869.
Aritra Ghosh, Naresh Manwani, and P. S. Sastry. 2020. Making Risk Minimization
Tolerant to Label Noise. Neurocomputing 160 (July 2020), 93–107.
Isabelle Guyon and Andre Elissee. 2003. An introduction to variable and feature
selection. J. Mach. Learn. Res. 3 (2003), 1157–1182.
I. Guyon, A. Saari, G. Dror, and G. Cawley. 2010. Model selection: Beyond the
Bayesian/frequentist divide. The Journal of Machine Learning Research 11 (2010),
Ray J. Hickey. 1996. Noise Modelling and Evaluating Learning from Examples.
Articial Intelligence 82, 1-2 (1996), 157–179.
[21] I3 Forum. 2014. I3F Fraud Classication. White paper 3. I3Forum.
Sensitivity Elias Kalapanidas, Elias Kalapanidas, Nikolaos Avouris, Marian
Craciun, and Daniel Neagu. 2003. Machine Learning algorithms: a study on
noise. Technical Report. in 1st Balcan Conference in Informatics.
Pat Langley. 1994. Selection of Relevant Features in Machine Learning. In In
Proceedings of the AAAI Fall symposium on relevance. AAAI Press, 140–144.
Pierre Lejeail, Vincent Lemaire, Antoine Cornuéjols, and Adam Ouorou. 2018.
TriClustering based outlier-shape score for time series in a fraud detection plat-
form. In ECML/PKDD Workshop on Advanced Analytics and Learning on Temporal
Andrea Malossini, Enrico Blanzieri, and Raymond T. Ng. 2006. Detecting potential
labeling errors in microarrays by data perturbation. Bioinformatics 22, 17 (2006),
Naresh Manwani and P. S. Sastry. 2013. Noise Tolerance under Risk Minimization.
IEEE Transactions on Cybernetics 43, 3 (June 2013), 1146–1151.
N. Matic, I. Guyon, L. Bottou, J. Denker, and V. Vapnik. 1992. Computer aided
cleaning of large databases for character recognition. In Proceedings., 11th IAPR
International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recog-
nition Methodology and Systems. 330–333.
André L. B. Miranda, Luís Paulo F. Garcia, André C. P. L. F. Carvalho, and Ana C.
Lorena. 2009. Use of Classication Algorithms in Noise Detection and Elimination.
In Hybrid Articial Intelligence Systems (Lecture Notes in Computer Science). 417–
David F. Nettleton, Albert Orriols-Puig, and Albert Fornells. 2010. A Study of
the Eect of Dierent Types of Noise on the Precision of Supervised Learning
Techniques. Articial Intelligence Review 33, 4 (2010), 275–306.
Pierre Nodet, Vincent Lemaire, Alexis Bondu, Antoine Cornuéjols, and Adam
Ouorou. 2021. From Weakly Supervised Learning to Biquality Learning: an
Introduction. In In Proceedings of the International Joint Conference on Neural
Networks (IJCNN).
Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean:
Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. 10, 11 (Aug.
2017), 1190–1201.
Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. 2018. Learning to
Reweight Examples for Robust Deep Learning. In Proceedings of Machine Learning
Research, Vol. 80. 4334–4343.
Gunnar Rätsch, Takashi Onoda, and Klaus Robert Müller. 1998. An improvement
of AdaBoost to avoid overtting. In Proc. of the Int. Conf. on Neural Information
Processing. 506–509.
Joseph L. Schafer and John W. Graham. 2002. Missing Data: Our View of the
State of the Art. Psychological Methods 7, 2 (2002), 147–177.
Hwanjun Song, Minseok Kim, Dongmin Park, and Jae-Gil Lee. 2020. Learning
from Noisy Labels with Deep Neural Networks: A Survey. arXiv:2007.08199
[cs.LG] (2020).
Jiang-wen Sun, Feng-ying Zhao, Chong-jun Wang, and Shi-fu Chen. 2007. Iden-
tifying and Correcting Mislabeled Training Instances. In Future Generation Com-
munication and Networking (FGCN 2007), Vol. 1. 244–250. ISSN: 2153-1463.
Alaa Tharwat. 2018. Classication assessment methods. Applied Computing and
Informatics (2018).
Jason Van Hulse and Taghi Khoshgoftaar. 2009. Knowledge Discovery from
Imbalanced and Noisy Data. Data & Knowledge Engineering 68, 12 (Dec. 2009),
Brendan van Rooyen, Aditya Krishna Menon, and Robert C. Williamson. 2015.
Learning with Symmetric Label Noise: The Importance of Being Unhinged.
arXiv:1505.07634 [cs] (May 2015).
Kalyan Veeramachaneni, Ignacio Arnaldo, Constantinos Bassias, Ke Li, and Al-
fredo Cuesta-Infante. 2016. AI^2: Training a Big Data Machine to Defend. In
IEEE International Conference on Intelligent Data and Security (IDS). 49–54.
N. Voisine,M. Boullé, and C. Hue. 2010. A Bayes Evaluation Criterion for Decision
Trees. Advances in Knowledge Discovery and Management (AKDM-1) 292 (2010),
Zhilu Zhang and Mert Sabuncu. 2018. Generalized Cross Entropy Loss for Train-
ing Deep Neural Networks with Noisy Labels. In Advances in Neural Information
Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-
Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 8778–8788.
Zhi-Hua Zhou. 2017. A Brief Introduction to Weakly Supervised Learning.
National Science Review 5, 1 (2017), 44–53.
Xingquan Zhu and Xindong Wu. 2004. Class Noise vs. Attribute Noise: A Quan-
titative Study of Their Impacts. Artif. Intell. Rev. 22, 3 (Nov. 2004), 177–210.
ResearchGate has not been able to resolve any citations for this publication.
Full-text available
Classification techniques have been applied to many applications in various fields of sciences. There are several ways of evaluating classification algorithms. The analysis of such metrics and its significance must be interpreted correctly for evaluating different learning algorithms. Most of these measures are scalar metrics and some of them are graphical methods. This paper introduces a detailed overview of the classification assessment measures with the aim of providing the basics of these measures and to show how it works to serve as a comprehensive source for researchers who are interested in this field. This overview starts by highlighting the definition of the confusion matrix in binary and multi-class classification problems. Many classification measures are also explained in details, and the influence of balanced and imbalanced data on each metric is presented. An illustrative example is introduced to show (1) how to calculate these measures in binary and multi-class classification problems, and (2) the robustness of some measures against balanced and imbalanced data. Moreover, some graphical measures such as Receiver operating characteristics (ROC), Precision-Recall, and Detection error trade-off (DET) curves are presented with details. Additionally, in a step-by-step approach, different numerical examples are demonstrated to explain the preprocessing steps of plotting ROC, PR, and DET curves.
Conference Paper
Full-text available
This paper presents a triclustering based outlier-shape score for time series in the context of a fraud detection platform for wholesale traffic for a telecommunications carrier. We propose to use triclustering as an exploration module for outlier shape detection using whole time series. Three main steps compose this approach: (1) projection of data in a new space of time series related features (e.g. derivative), (2) estimation of the density of known normal data using a triclustering method (3) computation of an outlierness score quantifying the distance to the estimator from step (2). We conduct an evaluation of the methodology by focusing on its ability to separate data from different classes. Our preliminary results to assess this approach are very encouraging.
Full-text available
Label noise is an important issue in classification, with many potential negative consequences. For example, the accuracy of predictions may decrease, whereas the complexity of inferred models and the number of necessary training samples may increase. Many works in the literature have been devoted to the study of label noise and the development of techniques to deal with label noise. However, the field lacks a comprehensive survey on the different types of label noise, their consequences and the algorithms that consider label noise. This paper proposes to fill this gap. First, the definitions and sources of label noise are considered and a taxonomy of the types of label noise is proposed. Second, the potential consequences of label noise are discussed. Third, label noise-robust, label noise cleansing, and label noise-tolerant algorithms are reviewed. For each category of approaches, a short discussion is proposed to help the practitioner to choose the most suitable technique in its own particular field of application. Eventually, the design of experiments is also discussed, what may interest the researchers who would like to test their own algorithms. In this paper, label noise consists of mislabeled instances: no additional information is assumed to be available like e.g., confidences on labels.
Full-text available
In this paper, we explore noise-tolerant learning of classifiers. We formulate the problem as follows. We assume that there is an unobservable training set that is noise free. The actual training set given to the learning algorithm is obtained from this ideal data set by corrupting the class label of each example. The probability that the class label of an example is corrupted is a function of the feature vector of the example. This would account for most kinds of noisy data one encounters in practice. We say that a learning method is noise tolerant if the classifiers learnt with noise-free data and with noisy data, both have the same classification accuracy on the noise-free data. In this paper, we analyze the noise-tolerance properties of risk minimization (under different loss functions). We show that risk minimization under 0-1 loss function has impressive noise-tolerance properties and that under squared error loss is tolerant only to uniform noise; risk minimization under other loss functions is not noise tolerant. We conclude this paper with some discussion on the implications of these theoretical results.
Supervised learning techniques construct predictive models by learning from a large number of training examples, where each training example has a label indicating its ground-truth output. Though current techniques have achieved great success, it is noteworthy that in many tasks it is difficult to get strong supervision information like fully ground-truth labels due to the high cost of data labeling process. Thus, it is desired for machine learning techniques to work with weak supervision. This article reviews some research progress of weakly supervised learning, focusing on three typical types of weak supervision: incomplete supervision where only a subset of training data are given with labels; inexact supervision where the training data are given with only coarse-grained labels; inaccurate supervision where the given labels are not always ground-truth.
We introduce HoloClean, a framework for holistic data repairing driven by probabilistic inference. HoloClean unifies existing qualitative data repairing approaches, which rely on integrity constraints or external data sources, with quantitative data repairing methods, which leverage statistical properties of the input data. Given an inconsistent dataset as input, HoloClean automatically generates a probabilistic program that performs data repairing. Inspired by recent theoretical advances in probabilistic inference, we introduce a series of optimizations which ensure that inference over HoloClean's probabilistic model scales to instances with millions of tuples. We show that HoloClean scales to instances with millions of tuples and find data repairs with an average precision of ~90% and an average recall of above ~76% across a diverse array of datasets exhibiting different types of errors. This yields an average F1 improvement of more than 2x against state-of-the-art methods.
Conference Paper
Tree boosting is a highly effective and widely used machine learning method. In this paper, we describe a scalable end-to-end tree boosting system called XGBoost, which is used widely by data scientists to achieve state-of-the-art results on many machine learning challenges. We propose a novel sparsity-aware algorithm for sparse data and weighted quantile sketch for approximate tree learning. More importantly, we provide insights on cache access patterns, data compression and sharding to build a scalable tree boosting system. By combining these insights, XGBoost scales beyond billions of examples using far fewer resources than existing systems.
In many applications, the training data, from which one needs to learn a classifier, is corrupted with label noise. Many standard algorithms such as SVM perform poorly in presence of label noise. In this paper we investigate the robustness of risk minimization to label noise. We prove a sufficient condition on a loss function for the risk minimization under that loss to be tolerant to uniform label noise. We show that the $0-1$ loss, sigmoid loss, ramp loss and probit loss satisfy this condition though none of the standard convex loss functions satisfy it. We also prove that, by choosing a sufficiently large value of a parameter in the loss function, the sigmoid loss, ramp loss and probit loss can be made tolerant to non-uniform label noise also if we can assume the classes to be separable under noise-free data distribution. Through extensive empirical studies, we show that risk minimization under the $0-1$ loss, the sigmoid loss and the ramp loss has much better robustness to label noise when compared to the SVM algorithm.