Content uploaded by Vincent Lemaire

Author content

All content in this area was uploaded by Vincent Lemaire on Jun 21, 2021

Content may be subject to copyright.

On the intrinsic robustness to noise of some leading classifiers

and symmetric loss function - an empirical evaluation

Hugo Le Baher

Orange Labs

Lannion, FRANCE

hugo.le-baher@orange.com

Vincent Lemaire

Orange Labs

Lannion, France

vincent.lemaire@orange.com

Romain Trinquart∗

Orange Labs

Lannion, France

romain.trinquart@orange.com

ABSTRACT

In some industrial applications such as fraud detection, the perfor-

mance of common supervision techniques may be aected by the

poor quality of the available labels : in actual operational use-cases,

these labels may be weak in quantity, quality or trustworthiness.

We propose a benchmark to evaluate the natural robustness of

dierent algorithms taken from various paradigms on articially

corrupted datasets, with a focus on noisy labels. This paper studies

the intrinsic robustness of some leading classiers. The algorithms

under scrutiny include SVM, logistic regression, random forests,

XGBoost, Khiops. Furthermore, building on results from recent

literature, the study is supplemented with an investigation into

the opportunity to enhance some algorithms with symmetric loss

functions.

CCS CONCEPTS

•Computing methodologies →Supervised learning by clas-

sication;Classication and regression trees.

KEYWORDS

robustness, label noise, supervised classication, tabular data

1 INTRODUCTION

In recent years, there has been a surge for businesses to turn to

ML solutions, especially supervised classiers, as a solution for au-

tomation and hence scaling. In this paradigm, what was previously

tackled with hard encoded expertise is now supposedly discovered

and exploited automatically by learning algorithms. Alas, the path

to ML starts with a strong prerequisite : the availability of labeled

data. In some domains, the collection of these labels is a costly

process, if not an impossible one. Classiers are then trained with

imperfect labels or proxies. Even in such adversary context, the

performance of the classiers may still deliver an added business

value, such as ltering events and relieving the human operator in a

monitoring scenario. But there are some application domains where

the high nancial stakes push for controlling the eect that noisy

labels may have on classication performance. Fraud detection is

one of these domains, especially for large companies where even a

small fraction of fraudulent activities may yield important losses.

As for a motivation to the present work, let us consider one specic

realm for fraudsters : the wholesale markets in Telecommunication.

Telecommunication companies use a variety of international

routes to send trac to each other across dierent countries. In a

“wholesale market", telecom carriers can obtain trac to make up

a shortfall, or send trac on other routes, by trading with other

∗Both authors contributed equally to this research.

carriers in the wholesale or carrier-to-carrier market. Minutes ex-

changes allow carriers to buy and sell terminations. Prices in the

wholesale market can change on a daily or weekly basis. A car-

rier will look for least cost routing function to optimize its trading

on the wholesale market. The quality of routes on the wholesale

market can also vary, as the trac may be going on a grey route.

A chain value exists between operators that provide the connec-

tion between two customers. But this value chain could be broken

if a fraudster nds a way to generate communication without pay-

ing. For the last decades, fraud has been a growing concern in the

telecommunication industry. In a 2017 survey [

8

], the CFCA esti-

mates the annual global fraud loss to $30 billion (USD). Therefore,

detecting and preventing, when possible, is primordial in this do-

main. Regarding Wholesale market, a list of fraud is known [

21

] by

the operators and fraud detection platforms already exist.

One of these platforms has been realized by Orange (as a whole-

sale operator). This platform contains modules which exploit the

information given by the expert (the scoring module, for exam-

ple, which is a classier), others explore the data to interact with

the expert(s) by nding new patterns, including malevolent ones.

This goal is achieved by knowledge discovery and active learning

techniques. Those exploration modules [

24

] are responsible for

adapting to the constant evolution of fraudster’s behaviors under

(in the case of this platform) the constraint of the limited time that

experts can aord to spend in this exploration. This platform share

similarities with the one presented by Veeramachaneni et al. in [

40

].

Both platforms combine a supervised model for predictions with

unsupervised models for exploration of unknown pattern, and take

into account the user feedback in the learning phase.

In this paper, we are interested by the inspection of the label noise

which is natively incorporated in this kind of platform. Here the

noise comes from an “inaccurate supervision” [

30

,

43

]. The values

used as a target for learning, also called labels, could be wrong, due

to a least two factors: the human annotators could make errors,

the fraud could drift over the times and previous normal behaviors

could contain not yet detected fraud behaviors. These errors would

be referred as noise or corruption.

2 CONTEXT AND OBJECTIVES OF THIS

STUDY

In this section, we will delineate the scope of our empirical eval-

uation. First we provide basic denitions and notations on binary

classication. Then we discuss how the task of classication can

be aected by various types of noise.

The goal of binary classication is to learn a model from a lim-

ited sample of observations, in order to predict an associated class

or label. Such a technique is based on the hypothesis that rules,

WK DQAML@SIGKDD ’21, August 14–18, 2021, Online Hugo Le Baher, Vincent Lemaire, and Romain Trinquart

patterns or associations learned from a representative subset of

individuals, identied as the training set, can be reused on new data

from a similar source, denoted testing set.

In this article, an individual observation will refer to a vector of

features

𝑥

. The observations are organised in a vector

𝑋∈ X𝑛

. Each

observation

𝑥

has an associated class

𝑦

, those associated classes (or

labels) are themselves organised into a vector

𝑌

whose domain is

Y𝑛

, with

|𝑋|=|𝑌|=𝑛

. The aim is to model the best relationship

between attributes and the associated class, dened as a couple

(𝑥, 𝑦 ) ∈ 𝑋 , 𝑌

. The goal of the model is to nd a classication func-

tion built from observed examples: 𝑓:X → Y.

In theory, the dataset used for training is supposed to represent a

subset of the ground truth. However, data in real-world applications

rarely correspond perfectly to reality. As dened in [

20

], noise is

“anything that obscures the relationship between the features of an

instance and its class”. According to this denition, every error or

imprecision into label is considered as noise in this paper.

2.1 Impact of Label Noise

When label noise occurs, i.e. a degradation of the learning examples

classes, it is no surprise that a signicant decrease of the performance

has been widely observed and studied. Impact on the learning pro-

cess could be due to some form of overtting: concerned models

focus too much on corrupted individuals and have trouble gen-

erating rules or relations that would generalize properly on new

data. For example, in [

33

], Adaboost shows overtting behavior

in presence of label noise due to wrong labels having a strong in-

uence on the decision boundary. But in some situations, these

corrupted labels can oer a new articial diversity in the original

dataset which prevents existing overtting. For example in [

6

], ran-

domized targets as a technique is compared to bagging with better

results. Besides the impact on performance, other problems may

arise. These issues may look less obvious but should still be taken

in consideration: (i) interpretability, (ii) statistical signicance tests,

(iii) feature ranking, etc which are important in fraud detection

platform.

2.2 Type of Noise

The classier is a function which models the relationship between

input features and an output qualitative variable. Only label noise

where

𝑌

are corrupted is considered in this paper. Label noise is

a stochastic process which consists in a misclassication of the

individuals. Considering

(𝑥, ˜

𝑦) ∈ 𝑋 , 𝑌

, with

𝑥

and

˜

𝑦

describing an

example with measured attributes and observed label, the following

probabilities apply:

𝑥, ˜

𝑦∈ (X,Y), 𝑃 (˜

𝑦=+

1

|𝑦=−

1

)=𝜌−1, 𝑃 (˜

𝑦=

−1|𝑦=+1)=𝜌+1

In other terms, the observed label takes one value among other

values possible randomly according to two parameters

𝜌−1

and

𝜌+1

, the noise parameters. Since only binary classication will be

considered later in the paper (

Y={+

1

,−

1

}

), introducing noise

on the labels consist in swapping the observed value to the other

possible one: positive becomes negative, negative becomes positive.

2.3 Experimental settings and Noise

dependencies

Frenay et al. [

16

] oer a taxonomy regarding dierent settings of

label noise. We discuss it briey and refer to it in order to specify

the environment our evaluation sets in with controlled random

settings. In their publication, authors compare the context of label

noise to the context of missing values, described in [

34

]. The noise

generation is characterized in terms of the distribution and the

magnitude it depends on.

One can isolate 3 distinct cases where the labels are noised:

Noisy Completely At Random (NCAR), Noisy At Random (NAR)

and Noisy Not At Random (NNAR). In this paper only NCAR will be

studied in future experiments. This entails that, with

𝜌−1

and

𝜌+1

dened as the probability of noise insertion on negative and positive

individuals respectively,

𝜌−1=𝜌+1=𝜌

. The same proportion of

the positive class is noised as the noised proportion in the negative

class. The noise is uniform.

2.4 Class Balance

The nal element that describes our targeted task is the balance of

the categories. In natural phenomena or datasets, labels are rarely

perfectly balanced. Some proportion may appear with almost 50

percent each for a binary labelling for example, which would lead

to fairly balanced datasets. However, some type of problems exhibit

less balanced proportions and sometimes even really imbalanced

ones. Let us consider the example of fraud detection : one can

reasonably imagine that in some system, the large majority of users

have an acceptable behavior. Only a small minority would present

signs of hostile actions.

This issue should be a particular focus that induces some appro-

priate benchmark design choices as detailed hereafter:

2.4.1 Choice of Metrics. Metrics should be chosen appropriately

[

37

]. If most common metrics work well when imbalance is no

issue, some may be deeply aected when most of the individuals

are from the same category. A simple example can be described

to illustrate: a really unintelligent algorithm that would always

predict the majority class would achieve an accuracy of 95% if the

evaluation dataset is composed of only 5% of the minority class. If

no proper attention is paid to the dataset or the built model, then

the result could be considered as satisfactory.

2.4.2 Imbalance of Categories. As studied in [

38

], imbalanced and

noisy data entails a decrease in performance. As this article con-

cluded, the addition of noise yields a proportional drop in perfor-

mance for every tested learner, even though a few of them did

actually benet from noise in terms of robustness. Also the study

proves that noise in the minority class is a lot more critical than

noise in the majority class. Finally, the authors conclude that lters

aiming to detect and correct potential mislabels before any super-

vised learning procedure, look inappropriate in the case of skewed

datasets.

2.5 Current Solutions

In this section, we sketch out the existing solutions for coping with

noisy labels. We follow the classication proposed by [

16

] into 4

categories, we shall then focus on the latter two.

On the intrinsic robustness to noise of some leading classifiers and symmetric loss function - an empirical evaluation WK DQAML@SIGKDD ’21, August 14–18, 2021, Online

2.5.1 Manual Review. The rst pragmatic way of tackling this issue

is to manually review the labels with the goal of identifying cor-

rupted labels. However, this task is very similar to manual labelling

and it is very costly in time and human resources.

2.5.2 Automatic filtering and cleansing. In binary classication,

once a corrupted label is detected, correction is easy: since only two

values are available for labelling, the faulty label should be switched

to the only other possible value. Alas automating the detection of

faulty labels is of similar complexity as anomaly detection, which

is far from being trivial. The interested reader may take a look at

[25, 27, 28, 36, 38].

2.5.3 Robust Algorithms. A third way of coping with noise is to

look for learning algorithms which embed strong resistance to

skewed labels in datasets and produce classication models whose

performance may be fairly robust to noise addition. It should be an

output of our evaluation to identify such classiers.

2.5.4 Changing loss function for more robustness. The fourth and

last approach is a variation on the previous one : instead of looking

for classiers that are robust to noise by design, one might tweak

part of the algorithm to reach robustness. Recent studies (see [

26

,

39

]

or [

17

]) have put a new emphasis on the research of more relevant

loss functions: aiming at risk minimization in presence of noisy

labels, [

9

] shows theoretically and experimentally that when the

loss function satises a symmetry condition, it contributes to the

robustness of this algorithm.

The concept of symmetry for loss functions already has multi-

ple conicting denitions. Here, we will be using the following.

Considering a loss function

L

where

𝑓(𝑥)

is the prediction of the

model and

𝑦

is the target label taken amongst

Y

,

L

is symmetrical if

Í𝑦∈Y L(𝑓(𝑥), 𝑦 )=𝑐

,

𝑐

being a constant. In a binary context where

Y={+1,−1}, we have: L(𝑓(𝑥),+1) + L(𝑓(𝑥),−1)=𝑐.

In other terms, an algorithm will be penalized equally according

to a relationship between target

𝑦

and predicted value

𝑓(𝑥)

,what-

ever the target being positive or negative. Some implementations of

algorithms oer multiple loss functions at disposal or at least allow

users to implement their own. In our study, a relevant goal is to

know if using such losses is worth the change, in general or in the

case of corrupted datasets.

Note: In this paper, we are more interested by an evaluation of

the robustness of some recent leading classiers and symmetric lost

function thus the manual review and automatic ltering / cleansing

will not be studied.

2.6 Objectives and organisation of this paper

The objectives of this paper are two-folds. First we are interested

in an empirical study of the performance of some of the leading

classiers as well as the classier used in the platform described

in the introduction. Secondly we are interested in assessing the

performance of dedicated symmetric loss function and hence their

relevance to our platform for mitigating the eects of label noise.

For our experiments, the following hypothesis will be considered:

(i) If label noise has an impact on performance in general, some

algorithms may be robust to it.(ii) The addition of noise over labels

may entail heterogeneous performance decrease over datasets, that

is to say noise may entail a lesser performance decrease on some

specic problems. (iii) Some simple tweaks in the algorithms might

increase the robustness. Use of symmetrical losses when available

is a promising option.

After the introduction (Section 1) and the description of the

context above in this section the rest of the paper is organized as

follow: the Section 3 described the methodology used to design

our benchmark then the Section 4 give detailed results before a

discussion / conclusion in last Section ??.

3 METHODOLOGY IMPLEMENTED

Following previously published evaluations such as [

14

,

29

,

44

],

we present the design of our empirical evaluation through the

following facets of specication: (i) the overall protocol, (ii) the

datasets used for benchmark, (iii) the list of learning algorithms

under evaluation, (iv) the procedures to generate the articial noise,

(v) the criteria for nal evaluation of the tests.

3.1 Protocol

In order to ensure that our evaluation delivers relevant results, the

protocol must explore a large variety of options. First it should in-

tegrate algorithms that are taken from dierent paradigms. Second

it should confront those algorithms with diverse datasets. Not only

should these datasets dier in their features, they should exhibit

a sucient range of target distribution, from fairly balanced to

heavily imbalanced ones. Finally the number of run tests must be

sucient to draw trustworthy conclusions; this number should also

be congurable so as to control computing time.

The protocol can be described as a pipeline, composed of many

processes, from collecting datasets to results. Here are all the steps

that have to be implemented, with associated adjustable parameters:

(1)

Collection of publicly available datasets whatever the format

is (.csv, .json, .ar, ...).

(2)

Preprocessing and standardization of the collected datasets,

to be used in the following steps with a uniform code. This

implies various transformations:

•Interpreting the format of the dataset as tabular.

•

Filling missing values. Categorical and Numerical are con-

sidered separately:

–

For categorical columns, missing values are considered

as their own separate value.

–

For numeric columns, missing values replaced by the

average of the whole column.

Note that such process may penalize models that can han-

dle missing values into their design, compared to other

models.

•

Selection of the relevant variables, according to the docu-

mentation furnished or not with the dataset.

•

Standardization of the labels into

{−

1 :

𝑛𝑒𝑔𝑎𝑡𝑖 𝑣𝑒, +

1 :

𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒 }

•Cleaning of the strings (values and columns)

(3)

splitting the datasets into

𝐾

-fold repeated

𝑅

times. Note that

the folds are stratied, i. e. they respect the original classes

proportions into splits.

(4)

applying some noise on the splits. A random portion of the

training labels are chosen according to a xed parameter.

These xed parameters are called

𝜌

. For the noise to be

WK DQAML@SIGKDD ’21, August 14–18, 2021, Online Hugo Le Baher, Vincent Lemaire, and Romain Trinquart

comparable across all datasets, the chosen proportion scale

over the minority class percentage. If we have

𝜌=

0

.

5and the

prior is 0.25 (25% of the examples are from the minority class),

then we would apply an eective label noise of 0

.

5

∗

0

.

25

=

0.125 =12.5%.

(5)

learn

𝐴

dierent algorithms on the corrupted training sets

and evaluate over the testing sets. Some algorithms may use

dierent preprocessing than others.

(6)

compute the chosen metrics according to the predicted sets

produced by the models.

At the end of this pipeline

(𝐷∗𝐾∗𝑅∗ |𝜌| ∗ 𝐴∗𝑀)

results will be

obtained:

M

metrics computed over the learning of

A

algorithms.

The algorithms learns over

D

datasets, split

K∗R

times. These

splits are corrupted depending on a xed parameter

𝝆

that scales

on the prior.

3.2 Datasets

In the introduction to this paper, we did motivate our benchmark

through the description of the operational fraud detection system

we contribute to. This use-case may seem like a good candidate

for providing datasets for evaluation purpose. Unfortunately this

source of data is sensitive and cannot be disclosed. Moreover these

data are aicted by the exact plague that this study is focused upon

: the labels are not fully known and hence the actual performance

in classication cannot be measured. Instead, we chose to turn to

datasets which are publicly available and commonly used in bench-

marks; the target labels in those datasets could then be partially

changed in a controlled manner. The datasets chosen and used in

this paper are listed in Table 1. They come from NASA or UCI and

have been used in [

29

], [

9

], or [

14

]. These datasets are willingly

more or less simplied versions of existing use-cases; they are close

to our real application : tabular datasets with a medium number of

explanatory variables which are mixed (categorical and numerical)

and which are not made of images, music or language.

Name Num Cat N Min%

1 Trucks 170 0 76000 1.8

2 Bank 10 10 41188 3.3

3 PC1 22 0 1109 6.9

4 CM1 22 0 498 9.8

5 KC1 22 0 2105 15.4

6 KC3 40 0 194 18.6

7 JM1 17 5 10885 19.3

8 KC2 22 0 522 20.5

9 Adult 5 8 48842 23.9

10 Breast Cancer 9 0 699 34.4

11 Spambase 57 0 4601 39.4

12 Eye State 14 0 14980 44.8

13 Phishing 68 0 11055 44.3

14 Mushroom 0 22 8416 46.6

Table 1: Datasets used for the Benchmark and some Charac-

teristics: name, number of numerical features (Num), num-

ber of categorical features (Cat), number of examples (N),

percentage of the minority class.

NASA datasets were chosen because they are used in [

14

] and

because fault detection and fraud detection are quite related topics.

However, original versions linked into the article are not avail-

able anymore

1

. Moreover, authors use a cleansing process over the

datasets that we are not able to reproduce, because of a lack of

information. For these reasons, we will not be able to compare our

results with the article.

With this choice of datasets, a large range of the class balance is

covered: Mushroom is balanced while Trucks is really imbalanced.

Also, volume are quite various in number of rows or columns which

lead to dierent diculties of learning tasks.

3.3 Algorithms

The aim of this paper is to demonstrate the dierent levels of ro-

bustness to label noise that are achieved by various algorithms

paradigms. In this subsection, we list the algorithms we have cho-

sen as paragons. For each algorithm, we provide the motivations

for our choice as well as the parameters and implementations used

2

.

Note that, given the datasets used in this benchmark are tabular,

with mixed variables, we chose not to include deep learner. But the

reader may nd a recent survey focused on this learning paradigm

in [

35

]. An over-all criteria for picking algorithms was interpretabil-

ity, which is a key-feature for experts in fraud detection3.

3.3.1 Linear SVC. To serve as a baseline or starting point, a simple

and common model has been chosen in the context of binary clas-

sication. The support vector machine classier seems to t with

these requirements. The goal of this algorithm is to nd a maximum

margin hyper plane that provides the greatest separation between

the classes [

1

]. An intermediate of the Liblinear implementation

[

13

] is available in Scikit-learn (http://scikit-learn.org/). Moreover,

the Liblinear implementation allows scaling on large datasets.

For our experiments, the following parameters have been used:

L2 penalty is applied, regularized at 𝐶=1.0, with a squared hinge

loss function. As advised in documentation, in the case where the

amount of samples outnumbers the number of features, we prefer

the primal optimization problem rather than dual. We also want to

reach convergence in a maximum cases as possible while nding a

solution in a reasonable time: the number of iterations is xed at

20000. This model is expected to have poor results on corrupted

labels without cleansing [14].

3.3.2 Logistic Regression (LR). Adding another linear model to the

experiments may conrm or not the expected results hypothesis.

Logistic Regression is expected to be aected in the same way as

Linear SVC [

14

]. This is conrmed by a review of the unsymmetrical

loss use on corrupted labels: this problem can indeed be expressed

as an optimization using a logistic loss, which does not satisfy the

symmetry condition. Using this algorithm also serves the purpose

of showing faults of such losses functions. The implementation

used is also from Scikit-learn, with similar parameters than SVC.

1

Only a portion have been backed up and made available here: https://datahub.io/

machine-learning

2

The code of all the experiments, and the datasets are available at: https://github.com/

Hugoswnw/NoiseEvaluation

3

Even though there is been a lot of recent tools like saliency maps and activation

dierences that work great for some domains, they do not transfer completely to

all applications. It is still dicult to interpret per-feature importance to the overall

decision of the deep net

On the intrinsic robustness to noise of some leading classifiers and symmetric loss function - an empirical evaluation WK DQAML@SIGKDD ’21, August 14–18, 2021, Online

lbfgs implementation is used with only available options for this:

L2 regularization with

𝐶=

1

.

0and primal optimization. Number of

iterations is also set at 20000.

3.3.3 Random Forests (RF). An option shown to outperform other

approaches when dealing with label noise is the use of ensemble

methods, or Bagging [

12

]. This kind of methods was developed

mostly to upgrade performances of existing models and prevent

overtting. The principle is quite simple: multiple versions of a

learner are trained on dierent samples and the outcome is a vote

[

5

]. This process works well when learners are naturally unstable.

For this reason Random forests [7] are often privileged.

To multiply the tests, two dierent implementations of Random

Forests were used. One from Scikit-learn, one from Weka (http://

weka.sourceforge.io/). Parameters were chosen to be similar on both

implementations. The forest is composed of 100 trees, learned on .

The trees have an unlimited depth. In Weka, the minimum instances

per leaf equals 1, while in Scikit-learn the minimum amount to

authorize a split is 2, which leads to the same idea. According to

[15] good results can be expected on the noised datasets.

3.3.4 Khiops. This a tool developed and used in Orange Labs. The

software, named Khiops

4

, is available here www.khiops.com. This

is also the classier used in the platform described in Section 1.

To summarize briey and for supervised classication it contains

a Selective Naive Bayes (SNB): i.e feature selection and feature

weighting are used.

Various methods of features selection have therefore been pro-

posed [

23

] to focus the description of the examples on supposedly

relevant features. For instance, heuristics for adding and removing

features can be used to select the best features using a wrapper

approach [

18

]. One way to average a large number of selective

naive Bayes classiers obtained with dierent subsets of features is

to use one model only, but with features weighting [

4

]. The Bayes

formula under the hypothesis of features independence condition-

ally to classes becomes:

𝑃(𝑗|𝑋)=

𝑃(𝑗)Î𝑓𝑃(𝑋𝑓|𝑗)𝑊𝑓

Í𝐾

𝑗=1h𝑃(𝑗)Î𝑓𝑃(𝑋𝑓|𝑗)𝑊𝑓i

, where

𝑊𝑓

represents the weight of the feature

𝑓

,

𝑋𝑓

is component

𝑓

of

𝑋

,

𝑗

is the class labels. The predicted class

𝑗

is the one that maximizes

the conditional probability

𝑃(𝑗|𝑋)

. The probabilities

𝑃(𝑋𝑖|𝑗)

are

estimated by interval using a discretization for continuous features.

For categorical features, this estimation can be done if the feature

has few dierent modalities. Otherwise, grouping into modalities

is used using [

2

]. The computation of the weights of the SNB as

well as the discretization of the numerical variables and grouping

modalities of categorical variables are done with the use of data

dependent priors and a model selection approach [

19

] which has

no recourse to cross-validation. As shown in [

3

] good results can

be expected on the noised datasets since these kind of method is

auto regularized.

An interface was used for the benchmark, available in Python and

developed to be uniform with Scikit-learn. For this use of Khiops,

no parameters are required. Another additional model, referred as

"KhiopsRF" into the results analysis has also been tested. For this

4

An evaluation license could be obtained by everyone for free and two months without

any technical limitations.

part, Khiops build some supplementary variables to be inputted

thanks to Random Trees (100 trees) [41].

3.3.5 XGBoost. One of the solution that needed to be tested with

high priority is the use of a symmetrical loss, as explained in Section

2.5.4. Then, an implementation that allowed the use of symmetrical

loss is needed or custom loss at least. The rst case seems to be

quite rare so we made the choice to use XGBoost, which allows to

implement our own losses (objective function in XGBoost) [10].

Boosting is similar to bagging in a way that it combines many

simple learners to get a single decision. However, in bagging, each

learner is learned at the same time and independently, while in

boosting we learn models after models. Also, each model serves as

a baseline for the next one. Other boosting methods like Adaboost

have been shown to have really good results overall, but less than

Bagging when label noise occur [

12

]. This algorithm has been

chosen mostly for the comparison of loss functions.

Other than the choice of loss function, here are the choices of

parameters:

𝜂=

0

.

3,

𝑚𝑖𝑛_𝑐ℎ𝑖𝑙𝑑_𝑤𝑒𝑖𝑔ℎ𝑡 =

0,

𝜆=

1. Eta or

𝜂

is

the learning rate that shrinks the feature weights used in the next

step to prevent overtting.

𝑚𝑖𝑛_𝑐ℎ𝑖𝑙𝑑_𝑤𝑒𝑖𝑔ℎ𝑡

corresponds to the

minimum sum of instance weight needed in a child to continue

partitioning. It also corresponds to the sum of hessian. In some

context like linear regression, the sum of hessian corresponds to

the number of instances. With symmetrical losses that we use, the

hessian is always null, so we need this parameter to be null also if

we want to trigger the learning. Lambda or

𝜆

is the L2 regularization

parameter. 100 estimators are built.

3.4 Evaluation

To evaluate the quality of the results of all classiers, a comparison

between the predicted targets and the ground truth is needed both

on the training and the testing set. Score obtained on training set

helps to understand if the algorithms learn eectively. The testing

score evaluates how much the model manages to generalize its

results on new data. Since our datasets range from balanced to very

unbalanced, three appropriate metrics will be considered:

3.4.1 Balanced Accuracy. A good alternative to accuracy in our

imbalanced scenario is the balanced accuracy, also called "bacc"

for short. To compute this metric, the accuracy is computed on

each individual and each individual is weighted by its class balance.

Thanks to this, errors on examples taken from the minority class

are emphasized and thus, not ignored. Balanced accuracy is dened

on [0,+1], where +1is the perfect score.

3.4.2 Area Under ROC Curve. Already dened into the review of

[

14

] this metric is relevant because it shows the trade-o between

detection and false alarm rates. The AUC is dened on

[

0

,+

1

]

,

where +1is the perfect score.

3.4.3 Cohen’s Kappa. The last metric we have chosen is the Co-

hen’s Kappa ([

11

]). It is a statistic that measures agreement between

two annotators. Here, we can consider the original target values

and the predicted one as assigned by two dierent annotators. If

𝑝0

is considered as the observed agreement ratio and

𝑝𝑒

the expected

agreement probability, if the labeling is randomly chosen by both

annotators, then: Kappa

=(𝑝0−𝑝𝑒)

(1−𝑝𝑒)

. The Cohen’s Kappa is dened

WK DQAML@SIGKDD ’21, August 14–18, 2021, Online Hugo Le Baher, Vincent Lemaire, and Romain Trinquart

on

[−

1

,+

1

]

, where

+

1is the perfect score and a negative or null

value means a random classication.

3.5 Summary: Parameters Chosen and Volume

of Results

To conclude this section, here is a recap of all parameters used

during the benchmark:

•14 datasets: detailed in Subsection 3.2.

•

10 algorithms: Linear SVC, Logistic Regression, Scikit-Learn

Random Forest (skRF), Weka Random Forest (wekaRF), Khiops,

Khiops with Random Forest features (KhiopsRF), XGBoost

with asymmetric losses functions: Hinge loss (XGB_HINGE)

and Squared error loss (XGB_SQUERR); XGBoost with sym-

metric losses functions: Unhinged loss (XGB_UNHINGED)

and Ramp loss (XGB_RAMP)

•5 folded splits, repeated 5 times.

•

12 levels of noise

𝜌=[

0

.

00

,

0

.

05

,

0

.

10

,

0

.

20

,

0

.

25

,

0

.

33

,

0

.

50

,

0.66,0.75,0.90,1.00,1.25].5:

•3 metrics: AUC, Balanced Accuracy and Cohen’s Kappa

When running full tests, one obtain this amount of run models:

14

∗

10

∗

5

∗

5

∗

12

=

42

,

000. Each of these models have been evaluated

with 3 metrics in both train and test. Below, only testing results

will be used.

4 RESULTS

4.1 General Results

As stated at the end of the previous section, the benchmark is

comprised of 42 thousands of runs. To interpret the results over

such volumes, looking closely at each individual result is not an

option: it is mandatory to aggregate the results. Considering the

average results by models and datasets is a sound approach to get an

overview of the algorithms performance. On the contrary averaging

over all value of noise would defeat the purpose of the benchmark

since it would mask the evolution of performances along with noise

addition.

For the dataset breastcancer which is almost balanced, one can

observe in Figure 1, that performances decrease along with noise

on all models (bars that represent each levels of noise are sorted

from left to right, and from dark to lighter color). For spambase

(see Figure 2) the results are the same. For trucks (see Figure 3),

which is the larger dataset (in number of examples and explanatory

variables) and the most unbalanced, the results seems stable for all

algorithms.

Note that Y axis is not drawn on the full range of values: only

from half the range to maximum values. The following observations

can be made on this case:

•Performances collapse genuinely at 𝜌=1.25.

•

If models achieve similar results without noise, khiops, khiop-

sRF, linearSVC and logistic regression know a really small

decrease even at 𝜌=1.00.

5

As a reminder, the eective noise is scaled on the minority class. These levels have

been chosen to go progressively to 0 from 1: From no noise, to a level of noise equals

to the minority balance proportions. Above 1, no information should be available on

the minority class so that performances should collapse. A level of noise of 1.25 is

added to validate this hypothesis. In this case

𝜌=1.25

means a level noise of

1.25

times the percentage of the minority class.

•

Both symmetric Ramp and Unhinged losses are more robust

than Hinge and Squared Error with XGBoost.

•

Both implementations of Random Forests have the same

results. Although this solution was considered as a promis-

ing option [

14

], it seems that the presence of noise yields a

signicant decrease of performance.

Due to place consideration we cannot show the detailed results

for all datasets, but they are all available here: https://github.com/

Hugoswnw/NoiseEvaluation for the AUC, the balanced accuracy

as well the Cohen’s Kappa (the plots of every datasets results are

available). This supplementary material also contains the retained

performance which is discussed in the next subsection.

The following conclusions can be withdrawn from the supple-

mentary material: the results from NASA datasets are really poor.

On KC1, KC3 and CM1 especially, the algorithms struggle to achieve

more than random guesses (half of the metrics range) and they ex-

hibit a high variance. These datasets being quite small and very

specic must have challenging features for the models to solve. The

collapse at

𝜌=

1

.

25 is also clear with eyestate, mushroom, phishing.

4.2 Retained Performances - View 1: Averaged

on all datasets

To measure robustness only, a simple metric has been used to em-

phasize this aspect. For each run with a given 𝜌, the result is com-

pared to the run done with exact same conditions at

𝜌=

0

.

0, i. e.

when there is no noise: the retained performance6

𝑃𝑘𝜌=

𝑅𝑒𝑠𝑢𝑙 𝑡𝜌

𝑅𝑒𝑠𝑢𝑙 𝑡𝜌=0

Using this metric, the robustness of all datasets can be compared in

a glance (Figure 4).

These results corroborate the following assertions:

•Performances collapse at 𝜌=1.25.

•

Both symmetric Ramp and Unhinged losses are more robust

than Hinge and Squared Error with XGBoost.

•

Both implementations of Random Forests yield the same

results and their decrease in performance are signicant.

This observation conicts the conclusions made in [15].

However the

𝑃𝑘𝜌

metric ignores completely the absolute perfor-

mances; therefore the next section focuses on this point.

4.3 Retained Performances - View 2:

Algorithms compared with each others vs.

a dataset

An aspect of the results that we really care about is how algorithms

behave compared with each others. The barplots in Figures 5 and

6 allow a clear view over the results, but it is far from trivial to

conclude whether algorithm A is better than B, and for which noise

values. Digging further into this idea, we can plot these aggregated

results where we focus on the evolution of performances of all mod-

els with respect to the noise level. Figure 5 and Figure 6 illustrate

the results for datasets SpamBase and Adult. Results of all datasets

are available in the supplementary material.

6

Note that for this computations, Cohen’s Kappa must be scaled from

[−1; +1]

to

[0; +2]with a simple addition.

On the intrinsic robustness to noise of some leading classifiers and symmetric loss function - an empirical evaluation WK DQAML@SIGKDD ’21, August 14–18, 2021, Online

Figure 1: Results on Breastcancer averaged per metrics, algorithms and noise applied: AUC, Cohen’s Kappa and Bacc

Figure 2: Results on Spambase averaged per metrics, algorithms and noise applied: AUC, Cohen’s Kappa and Bacc

For clarity purposes, the algorithms that belong to similar paradigms

are represented with the same color: Khiops and KhiopsRF in or-

ange, Weka and Scikit-Learn Random Forests in blue, SVC and

logistic regression in green, XGBoost versions in red. Since the

value

𝜌=

1

.

25 entails dierent behaviours which are not represen-

tative, only the values between 0 and 1 are kept.

From Figures 5 and 6, as well as from the others available in

supplementary material, we can make the following observations

(some of which reinforce some of our previous conclusions):

•

Random Forests achieve very good results overall when there

is no noise. However, it becomes one of the worst when noise

occur. This disagrees with the results in [14].

•

Khiops and KhiopsRF achieve very good results overall and

show quite stable results when noise occur.

•

Logistic Regression and Linear SVC show surprisingly good

results overall and a “small” sensibility to labels corruption.

•

Symmetric loss functions in XGBoost appear stable indeed.

However, it is still outperformed by a large margin by asym-

metric losses and other models in most of experiments. It

appears to be only worth the use for very high noise levels

WK DQAML@SIGKDD ’21, August 14–18, 2021, Online Hugo Le Baher, Vincent Lemaire, and Romain Trinquart

Figure 3: Results on Trucks averaged per metrics, algorithms and noise applied: AUC, Cohen’s Kappa and Bacc

Figure 4: Retained Performance averaged on all datasets: AUC, Cohen’s Kappa and Bacc

(For Phishing and Spambase, they outperform asymmetric

losses only after 𝜌=0.66).

Note on symmetric loss functions - At rst it could be remind

that the result presented in papers as [

9

] are asymptotically results,

when

𝑁→ ∞

, which is not often the case in real situation. If we

consider only dataset where the XGboost algorithm provided of the

symmetric loss function unhinged (XGB_UNHINGED) has good

results (i.e where its AUC performance is above 0.7) it is not false

that a link exists between the size of training set and the perfor-

mance kept but that is also the case for others algorithm as Linear

SVC or Khiops. Then even if XGB_UNHINGED is more stable than

XGB_SQUERR the performances for high

𝜌

values of XGB_SQUERR

are better than those of XGB_UNHINGED. The XGB_UNHINGED’s

performances are also often low for these datasets (whatever is the

value of

𝜌

) as well on the others datasets (where its AUC perfor-

mance is below 0.7 and are mainly present when the percentage of

the minority class is low).

Secondly it appears that loss functions are not universal, they

should be tailored to the learning algorithm. There are loss functions

adapted to regression, classication etc... Even if Unhinged and

Ramp loss are symmetric they seem not to be suited when using

XGboost for binary classication in our empirical study. It has also

been reported that performances with such losses are signicantly

aected by noisy labels [

32

]. Such implementations perform well

Figure 5: Impact on performances along with noise addition on spambase: AUC versus 𝜌.

Figure 6: Impact on performances along with noise addition on adult: AUC versus 𝜌.

only in simple cases, when learning is easy or the number of classes

is small. Moreover, the modication of the loss function increases

the training time for convergence [42].

5 CONCLUSION

This paper presents an extensive benchmark dedicated to binary

classication problems and for testing the ability of various learn-

ing algorithms to cope with noisy labels. Through the example of

operational fraud detection, it is shown in the introduction that

such an ability is crucial for automatic classiers to be adopted in

"real-life" applications.

The article has also a tutorial value in some parts and provides

additional results to [

22

] either in the datasets considered or in the

classiers considered. The benchmark protocol covers a wide range

of settings, with diverse datasets and a variety of target classes bal-

ance (or imbalance). The algorithms under scrutiny include SVM,

logistic regression, random forests, XGBoost, Khiops. Furthermore

the study is supplemented with an investigation into the opportu-

nity to enhance some algorithms with symmetric loss functions.

For the benchmark’s conclusion to be as meaningful as possible,

the set of use-cases and parameters to test yields a very large set of

results to be synthesized. The motivation for picking a few metrics

are exposed and the aggregated results are then discussed. The

conclusions can be summed up in the few sentences here-under.

If the labels in the data available for model training are trusted

and there is no reason to believe that any corruption process hap-

pened, then Random Forest is a good option. However as soon as

the dataset is suspected not to be perfectly reliable and the aim for

more stability is required, Khiops looks like a better, safer solution.

Simple models such as SVM and Logistic Regression also seem to

be provide trustworthy results in such context. We also disagree

with the results in [

14

] which considers random forest robuts to

label noise.

The lead of symmetrical losses still looks promising. In most of

our tests, their results remained quite stable even in high level of

noise. However the main issue that would have to be solved is the

underwhelming performances in low noise contexts as discussed

in the note of the previous section.

In spite of the eorts to make the test protocol as complete as

possible, some design choices were made to stay focus, which leave

room for further investigation. First, in the experiments part, the

only type of noise considered for icking labels is Noise Completly

at random (NCAR). It will be interesting to pursue the benchmark

with NAR (Noise at Random) and MNAR (Noise Not At Random).

Second, the test protocol was designed to compare classiers faced

with “raw” noisy data. Another possible setting would consist in

preparing the data with the application of automatic cleaning oper-

ations such as [

31

] and then measure how the classiers perform.

This would have been an added value to this empirical study and

so we consider this as a future work.

WK DQAML@SIGKDD ’21, August 14–18, 2021, Online Hugo Le Baher, Vincent Lemaire, and Romain Trinquart

REFERENCES

[1]

Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. 1992. A Training

Algorithm for Optimal Margin Classiers. In Proceedings of the Fifth Annual Work-

shop on Computational Learning Theory. Association for Computing Machinery,

New York, NY, USA, 144–152.

[2]

M. Boullé. 2005. A Bayes optimal approach for partitioning the values of categor-

ical attributes. Journal of Machine Learning Research 6 (2005), 1431–1452.

[3]

Marc Boullé. 2006. MODL: a Bayes optimal discretization method for continuous

attributes. Machine Learning 65, 1 (2006), 131–165.

[4]

Marc Boullé. 2007. Compression-Based Averaging of Selective Naive Bayes

Classiers. Journal of Machine Learning Research 8 (07 2007), 1659–1685.

[5]

Leo Breiman. 1996. Bagging predictors. Machine Language 24, 2 (Aug. 1996),

123–140.

[6]

Leo Breiman. 2000. Randomizing Outputs to Increase Prediction Accuracy. Mach.

Learn. 40, 3 (Sept. 2000), 229–242.

[7] Leo Breiman. 2001. Random Forests. Machine Learning 45, 1 (2001), 5–32.

[8]

CFCA. 2018. 2017 Global Fraud Loss Survey. Survey Results. Communications

Fraud Control Association.

[9]

Nontawat Charoenphakdee, Jongyeong Lee, and Masashi Sugiyama. 2019. On

Symmetric Losses for Learning from Corrupted Labels. In Proceedings of the 36th

International Conference on Machine Learning, Vol. 97. 961–970.

[10]

Tianqi Chen and Carlos Guestrin. 2016. XGBoost: A Scalable Tree Boosting Sys-

tem. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge

Discovery and Data Mining (2016), 785–794.

[11]

J. Cohen. 1960. A Coecient of Agreement for Nominal Scales. Educational and

Psychological Measurement 20, 1 (1960), 37.

[12]

Thomas G. Dietterich. 2000. An Experimental Comparison of Three Methods for

Constructing Ensembles of Decision Trees: Bagging, Boosting, and Randomiza-

tion. Machine Language 40, 2 (Aug. 2000), 139–157.

[13]

Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen

Lin. 2008. LIBLINEAR: A Library for Large Linear Classication. The Journal of

Machine Learning Research 9 (June 2008), 1871–1874.

[14]

Andres Folleco, Taghi M. Khoshgoftaar, Jason Van Hulse, and Lofton Bullard.

2008. Identifying Learners Robust to Low Quality Data. In 2008 IEEE International

Conference on Information Reuse and Integration. 190–195.

[15]

Andres A. Folleco, Taghi M. Khoshgoftaar, Jason Van Hulse, and Amri Napolitano.

2009. Identifying Learners Robust to Low Quality Data. Informatica 33 (2009),

245–259.

[16]

Benoit Frenay and Michel Verleysen. 2014. Classication in the Presence of Label

Noise: A Survey. IEEE Transactions on Neural Networks and Learning Systems 25,

5 (2014), 845–869.

[17]

Aritra Ghosh, Naresh Manwani, and P. S. Sastry. 2020. Making Risk Minimization

Tolerant to Label Noise. Neurocomputing 160 (July 2020), 93–107.

[18]

Isabelle Guyon and Andre Elissee. 2003. An introduction to variable and feature

selection. J. Mach. Learn. Res. 3 (2003), 1157–1182.

[19]

I. Guyon, A. Saari, G. Dror, and G. Cawley. 2010. Model selection: Beyond the

Bayesian/frequentist divide. The Journal of Machine Learning Research 11 (2010),

61–87.

[20]

Ray J. Hickey. 1996. Noise Modelling and Evaluating Learning from Examples.

Articial Intelligence 82, 1-2 (1996), 157–179.

[21] I3 Forum. 2014. I3F Fraud Classication. White paper 3. I3Forum.

[22]

Sensitivity Elias Kalapanidas, Elias Kalapanidas, Nikolaos Avouris, Marian

Craciun, and Daniel Neagu. 2003. Machine Learning algorithms: a study on

noise. Technical Report. in 1st Balcan Conference in Informatics.

[23]

Pat Langley. 1994. Selection of Relevant Features in Machine Learning. In In

Proceedings of the AAAI Fall symposium on relevance. AAAI Press, 140–144.

[24]

Pierre Lejeail, Vincent Lemaire, Antoine Cornuéjols, and Adam Ouorou. 2018.

TriClustering based outlier-shape score for time series in a fraud detection plat-

form. In ECML/PKDD Workshop on Advanced Analytics and Learning on Temporal

Data.

[25]

Andrea Malossini, Enrico Blanzieri, and Raymond T. Ng. 2006. Detecting potential

labeling errors in microarrays by data perturbation. Bioinformatics 22, 17 (2006),

2114–2121. https://doi.org/10.1093/bioinformatics/btl346

[26]

Naresh Manwani and P. S. Sastry. 2013. Noise Tolerance under Risk Minimization.

IEEE Transactions on Cybernetics 43, 3 (June 2013), 1146–1151.

[27]

N. Matic, I. Guyon, L. Bottou, J. Denker, and V. Vapnik. 1992. Computer aided

cleaning of large databases for character recognition. In Proceedings., 11th IAPR

International Conference on Pattern Recognition. Vol.II. Conference B: Pattern Recog-

nition Methodology and Systems. 330–333.

[28]

André L. B. Miranda, Luís Paulo F. Garcia, André C. P. L. F. Carvalho, and Ana C.

Lorena. 2009. Use of Classication Algorithms in Noise Detection and Elimination.

In Hybrid Articial Intelligence Systems (Lecture Notes in Computer Science). 417–

424.

[29]

David F. Nettleton, Albert Orriols-Puig, and Albert Fornells. 2010. A Study of

the Eect of Dierent Types of Noise on the Precision of Supervised Learning

Techniques. Articial Intelligence Review 33, 4 (2010), 275–306.

[30]

Pierre Nodet, Vincent Lemaire, Alexis Bondu, Antoine Cornuéjols, and Adam

Ouorou. 2021. From Weakly Supervised Learning to Biquality Learning: an

Introduction. In In Proceedings of the International Joint Conference on Neural

Networks (IJCNN).

[31]

Theodoros Rekatsinas, Xu Chu, Ihab F. Ilyas, and Christopher Ré. 2017. HoloClean:

Holistic Data Repairs with Probabilistic Inference. Proc. VLDB Endow. 10, 11 (Aug.

2017), 1190–1201. https://doi.org/10.14778/3137628.3137631

[32]

Mengye Ren, Wenyuan Zeng, Bin Yang, and Raquel Urtasun. 2018. Learning to

Reweight Examples for Robust Deep Learning. In Proceedings of Machine Learning

Research, Vol. 80. 4334–4343.

[33]

Gunnar Rätsch, Takashi Onoda, and Klaus Robert Müller. 1998. An improvement

of AdaBoost to avoid overtting. In Proc. of the Int. Conf. on Neural Information

Processing. 506–509.

[34]

Joseph L. Schafer and John W. Graham. 2002. Missing Data: Our View of the

State of the Art. Psychological Methods 7, 2 (2002), 147–177.

[35]

Hwanjun Song, Minseok Kim, Dongmin Park, and Jae-Gil Lee. 2020. Learning

from Noisy Labels with Deep Neural Networks: A Survey. arXiv:2007.08199

[cs.LG] (2020).

[36]

Jiang-wen Sun, Feng-ying Zhao, Chong-jun Wang, and Shi-fu Chen. 2007. Iden-

tifying and Correcting Mislabeled Training Instances. In Future Generation Com-

munication and Networking (FGCN 2007), Vol. 1. 244–250. ISSN: 2153-1463.

[37]

Alaa Tharwat. 2018. Classication assessment methods. Applied Computing and

Informatics (2018).

[38]

Jason Van Hulse and Taghi Khoshgoftaar. 2009. Knowledge Discovery from

Imbalanced and Noisy Data. Data & Knowledge Engineering 68, 12 (Dec. 2009),

1513–1542.

[39]

Brendan van Rooyen, Aditya Krishna Menon, and Robert C. Williamson. 2015.

Learning with Symmetric Label Noise: The Importance of Being Unhinged.

arXiv:1505.07634 [cs] (May 2015).

[40]

Kalyan Veeramachaneni, Ignacio Arnaldo, Constantinos Bassias, Ke Li, and Al-

fredo Cuesta-Infante. 2016. AI^2: Training a Big Data Machine to Defend. In

IEEE International Conference on Intelligent Data and Security (IDS). 49–54.

[41]

N. Voisine,M. Boullé, and C. Hue. 2010. A Bayes Evaluation Criterion for Decision

Trees. Advances in Knowledge Discovery and Management (AKDM-1) 292 (2010),

21–38.

[42]

Zhilu Zhang and Mert Sabuncu. 2018. Generalized Cross Entropy Loss for Train-

ing Deep Neural Networks with Noisy Labels. In Advances in Neural Information

Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-

Bianchi, and R. Garnett (Eds.). Curran Associates, Inc., 8778–8788.

[43]

Zhi-Hua Zhou. 2017. A Brief Introduction to Weakly Supervised Learning.

National Science Review 5, 1 (2017), 44–53.

[44]

Xingquan Zhu and Xindong Wu. 2004. Class Noise vs. Attribute Noise: A Quan-

titative Study of Their Impacts. Artif. Intell. Rev. 22, 3 (Nov. 2004), 177–210.