Content uploaded by Vincent Lemaire

Author content

All content in this area was uploaded by Vincent Lemaire on May 03, 2021

Content may be subject to copyright.

Importance Reweighting for Biquality Learning

Pierre Nodet

Orange Labs

AgroParisTech, INRAe

46 av. de la R´

epublique

Chˆ

atillon, France

Vincent Lemaire

Orange Labs

2 av. P. Marzin

Lannion, France

Alexis Bondu

Orange Labs

46 av. de la R´

epublique

Chˆ

atillon, France

Antoine Cornu´

ejols

UMR MIA-Paris

AgroParisTech, INRAe

Universit´

e Paris-Saclay

16 r. Claude Bernard

Paris, France

Adam Ouorou

Orange Labs

46 av. de la R´

epublique

Chˆ

atillon, France

Abstract—The ﬁeld of Weakly Supervised Learning (WSL)

has recently seen a surge of popularity, with numerous papers

addressing different types of “supervision deﬁciencies”, namely:

poor quality, non adaptability, and insufﬁcient quantity of labels.

Regarding quality, label noise can be of different types, including

completely-at-random, at-random or even not-at-random. All

these kinds of label noise are addressed separately in the

literature, leading to highly specialized approaches. This paper

proposes an original, encompassing, view of Weakly Supervised

Learning, which results in the design of generic approaches

capable of dealing with any kind of label noise. For this purpose,

an alternative setting called “Biquality data” is used. It assumes

that a small trusted dataset of correctly labeled examples is

available, in addition to an untrusted dataset of noisy examples.

In this paper, we propose a new reweigthing scheme capable

of identifying noncorrupted examples in the untrusted dataset.

This allows one to learn classiﬁers using both datasets. Extensive

experiments that simulate several types of label noise and that

vary the quality and quantity of untrusted examples, demonstrate

that the proposed approach outperforms baselines and state-of-

the-art approaches.

Index Terms—Supervised Classiﬁcation, Weakly Supervised

Learning, Biquality Learning, Trusted data, Label noise

I. INTRODUCTION

The supervised classiﬁcation problem aims to learn a classi-

ﬁer from a set of labeled training examples in order to predict

the class of new examples. In practice, conventional classiﬁca-

tion techniques may fail because of the imperfections of real-

world datasets. Accordingly, the ﬁeld of Weakly Supervised

Learning (WSL) has recently seen a surge of popularity, with

numerous papers addressing different types of “supervision

deﬁciencies” [1], namely:

Insufﬁcient quantity: when many training examples are

available, but only a small portion is labeled, e.g. due to the

cost of labelling. For instance, this occurs in the ﬁeld of cyber

security where human forensics is needed to label attacks.

Usually, this issue is addressed by semi-supervised learning

(SSL) [2] or active learning (AL) [3].

Poor quality labels: in this case, all training examples are

labeled but the labels may be corrupted. This may happen

when the labeling task is outsourced to crowd labeling. The

Robust Learning to Label Noise (RLL) approaches address this

problem [4], with three identiﬁed types of label noise: i) the

completely at random noise which correspond to a uniform

probability of label change ; ii) the at-random label noise

when the probability of label change depends upon each class,

with uniform label changes within each class ; iii) the not-at-

random label noise when the probability of label change varies

over the input space of the classiﬁer. This last type of label

noise is recognized as the most difﬁcult to deal with [5], [6].

Inappropriate labels: for instance, in Multi Instance Learn-

ing (MIL) [7] the labels are assigned to bags of examples, with

positive label indicating that at least one example of the bag is

positive. Some scenarios in Transfer Learning (TL) [8] imply

that only the labels in the source domain are provided while

the target domain labels are not. Often, these non-adapted

labels are associated with slightly different learning tasks (e.g.

more precise and numerous classes are dividing the original

categories). Alternatively, non-adapted labels may characterize

a differing statistical individual [9] (e.g. a subpart of an image

instead of the entire image).

All these types of supervision deﬁciencies are addressed

separately in the literature, leading to highly specialized ap-

proaches. In practice, it is very difﬁcult to identify the type(s)

of deﬁciencies with which a real dataset is associated. For

this reason, we argue that it would be very useful to ﬁnd a

uniﬁed framework for Weakly Supervised Learning, in order

to design generic approaches capable of dealing with any type

of supervision deﬁciency.

In Section II of this paper, we present “biquality data”,

an alternative WSL setting allowing a uniﬁed view of weakly

supervised learning. A generic learning framework using the

two training sets of biquality data (the one trusted and the

other one untrusted) is suggested in Section III. We identify

three possible ways of implementing this framework and

consider one of them. This article proposes a new approach

using example reweighting in Section IV. The effectiveness

of this new approach in dealing with different types of

supervision deﬁciencies, without a priori knowledge about

them, is demonstrated through experiments with real datasets

in Sections V and VI. Finally, perspectives and future works

are discussed in Section VII.

II. BI QUA LI TY DATA

This section presents an alternative setting called “Biquality

Data” which covers a large range of supervision deﬁciencies

and allows for unifying the WSL approaches. The interested

arXiv:2010.09621v4 [cs.LG] 26 Apr 2021

reader may ﬁnd a more detailed introduction on WSL and its

links with Biquality Learning in [10].

Learning using biquality data has recently been put forward

in [11]–[13] and consists in learning a classiﬁer from two

distinct training sets, one trusted and the other untrusted. The

initial motivation was to unify semi-supervised and robust

learning through a combination of the two. We show in this

paper that this scenario is not limited to this uniﬁcation and

that it can cover a larger range of supervision deﬁciencies as

demonstrated with the algorithms we propose and the obtained

results.

The trusted dataset DTconsists of pairs of labeled examples

(xi, yi) where all labels yi∈ Y are supposed to be cor-

rect according to the true underlying conditional distribution

PT(Y|X). In the untrusted dataset DU, examples ximay

be associated with incorrect labels. We note PU(Y|X)the

corresponding conditional distribution.

At this stage, no assumption is made about the nature of the

supervision deﬁciencies which could be of any type including

label noise, missing labels, concept drift, non-adapted labels...

and more generally a mixture of these supervision deﬁciencies.

The difﬁculty of a learning task performed on biquality

data can be characterised by two quantities. First, the ratio

of trusted data over the whole data set, denoted by p:

p=|DT|

|DT|+|DU|(1)

Second, a measure of the quality, denoted by q, which

evaluates the usefulness of the untrusted data DUto learn

the trusted concept PT(Y|X), where q∈[0,1] and 1 indi-

cates high quality. For example in [13] qis deﬁned using a

ratio of Kullback-Leibler divergence between PT(Y|X)and

PU(Y|X).

Fig. 1. The different learning tasks covered by the biquality setting, repre-

sented on a 2D representation.

The biquality setting covers a wide range of learning tasks

by varying the quantities qand p(as represented in Figure 1):

•When (p= 1 OR q= 1)1all examples can be trusted.

Thus, this setting corresponds to a standard supervised

learning (SL) task.

1p= 1 =⇒DU=∅=⇒q= 1

•When (p= 0 AND q= 0), there is no trusted examples

and the untrusted labels are not informative. We are

left with only the inputs {xi}1≤i≤mas in unsupervised

learning (UL).

•On the vertical axis deﬁned by q= 0, except for the

two points (p, q) = (0,0) and (p, q) = (1,0), the

untrusted labels are not informative, and trusted examples

are available. The learning task becomes semi-supervised

learning (SSL) with the untrusted examples as unlabeled

and the trusted as labeled.

•An upward move on the vertical axis, from a point

(p, q)=(, 0) characterized by a low proportion of

labeled examples p=, to a point (p0,0), with p0> p,

corresponds to Active Learning, when an oracle provides

new labels asked by a given strategy. The same upward

move can also be realized in Self-training and Co-

training [14], where unlabeled training examples are

labeled using the predictions of the current classiﬁer(s).

•On the horizontal axis deﬁned by p= 0, except for

the points (p, q) = (0,0) and (p, q) = (0,1), only

untrusted examples are provided, which corresponds

to the range of learning tasks typically addressed by

Robust Learning to Label noise (RLL) approaches.

Only the edges of Figure 1 have been envisioned in previous

works – i.e. the points mentioned above – and a whole new

range of problems are addressed in this paper. Moreover,

biquality learning may be used to tackle tasks belonging to

WSL, for instance:

•Positive Unlabeled Learning (PUL) [15]where only pos-

itive (trusted) and unlabeled instances are available, the

later which can be considered as untrusted.

•Self Training and Cotraining [14] could be addressed at

the end of the self labeling process: the initial training set

is then the trusted dataset, and all self-labeled examples

can be considered as the untrusted ones.

•Concept drift [16]: when a concept drift occurs, all the

examples used before a detected drift may be considered

as the untrusted examples, while the examples available

after it are viewed as the trusted ones, assuming a perfect

labeling process.

•Self Supervised Learning system as illustrated by Snorkel

[17] or Snuba [18]: the small initial training set can the

trusted, whereas all examples automatically labeled using

the labeling functions may be considered as untrusted.

As can be seen from the above list, the Biquality framework

is quite general and its investigation seems a promising avenue

to unify different aspects of the Weakly Supervised Learning.

A main contribution of this paper is to suggest one generic

framework for achieving biquality learning and thus covering

many facets of WSL. This is presented in the next section.

This framework will be then applied in the experiments part

of this paper to the problem of label noise.

III. BIQ UAL IT Y LEARNING

Learning the true concept2PT(Y|X)on D=DT∪DU

means minimizing the risk Ron Dwith a loss Lfor a

probabilistic classiﬁer f:

RD,L(f) = ED ,(X,Y )∼T[L(f(X), Y )]

=P(X∈DT)EDT,(X,Y )∼T[L(f(X), Y )]

+P(X∈DU)EDU,(X,Y )∼T[L(f(X), Y )]

(2)

where L(·,·)is a loss function, from R|Y| × Y to Rsince

f(X)is a vector of probability over the classes. Since the

true concept PT(Y|X)cannot be learned from DU, the last

line of Equation 2 is not tractable as it stands. That is why we

propose a generic formalization based on a mapping function

gthat enables us to learn the true concept from the modiﬁed

untrusted examples of DU. Equation 2 becomes:

RD,L(f) = P(X∈DT)EDT,(X ,Y )∼T[L(f(X), Y )]

+λP(X∈DU)EDU,(X,Y )∼U[g(L(f(X), Y ))] (3)

In Equation 3, the parameter λ∈[0,1] reﬂects the quality

of the untrusted examples of DUmodiﬁed by the function

g. This time, the last line is tractable since it consists of a

risk expectancy estimated over the training examples of DU

which follows the untrusted concept PU(Y|X), modiﬁed by

the function g.

Accordingly, the estimation of the expected risk requires to

learn three items: g,λand then f. To learn g, a mapping

function between the two datasets, both DTand DUare used.

Then, either λis considered as a hyper parameter to be learned

using DTor λis provided by an appropriate quality measure

and is considered as an input of the learning algorithm. Finally,

fis learned by minimizing the risk Ron Dusing the mapping

g.

In this formalization, the mapping function gplays a central

role. Not exhaustively, we identify three different ways of

designing the mapping function. For each of these, a different

function g0enters the deﬁnition of function g:

•The ﬁrst option consists in correcting the label for each

untrusted examples of DU. The mapping function thus

takes the form g(L(f(X), Y )) = L(f(X), g0(Y, X )),

with g0(Y, X )denoting the new corrected labels and

f(X)the predictions of the classiﬁer.

•In the second option, the untrusted labels are used un-

changed. The untrusted examples Xare moved in the

input space where the untrusted labels becomes correct

with respect to the true underlying concept. The mapping

function becomes g(L(f(X), Y )) = L(f(g0(X)), Y ),

where g0(X)is the “moved” input vector of the modiﬁed

untrusted examples.

•In the last option, g0weights the contribution of the

untrusted examples in the risk estimate. Accordingly,

we have g(L(f(X), Y )) = g0(Y, X)L(f(X), Y ). In this

case, the parameter λmay disappear from Equation 3

since it can be considered as included in the function g0.

2For reasons of space, we denote PT(Y|X)by Tand PU(Y|X)by U.

Section IV considers in-depth the last option and proposes

a new approach where g0acts as an Importance Reweighting

for Biquality Learning.

IV. A N EW IM PO RTANCE REWE IG HT IN G AP PROACH F OR

BIQ UAL IT Y LEARNING

To estimate the mapping function g0, we suggest to adapt the

importance reweigthing trick from the covariate shift literature

[19] to biquality learning. This trick relies on reweighting

untrusted samples by using the Radon-Nikodym derivative

(RND) [20] of PT(X, Y )in respect to PU(X, Y )which

is dPT(X,Y )

dPU(X,Y ). Contrary to the “covariate shift” setting, the

biquality setting handles the same distribution P(X)in the

trusted and untrusted datasets. However, the two underlying

concepts PT(Y|X)and PU(Y|X)are possibly different due

to a supervision deﬁciency. By using these assumptions and

the Bayes Formula, we can further simplifying the reweighing

function to the RND of PT(Y|X)in respect to PU(Y|X),

dPT(Y|X)

dPU(Y|X).

R(X,Y )∼T,L(f) = E(X,Y )∼T[L(f(X), Y )]

=ZL(f(X), Y )dPT(X, Y )

=ZdPT(X, Y )

dPU(X, Y )L(f(X), Y )dPU(X, Y )

=E(X,Y )∼U[PT(X, Y )

PU(X, Y )L(f(X), Y )]

=E(X,Y )∼U[PT(Y|X)P(X)

PU(Y|X)P(X)L(f(X), Y )]

=E(X,Y )∼U[PT(Y|X)

PU(Y|X)L(f(X), Y )]

=E(X,Y )∼U[βL(f(X), Y )]

=R(X,Y )∼U,βL (f)

(4)

Equation 4 shows that β=PT(Y|X)

PU(Y|X)is an estimation of the

mapping function g0, thanks to Section III estimating βis the

last step before an actual Biquality Learning algorithm.

Algorithm: Importance Reweighting for Biquality

Learning (IRBL)

Input: Trusted Dataset DT, Untrusted Dataset DU,

Probabilistic Classiﬁer Familiy F

1Learn fU∈ F on DU

2Learn fT∈ F on DT

3for (xi, yi)∈DU, where yi∈[[1, K]] do

4ˆ

β(xi, yi) = DfT(xi)

fU(xi)Eyi

5for (xi, yi)∈DTdo

6ˆ

β(xi, yi)=1

7Learn f∈ F on DT∪DUwith weights ˆ

β

Output: f

The proposed algorithm, Importance Reweighting for Bi-

quality Learning (IRBL), aims at estimating βfrom DTand

DUwhatever the unknown supervision deﬁciency. It consists

of two successive steps. First a probabilistic classiﬁer fTis

learned from the trusted dataset DTand another probabilistic

classiﬁer fUis learned from the untrusted dataset DU. Thanks

to their probabilistic nature each of them estimates PT(Y|X)

and PU(Y|X)by a probability distribution over the set of

the Kclasses. Thus we can estimate the weight βof an

untrusted sample (xi, yi)by dividing the prediction of fT(xi)

by fU(xi)for the yiclass (see line 4). The weight βfor

all trusted samples will be ﬁxed to 1 (see line 6). Then a

ﬁnal classiﬁer is learned from both datasets DTand DUwith

examples reweighted by ˆ

β.

Our algorithm is theoretically grounded, since it is asymp-

totically equivalent to minimizing the risk on the true concept

using the entire data set (see proof in Equation 5).

ˆ

RD, ˆ

βL (f) = 1

|D|X

(xi,yi)∈D1(xi,yi)∈DTL(f(xi), yi)

+1(xi,yi)∈DUˆ

β(xi, yi)L(f(xi), yi)

=1

|DT|+|DU|X

(xi,yi)∈DT

L(f(xi), yi)

+1

|DT|+|DU|X

(xi,yi)∈DU

L(f(xi), yi)ˆ

β(xi, yi)

=p

|DT|X

(xi,yi)∈DT

L(f(xi), yi)

+1−p

|DU|X

(xi,yi)∈DU

L(f(xi), yi)ˆ

β(xi, yi)

=pˆ

RDT,L(f) + (1 −p)ˆ

RDU,ˆ

βL (f)

≈pˆ

RDT,L(f) + (1 −p)ˆ

RDT,L(f)

≈ˆ

RDT,L(f)

(5)

Proof in Equation 5 is an asymptotic result: in practice our

algorithm relies on the quality of the estimation of PT(Y|X)

and PU(Y|X)in order to be efﬁcient. In the biquality setting

they both could be hard to estimate because of the small size

of DTand the poor quality of DU.

V. EXPERIMENTS

The aim of the experiments is to answer the following

questions: i) is our algorithm properly designed and does it

perform better than baselines approaches? ii) is our algorithm

competitive with state-of-the-art approaches?

First, Section V-A presents the supervision deﬁciencies

which are simulated in our experiments. They correspond to

two different kinds of weak supervision, namely, Noisy label

Completely at Random (i.e. not Xdependent) and Noisy

label Not at Random (i.e. Xdependent). From the Frenay’s

taxonomy [4] the former is the easiest to deal with and the

later is often considered as difﬁcult to manage. Then, Section

V-B consists of three parts: a presentation of the baseline

competitors, a brief report on the state-of-the-art competitors,

and a description of the set of classiﬁers used. Finally, Section

V-C describes the datasets used in the experiments, and the

chosen criterion to evaluate the learned classiﬁers. For full

reproducibility, source code, datasets and results are available

at : https://github.com/pierrenodet/irbl.

A. Simulated supervision deﬁciencies

The datasets listed in Section V-C consist of a collection

of training examples that are assumed to be correctly labeled,

denoted by Dtotal. In order to obtain a trusted dataset DTand

an untrusted one DU, each dataset is split in two parts using

a stratiﬁed random draw, where pis the percentage for the

trusted part. The trusted datasets are left untouched, whereas

corrupted labels are simulated in the untrusted datasets by

using two different techniques:

a) Noisy Completely At Random (NCAR):: Corrupted

untrusted examples are uniformly drawn from DUwith a

probability r, and are assigned a random label that is also

uniformly drawn from Y.

In the particular case of binary classiﬁcation problems, the

conditional distribution of the untrusted labels is deﬁned by

Equation 6.

∀y∈ Y,PU(Y=y|X) = r

2+ (1 −r)PT(Y=y|X)(6)

Here, rcontrols the overall number of random labels and thus

is our proxy for the quality: q= 1 −r.

b) Noisy Not At Random (NNAR):: Corrupted untrusted

examples are drawn from DUwith a probability r(x)that

depends on the instance value. To generate a instance de-

pendent label noise, we design a noise that depends on the

decision boundary of a classiﬁer ftotal learned from Dtotal.

The probability of random label r(x)should be high when an

instance xis close to the decision boundary, and low when it

is far. In our experiments, the probability outputs of ftotal are

used to model our label noise as follows:

∀x∈ X , r(x)=1−θ|1−2ftotal(x)|1

θ(7)

where θ∈[0; 1] is a constant that controls the overall number

of random labels and thus is our proxy for the quality: q=θ.

The parameter θinﬂuences both the slope (factor) and the

curvature (power) of r(x)to modify the area under curve of

r(x):E[r(x)].

For binary classiﬁcation problems, the conditional distribu-

tion of the untrusted labels is deﬁned by Equation 8.

∀x∈ X ,∀y∈ Y,

PU(Y=y|X=x) = r(x)

2+(1−r(x))PT(Y=y|X=x)(8)

B. Competitors

a) Baseline competitors: The ﬁrst part of our experi-

ments consists of a sanity check which compares the perfor-

mance of the proposed algorithm to the following baselines:

•Trusted: The ﬁnal classiﬁer fobtained with our algorithm

should be better than a classiﬁer fTthat learned only

from the trusted dataset, insofar as untrusted data bring

useful information about the trusted concept. At least, f

should not be worse than using only trusted data.

•Mixed: The ﬁnal classiﬁer fshould be better than a

classiﬁer fmixed learned from both trusted and untrusted

dataset, without correction. A biquality learning algo-

rithm should leverage the information provided by having

two distinct datasets.

•Untrusted: The ﬁnal classiﬁer should be better than a

classiﬁer fUthat learns only from the untrusted dataset

if there are trusted labels. Using trusted data should

improve the classiﬁer ﬁnal performances.

b) State-of-the-art-competitors: The second part of our

experiments compares our algorithm with two state-of-the-art

methods: (i) a method from the Robust Learning to Label noise

(RLL) [21], [22] family and (ii) the GLC approach [12].

•RLL: In recent literature a new emphasis is put on

the research of new loss functions that are conducive

to better risk minimization in presence of noisy

labels. For example, [21], [22] show theoretically and

experimentally that when the loss function satisﬁes a

symmetry condition, described below, this contributes

to the robustness of the classiﬁer. Accordingly, in this

paper we train a classiﬁer with a symmetric loss function

as a competitor. This ﬁrst competitor is expected to

have good results on completely-at-random label noise

described in Section V-A. A loss function Lsis said

symmetrical if Py∈{−1;1}Ls(f(x), y) = c, where cis a

constant and f(x)is the score on the class y. This loss

function is used on DT∪DU.

•GLC: To the best of our knowledge, GLC [12] is among

the best performing algorithm that can learn from bi-

quality data. It has been successfully compared to many

competing approaches. Like ours, it is a two steps ap-

proach which is simple and easy to implement.

In a ﬁrst step, a model fUis learned from the untrusted

dataset DU. Then it is used to estimate a transition matrix

Cof PU|T(Y)by making probabilistic predictions with

fUon the trusted dataset DTand comparing it to the

trusted labels.

In a second step, this matrix is used to correct the labels

from the untrusted dataset DUwhen learning the ﬁnal

model f. Indeed fis learned with Lon DTand with

L(C>f(X), Y )on DU.

c) Classiﬁers: First of all, the choice of classiﬁers was

guided by the idea of comparing algorithms for biquality

learning and not searching for the best classiﬁers. This choice

was also guided by the nature of the datasets used in the

experiments (see section V-C). Secondly our algorithm, as well

as GLC, implies two learning phases. For both reasons and

for simplicity, we decided to use Logistic Regressions (LR)

for each phase. LR is known to be limited, in the sense of the

Vapnik-Chervonenkis dimension [23] since it can only learn

linear separations of the input space X, which could underﬁt

the conditional probabilities P(Y|X)on DTand DUand lead

to bad βestimations. But this impediment, if met, will affect

equally all the compared algorithms. LR is also used for the

RLL classiﬁer using the Unhinged symmetric loss function.

To obtain reliable estimations of conditional probabilities

P(Y|X), the outputs of all classiﬁers have been calibrated

thanks to Isotonic Regression with the default parameters

provided by scikit-learn [24].

Logistic Regression is always be used and learned thanks

to SGD with a learning rate of 0.005, a weight decay of 10−6

during 20 epochs and a batch size of 24 with Pytorch [25].

C. Datasets

In industrial applications familiar to us, such as fraud detec-

tion, Customer Relationship Management (CRM) and churn

prediction, we are mostly faced with binary classiﬁcation

problems. The available data is of average size in terms of the

number of explanatory variables and involves mixed variables

(numerical and categorical).

For this reason we limited in this paper the experiments to

binary classiﬁcation tasks even if our algorithm can address

multi-class problems. The chosen tabular datasets, used for the

experiments, have similar characteristics than those of our real

applications.

TABLE I

BINA RY CL ASS IFI CATI ON DATASE TS U SED F OR T HE EVAL UATIO N.

COLUMNS:NU MBE R OF E XAM PL ES (|D|), NUMBER OF FEATURES (|X |) ,

AN D RATIO O F EX AM PLE S FRO M TH E MIN OR ITY C LA SS (MIN).

name |D| |X | min name |D| |X | min

4class 862 2 36 ibnsina 20,722 92 38

ad 3,278 1558 14 zebra 61,488 154 4.6

adult 48,842 14 23 musk 6,598 169 15

aus 690 14 44 phishing 11055 30 44

banknote 1372 4 44 spam 4,601 57 39

breast 683 9 35 ijcnn1 141,691 22 9

eeg 1498 13 45 svmg3 1284 4 26

diabetes 768 8 35 svmg1 7,089 22 43

german 1000 20 30 sylva 145,252 108 6.5

hiva 42,678 1617 3.5 web 49,749 300 3

They come from different sources: UCI [26], libsvm3and

active learning challenge [27]. A part of these datasets comes

from past challenges on active learning where high perfor-

mances with a low number of labeled examples has proved

difﬁcult to obtain. For each dataset, 80 % of samples were used

for training and 20% were used for the test. With this choice of

datasets, a large range of the class ratio is covered: Australian

is almost balanced while Web is really unbalanced. Also, the

size varies signiﬁcantly in number of rows or columns, with

corresponding impact on the difﬁculty of the learning tasks.

VI. RE SU LTS

The empirical performance of our approach, is evaluated in

two steps. First, we investigate the efﬁciency of the reweight-

ing scheme and its inﬂuence on the learning procedure of the

ﬁnal classiﬁer. Second, our approach is benchmarked against

competitors to evaluate its efﬁciency in real tasks.

3https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/

A. Behavior of the IRBL method

In order to illustrate the proposed reweighing scheme, we

picked a dataset, here the “ad” dataset used with a ratio of

trusted data p= 0.25, and examined the histogram of the

weights assigned to each untrusted example either corrupted

or not. The case of Random Label Completely at Random is

chosen and the hardest case where all labels are at random

q= 0 is considered.

Figure 2 shows the histogram of the weights assigned to

each untrusted example either corrupted or not. It is clear that

the proposed method is able to detect corrupted and noncor-

rupted labels from the untrusted dataset. Figure 3 conﬁrms

this behavior when varying the value of the quality. For a

perfect quality, the distribution of the βis unimodal with a

median equal to one and a very narrow inter quantile range,

whereas, when the quality drops, the distribution of the βfor

the corrupted labels decreases to zero.

Fig. 2. Histogram of the βvalues on AD for p= 0.25 and q= 0 for NCAR

for the corrupted and noncorrupted examples.

Fig. 3. Boxplot the βvalues on AD for p= 0.25 versus the quality, from

q= 0 to q= 1 for NCAR.

It is equally interesting to look at the classiﬁcation error

when q, the quality of the untrusted data, varies. Figure 4

reports the performance for the proposed method and for the

baseline competitors. It is remarkable that the performance

of our algorithm, IRBL, remains stable when qdecreases

while the performance of the mixed and untrusted algorithms

worsens. In addition, IRBL always obtains better performances

than the trusted baseline.

Fig. 4. Classiﬁcation error on test set for IRBL against baselines on a full

range of quality level (AD dataset, p= 0.25, NCAR).

B. Comparison with competitors

For a ﬁrst global comparison, two critical diagrams are

presented in Figures 5 and 6 which rank the various methods

for the NCAR and NNAR label noise. The Nemenyi test [28]

is used to rank the approaches in terms of mean accuracy,

calculated for all values of pand qand over all the 20 data

sets described in section V-C. The Nemenyi test consists of

two successive steps. First, the Friedman test is applied to the

mean accuracy of competing approaches to determine whether

their overall performance is similar. Second, if not, the post-

hoc test is applied to determine groups of approaches whose

overall performance is signiﬁcantly different from that of the

other groups.

Fig. 5. Nemenyi test for the 20 datasets ∀p, q for NCAR.

Fig. 6. Nemenyi test for the 20 datasets ∀p, q for NNAR.

These ﬁgures show that the IRBL method is ranked ﬁrst for

the two kinds of label noise and provides better performance

than the other competitors. Table II provides a more detailed

perspective by reporting the mean accuracy and its standard

deviation. These values are computed for different values of

pover all qualities qand all datasets. This table also helps to

see how the methods fare as compared to learning on perfect

TABLE II

MEA N ACCURACY (RES CAL ED S COR E TO B E FROM 0T O 100) AN D STAN DAR D DEV IATI ON C OMP UT ED O N THE 2 0 DATASE TS ∀qFO R (1) NCAR AN D (2)

NNA R. THE ME AN AC C WH EN US IN G ALL T HE T RAI NI NG DATA WI THO UT N OIS E IS 88.65.

p trusted irbl mixed glc rll

(1)

0.02 72.48 ±5.70 83.46 ±3.56 83.40 ±8.30 78.34 ±7.94 77.94 ±6.37

0.05 78.50 ±4.33 84.94 ±2.24 83.85 ±7.35 81.19 ±5.15 77.97 ±6.44

0.10 81.40 ±3.33 86.56 ±1.68 85.44 ±5.34 83.00 ±3.90 78.98 ±5.26

0.25 85.61 ±2.39 87.96 ±1.18 86.99 ±2.80 86.27 ±2.03 79.86 ±2.61

(2)

0.02 72.48 ±5.70 82.93 ±3.18 81.30 ±10.05 77.55 ±7.78 75.47 ±9.47

0.05 78.50 ±4.33 85.34 ±2.55 82.52 ±7.72 80.77 ±5.04 76.94 ±6.64

0.10 81.40 ±3.33 86.82 ±1.45 84.44 ±5.14 83.22 ±4.10 77.95 ±4.51

0.25 85.61 ±2.39 88.21 ±1.05 86.74 ±2.56 86.56 ±2.00 79.67 ±2.70

Mean 79.50 ±3.94 85.71 ±2.11 84.33 ±6.16 82.11 ±4.74 78.10 ±5.50

(a) IRBL vs Mixed for NCAR (b) IRBL vs RLL for NCAR (c) IRBL vs GLC for NCAR

(d) IRBL vs Mixed for NNAR (e) IRBL vs RLL for NNAR (f) IRBL vs GLC for NNAR

Fig. 7. Results of the Wilcoxon signed rank test computed on the 20 datasets. Each ﬁgure compares IRBL versus one of the competitors. Figures a, b, c are

in the case of Noisy label Completely at Random and Figures d, e, f for the case of Noisy label Not at Random. In each ﬁgure “◦”, “·” and “•” indicate

respectively a win, a tie or a loss of IRBL compared to the competitors, the vertical axis is pand the horizontal axis is q.

data. Overall, IRBL obtains the best results and with a lower

variability.

To get more reﬁned results, the Wilcoxon signed-rank test

[29] is used 4. It enables us to ﬁnd out under which conditions

– i.e. by varying the values of p and q – IRBL performs better

or worse than the competitors.

Figure 7 presents six graphics, each reporting the Wilcoxon

test that evaluates our approach against a competitor, based

on the mean accuracy over the 20 datasets. The two types of

label noise (see Section V-A) correspond to the rows in Figure

7 and a wide range of qand pvalues are considered.

Thanks to these graphs we can compare in more details our

method (IRBL) with the mixed methods, as well as with RLL

and GLC. Regarding the mixed method, Figures 7 (a) and (b)

return the results obtained versus varying values for pand q.

For low quality values q, whatever is the value of p, IRBL is

signiﬁcantly better. For middle values of the quality there is

no winner and for high quality values and low values of p, the

mixed method is signiﬁcantly better (this result seems to be

observed in [12] as well). This is not surprising since at high

4Here the test is used with a conﬁdence level at 5 %.

quality values, the mixed baseline is equivalent to learning

with perfect labels.

These detailed results help us to understand why, in the

critical diagram in Figure 5, although IRBL has a better

ranking, it is not signiﬁcantly better than the mixed method:

mainly because of the presence of high quality value cases.

Regarding the competitors RLL and GLC, Figures 7(b),

7(c), 7(e) and 7(f) show that IRBL has always better or

indistinguishable performances. Indeed, IRBL performs well

regardless of the type of noise. This is an important result since

it shows that we are able to deal not only with NCAR noise

but also with instance dependent label noise (NNAR) which

is more difﬁcult. The method RLL gets more ties with IRBL

on NCAR than on NNAR as expected. It is noteworthy that

GLC has ties with IRBL when the quality is high whatever

the label noise.

To sum up, the proposed method has been tested on a large

range of types and strengths of label corruptions. In all cases,

IRBL has obtained top or competitive results. Consequently,

IRBL appears to be a method of choice for applications where

biquality learning is needed. Moreover, IRBL has no user

parameter and a low computational complexity.

VII. CONCLUSION

This paper has presented an original view of Weakly

Supervised Learning and has described a generic approach

capable of dealing with any kind of label noise. A formal

framework for biquality learning has been developed where

the empirical risk is minimized on the small set of trusted

examples in addition to some appropriately chosen criterion

using the untrusted examples. We identiﬁed three different

ways to design a mapping function leading to three different

such criteria within the biquality learning framework. We

implemented one of them: a new Importance Reweighting ap-

proach for Biquality Learning (IRBL). Extensive experiments

have shown that IRBL signiﬁcantly outperforms state-of-the-

art approaches, by simulating completely-at-random and not-

at-random label noise over a wide range of quality and ratio

values of untrusted data.

Future works will be done to extend experiments with

multiclass classiﬁcation datasets and other classiﬁers such as

Gradient Boosted Trees [30]. An adaptation of IRBL to Deep

Learning tasks with an online algorithm will be studied too.

REFERENCES

[1] Z.-H. Zhou, “A brief introduction to weakly supervised learning,”

National Science Review, vol. 5, no. 1, pp. 44–53, 08 2017.

[2] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-supervised learning,”

IEEE Transactions on Neural Networks, vol. 20, no. 3, pp. 542–542,

2009.

[3] B. Settles, “Active learning literature survey,” University of Wisconsin-

Madison Department of Computer Sciences, Tech. Rep., 2009.

[4] B. Fr´

enay and M. Verleysen, “Classiﬁcation in the presence of label

noise: a survey,” IEEE transactions on neural networks and learning

systems, vol. 25, no. 5, pp. 845–869, 2013.

[5] J. Cheng, T. Liu, K. Ramamohanarao, and D. Tao, “Learning with

bounded instance and label-dependent label noise,” in International

Conference on Machine Learning (ICML), vol. 119. PMLR, 13–18

Jul 2020, pp. 1789–1799.

[6] A. Menon, B. V. Rooyen, and N. Natarajan, “Learning from binary labels

with instance-dependent corruption,” ArXiv, vol. abs/1605.00751, 2016.

[7] M.-A. Carbonneau, V. Cheplygina, E. Granger, and G. Gagnon, “Multi-

ple instance learning: A survey of problem characteristics and applica-

tions,” Pattern Recognition, vol. 77, p. 329–353, May 2018.

[8] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer

learning,” Journal of Big data, vol. 3, no. 1, p. 9, 2016.

[9] D. Conte, P. Foggia, G. Percannella, F. Tufano, and M. Vento, “A method

for counting people in crowded scenes,” in 2010 7th IEEE International

Conference on Advanced Video and Signal Based Surveillance, 2010,

pp. 225–232.

[10] P. Nodet, V. Lemaire, A. Bondu, A. Cornu ´

ejols, and A. Ouorou, “From

Weakly Supervised Learning to Biquality Learning: an Introduction,”

in In Proceedings of the International Joint Conference on Neural

Networks (IJCNN), 2021.

[11] M. Charikar, J. Steinhardt, and G. Valiant, “Learning from untrusted

data,” in Proceedings of the 49th Annual ACM SIGACT Symposium on

Theory of Computing, 2017, p. 47–60.

[12] D. Hendrycks, M. Mazeika, D. Wilson, and K. Gimpel, “Using trusted

data to train deep networks on labels corrupted by severe noise,” in

Advances in Neural Information Processing Systems 31, 2018, pp.

10 456–10 465.

[13] R. Hataya and H. Nakayama, “Unifying semi-supervised and robust

learning by mixup,” in The 2nd Learning from Limited Labeled Data

Workshop, ICLR, 2019.

[14] J. Zhao, X. Xie, X. Xu, and S. Sun, “Multi-view learning overview:

Recent progress and new challenges,” Information Fusion, vol. 38, pp.

43 – 54, 2017.

[15] J. Bekker and J. Davis, “Learning from positive and unlabeled data: a

survey,” Machine Learning, vol. 109, pp. 719–760, 2020.

[16] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A

survey on concept drift adaptation,” ACM Computing Surveys, vol. 46,

no. 4, 2014.

[17] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. R´

e,

“Snorkel: Rapid training data creation with weak supervision,” The

VLDB Journal, vol. 29, no. 2, pp. 709–730, 2020.

[18] P. Varma and C. R´

e, “Snuba: Automating weak supervision to label

training data,” Proc. VLDB Endow., vol. 12, no. 3, p. 223–236, Nov.

2018.

[19] T. Liu and D. Tao, “Classiﬁcation with noisy labels by importance

reweighting,” IEEE Transactions on Pattern Analysis and Machine

Intelligence, vol. 38, no. 3, p. 447–461, Mar 2016.

[20] O. Nikodym, “Sur une g´

en´

eralisation des int´

egrales de m. j. radon,”

Fundamenta Mathematicae, vol. 15, no. 1, pp. 131–179, 1930. [Online].

Available: http://eudml.org/doc/212339

[21] B. van Rooyen, A. Menon, and R. C. Williamson, “Learning with sym-

metric label noise: The importance of being unhinged,” in Advances in

Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence,

D. D. Lee, M. Sugiyama, and R. Garnett, Eds., 2015, pp. 10–18.

[22] N. Charoenphakdee, J. Lee, and M. Sugiyama, “On symmetric losses for

learning from corrupted labels,” in International Conference on Machine

Learning, vol. 97, 2019, pp. 961–970.

[23] V. N. Vapnik, “The nature of statistical learning theory,” 1995.

[24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,

O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al.,

“Scikit-learn: Machine learning in python,” Journal of machine learning

research, vol. 12, pp. 2825–2830, 2011.

[25] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,

T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf,

E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,

L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-

performance deep learning library,” in Advances in Neural Information

Processing Systems 32, 2019, pp. 8024–8035.

[26] D. Dua and C. Graff, “Uci machine learning repository,” 2017.

[Online]. Available: http://archive.ics.uci.edu/ml

[27] I. Guyon, “Datasets of the active learning challenge,” University of

Wisconsin-Madison Department of Computer Sciences, Tech. Rep.,

2010.

[28] P. Nemenyi, “Distribution-free multiple comparisons,” Biometrics,

vol. 18, no. 2, p. 263, 1962.

[29] F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics

Bulletin, vol. 1, no. 6, pp. 80–83, 1945. [Online]. Available:

http://www.jstor.org/stable/3001968

[30] J. H. Friedman, “Greedy function approximation: a gradient boosting

machine,” Annals of statistics, pp. 1189–1232, 2001.