Content uploaded by Vincent Lemaire
Author content
All content in this area was uploaded by Vincent Lemaire on May 03, 2021
Content may be subject to copyright.
Importance Reweighting for Biquality Learning
Pierre Nodet
Orange Labs
AgroParisTech, INRAe
46 av. de la R´
epublique
Chˆ
atillon, France
Vincent Lemaire
Orange Labs
2 av. P. Marzin
Lannion, France
Alexis Bondu
Orange Labs
46 av. de la R´
epublique
Chˆ
atillon, France
Antoine Cornu´
ejols
UMR MIA-Paris
AgroParisTech, INRAe
Universit´
e Paris-Saclay
16 r. Claude Bernard
Paris, France
Adam Ouorou
Orange Labs
46 av. de la R´
epublique
Chˆ
atillon, France
Abstract—The field of Weakly Supervised Learning (WSL)
has recently seen a surge of popularity, with numerous papers
addressing different types of “supervision deficiencies”, namely:
poor quality, non adaptability, and insufficient quantity of labels.
Regarding quality, label noise can be of different types, including
completely-at-random, at-random or even not-at-random. All
these kinds of label noise are addressed separately in the
literature, leading to highly specialized approaches. This paper
proposes an original, encompassing, view of Weakly Supervised
Learning, which results in the design of generic approaches
capable of dealing with any kind of label noise. For this purpose,
an alternative setting called “Biquality data” is used. It assumes
that a small trusted dataset of correctly labeled examples is
available, in addition to an untrusted dataset of noisy examples.
In this paper, we propose a new reweigthing scheme capable
of identifying noncorrupted examples in the untrusted dataset.
This allows one to learn classifiers using both datasets. Extensive
experiments that simulate several types of label noise and that
vary the quality and quantity of untrusted examples, demonstrate
that the proposed approach outperforms baselines and state-of-
the-art approaches.
Index Terms—Supervised Classification, Weakly Supervised
Learning, Biquality Learning, Trusted data, Label noise
I. INTRODUCTION
The supervised classification problem aims to learn a classi-
fier from a set of labeled training examples in order to predict
the class of new examples. In practice, conventional classifica-
tion techniques may fail because of the imperfections of real-
world datasets. Accordingly, the field of Weakly Supervised
Learning (WSL) has recently seen a surge of popularity, with
numerous papers addressing different types of “supervision
deficiencies” [1], namely:
Insufficient quantity: when many training examples are
available, but only a small portion is labeled, e.g. due to the
cost of labelling. For instance, this occurs in the field of cyber
security where human forensics is needed to label attacks.
Usually, this issue is addressed by semi-supervised learning
(SSL) [2] or active learning (AL) [3].
Poor quality labels: in this case, all training examples are
labeled but the labels may be corrupted. This may happen
when the labeling task is outsourced to crowd labeling. The
Robust Learning to Label Noise (RLL) approaches address this
problem [4], with three identified types of label noise: i) the
completely at random noise which correspond to a uniform
probability of label change ; ii) the at-random label noise
when the probability of label change depends upon each class,
with uniform label changes within each class ; iii) the not-at-
random label noise when the probability of label change varies
over the input space of the classifier. This last type of label
noise is recognized as the most difficult to deal with [5], [6].
Inappropriate labels: for instance, in Multi Instance Learn-
ing (MIL) [7] the labels are assigned to bags of examples, with
positive label indicating that at least one example of the bag is
positive. Some scenarios in Transfer Learning (TL) [8] imply
that only the labels in the source domain are provided while
the target domain labels are not. Often, these non-adapted
labels are associated with slightly different learning tasks (e.g.
more precise and numerous classes are dividing the original
categories). Alternatively, non-adapted labels may characterize
a differing statistical individual [9] (e.g. a subpart of an image
instead of the entire image).
All these types of supervision deficiencies are addressed
separately in the literature, leading to highly specialized ap-
proaches. In practice, it is very difficult to identify the type(s)
of deficiencies with which a real dataset is associated. For
this reason, we argue that it would be very useful to find a
unified framework for Weakly Supervised Learning, in order
to design generic approaches capable of dealing with any type
of supervision deficiency.
In Section II of this paper, we present “biquality data”,
an alternative WSL setting allowing a unified view of weakly
supervised learning. A generic learning framework using the
two training sets of biquality data (the one trusted and the
other one untrusted) is suggested in Section III. We identify
three possible ways of implementing this framework and
consider one of them. This article proposes a new approach
using example reweighting in Section IV. The effectiveness
of this new approach in dealing with different types of
supervision deficiencies, without a priori knowledge about
them, is demonstrated through experiments with real datasets
in Sections V and VI. Finally, perspectives and future works
are discussed in Section VII.
II. BI QUA LI TY DATA
This section presents an alternative setting called “Biquality
Data” which covers a large range of supervision deficiencies
and allows for unifying the WSL approaches. The interested
arXiv:2010.09621v4 [cs.LG] 26 Apr 2021
reader may find a more detailed introduction on WSL and its
links with Biquality Learning in [10].
Learning using biquality data has recently been put forward
in [11]–[13] and consists in learning a classifier from two
distinct training sets, one trusted and the other untrusted. The
initial motivation was to unify semi-supervised and robust
learning through a combination of the two. We show in this
paper that this scenario is not limited to this unification and
that it can cover a larger range of supervision deficiencies as
demonstrated with the algorithms we propose and the obtained
results.
The trusted dataset DTconsists of pairs of labeled examples
(xi, yi) where all labels yi∈ Y are supposed to be cor-
rect according to the true underlying conditional distribution
PT(Y|X). In the untrusted dataset DU, examples ximay
be associated with incorrect labels. We note PU(Y|X)the
corresponding conditional distribution.
At this stage, no assumption is made about the nature of the
supervision deficiencies which could be of any type including
label noise, missing labels, concept drift, non-adapted labels...
and more generally a mixture of these supervision deficiencies.
The difficulty of a learning task performed on biquality
data can be characterised by two quantities. First, the ratio
of trusted data over the whole data set, denoted by p:
p=|DT|
|DT|+|DU|(1)
Second, a measure of the quality, denoted by q, which
evaluates the usefulness of the untrusted data DUto learn
the trusted concept PT(Y|X), where q∈[0,1] and 1 indi-
cates high quality. For example in [13] qis defined using a
ratio of Kullback-Leibler divergence between PT(Y|X)and
PU(Y|X).
Fig. 1. The different learning tasks covered by the biquality setting, repre-
sented on a 2D representation.
The biquality setting covers a wide range of learning tasks
by varying the quantities qand p(as represented in Figure 1):
•When (p= 1 OR q= 1)1all examples can be trusted.
Thus, this setting corresponds to a standard supervised
learning (SL) task.
1p= 1 =⇒DU=∅=⇒q= 1
•When (p= 0 AND q= 0), there is no trusted examples
and the untrusted labels are not informative. We are
left with only the inputs {xi}1≤i≤mas in unsupervised
learning (UL).
•On the vertical axis defined by q= 0, except for the
two points (p, q) = (0,0) and (p, q) = (1,0), the
untrusted labels are not informative, and trusted examples
are available. The learning task becomes semi-supervised
learning (SSL) with the untrusted examples as unlabeled
and the trusted as labeled.
•An upward move on the vertical axis, from a point
(p, q)=(, 0) characterized by a low proportion of
labeled examples p=, to a point (p0,0), with p0> p,
corresponds to Active Learning, when an oracle provides
new labels asked by a given strategy. The same upward
move can also be realized in Self-training and Co-
training [14], where unlabeled training examples are
labeled using the predictions of the current classifier(s).
•On the horizontal axis defined by p= 0, except for
the points (p, q) = (0,0) and (p, q) = (0,1), only
untrusted examples are provided, which corresponds
to the range of learning tasks typically addressed by
Robust Learning to Label noise (RLL) approaches.
Only the edges of Figure 1 have been envisioned in previous
works – i.e. the points mentioned above – and a whole new
range of problems are addressed in this paper. Moreover,
biquality learning may be used to tackle tasks belonging to
WSL, for instance:
•Positive Unlabeled Learning (PUL) [15]where only pos-
itive (trusted) and unlabeled instances are available, the
later which can be considered as untrusted.
•Self Training and Cotraining [14] could be addressed at
the end of the self labeling process: the initial training set
is then the trusted dataset, and all self-labeled examples
can be considered as the untrusted ones.
•Concept drift [16]: when a concept drift occurs, all the
examples used before a detected drift may be considered
as the untrusted examples, while the examples available
after it are viewed as the trusted ones, assuming a perfect
labeling process.
•Self Supervised Learning system as illustrated by Snorkel
[17] or Snuba [18]: the small initial training set can the
trusted, whereas all examples automatically labeled using
the labeling functions may be considered as untrusted.
As can be seen from the above list, the Biquality framework
is quite general and its investigation seems a promising avenue
to unify different aspects of the Weakly Supervised Learning.
A main contribution of this paper is to suggest one generic
framework for achieving biquality learning and thus covering
many facets of WSL. This is presented in the next section.
This framework will be then applied in the experiments part
of this paper to the problem of label noise.
III. BIQ UAL IT Y LEARNING
Learning the true concept2PT(Y|X)on D=DT∪DU
means minimizing the risk Ron Dwith a loss Lfor a
probabilistic classifier f:
RD,L(f) = ED ,(X,Y )∼T[L(f(X), Y )]
=P(X∈DT)EDT,(X,Y )∼T[L(f(X), Y )]
+P(X∈DU)EDU,(X,Y )∼T[L(f(X), Y )]
(2)
where L(·,·)is a loss function, from R|Y| × Y to Rsince
f(X)is a vector of probability over the classes. Since the
true concept PT(Y|X)cannot be learned from DU, the last
line of Equation 2 is not tractable as it stands. That is why we
propose a generic formalization based on a mapping function
gthat enables us to learn the true concept from the modified
untrusted examples of DU. Equation 2 becomes:
RD,L(f) = P(X∈DT)EDT,(X ,Y )∼T[L(f(X), Y )]
+λP(X∈DU)EDU,(X,Y )∼U[g(L(f(X), Y ))] (3)
In Equation 3, the parameter λ∈[0,1] reflects the quality
of the untrusted examples of DUmodified by the function
g. This time, the last line is tractable since it consists of a
risk expectancy estimated over the training examples of DU
which follows the untrusted concept PU(Y|X), modified by
the function g.
Accordingly, the estimation of the expected risk requires to
learn three items: g,λand then f. To learn g, a mapping
function between the two datasets, both DTand DUare used.
Then, either λis considered as a hyper parameter to be learned
using DTor λis provided by an appropriate quality measure
and is considered as an input of the learning algorithm. Finally,
fis learned by minimizing the risk Ron Dusing the mapping
g.
In this formalization, the mapping function gplays a central
role. Not exhaustively, we identify three different ways of
designing the mapping function. For each of these, a different
function g0enters the definition of function g:
•The first option consists in correcting the label for each
untrusted examples of DU. The mapping function thus
takes the form g(L(f(X), Y )) = L(f(X), g0(Y, X )),
with g0(Y, X )denoting the new corrected labels and
f(X)the predictions of the classifier.
•In the second option, the untrusted labels are used un-
changed. The untrusted examples Xare moved in the
input space where the untrusted labels becomes correct
with respect to the true underlying concept. The mapping
function becomes g(L(f(X), Y )) = L(f(g0(X)), Y ),
where g0(X)is the “moved” input vector of the modified
untrusted examples.
•In the last option, g0weights the contribution of the
untrusted examples in the risk estimate. Accordingly,
we have g(L(f(X), Y )) = g0(Y, X)L(f(X), Y ). In this
case, the parameter λmay disappear from Equation 3
since it can be considered as included in the function g0.
2For reasons of space, we denote PT(Y|X)by Tand PU(Y|X)by U.
Section IV considers in-depth the last option and proposes
a new approach where g0acts as an Importance Reweighting
for Biquality Learning.
IV. A N EW IM PO RTANCE REWE IG HT IN G AP PROACH F OR
BIQ UAL IT Y LEARNING
To estimate the mapping function g0, we suggest to adapt the
importance reweigthing trick from the covariate shift literature
[19] to biquality learning. This trick relies on reweighting
untrusted samples by using the Radon-Nikodym derivative
(RND) [20] of PT(X, Y )in respect to PU(X, Y )which
is dPT(X,Y )
dPU(X,Y ). Contrary to the “covariate shift” setting, the
biquality setting handles the same distribution P(X)in the
trusted and untrusted datasets. However, the two underlying
concepts PT(Y|X)and PU(Y|X)are possibly different due
to a supervision deficiency. By using these assumptions and
the Bayes Formula, we can further simplifying the reweighing
function to the RND of PT(Y|X)in respect to PU(Y|X),
dPT(Y|X)
dPU(Y|X).
R(X,Y )∼T,L(f) = E(X,Y )∼T[L(f(X), Y )]
=ZL(f(X), Y )dPT(X, Y )
=ZdPT(X, Y )
dPU(X, Y )L(f(X), Y )dPU(X, Y )
=E(X,Y )∼U[PT(X, Y )
PU(X, Y )L(f(X), Y )]
=E(X,Y )∼U[PT(Y|X)P(X)
PU(Y|X)P(X)L(f(X), Y )]
=E(X,Y )∼U[PT(Y|X)
PU(Y|X)L(f(X), Y )]
=E(X,Y )∼U[βL(f(X), Y )]
=R(X,Y )∼U,βL (f)
(4)
Equation 4 shows that β=PT(Y|X)
PU(Y|X)is an estimation of the
mapping function g0, thanks to Section III estimating βis the
last step before an actual Biquality Learning algorithm.
Algorithm: Importance Reweighting for Biquality
Learning (IRBL)
Input: Trusted Dataset DT, Untrusted Dataset DU,
Probabilistic Classifier Familiy F
1Learn fU∈ F on DU
2Learn fT∈ F on DT
3for (xi, yi)∈DU, where yi∈[[1, K]] do
4ˆ
β(xi, yi) = DfT(xi)
fU(xi)Eyi
5for (xi, yi)∈DTdo
6ˆ
β(xi, yi)=1
7Learn f∈ F on DT∪DUwith weights ˆ
β
Output: f
The proposed algorithm, Importance Reweighting for Bi-
quality Learning (IRBL), aims at estimating βfrom DTand
DUwhatever the unknown supervision deficiency. It consists
of two successive steps. First a probabilistic classifier fTis
learned from the trusted dataset DTand another probabilistic
classifier fUis learned from the untrusted dataset DU. Thanks
to their probabilistic nature each of them estimates PT(Y|X)
and PU(Y|X)by a probability distribution over the set of
the Kclasses. Thus we can estimate the weight βof an
untrusted sample (xi, yi)by dividing the prediction of fT(xi)
by fU(xi)for the yiclass (see line 4). The weight βfor
all trusted samples will be fixed to 1 (see line 6). Then a
final classifier is learned from both datasets DTand DUwith
examples reweighted by ˆ
β.
Our algorithm is theoretically grounded, since it is asymp-
totically equivalent to minimizing the risk on the true concept
using the entire data set (see proof in Equation 5).
ˆ
RD, ˆ
βL (f) = 1
|D|X
(xi,yi)∈D1(xi,yi)∈DTL(f(xi), yi)
+1(xi,yi)∈DUˆ
β(xi, yi)L(f(xi), yi)
=1
|DT|+|DU|X
(xi,yi)∈DT
L(f(xi), yi)
+1
|DT|+|DU|X
(xi,yi)∈DU
L(f(xi), yi)ˆ
β(xi, yi)
=p
|DT|X
(xi,yi)∈DT
L(f(xi), yi)
+1−p
|DU|X
(xi,yi)∈DU
L(f(xi), yi)ˆ
β(xi, yi)
=pˆ
RDT,L(f) + (1 −p)ˆ
RDU,ˆ
βL (f)
≈pˆ
RDT,L(f) + (1 −p)ˆ
RDT,L(f)
≈ˆ
RDT,L(f)
(5)
Proof in Equation 5 is an asymptotic result: in practice our
algorithm relies on the quality of the estimation of PT(Y|X)
and PU(Y|X)in order to be efficient. In the biquality setting
they both could be hard to estimate because of the small size
of DTand the poor quality of DU.
V. EXPERIMENTS
The aim of the experiments is to answer the following
questions: i) is our algorithm properly designed and does it
perform better than baselines approaches? ii) is our algorithm
competitive with state-of-the-art approaches?
First, Section V-A presents the supervision deficiencies
which are simulated in our experiments. They correspond to
two different kinds of weak supervision, namely, Noisy label
Completely at Random (i.e. not Xdependent) and Noisy
label Not at Random (i.e. Xdependent). From the Frenay’s
taxonomy [4] the former is the easiest to deal with and the
later is often considered as difficult to manage. Then, Section
V-B consists of three parts: a presentation of the baseline
competitors, a brief report on the state-of-the-art competitors,
and a description of the set of classifiers used. Finally, Section
V-C describes the datasets used in the experiments, and the
chosen criterion to evaluate the learned classifiers. For full
reproducibility, source code, datasets and results are available
at : https://github.com/pierrenodet/irbl.
A. Simulated supervision deficiencies
The datasets listed in Section V-C consist of a collection
of training examples that are assumed to be correctly labeled,
denoted by Dtotal. In order to obtain a trusted dataset DTand
an untrusted one DU, each dataset is split in two parts using
a stratified random draw, where pis the percentage for the
trusted part. The trusted datasets are left untouched, whereas
corrupted labels are simulated in the untrusted datasets by
using two different techniques:
a) Noisy Completely At Random (NCAR):: Corrupted
untrusted examples are uniformly drawn from DUwith a
probability r, and are assigned a random label that is also
uniformly drawn from Y.
In the particular case of binary classification problems, the
conditional distribution of the untrusted labels is defined by
Equation 6.
∀y∈ Y,PU(Y=y|X) = r
2+ (1 −r)PT(Y=y|X)(6)
Here, rcontrols the overall number of random labels and thus
is our proxy for the quality: q= 1 −r.
b) Noisy Not At Random (NNAR):: Corrupted untrusted
examples are drawn from DUwith a probability r(x)that
depends on the instance value. To generate a instance de-
pendent label noise, we design a noise that depends on the
decision boundary of a classifier ftotal learned from Dtotal.
The probability of random label r(x)should be high when an
instance xis close to the decision boundary, and low when it
is far. In our experiments, the probability outputs of ftotal are
used to model our label noise as follows:
∀x∈ X , r(x)=1−θ|1−2ftotal(x)|1
θ(7)
where θ∈[0; 1] is a constant that controls the overall number
of random labels and thus is our proxy for the quality: q=θ.
The parameter θinfluences both the slope (factor) and the
curvature (power) of r(x)to modify the area under curve of
r(x):E[r(x)].
For binary classification problems, the conditional distribu-
tion of the untrusted labels is defined by Equation 8.
∀x∈ X ,∀y∈ Y,
PU(Y=y|X=x) = r(x)
2+(1−r(x))PT(Y=y|X=x)(8)
B. Competitors
a) Baseline competitors: The first part of our experi-
ments consists of a sanity check which compares the perfor-
mance of the proposed algorithm to the following baselines:
•Trusted: The final classifier fobtained with our algorithm
should be better than a classifier fTthat learned only
from the trusted dataset, insofar as untrusted data bring
useful information about the trusted concept. At least, f
should not be worse than using only trusted data.
•Mixed: The final classifier fshould be better than a
classifier fmixed learned from both trusted and untrusted
dataset, without correction. A biquality learning algo-
rithm should leverage the information provided by having
two distinct datasets.
•Untrusted: The final classifier should be better than a
classifier fUthat learns only from the untrusted dataset
if there are trusted labels. Using trusted data should
improve the classifier final performances.
b) State-of-the-art-competitors: The second part of our
experiments compares our algorithm with two state-of-the-art
methods: (i) a method from the Robust Learning to Label noise
(RLL) [21], [22] family and (ii) the GLC approach [12].
•RLL: In recent literature a new emphasis is put on
the research of new loss functions that are conducive
to better risk minimization in presence of noisy
labels. For example, [21], [22] show theoretically and
experimentally that when the loss function satisfies a
symmetry condition, described below, this contributes
to the robustness of the classifier. Accordingly, in this
paper we train a classifier with a symmetric loss function
as a competitor. This first competitor is expected to
have good results on completely-at-random label noise
described in Section V-A. A loss function Lsis said
symmetrical if Py∈{−1;1}Ls(f(x), y) = c, where cis a
constant and f(x)is the score on the class y. This loss
function is used on DT∪DU.
•GLC: To the best of our knowledge, GLC [12] is among
the best performing algorithm that can learn from bi-
quality data. It has been successfully compared to many
competing approaches. Like ours, it is a two steps ap-
proach which is simple and easy to implement.
In a first step, a model fUis learned from the untrusted
dataset DU. Then it is used to estimate a transition matrix
Cof PU|T(Y)by making probabilistic predictions with
fUon the trusted dataset DTand comparing it to the
trusted labels.
In a second step, this matrix is used to correct the labels
from the untrusted dataset DUwhen learning the final
model f. Indeed fis learned with Lon DTand with
L(C>f(X), Y )on DU.
c) Classifiers: First of all, the choice of classifiers was
guided by the idea of comparing algorithms for biquality
learning and not searching for the best classifiers. This choice
was also guided by the nature of the datasets used in the
experiments (see section V-C). Secondly our algorithm, as well
as GLC, implies two learning phases. For both reasons and
for simplicity, we decided to use Logistic Regressions (LR)
for each phase. LR is known to be limited, in the sense of the
Vapnik-Chervonenkis dimension [23] since it can only learn
linear separations of the input space X, which could underfit
the conditional probabilities P(Y|X)on DTand DUand lead
to bad βestimations. But this impediment, if met, will affect
equally all the compared algorithms. LR is also used for the
RLL classifier using the Unhinged symmetric loss function.
To obtain reliable estimations of conditional probabilities
P(Y|X), the outputs of all classifiers have been calibrated
thanks to Isotonic Regression with the default parameters
provided by scikit-learn [24].
Logistic Regression is always be used and learned thanks
to SGD with a learning rate of 0.005, a weight decay of 10−6
during 20 epochs and a batch size of 24 with Pytorch [25].
C. Datasets
In industrial applications familiar to us, such as fraud detec-
tion, Customer Relationship Management (CRM) and churn
prediction, we are mostly faced with binary classification
problems. The available data is of average size in terms of the
number of explanatory variables and involves mixed variables
(numerical and categorical).
For this reason we limited in this paper the experiments to
binary classification tasks even if our algorithm can address
multi-class problems. The chosen tabular datasets, used for the
experiments, have similar characteristics than those of our real
applications.
TABLE I
BINA RY CL ASS IFI CATI ON DATASE TS U SED F OR T HE EVAL UATIO N.
COLUMNS:NU MBE R OF E XAM PL ES (|D|), NUMBER OF FEATURES (|X |) ,
AN D RATIO O F EX AM PLE S FRO M TH E MIN OR ITY C LA SS (MIN).
name |D| |X | min name |D| |X | min
4class 862 2 36 ibnsina 20,722 92 38
ad 3,278 1558 14 zebra 61,488 154 4.6
adult 48,842 14 23 musk 6,598 169 15
aus 690 14 44 phishing 11055 30 44
banknote 1372 4 44 spam 4,601 57 39
breast 683 9 35 ijcnn1 141,691 22 9
eeg 1498 13 45 svmg3 1284 4 26
diabetes 768 8 35 svmg1 7,089 22 43
german 1000 20 30 sylva 145,252 108 6.5
hiva 42,678 1617 3.5 web 49,749 300 3
They come from different sources: UCI [26], libsvm3and
active learning challenge [27]. A part of these datasets comes
from past challenges on active learning where high perfor-
mances with a low number of labeled examples has proved
difficult to obtain. For each dataset, 80 % of samples were used
for training and 20% were used for the test. With this choice of
datasets, a large range of the class ratio is covered: Australian
is almost balanced while Web is really unbalanced. Also, the
size varies significantly in number of rows or columns, with
corresponding impact on the difficulty of the learning tasks.
VI. RE SU LTS
The empirical performance of our approach, is evaluated in
two steps. First, we investigate the efficiency of the reweight-
ing scheme and its influence on the learning procedure of the
final classifier. Second, our approach is benchmarked against
competitors to evaluate its efficiency in real tasks.
3https://www.csie.ntu.edu.tw/∼cjlin/libsvmtools/datasets/
A. Behavior of the IRBL method
In order to illustrate the proposed reweighing scheme, we
picked a dataset, here the “ad” dataset used with a ratio of
trusted data p= 0.25, and examined the histogram of the
weights assigned to each untrusted example either corrupted
or not. The case of Random Label Completely at Random is
chosen and the hardest case where all labels are at random
q= 0 is considered.
Figure 2 shows the histogram of the weights assigned to
each untrusted example either corrupted or not. It is clear that
the proposed method is able to detect corrupted and noncor-
rupted labels from the untrusted dataset. Figure 3 confirms
this behavior when varying the value of the quality. For a
perfect quality, the distribution of the βis unimodal with a
median equal to one and a very narrow inter quantile range,
whereas, when the quality drops, the distribution of the βfor
the corrupted labels decreases to zero.
Fig. 2. Histogram of the βvalues on AD for p= 0.25 and q= 0 for NCAR
for the corrupted and noncorrupted examples.
Fig. 3. Boxplot the βvalues on AD for p= 0.25 versus the quality, from
q= 0 to q= 1 for NCAR.
It is equally interesting to look at the classification error
when q, the quality of the untrusted data, varies. Figure 4
reports the performance for the proposed method and for the
baseline competitors. It is remarkable that the performance
of our algorithm, IRBL, remains stable when qdecreases
while the performance of the mixed and untrusted algorithms
worsens. In addition, IRBL always obtains better performances
than the trusted baseline.
Fig. 4. Classification error on test set for IRBL against baselines on a full
range of quality level (AD dataset, p= 0.25, NCAR).
B. Comparison with competitors
For a first global comparison, two critical diagrams are
presented in Figures 5 and 6 which rank the various methods
for the NCAR and NNAR label noise. The Nemenyi test [28]
is used to rank the approaches in terms of mean accuracy,
calculated for all values of pand qand over all the 20 data
sets described in section V-C. The Nemenyi test consists of
two successive steps. First, the Friedman test is applied to the
mean accuracy of competing approaches to determine whether
their overall performance is similar. Second, if not, the post-
hoc test is applied to determine groups of approaches whose
overall performance is significantly different from that of the
other groups.
Fig. 5. Nemenyi test for the 20 datasets ∀p, q for NCAR.
Fig. 6. Nemenyi test for the 20 datasets ∀p, q for NNAR.
These figures show that the IRBL method is ranked first for
the two kinds of label noise and provides better performance
than the other competitors. Table II provides a more detailed
perspective by reporting the mean accuracy and its standard
deviation. These values are computed for different values of
pover all qualities qand all datasets. This table also helps to
see how the methods fare as compared to learning on perfect
TABLE II
MEA N ACCURACY (RES CAL ED S COR E TO B E FROM 0T O 100) AN D STAN DAR D DEV IATI ON C OMP UT ED O N THE 2 0 DATASE TS ∀qFO R (1) NCAR AN D (2)
NNA R. THE ME AN AC C WH EN US IN G ALL T HE T RAI NI NG DATA WI THO UT N OIS E IS 88.65.
p trusted irbl mixed glc rll
(1)
0.02 72.48 ±5.70 83.46 ±3.56 83.40 ±8.30 78.34 ±7.94 77.94 ±6.37
0.05 78.50 ±4.33 84.94 ±2.24 83.85 ±7.35 81.19 ±5.15 77.97 ±6.44
0.10 81.40 ±3.33 86.56 ±1.68 85.44 ±5.34 83.00 ±3.90 78.98 ±5.26
0.25 85.61 ±2.39 87.96 ±1.18 86.99 ±2.80 86.27 ±2.03 79.86 ±2.61
(2)
0.02 72.48 ±5.70 82.93 ±3.18 81.30 ±10.05 77.55 ±7.78 75.47 ±9.47
0.05 78.50 ±4.33 85.34 ±2.55 82.52 ±7.72 80.77 ±5.04 76.94 ±6.64
0.10 81.40 ±3.33 86.82 ±1.45 84.44 ±5.14 83.22 ±4.10 77.95 ±4.51
0.25 85.61 ±2.39 88.21 ±1.05 86.74 ±2.56 86.56 ±2.00 79.67 ±2.70
Mean 79.50 ±3.94 85.71 ±2.11 84.33 ±6.16 82.11 ±4.74 78.10 ±5.50
(a) IRBL vs Mixed for NCAR (b) IRBL vs RLL for NCAR (c) IRBL vs GLC for NCAR
(d) IRBL vs Mixed for NNAR (e) IRBL vs RLL for NNAR (f) IRBL vs GLC for NNAR
Fig. 7. Results of the Wilcoxon signed rank test computed on the 20 datasets. Each figure compares IRBL versus one of the competitors. Figures a, b, c are
in the case of Noisy label Completely at Random and Figures d, e, f for the case of Noisy label Not at Random. In each figure “◦”, “·” and “•” indicate
respectively a win, a tie or a loss of IRBL compared to the competitors, the vertical axis is pand the horizontal axis is q.
data. Overall, IRBL obtains the best results and with a lower
variability.
To get more refined results, the Wilcoxon signed-rank test
[29] is used 4. It enables us to find out under which conditions
– i.e. by varying the values of p and q – IRBL performs better
or worse than the competitors.
Figure 7 presents six graphics, each reporting the Wilcoxon
test that evaluates our approach against a competitor, based
on the mean accuracy over the 20 datasets. The two types of
label noise (see Section V-A) correspond to the rows in Figure
7 and a wide range of qand pvalues are considered.
Thanks to these graphs we can compare in more details our
method (IRBL) with the mixed methods, as well as with RLL
and GLC. Regarding the mixed method, Figures 7 (a) and (b)
return the results obtained versus varying values for pand q.
For low quality values q, whatever is the value of p, IRBL is
significantly better. For middle values of the quality there is
no winner and for high quality values and low values of p, the
mixed method is significantly better (this result seems to be
observed in [12] as well). This is not surprising since at high
4Here the test is used with a confidence level at 5 %.
quality values, the mixed baseline is equivalent to learning
with perfect labels.
These detailed results help us to understand why, in the
critical diagram in Figure 5, although IRBL has a better
ranking, it is not significantly better than the mixed method:
mainly because of the presence of high quality value cases.
Regarding the competitors RLL and GLC, Figures 7(b),
7(c), 7(e) and 7(f) show that IRBL has always better or
indistinguishable performances. Indeed, IRBL performs well
regardless of the type of noise. This is an important result since
it shows that we are able to deal not only with NCAR noise
but also with instance dependent label noise (NNAR) which
is more difficult. The method RLL gets more ties with IRBL
on NCAR than on NNAR as expected. It is noteworthy that
GLC has ties with IRBL when the quality is high whatever
the label noise.
To sum up, the proposed method has been tested on a large
range of types and strengths of label corruptions. In all cases,
IRBL has obtained top or competitive results. Consequently,
IRBL appears to be a method of choice for applications where
biquality learning is needed. Moreover, IRBL has no user
parameter and a low computational complexity.
VII. CONCLUSION
This paper has presented an original view of Weakly
Supervised Learning and has described a generic approach
capable of dealing with any kind of label noise. A formal
framework for biquality learning has been developed where
the empirical risk is minimized on the small set of trusted
examples in addition to some appropriately chosen criterion
using the untrusted examples. We identified three different
ways to design a mapping function leading to three different
such criteria within the biquality learning framework. We
implemented one of them: a new Importance Reweighting ap-
proach for Biquality Learning (IRBL). Extensive experiments
have shown that IRBL significantly outperforms state-of-the-
art approaches, by simulating completely-at-random and not-
at-random label noise over a wide range of quality and ratio
values of untrusted data.
Future works will be done to extend experiments with
multiclass classification datasets and other classifiers such as
Gradient Boosted Trees [30]. An adaptation of IRBL to Deep
Learning tasks with an online algorithm will be studied too.
REFERENCES
[1] Z.-H. Zhou, “A brief introduction to weakly supervised learning,”
National Science Review, vol. 5, no. 1, pp. 44–53, 08 2017.
[2] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-supervised learning,”
IEEE Transactions on Neural Networks, vol. 20, no. 3, pp. 542–542,
2009.
[3] B. Settles, “Active learning literature survey,” University of Wisconsin-
Madison Department of Computer Sciences, Tech. Rep., 2009.
[4] B. Fr´
enay and M. Verleysen, “Classification in the presence of label
noise: a survey,” IEEE transactions on neural networks and learning
systems, vol. 25, no. 5, pp. 845–869, 2013.
[5] J. Cheng, T. Liu, K. Ramamohanarao, and D. Tao, “Learning with
bounded instance and label-dependent label noise,” in International
Conference on Machine Learning (ICML), vol. 119. PMLR, 13–18
Jul 2020, pp. 1789–1799.
[6] A. Menon, B. V. Rooyen, and N. Natarajan, “Learning from binary labels
with instance-dependent corruption,” ArXiv, vol. abs/1605.00751, 2016.
[7] M.-A. Carbonneau, V. Cheplygina, E. Granger, and G. Gagnon, “Multi-
ple instance learning: A survey of problem characteristics and applica-
tions,” Pattern Recognition, vol. 77, p. 329–353, May 2018.
[8] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer
learning,” Journal of Big data, vol. 3, no. 1, p. 9, 2016.
[9] D. Conte, P. Foggia, G. Percannella, F. Tufano, and M. Vento, “A method
for counting people in crowded scenes,” in 2010 7th IEEE International
Conference on Advanced Video and Signal Based Surveillance, 2010,
pp. 225–232.
[10] P. Nodet, V. Lemaire, A. Bondu, A. Cornu ´
ejols, and A. Ouorou, “From
Weakly Supervised Learning to Biquality Learning: an Introduction,”
in In Proceedings of the International Joint Conference on Neural
Networks (IJCNN), 2021.
[11] M. Charikar, J. Steinhardt, and G. Valiant, “Learning from untrusted
data,” in Proceedings of the 49th Annual ACM SIGACT Symposium on
Theory of Computing, 2017, p. 47–60.
[12] D. Hendrycks, M. Mazeika, D. Wilson, and K. Gimpel, “Using trusted
data to train deep networks on labels corrupted by severe noise,” in
Advances in Neural Information Processing Systems 31, 2018, pp.
10 456–10 465.
[13] R. Hataya and H. Nakayama, “Unifying semi-supervised and robust
learning by mixup,” in The 2nd Learning from Limited Labeled Data
Workshop, ICLR, 2019.
[14] J. Zhao, X. Xie, X. Xu, and S. Sun, “Multi-view learning overview:
Recent progress and new challenges,” Information Fusion, vol. 38, pp.
43 – 54, 2017.
[15] J. Bekker and J. Davis, “Learning from positive and unlabeled data: a
survey,” Machine Learning, vol. 109, pp. 719–760, 2020.
[16] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A
survey on concept drift adaptation,” ACM Computing Surveys, vol. 46,
no. 4, 2014.
[17] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. R´
e,
“Snorkel: Rapid training data creation with weak supervision,” The
VLDB Journal, vol. 29, no. 2, pp. 709–730, 2020.
[18] P. Varma and C. R´
e, “Snuba: Automating weak supervision to label
training data,” Proc. VLDB Endow., vol. 12, no. 3, p. 223–236, Nov.
2018.
[19] T. Liu and D. Tao, “Classification with noisy labels by importance
reweighting,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 38, no. 3, p. 447–461, Mar 2016.
[20] O. Nikodym, “Sur une g´
en´
eralisation des int´
egrales de m. j. radon,”
Fundamenta Mathematicae, vol. 15, no. 1, pp. 131–179, 1930. [Online].
Available: http://eudml.org/doc/212339
[21] B. van Rooyen, A. Menon, and R. C. Williamson, “Learning with sym-
metric label noise: The importance of being unhinged,” in Advances in
Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence,
D. D. Lee, M. Sugiyama, and R. Garnett, Eds., 2015, pp. 10–18.
[22] N. Charoenphakdee, J. Lee, and M. Sugiyama, “On symmetric losses for
learning from corrupted labels,” in International Conference on Machine
Learning, vol. 97, 2019, pp. 961–970.
[23] V. N. Vapnik, “The nature of statistical learning theory,” 1995.
[24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al.,
“Scikit-learn: Machine learning in python,” Journal of machine learning
research, vol. 12, pp. 2825–2830, 2011.
[25] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf,
E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,
L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-
performance deep learning library,” in Advances in Neural Information
Processing Systems 32, 2019, pp. 8024–8035.
[26] D. Dua and C. Graff, “Uci machine learning repository,” 2017.
[Online]. Available: http://archive.ics.uci.edu/ml
[27] I. Guyon, “Datasets of the active learning challenge,” University of
Wisconsin-Madison Department of Computer Sciences, Tech. Rep.,
2010.
[28] P. Nemenyi, “Distribution-free multiple comparisons,” Biometrics,
vol. 18, no. 2, p. 263, 1962.
[29] F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics
Bulletin, vol. 1, no. 6, pp. 80–83, 1945. [Online]. Available:
http://www.jstor.org/stable/3001968
[30] J. H. Friedman, “Greedy function approximation: a gradient boosting
machine,” Annals of statistics, pp. 1189–1232, 2001.