PreprintPDF Available

Importance Reweighting for Biquality Learning

Authors:

Abstract and Figures

https://arxiv.org/abs/2010.09621 (this paper has been accepted at IJCNN 2021). The field of Weakly Supervised Learning (WSL) has recently seen a surge of popularity, with numerous papers addressing different types of ``supervision deficiencies'', namely: poor quality, non adaptability, and insufficient quantity of labels. Regarding quality, label noise can be of different kinds, including completely-at-random, at-random or even not-at-random. All these kinds of label noise are addressed separately in the literature, leading to highly specialized approaches. This paper proposes an original view of Weakly Supervised Learning, to design generic approaches capable of dealing with any kind of label noise. For this purpose, an alternative setting called ``Biquality data'' is used. This setting assumes that a small trusted dataset of correctly labeled examples is available, in addition to the untrusted dataset of noisy examples. In this paper, we propose a new reweigthing scheme capable of identifying noncorrupted examples in the untrusted dataset. This allows one to learn classifiers using both datasets. Extensive experiments demonstrate that the proposed approach outperforms baselines and state-of-the-art approaches, by simulating several kinds of label noise and varying the quality and quantity of untrusted examples.
Content may be subject to copyright.
Importance Reweighting for Biquality Learning
Pierre Nodet
Orange Labs
AgroParisTech, INRAe
46 av. de la R´
epublique
Chˆ
atillon, France
Vincent Lemaire
Orange Labs
2 av. P. Marzin
Lannion, France
Alexis Bondu
Orange Labs
46 av. de la R´
epublique
Chˆ
atillon, France
Antoine Cornu´
ejols
UMR MIA-Paris
AgroParisTech, INRAe
Universit´
e Paris-Saclay
16 r. Claude Bernard
Paris, France
Adam Ouorou
Orange Labs
46 av. de la R´
epublique
Chˆ
atillon, France
Abstract—The field of Weakly Supervised Learning (WSL)
has recently seen a surge of popularity, with numerous papers
addressing different types of “supervision deficiencies”, namely:
poor quality, non adaptability, and insufficient quantity of labels.
Regarding quality, label noise can be of different types, including
completely-at-random, at-random or even not-at-random. All
these kinds of label noise are addressed separately in the
literature, leading to highly specialized approaches. This paper
proposes an original, encompassing, view of Weakly Supervised
Learning, which results in the design of generic approaches
capable of dealing with any kind of label noise. For this purpose,
an alternative setting called “Biquality data” is used. It assumes
that a small trusted dataset of correctly labeled examples is
available, in addition to an untrusted dataset of noisy examples.
In this paper, we propose a new reweigthing scheme capable
of identifying noncorrupted examples in the untrusted dataset.
This allows one to learn classifiers using both datasets. Extensive
experiments that simulate several types of label noise and that
vary the quality and quantity of untrusted examples, demonstrate
that the proposed approach outperforms baselines and state-of-
the-art approaches.
Index Terms—Supervised Classification, Weakly Supervised
Learning, Biquality Learning, Trusted data, Label noise
I. INTRODUCTION
The supervised classification problem aims to learn a classi-
fier from a set of labeled training examples in order to predict
the class of new examples. In practice, conventional classifica-
tion techniques may fail because of the imperfections of real-
world datasets. Accordingly, the field of Weakly Supervised
Learning (WSL) has recently seen a surge of popularity, with
numerous papers addressing different types of “supervision
deficiencies” [1], namely:
Insufficient quantity: when many training examples are
available, but only a small portion is labeled, e.g. due to the
cost of labelling. For instance, this occurs in the field of cyber
security where human forensics is needed to label attacks.
Usually, this issue is addressed by semi-supervised learning
(SSL) [2] or active learning (AL) [3].
Poor quality labels: in this case, all training examples are
labeled but the labels may be corrupted. This may happen
when the labeling task is outsourced to crowd labeling. The
Robust Learning to Label Noise (RLL) approaches address this
problem [4], with three identified types of label noise: i) the
completely at random noise which correspond to a uniform
probability of label change ; ii) the at-random label noise
when the probability of label change depends upon each class,
with uniform label changes within each class ; iii) the not-at-
random label noise when the probability of label change varies
over the input space of the classifier. This last type of label
noise is recognized as the most difficult to deal with [5], [6].
Inappropriate labels: for instance, in Multi Instance Learn-
ing (MIL) [7] the labels are assigned to bags of examples, with
positive label indicating that at least one example of the bag is
positive. Some scenarios in Transfer Learning (TL) [8] imply
that only the labels in the source domain are provided while
the target domain labels are not. Often, these non-adapted
labels are associated with slightly different learning tasks (e.g.
more precise and numerous classes are dividing the original
categories). Alternatively, non-adapted labels may characterize
a differing statistical individual [9] (e.g. a subpart of an image
instead of the entire image).
All these types of supervision deficiencies are addressed
separately in the literature, leading to highly specialized ap-
proaches. In practice, it is very difficult to identify the type(s)
of deficiencies with which a real dataset is associated. For
this reason, we argue that it would be very useful to find a
unified framework for Weakly Supervised Learning, in order
to design generic approaches capable of dealing with any type
of supervision deficiency.
In Section II of this paper, we present “biquality data”,
an alternative WSL setting allowing a unified view of weakly
supervised learning. A generic learning framework using the
two training sets of biquality data (the one trusted and the
other one untrusted) is suggested in Section III. We identify
three possible ways of implementing this framework and
consider one of them. This article proposes a new approach
using example reweighting in Section IV. The effectiveness
of this new approach in dealing with different types of
supervision deficiencies, without a priori knowledge about
them, is demonstrated through experiments with real datasets
in Sections V and VI. Finally, perspectives and future works
are discussed in Section VII.
II. BI QUA LI TY DATA
This section presents an alternative setting called “Biquality
Data” which covers a large range of supervision deficiencies
and allows for unifying the WSL approaches. The interested
arXiv:2010.09621v4 [cs.LG] 26 Apr 2021
reader may find a more detailed introduction on WSL and its
links with Biquality Learning in [10].
Learning using biquality data has recently been put forward
in [11]–[13] and consists in learning a classifier from two
distinct training sets, one trusted and the other untrusted. The
initial motivation was to unify semi-supervised and robust
learning through a combination of the two. We show in this
paper that this scenario is not limited to this unification and
that it can cover a larger range of supervision deficiencies as
demonstrated with the algorithms we propose and the obtained
results.
The trusted dataset DTconsists of pairs of labeled examples
(xi, yi) where all labels yi∈ Y are supposed to be cor-
rect according to the true underlying conditional distribution
PT(Y|X). In the untrusted dataset DU, examples ximay
be associated with incorrect labels. We note PU(Y|X)the
corresponding conditional distribution.
At this stage, no assumption is made about the nature of the
supervision deficiencies which could be of any type including
label noise, missing labels, concept drift, non-adapted labels...
and more generally a mixture of these supervision deficiencies.
The difficulty of a learning task performed on biquality
data can be characterised by two quantities. First, the ratio
of trusted data over the whole data set, denoted by p:
p=|DT|
|DT|+|DU|(1)
Second, a measure of the quality, denoted by q, which
evaluates the usefulness of the untrusted data DUto learn
the trusted concept PT(Y|X), where q[0,1] and 1 indi-
cates high quality. For example in [13] qis defined using a
ratio of Kullback-Leibler divergence between PT(Y|X)and
PU(Y|X).
Fig. 1. The different learning tasks covered by the biquality setting, repre-
sented on a 2D representation.
The biquality setting covers a wide range of learning tasks
by varying the quantities qand p(as represented in Figure 1):
When (p= 1 OR q= 1)1all examples can be trusted.
Thus, this setting corresponds to a standard supervised
learning (SL) task.
1p= 1 =DU==q= 1
When (p= 0 AND q= 0), there is no trusted examples
and the untrusted labels are not informative. We are
left with only the inputs {xi}1imas in unsupervised
learning (UL).
On the vertical axis defined by q= 0, except for the
two points (p, q) = (0,0) and (p, q) = (1,0), the
untrusted labels are not informative, and trusted examples
are available. The learning task becomes semi-supervised
learning (SSL) with the untrusted examples as unlabeled
and the trusted as labeled.
An upward move on the vertical axis, from a point
(p, q)=(, 0) characterized by a low proportion of
labeled examples p=, to a point (p0,0), with p0> p,
corresponds to Active Learning, when an oracle provides
new labels asked by a given strategy. The same upward
move can also be realized in Self-training and Co-
training [14], where unlabeled training examples are
labeled using the predictions of the current classifier(s).
On the horizontal axis defined by p= 0, except for
the points (p, q) = (0,0) and (p, q) = (0,1), only
untrusted examples are provided, which corresponds
to the range of learning tasks typically addressed by
Robust Learning to Label noise (RLL) approaches.
Only the edges of Figure 1 have been envisioned in previous
works – i.e. the points mentioned above – and a whole new
range of problems are addressed in this paper. Moreover,
biquality learning may be used to tackle tasks belonging to
WSL, for instance:
Positive Unlabeled Learning (PUL) [15]where only pos-
itive (trusted) and unlabeled instances are available, the
later which can be considered as untrusted.
Self Training and Cotraining [14] could be addressed at
the end of the self labeling process: the initial training set
is then the trusted dataset, and all self-labeled examples
can be considered as the untrusted ones.
Concept drift [16]: when a concept drift occurs, all the
examples used before a detected drift may be considered
as the untrusted examples, while the examples available
after it are viewed as the trusted ones, assuming a perfect
labeling process.
Self Supervised Learning system as illustrated by Snorkel
[17] or Snuba [18]: the small initial training set can the
trusted, whereas all examples automatically labeled using
the labeling functions may be considered as untrusted.
As can be seen from the above list, the Biquality framework
is quite general and its investigation seems a promising avenue
to unify different aspects of the Weakly Supervised Learning.
A main contribution of this paper is to suggest one generic
framework for achieving biquality learning and thus covering
many facets of WSL. This is presented in the next section.
This framework will be then applied in the experiments part
of this paper to the problem of label noise.
III. BIQ UAL IT Y LEARNING
Learning the true concept2PT(Y|X)on D=DTDU
means minimizing the risk Ron Dwith a loss Lfor a
probabilistic classifier f:
RD,L(f) = ED ,(X,Y )T[L(f(X), Y )]
=P(XDT)EDT,(X,Y )T[L(f(X), Y )]
+P(XDU)EDU,(X,Y )T[L(f(X), Y )]
(2)
where L(·,·)is a loss function, from R|Y| × Y to Rsince
f(X)is a vector of probability over the classes. Since the
true concept PT(Y|X)cannot be learned from DU, the last
line of Equation 2 is not tractable as it stands. That is why we
propose a generic formalization based on a mapping function
gthat enables us to learn the true concept from the modified
untrusted examples of DU. Equation 2 becomes:
RD,L(f) = P(XDT)EDT,(X ,Y )T[L(f(X), Y )]
+λP(XDU)EDU,(X,Y )U[g(L(f(X), Y ))] (3)
In Equation 3, the parameter λ[0,1] reflects the quality
of the untrusted examples of DUmodified by the function
g. This time, the last line is tractable since it consists of a
risk expectancy estimated over the training examples of DU
which follows the untrusted concept PU(Y|X), modified by
the function g.
Accordingly, the estimation of the expected risk requires to
learn three items: g,λand then f. To learn g, a mapping
function between the two datasets, both DTand DUare used.
Then, either λis considered as a hyper parameter to be learned
using DTor λis provided by an appropriate quality measure
and is considered as an input of the learning algorithm. Finally,
fis learned by minimizing the risk Ron Dusing the mapping
g.
In this formalization, the mapping function gplays a central
role. Not exhaustively, we identify three different ways of
designing the mapping function. For each of these, a different
function g0enters the definition of function g:
The first option consists in correcting the label for each
untrusted examples of DU. The mapping function thus
takes the form g(L(f(X), Y )) = L(f(X), g0(Y, X )),
with g0(Y, X )denoting the new corrected labels and
f(X)the predictions of the classifier.
In the second option, the untrusted labels are used un-
changed. The untrusted examples Xare moved in the
input space where the untrusted labels becomes correct
with respect to the true underlying concept. The mapping
function becomes g(L(f(X), Y )) = L(f(g0(X)), Y ),
where g0(X)is the “moved” input vector of the modified
untrusted examples.
In the last option, g0weights the contribution of the
untrusted examples in the risk estimate. Accordingly,
we have g(L(f(X), Y )) = g0(Y, X)L(f(X), Y ). In this
case, the parameter λmay disappear from Equation 3
since it can be considered as included in the function g0.
2For reasons of space, we denote PT(Y|X)by Tand PU(Y|X)by U.
Section IV considers in-depth the last option and proposes
a new approach where g0acts as an Importance Reweighting
for Biquality Learning.
IV. A N EW IM PO RTANCE REWE IG HT IN G AP PROACH F OR
BIQ UAL IT Y LEARNING
To estimate the mapping function g0, we suggest to adapt the
importance reweigthing trick from the covariate shift literature
[19] to biquality learning. This trick relies on reweighting
untrusted samples by using the Radon-Nikodym derivative
(RND) [20] of PT(X, Y )in respect to PU(X, Y )which
is dPT(X,Y )
dPU(X,Y ). Contrary to the “covariate shift” setting, the
biquality setting handles the same distribution P(X)in the
trusted and untrusted datasets. However, the two underlying
concepts PT(Y|X)and PU(Y|X)are possibly different due
to a supervision deficiency. By using these assumptions and
the Bayes Formula, we can further simplifying the reweighing
function to the RND of PT(Y|X)in respect to PU(Y|X),
dPT(Y|X)
dPU(Y|X).
R(X,Y )T,L(f) = E(X,Y )T[L(f(X), Y )]
=ZL(f(X), Y )dPT(X, Y )
=ZdPT(X, Y )
dPU(X, Y )L(f(X), Y )dPU(X, Y )
=E(X,Y )U[PT(X, Y )
PU(X, Y )L(f(X), Y )]
=E(X,Y )U[PT(Y|X)P(X)
PU(Y|X)P(X)L(f(X), Y )]
=E(X,Y )U[PT(Y|X)
PU(Y|X)L(f(X), Y )]
=E(X,Y )U[βL(f(X), Y )]
=R(X,Y )U,βL (f)
(4)
Equation 4 shows that β=PT(Y|X)
PU(Y|X)is an estimation of the
mapping function g0, thanks to Section III estimating βis the
last step before an actual Biquality Learning algorithm.
Algorithm: Importance Reweighting for Biquality
Learning (IRBL)
Input: Trusted Dataset DT, Untrusted Dataset DU,
Probabilistic Classifier Familiy F
1Learn fU∈ F on DU
2Learn fT∈ F on DT
3for (xi, yi)DU, where yi[[1, K]] do
4ˆ
β(xi, yi) = DfT(xi)
fU(xi)Eyi
5for (xi, yi)DTdo
6ˆ
β(xi, yi)=1
7Learn f∈ F on DTDUwith weights ˆ
β
Output: f
The proposed algorithm, Importance Reweighting for Bi-
quality Learning (IRBL), aims at estimating βfrom DTand
DUwhatever the unknown supervision deficiency. It consists
of two successive steps. First a probabilistic classifier fTis
learned from the trusted dataset DTand another probabilistic
classifier fUis learned from the untrusted dataset DU. Thanks
to their probabilistic nature each of them estimates PT(Y|X)
and PU(Y|X)by a probability distribution over the set of
the Kclasses. Thus we can estimate the weight βof an
untrusted sample (xi, yi)by dividing the prediction of fT(xi)
by fU(xi)for the yiclass (see line 4). The weight βfor
all trusted samples will be fixed to 1 (see line 6). Then a
final classifier is learned from both datasets DTand DUwith
examples reweighted by ˆ
β.
Our algorithm is theoretically grounded, since it is asymp-
totically equivalent to minimizing the risk on the true concept
using the entire data set (see proof in Equation 5).
ˆ
RD, ˆ
βL (f) = 1
|D|X
(xi,yi)D1(xi,yi)DTL(f(xi), yi)
+1(xi,yi)DUˆ
β(xi, yi)L(f(xi), yi)
=1
|DT|+|DU|X
(xi,yi)DT
L(f(xi), yi)
+1
|DT|+|DU|X
(xi,yi)DU
L(f(xi), yi)ˆ
β(xi, yi)
=p
|DT|X
(xi,yi)DT
L(f(xi), yi)
+1p
|DU|X
(xi,yi)DU
L(f(xi), yi)ˆ
β(xi, yi)
=pˆ
RDT,L(f) + (1 p)ˆ
RDU,ˆ
βL (f)
pˆ
RDT,L(f) + (1 p)ˆ
RDT,L(f)
ˆ
RDT,L(f)
(5)
Proof in Equation 5 is an asymptotic result: in practice our
algorithm relies on the quality of the estimation of PT(Y|X)
and PU(Y|X)in order to be efficient. In the biquality setting
they both could be hard to estimate because of the small size
of DTand the poor quality of DU.
V. EXPERIMENTS
The aim of the experiments is to answer the following
questions: i) is our algorithm properly designed and does it
perform better than baselines approaches? ii) is our algorithm
competitive with state-of-the-art approaches?
First, Section V-A presents the supervision deficiencies
which are simulated in our experiments. They correspond to
two different kinds of weak supervision, namely, Noisy label
Completely at Random (i.e. not Xdependent) and Noisy
label Not at Random (i.e. Xdependent). From the Frenay’s
taxonomy [4] the former is the easiest to deal with and the
later is often considered as difficult to manage. Then, Section
V-B consists of three parts: a presentation of the baseline
competitors, a brief report on the state-of-the-art competitors,
and a description of the set of classifiers used. Finally, Section
V-C describes the datasets used in the experiments, and the
chosen criterion to evaluate the learned classifiers. For full
reproducibility, source code, datasets and results are available
at : https://github.com/pierrenodet/irbl.
A. Simulated supervision deficiencies
The datasets listed in Section V-C consist of a collection
of training examples that are assumed to be correctly labeled,
denoted by Dtotal. In order to obtain a trusted dataset DTand
an untrusted one DU, each dataset is split in two parts using
a stratified random draw, where pis the percentage for the
trusted part. The trusted datasets are left untouched, whereas
corrupted labels are simulated in the untrusted datasets by
using two different techniques:
a) Noisy Completely At Random (NCAR):: Corrupted
untrusted examples are uniformly drawn from DUwith a
probability r, and are assigned a random label that is also
uniformly drawn from Y.
In the particular case of binary classification problems, the
conditional distribution of the untrusted labels is defined by
Equation 6.
y∈ Y,PU(Y=y|X) = r
2+ (1 r)PT(Y=y|X)(6)
Here, rcontrols the overall number of random labels and thus
is our proxy for the quality: q= 1 r.
b) Noisy Not At Random (NNAR):: Corrupted untrusted
examples are drawn from DUwith a probability r(x)that
depends on the instance value. To generate a instance de-
pendent label noise, we design a noise that depends on the
decision boundary of a classifier ftotal learned from Dtotal.
The probability of random label r(x)should be high when an
instance xis close to the decision boundary, and low when it
is far. In our experiments, the probability outputs of ftotal are
used to model our label noise as follows:
x X , r(x)=1θ|12ftotal(x)|1
θ(7)
where θ[0; 1] is a constant that controls the overall number
of random labels and thus is our proxy for the quality: q=θ.
The parameter θinfluences both the slope (factor) and the
curvature (power) of r(x)to modify the area under curve of
r(x):E[r(x)].
For binary classification problems, the conditional distribu-
tion of the untrusted labels is defined by Equation 8.
x X ,y∈ Y,
PU(Y=y|X=x) = r(x)
2+(1r(x))PT(Y=y|X=x)(8)
B. Competitors
a) Baseline competitors: The first part of our experi-
ments consists of a sanity check which compares the perfor-
mance of the proposed algorithm to the following baselines:
Trusted: The final classifier fobtained with our algorithm
should be better than a classifier fTthat learned only
from the trusted dataset, insofar as untrusted data bring
useful information about the trusted concept. At least, f
should not be worse than using only trusted data.
Mixed: The final classifier fshould be better than a
classifier fmixed learned from both trusted and untrusted
dataset, without correction. A biquality learning algo-
rithm should leverage the information provided by having
two distinct datasets.
Untrusted: The final classifier should be better than a
classifier fUthat learns only from the untrusted dataset
if there are trusted labels. Using trusted data should
improve the classifier final performances.
b) State-of-the-art-competitors: The second part of our
experiments compares our algorithm with two state-of-the-art
methods: (i) a method from the Robust Learning to Label noise
(RLL) [21], [22] family and (ii) the GLC approach [12].
RLL: In recent literature a new emphasis is put on
the research of new loss functions that are conducive
to better risk minimization in presence of noisy
labels. For example, [21], [22] show theoretically and
experimentally that when the loss function satisfies a
symmetry condition, described below, this contributes
to the robustness of the classifier. Accordingly, in this
paper we train a classifier with a symmetric loss function
as a competitor. This first competitor is expected to
have good results on completely-at-random label noise
described in Section V-A. A loss function Lsis said
symmetrical if Py∈{−1;1}Ls(f(x), y) = c, where cis a
constant and f(x)is the score on the class y. This loss
function is used on DTDU.
GLC: To the best of our knowledge, GLC [12] is among
the best performing algorithm that can learn from bi-
quality data. It has been successfully compared to many
competing approaches. Like ours, it is a two steps ap-
proach which is simple and easy to implement.
In a first step, a model fUis learned from the untrusted
dataset DU. Then it is used to estimate a transition matrix
Cof PU|T(Y)by making probabilistic predictions with
fUon the trusted dataset DTand comparing it to the
trusted labels.
In a second step, this matrix is used to correct the labels
from the untrusted dataset DUwhen learning the final
model f. Indeed fis learned with Lon DTand with
L(C>f(X), Y )on DU.
c) Classifiers: First of all, the choice of classifiers was
guided by the idea of comparing algorithms for biquality
learning and not searching for the best classifiers. This choice
was also guided by the nature of the datasets used in the
experiments (see section V-C). Secondly our algorithm, as well
as GLC, implies two learning phases. For both reasons and
for simplicity, we decided to use Logistic Regressions (LR)
for each phase. LR is known to be limited, in the sense of the
Vapnik-Chervonenkis dimension [23] since it can only learn
linear separations of the input space X, which could underfit
the conditional probabilities P(Y|X)on DTand DUand lead
to bad βestimations. But this impediment, if met, will affect
equally all the compared algorithms. LR is also used for the
RLL classifier using the Unhinged symmetric loss function.
To obtain reliable estimations of conditional probabilities
P(Y|X), the outputs of all classifiers have been calibrated
thanks to Isotonic Regression with the default parameters
provided by scikit-learn [24].
Logistic Regression is always be used and learned thanks
to SGD with a learning rate of 0.005, a weight decay of 106
during 20 epochs and a batch size of 24 with Pytorch [25].
C. Datasets
In industrial applications familiar to us, such as fraud detec-
tion, Customer Relationship Management (CRM) and churn
prediction, we are mostly faced with binary classification
problems. The available data is of average size in terms of the
number of explanatory variables and involves mixed variables
(numerical and categorical).
For this reason we limited in this paper the experiments to
binary classification tasks even if our algorithm can address
multi-class problems. The chosen tabular datasets, used for the
experiments, have similar characteristics than those of our real
applications.
TABLE I
BINA RY CL ASS IFI CATI ON DATASE TS U SED F OR T HE EVAL UATIO N.
COLUMNS:NU MBE R OF E XAM PL ES (|D|), NUMBER OF FEATURES (|X |) ,
AN D RATIO O F EX AM PLE S FRO M TH E MIN OR ITY C LA SS (MIN).
name |D| |X | min name |D| |X | min
4class 862 2 36 ibnsina 20,722 92 38
ad 3,278 1558 14 zebra 61,488 154 4.6
adult 48,842 14 23 musk 6,598 169 15
aus 690 14 44 phishing 11055 30 44
banknote 1372 4 44 spam 4,601 57 39
breast 683 9 35 ijcnn1 141,691 22 9
eeg 1498 13 45 svmg3 1284 4 26
diabetes 768 8 35 svmg1 7,089 22 43
german 1000 20 30 sylva 145,252 108 6.5
hiva 42,678 1617 3.5 web 49,749 300 3
They come from different sources: UCI [26], libsvm3and
active learning challenge [27]. A part of these datasets comes
from past challenges on active learning where high perfor-
mances with a low number of labeled examples has proved
difficult to obtain. For each dataset, 80 % of samples were used
for training and 20% were used for the test. With this choice of
datasets, a large range of the class ratio is covered: Australian
is almost balanced while Web is really unbalanced. Also, the
size varies significantly in number of rows or columns, with
corresponding impact on the difficulty of the learning tasks.
VI. RE SU LTS
The empirical performance of our approach, is evaluated in
two steps. First, we investigate the efficiency of the reweight-
ing scheme and its influence on the learning procedure of the
final classifier. Second, our approach is benchmarked against
competitors to evaluate its efficiency in real tasks.
3https://www.csie.ntu.edu.tw/cjlin/libsvmtools/datasets/
A. Behavior of the IRBL method
In order to illustrate the proposed reweighing scheme, we
picked a dataset, here the “ad” dataset used with a ratio of
trusted data p= 0.25, and examined the histogram of the
weights assigned to each untrusted example either corrupted
or not. The case of Random Label Completely at Random is
chosen and the hardest case where all labels are at random
q= 0 is considered.
Figure 2 shows the histogram of the weights assigned to
each untrusted example either corrupted or not. It is clear that
the proposed method is able to detect corrupted and noncor-
rupted labels from the untrusted dataset. Figure 3 confirms
this behavior when varying the value of the quality. For a
perfect quality, the distribution of the βis unimodal with a
median equal to one and a very narrow inter quantile range,
whereas, when the quality drops, the distribution of the βfor
the corrupted labels decreases to zero.
Fig. 2. Histogram of the βvalues on AD for p= 0.25 and q= 0 for NCAR
for the corrupted and noncorrupted examples.
Fig. 3. Boxplot the βvalues on AD for p= 0.25 versus the quality, from
q= 0 to q= 1 for NCAR.
It is equally interesting to look at the classification error
when q, the quality of the untrusted data, varies. Figure 4
reports the performance for the proposed method and for the
baseline competitors. It is remarkable that the performance
of our algorithm, IRBL, remains stable when qdecreases
while the performance of the mixed and untrusted algorithms
worsens. In addition, IRBL always obtains better performances
than the trusted baseline.
Fig. 4. Classification error on test set for IRBL against baselines on a full
range of quality level (AD dataset, p= 0.25, NCAR).
B. Comparison with competitors
For a first global comparison, two critical diagrams are
presented in Figures 5 and 6 which rank the various methods
for the NCAR and NNAR label noise. The Nemenyi test [28]
is used to rank the approaches in terms of mean accuracy,
calculated for all values of pand qand over all the 20 data
sets described in section V-C. The Nemenyi test consists of
two successive steps. First, the Friedman test is applied to the
mean accuracy of competing approaches to determine whether
their overall performance is similar. Second, if not, the post-
hoc test is applied to determine groups of approaches whose
overall performance is significantly different from that of the
other groups.
Fig. 5. Nemenyi test for the 20 datasets p, q for NCAR.
Fig. 6. Nemenyi test for the 20 datasets p, q for NNAR.
These figures show that the IRBL method is ranked first for
the two kinds of label noise and provides better performance
than the other competitors. Table II provides a more detailed
perspective by reporting the mean accuracy and its standard
deviation. These values are computed for different values of
pover all qualities qand all datasets. This table also helps to
see how the methods fare as compared to learning on perfect
TABLE II
MEA N ACCURACY (RES CAL ED S COR E TO B E FROM 0T O 100) AN D STAN DAR D DEV IATI ON C OMP UT ED O N THE 2 0 DATASE TS qFO R (1) NCAR AN D (2)
NNA R. THE ME AN AC C WH EN US IN G ALL T HE T RAI NI NG DATA WI THO UT N OIS E IS 88.65.
p trusted irbl mixed glc rll
(1)
0.02 72.48 ±5.70 83.46 ±3.56 83.40 ±8.30 78.34 ±7.94 77.94 ±6.37
0.05 78.50 ±4.33 84.94 ±2.24 83.85 ±7.35 81.19 ±5.15 77.97 ±6.44
0.10 81.40 ±3.33 86.56 ±1.68 85.44 ±5.34 83.00 ±3.90 78.98 ±5.26
0.25 85.61 ±2.39 87.96 ±1.18 86.99 ±2.80 86.27 ±2.03 79.86 ±2.61
(2)
0.02 72.48 ±5.70 82.93 ±3.18 81.30 ±10.05 77.55 ±7.78 75.47 ±9.47
0.05 78.50 ±4.33 85.34 ±2.55 82.52 ±7.72 80.77 ±5.04 76.94 ±6.64
0.10 81.40 ±3.33 86.82 ±1.45 84.44 ±5.14 83.22 ±4.10 77.95 ±4.51
0.25 85.61 ±2.39 88.21 ±1.05 86.74 ±2.56 86.56 ±2.00 79.67 ±2.70
Mean 79.50 ±3.94 85.71 ±2.11 84.33 ±6.16 82.11 ±4.74 78.10 ±5.50
(a) IRBL vs Mixed for NCAR (b) IRBL vs RLL for NCAR (c) IRBL vs GLC for NCAR
(d) IRBL vs Mixed for NNAR (e) IRBL vs RLL for NNAR (f) IRBL vs GLC for NNAR
Fig. 7. Results of the Wilcoxon signed rank test computed on the 20 datasets. Each figure compares IRBL versus one of the competitors. Figures a, b, c are
in the case of Noisy label Completely at Random and Figures d, e, f for the case of Noisy label Not at Random. In each figure “”, “·” and “” indicate
respectively a win, a tie or a loss of IRBL compared to the competitors, the vertical axis is pand the horizontal axis is q.
data. Overall, IRBL obtains the best results and with a lower
variability.
To get more refined results, the Wilcoxon signed-rank test
[29] is used 4. It enables us to find out under which conditions
– i.e. by varying the values of p and q – IRBL performs better
or worse than the competitors.
Figure 7 presents six graphics, each reporting the Wilcoxon
test that evaluates our approach against a competitor, based
on the mean accuracy over the 20 datasets. The two types of
label noise (see Section V-A) correspond to the rows in Figure
7 and a wide range of qand pvalues are considered.
Thanks to these graphs we can compare in more details our
method (IRBL) with the mixed methods, as well as with RLL
and GLC. Regarding the mixed method, Figures 7 (a) and (b)
return the results obtained versus varying values for pand q.
For low quality values q, whatever is the value of p, IRBL is
significantly better. For middle values of the quality there is
no winner and for high quality values and low values of p, the
mixed method is significantly better (this result seems to be
observed in [12] as well). This is not surprising since at high
4Here the test is used with a confidence level at 5 %.
quality values, the mixed baseline is equivalent to learning
with perfect labels.
These detailed results help us to understand why, in the
critical diagram in Figure 5, although IRBL has a better
ranking, it is not significantly better than the mixed method:
mainly because of the presence of high quality value cases.
Regarding the competitors RLL and GLC, Figures 7(b),
7(c), 7(e) and 7(f) show that IRBL has always better or
indistinguishable performances. Indeed, IRBL performs well
regardless of the type of noise. This is an important result since
it shows that we are able to deal not only with NCAR noise
but also with instance dependent label noise (NNAR) which
is more difficult. The method RLL gets more ties with IRBL
on NCAR than on NNAR as expected. It is noteworthy that
GLC has ties with IRBL when the quality is high whatever
the label noise.
To sum up, the proposed method has been tested on a large
range of types and strengths of label corruptions. In all cases,
IRBL has obtained top or competitive results. Consequently,
IRBL appears to be a method of choice for applications where
biquality learning is needed. Moreover, IRBL has no user
parameter and a low computational complexity.
VII. CONCLUSION
This paper has presented an original view of Weakly
Supervised Learning and has described a generic approach
capable of dealing with any kind of label noise. A formal
framework for biquality learning has been developed where
the empirical risk is minimized on the small set of trusted
examples in addition to some appropriately chosen criterion
using the untrusted examples. We identified three different
ways to design a mapping function leading to three different
such criteria within the biquality learning framework. We
implemented one of them: a new Importance Reweighting ap-
proach for Biquality Learning (IRBL). Extensive experiments
have shown that IRBL significantly outperforms state-of-the-
art approaches, by simulating completely-at-random and not-
at-random label noise over a wide range of quality and ratio
values of untrusted data.
Future works will be done to extend experiments with
multiclass classification datasets and other classifiers such as
Gradient Boosted Trees [30]. An adaptation of IRBL to Deep
Learning tasks with an online algorithm will be studied too.
REFERENCES
[1] Z.-H. Zhou, “A brief introduction to weakly supervised learning,”
National Science Review, vol. 5, no. 1, pp. 44–53, 08 2017.
[2] O. Chapelle, B. Scholkopf, and A. Zien, “Semi-supervised learning,”
IEEE Transactions on Neural Networks, vol. 20, no. 3, pp. 542–542,
2009.
[3] B. Settles, “Active learning literature survey,” University of Wisconsin-
Madison Department of Computer Sciences, Tech. Rep., 2009.
[4] B. Fr´
enay and M. Verleysen, “Classification in the presence of label
noise: a survey,IEEE transactions on neural networks and learning
systems, vol. 25, no. 5, pp. 845–869, 2013.
[5] J. Cheng, T. Liu, K. Ramamohanarao, and D. Tao, “Learning with
bounded instance and label-dependent label noise,” in International
Conference on Machine Learning (ICML), vol. 119. PMLR, 13–18
Jul 2020, pp. 1789–1799.
[6] A. Menon, B. V. Rooyen, and N. Natarajan, “Learning from binary labels
with instance-dependent corruption,” ArXiv, vol. abs/1605.00751, 2016.
[7] M.-A. Carbonneau, V. Cheplygina, E. Granger, and G. Gagnon, “Multi-
ple instance learning: A survey of problem characteristics and applica-
tions,” Pattern Recognition, vol. 77, p. 329–353, May 2018.
[8] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer
learning,” Journal of Big data, vol. 3, no. 1, p. 9, 2016.
[9] D. Conte, P. Foggia, G. Percannella, F. Tufano, and M. Vento, “A method
for counting people in crowded scenes,” in 2010 7th IEEE International
Conference on Advanced Video and Signal Based Surveillance, 2010,
pp. 225–232.
[10] P. Nodet, V. Lemaire, A. Bondu, A. Cornu ´
ejols, and A. Ouorou, “From
Weakly Supervised Learning to Biquality Learning: an Introduction,”
in In Proceedings of the International Joint Conference on Neural
Networks (IJCNN), 2021.
[11] M. Charikar, J. Steinhardt, and G. Valiant, “Learning from untrusted
data,” in Proceedings of the 49th Annual ACM SIGACT Symposium on
Theory of Computing, 2017, p. 47–60.
[12] D. Hendrycks, M. Mazeika, D. Wilson, and K. Gimpel, “Using trusted
data to train deep networks on labels corrupted by severe noise,” in
Advances in Neural Information Processing Systems 31, 2018, pp.
10 456–10 465.
[13] R. Hataya and H. Nakayama, “Unifying semi-supervised and robust
learning by mixup,” in The 2nd Learning from Limited Labeled Data
Workshop, ICLR, 2019.
[14] J. Zhao, X. Xie, X. Xu, and S. Sun, “Multi-view learning overview:
Recent progress and new challenges,Information Fusion, vol. 38, pp.
43 – 54, 2017.
[15] J. Bekker and J. Davis, “Learning from positive and unlabeled data: a
survey,” Machine Learning, vol. 109, pp. 719–760, 2020.
[16] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A
survey on concept drift adaptation,ACM Computing Surveys, vol. 46,
no. 4, 2014.
[17] A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. R´
e,
“Snorkel: Rapid training data creation with weak supervision,” The
VLDB Journal, vol. 29, no. 2, pp. 709–730, 2020.
[18] P. Varma and C. R´
e, “Snuba: Automating weak supervision to label
training data,” Proc. VLDB Endow., vol. 12, no. 3, p. 223–236, Nov.
2018.
[19] T. Liu and D. Tao, “Classification with noisy labels by importance
reweighting,IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 38, no. 3, p. 447–461, Mar 2016.
[20] O. Nikodym, “Sur une g´
en´
eralisation des int´
egrales de m. j. radon,”
Fundamenta Mathematicae, vol. 15, no. 1, pp. 131–179, 1930. [Online].
Available: http://eudml.org/doc/212339
[21] B. van Rooyen, A. Menon, and R. C. Williamson, “Learning with sym-
metric label noise: The importance of being unhinged,” in Advances in
Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence,
D. D. Lee, M. Sugiyama, and R. Garnett, Eds., 2015, pp. 10–18.
[22] N. Charoenphakdee, J. Lee, and M. Sugiyama, “On symmetric losses for
learning from corrupted labels,” in International Conference on Machine
Learning, vol. 97, 2019, pp. 961–970.
[23] V. N. Vapnik, “The nature of statistical learning theory,” 1995.
[24] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion,
O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg et al.,
“Scikit-learn: Machine learning in python,” Journal of machine learning
research, vol. 12, pp. 2825–2830, 2011.
[25] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf,
E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner,
L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-
performance deep learning library,” in Advances in Neural Information
Processing Systems 32, 2019, pp. 8024–8035.
[26] D. Dua and C. Graff, “Uci machine learning repository,” 2017.
[Online]. Available: http://archive.ics.uci.edu/ml
[27] I. Guyon, “Datasets of the active learning challenge,” University of
Wisconsin-Madison Department of Computer Sciences, Tech. Rep.,
2010.
[28] P. Nemenyi, “Distribution-free multiple comparisons,” Biometrics,
vol. 18, no. 2, p. 263, 1962.
[29] F. Wilcoxon, “Individual comparisons by ranking methods,” Biometrics
Bulletin, vol. 1, no. 6, pp. 80–83, 1945. [Online]. Available:
http://www.jstor.org/stable/3001968
[30] J. H. Friedman, “Greedy function approximation: a gradient boosting
machine,” Annals of statistics, pp. 1189–1232, 2001.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
The field of Weakly Supervised Learning (WSL) has recently seen a surge of popularity, with numerous papers addressing different types of “supervision deficiencies”. In WSL use cases, a variety of situations exists where the collected “information” is imperfect. The paradigm of WSL attempts to list and cover these problems with associated solutions. In this paper, we review the research progress on WSL with the aimto make it as a brief introduction to this field. We present the three axis of WSL cube and an overview of most of all the elements of their facets. We propose three measurable quantities that acts as coordinates in the previously defined cube namely: Quality, Adaptability and Quantity of information. Thus we suggest that Biquality Learning framework can be defined as a plan of the WSL cube and propose to re-discover previously unrelated patches in WSL literature as a unified Biquality Learning literature.
Article
Full-text available
Learning from positive and unlabeled data or PU learning is the setting where a learner only has access to positive examples and unlabeled data. The assumption is that the unlabeled data can contain both positive and negative examples. This setting has attracted increasing interest within the machine learning literature as this type of data naturally arises in applications such as medical diagnosis and knowledge base completion. This article provides a survey of the current state of the art in PU learning. It proposes seven key research questions that commonly arise in this field and provides a broad overview of how the field has tried to address them.
Article
Full-text available
Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research laboratories. In a user study, subject matter experts build models 2.8 × faster and increase predictive performance an average 45.5 % versus seven hours of hand labeling. We study the modeling trade-offs in this new setting and propose an optimizer for automating trade-off decisions that gives up to 1.8 × speedup per pipeline execution. In two collaborations, with the US Department of Veterans Affairs and the US Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132 % average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60 % of the predictive performance of large hand-curated training sets.
Article
Full-text available
The growing importance of massive datasets with the advent of deep learning makes robustness to label noise a critical property for classifiers to have. Sources of label noise include automatic labeling for large datasets, non-expert labeling, and label corruption by data poisoning adversaries. In the latter case, corruptions may be arbitrarily bad, even so bad that a classifier predicts the wrong labels with high confidence. To protect against such sources of noise, we leverage the fact that a small set of clean labels is often easy to procure. We demonstrate that robustness to label noise up to severe strengths can be achieved by using a set of trusted data with clean labels, and propose a loss correction that utilizes trusted examples in a data-efficient manner to mitigate the effects of label noise on deep neural network classifiers. Across vision and natural language processing tasks, we experiment with various label noises at several strengths, and show that our method significantly outperforms existing methods.
Article
Full-text available
Instance- and label-dependent label noise (ILN) is widely existed in real-world datasets but has been rarely studied. In this paper, we focus on a particular case of ILN where the label noise rates, representing the probabilities that the true labels of examples flip into the corrupted labels, have upper bounds. We propose to handle this bounded instance- and label-dependent label noise under two different conditions. First, theoretically, we prove that when the marginal distributions P(XY=+1)P(X|Y=+1) and P(XY=1)P(X|Y=-1) have non-overlapping supports, we can recover every noisy example's true label and perform supervised learning directly on the cleansed examples. Second, for the overlapping situation, we propose a novel approach to learn a well-performing classifier which needs only a few noisy examples to be labeled manually. Experimental results demonstrate that our method works well on both synthetic and real-world datasets.
Article
Full-text available
Supervised learning techniques construct predictive models by learning from a large number of training examples, where each training example has a label indicating its ground-truth output. Though current techniques have achieved great success, it is noteworthy that in many tasks it is difficult to get strong supervision information like fully ground-truth labels due to the high cost of data labeling process. Thus, it is desired for machine learning techniques to work with weak supervision. This article reviews some research progress of weakly supervised learning, focusing on three typical types of weak supervision: incomplete supervision where only a subset of training data are given with labels; inexact supervision where the training data are given with only coarse-grained labels; inaccurate supervision where the given labels are not always ground-truth.
Article
Full-text available
Multiple instance learning (MIL) is a form of weakly supervised learning where training instances are arranged in sets, called bags, and a label is provided for the entire bag. This formulation is gaining interest because it naturally fits various problems and allows to leverage weakly labeled data. Consequently, it has been used in diverse application fields such as computer vision and document classification. However, learning from bags raises important challenges that are unique to MIL. This paper provides a comprehensive survey of the characteristics which define and differentiate the types of MIL problems. Until now, these problem characteristics have not been formally identified and described. As a result, the variations in performance of MIL algorithms from one data set to another are difficult to explain. In this paper, MIL problem characteristics are grouped into four broad categories: the composition of the bags, the types of data distribution, the ambiguity of instance labels, and the task to be performed. Methods specialized to address each category are reviewed. Then, the extent to which these characteristics manifest themselves in key MIL application areas are described. Finally, experiments are conducted to compare the performance of 16 state-of-the-art MIL methods on selected problem characteristics. This paper provides insight on how the problem characteristics affect MIL algorithms, recommendations for future benchmarking and promising avenues for research.
Article
As deep learning models are applied to increasingly diverse problems, a key bottleneck is gathering enough high-quality training labels tailored to each task. Users therefore turn to weak supervision, relying on imperfect sources of labels like pattern matching and user-defined heuristics. Unfortunately, users have to design these sources for each task. This process can be time consuming and expensive: domain experts often perform repetitive steps like guessing optimal numerical thresholds and developing informative text patterns. To address these challenges, we present Snuba, a system to automatically generate heuristics using a small labeled dataset to assign training labels to a large, unlabeled dataset in the weak supervision setting. Snuba generates heuristics that each labels the subset of the data it is accurate for, and iteratively repeats this process until the heuristics together label a large portion of the unlabeled data. We develop a statistical measure that guarantees the iterative process will automatically terminate before it degrades training label quality. Snuba automatically generates heuristics in under five minutes and performs up to 9.74 F1 points better than the best known user-defined heuristics developed over many days. In collaborations with users at research labs, Stanford Hospital, and on open source datasets, Snuba outperforms other automated approaches like semi-supervised learning by up to 14.35 F1 points.
Conference Paper
The vast majority of theoretical results in machine learning and statistics assume that the training data is a reliable reflection of the phenomena to be learned. Similarly, most learning techniques used in practice are brittle to the presence of large amounts of biased or malicious data. Motivated by this, we consider two frameworks for studying estimation, learning, and optimization in the presence of significant fractions of arbitrary data. The first framework, list-decodable learning, asks whether it is possible to return a list of answers such that at least one is accurate. For example, given a dataset of n points for which an unknown subset of α n points are drawn from a distribution of interest, and no assumptions are made about the remaining (1−α)n points, is it possible to return a list of poly(1/α) answers? The second framework, which we term the semi-verified model, asks whether a small dataset of trusted data (drawn from the distribution in question) can be used to extract accurate information from a much larger but untrusted dataset (of which only an α-fraction is drawn from the distribution). We show strong positive results in both settings, and provide an algorithm for robust learning in a very general stochastic optimization setting. This result has immediate implications for robustly estimating the mean of distributions with bounded second moments, robustly learning mixtures of such distributions, and robustly finding planted partitions in random graphs in which significant portions of the graph have been perturbed by an adversary.
Article
Multi-view learning is an emerging direction in machine learning which considers learning with multiple views to improve the generalization performance. Multi-view learning is also known as data fusion or data integration from multiple feature sets. Since the last survey of multi-view machine learning in early 2013, multi-view learning has made great progress and developments in recent years, and is facing new challenges. This overview first reviews theoretical underpinnings to understand the properties and behaviors of multi-view learning. Then multi-view learning methods are described in terms of three classes to offer a neat categorization and organization. For each category, representative algorithms and newly proposed algorithms are presented. The main feature of this survey is that we provide comprehensive introduction for the recent developments of multi-view learning methods on the basis of coherence with early methods. We also attempt to identify promising venues and point out some specific challenges which can hopefully promote further research in this rapidly developing field.