PreprintPDF Available
Preprints and early-stage research may not have been peer reviewed yet.

Abstract

The democratization of Data Mining has been widely successful thanks in part to powerful and easy-to-use Machine Learning libraries. These libraries have been particularly tailored to tackle Supervised Learning. However, strong supervision signals are scarce in practice, and practitioners must resort to weak supervision. In addition to weaknesses of supervision, dataset shifts are another kind of phenomenon that occurs when deploying machine learning models in the real world. That is why Biquality Learning has been proposed as a machine learning framework to design algorithms capable of handling multiple weaknesses of supervision and dataset shifts without assumptions on their nature and level by relying on the availability of a small trusted dataset composed of cleanly labeled and representative samples. Thus we propose biquality-learn: a Python library for Biquality Learning with an intuitive and consistent API to learn machine learning models from biquality data, with well-proven algorithms, accessible and easy to use for everyone, and enabling researchers to experiment in a reproducible way on biquality data. https://github.com/biquality-learn/biquality-learn
biquality-learn: a Python library for Biquality
Learning
Pierre Nodet
pierre.nodet@orange.com
Orange Innovation, Université Paris-Saclay
Châtillon, France
Vincent Lemaire
vincent.lemaire@orange.com
Orange Innovation
Lannion, France
Alexis Bondu
alexis.bondu@orange.com
Orange Innovation
Châtillon, France
Antoine Cornuéjols
antoine.cornuejols@agroparistech.fr
AgroParisTech, INRAe, Université Paris-Saclay
Palaiseau, France
Abstract
The democratization of Data Mining has been widely suc-
cessful thanks in part to powerful and easy-to-use Machine
Learning libraries. These libraries have been particularly
tailored to tackle Supervised Learning. However, strong su-
pervision signals are scarce in practice, and practitioners
must resort to weak supervision. In addition to weaknesses
of supervision, dataset shifts are another kind of phenome-
non that occurs when deploying machine learning models in
the real world. That is why Biquality Learning has been pro-
posed as a machine learning framework to design algorithms
capable of handling multiple weaknesses of supervision and
dataset shifts without assumptions on their nature and level
by relying on the availability of a small trusted dataset com-
posed of cleanly labeled and representative samples. Thus
we propose biquality-learn: a Python library for Biquality
Learning with an intuitive and consistent API to learn ma-
chine learning models from biquality data, with well-proven
algorithms, accessible and easy to use for everyone, and en-
abling researchers to experiment in a reproducible way on
biquality data.
Keywords: Python, Biquality Learning, Weakly Supervised
Learning, Dataset Shift
1 Introduction
The democratization of Data Mining has been widely success-
ful thanks in part to powerful and easy to use Machine Learn-
ing libraries such as scikit-learn [
22
], weka [
29
], or caret [
14
].
These libraries have been at the core of enforcing good prac-
tices in Machine Learning and providing ecient solutions
to complex problems. These libraries have been particularly
tailored to tackle Supervised Learning and occasionally Semi-
Supervised Learning and Unsupervised Learning. However,
strong supervision signals are scarce in practice, and prac-
titioners must resort to weak supervision. Learning with
weak supervisions, or Weakly-Supervised Learning [
33
], is
a diverse eld, as diverse as the identied weaknesses of
supervision. Usually, weaknesses of supervision are divided
into three groups, namely inaccurate supervision when sam-
ples are mislabeled, inexact supervision when labels are not
adapted to the classication task, or incomplete supervision
when labels are missing which reects the inadequacy of the
available labels in the real world [
14
]. For each weakness of
supervision, algorithms have to be specically hand designed
to alleviate them. In addition to weaknesses of supervision,
dataset shifts are another kind of phenomenon that occurs
when deploying machine learning models in the real world
[
23
]. Dataset shifts happen when the data distribution ob-
served at training time diers from what is expected from
the data distribution at testing time [
16
]. Shifts in the joint
distribution of features and targets can be further divided
into four subgroups of shifts, covariate shift for shifts in the
feature distribution, prior shift for shifts in the target distri-
bution, concept drift for shifts in the decision boundary, and
class-conditional shift for shifts in the feature distribution for
a given target. Again, designing algorithms to handle dataset
shifts usually requires assumptions on the nature of the shift
[
7
]. Because of the diverse nature of possible weaknesses of
supervision and dataset shifts, and robust algorithms’ associ-
ated assumptions, it is impossible for practitioners to choose
the suited approach to their problem.
Biquality Learning is a machine learning framework that
has been proposed to design algorithms capable of handling
multiple weaknesses of supervision, and dataset shifts with-
out assumptions on their nature [
18
]. It relies on the avail-
ability of a small trusted dataset composed of cleanly labeled
and representative samples for the targeted classication
task, in addition to the usual untrusted dataset composed of
potentially corrupted and biased samples. Even though the
trusted dataset is not big or rich enough to properly learn
the targeted classication task, it is sucient to learn a map-
ping function from the untrusted distribution to the trusted
distribution to train machine learning models on corrected
untrusted samples.
Leveraging trusted data has proven to be particularly e-
cient to combat distribution shifts [
9
,
12
] especially on the
most engaging corruptions such as instance-dependant label
noise [
19
]. In many real-world scenarios, these trusted data
arXiv:2308.09643v1 [cs.LG] 18 Aug 2023
Nodet, et al.
are available or can easily be made available to use Biquality
Learning algorithms to train robust machine learning mod-
els. One occurrence is when annotating an entire dataset is
expansive to the point of being prohibitive, but labeling a
small part of the dataset is doable. In Fraud Detection and
Cyber Security, labeling samples require complex forensics
from domain experts, limiting the number of clean samples.
However, the rest of the dataset can be labeled by hand-
engineered rules [
24
] with labels that cannot properly be
trusted. Another scenario happens where data shifts happen
during the labeling process over time. It arises in MLOps
[
13
], when a model is rst learned on clean data and then
deployed in production, or when past predictions are used
to learn an updated model [
26
]. Finally, when multiple anno-
tators are responsible for dataset labeling, which happens in
NLP, the annotators’ eciency in following these guidelines
may vary. However, suppose one annotator can be trusted.
In that case, all the other annotators can be considered un-
trusted, and associating each untrusted annotator against
the trusted annotator can be viewed as a Biquality Learning
task [31].
Multiple libraries have been developed recently for the
purpose of handling covariate shift, especially for Domain
Adaptation [
8
] or for dealing with weak supervisions [
4
].
However, Biquality Learning lacks an accessible library with
an intuitive and consistent API to learn machine learning
models from Biquality Data, with well-proven algorithms.
Thus we propose biquality-learn: a Python library for Bi-
quality Learning.
2 biquality-learn
We designed the biquality-learn library following the gen-
eral design principles of scikit-learn, meaning that it pro-
vides a consistent interface for training and using biquality
learning algorithms with an easy way to compose build-
ing blocks provided by the library with other blocks from
libraries sharing these design principles [
3
]. It includes vari-
ous reweighting algorithms, plugin correctors, and functions
for simulating label noise and generating sample data to
benchmark biquality learning algorithms.
biquality-learn and its dependencies can be easily in-
stalled through pip:
pip install biquality-learn
Overall, the goal of biquality-learn is to make well-
known and proven biquality learning algorithms accessible
and easy to use for everyone and to enable researchers to
experiment in a reproducible way on biquality data.
Source Code: hps://github.com/biquality-learn/biquality-
learn
Documentation: hps://biquality-learn.readthedocs.
io/
License: BSD 3-Clause
3 Design of the API
Scikit-learn [
22
] is a machine learning library for Python
with a design philosophy emphasizing consistency, simplic-
ity, and performance. The library provides a consistent in-
terface for various algorithms, making it easy for users to
switch between models. It also aims to make machine learn-
ing easy to get started with through user-friendly API and
precise documentation. Additionally, it is built on top of ef-
cient numerical libraries (numpy [
11
], and SciPy [
27
]) to
ensure that models can be trained and used on large datasets
in a reasonable amount of time.
In biquality-learn, we followed the same principle, im-
plementing a similar API with t,transform, and predict
methods. In addition to passing the input features
𝑋
and the
labels
𝑌
as in scikit-learn, in biquality-learn, we need to
provide information regarding whether each sample comes
from the trusted or trusted or untrusted dataset. We require
an additional sample_quality untrusted dataset: the addi-
tional sample_quality parameter serves to specify property
to specify from which dataset the sample originates. Espe-
cially from which dataset the sample originates where a
value of 0 indicates an untrusted a value of 0 indicates an
untrusted sample, and 1 indicates a trusted sample, and 1 a
trusted one.
4 Algorithms Implemented
In biquality-learn, we purposely implemented only a spe-
cic class of algorithms centered on approaches for tabular
data and classiers, thus restricting approaches that are gen-
uinely classier agnostic or implementable within scikit-
learn’s API. We did so not to break the design principles
shared with scikit-learn and not impose a particular deep
learning library such as PyTorch [
20
], or TensorFlow [
1
] on
the user.
We summarized all implemented algorithms and what
kind of corruption they can handle in the following Table.
Algorithms Dataset Weaknesses
Shifts of Supervision
EasyAdapt [6] ×
TrAdaBoost [5] ×
Unhinged (Linear/Kernel) [25] ×
Backward [17, 21] (with GLC [12]) ×
IRLNL [15, 28] (with GLC [12]) ×
Plugin [32] (with GLC [12]) ×
𝐾-KMM [9]
IRBL [19] ×
𝐾-PDR
Table 1. Algorithms Implemented in biquality-learn
biquality-learn: a Python library for Biquality Learning
5 Training Biquality Classiers
Training a biquality learning algorithm using biquality-
learn is the same procedure as training a supervised algo-
rithm with scikit-learn thanks to the design presented in
Section 3. The features
𝑋
and the targets
𝑌
of samples be-
longing to the trusted dataset
𝐷𝑇
and untrusted dataset
𝐷𝑈
must be provided as one global dataset
𝐷
. Additionally, the
indicator representing if a sample is trusted or not has to be
provided: sample_quality =
1
𝑋𝐷𝑇.
Here is an example of how to train a biquality classier
using the
𝐾
-KMM (
𝐾
-Kernel Mean Matching) [
9
] algorithm
from biquality-learn:
from sklearn.linear_models import LogisticRegression
from bqlearn.kdr import KKMM
kkmm = KKMM(kernel="rbf", LogisticRegression())
kkmm.fit(X, y, sample_quality=sample_quality)
kkmm.predict(X_new)
6 scikit-learn’s metadata routing
scikit-learn’s metadata routing is a Scikit Learn Enhance-
ment Proposal (SLEP006) describing a system that can be
used to seamlessly incorporate various metadata in addi-
tion to the required features and targets in scikit-learn
estimators, scorers and transformers. biquality-learn uses
this design to integrate the sample_quality property into the
training and prediction process of biquality learning algo-
rithms. It allows one to use biquality-learn’s algorithms in
a similar way to scikit-learn’s algorithms by passing the
sample_quality property as an additional argument to the
t,predict, and other methods.
Currently, the main components provided by scikit-learn
support this design and is already usable for cross-validators.
However, it will be extended to all components in the future,
and biquality-learn will signicantly benet from many
“free” features. When hps://github.com/scikit-learn/scikit-
learn/pull/24250 will be merged, it will be possible to make a
bagging ensemble of biquality classiers thanks to the Bag-
gingClassier implemented in scikit-learn without overrid-
ing its behavior on biquality data.
from sklearn.ensemble import BaggingClassifier
bag = BaggingClassifier(kkmm).fit(X, y,
sample_quality=sample_quality)
7 Cross-Validating Biquality Classiers
Any cross-validators working for usual Supervised Learn-
ing can work in the case of Biquality Learning. However,
when splitting the data into a train and test set, untrusted
samples need to be removed from the test set to avoid com-
puting supervised metrics on corrupted labels. That is why
make_biquality_cv is provided by biquality-learn to post-
process any scikit-learn compatible cross-validators.
Here is an example of how to use scikit-learn’sRan-
domizedSearchCV function to perform hyperparameter vali-
dation for a biquality learning algorithm in biquality-learn:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.utils.fixes import loguniform
from bqlearn.model_selection import make_biquality_cv
param_dist = {"final_estimator__C": loguniform(1e3, 1e5)}
n_iter=20
random_search = RandomizedSearchCV(
kkmm,
param_distributions=param_dist,
n_iter=n_iter,
cv=make_biquality_cv(X, sample_quality, cv=3)
)
random_search.fit(X, y, sample_quality=sample_quality)
8 Simulating Corruptions with the
Corruption API
The corruption module in biquality-learn provides several
functions to articially create biquality datasets by intro-
ducing synthetic corruption. These functions can be used to
simulate various types of label noise or imbalances in the
dataset. We hope to ease the benchmark of biquality learn-
ing algorithms thanks to the corruption API, with a special
touch on the reproducibility and standardization of these
benchmarks for researchers.
Here is a brief overview of the functions available in the
corruption module:
make_weak_labels: Adds weak labels to a dataset by
learning a classier on a subset of the dataset and
using its predictions as a new label.
make_label_noise: Adds noisy labels to a dataset by
randomly corrupting a specied fraction of the sam-
ples thanks to a given noise matrix.
make_instance_dependent_label_noise: Adds instance-
dependent noisy labels by corrupting samples with
a probability depending on the sample and a given
noise matrix.
uncertainty_noise_probability: Computes the proba-
bility of corrupting a sample based on the prediction
uncertainty of a given classier [19].
make_feature_dependent_label_noise: Adds instance-
dependent noisy labels by corrupting a specied frac-
tion of the labels with a probability depending on a
random linear map between the features space and
the labels space [30].
make_imbalance: Creates an imbalanced dataset by
oversampling or undersampling the minority classes
[2].
make_sampling_biais: Creates a sampling biais by
sampling not at random a subset of the dataset from
the original dataset. The sampling scheme follows a
Gaussian distribution with a shifted mean and scaled
variance computed from the rst principal component
of a PCA learned from the dataset [10].
Nodet, et al.
9 Conclusion
We presented biquality-learn, a Python library for Biqual-
ity Learning. We exposed the design behind its API to make
it easy to use and consistent with scikit-learn. We notably
showed the future-proof of our design by showing how well
it integrates with the future design of scikit-learn. In the
future, biquality-learn could be supported with deep learn-
ing capabilities with a twin library with a principled design,
committing to a deep learning library. Finally, the capacity
of biquality-learn could be extended to particularly needed
capabilities in real-world scenarios, such as evaluating ma-
chine learning models on untrusted data.
References
[1]
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng
Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jerey Dean,
Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp,
Georey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz
Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat
Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster,
Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul
Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol
Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and
Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning
on Heterogeneous Systems. hps://www.tensorflow.org/ Software
available from tensorow.org.
[2]
Mateusz Buda, Atsuto Maki, and Maciej A Mazurowski. 2018. A
systematic study of the class imbalance problem in convolutional
neural networks. Neural Networks 106 (2018), 249–259.
[3]
Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, An-
dreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexan-
dre Gramfort, Jaques Grobler, et al
.
2013. API design for machine
learning software: experiences from the scikit-learn project. arXiv
preprint arXiv:1309.0238 (2013).
[4]
Andrea Campagner, Julian Lienen, Eyke Hüllermeier, and Davide
Ciucci. 2022. Scikit-Weak: A Python Library for Weakly Supervised
Machine Learning. In Rough Sets, JingTao Yao, Hamido Fujita, Xi-
aodong Yue, Duoqian Miao, Jerzy Grzymala-Busse, and Fanzhang Li
(Eds.). Springer Nature Switzerland, Cham, 57–70.
[5]
Wenyuan Dai, Qiang Yang, Gui-Rong Xue, and Yong Yu. 2007. Boosting
for transfer learning. In International Conference on Machine Learning.
193–200.
[6]
Hal Daumé III. 2009. Frustratingly easy domain adaptation. arXiv
preprint arXiv:0907.1815 (2009).
[7]
Shai Ben David, Tyler Lu, Teresa Luu, and Dávid Pál. 2010. Impossibil-
ity theorems for domain adaptation. In Proceedings of the Thirteenth
International Conference on Articial Intelligence and Statistics. JMLR
Workshop and Conference Proceedings, 129–136.
[8]
Antoine de Mathelin, François Deheeger, Guillaume Richard, Mathilde
Mougeot, and Nicolas Vayatis. 2021. ADAPT: Awesome Domain
Adaptation Python Toolbox. arXiv preprint arXiv:2107.03049 (2021).
[9]
Tongtong Fang, Nan Lu, Gang Niu, and Masashi Sugiyama. 2020. Re-
thinking Importance Weighting for Deep Learning under Distribution
Shift, In Neural Information Processing Systems. Advances in Neural
Information Processing Systems 33 (2020), 11996–12007.
[10]
Arthur Gretton, Alex Smola, Jiayuan Huang, Marcel Schmittfull,
Karsten Borgwardt, and Bernhard Schölkopf. 2009. Covariate shift by
kernel mean matching. Dataset shift in machine learning 3, 4 (2009),
5.
[11]
Charles R. Harris, K. Jarrod Millman, Stéfan J van der Walt, Ralf
Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian
Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Pi-
cus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Al-
lan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peter-
son, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren
Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E. Oliphant.
2020. Array programming with NumPy. Nature 585 (2020), 357–362.
hps://doi.org/10.1038/s41586-020- 2649-2
[12]
Dan Hendrycks, Mantas Mazeika, Duncan Wilson, and Kevin Gimpel.
2018. Using Trusted Data to Train Deep Networks on Labels Corrupted
by Severe Noise. In Advances in Neural Information Processing Systems,
Vol. 31. 10456–10465.
[13]
Dominik Kreuzberger, Niklas Kühl, and Sebastian Hirschl. 2022. Ma-
chine Learning Operations (MLOps): Overview, Denition, and Archi-
tecture. arXiv preprint arXiv:2205.02302 (2022).
[14]
Max Kuhn. 2008. Building Predictive Models in R Using the caret
Package. Journal of Statistical Software, Articles 28, 5 (2008), 1–26.
hps://doi.org/10.18637/jss.v028.i05
[15]
Tongliang Liu and Dacheng Tao. 2015. Classication with noisy labels
by importance reweighting. IEEE Transactions on pattern analysis and
machine intelligence 38, 3 (2015), 447–461.
[16]
Jose G Moreno-Torres, Troy Raeder, Rocío Alaiz-Rodríguez, Nitesh V
Chawla, and Francisco Herrera. 2012. A unifying view on dataset shift
in classication. Pattern recognition 45, 1 (2012), 521–530.
[17]
Nagarajan Natarajan, Inderjit S Dhillon, Pradeep Ravikumar, and Am-
buj Tewari. 2017. Cost-Sensitive Learning with Noisy Labels. J. Mach.
Learn. Res. 18, 1 (2017), 5666–5698.
[18]
Pierre Nodet, Vincent Lemaire, Alexis Bondu, Antoine Cornuéjols, and
Adam Ouorou. 2021. From Weakly Supervised Learning to Biquality
Learning: an Introduction. In International Joint Conference on Neural
Networks (IJCNN). IEEE.
[19]
Pierre Nodet, Vincent Lemaire, Alexis Bondu, Antoine Cornuejols, and
Adam Ouorou. 2021. Importance Reweighting for Biquality Learning.
In International Joint Conference on Neural Networks (IJCNN). IEEE.
[20]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James
Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia
Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward
Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chil-
amkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala.
2019. PyTorch: An Imperative Style, High-Performance Deep Learn-
ing Library. In Advances in Neural Information Processing Systems 32.
8024–8035.
[21]
Giorgio Patrini, Alessandro Rozza, Aditya Menon, Richard Nock, and
Lizhen Qu. 2017. Making Deep Neural Networks Robust to Label
Noise: a Loss Correction Approach. In IEEE Conference on Computer
Vision and Pattern Recognition (CVPR).
[22]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent
Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Pret-
tenhofer, Ron Weiss, Vincent Dubourg, et al
.
2011. Scikit-learn: Ma-
chine learning in Python. Journal of machine learning research 12
(2011), 2825–2830.
[23]
Joaquin Quinonero-Candela, Masashi Sugiyama, Anton Schwaighofer,
and Neil D Lawrence. 2008. Dataset shift in machine learning. Mit
Press.
[24]
Alexander Ratner, Stephen H Bach, Henry Ehrenberg, Jason Fries, Sen
Wu, and Christopher Ré. 2020. Snorkel: Rapid training data creation
with weak supervision. The VLDB Journal 29, 2 (2020), 709–730.
[25]
Brendan van Rooyen, Aditya Menon, and Robert C Williamson. 2015.
Learning with Symmetric Label Noise: The Importance of Being Un-
hinged. In Neural Information Processing Systems, C. Cortes, N. D.
Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.). 10–18.
[26]
Kalyan Veeramachaneni, Ignacio Arnaldo, Vamsi Korrapati, Constanti-
nos Bassias, and Ke Li. 2016. AI2: Training a Big Data Machine to
Defend. In 2016 IEEE 2nd International Conference on Big Data Security
on Cloud (BigDataSecurity). 49–54.
biquality-learn: a Python library for Biquality Learning
[27]
Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland,
Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson,
Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew
Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew
R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, C J Carey, İlhan Polat,
Yu Feng, Eric W. Moore, Jake VanderPlas, Denis Laxalde, Josef Perktold,
Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R. Harris,
Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van
Mulbregt, and SciPy 1.0 Contributors. 2020. SciPy 1.0: Fundamental
Algorithms for Scientic Computing in Python. Nature Methods 17
(2020), 261–272. hps://doi.org/10.1038/s41592-019- 0686-2
[28]
Ruxin Wang, Tongliang Liu, and Dacheng Tao. 2017. Multiclass learn-
ing with partially corrupted labels. IEEE transactions on neural net-
works and learning systems 29, 6 (2017), 2568–2580.
[29]
Ian H Witten and Eibe Frank. 2002. Data mining: practical machine
learning tools and techniques with Java implementations. Acm Sigmod
Record 31, 1 (2002), 76–77.
[30]
Xiaobo Xia, Tongliang Liu, Bo Han, Nannan Wang, Mingming Gong,
Haifeng Liu, Gang Niu, Dacheng Tao, and Masashi Sugiyama. 2020.
Part-dependent label noise: Towards instance-dependent label noise.
Advances in Neural Information Processing Systems 33 (2020), 7597–
7610.
[31]
Man-Ching Yuen, Irwin King, and Kwong-Sak Leung. 2011. A survey
of crowdsourcing systems. In 2011 IEEE third international conference
on privacy, security, risk and trust and 2011 IEEE third international
conference on social computing. IEEE, 766–773.
[32]
Mingyuan Zhang, Jane Lee, and Shivani Agarwal. 2021. Learning from
noisy labels with no change to the training process. In International
Conference on Machine Learning. PMLR, 12468–12478.
[33]
Zhi-Hua Zhou. 2017. A brief introduction to weakly supervised learn-
ing. National Science Review 5, 1 (08 2017), 44–53.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
The field of Weakly Supervised Learning (WSL) has recently seen a surge of popularity, with numerous papers addressing different types of “supervision deficiencies”. In WSL use cases, a variety of situations exists where the collected “information” is imperfect. The paradigm of WSL attempts to list and cover these problems with associated solutions. In this paper, we review the research progress on WSL with the aimto make it as a brief introduction to this field. We present the three axis of WSL cube and an overview of most of all the elements of their facets. We propose three measurable quantities that acts as coordinates in the previously defined cube namely: Quality, Adaptability and Quantity of information. Thus we suggest that Biquality Learning framework can be defined as a plan of the WSL cube and propose to re-discover previously unrelated patches in WSL literature as a unified Biquality Learning literature.
Conference Paper
Full-text available
Under distribution shift (DS) where the training data distribution differs from the test one, a powerful technique is importance weighting (IW) which handles DS in two separate steps: weight estimation (WE) estimates the test-over-training density ratio and weighted classification (WC) trains the classifier from weighted training data. However, IW cannot work well on complex data, since WE is incompatible with deep learning. In this paper, we rethink IW and theoretically show it suffers from a circular dependency: we need not only WE for WC, but also WC for WE where a trained deep classifier is used as the feature extractor (FE). To cut off the dependency, we try to pretrain FE from unweighted training data, which leads to biased FE. To overcome the bias, we propose an end-to-end solution dynamic IW that iterates between WE and WC and combines them in a seamless manner, and hence our WE can also enjoy deep networks and stochastic optimizers indirectly. Experiments with two representative types of DS on three popular datasets show that our dynamic IW compares favorably with state-of-the-art methods.
Article
Full-text available
Array programming provides a powerful, compact and expressive syntax for accessing, manipulating and operating on data in vectors, matrices and higher-dimensional arrays. NumPy is the primary array programming library for the Python language. It has an essential role in research analysis pipelines in fields as diverse as physics, chemistry, astronomy, geoscience, biology, psychology, materials science, engineering, finance and economics. For example, in astronomy, NumPy was an important part of the software stack used in the discovery of gravitational waves1 and in the first imaging of a black hole2. Here we review how a few fundamental array concepts lead to a simple and powerful programming paradigm for organizing, exploring and analysing scientific data. NumPy is the foundation upon which the scientific Python ecosystem is constructed. It is so pervasive that several projects, targeting audiences with specialized needs, have developed their own NumPy-like interfaces and array objects. Owing to its central position in the ecosystem, NumPy increasingly acts as an interoperability layer between such array computation libraries and, together with its application programming interface (API), provides a flexible framework to support the next decade of scientific and industrial analysis.
Article
Full-text available
SciPy is an open-source scientific computing library for the Python programming language. Since its initial release in 2001, SciPy has become a de facto standard for leveraging scientific algorithms in Python, with over 600 unique code contributors, thousands of dependent packages, over 100,000 dependent repositories and millions of downloads per year. In this work, we provide an overview of the capabilities and development practices of SciPy 1.0 and highlight some recent technical developments. This Perspective describes the development and capabilities of SciPy 1.0, an open source scientific computing library for the Python programming language.
Article
Full-text available
Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research laboratories. In a user study, subject matter experts build models 2.8 × faster and increase predictive performance an average 45.5 % versus seven hours of hand labeling. We study the modeling trade-offs in this new setting and propose an optimizer for automating trade-off decisions that gives up to 1.8 × speedup per pipeline execution. In two collaborations, with the US Department of Veterans Affairs and the US Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132 % average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60 % of the predictive performance of large hand-curated training sets.
Article
Full-text available
The growing importance of massive datasets with the advent of deep learning makes robustness to label noise a critical property for classifiers to have. Sources of label noise include automatic labeling for large datasets, non-expert labeling, and label corruption by data poisoning adversaries. In the latter case, corruptions may be arbitrarily bad, even so bad that a classifier predicts the wrong labels with high confidence. To protect against such sources of noise, we leverage the fact that a small set of clean labels is often easy to procure. We demonstrate that robustness to label noise up to severe strengths can be achieved by using a set of trusted data with clean labels, and propose a loss correction that utilizes trusted examples in a data-efficient manner to mitigate the effects of label noise on deep neural network classifiers. Across vision and natural language processing tasks, we experiment with various label noises at several strengths, and show that our method significantly outperforms existing methods.
Article
Full-text available
In this study, we systematically investigate the impact of class imbalance on classification performance of convolutional neural networks (CNNs) and compare frequently used methods to address the issue. Class imbalance is a common problem that has been comprehensively studied in classical machine learning, yet very limited systematic research is available in the context of deep learning. In our study, we use three benchmark datasets of increasing complexity, MNIST, CIFAR-10 and ImageNet, to investigate the effects of imbalance on classification and perform an extensive comparison of several methods to address the issue: oversampling, undersampling, two-phase training, and thresholding that compensates for prior class probabilities. Our main evaluation metric is area under the receiver operating characteristic curve (ROC AUC) adjusted to multi-class tasks since overall accuracy metric is associated with notable difficulties in the context of imbalanced data. Based on results from our experiments we conclude that (i) the effect of class imbalance on classification performance is detrimental; (ii) the method of addressing class imbalance that emerged as dominant in almost all analyzed scenarios was oversampling; (iii) oversampling should be applied to the level that totally eliminates the imbalance, whereas undersampling can perform better when the imbalance is only removed to some extent; (iv) as opposed to some classical machine learning models, oversampling does not necessarily cause overfitting of CNNs; (v) thresholding should be applied to compensate for prior class probabilities when overall number of properly classified cases is of interest.
Article
We study binary classification in the presence of class-conditional random noise, where the learner gets to see labels that are flipped independently with some probability, and where the flip probability depends on the class. Our goal is to devise learning algorithms that are efficient and statistically consistent with respect to commonly used utility measures. In particular, we look at a family of measures motivated by their application in domains where cost-sensitive learning is necessary (for example, when there is class imbalance). In contrast to most of the existing literature on consistent classification that are limited to the classical 0-1 loss, our analysis includes more general utility measures such as the AM measure (arithmetic mean of True Positive Rate and True Negative Rate). For this problem of cost-sensitive learning under class-conditional random noise, we develop two approaches that are based on suitably modifying surrogate losses. First, we provide a simple unbiased estimator of any loss, and obtain performance bounds for empirical utility maximization in the presence of i.i.d. data with noisy labels. If the loss function satisfies a simple symmetry condition, we show that using unbiased estimator leads to an efficient algorithm for empirical maximization. Second, by leveraging a reduction of risk minimization under noisy labels to classification with weighted 0-1 loss, we suggest the use of a simple weighted surrogate loss, for which we are able to obtain strong utility bounds. This approach implies that methods already used in practice, such as biased SVM and weighted logistic regression, are provably noise-tolerant. For two practically important measures in our family, we show that the proposed methods are competitive with respect to recently proposed methods for dealing with label noise in several benchmark data sets. © 2018 Nagarajan Natarajan and Inderjit S. Dhillon and Pradeep Ravikumar and Ambuj Tewari.