The democratization of Data Mining has been widely successful thanks in part to powerful and easy-to-use Machine Learning libraries. These libraries have been particularly tailored to tackle Supervised Learning. However, strong supervision signals are scarce in practice, and practitioners must resort to weak supervision. In addition to weaknesses of supervision, dataset shifts are another kind of phenomenon that occurs when deploying machine learning models in the real world. That is why Biquality Learning has been proposed as a machine learning framework to design algorithms capable of handling multiple weaknesses of supervision and dataset shifts without assumptions on their nature and level by relying on the availability of a small trusted dataset composed of cleanly labeled and representative samples. Thus we propose biquality-learn: a Python library for Biquality Learning with an intuitive and consistent API to learn machine learning models from biquality data, with well-proven algorithms, accessible and easy to use for everyone, and enabling researchers to experiment in a reproducible way on biquality data.
biquality-learn: a Python library for Biquality
Pierre Nodet
Orange Innovation, Université Paris-Saclay
Châtillon, France
Vincent Lemaire
Orange Innovation
Lannion, France
Alexis Bondu
Orange Innovation
Châtillon, France
Antoine Cornuéjols
AgroParisTech, INRAe, Université Paris-Saclay
Palaiseau, France
1 Introduction
The democratization of Data Mining has been widely success-
ful thanks in part to powerful and easy to use Machine Learn-
ing libraries such as scikit-learn [
], weka [
], or caret [
These libraries have been at the core of enforcing good prac-
tices in Machine Learning and providing ecient solutions
to complex problems. These libraries have been particularly
tailored to tackle Supervised Learning and occasionally Semi-
Supervised Learning and Unsupervised Learning. However,
strong supervision signals are scarce in practice, and prac-
titioners must resort to weak supervision. Learning with
weak supervisions, or Weakly-Supervised Learning [
], is
a diverse eld, as diverse as the identied weaknesses of
supervision. Usually, weaknesses of supervision are divided
into three groups, namely inaccurate supervision when sam-
ples are mislabeled, inexact supervision when labels are not
adapted to the classication task, or incomplete supervision
when labels are missing which reects the inadequacy of the
available labels in the real world [
]. For each weakness of
supervision, algorithms have to be specically hand designed
to alleviate them. In addition to weaknesses of supervision,
dataset shifts are another kind of phenomenon that occurs
when deploying machine learning models in the real world
]. Dataset shifts happen when the data distribution ob-
served at training time diers from what is expected from
the data distribution at testing time [
]. Shifts in the joint
distribution of features and targets can be further divided
into four subgroups of shifts, covariate shift for shifts in the
feature distribution, prior shift for shifts in the target distri-
bution, concept drift for shifts in the decision boundary, and
class-conditional shift for shifts in the feature distribution for
a given target. Again, designing algorithms to handle dataset
shifts usually requires assumptions on the nature of the shift
]. Because of the diverse nature of possible weaknesses of
supervision and dataset shifts, and robust algorithms’ associ-
ated assumptions, it is impossible for practitioners to choose
the suited approach to their problem.
Biquality Learning is a machine learning framework that
has been proposed to design algorithms capable of handling
multiple weaknesses of supervision, and dataset shifts with-
out assumptions on their nature [
]. It relies on the avail-
ability of a small trusted dataset composed of cleanly labeled
and representative samples for the targeted classication
task, in addition to the usual untrusted dataset composed of
potentially corrupted and biased samples. Even though the
trusted dataset is not big or rich enough to properly learn
the targeted classication task, it is sucient to learn a map-
ping function from the untrusted distribution to the trusted
distribution to train machine learning models on corrected
untrusted samples.
Leveraging trusted data has proven to be particularly e-
cient to combat distribution shifts [
] especially on the
most engaging corruptions such as instance-dependant label
noise [
]. In many real-world scenarios, these trusted data
Nodet, et al.
are available or can easily be made available to use Biquality
Learning algorithms to train robust machine learning mod-
els. One occurrence is when annotating an entire dataset is
expansive to the point of being prohibitive, but labeling a
small part of the dataset is doable. In Fraud Detection and
Cyber Security, labeling samples require complex forensics
from domain experts, limiting the number of clean samples.
However, the rest of the dataset can be labeled by hand-
engineered rules [
] with labels that cannot properly be
trusted. Another scenario happens where data shifts happen
during the labeling process over time. It arises in MLOps
], when a model is rst learned on clean data and then
deployed in production, or when past predictions are used
to learn an updated model [
]. Finally, when multiple anno-
tators are responsible for dataset labeling, which happens in
NLP, the annotators’ eciency in following these guidelines
may vary. However, suppose one annotator can be trusted.
In that case, all the other annotators can be considered un-
trusted, and associating each untrusted annotator against
the trusted annotator can be viewed as a Biquality Learning
task [31].
Multiple libraries have been developed recently for the
purpose of handling covariate shift, especially for Domain
Adaptation [
] or for dealing with weak supervisions [
However, Biquality Learning lacks an accessible library with
an intuitive and consistent API to learn machine learning
models from Biquality Data, with well-proven algorithms.
Thus we propose biquality-learn: a Python library for Bi-
quality Learning.
2 biquality-learn
We designed the biquality-learn library following the gen-
eral design principles of scikit-learn, meaning that it pro-
vides a consistent interface for training and using biquality
learning algorithms with an easy way to compose build-
ing blocks provided by the library with other blocks from
libraries sharing these design principles [
]. It includes vari-
ous reweighting algorithms, plugin correctors, and functions
for simulating label noise and generating sample data to
benchmark biquality learning algorithms.
biquality-learn and its dependencies can be easily in-
stalled through pip:
pip install biquality-learn
Overall, the goal of biquality-learn is to make well-
known and proven biquality learning algorithms accessible
and easy to use for everyone and to enable researchers to
experiment in a reproducible way on biquality data.
Source Code: hps://
Documentation: hps://biquality-learn.readthedocs.
License: BSD 3-Clause
3 Design of the API
Scikit-learn [
] is a machine learning library for Python
with a design philosophy emphasizing consistency, simplic-
ity, and performance. The library provides a consistent in-
terface for various algorithms, making it easy for users to
switch between models. It also aims to make machine learn-
ing easy to get started with through user-friendly API and
precise documentation. Additionally, it is built on top of ef-
cient numerical libraries (numpy [
], and SciPy [
]) to
ensure that models can be trained and used on large datasets
in a reasonable amount of time.
In biquality-learn, we followed the same principle, im-
plementing a similar API with t,transform, and predict
methods. In addition to passing the input features
and the
as in scikit-learn, in biquality-learn, we need to
provide information regarding whether each sample comes
from the trusted or trusted or untrusted dataset. We require
an additional sample_quality untrusted dataset: the addi-
tional sample_quality parameter serves to specify property
to specify from which dataset the sample originates. Espe-
cially from which dataset the sample originates where a
value of 0 indicates an untrusted a value of 0 indicates an
untrusted sample, and 1 indicates a trusted sample, and 1 a
trusted one.
4 Algorithms Implemented
In biquality-learn, we purposely implemented only a spe-
cic class of algorithms centered on approaches for tabular
data and classiers, thus restricting approaches that are gen-
uinely classier agnostic or implementable within scikit-
learn’s API. We did so not to break the design principles
shared with scikit-learn and not impose a particular deep
learning library such as PyTorch [
], or TensorFlow [
] on
the user.
We summarized all implemented algorithms and what
kind of corruption they can handle in the following Table.
Algorithms Dataset Weaknesses
Shifts of Supervision
EasyAdapt [6] ×
TrAdaBoost [5] ×
Unhinged (Linear/Kernel) [25] ×
Backward [17, 21] (with GLC [12]) ×
IRLNL [15, 28] (with GLC [12]) ×
Plugin [32] (with GLC [12]) ×
𝐾-KMM [9]
IRBL [19] ×
Table 1. Algorithms Implemented in biquality-learn
biquality-learn: a Python library for Biquality Learning
5 Training Biquality Classiers
Training a biquality learning algorithm using biquality-
learn is the same procedure as training a supervised algo-
rithm with scikit-learn thanks to the design presented in
Section 3. The features
and the targets
of samples be-
longing to the trusted dataset
and untrusted dataset
must be provided as one global dataset
. Additionally, the
indicator representing if a sample is trusted or not has to be
provided: sample_quality =
Here is an example of how to train a biquality classier
using the
-KMM (
-Kernel Mean Matching) [
] algorithm
from biquality-learn:
from sklearn.linear_models import LogisticRegression
from bqlearn.kdr import KKMM
kkmm = KKMM(kernel="rbf", LogisticRegression()), y, sample_quality=sample_quality)
6 scikit-learn’s metadata routing
scikit-learn’s metadata routing is a Scikit Learn Enhance-
ment Proposal (SLEP006) describing a system that can be
used to seamlessly incorporate various metadata in addi-
tion to the required features and targets in scikit-learn
estimators, scorers and transformers. biquality-learn uses
this design to integrate the sample_quality property into the
training and prediction process of biquality learning algo-
rithms. It allows one to use biquality-learn’s algorithms in
a similar way to scikit-learn’s algorithms by passing the
sample_quality property as an additional argument to the
t,predict, and other methods.
Currently, the main components provided by scikit-learn
support this design and is already usable for cross-validators.
However, it will be extended to all components in the future,
and biquality-learn will signicantly benet from many
“free” features. When hps://
learn/pull/24250 will be merged, it will be possible to make a
bagging ensemble of biquality classiers thanks to the Bag-
gingClassier implemented in scikit-learn without overrid-
ing its behavior on biquality data.
from sklearn.ensemble import BaggingClassifier
bag = BaggingClassifier(kkmm).fit(X, y,
7 Cross-Validating Biquality Classiers
Any cross-validators working for usual Supervised Learn-
ing can work in the case of Biquality Learning. However,
when splitting the data into a train and test set, untrusted
samples need to be removed from the test set to avoid com-
puting supervised metrics on corrupted labels. That is why
make_biquality_cv is provided by biquality-learn to post-
process any scikit-learn compatible cross-validators.
Here is an example of how to use scikit-learn’sRan-
domizedSearchCV function to perform hyperparameter vali-
dation for a biquality learning algorithm in biquality-learn:
from sklearn.model_selection import RandomizedSearchCV
from sklearn.utils.fixes import loguniform
from bqlearn.model_selection import make_biquality_cv
param_dist = {"final_estimator__C": loguniform(1e3, 1e5)}
random_search = RandomizedSearchCV(
cv=make_biquality_cv(X, sample_quality, cv=3)
), y, sample_quality=sample_quality)
8 Simulating Corruptions with the
Corruption API
The corruption module in biquality-learn provides several
functions to articially create biquality datasets by intro-
ducing synthetic corruption. These functions can be used to
simulate various types of label noise or imbalances in the
dataset. We hope to ease the benchmark of biquality learn-
ing algorithms thanks to the corruption API, with a special
touch on the reproducibility and standardization of these
benchmarks for researchers.
Here is a brief overview of the functions available in the
corruption module:
make_weak_labels: Adds weak labels to a dataset by
learning a classier on a subset of the dataset and
using its predictions as a new label.
make_label_noise: Adds noisy labels to a dataset by
randomly corrupting a specied fraction of the sam-
ples thanks to a given noise matrix.
make_instance_dependent_label_noise: Adds instance-
dependent noisy labels by corrupting samples with
a probability depending on the sample and a given
noise matrix.
uncertainty_noise_probability: Computes the proba-
bility of corrupting a sample based on the prediction
uncertainty of a given classier [19].
make_feature_dependent_label_noise: Adds instance-
dependent noisy labels by corrupting a specied frac-
tion of the labels with a probability depending on a
random linear map between the features space and
the labels space [30].
make_imbalance: Creates an imbalanced dataset by
oversampling or undersampling the minority classes
make_sampling_biais: Creates a sampling biais by
sampling not at random a subset of the dataset from
the original dataset. The sampling scheme follows a
Gaussian distribution with a shifted mean and scaled
variance computed from the rst principal component
of a PCA learned from the dataset [10].
Nodet, et al.
9 Conclusion
We presented biquality-learn, a Python library for Biqual-
ity Learning. We exposed the design behind its API to make
it easy to use and consistent with scikit-learn. We notably
showed the future-proof of our design by showing how well
it integrates with the future design of scikit-learn. In the
future, biquality-learn could be supported with deep learn-
ing capabilities with a twin library with a principled design,
committing to a deep learning library. Finally, the capacity
of biquality-learn could be extended to particularly needed
capabilities in real-world scenarios, such as evaluating ma-
chine learning models on untrusted data.
