PreprintPDF Available

Enhancing Sample Utilization in Noise-Robust Deep Metric Learning With Subgroup-Based Positive-Pair Selection

Authors:
Preprints and early-stage research may not have been peer reviewed yet.

Abstract and Figures

The existence of noisy labels in real-world data negatively impacts the performance of deep learning models. Although much research effort has been devoted to improving the robustness towards noisy labels in classification tasks, the problem of noisy labels in deep metric learning (DML) remains under-explored. Existing noisy label learning methods designed for DML mainly discard suspicious noisy samples, resulting in a waste of the training data. To address this issue, we propose a noise-robust DML framework with SubGroup-based Positive-pair Selection (SGPS), which constructs reliable positive pairs for noisy samples to enhance the sample utilization. Specifically, SGPS first effectively identifies clean and noisy samples by a probability-based clean sample selectionstrategy. To further utilize the remaining noisy samples, we discover their potential similar samples based on the subgroup information given by a subgroup generation module and then aggregate them into informative positive prototypes for each noisy sample via a positive prototype generation module. Afterward, a new contrastive loss is tailored for the noisy samples with their selected positive pairs. SGPS can be easily integrated into the training process of existing pair-wise DML tasks, like image retrieval and face recognition. Extensive experiments on multiple synthetic and real-world large-scale label noise datasets demonstrate the effectiveness of our proposed method. Without any bells and whistles, our SGPS framework outperforms the state-of-the-art noisy label DML methods. Code is available at \url{https://github.com/smuelpeng/SGPS-NoiseFreeDML}.
Content may be subject to copyright.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024 1
Enhancing Sample Utilization in
Noise-robust Deep Metric Learning with
Subgroup-based Positive-pair Selection
Zhipeng Yu, Qianqian Xu, Senior Member, IEEE, Yangbangyan Jiang,Member, IEEE,
Yingfei Sun, and Qingming Huang, Fellow, IEEE
Abstract—The existence of noisy labels in real-world data
negatively impacts the performance of deep learning models.
Although much research effort has been devoted to improv-
ing the robustness towards noisy labels in classification tasks,
the problem of noisy labels in deep metric learning (DML)
remains under-explored. Existing noisy label learning methods
designed for DML mainly discard suspicious noisy samples,
resulting in a waste of the training data. To address this issue,
we propose a noise-robust DML framework with SubGroup-
based Positive-pair Selection (SGPS), which constructs reliable
positive pairs for noisy samples to enhance the sample utilization.
Specifically, SGPS first effectively identifies clean and noisy
samples by a probability-based clean sample selectionstrategy.
To further utilize the remaining noisy samples, we discover their
potential similar samples based on the subgroup information
given by a subgroup generation module and then aggregate
them into informative positive prototypes for each noisy sample
via a positive prototype generation module. Afterward, a new
contrastive loss is tailored for the noisy samples with their
selected positive pairs. SGPS can be easily integrated into the
training process of existing pair-wise DML tasks, like image
retrieval and face recognition. Extensive experiments on mul-
tiple synthetic and real-world large-scale label noise datasets
demonstrate the effectiveness of our proposed method. Without
any bells and whistles, our SGPS framework outperforms the
state-of-the-art noisy label DML methods. Code is available at
https://github.com/smuelpeng/SGPS-NoiseFreeDML.
Index Terms—Metric Learning, Noisy Label, Deep Learning,
Pair-wise Loss, Positive-pair Selection
I. INTRODUCTION
DEEP learning has achieved remarkable success in var-
ious computer vision domains, including image classi-
fication [1], image retrieval (IR) [2], and face recognition
(FR) [3]. Due to the powerful representation ability, deep
Z. Yu and Y. Sun are with the School of Electronic, Electrical and Com-
munication Engineering, University of Chinese Academy of Sciences, Beijing
100049, China (E-mail: yuzhipeng21@mails.ucas.ac.cn, yfsun@ucas.ac.cn).
Q. Xu is with the Key Laboratory of Intelligent Information Processing,
Institute of Computing Technology, Chinese Academy of Sciences, Beijing
100190, China (E-mail: xuqianqian@ict.ac.cn).
Y. Jiang is with the School of Computer Science and Technology, Uni-
versity of Chinese Academy of Sciences, Beijing 101408, China (E-mail:
jiangyangbangyan@ucas.ac.cn).
Q. Huang is with the School of Computer Science and Technology,
University of Chinese Academy of Sciences, Beijing 101408, China, also
with the Key Laboratory of Intelligent Information Processing, Institute of
Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
(E-mail: qmhuang@ucas.ac.cn).
Corresponding authors.
neural networks can extract useful information from large-
scale annotated datasets. However, human annotations or auto-
collected data will inevitably introduce unexpected noise in
real-world datasets. The ubiquitous noisy labels might largely
degrade the performance of deep learning models.
Recent noisy label learning research has made great
progress on classification tasks through various methods, such
as sample selection [7]–[10], weight generation [11], [12],
transfer matrix [13]–[15], and semi-supervised methods [16]–
[18]. Nevertheless, little effort has been devoted to the problem
of noisy labels in deep metric learning (DML). In DML, the
test set usually consists of classes that are not present in the
training set. Moreover, DML sorely relies on the similarity
between sample features to determine sample relationships
instead of the probability prediction of classifiers. Therefore,
many DML tasks prefer to use pair-wise loss functions [2],
[19], [20] to optimize the model, which makes it difficult
to directly apply the classifier-based noisy label learning
methods. What’s worse, DML model is more sensitive to label
noise than basic classification ones [21]. As shown in Fig.1(a),
we can observe a rapid performance drop with the increase of
uniform label noise level on retrieval tasks. Specifically, even
5% of label noise could cause a relative performance drop of
nearly 10% on image retrieval (CARS) and more than 80%
on face recognition (LFW) tasks. In contrast, the classification
accuracy on ImageNet under the same condition only exhibits
a much smaller drop (less than 2%).
The current state-of-the-art noise label learning method for
DML is PRISM [4], which mainly relies on prototype similar-
ity ranking to select clean samples and construct reliable sam-
ple pairs for DML training. However, PRISM tends to discard
excessive noisy samples, resulting in a prevalent overfitting
issue. As shown in Fig.1(b), even in the presence of noisy
labels, PRISM could outperform the baseline MCL model
trained on the original clean data in terms of the Precision@1
on the clean training subset used in PRISM. However, on the
test set, we can observe a substantial performance drop for
PRISM during later training stages. This phenomenon can be
attributed to confirmation bias, where the model accumulates
confidence for the selected clean samples and fails to explore
potentially valuable noisy samples continually. Therefore, it is
urgent to utilize the ignored noisy samples to benefit the noisy
label DML.
The most direct way to involve the noisy samples is to learn
arXiv:2501.11063v1 [cs.CV] 19 Jan 2025
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024 2
−100
−80
−60
−40
−20
0
0 5 10
Noise Rate (%)
Relative Performance Drop(%)
ImageNet (Top1−ACC)
CARS (P@1)
LFW (TPR@1e−3)
(a)
0.25
0.50
0.75
1.00
0 2000 4000 6000 8000
Iteration
Precision@1
Test−MCL−Clean
Test−PRISM−50%Noise
Train−MCL−Clean
Train−PRISM−50%Noise
(b)
0.3
0.4
0.5
0.6
0.7
0.8
0 2000 4000 6000 8000
Iteration
Precision@1
AHC
Kmeans
PRISM
(c)
Fig. 1. (a) Performance drop corresponding to the noise level for image classification and DML tasks (image retrieval and face recognition). (b) Precision@1 of
the clean training subset and test set for different methods on CARS. (c) Precision@1 of PRISIM [4] and clustering-based denoising methods, i.e., Kmeans [5]
and agglomerative hierarchical clustering (AHC) [6] on CARS with 50% sysmmetric noise.
with the pseudo-labels generated by unsupervised methods
like clustering. Nevertheless, clustering-based methods are not
immune to the overfitting problem. We apply K-means [5]
and agglomerative hierarchical clustering (AHC) [6] to gen-
erate pseudo-labels from the features extracted by an early-
stopped PRISM model. Then new models are trained using
these pseudo-labels from scratch. From Fig.1(c), although the
performance is improved at the very beginning, it eventually
suffers from significant degradation. This decline is primarily
attributed to the additional error introduced in cluster labels,
whose accuracy is significantly constrained by the quality of
the pretrained features.
In fact, from the perspective of pairwise learning, the noisy
labels (also including pseudo-labels) will incur both wrong
positive and negative pairs. In DML tasks, positive pairs often
play an more important role than negative ones. Motivated
by this, we turn our attention to construct more informative
sample pairs, especially the positive ones, for noisy samples.
We propose a noise-robust DML training framework which
enhances the sample utilization by SubGroup-based Positive
pair Selection (SGPS). The whole framework is illustrated
in Fig.2. Without need the predictions of classifiers, SGPS
introduces a probability-based clean sample selection (PCS)
strategy to effectively identify the clean and noisy samples. We
simply use the historical features stored in the memory bank
to calculate the probability of each sample being clean. The
memory bank works as a first-in-first-out queue with a limited
size which only store latest sample features. Besides learning
with clean samples using general DML objectives (denoted
as Lclean), SGPS goes a step further by discovering extra
positive pairs for noisy samples based on the subgroup labels
obtained by a subgroup generation module (SGM). Another
feature bank is used for subgroup generation in SGM, which
stores the features of all training samples and is updated in a
momentum way. Compared with cluster labels, our subgroup
labels integrate the knowledge from both the original labels
and the feature distribution, reducing the incorrect positive and
negative pairs. After that, a positive prototype generation mod-
ule (PPM) is applied to aggregate multiple positive samples
for each noisy sample into positive prototypes. And then these
aggregated pairs is used to compute a noisy contrastive loss
function Lnoise for noisy samples. By adding Lclean and Lnoise
together, both clean and noisy samples can be well utilized
for DML training.
The contributions of this paper are summarized as follows:
We propose a new noise-robust training framework
for DML with SubGroup-based Positive pair Selection
(SGPS). Unlike existing methods, SGPS integrates both
clean and noisy samples into the training process to
enhance the sample utilization.
In the core of the framework, we introduce SGM to
efficiently generate subgroup labels for noisy samples.
The subgroup information helps the discovery of true
positive pairs without imposing significant time costs
during training.
We then present PPM to aggregate multiple positive
samples into a prototype for each noisy samples, leading
to more informative positive pairs.
Finally, we provide extensive experimental results to demon-
strate the effectiveness of our proposed method. SGPS can be
plugged into both existing proxy- and pair-based DML modles
to improve performance.
II. RE LATE D WO RK
A. Metric Learning
We broadly categorize deep metric learning into 1) pair-
wise and 2) proxy-based methods. Pair-wise methods [22], [2],
[23], [24] calculate the loss based on the contrast between
positive pairs and negative pairs. Commonly used loss func-
tions include contrastive loss [19], triplet loss [22] and softmax
loss [25]. In this process, identifying informative positive
and negative pairs becomes an important consideration, [2].
The problem of instance sampling in DML has also been
considered [26]. Nevertheless, objectives like triplet loss is
in fact optimising a distance metric rather than a ranking
metric, which is sub-optimal when evaluating using a ranking
metric, such as Average Precision (AP) and Area Under
the Precision-Recall Curve (AUPRC). Therefore, a series of
ranking-based methods have been proposed to address the
shortcomings of distance-based pair-wise methods, enabling
neural networks to achieve end-to-end training to directly op-
timize the ranking metrics. As a typical approach, FastAP [27]
optimizes the AP metric using an approximation derived
from distance quantization. Roadmap [28] proposes robust
and decomposable objective to address non-differentiability
and non-decomposability challenges for end-to-end training
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024 3
of deep neural networks with AP. [29] proposes a sampling-
rate-invariant unbiased stochastic estimator with superior sta-
bility to optimize AUPRC. On the other hand, proxy-based
methods [30]–[32], [33], [34] represent each class with one
or more proxy vectors, and use the similarities between the
input data and the proxies to calculate the loss. Proxies are
learned from data during model training, which could deviate
from the class center under heavy noise and cause performance
degradation. Recently, Contextual [35] introduces a contextual
loss to learn the feature representation and the similarity metric
jointly. Some methods like [23], [36] try to integrate pair-
wise and proxy-based approaches into a unified framework.
MetricFormer [36] considers the correlations from a unified
perspective to capture and models the multiple correlations
with an elaborate metric transformer. Additionally, CIIM [37]
proposes a causality-invariant interactive mining method to
learn the cross-modal similarity with the causal structure by
modality-aware hard mining and modality-invariant feature
embedding for cross-modal image DML tasks.
B. Noisy Label Learning
Training under noisy labels has been studied extensively for
classification tasks [10], [12], [38]–[42]. A common approach
is to detect noisy labels and exclude them from the training
set. Recent works have also started exploring correcting the
noisy labels [12], [38]or treating noisy data as unlabeled data
for semi-supervised learning [16]. F-correction [13] models
label noise with a class transition matrix. MentorNet [11]
trains a teacher network that provides a weight for each sample
to the student network. Co-teaching [7] trains two networks
concurrently. Samples identified as having small losses by
one network are used to train the other network. This is
further improved in [43] by training on samples with small
losses and different predictions from the two networks. Many
recent works [16], [18], [42], [44], [45] filter noisy data based
on the predicted class output by the current classification
model and introduce semi-supervised learning to improve the
robustness of the model. Among them, Unicon [46] employs
a semi-supervised approach with class balance constraints,
enhancing the performance of classical semi-supervised based
methods like DivideMix [16]. Sel-CL [9] is based on selective-
supervised contrastive learning and reliable sample and pairs
selection. Seeing the recent advances in the diffusion model,
LRA-Diffusion [47] attempts to combine pretrained features
and the diffusion model to infer real labels directly and
achieves the state-of-the-art performance. Early stopping [48],
[49] acts as a simple yet effective method to reduce the
impact of noisy samples by stopping the training process when
the model starts to overfit the noisy samples. Specifically,
LabelWave [49] simplifies the early stopping method which
does not require validation data for selecting the desired model
in the presence of label noise. These studies could exhibit
a good label-noise robustness on the image classification
datasets. Besides, unsupervised clustering-based methods [5],
[6], [50]–[52] are also widely used to generate pseudo-labels
for noisy samples.
C. Noisy Labels in Metric Learning
Previous work on noisy labels in metric learning is lim-
ited. [53] estimates the posterior label distribution using a
shallow Bayesian neural network, which may not scale well
to deeper network architectures. [54] uses the pair-wise loss
and trains a proxy for each class simultaneously to adjust
the weights of the outliers. In [55], noisy data in DML is
handled by first performing label cleaning using a model
trained on a clean dataset, which may not be available in real-
world applications. Recently, methods based on hyperbolic
embedding space [56], [57] and hierarchical labels [58] have
been proposed to improve the robustness of metric learning
on unsupervised and noisy label settings. There are also
many other ways to reduce the impact of noisy samples. For
example, One4More [59] learns a data sampler to reduce the
sampling frequency on noisy samples. RTMem [60] updates
the cluster centroid with a randomly sampled instance feature
in the current mini-batch without momentum to reduce the
mismatching between the model and feature memory. LaCoL
[61] proposes latent contrastive learning to leverage the neg-
ative correlations from the noisy data to guarantee the robust
generalization of DML models. LP [62] introduces multi-
view features and teacher-student distillation to purify noise in
pseudo labels. PRISM [4] uses memory features of the same
category to identify noisy samples and select clean samples
to construct reliable sample pairs for DML training, which
achieves state-of-the-art performance on noisy DML tasks.
Our work shares a similar motivation to the PRISM method.
Nevertheless, we resample positive pairs for noisy samples
instead of discarding them.
III. APP ROAC H
A. Problem Setting
Let {(xi, yi)}N
i=1 denote the set of Ntraining samples and
the corresponding annotated labels from Mclasses. The d-
dimensional features are extracted by a neural network f
parameterized by θ. Define the similarity between two samples
as S(fθ(xi), fθ(xj)) = fθ(xi)fθ(xj)
fθ(xi)∥·∥fθ(xj)(denoted as Sij
for short). Then, the objective of DML is to maximize the
similarity between positive pairs (sample pairs with the same
label) and minimize the similarity between negative pairs
(sample pairs with different labels). Note that in DML, the
positive pairs are usually much less than the negative ones and
play a more critical role in learning. Recent pair-wise DML
methods such as MS loss [2] and Circle loss [23] well capture
pairwise similarity information by adjusting different weights
or assigning different margins for positive and negative pairs.
However, existing DML methods are sensitive to label noise.
The incurred false positive and negative pairs, especially the
former, could lead to a sharp performance drop.
Current methods for handling label noise in DML, such
as PRISM [4], mainly focus on filtering clean samples for
training. For each sample xi, they calculate a confidence
score pclean(i)of how likely the sample is correctly labeled.
If pclean(i)> t where tis a threshold that changes during
training, the sample is added into the clean-labeled subset
Bclean. Then DML losses such as the contrastive loss [19]
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024 4
Input Batch
clean <
Noisy
Clean
Subgroup Generation
Module
Positive Prototype
Generation Module
Memory Bank
Probability-based
Clean-sample Selection
noise
clean
clean
Subgroup Labels
Network
θFeatures
Update
Positive Prototype
Fig. 2. The framework of our proposed method. The input batch will first be fed into the feature extractor network to obtain the features. Then, inputs will be
separated into a clean set and a noisy set by the PCS module. Samples in the clean set will be used to compute Lclean. Based on the subroup labels generated
by SGM, samples in the noisy set also obtain corresponding positive pairs Pi. PPM will aggregate Pito generate positive prototype rito compute Lnoise
with the noisy samples.
can be applied to Bclean. A typical shortcoming of this type
of method is that they might discard a substantial number of
samples, potentially fostering confirmation bias. Therefore, we
propose to discover more positive pairs from the filtered-out
noisy samples for learning.
B. Overview
An overview of the proposed SGPS framework is illustrated
in Fig. 2. For each batch, we first perform probability-based
sample selection to divide samples in a batch into a clean-
labeled subset Bclean and a noisy-labeled subset Bnoise. For
Bclean, we can directly compute a common-used DML loss
Lclean. Both Pair-wise and Proxy-based DML losses can be
selected as Lclean. Moreover, for each instance in Bnoise, we
obtain additional positive samples lying from the correspond-
ing subgroup using the SGM. In this way, we can obtain a set
of positive pairs regardless of the original annotated labels in
Bnoise. To better utilize the selected positive pairs, a PPM is
designed to aggregate the positive pairs to generate positive
prototypes for noisy samples. Finally, we can compute the
Lnoise based on Bnoise.
C. Probability-based Clean-sample Selection
Inspired by PRISM [4], we adopt a probability-based clean-
sample selection (PCS) strategy to identify clean samples in
a mini-batch, together with a memory bank storing historic
sample features. Note that the class membership could be
indicated by the similarity between a sample and all class
centroids. If a sample is the most similar to the class consistent
with its annotation, then it is likely to be clean. Therefore,
based on the similarity to each class centroid calculated using
the memory bank, we could compute the probability that a
sample’s label is clean as follows:
pclean(i) = exp wyi
fθ(xi)
fθ(xi)/X
m[M]
exp wm
fθ(xi)
fθ(xi),
(1)
where wmis the centroid of the m-th class obtained from the
memory bank. This may be seen as modeling the posterior
probability of the data label. A smooth top-R (sTRM) trick
[4] is also adopted to adjust the threshold tfor noisy data
identification. Let Qjbe the R-th percentile of pclean(i)value
in j-th mini-batch. Then the threshold tis defined as:
t=1
ω
iter
X
j=iterω
Qj(2)
where ωis the sliding window size of training iterations, Ris
a predefined noise ratio. For data points with a high probability
to be clean, i.e., pclean(i)t, a common DML loss Lclean will
be calculated, and their features will also be used to update
the memory bank.
D. Subgroup Generation Module
The subgroup generation module (SGM) is designed to
generate high-confidence candidate positive pairs for Bnoise.
SGM works as a backend server consisting of four parts: (1)
feature bank, (2) intra-class splitting submodule, (3) bottom-up
subgroup generation submodule, and (4) complementary top-
down subgroup generation submodule. The wrongly-annotated
labels will incur two types of incorrect pairs in DML:
false negative pairs, in which the samples annotated as
different classes actually belong to the same ground-truth
category;
false positive pairs, where the samples with the same an-
notated labels belong to different ground-truth categories.
Therefore, we first design a strategy to mitigate the false
positive pairs by splitting the noisy samples labeled as the
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024 5
Feature
Bank
Input
Intra-class
Splitting
Bottom-Up Top-Down
Update Feature Update Labels
Asynchronous
Update
Subgroup
labels
Fig. 3. The workflow of SGM. SGM will maintain two kinds of subgroup
labels for each sample in the dataset. The subgrouping module will update the
subgroup labels conditionally once new features are added to the feature bank,
then update the subgroup labels in an asynchronous manner. The subgroup
labels will be used to select positive pairs for each noisy sample in a batch.
same category into multiple cleaner subgroups. As for false
negative pairs, we propose two sub-cluster merging strategies:
bottom-up subgroup generation and complementary top-down
subgroup generation. The obtained two kinds of subgroup label
information will be used to generate positive pairs for each
noisy sample. The whole pipeline of SGM is shown in Fig.3.
Feature Bank. Motivated by previous work like MoCo [63]
and ODC [64], we maintain a feature bank that stores features
and labels of the entire dataset. Each time a sample is extracted
by the network forward pass with L2 normalization, its feature
is used to update the sample memory:
Fm(xi)αfθ(xi)
fθ(xi)2
+ (1 α)Fm(xi),(3)
where αis a momentum parameter. Considering the large scale
of training data in current DML tasks, in order to conserve the
training resources and speed up the training process, we deploy
the feature bank in additional training nodes distributionally.
Intra-class splitting. This submodule aims at mitigating the
false positive pairs by splitting samples labeled as the same
category into multiple cleaner subgroups. Samples within
the same subgroup are more likely to belong to the same
real category. Specifically, we first calculate the intra-class
similarity matrix Smfor all samples annotated as the m-th
class. For the i-th sample, we select its 1-nearest-neighbor
and other samples very close to it as candidates that might
belong to the same subgroup. This subsequently constructs an
intra-class adjacency matrix Wm:
Wm
ij =(1,if Sm
ij = maxkSm
ik or Sm
ij > λmax
0,otherwise ,(4)
where λmax is a predefined threshold. Besides, we further
set a lower bound λmin for Sm
ij to eliminate outliers. We then
apply a common Connected Components Labeling (CCL) [65]
process to obtain subgroups {Cm
i}. Here, we refer to the
subgroup with the largest number of samples for each class
mas its meta cluster, denoted as Cm⋆. The whole pipeline
is shown in Algorithm 1. After this process, we can obtain a
set of subgroups for each category m. However, directly using
Algorithm 1 Intra Class Splitting
Input: Training set S={(xi, yi)}N
i; The lower and upper
similarity thresholds λmin, λmax
1: for each class m[M]do
2: Select all xiwith yi=mfrom S
3: Calculate the corresponding similarity matrix Sm
4: Construct the adjacency matrix Wmbased on Eq. (4)
5: Set Wm
ij = 0 if Sm
ij < λmin Remove outliers
6: Acquire the subgroups {Cm
i}based on the connected
components of Wm
7: Choose the meta cluster Cm⋆ = arg maxi|Cm
i|
8: end for
Output: All the subgroups C=SM
mCm
these subgroups to build positive pairs is still insufficient. On
the other hand, the total number of subgroups is much larger
than that of categories M, and different subgroups may still
belong to the same category, resulting in extra false negative
pairs.
Bottom-up subgroup generation. Initially, each small sub-
group is treated as a separate clean cluster, resulting in a total
of Mcclusters. We first calculate the similarity between each
pair of cluster centroids:
¯
Sij =¯
f
i·¯
fj
¯
fi∥·∥¯
fj,(5)
where ¯
fiis the centroid (mean feature) of cluster i. Following
the hierarchical bottom-up subgrouping approach [66], [67],
we merge the two most similar clusters at each step and then
recalculate the similarity between the merging cluster and
other clusters. The number of clusters will decrease by one
after each merge step. This merging process continues until the
number of clusters is no more than τkor the similarity between
the two clusters is less than λ
min. The whole process is shown
in Algorithm 2. During this process, potential disruptions
might be caused by highly-ranked noise clusters. Such noisy
clusters are often highly similar to other clusters, which may
result in a supercluster containing multiple categories. To
tackle this problem, we establish a set of rules to avoid over-
merging:
Meta group exclusion: A meta cluster is not allowed to
be merged with other meta clusters unless their similarity
is greater than λ
max.
Class size control: The number of samples in the merged
cluster cannot exceed τmax.
At the end of bottom-up subgroup generation, We can obtain
the bottom-up subgroup labels cB.
Complementary top-down subgroup generation. Bottom-
up merging procedure heavily relies on the feature quality
and the density distribution of samples. Once high-similarity
noisy clusters are merged, more noisy samples will be included
in the merged cluster, which finally results in a super-large
group containing multiple categories. Positive groups with low
similarity are often unable to merge effectively, leading to
the omission of many potential positive samples. To address
this issue, we propose a complementary top-down subgroup
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024 6
Algorithm 2 Bottom-up subgroup generation
Input: Initial subgroups {Ci}Mc
i=1; The number of minimum
subgroups after merging τk; The maximum sample num-
ber in each subgroup τmax; The thresholds λ
min;
1: Set M
c=Mc
2: Compute the centroid ¯
fiof each subgroup
3: Compute the similarity between each pair of cluster cen-
troids ¯
Sij
4: repeat
5: Find two subgroups Ci,Cjwith the largest similarity
6: if not both Ciand Cjare meta clusters then
7: if |Ci|+|Cj|< τmax then
8: Merge Ci,Cjinto a new subgroup Ch
9: Compute similarity between Chand other sub-
groups
10: Set M
c=M
c1
11: end if
12: end if
13: Set ¯
Sij = 0 Ignore the current subgroup pair
14: until M
c< τk i, j :¯
Sij < λ
min
Output: Subgroup labels for each sample cB={cB
i}N
i=1
generation procedure to discover more underlying positive
samples. In this method, we first assume that all clusters
belong to one cell, then apply a recursive partitioning method
to divide the entire cluster vector space into small cells. In each
iteration, we select a cell containing more than Bsamples,
then randomly select a pair of clusters from the cell (give
preference to meta clusters, if exist), and compute a hyperplane
that separates the selected pair with the maximum margin. The
vector is:
h=¯
fi¯
fj
2.(6)
This hyperplane can divide the cell into two parts. This process
continues until there are no cells containing more than two
meta clusters and contain no more than Bclusters. The overall
division process is listed in Algorithm 3and 4. Finally, we can
obtain the final top-down subgroup labels cT.
E. Positive Prototype Generation Module
The subgroup labels cB,cTobtained from SGM provide
extra cleaner supervision information. For each noisy sample,
compared with directly sampling its positive samples based
on the annotated labels, sampling based on subgroup labels
is more likely to construct true positive pairs. Therefore, we
propose to select extra Kpositive samples denoted as Pi
based on cBand cTfor the i-th noisy sample in Bnoise. To
further eliminate the impact of possible noisy samples in Pi
without losing valuable hard positive samples, we propose to
aggregate the features of Kpositive samples into a single
reliable prototype ri. Several prototype aggregation methods
are designed as follows:
Mean: A straightforward approach is taking the mean of
the features of all positive samples corresponding to the
Algorithm 3 Top Down Division
Input: A set of subgroups C={Ci}Mc
i=1; The maximum
number of samples in one cluster B
1: if PMc
i=1 |Ci| BMc>1then
2: if More than one meta clusters exist in Cthen
3: Sample two different meta clusters C
i,C
j
4: else
5: Sample two different subgroups Ci,Cj
6: end if
7: Get the hyperplane vector h=¯
fi¯
fj
2
8: Cl={Ck:¯
f
kh0}
9: Cr={Ck:¯
f
kh<0}
10: Cl
sub = Top Down Division(Cl,B)
11: Cr
sub = Top Down Division(Cr,B)
12: Csub ={Cl
sub,Cr
sub}
13: else
14: Csub ={C}
15: end if
Output: Csub
Algorithm 4 Complementary top-down subgroup generation
Input: A set of subgroups C={Ci}Mc
i=1; The maximum
number of samples in one cluster B
1: Compute and normalize the centroid ¯
fiof each subgroup
2: Csub = Top Down Division(C,B)
Output: Cluster labels for each sample cT={cT
i}N
i=1
i-th noisy sample as the positive prototype:
ri
mean =1
K
K
X
k=1
f(xk),(7)
where xk Pi.
Max: Instead of treating all positive samples equally, we
choose the sample most similar to the noisy sample as
the positive prototype:
j=argmax
j[K]
Sij ,
ri
max =f(xj).
(8)
SoftMax: In order to better utilize the information of
all positive samples, we conduct the aggregation by
weighting them based on their correlation. Concatenat-
ing the extracted features from Pias a feature matrix
FiRK×d, then we have
Corri=FiF
iI,
ri
soft =softmax(1
KCorri·1)·Fi,(9)
where Iis the K×Kidentity matrix and 1is a K×1
all-one vector. Namely, the weight of each sample in Pi
is measured by summing up their correlations to other
samples in this set.
TransProto: In addition to the fixed aggregation strate-
gies, we introduce a novel learnable mechanism based
on a transformer for enhanced flexibility in aggregation.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024 7
Cross-
attention
Self-
attention MLP
+++
Trans-Proto
3 layers
Positive
Samples Positive
Prototype
Cross-Batch Memory
Positive Samples
in Memory
Fig. 4. Visualization of TransProto PPM. We employ a 3-layer transformer-
based model learns to aggregate features of positive samples into the learning
prototype ri.
Illustrated in Fig.4, we concatenate the extracted fea-
tures of Pito create an input sequence for a 3-layer
cross-attention-based transformer. It is equipped with the
cross-batch memory bank Mto store historical features.
Leveraging the historical features of samples within the
same subgroup as Pias the conditioned feature for
cross-attention, we add the transformer’s output up and
normalize them as the prototype ri. Unlike the afore-
mentioned methods that merely assign weights to existing
positive samples, TransProto works as a learned mecha-
nism that leverages the powerful information extraction
capabilities of cross-attention. This allows it to generate
novel prototypes distinct from the original positive feature
distribution. As a result, TransProto exhibits the ability to
handle more complex noisy label scenarios.
The cross-batch memory in Fig.4is actually an extension
of XBM [20], which is maintained as a queue with the current
mini-batch enqueued and the oldest mini-batch dequeued. In
addition to features which are the only contents in XBM,
our cross-batch memory also stores the indicator of being
clean samples generated by the PCS module together with
the corresponding noisy labels and subgroup labels. These
contents will be also used to update the centroids in PCS,
generate positive pairs for noisy samples in PPM and provide
negative pairs for the memory-bank-based contrastive loss
(MCL).
F. Loss Functions
Loss for Bclean: The traditional pair-wise contrastive loss func-
tion computes similarities between all pairs of data samples.
The loss function encourages f(·)to assign small distances
between samples in the same class and large distances be-
tween samples from different classes. More formally, a typical
contrastive loss (CL) over Bclean is:
Lbatch =X
i,j∈Bclean
yi=yj
[Sij λ]+X
i,j∈Bclean
yi=yj
Sij ,(10)
where λ[0,1] is the predefined margin hyperparameter and
[x]+= max(x, 0). With the cross-batch memory Msame
in PCS and PPM that stores the features of data samples v,
we can obtain more positive and negative pairs in the loss,
which may reduce the variance in the gradient estimation.
The memory-bank-based contrastive loss (MCL) [20] can be
written as:
Lbank =X
i∈Bclean X
j∈M
yi=yj
[Sθ(f(xi),vj)λ]+
X
j∈M
yi=yj
Sθ(f(xi),vj).
(11)
The total loss over clean samples is the sum of the basic batch
loss Lbatch and the memory bank loss Lbank, which is denoted
as:
Lclean =Lbatch +Lbank.(12)
Other novel DML methods like SoftTriple [32] can also be
used as the loss function for clean samples.
Loss for Bnoise: We introduce a novel loss function for samples
in Bnoise based on the prototypes generated in PPM. Denote
the feature of i-th sample in Bnoise as zi=fθ(xi)and riis
the corresponding positive prototype generated by PPM from
Pi, the loss function for each noisy sample is defined as:
Lbatch
SGPS =log exp((zi·riδ))
exp((zi·riδ)) + P
zj∈Ni
exp(zi·zj)
(13)
where τis the temperature hyperparameter, δis a margin
parameter to control the distance between positive and negative
samples. Niis the set of negative samples for i-th sample,
which is defined as:
Ni={zj|yj=yicB
j=cB
icT
j=cT
i, j B}.(14)
Similar to Lbank, we employ the feature memory bank Mto
further increase the negative pairs in the loss. The memory
bank based loss can be defined as:
Lbank
SGPS =log exp((zi·riδ))
exp((zi·riδ)) + P
zj∈N bank
i
exp(zi·zj),
(15)
where Nbank
iis the set of negative samples for i-th sample
defined similar to Eq.(14) except that jis from M. Therefore,
the loss over the noisy subset is:
Lnoise =γ1·Lbatch
SGPS +γ2·Lbank
SGPS,(16)
where γ1and γ2is loss weight to balance the noisy sample
loss on batch and bank.
Overall. Putting all together, the total objective of SGPS is:
Lall =Lclean +Lnoise.(17)
IV. EXP ER IM EN TS
To evaluate the effectiveness of the proposed approach
on noisy DML tasks, we compare it against 13 baseline
methods on both synthetic and real-world image retrieval
benchmarks. Further, we conduct experiments on large-scale
face recognition tasks to demonstrate the generalization of our
approach.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024 8
A. Datasets
The evaluation is conducted on seven image retrieval or face
recognition benchmark datasets, including:
The evaluation is conducted on seven image retrieval or face
recognition benchmark datasets, including:
CARS [68] contains 16,185 images of 196 different car
models. We use the first 98 models for training, the rest
for testing.
CUB [69] contains 11,788 images of 200 different bird
species. We use the first 100 species for training and the
rest for testing.
Stanford Online Products (SOP) [70] contains 59,551
images of 11,318 furniture items on eBay. We use 59,551
images in all classes for training and the rest for testing.
Food-101N [71] is a real-world noisy dataset that con-
tains 310,009 images of food recipes in 101 classes. It
has the same 101 classes as Food-101 [72] (which is
considered a clean dataset). We use 144,086 images in
the first 50 classes (in alphabetical order) as the training
set, and the remaining 51 classes in Food-101 as the test
set which contains 51,000 images.
CARS-98N [4] is a real-world noisy dataset that contains
9,558 images for 98 car models. The noisy images often
contain the interior of the car, car parts, or images of other
car models. The CARS-98N is only used for training, and
the test set of CARS is used for performance evaluation.
MS1MV0 [73] contains 10M face images of 100K iden-
tities collected from the search engine based on a name
list, in which there is around 50% noise.
MS1MV2 [3] contains 5.8M face images of 85K identi-
ties, which is the cleaned version of MS1MV0 [73] by a
semi-automatic pipeline.
Clothing1M [74] is a large-scale real-world dataset with
noisy labels. It contains 1M images from 14 different
cloth-related classes. We use the 1M noisy images for
training and same testset as classification with 10,525
clean images for testing.
We employ two types of synthetic label noise: 1) symmetric
noise [75] and 2) small cluster noise [4]. Symmetric noise
is widely utilized to assess the robustness of classification
models. Under this model, a predefined portion of data from
each ground truth class is assigned to all other classes with
equal probability, irrespective of the similarity between data
samples. The number of classes remains unchanged after
applying this noise synthesis. On the other hand, small cluster
noise emulates naturally occurring label noise by iteratively
flipping labels. In each iteration, images are clustered from
a randomly selected ground-truth class into numerous small
clusters, and each cluster is then merged into another randomly
selected ground-truth class. This process creates an open-set
noisy label scenario [39], where some images do not belong
to any other existing class in the training set.
B. Baselines
We compare SGPS against 19 baseline approaches on im-
age retrieval tasks, including (1) noise-resistant classification
methods: Co-teaching [7], Co-teaching+ [43], Co-teaching
with Temperature [30], and F-correction [13], Unicon [46],
DivideMix [16], LRA-Diffusion [47] and LabelWave [49].
(2) DML methods with proxy-based losses: SoftTriple [32],
FastAp [27], nSoftmax [30] and proxyNCA [31]; (3) DML
methods with pair-wise losses: MS loss [2], circle loss [23],
contrastive loss [19], memory contrastive loss (MCL) [20],
SupCon [76], Roadmap [28], Contextual [35] and PRISM [4].
Among noise-resistant classification baselines, F-correction
assumes closed-world noise and can only be used under
symmetric noise. We train the classification baselines using
the cross-entropy loss and use the l2-normalized features from
the penultimate layer when retrieving images during inference.
On the other hand, to demonstrate the applicability of our
framework, we instantiated our SGPS on four DML losses:
MCL, SupCon, Roadmap and Contextual.
For all experiments on image retrieval, a consistent batch
size of 64 is set and conducted on one NVIDIA RTX 3090
GPU. During training, the input images are first resized to
256×256, then randomly cropped to 224×224. A horizontal
flip is performed on the training data with a probability of
0.5. The validation and testing images are resized to 224×224
without data augmentation. Following [20], when comparing
performance on CARS, CUB and CARS-98N, we use BN-
inception [77] as the backbone model. The dimension of
the output feature is set as 512. For SOP and Food-101N
datasets, we use ResNet-50 [1] with a 128-dimensional output.
In terms of evaluation metrics, we base our assessment on
the ranked list of nearest neighbors for test images. Specif-
ically, Precision@1 (P@1) and Mean Average Precision@R
(MAP@R) [78] are adapt as our evaluation metrics.
For the large-scale metric learning task, i.e., face recog-
nition, we follow ArcFace [3] to get the aligned face crops
and resize them into (112×112). Then, a ResNet-like [1]
network R50 is used to extract representation and returns a
512-D embedding for each image. The experiments of face
recognition are implemented by PaddlePaddle [79] and trained
on 8 NVIDIA 3090 GPUs with a total batch size of 512. For
proxy-based methods, the class center is a learnable vector,
the same as the classifier. We set the learning rate as 0.1 and
use 64 and 0.4 as the scale and margin for other proxy-based
methods. For pair-wise methods, including ours, we set the
learning rate as 0.06 at the start of training and downscale it
by 0.1 at 4th, 8th and 9th epoch. The training process ends
at the 10th epoch. The weight decay is set to 1e-4, and the
momentum of the SGD optimizer is 0.9. We adopt the loss
function in DCQ [80] as Lface
clean:
Lface
clean =ln es(cos(θy)m)
es(cos(θy)m)+Pj[M]/{y}escos(θj),(18)
where cos(θy)is the cosine similarity between the feature vec-
tor and the corresponding class center (dynamically generated
from the class queue [80]). The scale sand margin min the
loss are set to 50 and 0.3, respectively.
C. Results
Symmetric Label Noise. Table Ishows the evaluation results
on CARS, SOP, and CUB under symmetric label noise. Our
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024 9
TABLE I
PRECISION@1 (%) ON CARS, SOP, AND CUB DATA SE TS WI TH S YMM ET RIC L AB EL NO IS E.BEST AND SECOND BE ST VALU ES A RE BO TH
HIGHLIGHTED IN BOLD. SGPS-MCLINDICATES WE PERFORM THE POSTERIOR DATA CLEAN STRATEGY.
CARS SOP CUB
Noisy Rate 10% 20% 50% 70% 90% 10% 20% 50% 70% 90% 10% 20% 50% 70% 90%
Algorithms for image classification under label noise
Co-teaching [7] 73.47 70.39 59.55 45.47 30.45 62.60 60.26 52.18 50.21 40.62 53.74 51.12 45.01 24.81 1.17
Co-teaching+ [43] 71.49 69.62 62.35 47.54 38.29 63.44 67.93 58.29 43.61 31.59 53.31 51.04 45.16 27.51 7.24
Co-teaching w/ T [30] 77.51 76.30 66.87 54.59 32.77 73.71 71.97 64.07 50.88 45.63 55.25 54.18 50.65 29.98 11.45
F-correction [13] 71.00 69.47 59.54 44.27 25.21 51.18 46.34 48.92 45.12 39.47 53.41 52.65 48.84 24.81 8.95
UNICON [46] 77.80 77.24 71.90 51.03 25.87 55.44 52.55 47.20 38.25 35.24 52.21 51.68 46.89 24.57 18.53
DivideMix [16] 65.34 62.32 51.10 40.14 30.47 59.28 54.99 50.03 48.12 41.12 56.51 53.79 47.34 28.71 26.65
LRA-Diffusion [47] 73.71 72.61 71.34 69.65 64.90 28.51 25.26 26.97 24.73 20.59 52.44 50.64 50.21 49.42 44.37
LabelWave [49] 79.05 74.12 44.55 43.89 22.10 61.14 59.74 54.36 51.38 46.72 57.83 55.64 54.72 31.63 0.84
DML with proxy-based losses
FastAP [27] 66.74 66.39 58.87 13.47 9.97 69.20 67.94 65.83 59.62 25.38 54.10 53.70 51.18 4.24 2.57
nSoftmax [30] 72.72 70.10 54.80 46.59 29.48 70.10 68.90 57.32 43.91 34.15 51.99 49.66 42.81 27.25 7.87
ProxyNCA [31] 69.75 70.31 61.75 42.06 29.22 71.10 69.50 61.49 59.25 56.60 47.13 46.64 41.63 34.58 33.06
Soft Triple [32] 76.18 71.82 52.53 43.34 33.47 68.60 55.21 38.45 0.01 0.00 51.94 49.14 41.46 32.63 28.99
DML with pair-wise losses
MS [2] 66.31 67.14 38.41 6.16 5.06 69.90 67.60 59.58 66.40 63.14 57.44 54.52 40.70 5.351 4.051
Circle [23] 71.00 56.24 15.24 3.78 2.89 72.80 70.50 41.17 60.05 53.21 47.48 45.32 12.98 15.33 11.45
Contrastive [19] 72.34 70.93 22.91 13.80 5.65 68.70 68.80 61.16 62.15 57.34 51.77 51.50 38.59 56.18 34.72
XBM-MCL [20] 74.22 69.17 46.88 21.79 16.50 79.00 76.60 67.21 58.70 50.80 56.72 50.74 31.18 17.03 11.60
PRISM-MCL [4] 80.06 78.03 72.93 56.86 33.55 80.11 79.47 72.85 67.10 54.40 58.78 58.73 56.03 44.93 35.88
Supcon [76] 74.01 72.24 62.06 23.92 0.68 68.53 67.92 62.97 57.32 53.11 53.58 52.13 10.89 56.60 35.84
Roadmap [28] 82.74 80.27 74.12 64.96 25.08 80.82 79.69 68.17 69.92 64.59 66.03 65.31 62.78 47.92 7.11
Contextual [35]88.18 76.58 17.00 10.25 8.28 79.85 72.90 68.02 63.56 59.64 65.85 62.21 29.27 16.12 14.16
Our methods with different DML frameworks
SGPS-MCL 80.90 79.11 76.33 68.09 40.56 80.91 80.55 74.61 71.03 62.46 63.18 62.60 60.97 50.47 45.19
SGPS-SupCon 80.31 79.77 75.43 64.35 36.49 76.21 71.64 67.32 63.43 61.48 60.05 58.12 52.21 48.71 43.78
SGPS-Roadmap 83.11 82.43 78.32 70.27 52.91 81.90 80.33 75.24 72.44 66.91 66.01 65.78 64.21 52.49 47.30
SGPS-Contextual 90.21 89.35 84.57 71.45 53.77 82.30 81.75 78.31 73.58 68.11 70.54 68.43 64.80 53.53 46.12
methods achieve the highest performance in all the cases. As
the noise rate increases, all approaches experience a decrease
in Precision@1 scores. Notably, our SGPS-based models in-
stantiated on different DML losses are more stable towards
the noise rate. In the case of smaller datasets like CARS and
CUB, SGPS-based models exhibit a decrease of less than 6%
in P@1 as the noise level increases from 10% to 50%. In
contrast, most competitors suffer from a performance drop of
more than 10%. For example, the recent noisy label learning
method LRA-Diffusion show promising results on CARS,
but drop significantly in SOP where the class distribution is
imbalanced and the number of images per class decreases.
We can see that DivideMix and UNICON both perform not
very well, as they are designed for image classification tasks,
the guessing label strategy may not be suitable for DML
tasks. On the other hand, Roadmap exhibits relatively stable
performance among DML methods. The reason might be that
the calibration loss in Roadmap enforces the score of the
negative pairs to be smaller than β, which alleviates the impact
of noisy labels to some extent. PRISM effectively improves
the ratio of correct pairs by preserving clean samples with
high confidence. Nevertheless, discarding noisy samples could
impair overall performance. Our framework takes a step fur-
ther by effectively utilizing noisy samples, leading to superior
performance compared to all other methods. When the noise
rate increases, SGPS suffers less performance degradation
compared to the competitors, where SGPS-Contextual under
70% noise rate even outperforms most competitors under 50%
noise rate.
Small Cluster Label Noise. Table II reports the Precision@1
scores on datasets with small-cluster label noise. The results
are similar to the case of symmetric noise. Contextual achieves
a very high performance in the low noise rate setting. When
trained on CARS on 25% small cluster noise, Contextual
achieves a higher performance than PRISM-MCL. However,
in the case of high noise rates, it drops significantly. Our
SGPS-Contextual outperforms SoftTriple and PRISM-MCL by
a significant margin across all datasets. This success further
demonstrates the importance of utilizing the noisy samples
rather than discarding them.
Real-world Noise. Table III displays the performance on
three datasets with real-world label noise, CARS98N and
Food-101N [71] and Clothing1M [74]. Unlike the previous
datasets, these three datasets contain a significant number
of out-of-distribution (OOD) samples. The actual number of
categories and the total noise ratio are both unknown. It is
easy to see that the multiProxy-based method SoftTriple is
more capable for such a phenomenon, as it can assign multiple
centers even for mislabeled training samples. Roadmap also
shows good performance as its extra constraints can help
to alleviate the influence of a large amount of uncertain
negative pairs. In contrast, PRISM-MCL and XBM-MCL can
not handle this situation well. To handle such a complicated
noise situation, we set τkto be twice the number of categories
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024 10
TABLE II
PRECISION@1 (%) ON CARS, SOP, AND CUB WITH SMALL CL US TER L AB EL NO IS E.
CARS SOP CUB
Noisy Rate 25% 50% 75% 25% 50% 75% 25% 50% 75%
Algorithms for image classification under label noise
Co-teaching [7] 70.57 62.91 53.47 61.97 58.08 49.39 51.75 48.85 30.87
Co-teaching+ [43] 70.05 61.58 51.90 62.57 59.27 50.43 51.55 47.60 27.59
Co-teaching w/ T [30] 75.26 66.19 55.78 70.19 68.50 55.13 54.59 48.32 31.66
UNICON [46] 55.29 47.19 39.99 59.16 55.82 50.48 52.59 50.10 24.98
DivideMix [16] 55.55 58.89 44.36 60.50 56.36 54.59 54.08 46.70 34.67
LRA-Diffusion [47] 71.27 68.03 65.93 34.04 24.87 22.48 49.83 47.33 42.77
LabelWave [49] 77.58 68.17 61.27 61.83 58.46 50.18 54.49 51.25 50.02
DML with proxy-based losses
FastAP [27] 62.49 53.07 2.98 70.66 67.55 6.94 52.18 48.46 3.85
nSoftmax [30] 71.61 62.29 54.53 70.00 61.92 60.50 49.61 41.78 38.35
ProxyNCA [31] 69.50 58.34 55.91 67.95 62.25 62.12 42.07 36.48 23.14
Soft Triple [32] 73.26 66.66 56.05 73.63 64.14 58.57 56.18 50.35 46.71
DML with pair-wise losses
MS [2] 63.92 43.73 17.46 67.32 62.17 61.36 53.60 41.66 11.90
Circle [23] 53.03 19.95 3.62 70.33 40.48 39.87 44.07 22.96 18.60
Contrastive [19] 65.60 26.45 22.05 68.25 64.27 63.93 47.27 39.43 35.25
XBM-MCL [20] 69.46 36.43 22.23 75.61 68.71 68.64 52.25 41.58 32.74
PRISM-MCL [4] 76.50 69.56 58.07 79.09 73.56 69.07 57.58 54.92 46.45
SupCon [76] 68.59 60.28 6.00 66.39 63.00 60.36 49.98 49.36 46.79
Roadmap [28] 78.43 69.02 62.63 80.14 76.37 73.38 64.42 59.59 53.90
Contextual [35] 84.91 32.28 26.67 76.98 71.71 69.62 66.19 50.68 26.67
Our methods with different DML frameworks
SGPS-MCL 79.95 73.50 67.88 80.55 75.46 71.96 61.63 59.80 50.21
SGPS-SupCon 72.89 62.54 60.41 70.29 62.32 62.44 59.77 51.60 48.64
SGPS-Roadmap 81.84 74.73 68.45 81.05 78.65 75.79 66.43 60.32 56.89
SGPS-Contextual 85.57 71.31 65.73 77.28 75.39 70.46 66.59 63.41 57.49
to enhance the accuracy of positive pairs selected by SGM.
Considering the test set of Clothing1M shares the same 14
categories as its training set, the performance difference among
different methods might be not very large. Nonetheless, SGPS-
Roadmap achieves the best performance on all three datasets,
and the proposed SGPS-based framework could improve the
performance of its base model (SupCon, MCL, Contextual,
and Roadmap). This again demonstrates the strength of inte-
grating advanced DML methods with our SGPS framework in
real-world cases.
Large-scale Noisy Dataset. Tab.IV shows the results on the
large-scale noisy face recognition dataset MS1MV0 [73]. The
training of ArcFace [3] and CosFace [81] on MS1MV0 is
hindered by the gradient conflict arising from massive fine-
grained intra-class noise and inter-class noise, resulting in
limited performance. Approaches like NT [82], NR [83],
and Co-mining [84] assign different weights or margins for
clean and noisy samples, effectively improving performance
by fully leveraging clean data. Sub-center [85] introduces
multiple class centroids in the classifier to address intra-class
conflict, while SKH [86] goes a step further by assigning
multiple switchable classification layers to handle inter-class
conflict. Although algorithms with classifiers can effectively
alleviate noise influence compared to those without classifiers,
they are time-consuming and resource-intensive when training
on large datasets with millions of identities. Recently, methods
like DCQ [80] and FFC [87] aim to mitigate time and GPU
memory costs by constructing dynamic class pools as a substi-
tute for classifiers. They can achieve comparable performance
with classifier-based methods when training on clean data.
However, it is noteworthy that classifier-free methods tend to
encounter challenges in handling extensive noise. DCQ [80],
the SOTA classifier-free FR method, achieves only 53.86%
on the challenging IJB-C(TAR@FAR=1e-5), resulting in a
performance drop of more than 40% than training on cleaner
data MS1MV2 [3]. We train DCQ with SGPS on MS1MV0,
resulting in a substantial improvement of recognition accuracy.
Moreover, our method obtains the performance of 92.91% on
IJB-C@1e-5, surpassing DCQ trained with clean data by a
notable margin. This improvement can be attributed to our
method’s utilization of noisy data in MS1MV0, which is
discarded in MS1MV2. Comparing our method with classifier-
based methods proposed for noise-robust training, e.g., SKH,
our method achieves comparable performance, but with much
less hardware cost when training on large-scale datasets, which
is further discussed in Fig. 9.
D. Ablation Studies
Effectiveness of GSM. To demonstrate the effectiveness of
our proposed SGM, we decouple subgroup modules in SGM
to ablate their effectiveness on CARS and SOP with 50%
symmetric noise. As shown in Fig.5(a), selecting positive
samples from both cBand cTcontributes to the perfor-
mance enhancement. Notably, when selecting samples from
cB(SGM-B) a superior Precision@1 is achieved compared to
selecting samples from cT(SGM-T). This is mainly because
cBwell utilizes the prior knowledge from feature distribution.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024 11
TABLE III
PRECISION@1 (%) AN D MEAN AVE RAGE PRECISION@R (%) O N
CARS-98N, FOO D-101N AN D CLOTHING1M.
CARS-98N Food-101N Clothing1M
Metric P@1 MAP@R P@1 MAP@R P@1 MAP@R
Algorithms for image classification under label noise
Co-teaching [7] 58.74 9.10 59.08 14.66 70.41 46.43
Co-teaching+ [43] 56.66 8.40 57.59 14.72 70.53 47.89
Co-teaching w/ T [30] 60.72 9.61 63.18 17.38 70.66 47.39
UNICON [46] 52.60 7.38 44.80 7.15 67.46 23.59
Dividemix [16] 49.03 6.14 56.65 12.72 63.38 25.28
LRA-Diffusion [47] 64.71 13.66 61.90 13.98 69.21 35.91
LabelWave [49] 64.46 13.20 63.30 16.78 70.81 40.42
DML with proxy-based losses
FastAP [27] 52.04 9.88 52.58 12.46 70.79 46.96
nSoftmax [30] 63.54 12.56 62.15 14.61 70.89 39.44
ProxyNCA [31] 53.55 8.75 48.41 9.30 70.14 38.63
Soft Triple [32] 63.36 10.88 63.61 16.23 70.57 40.03
DML with pair-wise losses
MS [2] 49.00 5.92 52.53 9.82 69.71 43.94
Circle [23] 41.04 4.97 50.26 9.77 62.66 26.02
Contrastive [19] 44.91 4.76 50.04 9.42 70.76 47.56
XBM-MCL [20] 38.73 3.34 52.58 9.88 68.98 44.55
PRISM-MCL [4] 58.27 9.24 52.47 9.64 70.43 47.37
SupCon [76] 59.34 10.30 49.78 9.22 70.60 46.29
Roadmap [28] 64.29 12.25 60.33 14.82 70.85 47.22
Contextual [35] 44.30 5.22 57.88 14.77 70.26 46.04
Our methods with different DML frameworks
SGPS-MCL 65.43 10.42 60.44 15.01 71.11 48.08
SGPS-SupCon 62.45 10.12 60.54 15.13 71.76 49.02
SGPS-Roadmap 73.56 16.38 64.39 17.32 73.21 49.21
SGPS-Contextual 63.67 10.24 63.47 16.46 71.36 48.21
(a) Ablation of SGM (b) Ablation of PPM
Fig. 5. Ablation of SGM and PPM on CARS and SOP with 50% symmetric
noise.
Nevertheless, cTcan also perform as a complementary part for
cBto further improve the performance.
Effectiveness of PPM. We also present the results with dif-
ferent aggregation strategies in PPM on CARS and SOP with
50% symmetric noise in Fig.5(b). Notably, weighted-based
aggregating methods achieve better P@1 than simply selecting
the most similar positive sample, as Max always selects easy
positive pairs, leading to rapid overfitting. The performance of
TransProto surpasses that of Mean and SoftMax, suggesting
the transformer block can discover potentially more effective
learning directions when conditioned on the positive pairs in
the memory bank.
Sensitivity of hyperparameters. The hyperparameter τand
δin Eq.13 and Eq.15 and loss weight γ1and γ2in Eq.16
are important in our method. We demonstrate the influence
of the hyperparameters in Tab.V. We can observe that the
smaller τcan achieve better performance, mainly because the
smaller τwill give more weight to the hard negative samples.
However, too small τwill lead to significant fluctuations in
loss, resulting in training failures for the model. Increasing δ
obviously improves the performance, However, if δbecomes
excessively large, it can lead to rapid overfitting of LSGPS ,
which hampers the learning process for clean samples.
(a) CARS (b) SOP
Fig. 6. Quantitative accuracy of clean sample selection on CARS and SOP
with 50% symmetric noise.
(a) CARS-Vis (b) SOP-Vis
Fig. 7. Qualitative visualization of the detailed selection results in a training
batch. Yellow and green boxes represent the selected true clean and discarded
true noisy samples, respectively. Blue and red boxes represent the clean
sample discarded and the noisy sample selected, respectively. Samples with
the same noisy label are separated by solid lines.
E. Other Analysis
Quantitative and qualitative analysis of PCS. We present
the quantitative accuracy of clean sample selection on CARS
and SOP datasets with 50% symmetric noise in Figure 6. We
can observe the selection accuracy is increasing during training
and can achieve more than 90% on CARS and nearly 80% on
SOP, both much higher than the noise rate (50%). Note that
the lower accuracy on SOP mainly arises from the fact that
it has a notably lower average number of images per class
(5.26) than CARS (82.18). Nevertheless, a selection accuracy
of 80% could be sufficiently helpful. To better illustrate the
selection results, we also provide the qualitative visualization
in a training batch with a size of 64, in Figure 7(a) and
Figure 7(b). It can be observed that only a few noisy samples
are mistakenly considered as clean (red boxes), while most
noisy samples (green boxes) are accurately identified and not
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024 12
TABLE IV
EXP ERI ME NTS O F DIFF ER ENT S ET TIN GS O N MS1MV0 DATASET C OM PARI NG WI TH S TATE-OF -THE -ART M ETH OD S. THE NUMBER IN RE D DE NOTE T HE
BE ST RE SU LT.BEST AND SECOND BEST VALUES IN ALGORITHMS WITHOUT CLASSIFIER ARE BOTH HIGHLIGHTED IN BOLD. WE A DO PT TH E 1:1
VE RIFI CATI ON TAR (@FAR=1E-3,1E-4,1E-5) O N THE IJB-B A ND IJB-C DATAS ET AS T HE E VALUATI ON ME TR IC.
Method Dataset LFW CFP-FP AgeDB-30 IJB-B IJB-C
1e-03 1e-04 1e-05 1e-03 1e-04 1e-05
Algorithms with classifier
ArcFace [3] MS1Mv0 99.75 97.17 97.26 93.27 87.87 74.74 94.59 90.27 81.11
CosFace [81] MS1Mv0 99.78 97.60 97.28 93.44 86.87 74.20 95.15 90.56 83.01
NT [82] MS1Mv0 99.73 97.76 97.41 94.79 91.57 85.56 95.86 93.65 90.48
NR [83] MS1Mv0 99.75 97.71 97.42 94.77 91.58 85.53 95.88 93.60 90.41
Co-mining [84] MS1Mv0 99.75 97.55 97.37 94.99 91.80 85.57 95.95 93.82 90.71
Subcenter [85] MS1Mv0 99.72 97.73 97.40 94.88 91.70 85.62 95.98 93.72 90.50
SKH [86] MS1Mv0 99.73 97.81 97.47 95.89 93.50 89.34 96.85 95.25 93.00
Algorithms without classifier
DCQ [80] MS1Mv0 99.65 96.14 97.23 90.23 74.07 46.63 92.24 77.60 53.86
F2C[87] MS1Mv0 99.23 96.01 97.28 89.31 74.10 43.21 91.87 74.34 51.33
DCQ [80] + SKH [86] MS1Mv0 99.43 97.34 97.13 93.23 87.13 75.63 95.42 91.60 88.46
DCQ + SGPS (Ours) MS1Mv0 99.73 98.07 97.36 95.61 93.20 87.89 96.62 94.60 92.91
Algorithms under clean data
ArcFace [3] MS1MV2 99.82 98.27 98.01 96.16 94.41 88.64 97.23 95.81 93.63
DCQ [80] MS1MV2 99.73 97.71 97.60 95.89 93.22 87.56 96.93 94.89 91.99
TABLE V
SENSITIVITY OF THE HYPERPARAMETERS ON CARS19 WITH 50%
SY MME TR IC NO ISE .τAN D δARE D EFI NED I N EQ.(13), WHILE γ1A ND γ2
ARE IN EQ.(16).
τ δ γ1γ2Precision@1(%) MAP@R(%)
0.02 0.1 1.0 0.1 76.33 20.90
1.0 0.1 1.0 0.1 73.05 18.56
0.1 0.1 1.0 0.1 74.99 19.63
0.02 0.0 1.0 0.1 74.79 19.80
0.02 0.3 1.0 0.1 73.16 18.56
0.02 0.1 2.0 0.1 75.41 19.92
0.02 0.1 5.0 0.1 73.22 18.61
0.02 0.1 1.0 0.5 75.56 20.60
0.02 0.1 1.0 1.0 74.20 19.27
used in Lclean. These misselected noisy samples and the corre-
sponding within-class clean samples are semantically similar.
For example, in the 5th row of Figure7(a), the misselected
noisy sample (red box) is similar to the clean samples (yellow
boxes, pickup trucks). The unselected clean samples (blue
boxes) are often hard samples within this category.
Posterior Data Clean and Training Strategy. We also design
a posterior data clean and training strategy to further improve
the performance, which contains three stages. First, we train
SGPS-MCL with the noisy training set. Once the training
process is converged, a DML model with a discriminative
feature space is acquired, which can be denoted as a stage-1
model. Then we use stage-1 model to generate pseudo labels,
i.e., cB, for the training set. Second, we train the stage-2 model
by the pseudo label without SGPS but with an early stopping
PRISM. The stage-2 model can achieve better performance
than the stage-1 model by setting a smaller noise rate. In the
third step, we train the stage-3 model with the original noisy
labels with the stage-2 model as pretrained model and use the
extracted feature to initialize the feature of SGM. The related
0 2000 4000 6000 8000
Iteration
0.0
0.2
0.4
0.6
0.8
1.0
Precision
SGPS P@1
PRISM P@1
SGPS R-P
PRISM R-P
(a) Change of Precision
0 2000 4000 6000 8000
Iteration
0.0
0.1
0.2
0.3
0.4
0.5
MAP
SGPS-MAP@100
PRISM-MAP@100
SGPS-MAP@R
PRISM-MAP@R
(b) Change of MAP
Fig. 8. Change of Precision and MAP during the training. SGPS overcomes
the overfitting problem in PRISM.
experiments with SGPS-MCL in Tab.Idemonstrate the
results. Such a strategy can further improve the performance
of SGPS-MCL by a large margin, where it outperforms SGPS-
MCL by more than 1.2% on CARS, SOP, and CUB with 50%
symmetric noise. With the proposed clean strategy, we can
catch up and even achieve better performance than models
trained by clean labels when the noise rate is not very large.
Overfitting Problems. We also investigate the overfitting
problems of SGPS-MCL. As shown in Fig.8(a) and Fig.8(b),
our method can achieve better performance than PRISM [4] on
all metrics, including Precision@1, R-Precision, MAP@R and
MAR@100. Different from PRISM [4]which suffers a perfor-
mance degradation during the training, our method can keep a
stable performance improvement, which indicates SGPS is less
likely to overfit the noisy labels. In the training of SGPS, we
can easily choose the best model when the training is finished,
regardless of other cherry-picked strategies like early stopping.
Runtime analysis. We provide the training speed comparisons
for PRISM and SGPS (K=4) on CARS and SOP in TableVI.
It is indicated that the proposed model is about 1.1 times
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024 13
TABLE VI
GPU HOURS DURING THE TRAINING. TH E SPE ED IS T ES TED O N AN
NVIDIA RTX 3090 GPU.
Dataset PRISM SGPS
CARS (10% noise) 0.307 0.332
CARS (50% noise) 0.312 0.361
SOP (10% noise) 3.090 3.368
SOP (50% noise) 3.120 3.883
0 10000 20000 30000 40000
GPU memory (MB)
100k
200k
500k
1M
Number of Identities
SGPS
CosFace
SKH
Fig. 9. GPU memory cost of the multiproxy-based method and SGPS.
slower than PRISM on CARS (10% noise) and SOP (10%
noise) and 1.2 times slower on CARS (50% noise) and SOP
(50% noise). As we discussed before, the increased times
are mainly caused by extra Kpositive samples selected by
SGM and the subsequent aggregating procedure in PPM. Due
to the asynchronous update for cB,cT, SGM will not add
too much time costs during the training. Moreover, the costs
during back-propagation and optimization are not affected.
Discussion with multiproxy-based methods. Multiproxy-
based methods like SoftTriple [32], Subcenter [85] and
SKH [86] extend the softmax loss with multi centers for each
class, which achieves SOTA performance on both DML bench-
marks with clean labels and noisy labels. Our SGPS differs
from the above multiproxy-based methods in two aspects.
Firstly, SGPS is hardware-friendly and easy to incorporate
with pair-wise methods like MCL [20] and DCQ [80]. As
shown in Fig.9, multiproxy-based methods like SKH [86] cost
several times more memory than CosFace [81]. In scenarios
with over one million identities, the memory cost of SKH [86]
will be unacceptable. Our SGPS will not add extra GPU
memory cost during the training, which can be easily deployed
on large-scale datasets. Secondly, SGPS introduces novel SGM
to discover positive pairs instead of the predefined multiple
centers. This allows SGPS to more effectively merge or filter
out noisy samples and outliers than proxy-based methods. As
illustrated in Fig. 10, when dealing with samples labeled as
category 11308 in SOP with 50% symmetric noise, the intra-
class splitting process initially divides these samples into three
subgroups: 11308-0, 11308-1, 11308-2. Subsequently, after
bottom-up subgroup generation, the resulting merged sub-
groups are displayed in the three lines below. It is noteworthy
that samples within the same subgroup exhibit a high likeli-
hood of belonging to the ground truth category (illustrated
as the same color). Neverthelss, multiproxy-based methods
11308
11308-0 11308-1 11308-2
11308-0
11308-1
11308-2
Fig. 10. Visualization of SGM’s results on SOP with 50% symmetric noise.
The first row shows some samples with the same noisy label (id 11308). The
subsequent three rows show the samples from the three subgroups in cB.
Samples within the same rectangle belong to the same subgroup and noisy
label category. Different colors denote different ground truth categories.
are incapable of effectively modeling the noise situation like
11308-1, when positive samples are distributed across across
numerous different categories. This example highlights the
capability of SGPS in handling complex noise scenarios.
V. CONCLUSION
In this paper, we propose a novel SGPS framework for
noise-robust DML. SGPS can be smoothly integrated with
existing DML methods and notably reduces the impact of
label noise, particularly in pair-wise DML approaches. This
framework starts by efficiently distinguishing between clean
and noisy samples using the PCS strategy. To enhance the
utilization of noisy data, we introduce SGM to generate
subgroup labels, which enable to SGPS discover positive pairs
for noisy samples. Following that, PPM aggregates positive
pairs into informative prototypes to involve the noisy samples
in training process. Extensive experiments with both synthetic
and real-world datasets demonstrate the effectiveness of SGPS.
ACKNOWLEDGMENTS
This work was supported in part by the National Key
R&D Program of China under Grant 2018AAA0102000,
in part by National Natural Science Foundation of China:
62236008, U21B2038, U23B2051, 61931008, 62122075,
62406305, 62471013, 62476068 and 62272439, in part by
Youth Innovation Promotion Association CAS, in part by the
Strategic Priority Research Program of the Chinese Academy
of Sciences, Grant No. XDB0680000, in part by the Innovation
Funding of ICT, CAS under Grant No.E000000, in part by the
China Postdoctoral Science Foundation (CPSF) under Grant
No.2023M743441, and in part by the Postdoctoral Fellowship
Program of CPSF under Grant No.GZB20230732.
REFERENCES
[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in CVPR, 2016, pp. 770–778.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024 14
[2] X. Wang, X. Han, W. Huang, D. Dong, and M. R. Scott, “Multi-
similarity loss with general pair weighting for deep metric learning,”
in CVPR, 2019, pp. 5022–5030.
[3] J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “Arcface: Additive angular
margin loss for deep face recognition, in CVPR, 2019, pp. 4690–4699.
[4] C. Liu, H. Yu, B. Li, Z. Shen, Z. Gao, P. Ren, X. Xie, L. Cui,
and C. Miao, “Noise-resistant deep metric learning with ranking-based
instance selection,” in CVPR, 2021, pp. 6811–6820.
[5] D. Arthur and S. Vassilvitskii, “K-means++ the advantages of careful
seeding,” in SODA, 2007, pp. 1027–1035.
[6] F. Nielsen and F. Nielsen, “Hierarchical clustering,” Introduction to HPC
with MPI for Data Science, pp. 195–211, 2016.
[7] B. Han, Q. Yao, X. Yu, G. Niu, M. Xu, W. Hu, I. Tsang, and
M. Sugiyama, “Co-teaching: Robust training of deep neural networks
with extremely noisy labels,” NeurIPS, vol. 31, 2018.
[8] X. Xia, T. Liu, B. Han, M. Gong, J. Yu, G. Niu, and M. Sugiyama,
“Sample selection with uncertainty of losses for learning with noisy
labels,” in ICLR, 2022.
[9] S. Li, X. Xia, S. Ge, and T. Liu, “Selective-supervised contrastive
learning with noisy labels,” in CVPR, 2022, pp. 316–325.
[10] Y. Zhao, Q. Xu, Y. Jiang, P. Wen, and Q. Huang, “Dist-pu: Positive-
unlabeled learning from a label distribution perspective, in CVPR, 2022,
pp. 14 461–14 470.
[11] L. Jiang, Z. Zhou, T. Leung, L.-J. Li, and L. Fei-Fei, “Mentornet:
Learning data-driven curriculum for very deep neural networks on
corrupted labels,” in ICML. PMLR, 2018, pp. 2304–2313.
[12] G. Zheng, A. H. Awadallah, and S. Dumais, “Meta label correction for
noisy label learning,” in AAAI, vol. 35, no. 12, 2021, pp. 11053–11 061.
[13] G. Patrini, A. Rozza, A. Krishna Menon, R. Nock, and L. Qu, “Making
deep neural networks robust to label noise: A loss correction approach,
in CVPR, 2017, pp. 1944–1952.
[14] “Estimating noise transition matrix with label correlations for noisy
multi-label learning,” in 36th Conference on Neural Information Pro-
cessing Systems (NeurIPS 2022), 2022.
[15] S. Yang, S. Wu, E. Yang, B. Han, Y. Liu, M. Xu, G. Niu, and T. Liu, “A
parametrical model for instance-dependent label noise,” IEEE TPAMI,
vol. 45, no. 12, pp. 14 055–14 068, 2023.
[16] J. Li, R. Socher, and S. C. Hoi, “Dividemix: Learning with noisy labels
as semi-supervised learning,” arXiv preprint arXiv:2002.07394, 2020.
[17] Y. Bai and T. Liu, “Me-momentum: Extracting hard confident examples
from noisily labeled data,” in ICCV, 2021.
[18] Y. Jiang, X. Li, Y. Chen, Y. He, Q. Xu, Z. Yang, X. Cao, and Q. Huang,
“Maxmatch: Semi-supervised learning with worst-case consistency,
IEEE TPAMI, vol. 45, no. 5, pp. 5970–5987, 2022.
[19] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric
discriminatively, with application to face verification, in CVPR, vol. 1.
IEEE, 2005, pp. 539–546.
[20] X. Wang, H. Zhang, W. Huang, and M. R. Scott, “Cross-batch memory
for embedding learning,” in CVPR, 2020, pp. 6388–6397.
[21] S. Dereka, I. Karpukhin, and S. Kolesnikov, “Deep image retrieval is
not robust to label noise,” in CVPR, 2022, pp. 4975–4980.
[22] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embed-
ding for face recognition and clustering,” in CVPR, 2015, pp. 815–823.
[23] Y. Sun, C. Cheng, Y. Zhang, C. Zhang, L. Zheng, Z. Wang, and Y. Wei,
“Circle loss: A unified perspective of pair similarity optimization, in
CVPR, 2020, pp. 6398–6407.
[24] B. Zhang, W. Zheng, J. Zhou, and J. Lu, “Attributable visual similarity
learning,” in CVPR, 2022, pp. 7532–7541.
[25] J. Goldberger, G. E. Hinton, S. Roweis, and R. R. Salakhutdinov,
“Neighbourhood components analysis,” NeurIPS, vol. 17, 2004.
[26] C.-Y. Wu, R. Manmatha, A. J. Smola, and P. Krahenbuhl, “Sampling
matters in deep embedding learning,” in ICCV, 2017, pp. 2840–2848.
[27] F. Cakir, K. He, X. Xia, B. Kulis, and S. Sclaroff, “Deep metric learning
to rank,” in CVPR, 2019, pp. 1861–1870.
[28] E. Ramzi, N. Thome, C. Rambour, N. Audebert, and X. Bitot, “Robust
and decomposable average precision for image retrieval, NeurIPS,
vol. 34, pp. 23 569–23 581, 2021.
[29] P. Wen, Q. Xu, Z. Yang, Y. He, and Q. Huang, “Exploring the algorithm-
dependent generalization of auprc optimization with list stability, in
NeurIPS, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho,
and A. Oh, Eds., vol. 35. Curran Associates, Inc., 2022, pp. 28 335–
28 349.
[30] A. Zhai and H.-Y. Wu, “Classification is a strong baseline for deep
metric learning,” BMVC, 2018.
[31] Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh,
“No fuss distance metric learning using proxies,” in ICCV, 2017, pp.
360–368.
[32] Q. Qian, L. Shang, B. Sun, J. Hu, H. Li, and R. Jin, “Softtriple loss: Deep
metric learning without triplet sampling,” in ICCV, 2019, pp. 6450–6458.
[33] E. W. Teh, T. DeVries, and G. W. Taylor, “Proxynca++: Revisiting
and revitalizing proxy neighborhood component analysis,” in Computer
Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August
23–28, 2020, Proceedings, Part XXIV 16. Springer, 2020, pp. 448–464.
[34] X. An, J. Deng, K. Yang, J. Li, Z. Feng, J. Guo, J. Yang, and T. Liu,
“Unicom: Universal and compact representation learning for image
retrieval, arXiv preprint arXiv:2304.05884, 2023.
[35] C. Liao, T. Tsiligkaridis, and B. Kulis, “Supervised metric learning to
rank for retrieval via contextual similarity optimization, in ICML, ser.
Proceedings of Machine Learning Research, A. Krause, E. Brunskill,
K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds., vol. 202. PMLR,
23–29 Jul 2023, pp. 20 906–20 938.
[36] J. Yan, E. Yang, C. Deng, and H. Huang, “Metricformer: A unified
perspective of correlation exploring in similarity learning, NeurIPS,
vol. 35, pp. 33 414–33 427, 2022.
[37] J. Yan, C. Deng, H. Huang, and W. Liu, “Causality-invariant interactive
mining for cross-modal similarity learning,” IEEE TPAMI, 2024.
[38] D. Angluin and P. Laird, “Learning from noisy examples,” Machine
Learning, vol. 2, pp. 343–370, 1988.
[39] Y. Wang, W. Liu, X. Ma, J. Bailey, H. Zha, L. Song, and S.-T. Xia,
“Iterative learning with open-set noisy labels, in CVPR, 2018, pp. 8688–
8696.
[40] Y. Yao, M. Gong, Y. Du, J. Yu, B. Han, K. Zhang, and T. Liu, “Which
is better for learning with noisy labels: the semi-supervised method or
modeling label noise?” in ICML. PMLR, 2023, pp. 39 660–39 673.
[41] W. Ni, Q. Xu, Y. Jiang, Z. Cao, X. Cao, and Q. Huang, “Psnea: Pseudo-
siamese network for entity alignment between multi-modal knowledge
graphs,” in ACMMM, 2023, pp. 3489–3497.
[42] Y. Jiang, Q. Xu, Y. Zhao, Z. Yang, P. Wen, X. Cao, and Q. Huang,
“Positive-unlabeled learning with label distribution alignment, IEEE
TPAMI, 2023.
[43] X. Yu, B. Han, J. Yao, G. Niu, I. Tsang, and M. Sugiyama, “How does
disagreement help generalization against label corruption?” in ICML.
PMLR, 2019, pp. 7164–7173.
[44] K. Nishi, Y. Ding, A. Rich, and T. H¨
ollerer, “Improving label noise
robustness with data augmentation and semi-supervised learning (student
abstract),” in AAAI, vol. 35, no. 18, 2021, pp. 15855–15 856.
[45] Y. Wang, Q. Xu, Y. Jiang, S. Dai, and Q. Huang, “Regularized
contrastive partial multi-view outlier detection, in ACMMM, 2024, pp.
8711–8720.
[46] N. Karim, M. N. Rizve, N. Rahnavard, A. Mian, and M. Shah, “Uni-
con: Combating label noise through uniform selection and contrastive
learning,” in CVPR, 2022, pp. 9676–9686.
[47] J. Chen, R. Zhang, T. Yu, R. Sharma, Z. Xu, T. Sun, and C. Chen,
“Label-retrieval-augmented diffusion models for learning from noisy
labels,” NeurIPS, vol. 36, 2024.
[48] Y. Bai, E. Yang, B. Han, Y. Yang, J. Li, Y. Mao, G. Niu, and T. Liu,
“Understanding and improving early stopping for learning with noisy
labels,” in NeurIPS, 2021.
[49] S. Yuan, L. Feng, and T. Liu, “Early stopping against label noise without
validation data, ICLR, 2024.
[50] Y. Jiang, Q. Xu, Z. Yang, X. Cao, and Q. Huang, “Dm2c: Deep mixed-
modal clustering,” NeurIPS, vol. 32, 2019.
[51] Y. Jiang, Z. Yang, Q. Xu, X. Cao, and Q. Huang, “When to learn what:
Deep cognitive subspace clustering, in ACMMM, 2018, pp. 718–726.
[52] Y. Jiang, Q. Xu, Z. Yang, X. Cao, and Q. Huang, “Duet robust deep
subspace clustering,” in ACMMM, 2019, pp. 1596–1604.
[53] D. Wang and X. Tan, “Robust distance metric learning via bayesian
inference,” IEEE TIP, vol. 27, no. 3, pp. 1542–1553, 2017.
[54] X. Wang, Y. Hua, E. Kodirov, G. Hu, and N. M. Robertson, “Deep
metric learning by online soft mining and class-aware attention,” in
AAAI, vol. 33, no. 01, 2019, pp. 5361–5368.
[55] K. Ozaki and S. Yokoo, “Large-scale landmark retrieval/recognition
under a noisy and diverse dataset, arXiv preprint arXiv:1906.04087,
2019.
[56] A. Ermolov, L. Mirvakhabova, V. Khrulkov, N. Sebe, and I. Oseledets,
“Hyperbolic vision transformers: Combining improvements in metric
learning,” in CVPR, 2022, pp. 7409–7419.
[57] J. Yan, L. Luo, C. Deng, and H. Huang, “Adaptive hierarchical similarity
metric learning with noisy labels,” IEEE TIP, vol. 32, pp. 1245–1256,
2023.
[58] ——, “Unsupervised hyperbolic metric learning,” in CVPR, 2021, pp.
12 465–12 474.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024 15
[59] E. Zhang, X. Jiang, H. Cheng, A. Wu, F. Yu, K. Li, X. Guo, F. Zheng,
W. Zheng, and X. Sun, “One for more: Selecting generalizable samples
for generalizable reid model,” in AAAI, vol. 35, no. 4, 2021, pp. 3324–
3332.
[60] M. Ye, H. Li, B. Du, J. Shen, L. Shao, and S. C. Hoi, “Collaborative
refining for person re-identification with label noise,” IEEE Transactions
on Image Processing, vol. 31, pp. 379–391, 2021.
[61] J. Yan, L. Luo, C. Xu, C. Deng, and H. Huang, “Noise is also useful:
Negative correlation-steered latent contrastive learning,” in CVPR, 2022,
pp. 31–40.
[62] L. Lan, X. Teng, J. Zhang, X. Zhang, and D. Tao, “Learning to purifi-
cation for unsupervised person re-identification,” IEEE Transactions on
Image Processing, 2023.
[63] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast
for unsupervised visual representation learning,” in CVPR, June 2020.
[64] X. Zhan, J. Xie, Z. Liu, Y.-S. Ong, and C. C. Loy, “Online deep
clustering for unsupervised representation learning,” in CVPR, June
2020.
[65] J. Hoshen and R. Kopelman, “Percolation and cluster distribution. i.
cluster multiple labeling technique and critical concentration algorithm,”
Physical Review B, vol. 14, no. 8, p. 3438, 1976.
[66] S. C. Johnson, “Hierarchical clustering schemes,” Psychometrika,
vol. 32, no. 3, pp. 241–254, 1967.
[67] D. Comaniciu and P. Meer, “Mean shift: A robust approach toward
feature space analysis,” IEEE TPAMI, vol. 24, no. 5, pp. 603–619, 2002.
[68] J. Krause, M. Stark, J. Deng, and L. Fei-Fei, “3d object representations
for fine-grained categorization,” in ICCV workshops, 2013, pp. 554–561.
[69] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie, “The
caltech-ucsd birds-200-2011 dataset,” 2011.
[70] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese, “Deep metric
learning via lifted structured feature embedding,” in CVPR, 2016, pp.
4004–4012.
[71] K.-H. Lee, X. He, L. Zhang, and L. Yang, “Cleannet: Transfer learning
for scalable image classifier training with label noise,” in CVPR, 2018,
pp. 5447–5456.
[72] L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining
discriminative components with random forests, in Computer Vision–
ECCV 2014: 13th European Conference, Zurich, Switzerland, September
6-12, 2014, Proceedings, Part VI 13. Springer, 2014, pp. 446–461.
[73] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao, “Ms-celeb-1m: A dataset
and benchmark for large-scale face recognition, in Computer Vision–
ECCV 2016: 14th European Conference, Amsterdam, The Netherlands,
October 11-14, 2016, Proceedings, Part III 14. Springer, 2016, pp.
87–102.
[74] T. Xiao, T. Xia, Y. Yang, C. Huang, and X. Wang, “Learning from
massive noisy labeled data for image classification, in CVPR, 2015,
pp. 2691–2699.
[75] B. Van Rooyen, A. Menon, and R. C. Williamson, “Learning with
symmetric label noise: The importance of being unhinged,” NeurIPS,
vol. 28, 2015.
[76] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola,
A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learn-
ing,” NeurIPS, vol. 33, pp. 18661–18 673, 2020.
[77] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
network training by reducing internal covariate shift, in ICML. pmlr,
2015, pp. 448–456.
[78] K. Musgrave, S. Belongie, and S.-N. Lim, A metric learning reality
check,” in Computer Vision–ECCV 2020: 16th European Conference,
Glasgow, UK, August 23–28, 2020, Proceedings, Part XXV 16. Springer,
2020, pp. 681–699.
[79] Y. Ma, D. Yu, T. Wu, and H. Wang, “Paddlepaddle: An open-source
deep learning platform from industrial practice,” Frontiers of Data and
Domputing, vol. 1, no. 1, pp. 105–115, 2019.
[80] B. Li, T. Xi, G. Zhang, H. Feng, J. Han, J. Liu, E. Ding, and W. Liu,
“Dynamic class queue for large scale face recognition in the wild, in
CVPR, 2021, pp. 3763–3772.
[81] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, and W. Liu,
“Cosface: Large margin cosine loss for deep face recognition, in CVPR,
2018, pp. 5265–5274.
[82] W. Hu, Y. Huang, F. Zhang, and R. Li, “Noise-tolerant paradigm for
training face recognition cnns,” in CVPR, 2019, pp. 11887–11 896.
[83] Y. Zhong, W. Deng, M. Wang, J. Hu, J. Peng, X. Tao, and Y. Huang,
“Unequal-training for deep face recognition with long-tailed noisy data,”
in CVPR, 2019, pp. 7812–7821.
[84] X. Wang, S. Wang, J. Wang, H. Shi, and T. Mei, “Co-mining: Deep face
recognition with noisy labels,” in ICCV, 2019, pp. 9358–9367.
[85] J. Deng, J. Guo, T. Liu, M. Gong, and S. Zafeiriou, “Sub-center arcface:
Boosting face recognition by large-scale noisy web faces, in Computer
Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August
23–28, 2020, Proceedings, Part XI 16. Springer, 2020, pp. 741–757.
[86] B. Liu, G. Song, M. Zhang, H. You, and Y. Liu, “Switchable k-class
hyperplanes for noise-robust representation learning, in ICCV, 2021,
pp. 3019–3028.
[87] K. Wang, S. Wang, P. Zhang, Z. Zhou, Z. Zhu, X. Wang, X. Peng,
B. Sun, H. Li, and Y. You, An efficient training approach for very
large scale face recognition, in CVPR, 2022, pp. 4083–4092.
Zhipeng Yu received the B.E. degree in communica-
tion engineering and the M.E degree in electrical and
communication engineering from the Beijing Uni-
versity of Posts and Telecommunications (BUPT),
Beijing, China, in 2015 and 2018, respectively. He is
currently pursuing the Ph.D. degree with University
of the Chinese Academy of Sciences. His research
interests include machine learning and computer
vision.
Qianqian Xu received the B.S. degree in com-
puter science from China University of Mining and
Technology in 2007 and the Ph.D. degree in com-
puter science from University of Chinese Academy
of Sciences in 2013. She is currently a Professor
with the Institute of Computing Technology, Chinese
Academy of Sciences, Beijing, China. Her research
interests include statistical machine learning, with
applications in multimedia and computer vision. She
has authored or coauthored 90+ academic papers
in prestigious international journals and conferences
(including T-PAMI, IJCV, T-IP, NeurIPS, ICML, CVPR, AAAI, etc). More-
over, she serves as an associate editor of IEEE Transactions on Circuits and
Systems for Video Technology, IEEE Transactions on Multimedia, and ACM
Transactions on Multimedia Computing, Communications, and Applications.
Yangbangyan Jiang received the B.S. degree in
instrumentation and control from Beihang University
in 2017 and the Ph.D. degree in computer science
from University of Chinese Academy of Sciences in
2023. She is currently a postdoctoral research fellow
with University of Chinese Academy of Sciences.
Her research interests include machine learning and
computer vision. She has authored or coauthored
several academic papers in international journals and
conferences including T-PAMI, NeurIPS, CVPR,
AAAI, ACM MM, etc. She served as a reviewer for
several top-tier conferences such as ICML, NeurIPS, ICLR, CVPR, ICCV,
AAAI.
Yingfei Sun received the Ph.D. degree in applied
mathematics from the Beijing Institute of Technol-
ogy, in 1999. He is currently a Full Professor with
the School of Electronic, Electrical and Communi-
cation Engineering, University of Chinese Academy
of Sciences. His current research interests include
machine learning and pattern recognition.
IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 33, 2024 16
Qingming Huang is a chair professor in University
of Chinese Academy of Sciences and an adjunct re-
search professor in the Institute of Computing Tech-
nology, Chinese Academy of Sciences. He graduated
with a Bachelor degree in Computer Science in 1988
and Ph.D. degree in Computer Engineering in 1994,
both from Harbin Institute of Technology, China. His
research areas include multimedia computing, image
processing, computer vision and pattern recognition.
He has authored or coauthored more than 400 aca-
demic papers in prestigious international journals
and top-level international conferences. He was the associate editor of IEEE
Trans. on CSVT and Acta Automatica Sinica, and the reviewer of various
international journals including IEEE Trans. on PAMI, IEEE Trans. on Image
Processing, IEEE Trans. on Multimedia, etc. He is a Fellow of IEEE and has
served as general chair, program chair, area chair and TPC member for various
conferences, including ACM Multimedia, CVPR, ICCV, ICME, ICMR, PCM,
BigMM, PSIVT, etc.
ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Early stopping methods in deep learning face the challenge of balancing the volume of training and validation data, especially in the presence of label noise. Concretely, sparing more data for validation from training data would limit the performance of the learned model, yet insufficient validation data could result in a sub-optimal selection of the desired model. In this paper, we propose a novel early stopping method called Label Wave, which does not require validation data for selecting the desired model in the presence of label noise. It works by tracking the changes in the model's predictions on the training set during the training process, aiming to halt training before the model unduly fits mislabeled data. This method is empirically supported by our observation that minimum fluctuations in predictions typically occur at the training epoch before the model excessively fits mislabeled data. Through extensive experiments, we show both the effectiveness of the Label Wave method across various settings and its capability to enhance the performance of existing methods for learning with noisy labels.
Article
In the real world, how to effectively learn consistent similarity measurement across different modalities is essential. Most of the existing similarity learning methods cannot deal well with cross-modal data due to the modality gap and have obvious performance degeneration when applied to cross-modal data. To tackle this problem, we propose a novel cross-modal similarity learning method, called Causality-Invariant Interactive Mining (CIIM), that can effectively capture informative relationships among different samples and modalities to derive the modality-consistent feature embeddings in the unified metric space. Our CIIM tackles the modality gap from two aspects, i.e. , sample-wise and feature-wise. Specifically, we start from the sample-wise view and learn the single-modality and hybrid-modality proxies for exploring the cross-modal similarity with the elaborate metric losses. In this way, sample-to-sample and sample-to-proxy correlations are both taken into consideration. Furthermore, we conduct the causal intervention to eliminate the modality bias and reconstruct the invariant causal embedding in the feature-wise aspect. To this end, we force the learned embeddings to satisfy the specific properties of our causal mechanism and derive the causality-invariant feature embeddings in the unified metric space. Extensive experiments on two cross-modality tasks demonstrate the superiority of our proposed method over the state-of-the-art methods.
Article
Positive-Unlabeled (PU) data arise frequently in a wide range of fields such as medical diagnosis, anomaly analysis and personalized advertising. The absence of any known negative labels makes it very challenging to learn binary classifiers from such data. Many state-of-the-art methods reformulate the original classification risk with individual risks over positive and unlabeled data, and explicitly minimize the risk of classifying unlabeled data as negative. This, however, usually leads to classifiers with a bias toward negative predictions, i.e. , they tend to recognize most unlabeled data as negative. In this paper, we propose a label distribution alignment formulation for PU learning to alleviate this issue. Specifically, we align the distribution of predicted labels with the ground-truth, which is constant for a given class prior. In this way, the proportion of samples predicted as negative is explicitly controlled from a global perspective, and thus the bias toward negative predictions could be intrinsically eliminated. On top of this, we further introduce the idea of functional margins to enhance the model's discriminability, and derive a margin-based learning framework named Positive-Unlabeled learning with Label Distribution Alignment (PULDA) . This framework is also combined with the class prior estimation process for practical scenarios, and theoretically supported by a generalization analysis. Moreover, a stochastic mini-batch optimization algorithm based on the exponential moving average strategy is tailored for this problem with a convergence guarantee. Finally, comprehensive empirical results demonstrate the effectiveness of the proposed method.
Article
In label-noise learning, estimating the transition matrix is a hot topic as the matrix plays an important role in building statistically consistent classifiers . Traditionally, the transition from clean labels to noisy labels (i.e., clean-label transition matrix (CLTM) ) has been widely exploited on class-dependent label-noise (wherein all samples in a clean class share the same label transition matrix). However, the CLTM cannot handle the more common instance-dependent label-noise well (wherein the clean-to-noisy label transition matrix needs to be estimated at the instance level by considering the input quality). Motivated by the fact that classifiers mostly output Bayes optimal labels for prediction, in this paper, we study to directly model the transition from Bayes optimal labels to noisy labels (i.e., Bayes-Label Transition Matrix (BLTM) ) and learn a classifier to predict Bayes optimal labels . Note that given only noisy data, it is ill-posed to estimate either the CLTM or the BLTM . But favorably, Bayes optimal labels have no uncertainty compared with the clean labels, i.e., the class posteriors of Bayes optimal labels are one-hot vectors while those of clean labels are not. This enables two advantages to estimate the BLTM , i.e., (a) a set of examples with theoretically guaranteed Bayes optimal labels can be collected out of noisy data; (b) the feasible solution space is much smaller. By exploiting the advantages, this work proposes a parametrical model for estimating the instance-dependent label-noise transition matrix by employing a deep neural network , leading to better generalization and superior classification performance.
Article
Unsupervised person re-identification is a challenging and promising task in computer vision. Nowadays unsupervised person re-identification methods have achieved great progress by training with pseudo labels. However, how to purify feature and label noise is less explicitly studied in the unsupervised manner. To purify the feature, we take into account two types of additional features from different local views to enrich the feature representation. The proposed multi-view features are carefully integrated into our cluster contrast learning to leverage more discriminative cues that the global feature easily ignored and biased. To purify the label noise, we propose to take advantage of the knowledge of teacher model in an offline scheme. Specifically, we first train a teacher model from noisy pseudo labels, and then use the teacher model to guide the learning of our student model. In our setting, the student model could converge fast with the supervision of the teacher model thus reduce the interference of noisy labels as the teacher model greatly suffered. After carefully handling the noise and bias in the feature learning, our purification modules are proven to be very effective for unsupervised person re-identification. Extensive experiments on two popular person re-identification datasets demonstrate the superiority of our method. Especially, our approach achieves a state-of-the-art accuracy 85.8% @mAP and 94.5% @Rank-1 on the challenging Market-1501 benchmark with ResNet-50 under the fully unsupervised setting. Code has been available at: https://github.com/tengxiao14/Purification_ReID .
Article
Deep Metric Learning (DML) plays a critical role in various machine learning tasks. However, most existing deep metric learning methods with binary similarity are sensitive to noisy labels, which are widely present in real-world data. Since these noisy labels often cause a severe performance degradation, it is crucial to enhance the robustness and generalization ability of DML. In this paper, we propose an Adaptive Hierarchical Similarity Metric Learning method. It considers two noise-insensitive information, i.e ., class-wise divergence and sample-wise consistency. Specifically, class-wise divergence can effectively excavate richer similarity information beyond binary in modeling by taking advantage of Hyperbolic metric learning, while sample-wise consistency can further improve the generalization ability of the model using contrastive augmentation. More importantly, we design an adaptive strategy to integrate this information in a unified view. It is noteworthy that the new method can be extended to any pair-based metric loss. Extensive experimental results on benchmark datasets demonstrate that our method achieves state-of-the-art performance compared with current deep metric learning approaches.