Conference PaperPDF Available

Pseudo-Labeling and Confirmation Bias in Deep Semi-Supervised Learning

Authors:
... Nevertheless, its implementation using non-pre-trained custom CNN architectures remains an open problem since the required amount of labeled samples to provide satisfactory results is rather high (∼5% of the dataset) [9, 10]. In addition, the occurrence of confirmation bias is a potential drawback for this method, where pseudo-labeling errors may have an adverse effect on the model's generalizability [11]. ...
... Self-training approaches use the model's predictions confidence to infer pseudo-labels for unlabeled data. However, recent works have proven that the model's performance can be further enhanced by using an auxiliary model (e.g., graph-based methods) in a teacher-student setup [16,5,3,4,11]. It is noteworthy that a number of issues may arise from naive approaches, such as confirmation bias [11], imbalance bias [17] or concept drift [18], which adversely impact the generalizability of the model. ...
... However, recent works have proven that the model's performance can be further enhanced by using an auxiliary model (e.g., graph-based methods) in a teacher-student setup [16,5,3,4,11]. It is noteworthy that a number of issues may arise from naive approaches, such as confirmation bias [11], imbalance bias [17] or concept drift [18], which adversely impact the generalizability of the model. Some approaches to mitigate the above issues are: to consider the pseudolabeling confidence during the regularization process [11], to reset the parameters of the model at the start of each learning iteration [18], and to include the selection of the most confident pseudo-labeled samples to ensure labeling correctness [3,4]. ...
Preprint
A major challenge that prevents the training of DL models is the limited availability of accurately labeled data. This shortcoming is highlighted in areas where data annotation becomes a time-consuming and error-prone task. In this regard, SSL tackles this challenge by capitalizing on scarce labeled and abundant unlabeled data; however, SoTA methods typically depend on pre-trained features and large validation sets to learn effective representations for classification tasks. In addition, the reduced set of labeled data is often randomly sampled, neglecting the selection of more informative samples. Here, we present active-DeepFA, a method that effectively combines CL, teacher-student-based meta-pseudo-labeling and AL to train non-pretrained CNN architectures for image classification in scenarios of scarcity of labeled and abundance of unlabeled data. It integrates DeepFA into a co-training setup that implements two cooperative networks to mitigate confirmation bias from pseudo-labels. The method starts with a reduced set of labeled samples by warming up the networks with supervised CL. Afterward and at regular epoch intervals, label propagation is performed on the 2D projections of the networks' deep features. Next, the most reliable pseudo-labels are exchanged between networks in a cross-training fashion, while the most meaningful samples are annotated and added into the labeled set. The networks independently minimize an objective loss function comprising supervised contrastive, supervised and semi-supervised loss components, enhancing the representations towards image classification. Our approach is evaluated on three challenging biological image datasets using only 5% of labeled samples, improving baselines and outperforming six other SoTA methods. In addition, it reduces annotation effort by achieving comparable results to those of its counterparts with only 3% of labeled data.
... Let us illustrate the core concepts of reciprocal learning through a straightforward example involving self-training within semi-supervised learning (SSL), see e.g., Arazo et al. (2020); Rizve et al. (2020); Rodemann (2023Rodemann ( , 2024; Rodemann et al. (2023a,b); Li et al. (2020); Bordini et al. (2024);Dietrich et al. (2024). The objective of SSL is to learn a predictive classification functionŷ(x, θ) parameterized by θ, leveraging both labeled and unlabeled data. ...
... For ease of exposition, assume Θ = [−100, 100] 2 in what follows. Further assume selfpredicted data is added according to a regularized Lipschitz-continuous uncertainty measure as in Rizve et al. (2020) or Arazo et al. (2020) such that the sufficient conditions from for the non-greedy sample adaptation f s to be L s -Lipschitz are fulfilled. Moreover, assume only m = 1 data point is changed per iteration such that L s = n−1 n . ...
Preprint
Full-text available
Many learning paradigms self-select training data in light of previously learned parameters. Examples include active learning, semi-supervised learning, bandits, or boosting. Rodemann et al. (2024) unify them under the framework of "reciprocal learning". In this article, we address the question of how well these methods can generalize from their self-selected samples. In particular, we prove universal generalization bounds for reciprocal learning using covering numbers and Wasserstein ambiguity sets. Our results require no assumptions on the distribution of self-selected data, only verifiable conditions on the algorithms. We prove results for both convergent and finite iteration solutions. The latter are anytime valid, thereby giving rise to stopping rules for a practitioner seeking to guarantee the out-of-sample performance of their reciprocal learning algorithm. Finally, we illustrate our bounds and stopping rules for reciprocal learning's special case of semi-supervised learning.
... In this work, two dominant approaches currently for SSL, i.e., pseudolabeling and consistency regularization are mostly focused on. The goal of pseudo-labeling is to generate pseudo-labels for unlabeled samples with a model trained on labeled data to expand the labeled data pool for training [42][43][44][45]. Consistency regularization aims to obtain perturbation/augmentation invariant output distribution [46][47][48][49][50][51]. ...
Article
Full-text available
Cross-domain few-shot scene classification (CDFSSC) aims to tackle the challenge of classifying target domain data with limited labeled samples under distribution shift and category mismatch in remote sensing imagery classification. Existing methods primarily focus on extracting domain-common knowledge while overlooking domain-specific knowledge, which is insufficient for effective cross-domain representation learning. Additionally, the category mismatch between the source and target domain further hinders the model's ability to adapt and perform effectively on few-shot tasks in the target domain. Hence, in this article, a novel CDFSSC method called domain knowledge decomposition (DKD) framework is proposed to effectively exploit domain-common and domain-specific knowledge from the pseudo-labels of target samples, improve the certainty of cross-domain representation learning, and enhance the model's adaptability to target domain. First, a cross-domain pseudo-label decomposed learning structure is proposed for DKD to facilitate domain-common knowledge transfer and correct interference with domain-specific knowledge. It decomposes the pseudo-labels at the logit level to separately exploit two types of knowledge, ensuring a more effective and robust cross-domain representation learning. Second, a certainty-enhanced dynamic loss is designed to strengthen the certainty in cross-domain representation learning by minimizing the self-entropy of predictions and dynamic loss re-weighting principle. Third, in target domain adaptation and few-shot evaluation stage, a target-domain-specific adapter is designed to improve the model's adaptability to target domain few-shot tasks, while addressing category mismatch by classifier fine-tuning. Extensive experimental results on 12 RS cross-domain scenarios exhibit impressive performance of the proposed DKD framework.
... Notably, our experiments suggest the ideal balance is around 100 iterations per active learning cycle; longer cycles appear to reduce overall anomaly detection efficiency, highlighting the need for careful tuning and early stopping strategies. Such overfitting risk in SSL has been previously observed, especially in low-label regimes without strong regularisation [31]. Given that original FixMatch and its derivatives such as MSMatch typically employ extensive training iterations, our adaptation of shorter, frequent retraining cycles emerges as a beneficial modification tailored to active learning scenarios with heavily imbalanced classes and a simplified binary classification task. ...
Preprint
Full-text available
Anomaly detection in large datasets is essential in fields such as astronomy and computer vision; however, supervised methods typically require extensive anomaly labelling, which is often impractical. We present AnomalyMatch, an anomaly detection framework combining the semi-supervised FixMatch algorithm using EfficientNet classifiers with active learning. By treating anomaly detection as a semi-supervised binary classification problem, we efficiently utilise limited labelled and abundant unlabelled images. We allow iterative model refinement in a user interface for expert verification of high-confidence anomalies and correction of false positives. Built for astronomical data, AnomalyMatch generalises readily to other domains facing similar data challenges. Evaluations on the GalaxyMNIST astronomical dataset and the miniImageNet natural-image benchmark under severe class imbalance (1% anomalies for miniImageNet) display strong performance: starting from five to ten labelled anomalies and after three active learning cycles, we achieve an average AUROC of 0.95 (miniImageNet) and 0.86 (GalaxyMNIST), with respective AUPRC of 0.77 and 0.71. After active learning cycles, anomalies are ranked with 71% (miniImageNet) to 93% precision in the 1% of the highest-ranked images. AnomalyMatch is tailored for large-scale applications, efficiently processing predictions for 100 million images within three days on a single GPU. Integrated into ESAs Datalabs platform, AnomalyMatch facilitates targeted discovery of scientifically valuable anomalies in vast astronomical datasets. Our results underscore the exceptional utility and scalability of this approach for anomaly discovery, highlighting the value of specialised approaches for domains characterised by severe label scarcity.
Article
Full-text available
How can we perform unsupervised domain adaptation when transferring a black-box source model to a target domain? Black-box Unsupervised Domain Adaptation focuses on transferring the labels derived from a pre-trained black-box source model to an unlabeled target domain. The problem setting is motivated by privacy concerns associated with accessing and utilizing source data or source model parameters. Recent studies typically train the target model by mimicking the labels derived from the black-box source model, which often contain noise due to domain gaps between the source and the target. Directly exploiting such noisy labels or disregarding them may lead to a decrease in the model’s performance. We propose Threshold-Based Exploitation of Noisy Predictions (TEN), a method to accurately learn the target model with noisy labels in Black-box Unsupervised Domain Adaptation. To ensure the preservation of information from the black-box source model, we employ a threshold-based approach to distinguish between clean labels and noisy labels, thereby allowing the transfer of high-confidence knowledge from both labels. We utilize a flexible thresholding approach to adjust the threshold for each class, thereby obtaining an adequate amount of clean data for hard-to-learn classes. We also exploit knowledge distillation for clean data and negative learning for noisy labels to extract high-confidence information. Extensive experiments show that TEN outperforms baselines with an accuracy improvement of up to 9.49%.
Conference Paper
Full-text available
We introduce Interpolation Consistency Training (ICT), a simple and computation efficient algorithm for training Deep Neural Networks in the semi-supervised learning paradigm. ICT encourages the prediction at an interpolation of unlabeled points to be consistent with the interpolation of the predictions at those points. In classification problems, ICT moves the decision boundary to low-density regions of the data distribution. Our experiments show that ICT achieves state-of-the-art performance when applied to standard neural network architectures on the CIFAR-10 and SVHN benchmark dataset.
Article
Full-text available
For many applications the collection of labeled data is expensive laborious. Exploitation of unlabeled data during training is thus a long pursued objective of machine learning. Self-supervised learning addresses this by positing an auxiliary task (different, but related to the supervised task) for which data is abundantly available. In this paper, we show how ranking can be used as a proxy task for some regression problems. As another contribution, we propose an efficient backpropagation technique for Siamese networks which prevents the redundant computation introduced by the multi-branch network architecture. We apply our framework to two regression problems: Image Quality Assessment (IQA) and Crowd Counting. For both we show how to automatically generate ranked image sets from unlabeled data. Our results show that networks trained to regress to the ground truth targets for labeled data and to simultaneously learn to rank unlabeled data obtain significantly better, state-of-the-art results for both IQA and crowd counting. In addition, we show that measuring network uncertainty on the self-supervised proxy task is a good measure of informativeness of unlabeled data. This can be used to drive an algorithm for active learning and we show that this reduces labeling effort by up to 50%.
Article
One of the successful approaches in semi-supervised learning is based on the consistency regularization. Typically, a student model is trained to be consistent with teacher prediction for the inputs under different perturbations. To be successful, the prediction targets given by teacher should have good quality, otherwise the student can be misled by teacher. Unfortunately, existing methods do not assess the quality of the teacher targets. In this paper, we propose a novel Certainty-driven Consistency Loss (CCL) that exploits the predictive uncertainty in the consistency loss to let the student dynamically learn from reliable targets. Specifically, we propose two approaches, i.e. Filtering CCL and Temperature CCL to either filter out uncertain predictions or pay less attention on them in the consistency regularization. We further introduce a novel decoupled framework to encourage model difference. Experimental results on SVHN, CIFAR-10, and CIFAR-100 demonstrate the advantages of our method over a few existing methods.
Chapter
Despite the steady progress in video analysis led by the adoption of convolutional neural networks (CNNs), the relative improvement has been less drastic as that in 2D static image classification. Three main challenges exist including spatial (image) feature representation, temporal information representation, and model/computation complexity. It was recently shown by Carreira and Zisserman that 3D CNNs, inflated from 2D networks and pretrained on ImageNet, could be a promising way for spatial and temporal representation learning. However, as for model/computation complexity, 3D CNNs are much more expensive than 2D CNNs and prone to overfit. We seek a balance between speed and accuracy by building an effective and efficient video classification system through systematic exploration of critical network design choices. In particular, we show that it is possible to replace many of the 3D convolutions by low-cost 2D convolutions. Rather surprisingly, best result (in both speed and accuracy) is achieved when replacing the 3D convolutions at the bottom of the network, suggesting that temporal representation learning on high-level “semantic” features is more useful. Our conclusion generalizes to datasets with very different properties. When combined with several other cost-effective designs including separable spatial/temporal convolution and feature gating, our system results in an effective video classification system that that produces very competitive results on several action classification benchmarks (Kinetics, Something-something, UCF101 and HMDB), as well as two action detection (localization) benchmarks (JHMDB and UCF101-24).