Figure 4 - uploaded by Peyman Bateni
Content may be subject to copyright.
Comparison of the feature extraction and classification in CNAPS versus Simple CNAPS: Both CNAPS and Simple CNAPS share the feature extraction adaptation architecture detailed in Figure 3. CNAPS and Simple CNAPS differ in how distances between query feature vectors and class feature representations are computed for classification. CNAPS uses a trained, adapted linear classifier whereas Simple CNAPS uses a differentiable but fixed and parameter-free deterministic distance computation. Components in light blue have parameters that are trained, specifically f τ θ in both models and ψ c φ in the CNAPS adaptive classification. CNAPS classification requires 778k parameters while Simple CNAPS is fully deterministic.

Comparison of the feature extraction and classification in CNAPS versus Simple CNAPS: Both CNAPS and Simple CNAPS share the feature extraction adaptation architecture detailed in Figure 3. CNAPS and Simple CNAPS differ in how distances between query feature vectors and class feature representations are computed for classification. CNAPS uses a trained, adapted linear classifier whereas Simple CNAPS uses a differentiable but fixed and parameter-free deterministic distance computation. Components in light blue have parameters that are trained, specifically f τ θ in both models and ψ c φ in the CNAPS adaptive classification. CNAPS classification requires 778k parameters while Simple CNAPS is fully deterministic.

Source publication
Conference Paper
Full-text available
Few-shot learning is a fundamental task in computer vision that carries the promise of alleviating the need for exhaustively labeled data. Most few-shot learning approaches to date have focused on progressively more complex neural feature extractors and classifier adaptation strategies, and the refinement of the task definition itself. In this pape...

Contexts in source publication

Context 1
... we show that regularized class-specific covariance estimation from task-specific adapted feature vectors allows the use of the Mahalanobis distance for classification, achieving a significant improvement over state of the art. A high-level diagrammatic comparison of our "Simple CNAPS" architecture to CNAPS can be found in Figure 4. ...
Context 2
... class mean µ k is obtained by mean-pooling the feature vectors of the support examples for class k extracted by the adapted feature extractor f τ θ . A visual overview of the CNAPS adapted classifier architecture is shown in Figure 4, bottom left, red. ...
Context 3
... considered other ratios and making λ τ k 's learnable parameters, but found that out of all the considered alternatives the simple deterministic ratio above produced the best results. The architecture of the classifier in Simple CNAPS appears in Figure 4, bottom-right, blue. ...

Similar publications

Article
Full-text available
We present a numerical solver for the incompressible Navier-Stokes equations that combines fourth-order-accurate discrete approximations and an adaptive tree grid (i.e. h-refinement). The scheme employs a novel compact-upwind advection scheme and a 4th-order accurate projection algorithm whereby the numerical solution exactly satisfies the incompre...
Preprint
Full-text available
We propose a novel framework for interactive class-agnostic object counting, where a human user can interactively provide feedback to improve the accuracy of a counter. Our framework consists of two main components: a user-friendly visualizer to gather feedback and an efficient mechanism to incorporate it. In each iteration, we produce a density ma...

Citations

... Inspired by the success of few-shot learning in computer vision [2,20,46] and natural language processing [1,36,48], few-shot learning on graphs has recently seen significant development [23,25,27,28,43,49,50]. The core concept of current mainstream methods is to develop complicated algorithms to address the problem of few-shot learning on graphs. ...
Preprint
Graph neural networks have been demonstrated as a powerful paradigm for effectively learning graph-structured data on the web and mining content from it.Current leading graph models require a large number of labeled samples for training, which unavoidably leads to overfitting in few-shot scenarios. Recent research has sought to alleviate this issue by simultaneously leveraging graph learning and meta-learning paradigms. However, these graph meta-learning models assume the availability of numerous meta-training tasks to learn transferable meta-knowledge. Such assumption may not be feasible in the real world due to the difficulty of constructing tasks and the substantial costs involved. Therefore, we propose a SiMple yet effectIve approach for graph few-shot Learning with fEwer tasks, named SMILE. We introduce a dual-level mixup strategy, encompassing both within-task and across-task mixup, to simultaneously enrich the available nodes and tasks in meta-learning. Moreover, we explicitly leverage the prior information provided by the node degrees in the graph to encode expressive node representations. Theoretically, we demonstrate that SMILE can enhance the model generalization ability. Empirically, SMILE consistently outperforms other competitive models by a large margin across all evaluated datasets with in-domain and cross-domain settings. Our anonymous code can be found here.
... Metric-based methods learn a semantic embedding space and classify query samples based on their similarity. Metrics such as cosine similarity [27], Euclidean distance [21], Mahalanobis distance [47], and Earth Mover's Distance (EMD) [48] have been effectively applied to FSL. CrossTransformers [49] explored coarse spatial correspondence between the query and the labeled images and then used the spatially-corresponding features for classification. ...
Article
Full-text available
The goal of Few-Shot Continual Learning (FSCL) is to incrementally learn novel tasks with limited labeled samples and preserve previous capabilities simultaneously. However, current FSCL works lack research on domain increment and domain generalization ability, which cannot cope with changes in the visual perception environment. In this paper, we set up a Generalized FSCL (GFSCL) protocol involving both class-and domain-incremental scenarios together with domain generalization assessment. Firstly, two benchmark datasets and protocols are newly arranged, and detailed baselines are provided for this unexplored configuration. Furthermore, we find that common continual learning methods have poor generalization ability on unseen domains and cannot better tackle catastrophic forgetting issue in cross-incremental tasks. Hence, we propose a rehearsal-free framework based on Vision Transformer (ViT) named Contrastive Mixture of Adapters (CMoA). It contains two non-conflicting parts: (1) By applying the fast-adaptation characteristic of adapter-embedded ViT, the mixture of Adapters (MoA) module is incorporated into ViT. For stability purpose, cosine similarity regularization and dynamic weighting are designed to make each adapter learn specific knowledge and concentrate on particular classes. (2) To further enhance domain generalization ability, we alleviate the intra-class variation by prototype-calibrated contrastive learning to improve domain-invariant representation learning. Finally, six evaluation indicators showing the overall performance and forgetting are compared by comprehensive experiments on two benchmark datasets to validate the efficacy of CMoA, and the results illustrate that CMoA can achieve comparative performance with rehearsal-based continual learning methods. The codes and protocols are available at https://github.com/yawencui/CMoA.
... While there exists multiple strategies to construct a good metric [36,56], using the Mahalanobis distance with the covariance matrix estimated from the training data is a simple yet well-performing one [3]: ...
Preprint
The growing popularity of Contrastive Language-Image Pretraining (CLIP) has led to its widespread application in various visual downstream tasks. To enhance CLIP's effectiveness and versatility, efficient few-shot adaptation techniques have been widely adopted. Among these approaches, training-free methods, particularly caching methods exemplified by Tip-Adapter, have gained attention for their lightweight adaptation without the need for additional fine-tuning. In this paper, we revisit Tip-Adapter from a kernel perspective, showing that caching methods function as local adapters and are connected to a well-established kernel literature. Drawing on this insight, we offer a theoretical understanding of how these methods operate and suggest multiple avenues for enhancing the Tip-Adapter baseline. Notably, our analysis shows the importance of incorporating global information in local adapters. Therefore, we subsequently propose a global method that learns a proximal regularizer in a reproducing kernel Hilbert space (RKHS) using CLIP as a base learner. Our method, which we call ProKeR (Proximal Kernel ridge Regression), has a closed form solution and achieves state-of-the-art performances across 11 datasets in the standard few-shot adaptation benchmark.
... Simple CNAP [21], in comparison with the ProtoNet method uses the Mahalanobis distance instead of the Euclidean distance. In this method, the class covariance matrix, unlike the Gaussian prototypical networks [22] which learns a diagonal-covariance matrix, includes the covariance of the class data in all dimensions. ...
Article
Full-text available
Prominent prototype-based classification (PbC) approaches, such as Prototypical Networks (ProtoNet), use the average of samples within a class as the class prototype. In these methods which we call Mean-PbC, a discriminant classifier is defined based on the minimum Mahalanobis distance from class prototypes. It is well known that if the data of each class is normally distributed, then the use of Mahalanobis distance leads to an optimal discriminant classifier. We propose the Hard-Positive Prototypical Networks (HPP-Net) , which also employs the Mahalanobis distance, despite assuming the class distribution may be unnormalized. HPP-Net learns class prototypes from hard (near-boundary) samples that are less similar to the class center and have a higher misclassification probability. It also employs a learnable parameter to capture the covariance of samples around the new prototypes. The valuable finding of this paper is that a more accurate discriminant classifier can be attained by applying the Mahalanobis distance in which the mean is a “hard-positive prototype”, and the covariance is learned via the model. The experimental results on Omniglot, CUB, miniImagenet and CIFAR-100 datasets demonstrate that HPP-Net achieves competitive performance compared to ProtoNet and several other prototype-based few-shot learning (FSL) methods.
... Leveraging the Meta-Dataset, various Cross-Domain Few-Shot Learning (CDFSL) methods have been developed (Requeima et al. 2019;Bateni et al. 2020Bateni et al. , 2022Liu et al. 2021;Triantafillou et al. 2021;Li, Liu, andBilen 2021, 2022;Dvornik, Schmid, and Mairal 2020;Liu et al. 2020;Guo et al. 2023;Tian et al. 2024), demonstrating significant advancements in this field. These approaches typically parameterize deep neural networks with a large set of task-agnostic parameters alongside a smaller set of taskspecific parameters. ...
... Task-specific parameters are optimized to the target task through an adaptation mechanism, generally following one of two primary methodologies. The first approach utilizes an auxiliary network functioning as a parameter generator, which, upon receiving a few labeled examples from the target task, outputs optimized task-specific parameters (Requeima et al. 2019;Bateni et al. 2020Bateni et al. , 2022Liu et al. 2020Liu et al. , 2021. The second approach directly finetunes the task-specific parameters through gradient descent using a few labeled examples from the target task (Dvornik, Schmid, and Mairal 2020;Li, Liu, and Bilen 2021;Triantafillou et al. 2021;Li, Liu, and Bilen 2022;Tian et al. 2024). ...
... Task-agnostic parameters can be designed as a single network or multiple networks. The single network is trained on a large dataset from single domain (Requeima et al. 2019;Bateni et al. 2020Bateni et al. , 2022Liu et al. 2021) or multiple domains (Triantafillou et al. 2021;Li, Liu, andBilen 2021, 2022;Guo et al. 2023), whereas the multiple networks are trained individually on each domain (Dvornik, Schmid, and Mairal 2020;Liu et al. 2020). Task-specific parameters can be designed as selection parameters (Dvornik, Schmid, and Mairal 2020;Liu et al. 2020), pre-classifier transformation (Li, Liu, and Bilen 2021, 2022Guo et al. 2023), Feature-wise Linear Modulate (FiLM) layer (Requeima et al. 2019;Bateni et al. 2020Bateni et al. , 2022Liu et al. 2021;Triantafillou et al. 2021), or Residual Adapter (RA) (Li, Liu, and Bilen 2022;Guo et al. 2023). ...
Preprint
Full-text available
Cross-Domain Few-Shot Learning~(CDFSL) methods typically parameterize models with task-agnostic and task-specific parameters. To adapt task-specific parameters, recent approaches have utilized fixed optimization strategies, despite their potential sub-optimality across varying domains or target tasks. To address this issue, we propose a novel adaptation mechanism called Task-Specific Preconditioned gradient descent~(TSP). Our method first meta-learns Domain-Specific Preconditioners~(DSPs) that capture the characteristics of each meta-training domain, which are then linearly combined using task-coefficients to form the Task-Specific Preconditioner. The preconditioner is applied to gradient descent, making the optimization adaptive to the target task. We constrain our preconditioners to be positive definite, guiding the preconditioned gradient toward the direction of steepest descent. Empirical evaluations on the Meta-Dataset show that TSP achieves state-of-the-art performance across diverse experimental scenarios.
... 61 [27] 62.81 82.11 Relation Net (2018) [28] 68.26 80.94 MsSoSN (2020) [57] 84.69 94.21 SoSN ArL (2021) [58] 76.21 88.36 CMKD (2021) [12] 50.30 ± 0.01 -MProtoNet (2021) [18] 86.52 ± 0.36 94.57 ± 0.13 CNAPS (2020) [59] -90.70 ± 0.50 MAP-Net (2022) [17] 77. 41 ...
Article
Full-text available
The development of transformer-based models has significantly advanced research in natural language processing and computer vision, allowing us to create models with excellent results across various domains. However, in real-world scenarios, the model may lack generalization ability and perform poorly due to data distribution shifts, insufficient training data, or low-quality data. This work proposes the generic multimodal optimization-based few-shot learning framework (GoFSL). The framework leverages few-shot learning to learn from a few data, multimodal learning to learn a rich representation of image and text data, and meta-learning to help the model generalization. We evaluated the framework using ten datasets from various domains and characteristics, including short texts from Twitter, legal domain long text, text with alphabetic (English and Portuguese) and non-alphabetic (Japanese) languages, images from the medical domain, and multimodal benchmark datasets. GoFSL outperformed the state-of-the-art model ALMO by 1.05% with CUB-200-2011 and multimodal ProtoNet by 0.86% with Oxford-102 dataset. GoFSL is a small but efficient model, with low CO2 estimated emissions (0.01 kgCO2eq), and adaptable to different domains, data modalities, and languages.
... Typically, each category has very few available training samples, leading to poor performance of traditional deep learning methods in few-shot scenarios. To overcome this issue, researchers have proposed various methods and techniques, including transfer learning [12], meta-learning [10,13], and metric learning [14,15], to enhance the performance and generalization of few-shot classification tasks. These approaches usually rely on a large-scale meta-training set, followed by training with a small amount of support set data to adapt to new tasks. ...
Article
Full-text available
Few-shot learning focuses on training efficient models with limited amounts of training data. Its mainstream approaches have evolved from single-modal to multi-modal methods. The Contrastive Vision-Language Pre-training model, known as CLIP, achieves image classification by aligning the embedding spaces of images and text. To better achieve knowledge transfer between image domain and text domain, we propose a fine-tuning framework for vision-language models with CLIP. It introduces a novel adversarial domain adaptation approach, which trains a text and image symmetrical classifier to identify the differences between two domains. To more effectively align text and image into the same space, we adapt two types of confusion loss to construct the aligned semantic space by fine-tuning multi-modal features extractor. Experiments on 11 public datasets show that our proposed method has superior performance compared with state of art CLIP-driven learning methods.
... It is noteworthy that performance improves when using large language models (LLMs) to generate descriptors [8,20,21] or even employing random strings as descriptors [32]. On the other hand, few-shot classification [1,37,38,41] is another branch that classifies new classes with minimal training data. This method is useful for real-world problems with limited data or labeling difficulties [25]. ...
Preprint
The performance of vision-language models (VLMs), such as CLIP, in visual classification tasks, has been enhanced by leveraging semantic knowledge from large language models (LLMs), including GPT. Recent studies have shown that in zero-shot classification tasks, descriptors incorporating additional cues, high-level concepts, or even random characters often outperform those using only the category name. In many classification tasks, while the top-1 accuracy may be relatively low, the top-5 accuracy is often significantly higher. This gap implies that most misclassifications occur among a few similar classes, highlighting the model's difficulty in distinguishing between classes with subtle differences. To address this challenge, we introduce a novel concept of comparative descriptors. These descriptors emphasize the unique features of a target class against its most similar classes, enhancing differentiation. By generating and integrating these comparative descriptors into the classification framework, we refine the semantic focus and improve classification accuracy. An additional filtering process ensures that these descriptors are closer to the image embeddings in the CLIP space, further enhancing performance. Our approach demonstrates improved accuracy and robustness in visual classification tasks by addressing the specific challenge of subtle inter-class differences.
... Model-based methods focus on quickly updating the parameters for a small amount of sample data using model structures [13,14]. Metric-based methods aim to learn a distance metric or similarity function to classify new samples efficiently [15][16][17]. Optimization-based methods focus on designing optimal objective functions, thereby enabling models to complete few-shot classification tasks by iteratively adjusting the optimization [18,19]. Among these, metric-based methods have become mainstream because of their simplicity and efficiency. ...
Article
Full-text available
With the advancement of technology and improvements in cultivation techniques coupled with growing market demand, the global orchid market continues to expand, giving the orchid industry extremely high research and commercial value. However, because of the wide variety of orchid species, relying solely on human visual recognition or traditional paper-based data for comparison is time-consuming and labor-intensive, making orchid species recognition challenging. Deep learning technology has brought significant advancements to the field of image recognition. However, publicly available orchid datasets in the real world are scarce, and manually collecting a large amount of data incurs high costs. In situations with limited data, training deep learning models on a large scale is extremely difficult. To address these issues, this study employed the few-shot learning method, enabling the training of highly effective models under limited data conditions. This study proposes a few-shot learning and diffusion model data augmentation method based on dual-attention mechanisms for orchid species recognition. In the data preprocessing stage, stable diffusion technology was used to generate additional images to augment the original dataset. In addition, this study adopted ResNet34 as the backbone network through transfer learning mechanisms and utilized prototypical network classification algorithms for model training. The experimental results demonstrate that the proposed model achieves accuracies of 84.23% and 93.32% on 5-way 1-shot and 5-way 5-shot tasks, respectively. Furthermore, the generalizability of the proposed model was validated on the Omniglot and CIFAR-FS public datasets. The experimental results also indicate that the method proposed in this study can reduce reliance on large amounts of data and improve training efficiency.
... For example, Zhang et al. (Zhang et al., 2022b) utilized Earth Mover's Distance (EMD) to model the distance between images as the optimal transport plan. Bateni et al. (Bateni et al., 2020) proposed to construct Mahalanobis-distance classifiers to improve the accuracy of few-shot classification. ...
Article
Full-text available
Few-shot Segmentation aims to segment the interested objects in the query image with just a handful of labeled samples (i.e., support images). Previous schemes would leverage the similarity between support-query pixel pairs to construct the pixel-level semantic correlation. However, in remote sensing scenarios with extreme intra-class variations and cluttered backgrounds, such pixel-level correlations may produce tremendous mismatches, resulting in semantic ambiguity between the query foreground (FG) and background (BG) pixels. To tackle this problem, we propose a novel Agent Mining Transformer, which adaptively mines a set of local-aware agents to construct agent-level semantic correlation. Compared with pixel-level semantics, the given agents are equipped with local-contextual information and possess a broader receptive field. At this point, different query pixels can selectively aggregate the fine-grained local semantics of different agents, thereby enhancing the semantic clarity between query FG and BG pixels. Concretely, the Agent Learning Encoder is first proposed to erect the optimal transport plan that arranges different agents to aggregate support semantics under different local regions. Then, for further optimizing the agents, the Agent Aggregation Decoder and the Semantic Alignment Decoder are constructed to break through the limited support set for mining valuable class-specific semantics from unlabeled data sources and the query image itself, respectively. Extensive experiments on the remote sensing benchmark iSAID indicate that the proposed method achieves state-of-the-art performance. Surprisingly, our method remains quite competitive when extended to more common natural scenarios, i.e., PASCAL-5i5i5^i and COCO-20i20i20^{i}.