Preprint

Singular Value Fine-tuning: Few-shot Segmentation requires Few-parameters Fine-tuning

Authors:
Preprints and early-stage research may not have been peer reviewed yet.
To read the file of this research, you can request a copy directly from the authors.

Abstract

Freezing the pre-trained backbone has become a standard paradigm to avoid overfitting in few-shot segmentation. In this paper, we rethink the paradigm and explore a new regime: {\em fine-tuning a small part of parameters in the backbone}. We present a solution to overcome the overfitting problem, leading to better model generalization on learning novel classes. Our method decomposes backbone parameters into three successive matrices via the Singular Value Decomposition (SVD), then {\em only fine-tunes the singular values} and keeps others frozen. The above design allows the model to adjust feature representations on novel classes while maintaining semantic clues within the pre-trained backbone. We evaluate our {\em Singular Value Fine-tuning (SVF)} approach on various few-shot segmentation methods with different backbones. We achieve state-of-the-art results on both Pascal-5i^i and COCO-20i^i across 1-shot and 5-shot settings. Hopefully, this simple baseline will encourage researchers to rethink the role of backbone fine-tuning in few-shot settings. The source code and models will be available at \url{https://github.com/syp2ysy/SVF}.

No file available

Request Full-text Paper PDF

To read the file of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Conference Paper
Full-text available
Recently few-shot segmentation (FSS) has been extensively developed. Most previous works strive to achieve generalization through the meta-learning framework derived from classification tasks; however, the trained models are biased towards the seen classes instead of being ideally class-agnostic, thus hindering the recognition of new concepts. This paper proposes a fresh and straightforward insight to alleviate the problem. Specifically, we apply an additional branch (base learner) to the conventional FSS model (meta learner) to explicitly identify the targets of base classes, i.e., the regions that do not need to be segmented. Then, the coarse results output by these two learners in parallel are adaptively integrated to yield precise segmentation prediction. Considering the sensitivity of meta learner, we further introduce an adjustment factor to estimate the scene differences between the input image pairs for facilitating the model ensemble forecasting. The substantial performance gains on PASCAL-5 i and COCO-20 i verify the effectiveness , and surprisingly, our versatile scheme sets a new state-of-the-art even with two plain learners. Moreover, in light of the unique nature of the proposed approach, we also extend it to a more realistic but challenging setting, i.e., generalized FSS, where the pixels of both base and novel classes are required to be determined. The source code is available at github.com/chunbolang/BAM.
Conference Paper
Full-text available
Semantic segmentation assigns a class label to each image pixel. This dense prediction problem requires large amounts of manually annotated data, which is often unavailable. Few-shot learning aims to learn the pattern of a new category with only a few annotated examples. In this paper, we formulate the few-shot semantic segmentation problem from 1-way (class) to N-way (classes). Inspired by few-shot classification, we propose a generalized framework for few-shot semantic segmentation with an alternative training scheme. The framework is based on prototype learning and metric learning. Our approach outperforms the baselines by a large margin and shows comparable performance for 1-way few-shot semantic segmentation on PASCAL VOC 2012 dataset.
Article
Full-text available
The ImageNet Large Scale Visual Recognition Challenge is a benchmark in object category classification and detection on hundreds of object categories and millions of images. The challenge has been run annually from 2010 to present, attracting participation from more than fifty institutions. This paper describes the creation of this benchmark dataset and the advances in object recognition that have been possible as a result. We discuss the challenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recognition, provide detailed a analysis of the current state of the field of large-scale image classification and object detection, and compare the state-of-the-art computer vision accuracy with human accuracy. We conclude with lessons learned in the five years of the challenge, and propose future directions and improvements.
Article
Full-text available
We present techniques for speeding up the test-time evaluation of large convolutional networks, designed for object recognition tasks. These models deliver impressive accuracy but each image evaluation requires millions of floating point operations, making their deployment on smartphones and Internet-scale clusters problematic. The computation is dominated by the convolution operations in the lower layers of the model. We exploit the linear structure present within the convolutional filters to derive approximations that significantly reduce the required computation. Using large state-of-the-art models, we demonstrate speedups by a factor of 2x, while keeping the accuracy within 1% of the original model.
Article
Few-shot fine-grained recognition (FS-FGR) aims to distinguish several highly similar objects from different sub-categories with limited supervision. However, traditional few-shot learning solutions typically exploit image-level features and are committed to capturing global silhouettes while accidentally ignore to exploring local details, resulting in an inevitable problem of inconspicuous but distinguishable information loss. Thus, how to effectively address the fine-grained recognition issue given limited samples still remains a major challenging. In this article, we tend to propose an effective bidirectional pyramid architecture to enhance internal representations of features to cater to fine-grained image recognition task in the few-shot learning scenario. Specifically, we deploy a multi-scale feature pyramid and a multi-level attention pyramid on the backbone network, and progressively aggregated features from different granular spaces via both of them. We then further present an attention-guided refinement strategy in collaboration with a multi-level attention pyramid to reduce the uncertainty brought by backgrounds conditioned by limited samples. In addition, the proposed method is trained with the meta-learning framework in an end-to-end fashion without any extra supervision. Extensive experimental results on four challenging and widely-used fine-grained benchmarks show that the proposed method performs favorably against state-of-the-arts, especially in the one-shot scenarios.
Article
Contextual information has been shown to be powerful for semantic segmentation. This work proposes a novel Context-based Tandem Network (CTNet) by interactively exploring the spatial contextual information and the channel contextual information, which can discover the semantic context for semantic segmentation. Specifically, the Spatial Contextual Module (SCM) is leveraged to uncover the spatial contextual dependency between pixels by exploring the correlation between pixels and categories. Meanwhile, the Channel Contextual Module (CCM) is introduced to learn the semantic features including the semantic feature maps and class-specific features by modeling the long-term semantic dependence between channels. The learned semantic features are utilized as the prior knowledge to guide the learning of SCM, which can make SCM obtain more accurate long-range spatial dependency. Finally, to further improve the performance of the learned representations for semantic segmentation, the results of the two context modules are adaptively integrated to achieve better results. Extensive experiments are conducted on four widely-used datasets, i.e., PASCAL-Context, Cityscapes, ADE20K and PASCAL VOC2012. The results demonstrate the superior performance of the proposed CTNet by comparison with several state-of-the-art methods. The source code and models are available at https://github.com/syp2ysy/CTNet .
Article
State-of-the-art semantic segmentation methods require sufficient labeled data to achieve good results and hardly work on unseen classes without fine-tuning. Few-shot segmentation is thus proposed to tackle this problem by learning a model that quickly adapts to new classes with a few labeled support samples. Theses frameworks still face the challenge of generalization ability reduction on unseen classes due to inappropriate use of high-level semantic information of training classes and spatial inconsistency between query and support targets. To alleviate these issues, we propose the Prior Guided Feature Enrichment Network (PFENet). It consists of novel designs of (1) a training-free prior mask generation method that not only retains generalization power but also improves model performance and (2) Feature Enrichment Module (FEM) that overcomes spatial inconsistency by adaptively enriching query features with support features and prior masks. Extensive experiments on PASCAL- 5i5^i and COCO prove that the proposed prior generation method and FEM both improve the baseline method significantly. Our PFENet also outperforms state-of-the-art methods by a large margin without efficiency loss. It is surprising that our model even generalizes to cases without labeled support samples.
Article
One-shot image semantic segmentation poses a challenging task of recognizing the object regions from unseen categories with only one annotated example as supervision. In this article, we propose a simple yet effective similarity guidance network to tackle the one-shot (SG-One) segmentation problem. We aim at predicting the segmentation mask of a query image with the reference to one densely labeled support image of the same category. To obtain the robust representative feature of the support image, we first adopt a masked average pooling strategy for producing the guidance features by only taking the pixels belonging to the support image into account. We then leverage the cosine similarity to build the relationship between the guidance features and features of pixels from the query image. In this way, the possibilities embedded in the produced similarity maps can be adopted to guide the process of segmenting objects. Furthermore, our SG-One is a unified framework that can efficiently process both support and query images within one network and be learned in an end-to-end manner. We conduct extensive experiments on Pascal VOC 2012. In particular, our SG-One achieves the mIoU score of 46.3%, surpassing the baseline methods.
Article
Restart techniques are common in gradient-free optimization to deal with multimodal functions. Partial restarts are also gaining popularity in gradient-based optimization to improve the rate of convergence in accelerated gradient schemes to deal with ill-conditioned functions. In this paper, we propose a simple restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks. We empirically study its performance on CIFAR-10 and CIFAR-100 datasets where we demonstrate new state-of-the-art results below 4backslash% and 19backslash%, respectively. Our source code is available at https://github.com/loshchil/SGDR.
Article
The numerical techniques of transform image coding are well known in the image bandwidth compression literature. This concise paper presents a new transform method in which the singular values and singular vectors of an image are computed and transmitted instead of transform coefficients. The singular value decomposition (SVD) method is known to be the deterministically optimal transform for energy compaction [2]. A systems implementation is hypothesized, and a variety of coding strategies is developed. Statistical properties of the SVD are discussed and a self adaptive set of experimental results is presented, Imagery compressed to 1, 1.5, and 2.5 bits per pixel with less than 1.6, 1, and 1/3 percent, respective mean-square error is displayed. Finally, additional image coding scenarios are postulated for further consideration.
Very deep convolutional networks for large-scale image recognition
  • Yoshua Bengio
  • Yann Lecun
Yoshua Bengio and Yann LeCun. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
An image is worth 16x16 words: Transformers for image recognition at scale
  • Alexey Dosovitskiy
  • Lucas Beyer
  • Alexander Kolesnikov
  • Dirk Weissenborn
  • Xiaohua Zhai
  • Thomas Unterthiner
  • Mostafa Dehghani
  • Matthias Minderer
  • Georg Heigold
  • Sylvain Gelly
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2020.
Parameter-efficient transfer learning for nlp
  • Neil Houlsby
  • Andrei Giurgiu
  • Stanislaw Jastrzebski
  • Bruna Morrone
  • Quentin De Laroussilhe
  • Andrea Gesmundo
  • Mona Attariyan
  • Sylvain Gelly
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790-2799. PMLR, 2019.
  • Menglin Jia
  • Luming Tang
  • Claire Bor-Chun Chen
  • Serge Cardie
  • Bharath Belongie
  • Ser-Nam Hariharan
  • Lim
Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Visual prompt tuning. arXiv preprint arXiv:2203.12119, 2022.
Few-shot segmentation with global and local contrastive learning
  • Weide Liu
  • Zhonghua Wu
  • Henghui Ding
  • Fayao Liu
  • Jie Lin
  • Guosheng Lin
Weide Liu, Zhonghua Wu, Henghui Ding, Fayao Liu, Jie Lin, and Guosheng Lin. Few-shot segmentation with global and local contrastive learning. arXiv preprint arXiv:2108.05293, 2021.
Songyang Zhang, and Xuming He. Part-aware prototype network for few-shot semantic segmentation
  • Yongfei Liu
  • Xiangyi Zhang
Yongfei Liu, Xiangyi Zhang, Songyang Zhang, and Xuming He. Part-aware prototype network for few-shot semantic segmentation. In European Conference on Computer Vision, pages 142-158. Springer, 2020.
Marios Savvides, and Kwang-Ting Cheng. Partial is better than all: Revisiting fine-tuning strategy for few-shot learning
  • Zhiqiang Shen
  • Zechun Liu
  • Jie Qin
Zhiqiang Shen, Zechun Liu, Jie Qin, Marios Savvides, and Kwang-Ting Cheng. Partial is better than all: Revisiting fine-tuning strategy for few-shot learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9594-9602, 2021.
Prototypical networks for few-shot learning. Advances in neural information processing systems
  • Jake Snell
  • Kevin Swersky
  • Richard Zemel
Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical networks for few-shot learning. Advances in neural information processing systems, 30, 2017.
Multimodal few-shot learning with frozen language models
  • Maria Tsimpoukelli
  • L Jacob
  • Serkan Menick
  • Cabi
  • Oriol Sm Eslami
  • Felix Vinyals
  • Hill
Maria Tsimpoukelli, Jacob L Menick, Serkan Cabi, SM Eslami, Oriol Vinyals, and Felix Hill. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200-212, 2021.
Why do pretrained language models help in downstream tasks? an analysis of head and prompt tuning
  • Colin Wei
  • Sang Michael Xie
  • Tengyu Ma
Colin Wei, Sang Michael Xie, and Tengyu Ma. Why do pretrained language models help in downstream tasks? an analysis of head and prompt tuning. Advances in Neural Information Processing Systems, 34, 2021.
Scale-aware graph neural network for few-shot semantic segmentation
  • Jie Guo-Sen Xie
  • Huan Liu
  • Ling Xiong
  • Shao
Guo-Sen Xie, Jie Liu, Huan Xiong, and Ling Shao. Scale-aware graph neural network for few-shot semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5475-5484, 2021.
Few-shot semantic segmentation with cyclic memory network
  • Huan Guo-Sen Xie
  • Jie Xiong
  • Yazhou Liu
  • Ling Yao
  • Shao
Guo-Sen Xie, Huan Xiong, Jie Liu, Yazhou Yao, and Ling Shao. Few-shot semantic segmentation with cyclic memory network. In Proceedings of the IEEE International Conference on Computer Vision, pages 7293-7302, 2021.
Prototype mixture models for few-shot semantic segmentation
  • Boyu Yang
  • Chang Liu
  • Bohao Li
  • Jianbin Jiao
  • Qixiang Ye
Boyu Yang, Chang Liu, Bohao Li, Jianbin Jiao, and Qixiang Ye. Prototype mixture models for few-shot semantic segmentation. In European Conference on Computer Vision, pages 763-778. Springer, 2020.
Object-contextual representations for semantic segmentation
  • Yuhui Yuan
  • Xilin Chen
  • Jingdong Wang
Yuhui Yuan, Xilin Chen, and Jingdong Wang. Object-contextual representations for semantic segmentation. In European Conference on Computer Vision, pages 173-190. Springer, 2020.
Feature pyramid transformer
  • Dong Zhang
  • Hanwang Zhang
  • Jinhui Tang
  • Meng Wang
  • Xiansheng Hua
  • Qianru Sun
Dong Zhang, Hanwang Zhang, Jinhui Tang, Meng Wang, Xiansheng Hua, and Qianru Sun. Feature pyramid transformer. In European Conference on Computer Vision, pages 323-339. Springer, 2020.
Few-shot segmentation via cycle-consistent transformer
  • Gengwei Zhang
  • Guoliang Kang
  • Yi Yang
  • Yunchao Wei
Gengwei Zhang, Guoliang Kang, Yi Yang, and Yunchao Wei. Few-shot segmentation via cycle-consistent transformer. Advances in Neural Information Processing Systems, 34, 2021.
Revisiting few-sample bert fine-tuning
  • Tianyi Zhang
  • Felix Wu
  • Arzoo Katiyar
  • Q Kilian
  • Yoav Weinberger
  • Artzi
Tianyi Zhang, Felix Wu, Arzoo Katiyar, Kilian Q Weinberger, and Yoav Artzi. Revisiting few-sample bert fine-tuning. In International Conference on Learning Representations, 2020.
Discrimination-aware channel pruning for deep neural networks
  • Zhuangwei Zhuang
  • Mingkui Tan
  • Bohan Zhuang
  • Jing Liu
  • Yong Guo
  • Qingyao Wu
  • Junzhou Huang
  • Jinhui Zhu
Zhuangwei Zhuang, Mingkui Tan, Bohan Zhuang, Jing Liu, Yong Guo, Qingyao Wu, Junzhou Huang, and Jinhui Zhu. Discrimination-aware channel pruning for deep neural networks. Advances in neural information processing systems, 31, 2018.