Niki Parmar’s research while affiliated with Mountain View College and other places


Ad

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (24)


Guest Editorial Introduction to the Special Section on Transformer Models in Vision
  • Article

November 2023

·

51 Reads

IEEE Transactions on Pattern Analysis and Machine Intelligence

·

·

·

[...]

·

Mubarak Shah

Transformer models have achieved outstanding results on a variety of language tasks, such as text classification, ma- chine translation, and question answering. This success in the field of Natural Language Processing (NLP) has sparked interest in the computer vision community to apply these models to vision and multi-modal learning tasks. However, visual data has a unique structure, requiring the need to rethink network designs and training methods. As a result, Transformer models and their variations have been suc- cessfully used for image recognition, object detection, seg- mentation, image super-resolution, video understanding, image generation, text-image synthesis, and visual question answering, among other applications.



Decoder Denoising Pretraining for Semantic Segmentation

May 2022

·

42 Reads

·

2 Citations

Semantic segmentation labels are expensive and time consuming to acquire. Hence, pretraining is commonly used to improve the label-efficiency of segmentation models. Typically, the encoder of a segmentation model is pretrained as a classifier and the decoder is randomly initialized. Here, we argue that random initialization of the decoder can be suboptimal, especially when few labeled examples are available. We propose a decoder pretraining approach based on denoising, which can be combined with supervised pretraining of the encoder. We find that decoder denoising pretraining on the ImageNet dataset strongly outperforms encoder-only supervised pretraining. Despite its simplicity, decoder denoising pretraining achieves state-of-the-art results on label-efficient semantic segmentation and offers considerable gains on the Cityscapes, Pascal Context, and ADE20K datasets.




Simple and Efficient ways to Improve REALM

April 2021

·

29 Reads

Dense retrieval has been shown to be effective for retrieving relevant documents for Open Domain QA, surpassing popular sparse retrieval methods like BM25. REALM (Guu et al., 2020) is an end-to-end dense retrieval system that relies on MLM based pretraining for improved downstream QA efficiency across multiple datasets. We study the finetuning of REALM on various QA tasks and explore the limits of various hyperparameter and supervision choices. We find that REALM was significantly undertrained when finetuning and simple improvements in the training, supervision, and inference setups can significantly benefit QA results and exceed the performance of other models published post it. Our best model, REALM++, incorporates all the best working findings and achieves significant QA accuracy improvements over baselines (~5.5% absolute accuracy) without any model design changes. Additionally, REALM++ matches the performance of large Open Domain QA models which have 3x more parameters demonstrating the efficiency of the setup.


Figure 1. HaloNet local self-attention architecture: The different stages of blocked local attention for a [4, 4, c] image, block size b = 2, and halo h = 1. The image is first blocked into non-overlapping [2, 2, c] images from which the queries are computed. The subsequent haloing step then extracts a [4, 4, c] memory around each of the blocks which linearly transform to keys and values. The spatial dimensions after attention are the same as the queries.
Figure 2. The attention downsampling layer subsamples the queries but keeps the neighborhood the same as the the stride=1 case.
Figure 5. Relaxing translational equivariance improves accuracies
Figure 6. The accuracy gap between HaloNet-50 and ResNet-50 is maintained with increasing image sizes. The HaloNet experiments are annotated with block size (b), halo size (h).
Figure 7. Increasing window sizes improves accuracy up to a point. The experiments in the graph have been annotated with their block size (b), halo size (h), h = 0 implies attention with non-overlapping blocks
Scaling Local Self-Attention For Parameter Efficient Visual Backbones
  • Preprint
  • File available

March 2021

·

507 Reads

·

2 Citations

Self-attention has the promise of improving computer vision systems due to parameter-independent scaling of receptive fields and content-dependent interactions, in contrast to parameter-dependent scaling and content-independent interactions of convolutions. Self-attention models have recently been shown to have encouraging improvements on accuracy-parameter trade-offs compared to baseline convolutional models such as ResNet-50. In this work, we aim to develop self-attention models that can outperform not just the canonical baseline models, but even the high-performing convolutional models. We propose two extensions to self-attention that, in conjunction with a more efficient implementation of self-attention, improve the speed, memory usage, and accuracy of these models. We leverage these improvements to develop a new self-attention model family, \emph{HaloNets}, which reach state-of-the-art accuracies on the parameter-limited setting of the ImageNet classification benchmark. In preliminary transfer learning experiments, we find that HaloNet models outperform much larger models and have better inference performance. On harder tasks such as object detection and instance segmentation, our simple local self-attention and convolutional hybrids show improvements over very strong baselines. These results mark another step in demonstrating the efficacy of self-attention models on settings traditionally dominated by convolutional models.

Download

Bottleneck Transformers for Visual Recognition

January 2021

·

282 Reads

·

7 Citations

We present BoTNet, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation. By just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet and no other changes, our approach improves upon the baselines significantly on instance segmentation and object detection while also reducing the parameters, with minimal overhead in latency. Through the design of BoTNet, we also point out how ResNet bottleneck blocks with self-attention can be viewed as Transformer blocks. Without any bells and whistles, BoTNet achieves 44.4% Mask AP and 49.7% Box AP on the COCO Instance Segmentation benchmark using the Mask R-CNN framework; surpassing the previous best published single model and single scale results of ResNeSt evaluated on the COCO validation set. Finally, we present a simple adaptation of the BoTNet design for image classification, resulting in models that achieve a strong performance of 84.7% top-1 accuracy on the ImageNet benchmark while being up to 2.33x faster in compute time than the popular EfficientNet models on TPU-v3 hardware. We hope our simple and effective approach will serve as a strong baseline for future research in self-attention models for vision.




Ad

Citations (19)


... Diffusion models [24,73,75] have emerged as a dominant force, outperforming Generative Adversarial Networks (GANs) [32] and opening a new era in generative modeling. Previous studies [1,6,7,25,85,86,88] have shown that generative models can learn competitive recognition repre-sentations, indicating a strong relationship between image understanding and generation. ...

Reference:

USP: Unified Self-Supervised Pretraining for Image Generation and Understanding
Denoising Pretraining for Semantic Segmentation
  • Citing Conference Paper
  • June 2022

... Although interesting, these generative models are usually difficult to train, and require large computing capability to be applied on complex 3D images. In a simplified setting, denoising can also be used to extend pretrained "off the shelf" encoders into full segmentation networks [22]. ...

Decoder Denoising Pretraining for Semantic Segmentation
  • Citing Preprint
  • May 2022

... As an extension work, Balachandran et al. [236] perform a thorough tuning of REALM on a variety of QA tasks. They conduct extensive experiments with multiple training tricks, including using exact vector similarity search, training with a larger batch, retrieving more documents for the reader, and incorporating human annotations for evidence passages. ...

Simple and Efficient ways to Improve REALM
  • Citing Conference Paper
  • January 2021

... Several studies investigate the idea of applying attention modules or utilizing additional relational data to improve the performance of convolutional neural networks 12 . Srinivas et al. 41 replace the convolutional layers with self-attention in the model's final stages. Bello et al. 40 propose to augment self-attention with convolution by concatenating feature maps from the self-attention pipeline with convolutions in certain layers. ...

Bottleneck Transformers for Visual Recognition
  • Citing Conference Paper
  • June 2021

... These models possess deeper network architectures and enhanced feature extraction capabilities, enabling them to autonomously learn representations from large datasets, thereby improving adaptability. Advances in deep learning have propelled fields such as computer vision (CV) and natural language processing (NLP), and these models are also extensively used in fault diagnosis [3][4][5][6]. For instance, Zhong et al. proposed a transfer learning model for gas turbine fault diagnosis based on CNN and SVM, which effectively transfers fault classification knowledge to small-scale fault data [7]. ...

Scaling Local Self-Attention for Parameter Efficient Visual Backbones
  • Citing Conference Paper
  • June 2021

... ASR models take speech features as inputs and output characters or words. We used two ASR models in this study, a state-of-the-art Conformer-Transducer (CTD) [57] and a wav2vec2 ASR model [58]. The CTD model was used for evaluating the speech attributes that are important for improving the performance of ASR while the wav2vec2 model was finetuned on the final datasets generated after combining all the speech attributes to create a diverse synthetic dataset. ...

Conformer: Convolution-augmented Transformer for Speech Recognition
  • Citing Conference Paper
  • October 2020

... Therefore, an additional region matching loss is proposed that mitigates the lost semantic correspondences during the merging process in multi-stage transformer architectures. Vaswani et al. (2021) reduce computational complexity of a standard transformer by reducing the number of patches that go through the transformer at each layer. For this, a special image patch merging module merges the patches at each layer and attention is calculated between them via sparse self-attention modules. ...

Scaling Local Self-Attention For Parameter Efficient Visual Backbones

... Each residual block comprises a sequence of convolutional layers, particularly the BottleneckBlock units. [48] These units start with a 1 × 1 convolution to reduce data dimensions, followed by a 3 × 3 convolution for a comprehensive feature extraction, and conclude with another 1 × 1 convolution to restore the data dimensions. Moreover, our discretization model adopts the CrossEntropy-Loss function, which combines log softmax and negative loglikelihood loss to accurately compute gradients. ...

Bottleneck Transformers for Visual Recognition
  • Citing Preprint
  • January 2021

... It has outperformed traditional deep neural network models based on CNNs or RNNs in numerous NLP tasks, such as machine translation. Additionally, the Transformer model has been applied to other fields such as speech processing and computer vision (Dosovitskiy et al., 2020;Liu et al., 2021;Gulati et al., 2020). In recent years, more and more researchers have also introduced it to time series related fields. ...

Conformer: Convolution-augmented Transformer for Speech Recognition
  • Citing Preprint
  • May 2020

... Zhao et al. [10] used a large-scale unlabeled corpus to pre-train a denoising self-encoder [11,12] with a copy mechanism for the Transformer model [13] and achieved results close to those of Ge et al. [14] based on word-level and sentence-level multi-task learning methods using only publicly available "error-corrected" parallel corpora. Based on the idea of round-trip translation, Lichtarge et al. [15] used Wikipedia data to generate a large number of pseudoparallel sentence pairs to pre-train a forward grammatical error-correction model using a medial language as a jumping-off point for fold-back translation. ...

Corpora Generation for Grammatical Error Correction
  • Citing Conference Paper
  • January 2019