Niki Parmar's research while affiliated with Google Inc. and other places

Publications (23)

Preprint
Semantic segmentation labels are expensive and time consuming to acquire. Hence, pretraining is commonly used to improve the label-efficiency of segmentation models. Typically, the encoder of a segmentation model is pretrained as a classifier and the decoder is randomly initialized. Here, we argue that random initialization of the decoder can be su...
Preprint
Dense retrieval has been shown to be effective for retrieving relevant documents for Open Domain QA, surpassing popular sparse retrieval methods like BM25. REALM (Guu et al., 2020) is an end-to-end dense retrieval system that relies on MLM based pretraining for improved downstream QA efficiency across multiple datasets. We study the finetuning of R...
Preprint
Full-text available
Self-attention has the promise of improving computer vision systems due to parameter-independent scaling of receptive fields and content-dependent interactions, in contrast to parameter-dependent scaling and content-independent interactions of convolutions. Self-attention models have recently been shown to have encouraging improvements on accuracy-...
Preprint
We present BoTNet, a conceptually simple yet powerful backbone architecture that incorporates self-attention for multiple computer vision tasks including image classification, object detection and instance segmentation. By just replacing the spatial convolutions with global self-attention in the final three bottleneck blocks of a ResNet and no othe...
Preprint
Recently Transformer and Convolution neural network (CNN) based models have shown promising results in Automatic Speech Recognition (ASR), outperforming Recurrent neural networks (RNNs). Transformer models are good at capturing content-based global interactions, while CNNs exploit local features effectively. In this work, we achieve the best of bot...
Preprint
Medical images such as 3D computerized tomography (CT) scans and pathology images, have hundreds of millions or billions of voxels/pixels. It is infeasible to train CNN models directly on such high resolution images, because neural activations of a single image do not fit in the memory of a single GPU/TPU. Existing image analysis approaches allevia...
Preprint
Full-text available
Convolutions are a fundamental building block of modern computer vision systems. Recent approaches have argued for going beyond convolutions in order to capture long-range dependencies. These efforts focus on augmenting convolutional models with content-based interactions, such as self-attention and non-local means, to achieve gains on a number of...
Preprint
Grammatical Error Correction (GEC) has been recently modeled using the sequence-to-sequence framework. However, unlike sequence transduction problems such as machine translation, GEC suffers from the lack of plentiful parallel data. We describe two approaches for generating large parallel datasets for GEC using publicly available Wikipedia data. Th...
Preprint
Batch-splitting (data-parallelism) is the dominant distributed Deep Neural Network (DNN) training strategy, due to its universal applicability and its amenability to Single-Program-Multiple-Data (SPMD) programming. However, batch-splitting suffers from problems including the inability to train very large models (due to memory constraints), high lat...
Preprint
We describe an approach to Grammatical Error Correction (GEC) that is effective at making use of models trained on large amounts of weakly supervised bitext. We train the Transformer sequence-to-sequence model on 4B tokens of Wikipedia revisions and employ an iterative decoding strategy that is tailored to the loosely-supervised nature of the Wikip...
Preprint
Deep neural networks with discrete latent variables offer the promise of better symbolic reasoning, and learning abstractions that are more useful to new tasks. There has been a surge in interest in discrete latent variable models, however, despite several recent improvements, the training of discrete latent variable models has remained challenging...
Preprint
The past year has witnessed rapid advances in sequence-to-sequence (seq2seq) modeling for Machine Translation (MT). The classic RNN-based approaches to MT were first out-performed by the convolutional seq2seq model, which was then out-performed by the more recent Transformer model. Each of these new approaches consists of a fundamental architecture...
Article
Full-text available
Tensor2Tensor is a library for deep learning models that is well-suited for neural machine translation and includes the reference implementation of the state-of-the-art Transformer model.
Article
Image generation has been successfully cast as an autoregressive sequence generation or transformation problem. Recent work has shown that self-attention is an effective way of modeling textual sequences. In this work, we generalize a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation o...
Article
Deep learning yields great results across many fields, from speech recognition, image classification, to translation. But for each problem, getting a deep model to work well involves research into the architecture and a long period of tuning. We present a single model that yields good results on a number of problems spanning multiple domains. In pa...
Article
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing...

Citations

... We think of the philosophy of noise-to-box paradigm is analogous to noise-to-image process in the denoising diffusion models [15,35,79], which are a class of likelihood-based models to generate the image by gradually removing noise from an image via the learned denoising model. Diffusion models have achieved great success in many generation tasks [3,4,37,63,85] and start to be explored in perception tasks like image segmentation [1,5,6,12,28,42,89]. However, to the best of our knowledge, there is no prior arts that successfully adopt it to object detection. ...
... Open-Domain Question-Answering is the task of answering an input question given a large external knowledge-base, such as the entire Wikipedia. This problem is typically approached by leveraging a retriever model to first retrieve a set of relevant documents/passages using some IR method, which are then passed on to a reader model Guu et al., 2020;Xiong et al., 2021;Balachandran et al., 2021). ...
... But most recently, attention mechanism [31] has been gradually applied in the field of machine vision. There already exist some backbones used in image feature extraction such as BotNet [32] and Swin transformer [33]. In this work, BotNet is chosen for the backbone in considering of efficiency. ...
... NA also approaches self attention itself as its window size grows, and unlike WSA, would not need pixel shifts as it is a dynamic operation. Similar operations, in which self attention is restricted in a token-wise manner, had been investigated prior to this work [35], but were less actively studied due to implementation difficulties [29,35,41]. To that end, Neighbor-hood Attention Extension (NATTEN) [9] was created as an extension to PyTorch, with efficient CUDA kernels, which allow NA to run even faster than WSA, while using less memory. ...
... These features are fed to the downstream ASR model. We use conformers [27] for acoustic modeling in the downstream ASR. The hyperparameters for the conformer model are listed in Table 3. ...
... The advantage of this is that model gets a second chance to correct errors it might have missed during the first iteration. Lichtarge et al. (2019) thus proposed an iterative decoding algorithm that allows a model to make multiple incremental corrections. In each iteration, the model is allowed to generate a different output only if it has high confidence. ...
... However, the self-attention mechanism of Transformer makes it good at capturing global information, but weak at capturing the local information (such as structural information) from SMILES sequences [22,23]. In summary, RNN and Transformer tend to focus on different types of features in the process of feature extraction [24], and both of them have their tendencies in molecular property prediction. However, there has been little work integrating both models for molecular property prediction. ...
... These models were first proposed by Vaswani et al. (2017). (A TensorFlow implementation is available as part of the Tensor2Tensor package (Vaswani et al., 2018).) Transformers focus on learning context from an input stream (temporal sequence) and have become the dominant architecture in the NLP domain. ...
... Pre-trained language models have been shown to learn skills that can transfer to new modalities [35], however, this will not be effective for task-specific skills such as a desired captioning style or learning the space of output labels. Several multi-modal, multi-task models have learned many tasks in different modalities simultaneously [22,31,34,62] and could thus potentially transfer skills between them, with HighMMT in particular showing positive results [31]. Our work studies the more challenging zero-shot setting (meaning no training data in the target modality is available), and therefore requires all the needed skills to be learned from a modality different than the one used in evaluation. ...
... This invisibly increases the task difficulty of the modeling, and also adversely affects the performance of the built model. Fortunately, after the Transformer (Vaswani et al., 2017) was announced, researchers discovered the great efficacy of the self-attention mechanism. The self-attention mechanism can make the neural network model have the ability to focus on different subsets of data in different situations, and make more reasonable and effective use of the data information resources. ...