
Andreas Veit- Researcher at Google Inc.
Andreas Veit
- Researcher at Google Inc.
About
48
Publications
39,344
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,835
Citations
Introduction
Skills and Expertise
Current institution
Additional affiliations
August 2014 - present
Publications
Publications (48)
Latent Diffusion Models (LDMs) produce high-quality, photo-realistic images, however, the latency incurred by multiple costly inference iterations can restrict their applicability. We introduce LatentCRF, a continuous Conditional Random Field (CRF) model, implemented as a neural network layer, that models the spatial and semantic relationships amon...
Cross-Encoder (CE) and Dual-Encoder (DE) models are two fundamental approaches for query-document relevance in information retrieval. To predict relevance, CE models use joint query-document embeddings, while DE models maintain factorized query and document embeddings; usually, the former has higher quality while the latter benefits from lower late...
Modern text-to-image generation models produce high-quality images that are both photorealistic and faithful to the text prompts. However, this quality comes at significant computational cost: nearly all of these models are iterative and require running inference multiple times with large models. This iterative process is needed to ensure that diff...
Large language models (LLMs) have led to a series of breakthroughs in natural language processing (NLP), owing to their excellent understanding and generation abilities. Remarkably, what further sets these models apart is the massive amounts of world knowledge they internalize during pretraining. While many downstream applications provide the model...
The remarkable performance gains realized by large pretrained models, e.g., GPT-3, hinge on the massive amounts of data they are exposed to during training. Analogously, distilling such large models to compact models for efficient deployment also necessitates a large amount of (labeled or unlabeled) training data. In this paper, we propose the teac...
Pairwise dot product-based attention allows Transformers to exchange information between tokens in an input-dependent way, and is key to their success across diverse applications in language and vision. However, a typical Transformer model computes such pairwise attention scores repeatedly for the same sequence, in multiple heads in multiple layers...
State-of-the-art transformer models use pairwise dot-product based self-attention, which comes at a computational cost quadratic in the input sequence length. In this paper, we investigate the global structure of attention scores computed using this dot product mechanism on a typical distribution of inputs, and study the principal components of the...
Deep Convolutional Neural Networks (CNNs) have long been the architecture of choice for computer vision tasks. Recently, Transformer-based architectures like Vision Transformer (ViT) have matched or even surpassed ResNets for image classification. However, details of the Transformer architecture -- such as the use of non-overlapping patches -- lead...
Standard training techniques for neural networks involve multiple sources of randomness, e.g., initialization, mini-batch ordering and in some cases data augmentation. Given that neural networks are heavily over-parameterized in practice, such randomness can cause {\em churn} -- for the same input, disagreements between predictions of the two model...
Modern image retrieval systems increasingly rely on the use of deep neural networks to learn embedding spaces in which distance encodes the relevance between a given query and image. In this setting, existing approaches tend to emphasize one of two properties. Triplet-based methods capture top-$k$ relevancy, where all top-$k$ scoring documents are...
The label shift problem refers to the supervised learning setting where the train and test label distributions do not match. Existing work addressing label shift usually assumes access to an \emph{unlabelled} test sample. This sample may be used to estimate the test label distribution, and to then train a suitably re-weighted classifier. While appr...
Real-world classification problems typically exhibit an imbalanced or long-tailed label distribution, wherein many labels are associated with only a few samples. This poses a challenge for generalisation on such labels, and also makes na\"ive learning biased towards dominant labels. In this paper, we present two simple modifications of standard sof...
Modern retrieval problems are characterised by training sets with potentially billions of labels, and heterogeneous data distributions across subpopulations (e.g., users of a retrieval system may be from different countries), each of which poses a challenge. The first challenge concerns scalability: with a large number of labels, standard losses ar...
Do convolutional networks really need a fixed feed-forward structure? What if, after identifying the high-level concept of an image, a network could move directly to a layer that can distinguish fine-grained differences? Currently, a network would first need to execute sometimes hundreds of intermediate layers that specialize in unrelated aspects....
While stochastic gradient descent (SGD) is still the de facto algorithm in deep learning, adaptive methods like Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to Adam are not well understood yet. In this paper, we provide empirical and theoretica...
Research experiences today are limited to a privileged few at select universities. Providing open access to research experiences would enable global upward mobility and increased diversity in the scientific workforce. How can we coordinate a crowd of diverse volunteers on open-ended research? How could a PI have enough visibility into each person’s...
Do convolutional networks really need a fixed feed-forward structure? What if, after identifying the high-level concept of an image, a network could move directly to a layer that can distinguish fine-grained differences? Currently, a network would first need to execute sometimes hundreds of intermediate layers that specialize in unrelated aspects....
Federated learning enables multiple participants to jointly construct a deep learning model without sharing their private training data with each other. For example, multiple smartphones can jointly train a predictive keyboard model without revealing what individual users type into their phones. We demonstrate that any participant in federated lear...
Semantic segmentation is a challenging vision problem that usually necessitates the collection of large amounts of finely annotated data, which is often quite expensive to obtain. Coarsely annotated data provides an interesting alternative as it is usually substantially more cheap. In this work, we present a method to leverage coarsely annotated da...
Evaluation metrics for image captioning face two challenges. Firstly, commonly used metrics such as CIDEr, METEOR, ROUGE and BLEU often do not correlate well with human judgments. Secondly, each metric has well known blind spots to pathological caption constructions, and rule-based metrics lack provisions to repair such blind spots once identified....
Do convolutional networks really need a fixed feed-forward structure? Often, a neural network is already confident after a few layers about the high-level concept shown in the image. However, due to the fixed network structure, all remaining layers still need to be evaluated. What if the network could jump right to a layer that is specialized in fi...
The variety, abundance, and structured nature of hashtags make them an interesting data source for training vision models. For instance, hashtags have the potential to significantly reduce the problem of manual supervision and annotation when learning vision models for a large number of concepts. However, a key challenge when learning from hashtags...
Research experiences today are limited to a privileged few at select universities. Providing open access to research experiences would enable global upward mobility and increased diversity in the scientific workforce. How can we coordinate a crowd of diverse volunteers on open-ended research? How could a PI have enough visibility into each person's...
Deep neural networks trained on large supervised datasets have led to impressive results in recent years. However, since well-annotated datasets can be prohibitively expensive and time-consuming to collect, recent work has explored the use of larger but noisy datasets that can be more easily obtained. In this paper, we investigate the behavior of d...
We present an approach to effectively use millions of images with noisy annotations in conjunction with a small subset of cleanly-annotated images to learn powerful image representations. One common approach to combine clean and noisy data is to first pre-train a network using the large noisy dataset and then fine-tune with the clean dataset. We sh...
In this work, we introduce a novel interpretation of residual networks showing they are exponential ensembles. This observation is supported by a large-scale lesion study that demonstrates they behave just like ensembles at test time. Subsequently, we perform an analysis showing these ensembles mostly consist of networks that are each relatively sh...
In typical perceptual tasks, higher-order concepts are inferred from visual features to assist with perceptual decision making. However, there is a multitude of visual concepts which can be inferred from a single stimulus. When learning nonlinear embeddings with siamese or triplet networks from similarities, we typically assume they are sourced fro...
This paper describes the COCO-Text dataset. In recent years large-scale datasets like SUN and Imagenet drove the advancement of scene understanding and object recognition. The goal of COCO-Text is to advance state-of-the-art in text detection and recognition in natural images. The dataset is based on the MS COCO dataset, which contains images of co...
One key challenge in creating a sustainable society is to make consumer demand adaptive to the supply of electricity. We propose a generalization for the multiagent coordination algorithm for partially-centralized demand side management. In this setting, a central unit buys the electricity for the whole group, while the individual agents make their...
A problem receiving increasing attention is autonomously adjusting the electricity demand of consumers in response to real-time supply. The current literature mostly operates under the assumption that it is desirable to have an autonomous energy management system. However, autonomous demand-side management requires that smart agents understand thei...
With the rapid proliferation of smart mobile devices, users now take millions
of photos every day. These include large numbers of clothing and accessory
images. We would like to answer questions like `What outfit goes well with this
pair of shoes?' To answer these types of questions, one has to go beyond
learning visual similarity and learn a visua...
When crowdsourcing systems are used in combination with machine inference
systems in the real world, they benefit the most when the machine system is
deeply integrated with the crowd workers. However, if researchers wish to
integrate the crowd with "off-the-shelf" machine classifiers, this deep
integration is not always possible. This work explores...
With the rapid proliferation of smart mobile devices, users now take millions of photos every day. These include large numbers of clothing and accessory images. We would like to answer questions like `What outfit goes well with this pair of shoes?' To answer these types of questions, one has to go beyond learning visual similarity and learn a visua...
A key challenge in creating a sustainable and energy-efficient society is to make consumer demand adaptive to the supply of energy, especially to the renewable supply. In this article, we propose a partially-centralized organization of consumers (or agents), namely, a consumer cooperative that purchases electricity from the market. In the cooperati...
The increasing use of renewable energy sources with variable output, such as
solar photovoltaic and wind power generation, calls for Smart Grids that
effectively manage flexible loads and energy storage. The ability to forecast
consumption at different locations in distribution systems will be a key
capability of Smart Grids. The goal of this paper...
A key challenge to create a sustainable and energy-efficient society is in making consumer demand adaptive to energy supply, especially renewable supply. In this paper, we propose a partially-centralized organization of consumers, namely, a consumer cooperative for purchasing electricity from the market. We propose a novel multiagent coordination a...
A key challenge to create a sustainable and energy-efficient society is in making consumer demand adaptive to energy supply, especially renewable supply. In this paper, we propose a partially-centralized organization of consumers, namely, a consumer cooperative for purchasing electricity from the market. We propose a novel multiagent coordination a...