About
28
Publications
44,994
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
94,372
Citations
Publications
Publications (28)
Over the past three years Pinterest has experimented with several visual search and recommendation systems, from enhancing existing products such as Related Pins (2014), to powering new products such as Similar Looks (2015), Flashlight (2016), and Lens (2017). This paper presents an overview of our visual discovery engine powering these services, a...
Over the past three years Pinterest has experimented with several visual search and recommendation services, including Related Pins (2014), Similar Looks (2015), Flashlight (2016) and Lens (2017). This paper presents an overview of our visual discovery engine powering these services, and shares the rationales behind our technical and product decisi...
Clearly explaining a rationale for a classification decision to an end-user can be as important as the decision itself. Existing approaches for deep visual recognition are generally opaque and do not output any justification text; contemporary vision-language models can describe image content but fail to take into account class-discriminative image...
The ability of the Generative Adversarial Networks (GANs) framework to learn generative models mapping from simple latent distributions to arbitrarily complex data distributions has been demonstrated empirically, with compelling results showing generators learn to "linearize semantics" in the latent space of such models. Intuitively, such latent sp...
We present an unsupervised visual feature learning algorithm driven by context-based pixel prediction. By analogy with auto-encoders, we propose Context Encoders -- a convolutional neural network trained to generate the contents of an arbitrary image region conditioned on its surroundings. In order to succeed at this task, context encoders need to...
Clearly explaining a rationale for a classification decision to an end-user can be as important as the decision itself. Existing approaches for deep visual recognition are generally opaque and do not output any justification text; contemporary vision-language models can describe image content but fail to take into account class-discriminative image...
Object detection performance, as measured on the canonical PASCAL VOC Challenge datasets, plateaued in the final years of the competition. The best-performing methods were complex ensemble systems that typically combined multiple low-level image features with high-level context. In this paper, we propose a simple and scalable detection algorithm th...
Convolutional Neural Networks spread through computer vision like a wildfire,
impacting almost all visual tasks imaginable. Despite this, few researchers
dare to train their models from scratch. Most work builds on one of a handful
of ImageNet pre-trained models, and fine-tunes or adapts these for specific
tasks. This is in large part due to the di...
We demonstrate that, with the availability of distributed computation
platforms such as Amazon Web Services and open-source tools, it is possible for
a small engineering team to build, launch and maintain a cost-effective,
large-scale visual search system with widely available tools. We also
demonstrate, through a comprehensive set of live experime...
Real-world videos often have complex dynamics; methods for generating
open-domain video descriptions should be senstive to temporal structure and
allow both input (sequence of frames) and output (sequence of words) of
variable length. To approach this problem we propose a novel end-to-end
sequence-to-sequence model to generate captions for videos....
Solving the visual symbol grounding problem has long been a goal of
artificial intelligence. The field appears to be advancing closer to this goal
with recent breakthroughs in deep learning for natural language grounding in
static images. In this paper, we propose to translate videos directly to
sentences using a unified deep neural network with bo...
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48%...
Models based on deep convolutional networks have dominated recent image
interpretation tasks; we investigate whether models which are also recurrent,
or "temporally deep", are effective for tasks involving sequences, visual and
otherwise. We develop a novel recurrent convolutional architecture suitable for
large-scale visual learning which is end-t...
-1We address the problem of visual domain adaptation for transferring object models from one dataset or visual domain to another. We introduce a unified flexible model for both supervised and semi-supervised learning that allows us to learn transformations between domains. Additionally, we present two instantiations of the model, one for general fe...
A major challenge in scaling object detection is the difficulty of obtaining
labeled images for large numbers of categories. Recently, deep convolutional
neural networks (CNN) have emerged as clear winners on object classification
benchmarks, in part due to training with 1.2M+ labeled classification images.
Unfortunately, only a small fraction of t...
Semantic part localization can facilitate fine-grained categorization by
explicitly isolating subtle appearance differences associated with specific
object parts. Methods for pose-normalized representations have been proposed,
but generally presume bounding box annotations at test time due to the
difficulty of object detection. We propose a model f...
Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models ef...
Dataset bias remains a significant barrier towards solving real world
computer vision tasks. Though deep convolutional networks have proven to be a
competitive approach for image classification, a question remains: have these
models have solved the dataset bias problem? In general, training or
fine-tuning a state-of-the-art deep model on a new doma...
Can a large convolutional neural network trained for whole-image
classification on ImageNet be coaxed into detecting objects in PASCAL? We show
that the answer is yes, and that the resulting system is simple, scalable, and
boosts mean average precision, relative to the venerable deformable part model,
by more than 40% (achieving a final mAP of 48%...
We evaluate whether features extracted from the activation of a deep
convolutional network trained in a fully supervised fashion on a large, fixed
set of object recognition tasks can be re-purposed to novel generic tasks. Our
generic tasks may differ significantly from the originally trained tasks and
there may be insufficient labeled or unlabeled...
We evaluate whether features extracted from the activation of a deep convolutional network trained in a fully supervised fashion on a large, fixed set of object recognition tasks can be re-purposed to novel generic tasks. Our generic tasks may differ significantly from the originally trained tasks and there may be insufficient labeled or unlabeled...
Images seen during test time are often not from the same distribution as
images used for learning. This problem, known as domain shift, occurs when
training classifiers from object-centric internet image databases and trying to
apply them directly to scene understanding tasks. The consequence is often
severe performance degradation and is one of th...
Most successful object classification and detection methods rely on classifiers trained on large labeled datasets. However, for domains where labels are limited, simply borrowing labeled data from existing datasets can hurt performance, a phenomenon known as "dataset bias." We propose a general framework for adapting classifiers from "borrowed" dat...
We present an algorithm that learns representations which explicitly
compensate for domain mismatch and which can be efficiently realized as linear
classifiers. Specifically, we form a linear transformation that maps features
from the target (test) domain to the source (training) domain as part of
training the classifier. We optimize both the trans...
Traditional supervised visual learning simply asks annotators “what” label an image should have. We propose an approach for image classification problems requiring subjective judgment that also asks “why”, and uses that information to enrich the learned model. We develop two forms of visual annotator rationales: in the first, the annotator highligh...