About
19
Publications
77,121
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
121,209
Citations
Introduction
Current institution
Publications
Publications (19)
The main contribution of this paper is an approach for introducing additional context into state-of-the-art general object detection. To achieve this we first combine a state-of-the-art classifier (Residual-101[14]) with a fast detection framework (SSD[18]). We then augment SSD+Residual-101 with deconvolution layers to introduce additional large-sc...
We present a method for detecting objects in images using a single deep neural network. Our approach, named SSD, discretizes the output space of bounding boxes into a set of default boxes over different aspect ratios and scales per feature map location. At prediction time, the network generates scores for the presence of each object category in eac...
For applications in navigation and robotics, estimating the 3D pose of objects is as important as detection. Many approaches to pose estimation rely on detecting or tracking parts or keypoints [11, 21]. In this paper we build on a recent state-of-the-art convolutional network for slidingwindow detection [10] to provide detection and rough pose esti...
We present a technique for adding global context to deep convolutional networks for semantic segmentation. The approach is simple, using the average feature for a layer to augment the features at each location. In addition, we study several idiosyncrasies of training, significantly increasing the performance of baseline networks (e.g. from FCN). Wh...
We have seen remarkable recent progress in computational visual recognition, producing systems that can classify objects into thousands of different categories with increasing accuracy. However, one question that has received relatively less attention is "what labels should recognition systems output?" This paper looks at the problem of predicting...
We present a method for detecting objects in images using a single deep
neural network. Our approach, named SSD, discretizes the output space of
bounding boxes into a set of bounding box priors over different aspect ratios
and scales per feature map location. At prediction time, the network generates
confidences that each prior corresponds to objec...
We present a technique for adding global context to deep convolutional
networks for semantic segmentation. The approach is simple, using the average
feature for a layer to augment the features at each location. In addition, we
study several idiosyncrasies of training, significantly increasing the
performance of baseline networks (e.g. from FCN). Wh...
Entry-level categories—the labels people use to name an object—were originally defined and studied by psychologists in the 1970s and 1980s. In this paper we extend these ideas to study entry-level categories at a larger scale and to learn models that can automatically predict entry-level categories for images. Our models combine visual recognition...
We study Refer-to-as relations as a new type of semanticknowledge. Compared to the much studied Is-a relation,which concerns factual taxonomy knowledge, Refer-to-as relationsaim to address pragmatic semantic knowledge. Forexample, a “penguin” is a “bird” from a taxonomy point ofview, but people rarely refer to a “penguin” as a “bird” invernacular u...
We study Refer-to-as relations as a new type of semantic knowledge. Compared to the much studied Is-a relation, which concerns factual taxonomic knowledge, Refer-to-as relations aim to address pragmatic semantic knowledge. For example, a “penguin” is a “bird” from a taxonomic point of view, but people rarely refer to a “penguin” as a “bird” in vern...
We propose a deep convolutional neural network architecture codenamed
"Inception", which was responsible for setting the new state of the art for
classification and detection in the ImageNet Large-Scale Visual Recognition
Challenge 2014 (ILSVRC 2014). The main hallmark of this architecture is the
improved utilization of the computing resources insi...
Multimedia Event Detection(MED) is a multimedia retrieval task with the goal of finding videos of a particular event in video archives, given example videos and event descriptions; different from MED, multimedia classification is a task that classifies given videos into specified classes. Both tasks require mining features of example videos to lear...
Multimedia Event Detection is a multimedia retrieval task with the goal of finding videos of a particular event in an internet video archive, given example videos and descriptions. We focus here on mining features of example videos to learn the most characteristic features, which requires a combination of multiple complementary types of features. G...
The Informedia group participated in four tasks this year, including Semantic in-dexing, Known-item search, Surveillance event detection and Event detection in Internet multimedia pilot. For semantic indexing, except for training traditional SVM classifiers for each high level feature by using different low level features, a kind of cascade classif...
Bag of Words model has been widely used in the task of Object Categorization, and SIFT, computed for interest local regions,
has been extracted from the image as the representative features, which can provide robustness and invariance to many kind
of image transformation. Even though, they can only capture the local information, while be blind to t...