[Show abstract][Hide abstract] ABSTRACT: Leveraging Manhattan assumption we generate metrically rectified novel views from a single image, even for
non-box scenarios. Our novel views enable the already trained classifiers to handle training data missing views (blind spots) without additional training. We demonstrate this on end-to-end scene text spotting under perspective. Additionally, utilizing our fronto-parallel views, we discover unsupervised invariant mid-level patches given a few widely separated training examples (small data domain). These invariant patches outperform various baselines on small data image retrieval challenge.
[Show abstract][Hide abstract] ABSTRACT: Do we really need 3D labels in order to learn how to predict 3D? In this paper, we show that one can learn a
mapping from appearance to 3D properties without ever seeing a single explicit 3D label. Rather than use explicit supervision, we use the regularity of indoor scenes to learn the mapping in a completely unsupervised manner. We demonstrate this on both a standard 3D scene understanding dataset as well as Internet images for which 3D is unavailable, precluding supervised learning. Despite never seeing a 3D label, our method produces competitive results.
[Show abstract][Hide abstract] ABSTRACT: Robot perception is generally viewed as the interpretation of data from various types of sensors such as cameras. In this paper, we study indirect perception where a robot can perceive new information by making inferences from non-visual observations of human teammates. As a proof-of-concept study, we specifically focus on a door detection problem in a stealth mission setting where a team operation must not be exposed to the visibility of the team's opponents. We use a special type of the Noisy-OR model known as BN2O model of Bayesian inference network to represent the inter-visibility and to infer the locations of the doors, i.e., potential locations of the opponents. Experimental results on both synthetic data and real person tracking data achieve an F-measure of over .9 on average, suggesting further investigation on the use of non-visual perception in human-robot team operations.
[Show abstract][Hide abstract] ABSTRACT: Do we really need 3D labels in order to learn how to
predict 3D? In this paper, we show that one can learn a
mapping from appearance to 3D properties without ever
seeing a single explicit 3D label. Rather than use explicit
supervision, we use the regularity of indoor scenes to learn the mapping in a completely unsupervised manner. We demonstrate this on both a standard 3D scene understanding dataset as well as Internet images for which 3D is unavailable, precluding supervised learning. Despite never seeing a 3D label, our method produces competitive results.
[Show abstract][Hide abstract] ABSTRACT: We propose a data-driven approach to estimate the likelihood that an image segment corresponds to a scene object (its “objectness”) by comparing it to a large collection of example object regions. We demonstrate that when the application domain is known, for example, in our case activity of daily living (ADL), we can capture the regularity of the domain specific objects using millions of exemplar object regions. Our approach to estimating the objectness of an image region proceeds in two steps: 1) finding the exemplar regions that are the most similar to the input image segment; 2) calculating the objectness of the image segment by combining segment properties, mutual consistency across the nearest exemplar regions, and the prior probability of each exemplar region. In previous work, parametric objectness models were built from a small number of manually annotated objects regions, instead, our data-driven approach uses 5 million object regions along with their metadata information. Results on multiple data sets demonstrates our data-driven approach compared to the existing model based techniques. We also show the application of our approach in improving the performance of object discovery algorithms.
No preview · Article · Sep 2015 · IEEE Transactions on Pattern Analysis and Machine Intelligence
[Show abstract][Hide abstract] ABSTRACT: Health care providers typically rely on family caregivers (CG) of persons with dementia (PWD) to describe difficult behaviors manifested by their underlying disease. Although invaluable, such reports may be selective or biased during brief medical encounters. Our team explored the usability of a wearable camera system with 9 caregiving dyads (CGs: 3 males, 6 females, 67.00 ± 14.95 years; PWDs: 2 males, 7 females, 80.00 ± 3.81 years, MMSE 17.33 ± 8.86) who recorded 79 salient events over a combined total of 140 hours of data capture, from 3 to 7 days of wear per CG. Prior to using the system, CGs assessed its benefits to be worth the invasion of privacy; post-wear privacy concerns did not differ significantly. CGs rated the system easy to learn to use, although cumbersome and obtrusive. Few negative reactions by PWDs were reported or evident in resulting video. Our findings suggest that CGs can and will wear a camera system to reveal their daily caregiving challenges to health care providers.
No preview · Article · Jun 2015 · Journal of Healthcare Engineering
[Show abstract][Hide abstract] ABSTRACT: We present a semi-supervised approach that localizes multiple unknown object
instances in long videos. We start with a handful of labeled boxes and
iteratively learn and label hundreds of thousands of object instances. We
propose criteria for reliable object detection and tracking for constraining
the semi-supervised learning process and minimizing semantic drift. Our
approach does not assume exhaustive labeling of each object instance in any
single frame, or any explicit annotation of negative data. Working in such a
generic setting allow us to tackle multiple object instances in video, many of
which are static. In contrast, existing approaches either do not consider
multiple object instances per video, or rely heavily on the motion of the
objects present. The experiments demonstrate the effectiveness of our approach
by evaluating the automatically labeled data on a variety of metrics like
quality, coverage (recall), diversity, and relevance to training an object
[Show abstract][Hide abstract] ABSTRACT: Given a scene, what is going to move, and in what direction will it move?
Such a question could be considered a non-semantic form of action prediction.
In this work, we present predictive convolutional neural networks (P-CNN).
Given a static image, P-CNN predicts the future motion of each and every pixel
in the image in terms of optical flow. Our P-CNN model leverages the data in
tens of thousands of realistic videos to train our model. Our method relies on
absolutely no human labeling and is able to predict motion based on the context
of the scene. Since P-CNNs make no assumptions about the underlying scene they
can predict future optical flow on a diverse set of scenarios. In terms of
quantitative performance, P-CNN outperforms all previous approaches by large
[Show abstract][Hide abstract] ABSTRACT: Cameras provide a rich source of information while being passive, cheap and
lightweight for small and medium Unmanned Aerial Vehicles (UAVs). In this work
we present the first implementation of receding horizon control, which is
widely used in ground vehicles, with monocular vision as the only sensing mode
for autonomous UAV flight in dense clutter. We make it feasible on UAVs via a
number of contributions: novel coupling of perception and control via relevant
and diverse, multiple interpretations of the scene around the robot, leveraging
recent advances in machine learning to showcase anytime budgeted cost-sensitive
feature selection, and fast non-linear regression for monocular depth
prediction. We empirically demonstrate the efficacy of our novel pipeline via
real world experiments of more than 2 kms through dense trees with a quadrotor
built from off-the-shelf parts. Moreover our pipeline is designed to combine
information from other modalities like stereo and lidar as well if available.
[Show abstract][Hide abstract] ABSTRACT: Robot teleoperation systems introduce a unique set of challenges including
latency, intermittency, and asymmetry in control inputs. User control with
Brain-Computer Interfaces (BCIs) exacerbates these problems through especially
noisy and even erratic low-dimensional motion commands due to the difficulty in
decoding neural activity. We introduce a general framework to address these
challenges through a combination of Machine Vision, User Intent Inference, and
Human-Robot Autonomy Control Arbitration. Adjustable levels of assistance allow
the system to balance the operator's capabilities and feelings of comfort and
control while compensating for a task's difficulty. We present experimental
results demonstrating significant performance improvement using the
shared-control assistance framework on adapted rehabilitation benchmarks with
two subjects implanted with intracortical brain-computer interfaces controlling
a high degree-of-freedom robotic manipulator as a prosthetic. Our results
further indicate shared assistance mitigates perceived user difficulty and even
enables successful performance on previously infeasible tasks. We showcase the
extensibility of our architecture with applications to quality-of-life tasks
such as opening a door with a BCI, pouring liquids from a container with a
dual-joystick game controller, and manipulation in dense clutter with a 6-DoF
[Show abstract][Hide abstract] ABSTRACT: Robots are increasingly becoming key players in human-robot teams. To become effective teammates, robots must possess profound understanding of an environment , be able to reason about the desired commands and goals within a specific context, and be able to communicate with human teammates in a clear and natural way. To address these challenges, we have developed an intelligence architecture that combines cognitive components to carry out high-level cognitive tasks, semantic perception to label regions in the world, and a natural language component to reason about the command and its relationship to the objects in the world. This paper describes recent developments using this architecture on a fielded mobile robot platform operating in unknown urban environments. We report a summary of extensive outdoor experiments; the results suggest that a multidisciplinary approach to robotics has the potential to create competent human-robot teams.
[Show abstract][Hide abstract] ABSTRACT: We consider detecting objects in an image by iteratively selecting from a set
of arbitrarily shaped candidate regions. Our generic approach, which we term
visual chunking, reasons about the locations of multiple object instances in an
image while expressively describing object boundaries. We design an
optimization criterion for measuring the performance of a list of such
detections as a natural extension to a common per-instance metric. We present
an efficient algorithm with provable performance for building a high-quality
list of detections from any candidate set of region-based proposals. We also
develop a simple class-specific algorithm to generate a candidate region
instance in near-linear time in the number of low-level superpixels that
outperforms other region generating methods. In order to make predictions on
novel images at testing time without access to ground truth, we develop
learning approaches to emulate these algorithms' behaviors. We demonstrate that
our new approach outperforms sophisticated baselines on benchmark datasets.
Preview · Article · Oct 2014 · Proceedings - IEEE International Conference on Robotics and Automation
[Show abstract][Hide abstract] ABSTRACT: We propose a regularized linear learning algorithm to sequence groups of
features, where each group incurs test-time cost or computation. Specifically,
we develop a simple extension to Orthogonal Matching Pursuit (OMP) that
respects the structure of groups of features with variable costs, and we prove
that it achieves near-optimal anytime linear prediction at each budget
threshold where a new group is selected. Our algorithm and analysis extends to
generalized linear models with multi-dimensional responses. We demonstrate the
scalability of the resulting approach on large real-world data-sets with many
feature groups associated with test-time computational costs. Our method
improves over Group Lasso and Group OMP in the anytime performance of linear
predictions, measured in timeliness, an anytime prediction performance metric,
while providing rigorous performance guarantees.
[Show abstract][Hide abstract] ABSTRACT: State-of-the-art approaches for articulated human pose estimation are rooted in parts-based graphical models. These models are often restricted to tree-structured representations and simple parametric potentials in order to enable tractable inference. However, these simple dependencies fail to capture all the interactions between body parts. While models with more complex interactions can be defined, learning the parameters of these models remains challenging with intractable or approximate inference. In this paper, instead of performing inference on a learned graphical model, we build upon the inference machine framework and present a method for articulated human pose estimation. Our approach incorporates rich spatial interactions among multiple parts and information across parts of different scales. Additionally, the modular framework of our approach enables both ease of implementation without specialized optimization solvers, and efficient inference. We analyze our approach on two challenging datasets with large pose variation and outperform the state-of-the-art on these benchmarks.
[Show abstract][Hide abstract] ABSTRACT: In the task of activity recognition in videos, computing the video representation often involves pooling feature vectors over spatially local neighborhoods. The pooling is done over the entire video, over coarse spatio-temporal pyramids, or over pre-determined rigid cuboids. Similarly to pooling image features over superpixels in images, it is natural to consider pooling spatio-temporal features over video segments, e.g., supervoxels. However, since the number of segments is variable, this produces a video representation of variable size. We propose Motion Words - a new, fixed size video representation, where we pool features over supervoxels. To segment the video into supervoxels, we explore two recent video segmentation algorithms. The proposed representation enables localization of common regions across videos in both space and time. Importantly, since the video segments are meaningful regions, we can interpret the proposed features and obtain a better understanding of why two videos are similar. Evaluation on classification and retrieval tasks on two datasets further shows that Motion Words achieves state-of-the-art performance.
[Show abstract][Hide abstract] ABSTRACT: In this paper we present a conceptually simple but surprisingly powerful method for visual prediction which combines the effectiveness of mid-level visual elements with temporal modeling. Our framework can be learned in a completely unsupervised manner from a large collection of videos. However, more importantly, because our approach models the prediction framework on these mid-level elements, we can not only predict the possible motion in the scene but also predict visual appearances - how are appearances going to change with time. This yields a visual 'hallucination' of probable events on top of the scene. We show that our method is able to accurately predict and visualize simple future events, we also show that our approach is comparable to supervised methods for event prediction.
[Show abstract][Hide abstract] ABSTRACT: We consider the problem of discovering discriminative exemplars suitable for object detection. Due to the diversity in appearance in real world objects, an object detector must capture variations in scale, viewpoint, illumination etc. The current approaches do this by using mixtures of models, where each mixture is designed to capture one (or a few) axis of variation. Current methods usually rely on heuristics to capture these variations; however, it is unclear which axes of variation exist and are relevant to a particular task. Another issue is the requirement of a large set of training images to capture such variations. Current methods do not scale to large training sets either because of training time complexity  or test time complexity . In this work, we explore the idea of compactly capturing task-appropriate variation from the data itself. We propose a two stage data-driven process, which selects and learns a compact set of exemplar models for object detection. These selected models have an inherent ranking, which can be used for anytime/budgeted detection scenarios. Another benefit of our approach (beyond the computational speedup) is that the selected set of exemplar models performs better than the entire set.
[Show abstract][Hide abstract] ABSTRACT: This paper presents a fast and efficient computational approach to higher order spectral graph matching. Exploiting the redundancy in a tensor representing the affinity between feature points, we approximate the affinity tensor with the linear combination of Kronecker products between bases and index tensors. The bases and index tensors are highly compressed representations of the approximated affinity tensor, requiring much smaller memory than in previous methods, which store the full affinity tensor. We compute the principal eigenvector of the approximated affinity tensor using the small bases and index tensors without explicitly storing the approximated tensor. To compensate for the loss of matching accuracy by the approximation, we also adopt and incorporate a marginalization scheme that maps a higher order tensor to matrix as well as a one-to-one mapping constraint into the eigenvector computation process. The experimental results show that the proposed method is faster and requires smaller memory than the existing methods with little or no loss of accuracy.
No preview · Article · Mar 2014 · IEEE Transactions on Software Engineering
[Show abstract][Hide abstract] ABSTRACT: We present Marvin, a system that can search physical objects using a mobile or wearable device. It integrates HOG-based object recognition, SURF-based localization information, automatic speech recognition, and user feedback information with a probabilistic model to recognize the “object of interest” at high accuracy and at interactive speeds. Once the object of interest is recognized, the information that the user is querying, e.g. reviews, options, etc., is displayed on the user's mobile or wearable device. We tested this prototype in a real-world retail store during business hours, with varied degree of background noise and clutter. We show that this multi-modal approach achieves superior recognition accuracy compared to using a vision system alone, especially in cluttered scenes where a vision system would be unable to distinguish which object is of interest to the user without additional input. It is computationally able to scale to large numbers of objects by focusing compute-intensive resources on the objects most likely to be of interest, inferred from user speech and implicit localization information. We present the system architecture, the probabilistic model that integrates the multi-modal information, and empirical results showing the benefits of multi-modal integration.
[Show abstract][Hide abstract] ABSTRACT: Our long-term goal is to develop a general solution to the Life- long Robotic Object Discovery (LROD) problem: to discover new objects in the environment while the robot operates, for as long as the robot operates. In this paper, we consider the first step towards LROD: we automatically process the raw data stream of an entire workday of a robotic agent to discover objects.
Our key contribution to achieve this goal is to incorporate domain knowledge—robotic metadata—in the discovery process, in addition to visual data. We propose a general graph-based formulation for LROD in which generic domain knowledge is encoded as constraints. To make long-term object discovery feasible, we encode into our formulation the natural constraints and non-visual sensory information in service robotics. A key advantage of our generic formulation is that we can add, modify, or remove sources of domain knowledge dynamically, as they become available or as conditions change.
In our experiments, we show that by adding domain knowl- edge we discover 2.7x more objects and decrease processing time 190 times. With our optimized implementation, Herb- Disc, we show for the first time a system that processes a video stream of 6 h 20 min of continuous exploration in cluttered human environments (and over half a million images) in 18 min 34 s, to discover 206 new objects with their 3D models.
Full-text · Article · Jan 2014 · The International Journal of Robotics Research