Conference PaperPDF Available

Scene Recognition for Mobile Robots by Relational Object Search using Next-Best-View Estimates from Hierarchical Implicit Shape Models

Authors:

Abstract and Figures

We present an approach for recognizing indoor scenes in object constellations that require object search by a mobile robot, as they cannot be captured from a single viewpoint. In our approach that we call Active Scene Recognition (ASR), robots predict object poses from learnt spatial relations that they combine with their estimates about present scenes. Our models for estimating scenes and predicting poses are Implicit Shape Model (ISM) trees from prior work [1]. ISMs model scenes as sets of objects with spatial relations in-between and are learnt from observations. In prior work [2], we presented a realization of ASR, limited to choosing orientations for a fixed robot head with an approach to search objects that uses positions and ignores types. In this paper, we introduce an integrated system that extends ASR to selecting positions and orientations of camera views for a mobile robot with a pivoting head. We contribute an approach for Next-Best-View estimation in object search on predicted object poses. It is defined on 6 DoF viewing frustums and optimizes the searched view, together with the objects to be searched in it, based on 6 DoF pose predictions. To prevent combinatorial explosion when searching camera pose space, we introduce a hierarchical approach to sample robot positions with increasing resolution.
Content may be subject to copyright.
Scene Recognition for Mobile Robots by Relational Object Search using
Next-Best-View Estimates from Hierarchical Implicit Shape Models
Pascal Meißner, Ralf Schleicher, Robin Hutmacher, Sven R. Schmidt-Rohr and R¨
udiger Dillmann
Abstract We present an approach for recognizing indoor
scenes in object constellations that require object search by
a mobile robot, as they cannot be captured from a single
viewpoint. In our approach that we call Active Scene Recog-
nition (ASR), robots predict object poses from learnt spatial
relations that they combine with their estimates about present
scenes. Our models for estimating scenes and predicting poses
are Implicit Shape Model (ISM) trees from prior work [1].
ISMs model scenes as sets of objects with spatial relations
in-between and are learnt from observations. In prior work
[2], we presented a realization of ASR, limited to choosing
orientations for a fixed robot head with an approach to search
objects that uses positions and ignores types. In this paper, we
introduce an integrated system that extends ASR to selecting
positions and orientations of camera views for a mobile robot
with a pivoting head. We contribute an approach for Next-Best-
View estimation in object search on predicted object poses. It is
defined on 6 DoF viewing frustums and optimizes the searched
view, together with the objects to be searched in it, based on
6 DoF pose predictions. To prevent combinatorial explosion
when searching camera pose space, we introduce a hierarchical
approach to sample robot positions with increasing resolution.
I. INTRODUCTION
Programming by Demonstration (PbD) is a paradigm
that aims to allow non-experts teaching robots manipulation
tasks. Given a taught task description, a robot needs a notion
of the scene, in which it is situated, to decide whether it may
be adequate or not to begin that task. In general, estimating
isolated locations of present objects is not sufficient, since
object may have different usages in different scenes. Not
only the occurrence of objects, but also the spatial relations
between them have to be considered. To address this issue,
we presented hierarchical Implicit Shape Models [1] that
recognize indoor scenes in object constellations. They model
scenes as sets of objects, including the spatial relations
between them as six Degree-of-Freedom (DoF) coordinate
transforms. Relations are entirely learnt from demonstrations.
Objects in indoor scenes may be distributed or occluded.
Object configurations often have to be perceived from several
viewpoints before a scene can be recognized. Without prior
knowledge, which perspectives have to be checked for which
objects, the 6 DoF space of camera viewpoints has to be
browsed by uninformed search1. In a realistic scenario, such
an approach is infeasible due to combinatorial explosion. In
[2], we presented a method that, given a partially recognized
All authors are with Institute of Anthropomatics, Karlsruhe Institute of
Technology, 76131 Karlsruhe, Germany. pascal.meissner@kit.edu
1An approach for a mobile robot with a pivoting camera is to discretize the
space of robot positions with a given resolution to a grid. At each position,
the camera has to be rotated to a set of views, lying on a tessellated sphere
and at each view, localization is to be performed for all searched objects.
scene, predicts poses for the remaining objects, not yet found,
with the same relations that are used for recognition. Jointly
recognizing scenes and predicting poses with the help of
relations makes it possible to identify object constellations
independent of their absolute locations. We call the concept
to alternate scene recognition and object search, Active Scene
Recognition (ASR). In prior work [2], we demonstrated it
with a simplified approach for a fixed robot head. Camera
views were only optimized regarding their orientations for
a single, given view position. In this paper, we introduce a
new integrated system that implements the ASR concept as a
hierarchical state machine to control a mobile robot instead
of a fixed robot head. For this state machine, we contribute a
method that estimates Next-Best-Views (NBVs) on clouds of
predicted object poses. Both position and orientation of a piv-
oting head of a mobile robot is optimized in terms of utility
and costs. Beyond prior work [2], we introduce a new utility
function and an algorithm for generating NBV candidates on
a 4 DoF space of discrete combinations of robot positions
and sensor orientations. Since uniform sampling of robot
position space is infeasible for the resolutions, required in our
scenario, we contribute a hierarchical strategy to reduce the
number of searched positions: We start an iterative process
by searching a view at a coarse resolution. Next, we increase
the resolution only in proximity of the result from the prior
iteration and restart searching until we converge to a NBV
at a sufficient resolution. Instead of searching all missing
objects in NBVs [2], we estimate an optimal set of objects
per NBV. When only considering how many predicted object
positions fall into a view [2] during NBV estimation, we
noticed many detection errors with real object localizers. We
address this by a utility concept with three measures, each of
which models an aspect that relates 6 DoF poses of objects
in a view to the confidence of detecting them correctly.
II. RELATED WORK
Methods in indoor scene understanding as [3], often model
scenes using Qualitative Spatial Relations (QSRs). Being a
symbolic abstraction, QSRs perform well in generalizing ob-
ject relations, while lacking a detailed description of relative
object locations as required by us. In contrast, the Implicit
Shape Model (ISM) [4] is a non-parametrical method from
object recognition that models object part configurations
in any complexity. We extended ISMs in [1] for scene
recognition so that they process 6 DoF poses instead of 3D
data and introduced a hierarchical ISM, which is a set of
connected ISMs, including learning and inference methods.
In the area of object search, most methods optimize the
132
found_all_objects
Fig. 1: 1: Robot looking for two objects in scenario C2, see Sec. VII. 2,3: State diagrams for presented ASR system, see Sec. VI. 2: Highest level of
ASR system. Grey box stands for object search subsystem whose substates are in 3, starting at top right. Outcomes (red boxes in 3), are transitions in 2.
4
2
1 53
ISM no. 1
ISM no. 2 breakfast
breakfast_sub0
CupPlateDeep
SmacksBox
Fig. 2: [1] 1: Trajectories & relations in single ISM, in different colors per object. 2: False positive recognition by ISM in 1.4: Trajectories & relations
for tree of ISMs, shown in 3. An ISM relates cup & plate, other ISM relates box & reference of first ISM (plate). 5: Correct recognition by ISM tree in 4.
next view to be reached, called the Next-Best-View (NBV).
Like [5], approaches in NBV planning are mainly concerned
with object model reconstruction. Optimization criteria for
choosing views address the characteristics of that application
e.g. by maximizing unknown space within the view instead
of rating sets of predicted object poses like us. Generation
of NBV candidates is done by uniform sampling in large
numbers instead of advanced generation with fewer views
to evaluate. In contrast, NBV planning in our application is
required to produce views of constant quality in real time
in a search loop. One of the first, employing NBV planning
for object search, were [6]. In this work, utility of views is
extracted from a distribution of absolute positions of a single
object. In our approach, we instead search multiple objects at
once, using their relative 6D poses. [7] combines predefined
absolute locations of objects with QSRs to calculate object
position predictions. QSRs are applied without differentiating
among scenes. Object search is performed by extracting 2D
views from robot pose space. For the best pose, camera
orientation is optimized in 3D afterwards. Not considering
the distribution of predictions in vertical direction during 2D
search, might lead to suboptimal views. Calculation of view
utilities is only based on the density of position predictions.
III. SCENE-RELATED DEFINITIONS
We model a scene S=({o},{R})as a set of objects
{o}together with a set of spatial relations {R}between the
objects. Recognition of a scene in a constellation of objects
is equal to rating the similarity between a set of input objects
{i}, located at absolute poses {T}, and the model, learnt for
the scene. An exemplary input constellation is shown in 1in
Fig. 1. Scene models, in particular relations, are extracted
from demonstrations in which trajectories of objects are
recorded over time. Trajectories consist of sequences of
absolute 6 DoF poses T, shown as thick lines in 1in Fig.
2. Poses Tare estimated for objects by visual localization
[8]. Spatial relations are given as sets of relative poses {Tjk}
between pairs of objects (oj,ok). In 1in Fig. 2, relative poses
in relations are depicted by arrows. They make up two spatial
relations, each between the box and one moving object.
IV. PRIOR WORK ON IMPLICIT SHAPE MODELS
A. Passive scene recognition with a single ISM
In [1], we introduced scene model learning and recognition
with ISMs for scenes as in Sec. III. From all n2spatial
relations among a given set {o}of nobjects, a single ISM
can only model the relations from each object (circles in 3
in Fig. 2) towards a common reference (boxes in 3in Fig. 2)
with pose TF. We define an object as equal to the reference
according to a heuristic in [1]. We call it reference object
oF. In 1in Fig. 2, an ISM only models relations towards
a box, being the scene reference. Learning an ISM is equal
to transforming absolute poses T(t)in every demonstrated
trajectory into relative poses ToF(t)and TFo (t)for each
point in time t.ToF (t)encodes how reference object oFis
located relative to object o, with TFo(t)being the inverse
relationship. Recognition of a scene Son a configuration of
{i}objects at poses {T}is a voting process. Each object {i},
being part of the scene, votes where it expects absolute poses
TF T·TiF of the scene reference by using the relative
reference poses TiF in its relation. The reference object votes
on its own pose, if it is present in {i}. An instance of the
scene, i.e. recognition result, is expected where a maximum
of votes of different objects coincide. Votes are collected
in a voxel grid on R3, which enables detecting maximums
1 2 354
Fig. 3: Demonstrations for scene model learning in C1 in 1& in C2 in 3. Recording object trajectories in 1& moving robot towards training view in 3.
Objects of S1 in yellow circle in 1, those of S2 resp. S3 in the blue resp. green circles in 2. Resulting ISM tree for S1 in 2, for S2 in 4& for S3 in 5.
by counting votes per voxel. The edge length of the voxels,
called sensitivity, decides to which amount votes are allowed
to differ in a recognition result, thereby adjusting how much
objects in a scene are allowed to deviate from relations.
B. Passive scene recognition with a tree of ISMs
2in Fig. 2 shows a false positive recognition result of
an ISM as the violation of the relation between the objects
cup and plate, being not the reference, is not detected2.To
be able to represent relations among non-reference objects
in scenes as well, we introduced trees {m}of ISMs, which
we call hierarchical Implicit Shape Models and that consist
of connected ISMs. Each spatial relation in a scene, to be
modeled, is represented by a separate ISM in the tree. In 4in
Fig. 2, the relations between cup and plate and between plate
and box are modeled by two ISMs. ISMs at adjacent levels
in the tree are connected by objects, like breakfast sub0 in
3in Fig. 2, which appear as reference in the ISM at level
n+1 and as regular object in the ISM at level n. Voting
results from an ISM at level n+1 are input at level n. The
ISM at the tree root returns recognition results of the entire
tree. In the tree in 5in Fig. 2, voting from cup towards plate
produces a result at level 2, used at level 1 to relate the
reference of level 2 to the box. The plate is excluded from
recognition. Recognition with ISM trees and scene model
learning, including the structure of trees, is described in [1].
C. Prediction of object poses with a tree of ISMs
If a set {i}of input objects does not contain all objects of
scene S, the spatial relations Rtowards the missing objects
can be used to predict their poses TP. An algorithm that
predicts object poses with the help of the same trees of ISMs
that are used for scene recognition, is described in prior work
[2]. Given the reference pose TFof a scene recognition
result, the ISMs in a tree employ the inverses TFo of the
transforms, used for voting on references, to calculate object
pose predictions T TF·TFo. Pose predictions, as in 3in
Fig. 4, are used to estimate Next-Best-Views to find objects.
V. OBJECT SEARCH-RELATED DEFINITIONS
The robot that we use for searching objects, is depicted
in 1in Fig. 1. It has a laser rangefinder for navigation with
2Scene recognition results are visualized in terms of the relations in the
searched scene, which are shown as straight lines. The colors of the relations
change from green to red as more objects do not match the scene model.[1]
[9] and a stereo camera on a motorized Pan-Tilt Unit (PTU)
for object localization. We define a robot configuration C=
(x,y,q,r,t)based on the values r(pan) and t(tilt) of both
degrees of freedom of the PTU and on the robot pose (x,y,q).
The robot pose is estimated relative to a global frame and
against a 2D map. The workspace of the PTU is limited to
[rmin,rmax ][tmin,tmax ]. The poses of left TLand right TR
camera in the stereo rig are given in a global frame in 6
DoF3. The viewing area of a camera is defined by a frustum
F=(fovx,fovy,ncp,fcp). A frustum is a four-sided truncated
pyramid that is described by the horizontal and vertical field
of view fovxand fovyof the camera as well as the near and
far clipping planes ncp and fcp, which set the minimal and
maximal distance from which an object can be localized. A
view V(T)is the viewing area of a camera, transformed into
a global frame according to the pose Tof the camera. Each
camera view has a direction x(V), given by the optical axis.
A pair of views for a stereo rig is shown in 3in Fig. 4.
VI. ACTIVE SCENE RECOGNITION
BY RELATION-BASED OBJECT SEARCH
A. Decision-making system for Active Scene Recognition
Scene recognition in Sec. IV is a passive process, since it
interprets data from an external source. In this Section, we
embed scene recognition and pose prediction in a decision-
making system to search objects in the context of Active
Scene Recognition (ASR). Both are from prior work and
described in Sec. IV-B and IV-C. Besides an integrated sys-
tem, realized as hierarchical state machine [10] and extending
prior work in ASR [2] to a navigating robot, our contribution
is a method to consecutively estimating Next-Best-Views
(NBVs) that we introduce in Sec. VI-B and VI-C. In the
following, we suppose that at least one object is available to
object search in order to deduce poses of further objects4and
that scene models have been learnt by robot teleoperation5.
The highest level of our ASR system, shown in 2in Fig.
1, contains as major steps the states SCENE RECOGNITION,
POSE PREDICTION and OBJECT SEARCH that run in a loop. OB-
JECT SEARCH is divided into a series of substates, visible in 3
3We present all operations as performed on the view of the left camera.
4Initial object poses could be acquired with an uninformed search
approach, as described in Sec. I, before our system starts execution.
5A robot is guided to views in which it detects objects that belong to a
scene and the changes in object poses, occurring during the demonstration.
1 2 3
Fig. 4: First NBV estimation in scenario C2. 1: Top view on robot position samples from all iterations of Algo. 1. 2: Camera orientation set for best robot
position in first iteration. 3: NBV result in detail. View pair of robot as transparent frustums with arrows in their middle as view directions. Current view of
robot in red. NBV in blue containing pose predictions as shrinked 3D-models. Pose predictions within NBV colored in blue, their lines of sight in yellow.
in Fig. 1. ASR begins once an initial set of objects is detected
in OBJECT DETECTION. When OBJECT DETECTION is visited,
all estimated poses in the robot frustum are transformed from
camera to world frame with the current robot configuration.
The results are stored together with the views from which
they are acquired. Storing views prevents OBJECT SEARCH in
Sec. VI-B from searching views for the same objects more
than once. The next step is SCENE RECOGNITION that passes
all localized objects to passive scene recognition with hierar-
chical ISMs. In case all objects of all scenes are found, ASR
stops, returning the results of scene recognition. Otherwise
all scene instances from SCENE RECOGNITION, which do not
comprise all objects of their scenes, are buffered. We greedily
extract the best-rated instance from that buffer and start to
predict poses in POSE PREDICTION for all target objects {o}P
that miss in that scene instance. The resulting set of poses TP
is passed to the OBJECT SEARCH subsystem. In its substates,
it tries to find target objects by successively estimating views
that are most suitable for detecting searched objects.
B. Relational object search subsystem
Predicted poses TPalone are insufficient to reliably esti-
mate views for object detection. Objects oare not equally
well detectable from each perspective. Instead we use lines
of sight {n}, which we empirically estimate per object type.
Each line, defined relative to the object frame, represents a
perspective suitable for detection. In world coordinates, each
line depends on the pose of its object. Lines are assigned
to each pose TPin the first substate NBV SET POINT CLOUD
of OBJECT SEARCH. Example lines are visible in 3in Fig.
4. We designate {(TP,{n})}as poses with lines. Before
NBV estimation starts, the poses with lines are prefiltered
by the method updateSet(V). It is called for all views that
have been already explored. Instead of invalidating entire
predicted poses in the frustum of a view V, it only deletes
the lines of sight that point into the direction x(V)of the view
so that the poses can still be searched from other directions.
The next state NBV CALCULATION is the starting point of
the iterative object search process and estimates a Next-Best-
View VNtogether with the optimal set of objects {o}Nto
be searched in view VN. Combinations of object set and
NBV ({o}N,VN)are returned by Algo. 1 in Sec. VI-C, which
receives the current view VC, the current robot configuration
CCand predicted poses with lines {(TP,{n})}as input. In
case a NBV VN, classified as accessible, is found, robot goal
configuration CNthat corresponds to the view, is transmitted
to the state SM MOVE TO VIEW. This state controls PTU and
navigation to reach the goal. Once navigation has stopped at
a configuration more or less close to the goal, a transition to
OBJECT DETECTION is triggered. OBJECT DETECTION performs
object localization for the optimal set of objects {o}N.
If at least one object is found, OBJECT SEARCH is left to
generate updated recognition results in SCENE RECOGNITION,
replacing the old in the buffer. Otherwise, e.g. in case of
occlusions, NBV UPDATE POINT CLOUD triggers updateSet(VC)
to invalidate all lines of sight from poses in the frustum of
the current view Vc, pointing in direction of Vc. If lines exists,
but none can be invalidated, updateSet(VN) is called for the
NBV VNinstead. If the number of remaining lines falls below
a threshold, the scene instance, currently used for predicting
poses, is discarded and OBJECT SEARCH is left. New object
pose predictions are calculated in POSE PREDICTION with
the next-best scene instance in the buffer. We exhaustively
process all scene instances in the buffer for pose prediction
until an object is found or ASR is aborted. Greedy scene
instance selection, with the aim to complete scenes as fast as
possible, only sets the order in which instances are processed.
C. Next-Best-View estimation
We divide 6 DoF camera pose space into a space of
robot positions (x,y)and of camera orientations q. This 4
DoF search space is iteratively discretized with increasing
resolutions eby repeatedly executing Algo. 2 in Algo. 1.
1: {q}S getSpiralApprox(tmin,tmax )and VA VCand e e0
2: repeat
3: Get robot configuration CAfrom currently best view VA
4: ({o}I,VI) iterationStep(e,CA,{q}S,{(TP,{n})},CC)
5: d k(xI,yI)(xA,yA)kand VA VIand e 2·e
6: until d<threshold
7: return ({o}N,VN) ({o}I,VI)
Algorithm 1: ({o}N,VN) calculateNBV({(TP,{n})},VC,CC)
In the first execution of iterationStep(), we tessellate the
entire area in robot position space, being of interest to object
1 2 3
Fig. 5: Iteration steps, each in different colors, for first NBV in C2. Robot as red circle. Its environment as 2D map with light colored free space & borders
in black. Robot position samples as cylinders on a blue, quadrate area of the floor and camera orientations as squares on a sphere, make up search space
in each step. Both in same color as arrow of resulting view (its position & direction). Blue pillars connect best-rated position with orientation sphere and
result arrow. Sphere height displaced while iterating. Deviations among arrows from step 1 & 2 in 1,2&3in2,3&4in3decrease from left to right.
search, at a coarse resolution e0. We do so with a hex grid
that is aligned to the current position of the robot CCfrom
which getHexGrid() returns the corners as a position set
{(x,y)}H. From position set {(x,y)}H, each position whose
distance to any pose prediction TPof a searched object is
larger than the fcp or that lies too close to an obstacle on the
2D environment map, is discarded at line 2 of Algo. 2. A
set of camera orientations {q}S, from evenly sampling a unit
sphere in getSpiralApprox() at line 1 in Algo. 1, is passed
to iterationStep(). These orientations are combined with each
robot position, sampled in Algo. 2, at line 5 of iterationStep()
to generate view candidates Vwith the frustum F. In 2in Fig.
4, a camera orientation set, limited to the workspace of the
PTU, is visualized on a sphere for a given robot position.
Each execution of iterationStep() estimates the best-rated
view VIat a resolution eaccording to a reward r({o},V)
at line 10. It is passed as robot configuration CAto the next
call of iterationStep() in Algo. 1.
1: {(x,y)}H getHexGrid(e,xA,yA)with (xA,yA)from CA
2: pruneIsolatedAndOccupiedPositions({(x,y)}H)
3: for all (x,y)2{(x,y)}Hdo
4: for all q 2{q}Sdo
5: V getView(F,x,y,q)
6: {(TF,{n})} frustumCulling({(TP,{n})},V)
7: for all {o}22{o}Pdo
8: Extract (TF,{n})2{(TF,{n})}belonging to o2{o}
9: r({o},V) u(V,{(TF,{n})})·i({o},V,CC)
10: return ({o}I,VI)) argmax
({o},V)from {(x,y)}H{q}S2{o}P
r({o},V)
Algorithm 2: ({o}I,VI)) iterationStep(e,CA,{q}S,{(TP,{n})},CC)
In the next iterationStep() execution, a new hex grid
whose origin lies at configuration CAis created with doubled
resolution eand halved area, to be tessellated. CAis the
best-rated configuration in the preceding iteration. Tessel-
lation results, in particular robot position sets, for pairs of
consecutive iteration steps shown are in Fig. 5. As we only
tessellate with increasing resolution in proximity of best-
rated views in each iteration step, search converges to a
locally optimal view. These steps are aborted after line 5
in Algo. 1 once positions (xI,yI),(xA,yA)of views VI,VA
from two consecutive iterations are similar enough. A NBV
is returned. By decreasing the tessellated area proportionally
to increasing the resolution in each step, the size of robot
position sets is constant throughout iterating. Uniformly
tessellating the whole area of interest for object search with
the resolution, used in the last iteration, would return a lot
more positions. Position sets, resulting from an exemplary
run of Algo. 1, are in 1in Fig. 4 with different colors per
iteration. A sequence of four views VI, generated by Algo. 2
during a run of Algo. 1, is in Fig. 5. The last pair of nearly
equal views before aborting is omitted. At line 7 in Algo.
2, we extend our search space to the power set of the target
objects {o}Pas we do not only look for a best view V, but
also for the best objects {o}{o}Pto search in this view.
u(V,{(TF,{n})})= Â
(TF,{n})2{(TF,{n})}
ua(V,TF)·ud(V,TF)·un(V,{n})
|{(TF,{n})}|
ua(V,TF)=f\(x(V),!
ppF),0.5·min(fovx,fovy)
ud(V,TF)=fhx(V),!
ppFi fcp+ncp
2,|fcpncp|
2
un(V,{n})=f(minn2{n}(\(x(V),n)),threshold)
A reward r({o},V)that consists of a conjunction of a
utility u(V,{(TF,{n})})and the inverse i({o},V)of a cost
function, is used to rate combinations of object sets and
views ({o},V), to find the best. Utility is only calculated
on the portion of predicted poses {(TP,{n})}that lies in
the frustum of view V, line 6 in Algo. 2, and that belongs
to any object, we selected for searching. A set of pose
predictions is shown in 3in Fig. 4 together with their lines
of sight. Utility function u(V,{(TF,{n})})rates each pose
prediction (TF,{n})in a view Vregarding its confidence
in being detectable for object localizers, given its 6 DoF
pose TF. Confidence of detection is defined as optimal for
a pose when the measures ua(V,TF),ud(V,TF),un(V,{n})
are maximized at once: ua(V,TF), which evaluates the angle
\(x(V),!
ppF)between camera view direction x(V)and the
ray !
ppFbetween camera position pand position pF(from
predicted pose TF), favors that the predicted pose lies at the
center of the camera field of view. ud(V,TF), which evaluates
the projection hx(V),!
ppFiof ray !
ppFon the view direction,
1 2 3 4
Fig. 6: 1,2,3:1
st,2
nd & last iteration step of ASR for S1 in C1, each with its current view & NBV. 4: Perfect scene recognition result for S1 after object
search finished. 1: Purple box in middle assigned to S1, resulting pose predictions at both sides of box with those for two objects within NBV frustum on
left. 2: Pose prediction causes frustum next to 2nd detection. 3: NBV for last missing object on the right. 4: Last object found among its pose predictions.
favors that the predicted pose lies half way between ncp and
fcp along the view direction. un(V,{n}), which evaluates the
angle between view direction x(V)and the line of sight of
the predicted pose, most similar to x(V), ensures that the pre-
dicted object is being observed from a perspective, favorable
for its detection6. Inverse costs from current configuration
CCto view Vin Algo. 3 are a weighted sum of normalized
inverted travel costs for PTU ip(r,t), robot orientation in(q)
and position id(x,y)(distance normalized by a Gaussian) as
well as io({o}), which contains the runtimes t(o)of running
detection for all objects in combination ({o},V)in relation
to the sum of detection runtimes for all target objects {o}P.
1: Get C=(x,y,q,r,t)for V
2: ip(r,t) wr·(1|rrC|
|rmaxrmin |)+wt·(1|ttC|
|tmaxtmin |)
3: in(q) wq·(1|qqC|
2p)
4: id(x,y) wd·e((xxC)2+(yyC)2).2s2
5: io({o}) wo·1Âo2{o}t(o)
Âo2{o}Pt(o)
6: return ip(r,t)+in(q)+id(x,y)+io({o})
Algorithm 3: i({o},V,CC)- Inverse cost function
VII. EXPERIMENTS AND RESULTS
A. Experimental setups
In this section, we present scene recognition results from
our Active Scene Recognition (ASR) approach for 5 exem-
plary object constellations. Despite being a general approach
for finding scenes as defined in Sec. III, we designed
constellations enabling us to compactly show how ASR
works in practice and what its conceptual limitations are. The
experiments are done with 3 scene models S1 to S3, acquired
from demonstrations7during which objects are horizontally
slided in squares of 10cm around their initial positions8.
(a) S1 - 4 objects, differently oriented on table (2in Fig. 3),
(b) S2 - 3 objects, one on 2 levels of a shelf (4in Fig. 3),
6Each criterion is rated by a function f(x,max)that is monotonically
decreasing and for which applies: f(0,max)=1 and f(max,max)=0.
7Sensitivity is set to 0.5m resp. 1m for C1 resp. C2. Maximal accepted
orientation deviations among demonstrated & detected poses are set to 45°
resp. 60°. Experiments done with a “Xeon E5-1650” & 32 GB RAM.
8Apart from a box in S2, presented on two different levels of a shelf.
(c) S3 - 3 pairs of objects on distant tables (5in Fig. 3).
S2 and S3 have two objects in common and are searched
by ASR at once, which we call scenario C2. Scenario C1
consists of S19. The major limitations for our system are
external components for robot navigation [9] and object de-
tection [8]. In order to restrict experiments to the limitations
of ASR itself, we chose object locations surrounded by views
that are accessible to [9] and trained object detectors from
these perspectives to enable robust object detection.
B. Recognition results for scenes at changing places
In this Section, we present scene recognition results on
2 different constellations of scene S1 in scenario C1. The
1st constellation consists of objects at poses similar to those,
observed during the demonstration of S1. To find all objects
in that constellation, we execute ASR on a real robot using
its real sensor readings. The presented execution of ASR
shall illustrate how the interplay between utility and costs
influences the choice of views during the ASR process. ASR
starts with the robot standing in front of a purple box as
depicted in 1in Fig. 6. The object is detected, assigned to
S1 and poses for the three remaining objects are predicted
according to S1. NBV estimation chooses the blue-colored
view10 on the left, as it contains two objects at once, their
predictions being shown in blue, too. This illustrates how
utility in searching two objects at once exceeds the increment
in travel costs compared to a view, closer to the current robot
position, but containing predictions for one object. Due to
limitations of navigation in positioning the robot, ASR takes
5 consecutive views, the first shown in 2in Fig. 6 and the last
in 3in Fig. 6, to find both searched objects. During this, NBV
estimation chooses to stay close to both objects due to low
travel costs. The pose of object detection results within the
view in 3in Fig. 6, illustrates that the three measures in our
utility function prefer views in which pose predictions, and
eventually detection results, lie in the middle of the viewing
9S1: Big red box & purple box as well as small red box & yellow box
are each connected by an ISM. A third ISM relates both object pairs to a
tree of ISMs. S2: 2 objects on a table are paired by an ISM. Its reference is
related to the box in the shelf by a 2nd ISM. S3: Relations among 3 object
pairs are each modeled by ISMs, while the pairs are connected by 2 ISMs.
10View pairs of the stereo rig are designated as views for simplicity.
Fig. 7: 1,2,3,4:1
st,2
nd,3
rd &4
th iteration step of ASR for S2&S3 in C2. 1: Robot pose localization with 2D laser scans in white. Robot position
sampling distorted on the left due to pruning on empirical data. 3: Perfect scene recognition result for S3. Pose predictions for tea box on the right separated
according to their shelf levels. 4: False detection of yellow cup discarded as not in frustum. 5: Perfect recognition results for S2&S3 after object search.
Fig. 8: Recognition results for two consecutive executions of ASR on
constellations for S1 with all views, reached by ASR. Robot as red arrow.
frustums and are turned towards the robot. In 4in Fig. 6, the
last object is found with S1 being entirely recognized.
The 2nd constellation for S1, visible at the bottom of Fig.
8, results from displacing all objects to a different table
while maintaining the relative poses between the objects. To
simplify comparison of both constellations, the 1st is depicted
at the top of Fig. 8 as well. This second experiment relies on
the same models as the first, but is performed with simulated
navigation and object recognition to simplify evaluation and
bypass navigation issues due to unreliable robot hardware.
In simulation, the 1st constellation is recognized like on the
real robot as visible at the top of Fig. 8 from the recognition
result and from the views, reached during ASR and shown as
yellow arrows. When the simulated robot is put in front of the
purple box in the 2nd constellation, the scene is completely
recognized as well. Even tough the absolute object poses
and the geometry of the edges of the underlying table is
different, ASR first heads to the object pair on one side,
before searching the last missing object on the other. This
shows that our approach is capable of recognizing scenes
independent of their emplacements in the world.
C. Recognition results for scenes with long relations
In scenario C2, made up of scenes with relations over
long distances, we first present scene recognition results for
S2 and S3 on a constellation that roughly corresponds to
demonstrated data. Object search in the 3rd constellation that
we analyze, shall illustrate additional properties of the ASR
process and is executed on the real robot, using its sensor
data. Initially the robot stands in front of a cup and a plate.
Once both are assigned to S3, poses for all missing objects in
S3 are predicted on two different tables as can be seen in 1in
Fig. 7. ASR chooses to look for the object pair at the bottom.
Detecting this object pair in 2in Fig. 7 causes equally rated
recognition results for S2 and S3, since it belongs to both. S3
is chosen for pose prediction at random, leading to a view on
the models on the top right of 1in Fig. 7. Once the objects are
found, search goes over to S2 and its tea box. Predicted poses
for the tea box are divided into clusters at two heights, which
can be seen in 3in Fig. 7. NBV estimation opts twice for
the lower cluster before enough lines of sight are invalidated,
see 4in Fig. 7. Switching to the top, the box is found and
both scenes are entirely recognized as shown in 5in Fig. 7.
Experiments on the next two constellations are conducted
in simulation just like those for the 2nd constellation in
Sec. VII-B. They show how ASR copes with objects, de-
viating from trained relative poses. We transform our 3rd
constellation into a 4th, shown in 1in Fig. 9, by translating
several objects11.A5
th constellation, visible in 2in Fig. 9,
arises from the 3rd by rotating objects instead12. To simplify
comparison of both new constellations to the original, we
visualize the original in transparent blue in 1,2in Fig. 9.
ASR for the 4th constellation starts with the robot in front of
cup and plate, at the bottom left of 1in Fig. 9 and finds all
objects apart from the tea box in S2 at the top right. For S3,
it does not only find the object pair on the top left whose
relative poses towards cup and plate did not change, but also
the pair on the bottom right, which cup and plate regard as
30cm away from its trained poses. These objects are found
as they are still in the view for their pose predictions. NBV
does not find a collision-free and sufficiently distant view
in which the tea box is centered. The box is at the right
corner of the view chosen instead (visible in 1in Fig. 9)13.
11On the left of 1in Fig. 9, all objects are shifted by 30cm to the left
and on the right, only the tea box at the top is moved to the right by 30cm.
Compared to 1in Fig. 7, the image it rotated by 90° counterclockwise.
12On the bottom right of that Fig., 2 boxes are rotated clockwise by 10°.
13We configured simulated object detection to require an object to be in
both frustums for successful detection in order to highlight this fact.
3 4
21
Fig. 9: 1: ASR searching last object in 4th constellation with shifted objects 2: ASR looking for objects of S3 on left side. Objects belong to 5th constellation
with rotated objects. 3,4: All views, shown as clusters at different positions and visited during uninformed search in C1&C2, according to Sec. VII-D.
Views, shown as big yellow arrows, connected to blue pillars, which show associated robot positions. Positions connected by blue line segments.
US - C1 ASR - C1 US - C2 ASR - C2
Views 69 7 117 6
Duration [min] 17.00 4.55 42.15 3.52
Table I: Number of views & object search duration with US & ASR.
SCE REC POS PRE NBV CAL MO TO VI OBJ DET
Number 4 3 6 6 7
Avg [s] 0.05 0.03 1.15 37.38 3.40
Table II: Number of executions & average runtimes for steps of ASR in C1.
SCE REC POS PRE NBV CAL MO TO VI OBJ DET
Number 4 3 5 5 6
Avg [s] 0.27 0.09 1.45 33.89 2.77
Table III: Number of executions & avg. runtimes for steps of ASR in C2.
In 2in Fig. 9, ASR for the 5th constellation is started with
the robot on the bottom right in front of both rotated boxes.
While ASR manages to assign these objects and the tea box
to S2, it fails to detect the remaining objects of S3 on the
left side. For those objects, pose predictions differ so much
from the actual object poses that the NBVs, like that shown
in 2in Fig. 9 on the top left, do not contain the objects at all.
The minimal distance between actual poses and predictions
is 0.37m at the bottom left and 0.58m at the top left14.
In general, ASR compensates position errors proportional
to the frustum size, but is susceptible to orientation errors,
depending on the length of the encountered relations.
D. Evaluation of Efficiency of ASR
Our last experiments, conducted on the real robot, focus
on runtime. As base line for our ASR approach, we realized
an uninformed search (US) as in Sec. I15. US is performed on
the 1st and the 3rd constellation on which we executed ASR.
We compare the effort of both approaches in recognizing the
scenes in C1 and C2 in Table I. Data about ASR in Table
I is taken from Sec. VII-B and VII-C. Results of US in C1
and C2 are visible in 3in Fig. 9 and in 4in Fig. 9. The
speedup when using ASR instead of US increases from 3.73
in C1 to 11.97 in C2 as US requires more robot positions
14With increasing length of the relations, orientation errors in detection
results cause increasing errors in the positions of predicted poses.
15To optimize performance, we reduced the robot positions that lie on a
grid, to a few promising points & oriented the robot towards the objects.
to recognize larger scenes and searches all objects at once.
ASR outperforms US not only in runtime, but also in scene
recognition results. Due to false positive detections and poor
perspectives, both visible in implausible pose estimates in 3,
4in Fig. 9, US fails to recognize the scenes. In Table II and
III, we separate the runtime of ASR into average runtimes
of its components for the 1st and the 3rd constellation.
VIII. CONCLUSIONS
We presented an Active Scene Recognition approach that
allows mobile robots iteratively improving their estimations
about present scenes by relational object search. As search
only relies on spatial relations, this approach recognizes
scenes independent of the absolute poses of detected objects.
IX. ACKNOWLEDGMENTS
This research was financially supported by the DFG
- Deutsche Forschungsgemeinschaft. We thank Florian
Aumann-Cleres and Jocelyn Borella for their support.
REFERENCES
[1] P. Meißner, R. Reckling, R. J¨
akel, S. R. Schmidt-Rohr, and R. Dill-
mann, “Recognizing scenes with hierarchical implicit shape models
based on spatial object relations for programming by demonstration,”
in Int. Conf. on Advanced Robotics, 2013.
[2] P. Meißner, R. Reckling, V. Wittenbeck, S. R. Schmidt-Rohr, and
R. Dillmann, “Active scene recognition for programming by demon-
stration using next-best-view estimates from hierarchical implicit
shape models,” in Int. Conf. on Robotics and Automation, 2014.
[3] D. Lin, S. Fidler, and R. Urtasun, “Holistic scene understanding for
3d object detection with rgbd cameras,” in Int. Conf. on CV, 2013.
[4] B. Leibe, A. Leonardis, and B. Schiele, “Robust object detection with
interleaved categorization and segmentation,Int. Journal of CV, 2008.
[5] J. I. Vasquez-Gomez, L. E. Sucar, and R. Murrieta-Cid, “View plan-
ning for 3d object reconstruction with a mobile manipulator robot,” in
Int. Conf. on Intelligent Robots and Systems, 2014.
[6] Y. Ye and J. K. Tsotsos, “Sensor planning for 3d object search,”
Computer Vision and Image Understanding, 1999.
[7] L. Kunze, K. K. Doreswamy, and N. Hawes, “Using qualitative spatial
relations for indirect object search,” in Int. Conf. on Robotics and
Automation, 2014.
[8] P. Azad, T. Asfour, and R. Dillmann, “Stereo-based 6d object local-
ization for grasping with humanoid robot systems,” in Int. Conf. on
Intelligent Robots and Systems, 2007.
[9] E. Marder-Eppstein, E. Berger, T. Foote, B. Gerkey, and K. Konolige,
“The office marathon: Robust navigation in an indoor office environ-
ment,” in Int. Conf. on Robotics and Automation, 2010.
[10] J. Bohren and S. Cousins, “The smach high-level executive,” IEEE
Robotics & Automation Magazine, 2010.
... To this end, we presented 'Active Scene Recognition' (ASR) in [7,8], which is a procedure that integrates the scene recognition and object search. Roughly, the procedure is as follows: The robot first detects some objects and computes which scene categories these objects may belong to. ...
... To make ASR possible, we had to link two research problems: scene recognition and object search. To this end, we proposed a technique for predicting the poses of searched objects in [7] and reused it in [8]. However, this technique suffered from a combinatorial explosion. ...
... Our approach to creating suitable conditions in spatially distributed and cluttered indoor environments is to have a mobile robot adopt camera views from which it can perceive searched objects. To this end, in two previous works ( [7,8]), we introduced active scene recognition (ASR)-an approach that connects PSR with a three-dimensional object search within a decision-making system. We implemented ASR as a state machine consisting of two search modes (states), DIRECT_SEARCH and INDIRECT_SEARCH, that alternate. ...
Article
Full-text available
This article describes an approach for mobile robots to identify scenes in configurations of objects spread across dense environments. This identification is enabled by intertwining the robotic object search and the scene recognition on already detected objects. We proposed “Implicit Shape Model (ISM) trees” as a scene model to solve these two tasks together. This article presents novel algorithms for ISM trees to recognize scenes and predict object poses. For us, scenes are sets of objects, some of which are interrelated by 3D spatial relations. Yet, many false positives may occur when using single ISMs to recognize scenes. We developed ISM trees, which is a hierarchical model of multiple interconnected ISMs, to remedy this. In this article, we contribute a recognition algorithm that allows the use of these trees for recognizing scenes. ISM trees should be generated from human demonstrations of object configurations. Since a suitable algorithm was unavailable, we created an algorithm for generating ISM trees. In previous work, we integrated the object search and scene recognition into an active vision approach that we called “Active Scene Recognition”. An efficient algorithm was unavailable to make their integration using predicted object poses effective. Physical experiments in this article show that the new algorithm we have contributed overcomes this problem.
... To this end, we presented 'Active Scene Recognition' (ASR) in [7] and [8], a procedure that integrates scene recognition and object search. Roughly, the procedure is as follows: The robot first detects some objects and computes which scene categories these objects may belong to. ...
... To make ASR possible, we had to link two research problems: Scene recognition and object search. To this end, we proposed a technique for predicting the poses of searched objects in [7] and reused it in [8]. However, this technique suffered from a combinatorial explosion. ...
... Our algorithm for scene recognition using ISM trees is another contribution of this article and involves two steps: An evaluation step exemplified in 2 in Figure 8 and an assembly step exemplified in 4 in Figure 8. Both steps 10 are detailed 8 Substitutions by reference objects in star topologies are colored green. 9 The object pose estimation is omitted in Figure 8 for simplicity. ...
Preprint
Full-text available
For a mobile robot, we present an approach to recognize scenes in arrangements of objects distributed over cluttered environments. Recognition is made possible by letting the robot alternately search for objects and assign found objects to scenes. Our scene model "Implicit Shape Model (ISM) trees" allows us to solve these two tasks together. For the ISM trees, this article presents novel algorithms for recognizing scenes and predicting the poses of searched objects. We define scenes as sets of objects, where some objects are connected by 3-D spatial relations. In previous work, we recognized scenes using single ISMs. However, these ISMs were prone to false positives. To address this problem, we introduced ISM trees, a hierarchical model that includes multiple ISMs. Through the recognition algorithm it contributes, this article ultimately enables the use of ISM trees in scene recognition. We intend to enable users to generate ISM trees from object arrangements demonstrated by humans. The lack of a suitable algorithm is overcome by the introduction of an ISM tree generation algorithm. In scene recognition, it is usually assumed that image data is already available. However, this is not always the case for robots. For this reason, we combined scene recognition and object search in previous work. However, we did not provide an efficient algorithm to link the two tasks. This article introduces such an algorithm that predicts the poses of searched objects with relations. Experiments show that our overall approach enables robots to find and recognize object arrangements that cannot be perceived from a single viewpoint.
... The NBVP framework is further extended for more specific tasks [12], [13], by integrating the gain with other theories, e.g., information theory [14] and visual-saliency model [15]. Although NBVP-based methods can efficiently explore individual regions, they can get stuck in a large environment. ...
Article
This article proposes a frontier-guided informative planner for unmanned aerial vehicle volumetric exploration and 3-D reconstruction, which can explore a complex unknown environment and provide the accurate truncated signed distance function reconstruction simultaneously. Different from the existing methods, the key insight of the proposed method is that the hybrid surface frontier is proposed to guide both the tree expansion of dynamic rapidly exploring random tree star and the informative trajectory generation. As a result, the proposed planner can achieve more efficient volumetric exploration with higher reconstruction quality. Specifically, hybrid global–local surface frontiers are designed to guide the potential viewpoints sampling and tree expansion, which results in directional exploration. Then, the hybrid surface frontiers are further leveraged to guide the candidate paths generation. On the basis, the path maximizing the new comprehensive gain is selected for the following B-spline trajectory optimization, which can further improve the reconstruction quality. Comparative simulation and real-world experiments are conducted to demonstrate the superior performance of the proposed method including the exploration efficiency and reconstruction quality.
Article
When humans want to understand an object’s 3D shape, they watch the object from different viewpoints. Changing the viewpoint is either performed actively, i.e., moving eye sights or the human head, or passively, i.e., holding and reposing the object. Inspired by the humans’ passive policy, we propose a method to plan the motion for a dual-arm robot to hold and repose an object, capture multiple views using a stationary depth sensor mounted on the robot head, and obtain the object’s 3D shape from the multiple views. Primarily, we develop algorithms to determine the Next Best Configuration (NBC) for observation and Next Best Regrasp/Grasp (NBR/G) poses while considering elements like the confidence of captured partial point clouds, robotic manipulability, robotic motion distances, and sensing ranges. We study the necessity and influence of these elements on the time costs and surface coverage quality in the experimental section using several representative objects. The results show that the elements play essential roles in supporting specific actions or suppressing certain costs. They help to secure efficient robot motion and satisfactory 3D shape recovery quality. Note to Practitioners —This paper is motivated by the difficulties in using commercial 3D scanners. A commercial 3D scanner set usually comprises a scanning sensor, a rotating table, and editing software. To scan the 3D shape of an object, a human needs to place the object on the rotating table with different poses, let the scanner obtain several partial point clouds, and use the editing software to merge them into a final model. The human must carefully design the different poses by considering both the object’s self obstructions and stable placements, which is tiring and difficult to be applied to large-scale tasks like building 3D shape databases containing many objects. On the other hand, although several robotic solutions exist for automatic scanning, they either use an eye-in-hand scanner to scan a stationary object or an arm to hold and move an object for scanning. In the former case, the bottom or downward faces of the object cannot be covered. In the latter case, the surface blocked by the fingers of the holding hand will be lost. The method proposed by this paper plans dual-arm robot motion to grasp and move objects for scanning. It automatically determines pick-up, rotation, and handover to maximize scanning coverage. Compared with commercial scanners and existing robotic solutions, the method performs automatic scanning with high coverage and is more advantageous for scanning many objects without human intervention.
Chapter
Detailed technical presentation of our contributions that are related to Active Scene Recognition. This includes our approaches to Object Pose Prediction and Next-Best-View estimation.
Conference Paper
Full-text available
The task addressed in this paper is to plan iteratively a set views in order to reconstruct an object using a mobile manipulator robot with an “eye-in-hand” sensor. The proposed method plans views directly in the configuration space avoiding the need of inverse kinematics. It is based on a fast evaluation and rejection of a set of candidate configurations. The main contributions are: a utility function to rank the views and an evaluation strategy implemented as a series of filters. Given that the candidate views are configurations, motion planning is solved using a rapidly-exploring random tree. The system is experimentally evaluated in simulation, contrasting it with previous work. We also present experiments with a real mobile manipulator robot, demonstrating the effectiveness of our method.
Conference Paper
Full-text available
Finding objects in human environments requires autonomous mobile robots to reason about potential object locations and to plan to perceive them accordingly. By using information about the 3D structure of the environment, knowledge about landmark objects and their spatial relationship to the sought object, search can be improved by directing the robot towards the most likely object locations. In this paper we have designed, implemented and evaluated an approach for searching for objects on the basis of Qualitative Spatial Relations (QSRs) such as left-of and in-front-of. On the basis of QSRs between landmarks and the sought object we generate metric poses of potential object locations using an extended version of the ternary point calculus and employ this information for view planning. Preliminary results show that search methods based on QSRs are faster and more reliable than methods not considering them.
Conference Paper
Full-text available
We present an approach for recognizing scenes, consisting of spatial relations between objects, in unstructured indoor environments, which change over time. Object relations are represented by full six Degree-of-Freedom (DoF) coordinate transformations between objects. They are acquired from object poses that are visually perceived while people demonstrate actions that are typically performed in a given scene. We recognize scenes using an Implicit Shape Model (ISM) that is similar to the Generalized Hough Transform. We extend it to take orientations between objects into account. This includes a verification step that allows us to infer not only the existence of scenes, but also the objects they are composed of. ISMs are restricted to represent scenes as star topologies of relations, which insufficiently approximate object relations in complex dynamic settings. False positive detections may occur. Our solution are exchangeable heuristics for recognizing object relations that have to be represented explicitly in separate ISMs. Object relations are modeled by the ISMs themselves. We use hierarchical agglomerative clustering, employing the heuristics, to construct a tree of ISMs. Learning and recognition of scenes with a single ISM is naturally extended to multiple ISMs.
Conference Paper
Full-text available
We present an approach that combines passive scene understanding with object search in order to recognize scenes in indoor environments that cannot be perceived from a single point of view. Passive scene recognition is done based on spatial relations between objects using Implicit Shape Models. ISMs, a variant of Generalized Hough Transform, are extended to describe scenes as sets of objects with relations lying in- between. Relations are expressed as six Degree-of-Freedom (DoF) relative object poses. They are extracted from sensor recordings of human demonstrations of actions usually taking place in the corresponding scene. In a scene ISMs solely represent relations of n objects towards a common reference. Violations of other relations are not detectable. To overcome this limitation we extend our scene models to binary trees consisting of ISMs using hierarchical agglomerative clustering. Active scene recognition aims to simultaneously detect present scenes and localize objects these scenes consist of. For a pivoting stereo camera rig, we achieve this by performing recognition with hierarchical ISMs in an object search loop using Next-Best-View (NBV) estimation. A criterion, on which we greedily choose views the rig shall adopt next, is the confidence to detect objects in them. In each search step confidence on potential positions of objects, not found yet, is calculated based on the best available scene hypothesis. This is done by partly reversing the basic principle of ISMs and using spatial relations to predict potential object positions starting from objects already detected.
Conference Paper
Full-text available
In this paper, we tackle the problem of indoor scene un-derstanding using RGBD data. Towards this goal, we pro-pose a holistic approach that exploits 2D segmentation, 3D geometry, as well as contextual relations between scenes and objects. Specifically, we extend the CPMC [3] frame-work to 3D in order to generate candidate cuboids, and develop a conditional random field to integrate informa-tion from different sources to classify the cuboids. With this formulation, scene classification and 3D object recognition are coupled and can be jointly solved through probabilis-tic inference. We test the effectiveness of our approach on the challenging NYU v2 dataset. The experimental results demonstrate that through effective evidence integration and holistic reasoning, our approach achieves substantial im-provement over the state-of-the-art.
Conference Paper
Full-text available
Robust vision-based grasping is still a hard problem for humanoid robot systems. When being restricted to using the camera system built-in into the robot's head for object localization, the scenarios get often very simplified in order to allow the robot to grasp autonomously. Within the computer vision community, many object recognition and localization systems exist, but in general, they are not tailored to the application on a humanoid robot. In particular, accurate 6D object localization in the camera coordinate system with respect to a 3D rigid model is crucial for a general framework for grasping. While many approaches try to avoid the use of stereo calibration, we will present a system that makes explicit use of the stereo camera system in order to achieve maximum depth accuracy. Our system can deal with textured objects as well as objects that can be segmented globally and are defined by their shape. Thus, it covers the cases of objects with complex texture and complex shape. Our work is directly linked to a grasping framework being implemented on the humanoid robot ARM AR and serves as its perception module for various grasping and manipulation experiments in a kitchen scenario.
Article
In this paper, we provide a systematic study of the task of sensor planning for object search. The search agent's knowledge of object location is encoded as a discrete probability density which is up-dated whenever a sensing action occurs. Each sensing action of the agent is defined by a viewpoint, a viewing direction, a field-of-view, and the application of a recognition algorithm. The formulation casts sensor planning as an optimization problem: the goal is to maximize the probability of detecting the target with minimum cost. This problem is proved to be NP-Complete, thus a heuristic strat-egy is favored. To port the theoretical framework to a real working system, we propose a sensor planning strategy for a robot equipped with a camera that can pan, tilt, and zoom. In order to efficiently determine the sensing actions over time, the huge space of possible actions with fixed camera position is decomposed into a finite set of actions that must be considered. The next action is then selected from among these by comparing the likelihood of detection and the cost of each action. When detection is unlikely at the current posi-tion, the robot is moved to another position for which the probability of target detection is the highest. c 1999 Academic Press CONTENTS 1. Introduction. 2. Overview. 3. Problem formulation. 4. Detection function. 5. Sensed sphere. 6. Where to look next. 6.1. Determine the necessary viewing angle size. 6.2. De-termining the necessary viewing directions for a given angle size. 6.3. Se-lecting the next action.
Article
This paper presents a novel method for detecting and localizing objects of a visual category in cluttered real-world scenes. Our approach considers object categorization and figure-ground segmentation as two interleaved processes that closely collaborate towards a common goal. As shown in our work, the tight coupling between those two processes allows them to benefit from each other and improve the combined performance. The core part of our approach is a highly flexible learned representation for object shape that can combine the information observed on different training examples in a probabilistic extension of the Generalized Hough Transform. The resulting approach can detect categorical objects in novel images and automatically infer a probabilistic segmentation from the recognition result. This segmentation is then in turn used to again improve recognition by allowing the system to focus its efforts on object pixels and to discard misleading influences from the background. Moreover, the information from where in the image a hypothesis draws its support is employed in an MDL based hypothesis verification stage to resolve ambiguities between overlapping hypotheses and factor out the effects of partial occlusion. An extensive evaluation on several large data sets shows that the proposed system is applicable to a range of different object categories, including both rigid and articulated objects. In addition, its flexible representation allows it to achieve competitive object detection performance already from training sets that are between one and two orders of magnitude smaller than those used in comparable systems.
Article
A Python application programming interface (API) called SMACH, based on hierarchical concurrent state machines, allow executives to be controlled by a higher level task-planning system. SMACH is a ROS independent library that can be used to build hierarchical and concurrent state machines and any other task-state container that adheres to the provided interfaces. SMACH States represent states of execution, each with some set of potential outcomes, and implement a blocking execute function, which runs until it returns a given outcome. A simple execution policy the SMACH Concurrence executes one state at a time in series and executes more than one state simultaneously. Each SMACH container has a locally scoped dictionary of user data that can be accessed by each of its child states, allowing states to access data that was written by previously executed states.
Conference Paper
This paper describes a navigation system that allowed a robot to complete 26.2 miles of autonomous navigation in a real office environment. We present the methods required to achieve this level of robustness, including an efficient Voxel-based 3D mapping algorithm that explicitly models unknown space. We also provide an open-source implementation of the algorithms used, as well as simulated environments in which our results can be verified.