Content uploaded by Pascal Meißner
Author content
All content in this area was uploaded by Pascal Meißner on Dec 15, 2017
Content may be subject to copyright.
Scene Recognition for Mobile Robots by Relational Object Search using
Next-Best-View Estimates from Hierarchical Implicit Shape Models
Pascal Meißner, Ralf Schleicher, Robin Hutmacher, Sven R. Schmidt-Rohr and R¨
udiger Dillmann
Abstract— We present an approach for recognizing indoor
scenes in object constellations that require object search by
a mobile robot, as they cannot be captured from a single
viewpoint. In our approach that we call Active Scene Recog-
nition (ASR), robots predict object poses from learnt spatial
relations that they combine with their estimates about present
scenes. Our models for estimating scenes and predicting poses
are Implicit Shape Model (ISM) trees from prior work [1].
ISMs model scenes as sets of objects with spatial relations
in-between and are learnt from observations. In prior work
[2], we presented a realization of ASR, limited to choosing
orientations for a fixed robot head with an approach to search
objects that uses positions and ignores types. In this paper, we
introduce an integrated system that extends ASR to selecting
positions and orientations of camera views for a mobile robot
with a pivoting head. We contribute an approach for Next-Best-
View estimation in object search on predicted object poses. It is
defined on 6 DoF viewing frustums and optimizes the searched
view, together with the objects to be searched in it, based on
6 DoF pose predictions. To prevent combinatorial explosion
when searching camera pose space, we introduce a hierarchical
approach to sample robot positions with increasing resolution.
I. INTRODUCTION
Programming by Demonstration (PbD) is a paradigm
that aims to allow non-experts teaching robots manipulation
tasks. Given a taught task description, a robot needs a notion
of the scene, in which it is situated, to decide whether it may
be adequate or not to begin that task. In general, estimating
isolated locations of present objects is not sufficient, since
object may have different usages in different scenes. Not
only the occurrence of objects, but also the spatial relations
between them have to be considered. To address this issue,
we presented hierarchical Implicit Shape Models [1] that
recognize indoor scenes in object constellations. They model
scenes as sets of objects, including the spatial relations
between them as six Degree-of-Freedom (DoF) coordinate
transforms. Relations are entirely learnt from demonstrations.
Objects in indoor scenes may be distributed or occluded.
Object configurations often have to be perceived from several
viewpoints before a scene can be recognized. Without prior
knowledge, which perspectives have to be checked for which
objects, the 6 DoF space of camera viewpoints has to be
browsed by uninformed search1. In a realistic scenario, such
an approach is infeasible due to combinatorial explosion. In
[2], we presented a method that, given a partially recognized
All authors are with Institute of Anthropomatics, Karlsruhe Institute of
Technology, 76131 Karlsruhe, Germany. pascal.meissner@kit.edu
1An approach for a mobile robot with a pivoting camera is to discretize the
space of robot positions with a given resolution to a grid. At each position,
the camera has to be rotated to a set of views, lying on a tessellated sphere
and at each view, localization is to be performed for all searched objects.
scene, predicts poses for the remaining objects, not yet found,
with the same relations that are used for recognition. Jointly
recognizing scenes and predicting poses with the help of
relations makes it possible to identify object constellations
independent of their absolute locations. We call the concept
to alternate scene recognition and object search, Active Scene
Recognition (ASR). In prior work [2], we demonstrated it
with a simplified approach for a fixed robot head. Camera
views were only optimized regarding their orientations for
a single, given view position. In this paper, we introduce a
new integrated system that implements the ASR concept as a
hierarchical state machine to control a mobile robot instead
of a fixed robot head. For this state machine, we contribute a
method that estimates Next-Best-Views (NBVs) on clouds of
predicted object poses. Both position and orientation of a piv-
oting head of a mobile robot is optimized in terms of utility
and costs. Beyond prior work [2], we introduce a new utility
function and an algorithm for generating NBV candidates on
a 4 DoF space of discrete combinations of robot positions
and sensor orientations. Since uniform sampling of robot
position space is infeasible for the resolutions, required in our
scenario, we contribute a hierarchical strategy to reduce the
number of searched positions: We start an iterative process
by searching a view at a coarse resolution. Next, we increase
the resolution only in proximity of the result from the prior
iteration and restart searching until we converge to a NBV
at a sufficient resolution. Instead of searching all missing
objects in NBVs [2], we estimate an optimal set of objects
per NBV. When only considering how many predicted object
positions fall into a view [2] during NBV estimation, we
noticed many detection errors with real object localizers. We
address this by a utility concept with three measures, each of
which models an aspect that relates 6 DoF poses of objects
in a view to the confidence of detecting them correctly.
II. RELATED WORK
Methods in indoor scene understanding as [3], often model
scenes using Qualitative Spatial Relations (QSRs). Being a
symbolic abstraction, QSRs perform well in generalizing ob-
ject relations, while lacking a detailed description of relative
object locations as required by us. In contrast, the Implicit
Shape Model (ISM) [4] is a non-parametrical method from
object recognition that models object part configurations
in any complexity. We extended ISMs in [1] for scene
recognition so that they process 6 DoF poses instead of 3D
data and introduced a hierarchical ISM, which is a set of
connected ISMs, including learning and inference methods.
In the area of object search, most methods optimize the
132
found_all_objects
Fig. 1: 1: Robot looking for two objects in scenario C2, see Sec. VII. 2,3: State diagrams for presented ASR system, see Sec. VI. 2: Highest level of
ASR system. Grey box stands for object search subsystem whose substates are in 3, starting at top right. Outcomes (red boxes in 3), are transitions in 2.
4
2
1 53
ISM no. 1
ISM no. 2 breakfast
breakfast_sub0
CupPlateDeep
SmacksBox
Fig. 2: [1] 1: Trajectories & relations in single ISM, in different colors per object. 2: False positive recognition by ISM in 1.4: Trajectories & relations
for tree of ISMs, shown in 3. An ISM relates cup & plate, other ISM relates box & reference of first ISM (plate). 5: Correct recognition by ISM tree in 4.
next view to be reached, called the Next-Best-View (NBV).
Like [5], approaches in NBV planning are mainly concerned
with object model reconstruction. Optimization criteria for
choosing views address the characteristics of that application
e.g. by maximizing unknown space within the view instead
of rating sets of predicted object poses like us. Generation
of NBV candidates is done by uniform sampling in large
numbers instead of advanced generation with fewer views
to evaluate. In contrast, NBV planning in our application is
required to produce views of constant quality in real time
in a search loop. One of the first, employing NBV planning
for object search, were [6]. In this work, utility of views is
extracted from a distribution of absolute positions of a single
object. In our approach, we instead search multiple objects at
once, using their relative 6D poses. [7] combines predefined
absolute locations of objects with QSRs to calculate object
position predictions. QSRs are applied without differentiating
among scenes. Object search is performed by extracting 2D
views from robot pose space. For the best pose, camera
orientation is optimized in 3D afterwards. Not considering
the distribution of predictions in vertical direction during 2D
search, might lead to suboptimal views. Calculation of view
utilities is only based on the density of position predictions.
III. SCENE-RELATED DEFINITIONS
We model a scene S=({o},{R})as a set of objects
{o}together with a set of spatial relations {R}between the
objects. Recognition of a scene in a constellation of objects
is equal to rating the similarity between a set of input objects
{i}, located at absolute poses {T}, and the model, learnt for
the scene. An exemplary input constellation is shown in 1in
Fig. 1. Scene models, in particular relations, are extracted
from demonstrations in which trajectories of objects are
recorded over time. Trajectories consist of sequences of
absolute 6 DoF poses T, shown as thick lines in 1in Fig.
2. Poses Tare estimated for objects by visual localization
[8]. Spatial relations are given as sets of relative poses {Tjk}
between pairs of objects (oj,ok). In 1in Fig. 2, relative poses
in relations are depicted by arrows. They make up two spatial
relations, each between the box and one moving object.
IV. PRIOR WORK ON IMPLICIT SHAPE MODELS
A. Passive scene recognition with a single ISM
In [1], we introduced scene model learning and recognition
with ISMs for scenes as in Sec. III. From all n2spatial
relations among a given set {o}of nobjects, a single ISM
can only model the relations from each object (circles in 3
in Fig. 2) towards a common reference (boxes in 3in Fig. 2)
with pose TF. We define an object as equal to the reference
according to a heuristic in [1]. We call it reference object
oF. In 1in Fig. 2, an ISM only models relations towards
a box, being the scene reference. Learning an ISM is equal
to transforming absolute poses T(t)in every demonstrated
trajectory into relative poses ToF(t)and TFo (t)for each
point in time t.ToF (t)encodes how reference object oFis
located relative to object o, with TFo(t)being the inverse
relationship. Recognition of a scene Son a configuration of
{i}objects at poses {T}is a voting process. Each object {i},
being part of the scene, votes where it expects absolute poses
TF T·TiF of the scene reference by using the relative
reference poses TiF in its relation. The reference object votes
on its own pose, if it is present in {i}. An instance of the
scene, i.e. recognition result, is expected where a maximum
of votes of different objects coincide. Votes are collected
in a voxel grid on R3, which enables detecting maximums
1 2 354
Fig. 3: Demonstrations for scene model learning in C1 in 1& in C2 in 3. Recording object trajectories in 1& moving robot towards training view in 3.
Objects of S1 in yellow circle in 1, those of S2 resp. S3 in the blue resp. green circles in 2. Resulting ISM tree for S1 in 2, for S2 in 4& for S3 in 5.
by counting votes per voxel. The edge length of the voxels,
called sensitivity, decides to which amount votes are allowed
to differ in a recognition result, thereby adjusting how much
objects in a scene are allowed to deviate from relations.
B. Passive scene recognition with a tree of ISMs
2in Fig. 2 shows a false positive recognition result of
an ISM as the violation of the relation between the objects
cup and plate, being not the reference, is not detected2.To
be able to represent relations among non-reference objects
in scenes as well, we introduced trees {m}of ISMs, which
we call hierarchical Implicit Shape Models and that consist
of connected ISMs. Each spatial relation in a scene, to be
modeled, is represented by a separate ISM in the tree. In 4in
Fig. 2, the relations between cup and plate and between plate
and box are modeled by two ISMs. ISMs at adjacent levels
in the tree are connected by objects, like breakfast sub0 in
3in Fig. 2, which appear as reference in the ISM at level
n+1 and as regular object in the ISM at level n. Voting
results from an ISM at level n+1 are input at level n. The
ISM at the tree root returns recognition results of the entire
tree. In the tree in 5in Fig. 2, voting from cup towards plate
produces a result at level 2, used at level 1 to relate the
reference of level 2 to the box. The plate is excluded from
recognition. Recognition with ISM trees and scene model
learning, including the structure of trees, is described in [1].
C. Prediction of object poses with a tree of ISMs
If a set {i}of input objects does not contain all objects of
scene S, the spatial relations Rtowards the missing objects
can be used to predict their poses TP. An algorithm that
predicts object poses with the help of the same trees of ISMs
that are used for scene recognition, is described in prior work
[2]. Given the reference pose TFof a scene recognition
result, the ISMs in a tree employ the inverses TFo of the
transforms, used for voting on references, to calculate object
pose predictions T TF·TFo. Pose predictions, as in 3in
Fig. 4, are used to estimate Next-Best-Views to find objects.
V. OBJECT SEARCH-RELATED DEFINITIONS
The robot that we use for searching objects, is depicted
in 1in Fig. 1. It has a laser rangefinder for navigation with
2Scene recognition results are visualized in terms of the relations in the
searched scene, which are shown as straight lines. The colors of the relations
change from green to red as more objects do not match the scene model.[1]
[9] and a stereo camera on a motorized Pan-Tilt Unit (PTU)
for object localization. We define a robot configuration C=
(x,y,q,r,t)based on the values r(pan) and t(tilt) of both
degrees of freedom of the PTU and on the robot pose (x,y,q).
The robot pose is estimated relative to a global frame and
against a 2D map. The workspace of the PTU is limited to
[rmin,rmax ]⇥[tmin,tmax ]. The poses of left TLand right TR
camera in the stereo rig are given in a global frame in 6
DoF3. The viewing area of a camera is defined by a frustum
F=(fovx,fovy,ncp,fcp). A frustum is a four-sided truncated
pyramid that is described by the horizontal and vertical field
of view fovxand fovyof the camera as well as the near and
far clipping planes ncp and fcp, which set the minimal and
maximal distance from which an object can be localized. A
view V(T)is the viewing area of a camera, transformed into
a global frame according to the pose Tof the camera. Each
camera view has a direction x(V), given by the optical axis.
A pair of views for a stereo rig is shown in 3in Fig. 4.
VI. ACTIVE SCENE RECOGNITION
BY RELATION-BASED OBJECT SEARCH
A. Decision-making system for Active Scene Recognition
Scene recognition in Sec. IV is a passive process, since it
interprets data from an external source. In this Section, we
embed scene recognition and pose prediction in a decision-
making system to search objects in the context of Active
Scene Recognition (ASR). Both are from prior work and
described in Sec. IV-B and IV-C. Besides an integrated sys-
tem, realized as hierarchical state machine [10] and extending
prior work in ASR [2] to a navigating robot, our contribution
is a method to consecutively estimating Next-Best-Views
(NBVs) that we introduce in Sec. VI-B and VI-C. In the
following, we suppose that at least one object is available to
object search in order to deduce poses of further objects4and
that scene models have been learnt by robot teleoperation5.
The highest level of our ASR system, shown in 2in Fig.
1, contains as major steps the states SCENE RECOGNITION,
POSE PREDICTION and OBJECT SEARCH that run in a loop. OB-
JECT SEARCH is divided into a series of substates, visible in 3
3We present all operations as performed on the view of the left camera.
4Initial object poses could be acquired with an uninformed search
approach, as described in Sec. I, before our system starts execution.
5A robot is guided to views in which it detects objects that belong to a
scene and the changes in object poses, occurring during the demonstration.
1 2 3
Fig. 4: First NBV estimation in scenario C2. 1: Top view on robot position samples from all iterations of Algo. 1. 2: Camera orientation set for best robot
position in first iteration. 3: NBV result in detail. View pair of robot as transparent frustums with arrows in their middle as view directions. Current view of
robot in red. NBV in blue containing pose predictions as shrinked 3D-models. Pose predictions within NBV colored in blue, their lines of sight in yellow.
in Fig. 1. ASR begins once an initial set of objects is detected
in OBJECT DETECTION. When OBJECT DETECTION is visited,
all estimated poses in the robot frustum are transformed from
camera to world frame with the current robot configuration.
The results are stored together with the views from which
they are acquired. Storing views prevents OBJECT SEARCH in
Sec. VI-B from searching views for the same objects more
than once. The next step is SCENE RECOGNITION that passes
all localized objects to passive scene recognition with hierar-
chical ISMs. In case all objects of all scenes are found, ASR
stops, returning the results of scene recognition. Otherwise
all scene instances from SCENE RECOGNITION, which do not
comprise all objects of their scenes, are buffered. We greedily
extract the best-rated instance from that buffer and start to
predict poses in POSE PREDICTION for all target objects {o}P
that miss in that scene instance. The resulting set of poses TP
is passed to the OBJECT SEARCH subsystem. In its substates,
it tries to find target objects by successively estimating views
that are most suitable for detecting searched objects.
B. Relational object search subsystem
Predicted poses TPalone are insufficient to reliably esti-
mate views for object detection. Objects oare not equally
well detectable from each perspective. Instead we use lines
of sight {n}, which we empirically estimate per object type.
Each line, defined relative to the object frame, represents a
perspective suitable for detection. In world coordinates, each
line depends on the pose of its object. Lines are assigned
to each pose TPin the first substate NBV SET POINT CLOUD
of OBJECT SEARCH. Example lines are visible in 3in Fig.
4. We designate {(TP,{n})}as poses with lines. Before
NBV estimation starts, the poses with lines are prefiltered
by the method updateSet(V). It is called for all views that
have been already explored. Instead of invalidating entire
predicted poses in the frustum of a view V, it only deletes
the lines of sight that point into the direction x(V)of the view
so that the poses can still be searched from other directions.
The next state NBV CALCULATION is the starting point of
the iterative object search process and estimates a Next-Best-
View VNtogether with the optimal set of objects {o}Nto
be searched in view VN. Combinations of object set and
NBV ({o}N,VN)are returned by Algo. 1 in Sec. VI-C, which
receives the current view VC, the current robot configuration
CCand predicted poses with lines {(TP,{n})}as input. In
case a NBV VN, classified as accessible, is found, robot goal
configuration CNthat corresponds to the view, is transmitted
to the state SM MOVE TO VIEW. This state controls PTU and
navigation to reach the goal. Once navigation has stopped at
a configuration more or less close to the goal, a transition to
OBJECT DETECTION is triggered. OBJECT DETECTION performs
object localization for the optimal set of objects {o}N.
If at least one object is found, OBJECT SEARCH is left to
generate updated recognition results in SCENE RECOGNITION,
replacing the old in the buffer. Otherwise, e.g. in case of
occlusions, NBV UPDATE POINT CLOUD triggers updateSet(VC)
to invalidate all lines of sight from poses in the frustum of
the current view Vc, pointing in direction of Vc. If lines exists,
but none can be invalidated, updateSet(VN) is called for the
NBV VNinstead. If the number of remaining lines falls below
a threshold, the scene instance, currently used for predicting
poses, is discarded and OBJECT SEARCH is left. New object
pose predictions are calculated in POSE PREDICTION with
the next-best scene instance in the buffer. We exhaustively
process all scene instances in the buffer for pose prediction
until an object is found or ASR is aborted. Greedy scene
instance selection, with the aim to complete scenes as fast as
possible, only sets the order in which instances are processed.
C. Next-Best-View estimation
We divide 6 DoF camera pose space into a space of
robot positions (x,y)and of camera orientations q. This 4
DoF search space is iteratively discretized with increasing
resolutions eby repeatedly executing Algo. 2 in Algo. 1.
1: {q}S getSpiralApprox(tmin,tmax )and VA VCand e e0
2: repeat
3: Get robot configuration CAfrom currently best view VA
4: ({o}I,VI) iterationStep(e,CA,{q}S,{(TP,{n})},CC)
5: d k(xI,yI)(xA,yA)kand VA VIand e 2·e
6: until d<threshold
7: return ({o}N,VN) ({o}I,VI)
Algorithm 1: ({o}N,VN) calculateNBV({(TP,{n})},VC,CC)
In the first execution of iterationStep(), we tessellate the
entire area in robot position space, being of interest to object
1 2 3
Fig. 5: Iteration steps, each in different colors, for first NBV in C2. Robot as red circle. Its environment as 2D map with light colored free space & borders
in black. Robot position samples as cylinders on a blue, quadrate area of the floor and camera orientations as squares on a sphere, make up search space
in each step. Both in same color as arrow of resulting view (its position & direction). Blue pillars connect best-rated position with orientation sphere and
result arrow. Sphere height displaced while iterating. Deviations among arrows from step 1 & 2 in 1,2&3in2,3&4in3decrease from left to right.
search, at a coarse resolution e0. We do so with a hex grid
that is aligned to the current position of the robot CCfrom
which getHexGrid() returns the corners as a position set
{(x,y)}H. From position set {(x,y)}H, each position whose
distance to any pose prediction TPof a searched object is
larger than the fcp or that lies too close to an obstacle on the
2D environment map, is discarded at line 2 of Algo. 2. A
set of camera orientations {q}S, from evenly sampling a unit
sphere in getSpiralApprox() at line 1 in Algo. 1, is passed
to iterationStep(). These orientations are combined with each
robot position, sampled in Algo. 2, at line 5 of iterationStep()
to generate view candidates Vwith the frustum F. In 2in Fig.
4, a camera orientation set, limited to the workspace of the
PTU, is visualized on a sphere for a given robot position.
Each execution of iterationStep() estimates the best-rated
view VIat a resolution eaccording to a reward r({o},V)
at line 10. It is passed as robot configuration CAto the next
call of iterationStep() in Algo. 1.
1: {(x,y)}H getHexGrid(e,xA,yA)with (xA,yA)from CA
2: pruneIsolatedAndOccupiedPositions({(x,y)}H)
3: for all (x,y)2{(x,y)}Hdo
4: for all q 2{q}Sdo
5: V getView(F,x,y,q)
6: {(TF,{n})} frustumCulling({(TP,{n})},V)
7: for all {o}22{o}Pdo
8: Extract (TF,{n})2{(TF,{n})}belonging to o2{o}
9: r({o},V) u(V,{(TF,{n})})·i({o},V,CC)
10: return ({o}I,VI)) argmax
({o},V)from {(x,y)}H⇥{q}S⇥2{o}P
r({o},V)
Algorithm 2: ({o}I,VI)) iterationStep(e,CA,{q}S,{(TP,{n})},CC)
In the next iterationStep() execution, a new hex grid
whose origin lies at configuration CAis created with doubled
resolution eand halved area, to be tessellated. CAis the
best-rated configuration in the preceding iteration. Tessel-
lation results, in particular robot position sets, for pairs of
consecutive iteration steps shown are in Fig. 5. As we only
tessellate with increasing resolution in proximity of best-
rated views in each iteration step, search converges to a
locally optimal view. These steps are aborted after line 5
in Algo. 1 once positions (xI,yI),(xA,yA)of views VI,VA
from two consecutive iterations are similar enough. A NBV
is returned. By decreasing the tessellated area proportionally
to increasing the resolution in each step, the size of robot
position sets is constant throughout iterating. Uniformly
tessellating the whole area of interest for object search with
the resolution, used in the last iteration, would return a lot
more positions. Position sets, resulting from an exemplary
run of Algo. 1, are in 1in Fig. 4 with different colors per
iteration. A sequence of four views VI, generated by Algo. 2
during a run of Algo. 1, is in Fig. 5. The last pair of nearly
equal views before aborting is omitted. At line 7 in Algo.
2, we extend our search space to the power set of the target
objects {o}Pas we do not only look for a best view V, but
also for the best objects {o}⇢{o}Pto search in this view.
u(V,{(TF,{n})})= Â
(TF,{n})2{(TF,{n})}
ua(V,TF)·ud(V,TF)·un(V,{n})
|{(TF,{n})}|
ua(V,TF)=f⇣\(x(V),!
ppF),0.5·min(fovx,fovy)⌘
ud(V,TF)=f⇣hx(V),!
ppFi fcp+ncp
2,|fcpncp|
2⌘
un(V,{n})=f(minn2{n}(\(x(V),n)),threshold)
A reward r({o},V)that consists of a conjunction of a
utility u(V,{(TF,{n})})and the inverse i({o},V)of a cost
function, is used to rate combinations of object sets and
views ({o},V), to find the best. Utility is only calculated
on the portion of predicted poses {(TP,{n})}that lies in
the frustum of view V, line 6 in Algo. 2, and that belongs
to any object, we selected for searching. A set of pose
predictions is shown in 3in Fig. 4 together with their lines
of sight. Utility function u(V,{(TF,{n})})rates each pose
prediction (TF,{n})in a view Vregarding its confidence
in being detectable for object localizers, given its 6 DoF
pose TF. Confidence of detection is defined as optimal for
a pose when the measures ua(V,TF),ud(V,TF),un(V,{n})
are maximized at once: ua(V,TF), which evaluates the angle
\(x(V),!
ppF)between camera view direction x(V)and the
ray !
ppFbetween camera position pand position pF(from
predicted pose TF), favors that the predicted pose lies at the
center of the camera field of view. ud(V,TF), which evaluates
the projection hx(V),!
ppFiof ray !
ppFon the view direction,
1 2 3 4
Fig. 6: 1,2,3:1
st,2
nd & last iteration step of ASR for S1 in C1, each with its current view & NBV. 4: Perfect scene recognition result for S1 after object
search finished. 1: Purple box in middle assigned to S1, resulting pose predictions at both sides of box with those for two objects within NBV frustum on
left. 2: Pose prediction causes frustum next to 2nd detection. 3: NBV for last missing object on the right. 4: Last object found among its pose predictions.
favors that the predicted pose lies half way between ncp and
fcp along the view direction. un(V,{n}), which evaluates the
angle between view direction x(V)and the line of sight of
the predicted pose, most similar to x(V), ensures that the pre-
dicted object is being observed from a perspective, favorable
for its detection6. Inverse costs from current configuration
CCto view Vin Algo. 3 are a weighted sum of normalized
inverted travel costs for PTU ip(r,t), robot orientation in(q)
and position id(x,y)(distance normalized by a Gaussian) as
well as io({o}), which contains the runtimes t(o)of running
detection for all objects in combination ({o},V)in relation
to the sum of detection runtimes for all target objects {o}P.
1: Get C=(x,y,q,r,t)for V
2: ip(r,t) wr·(1|rrC|
|rmaxrmin |)+wt·(1|ttC|
|tmaxtmin |)
3: in(q) wq·(1|qqC|
2p)
4: id(x,y) wd·e((xxC)2+(yyC)2).2s2
5: io({o}) wo·⇣1Âo2{o}t(o)
Âo2{o}Pt(o)⌘
6: return ip(r,t)+in(q)+id(x,y)+io({o})
Algorithm 3: i({o},V,CC)- Inverse cost function
VII. EXPERIMENTS AND RESULTS
A. Experimental setups
In this section, we present scene recognition results from
our Active Scene Recognition (ASR) approach for 5 exem-
plary object constellations. Despite being a general approach
for finding scenes as defined in Sec. III, we designed
constellations enabling us to compactly show how ASR
works in practice and what its conceptual limitations are. The
experiments are done with 3 scene models S1 to S3, acquired
from demonstrations7during which objects are horizontally
slided in squares of 10cm around their initial positions8.
(a) S1 - 4 objects, differently oriented on table (2in Fig. 3),
(b) S2 - 3 objects, one on 2 levels of a shelf (4in Fig. 3),
6Each criterion is rated by a function f(x,max)that is monotonically
decreasing and for which applies: f(0,max)=1 and f(max,max)=0.
7Sensitivity is set to 0.5m resp. 1m for C1 resp. C2. Maximal accepted
orientation deviations among demonstrated & detected poses are set to 45°
resp. 60°. Experiments done with a “Xeon E5-1650” & 32 GB RAM.
8Apart from a box in S2, presented on two different levels of a shelf.
(c) S3 - 3 pairs of objects on distant tables (5in Fig. 3).
S2 and S3 have two objects in common and are searched
by ASR at once, which we call scenario C2. Scenario C1
consists of S19. The major limitations for our system are
external components for robot navigation [9] and object de-
tection [8]. In order to restrict experiments to the limitations
of ASR itself, we chose object locations surrounded by views
that are accessible to [9] and trained object detectors from
these perspectives to enable robust object detection.
B. Recognition results for scenes at changing places
In this Section, we present scene recognition results on
2 different constellations of scene S1 in scenario C1. The
1st constellation consists of objects at poses similar to those,
observed during the demonstration of S1. To find all objects
in that constellation, we execute ASR on a real robot using
its real sensor readings. The presented execution of ASR
shall illustrate how the interplay between utility and costs
influences the choice of views during the ASR process. ASR
starts with the robot standing in front of a purple box as
depicted in 1in Fig. 6. The object is detected, assigned to
S1 and poses for the three remaining objects are predicted
according to S1. NBV estimation chooses the blue-colored
view10 on the left, as it contains two objects at once, their
predictions being shown in blue, too. This illustrates how
utility in searching two objects at once exceeds the increment
in travel costs compared to a view, closer to the current robot
position, but containing predictions for one object. Due to
limitations of navigation in positioning the robot, ASR takes
5 consecutive views, the first shown in 2in Fig. 6 and the last
in 3in Fig. 6, to find both searched objects. During this, NBV
estimation chooses to stay close to both objects due to low
travel costs. The pose of object detection results within the
view in 3in Fig. 6, illustrates that the three measures in our
utility function prefer views in which pose predictions, and
eventually detection results, lie in the middle of the viewing
9S1: Big red box & purple box as well as small red box & yellow box
are each connected by an ISM. A third ISM relates both object pairs to a
tree of ISMs. S2: 2 objects on a table are paired by an ISM. Its reference is
related to the box in the shelf by a 2nd ISM. S3: Relations among 3 object
pairs are each modeled by ISMs, while the pairs are connected by 2 ISMs.
10View pairs of the stereo rig are designated as views for simplicity.
1 2 345
Fig. 7: 1,2,3,4:1
st,2
nd,3
rd &4
th iteration step of ASR for S2&S3 in C2. 1: Robot pose localization with 2D laser scans in white. Robot position
sampling distorted on the left due to pruning on empirical data. 3: Perfect scene recognition result for S3. Pose predictions for tea box on the right separated
according to their shelf levels. 4: False detection of yellow cup discarded as not in frustum. 5: Perfect recognition results for S2&S3 after object search.
Fig. 8: Recognition results for two consecutive executions of ASR on
constellations for S1 with all views, reached by ASR. Robot as red arrow.
frustums and are turned towards the robot. In 4in Fig. 6, the
last object is found with S1 being entirely recognized.
The 2nd constellation for S1, visible at the bottom of Fig.
8, results from displacing all objects to a different table
while maintaining the relative poses between the objects. To
simplify comparison of both constellations, the 1st is depicted
at the top of Fig. 8 as well. This second experiment relies on
the same models as the first, but is performed with simulated
navigation and object recognition to simplify evaluation and
bypass navigation issues due to unreliable robot hardware.
In simulation, the 1st constellation is recognized like on the
real robot as visible at the top of Fig. 8 from the recognition
result and from the views, reached during ASR and shown as
yellow arrows. When the simulated robot is put in front of the
purple box in the 2nd constellation, the scene is completely
recognized as well. Even tough the absolute object poses
and the geometry of the edges of the underlying table is
different, ASR first heads to the object pair on one side,
before searching the last missing object on the other. This
shows that our approach is capable of recognizing scenes
independent of their emplacements in the world.
C. Recognition results for scenes with long relations
In scenario C2, made up of scenes with relations over
long distances, we first present scene recognition results for
S2 and S3 on a constellation that roughly corresponds to
demonstrated data. Object search in the 3rd constellation that
we analyze, shall illustrate additional properties of the ASR
process and is executed on the real robot, using its sensor
data. Initially the robot stands in front of a cup and a plate.
Once both are assigned to S3, poses for all missing objects in
S3 are predicted on two different tables as can be seen in 1in
Fig. 7. ASR chooses to look for the object pair at the bottom.
Detecting this object pair in 2in Fig. 7 causes equally rated
recognition results for S2 and S3, since it belongs to both. S3
is chosen for pose prediction at random, leading to a view on
the models on the top right of 1in Fig. 7. Once the objects are
found, search goes over to S2 and its tea box. Predicted poses
for the tea box are divided into clusters at two heights, which
can be seen in 3in Fig. 7. NBV estimation opts twice for
the lower cluster before enough lines of sight are invalidated,
see 4in Fig. 7. Switching to the top, the box is found and
both scenes are entirely recognized as shown in 5in Fig. 7.
Experiments on the next two constellations are conducted
in simulation just like those for the 2nd constellation in
Sec. VII-B. They show how ASR copes with objects, de-
viating from trained relative poses. We transform our 3rd
constellation into a 4th, shown in 1in Fig. 9, by translating
several objects11.A5
th constellation, visible in 2in Fig. 9,
arises from the 3rd by rotating objects instead12. To simplify
comparison of both new constellations to the original, we
visualize the original in transparent blue in 1,2in Fig. 9.
ASR for the 4th constellation starts with the robot in front of
cup and plate, at the bottom left of 1in Fig. 9 and finds all
objects apart from the tea box in S2 at the top right. For S3,
it does not only find the object pair on the top left whose
relative poses towards cup and plate did not change, but also
the pair on the bottom right, which cup and plate regard as
30cm away from its trained poses. These objects are found
as they are still in the view for their pose predictions. NBV
does not find a collision-free and sufficiently distant view
in which the tea box is centered. The box is at the right
corner of the view chosen instead (visible in 1in Fig. 9)13.
11On the left of 1in Fig. 9, all objects are shifted by 30cm to the left
and on the right, only the tea box at the top is moved to the right by 30cm.
Compared to 1in Fig. 7, the image it rotated by 90° counterclockwise.
12On the bottom right of that Fig., 2 boxes are rotated clockwise by 10°.
13We configured simulated object detection to require an object to be in
both frustums for successful detection in order to highlight this fact.
3 4
21
Fig. 9: 1: ASR searching last object in 4th constellation with shifted objects 2: ASR looking for objects of S3 on left side. Objects belong to 5th constellation
with rotated objects. 3,4: All views, shown as clusters at different positions and visited during uninformed search in C1&C2, according to Sec. VII-D.
Views, shown as big yellow arrows, connected to blue pillars, which show associated robot positions. Positions connected by blue line segments.
US - C1 ASR - C1 US - C2 ASR - C2
Views 69 7 117 6
Duration [min] 17.00 4.55 42.15 3.52
Table I: Number of views & object search duration with US & ASR.
SCE REC POS PRE NBV CAL MO TO VI OBJ DET
Number 4 3 6 6 7
Avg [s] 0.05 0.03 1.15 37.38 3.40
Table II: Number of executions & average runtimes for steps of ASR in C1.
SCE REC POS PRE NBV CAL MO TO VI OBJ DET
Number 4 3 5 5 6
Avg [s] 0.27 0.09 1.45 33.89 2.77
Table III: Number of executions & avg. runtimes for steps of ASR in C2.
In 2in Fig. 9, ASR for the 5th constellation is started with
the robot on the bottom right in front of both rotated boxes.
While ASR manages to assign these objects and the tea box
to S2, it fails to detect the remaining objects of S3 on the
left side. For those objects, pose predictions differ so much
from the actual object poses that the NBVs, like that shown
in 2in Fig. 9 on the top left, do not contain the objects at all.
The minimal distance between actual poses and predictions
is 0.37m at the bottom left and 0.58m at the top left14.
In general, ASR compensates position errors proportional
to the frustum size, but is susceptible to orientation errors,
depending on the length of the encountered relations.
D. Evaluation of Efficiency of ASR
Our last experiments, conducted on the real robot, focus
on runtime. As base line for our ASR approach, we realized
an uninformed search (US) as in Sec. I15. US is performed on
the 1st and the 3rd constellation on which we executed ASR.
We compare the effort of both approaches in recognizing the
scenes in C1 and C2 in Table I. Data about ASR in Table
I is taken from Sec. VII-B and VII-C. Results of US in C1
and C2 are visible in 3in Fig. 9 and in 4in Fig. 9. The
speedup when using ASR instead of US increases from 3.73
in C1 to 11.97 in C2 as US requires more robot positions
14With increasing length of the relations, orientation errors in detection
results cause increasing errors in the positions of predicted poses.
15To optimize performance, we reduced the robot positions that lie on a
grid, to a few promising points & oriented the robot towards the objects.
to recognize larger scenes and searches all objects at once.
ASR outperforms US not only in runtime, but also in scene
recognition results. Due to false positive detections and poor
perspectives, both visible in implausible pose estimates in 3,
4in Fig. 9, US fails to recognize the scenes. In Table II and
III, we separate the runtime of ASR into average runtimes
of its components for the 1st and the 3rd constellation.
VIII. CONCLUSIONS
We presented an Active Scene Recognition approach that
allows mobile robots iteratively improving their estimations
about present scenes by relational object search. As search
only relies on spatial relations, this approach recognizes
scenes independent of the absolute poses of detected objects.
IX. ACKNOWLEDGMENTS
This research was financially supported by the DFG
- Deutsche Forschungsgemeinschaft. We thank Florian
Aumann-Cleres and Jocelyn Borella for their support.
REFERENCES
[1] P. Meißner, R. Reckling, R. J¨
akel, S. R. Schmidt-Rohr, and R. Dill-
mann, “Recognizing scenes with hierarchical implicit shape models
based on spatial object relations for programming by demonstration,”
in Int. Conf. on Advanced Robotics, 2013.
[2] P. Meißner, R. Reckling, V. Wittenbeck, S. R. Schmidt-Rohr, and
R. Dillmann, “Active scene recognition for programming by demon-
stration using next-best-view estimates from hierarchical implicit
shape models,” in Int. Conf. on Robotics and Automation, 2014.
[3] D. Lin, S. Fidler, and R. Urtasun, “Holistic scene understanding for
3d object detection with rgbd cameras,” in Int. Conf. on CV, 2013.
[4] B. Leibe, A. Leonardis, and B. Schiele, “Robust object detection with
interleaved categorization and segmentation,” Int. Journal of CV, 2008.
[5] J. I. Vasquez-Gomez, L. E. Sucar, and R. Murrieta-Cid, “View plan-
ning for 3d object reconstruction with a mobile manipulator robot,” in
Int. Conf. on Intelligent Robots and Systems, 2014.
[6] Y. Ye and J. K. Tsotsos, “Sensor planning for 3d object search,”
Computer Vision and Image Understanding, 1999.
[7] L. Kunze, K. K. Doreswamy, and N. Hawes, “Using qualitative spatial
relations for indirect object search,” in Int. Conf. on Robotics and
Automation, 2014.
[8] P. Azad, T. Asfour, and R. Dillmann, “Stereo-based 6d object local-
ization for grasping with humanoid robot systems,” in Int. Conf. on
Intelligent Robots and Systems, 2007.
[9] E. Marder-Eppstein, E. Berger, T. Foote, B. Gerkey, and K. Konolige,
“The office marathon: Robust navigation in an indoor office environ-
ment,” in Int. Conf. on Robotics and Automation, 2010.
[10] J. Bohren and S. Cousins, “The smach high-level executive,” IEEE
Robotics & Automation Magazine, 2010.