Content uploaded by Pascal Meißner
Author content
All content in this area was uploaded by Pascal Meißner on Apr 01, 2017
Content may be subject to copyright.
Recognizing Scenes with Hierarchical Implicit Shape Models
based on Spatial Object Relations for Programming by Demonstration
Pascal Meißner, Reno Reckling, Rainer J¨
akel, Sven R. Schmidt-Rohr and R¨
udiger Dillmann
Abstract— We present an approach for recognizing scenes,
consisting of spatial relations between objects, in unstructured
indoor environments, which change over time. Object relations
are represented by full six Degree-of-Freedom (DoF) coordinate
transformations between objects. They are acquired from object
poses that are visually perceived while people demonstrate
actions that are typically performed in a given scene. We
recognize scenes using an Implicit Shape Model (ISM) that
is similar to the Generalized Hough Transform. We extend
it to take orientations between objects into account. This
includes a verification step that allows us to infer not only
the existence of scenes, but also the objects they are composed
of. ISMs are restricted to represent scenes as star topologies of
relations, which insufficiently approximate object relations in
complex dynamic settings. False positive detections may occur.
Our solution are exchangeable heuristics for recognizing object
relations that have to be represented explicitly in separate ISMs.
Object relations are modeled by the ISMs themselves. We use
hierarchical agglomerative clustering, employing the heuristics,
to construct a tree of ISMs. Learning and recognition of scenes
with a single ISM is naturally extended to multiple ISMs.
I. INTRODUCTION
Programming by Demonstration (PbD) is a learning
paradigm in robotics that consists of recording demon-
strations of everyday actions performed by humans with
sensors and generalizing them to conceptual knowledge
by using learning algorithms. This knowledge shall enable
autonomous robots to reproduce actions taking their goals
and effects into account. Everyday actions usually take place
in specific contexts that can be seen as their preconditions.
A primordial aspect of the notion of context are scenes for
which we present a recognition method. As Quattoni et al.
[1] state, scenes in indoor environments can be best described
by the objects, they contain. In relation to PbD, objects can
be regarded as entities of the world on which actions are
applied. In such a scenario not only occurrence of objects, but
also spatial relations, i.e. six DoF relative object poses, are
necessary to discriminate scenes. An example is cutlery that
usually lies beside plates before breakfast and on top of them
afterwards. As environment is dynamic, both occurrence and
spatial configurations cannot be assumed to be static.
Besides that object relations represent semantic informa-
tion on a scene, typical object detection on local portions
of image data often poorly performs in realistic scenarios,
P. Meißner, R. Reckling, R. J¨
akel, S. R. Schmidt-Rohr and R. Dillmann
are with Institute of Anthropomatics, Karlsruhe Institute of Technology,
76131 Karlsruhe, Germany. pascal.meissner@kit.edu
because of the large number of objects that can be encoun-
tered and visual ambiguities between objects. Recognizing
scenes provides a mean to overcome these issues. We present
a system to do so in the setting described so far. Based on
the PbD principle, it learns object occurrences and relations
from demonstrated data instead of having them modeled by
experts for given scenes. Demonstrations for learning scenes
differ from those for learning manipulation actions: Config-
urations of objects before and after actions over a longer
period of time are suitable data. Consequently, issues like
occlusions during manipulation do not affect our approach.
II. RELATED WORK
Scene understanding is a field in computer vision that
mainly deals with two-dimensional image data. In general,
methods in scene understanding can be divided into two
classes. One option is to infer scenes from detected objects
and relationships between them. The other is to deduce
scenes directly from features extracted in image data. Global
image descriptors like the gist descriptor [2] reduce images
to compact feature vectors by dividing images into regular
intervals and aggregating filter responses calculated on each
of the intervals. Other approaches [3] calculate local descrip-
tors only in interesting image regions designated by visual
saliency. These methods have in common that they perform
well on outdoor scenes and poorly on indoor scenes.
Common approaches in object-based scene understanding
use probabilistic graphical models like Markov Random
Fields (MRF) or Conditional Random Fields (CRF) [4]. For
example both have been applied to image segmentation in
which knowledge about expected co-occurrence of adjacent
image regions is used to correct erroneously labeled regions.
Such methods do not meet our requirements as they represent
correlations between identities of regions or objects instead
of modeling variations in spatial relations between them.
Just few publications deal with using characteristics of
spatial relations for scene understanding. One approach is to
map relative object poses on symbolic qualitative relations
based on manually defined formulas. Qualitative relations
have been shown as a strong alternative [5] to MRFs or
CRFs for correcting false object detections, when operating
in three-dimensional space. Simplification to symbolic rela-
tions provides high generalization capabilities. However it
is limited when trying to express slight differences between
object configurations as expected in our scenario.
978-1-4799-2722-7/13/$31.00 ©2013 IEEE
Fig. 1. Center-left: Localized plate and cup are translated, keeping relation between them static. Smacks is reference of ISM learnt on visualized trajectories.
Trajectories converted to relative poses are depicted as arrows. Center-right: False positive detection, because ISM ignores relation between cup and plate.
Right: Arrow shaped votes and two-dimensional projection of ISM accumulator. False positive corresponds to red peak in elevation profile of bucket values.
Substantial research on representing spatial relationships
has been conducted in part-based object recognition. Ran-
ganathan et al. [6] apply Constellation Models on learning
spatial configurations of objects in office environments in
a supervised manner. They represent object constellations
as normal distributions on object positions omitting their
orientations. They only deliver rough approximations of
complex object configurations as found in our scenario.
Feiten et al. [7] state that parametric representation of
position and orientation uncertainty is a non-trivial issue.
Complementary work [8] learns scene models in an unsu-
pervised manner. It clusters scenes into local reoccurring
object configurations using Bayesian non-parametric models.
Positions are represented by normal distributions and no six
DoF poses are taken into account. Our scenario may not
contain reoccurring structures. A non-parametric method for
representing configurations of arbitrary complexity in part-
based object recognition is the Implicit Shape Model (ISM)
[9]. It has been applied to two-dimensional image data for
detecting pedestrians in street scenes [10]. In range data,
furniture approximated by bounding box primitives [11] has
been recognized using ISMs. These approaches are limited to
using relative positions of object parts, ignoring orientations.
III. OBJECTS, SCENES AND DATA ACQUISITION
We define objects oas entities of our world for which we
require that state estimations E(o) = (c,d,T)from sensor
data can be provided. This triple consists of a label cdefining
the object class, ddiscriminating different instantiations of
the same class as well as a transformation matrix T∈R4×4
that contains object position p∈R3. A scene S= ({o},{R})
contains spatial relations {R}between objects oj,ok∈ {o}
of this scene. They are represented as sets of relative six DoF
poses {Tjk}. We define recognition of scenes as calculating
to which grade a configuration of objects {E(o,t)}captured
at a point in time tcorresponds to scene Slocated at pose
TF, taking into consideration beliefs bS(o), which express
how much recognition depends on having captured object o.
To acquire object estimations E(o), we employ three com-
plementary object localizers that interpret stereo images in
real time: A system [12] based on global matching of shapes
in eigenspace. A method [12] that calculates local image de-
scriptors and homographies. And a fiducial marker localizer.
Scene models are learnt from demonstrations during which
object estimations E(o,t)are recorded for a duration of ltime
steps. For each object oof a scene S, we obtain a sequence
of estimations J(o) = (E(o,1),...,E(o,l)) called trajectory
in which E(o,t)are non-empty for every time step tin which
ohas been observed. In trajectory J(o), class labels cand
identifiers dare equal for each time step t.
IV. IMPLICIT SHAPE MODEL
FOR SPATIAL OBJECT RELATIONS
A. Learning a Scene Model
We developed an Implicit Shape Model to represent spatial
restrictions {ToF }of nobjects otowards a common refer-
ence. As scenes Sspecify relations between nobjects {o},
we designate one of the oto be identical to the reference of
the ISM, calling it oFand its pose TF. We define as selection
heuristic HFthat the less non-empty poses T(t)of an object
ochange during its trajectory J(o)from t=1... l, the better
osuits as reference. For instance in scenes with a stove, the
stove itself should be the stationary reference. Learning our
ISM for a scene Sconsists of adding entries that are extracted
from estimations E(o,t), to a table. Each entry consists of
scene label Sthat is assigned to class label cand identifier
dof object oas well as two relative poses: TFo(t)represents
the pose of object owith respect to reference object oFand
ToF (t)stands for the pose of reference object oFwith respect
to o. The learning process is accomplished for any time step
tand trajectory J(o), assigned to scene S, as follows.
for t←1... ldo
oF←argmax
o∈{o}∧E(o,t)6=/0 in J(o)
HF(o)
for all {o|o∈ {o} ∧ E(o,t)6=/0 in J(o)}do
TFo(t)←TF(t)−1·T(t), with T(t)belonging to o
ToF (t)←T(t)−1·TF(t)
Beliefs bS(o)for each object oin scene Sare added to
the ISM. They are set according to an equal distribution on
object set {o}or can be deduced from the relative frequency
of the table entries of object odepending on scene S.
B. Scene Recognition
Suppose a set of objects {i}, called input objects, is
detected at poses {T}and a set of ISMs for different scenes
Fig. 2. Left: Image of objects from which data is recorded for Fig. 1 and 2. Center-left: Relation between cup and plate is detected by direction continuity
heuristic with the plate as reference: Parallel arrows between both objects. Reference and Smacks are subsumed to ISM for entire scene. Center-right: No
false positive detection. Detection result is incomplete, as the plate is not matched. Right: Poor rating corresponds to yellow peak in elevation profile.
{S}is given as a common table. As ISMs are a variation of
the General Hough Transform, scene recognition takes place
as a voting process: Every input object icasts votes, where it
expects poses of references of scenes S. Then, a verification
step searches for hypotheses about present scenes, given all
casted votes. The voting step that casts votes von reference
positions pFusing input object poses Tand relative reference
positions poF at first, is realized as follows.
for all S ∈ {S}do
for all i∈ {i}do
Extract all ISM table entries matching Sand i
for all matching table entries do
Extract ToF from entry and poF from ToF
Get pFfrom TF←T·ToF with Tgiven for i
(X,Y,Z)T← b(x,y,z)T·s−1c=bpF·s−1c
v←(TF,TFo,c,d)
BS(X,Y,Z)←BS(X,Y,Z)∪v
To vote as above, we discretize R3to an accumulator BS
in the form of a voxel grid for each scene S. Thus we are
able to accumulate votes according to reference positions
pF.BSis defined on position space R3instead of six DoF
pose space to reduce required time and memory. We call its
elements buckets BS(X,Y,Z). Their granularity is set by edge
length (bucket size) s. While pFdecides into which bucket
BS(X,Y,Z)vote vfalls, v= (TF,TFo,c,d)itself consists of
reference pose TFaccording to voter, voter pose TFo relative
to scene reference oFand c,dfor voter identification.
The verification process consists of an exhaustive search
on the accumulator to detect instances of scene S. For each
bucket BS(X,Y,Z), we perform fast greedy search on six DoF
poses separately. It looks for the largest set of input objects
{i}S, whose votes have fallen into this bucket and that are
consistent to scene S. Greedy search is sound as recognizing
the same scene with references at the same position, but at
different orientations is unusual. Object set {o}is consistent
to S, when we find an arbitrary object o∈ {o}that enables
us to predict absolute poses T0
p=TF·T0
Fo of all other objects
o0∈ {o}with sufficient accuracy, using the vote for reference
pose TFof o. Accuracy is calculated by comparing measured
pose T0from E(o0)and predicted pose T0
p. This approach
defines similarity of a set of objects {o}to a scene Son
the objects othemselves. Verification allows us to determine
which objects contribute to a recognition result instead of just
telling, how evident it is that a scene exists. We calculate a
rating for bucket BS(X,Y,Z)by summing beliefs bS(i)of
all objects iin greatest object set {i}S. When this rating
exceeds a given threshold, we return a recognition result for
BS(X,Y,Z)consisting of: Its rating, scene reference pose TF,
all objects {i}Sthat lead to this rating with their relative
poses {TFo}S, their classes {c}Sand their identifiers {d}S.
V. IMPLICIT SHAPE MODEL TREE
FROM AGGLOMERATIVE CLUSTERING
A. Learning a Scene Model
Implicit Shape Models, as presented so far, are only able to
represent star topologies of spatial relations from a common
reference oFto nobjects {o}, ignoring all potential relation-
ships between objects in {o}. In certain situations, like that
illustrated in Fig. 1, even though all relations, modeled by
an ISM, are fulfilled, this leads to incorrect scene detections.
To cope with this issue, we developed an approach that
analyzes spatial restrictions in sets of objects {o}in order to
separate {o}into clusters, each of which is dealt with in a
separate ISM m. It uses hierarchical agglomerative clustering
to construct a binary tree of ISMs {m}in which leafs
{o}Lrepresent objects for which estimations E(o)have been
acquired and internal vertices {o}\{o}Lstand for references
{oF}of the different ISMs in the tree. ISMs mrelate child
nodes to a parent thereby modeling relations to a common
reference oF. Recognition results of m∈ {m}are propagated
as input to the ISM m0∈ {m}at the next lower level in the
tree. In m0,oFis treated as an object, whose relations to
other objects are modeled. This process ends at the root oR,
whose ISM mRreturns the scene recognition results.
During clustering, heuristics {H}that analyze relations
between two objects, are used as linkage criterion. To fix the
issue shown in Fig. 1, we implemented a heuristic that rates
temporal continuity HC(oj,ok) = 1−u/lin the direction of
vectors pjk(t)connecting two objects oj,ok∈ {o}.HCrelates
the number of direction discontinuities uto trajectory length
l. Discontinuities uare detected by repeatedly comparing
angles ∠(pjk(t),pjk(t+x)) between the same vectors at
consecutive point in time to a given threshold. This allows
us to capture not only a directions distribution, but also
their temporal development. We opted for restricting vector
directions instead of six DoF transformations in order to
Fig. 3. Left: Trajectories of localized markers and visualization markerless localizer results. Else: Object trajectories, used as learning data for three ISMs
employed in Sec. VI. In each cluster m, arrows visualize poses of objects with respect to the reference of the ISM that models an object relation.
obtain a minimal criterion sufficient to overcome the issue
in Fig. 1. Supposing trajectories J(oj)and J(ok), both of
length l, are given for oj∈ {o}and ok∈ {o}, heuristic HC
calculates direction continuity when a sufficient number ε·l
of points in time exist, in which J(oj)and J(ok)both contain
non-empty poses Tj(t)and Tk(t).HCworks as follows.
n←1
for t←1... ldo
if E(oj,t)6=/0 in J(oj)∧E(ok,t)6=/0 in J(ok)then
pjk (n)←Tj(t)−1·pk(t)
n←n+1
if n>ε·lthen
x←1 and u←0
for i←1... n−1do
while i+x≤n∧∠(pjk (i),pjk (i+x)) <ddo
x←x+1
if i+x≤nthen
u←u+1
i←i+x, and x←1
return 1−u/l
Given a set of heuristics {H}with their ratings H(oj,ok)
being normalized and a set of object trajectories {J(o)}for
o∈ {o}, agglomerative clustering is performed as follows.
(HM,oM,qM)←argmax
(H,o,q)∈({H},{o},{o})
H(o,q)
while HM(oM,qM)>edo
Learn ISM mwith J(oM),J(qM).oFtaken among oM,qM
{J(o)}←{J(o)} \ (J(oM)∪J(qM))
{J(o)}←{J(o)} ∪ J(oF)
(HM,oM,qM)←argmax
(H,o,q)∈({H},{o},{o})
H(o,q)
{m}←{m} ∪ m
Learn root ISM mRwith {J(o)}
{m}←{m} ∪ mR
Pairings of trajectories J(oj),J(ok)are subsumed to clus-
ters mas long as the best rating of any heuristic Hexceeds
threshold e. All remaining trajectories J(o)are unified to a
root ISM mRto prevent creating unnecessary clusters.
B. Scene Recognition
The resulting hierarchy of ISMs for a scene Scan either be
imagined as a binary tree or as a set of ISMs {m}related to
each other through scene reference objects oF. Suppose a set
of input objects {i}at poses {T}is given in addition to the
ISM set {m}. For leafs {o}L, beliefs bS(o)are initialized to
one. For internal vertices {o}\{o}L, beliefs bS(o)are set to
zero. Input objects {i}cast votes as leafs {o}Lin the ISMs m,
they match. No votes are casted for internal vertices, resulting
in missing input for some ISMs m. After scene recognition
is done in each ISM m, those that contain leafs may have
returned scene references oF. Their ratings are used to update
beliefs bS(oF)and are equal to the number of objects whose
poses they restrict. ISMs may return multiple references oF.
All acquired references {oF}are added to the initial object
set {i}={i}∪{oF}and the process, described so far, is
repeated. In case multiple references oFwith identical class
cand identifier dlabels are present in the input set {i},
all of them simultaneously cast votes in the ISMs m, they
match. As new reference objects oFappear in object set {i},
the input for some ISMs mgrows, resulting in additional
references oFbeing returned or increasing beliefs bS(oF)for
already existing references. For an arbitrary reference object
oFin the binary tree {m}, its belief bS(oF)corresponds
to the number of leafs {o}L, in its subtree. This iterative
procedure, propagating beliefs bS(o)from the leafs {o}L
to the root oR, is repeated until beliefs converge through
the tree. Overall, recognition with a single ISM is naturally
extended to recognition with multiple ISMs as shown in Fig.
2. After the last iteration of the method described above,
multiple ISMs may provide recognition results. Results for
entire scenes Shave to be expected in ISM mRwith a
reference object being root oRof the binary tree {m}at the
same time. While the root ISM mRprovides poses of scene
references TFitself, the objects {i}Sthat lead to this result
have to be collected among the leafs {o}Lof the tree {m}.
The tree has to be traversed for every result produced by mR,
going along interrelated ISM recognition results.
VI. EXPERIMENTS AND RESULTS
A. Experimental Setups
Three hierarchical scene models are learnt to evaluate
the capabilities of our hierarchical Implicit Shape Model.
The involved objects and their trajectories, estimated by
object localizers as described in Sec. III, are visible in Fig.
3: The configuration on center-left represents a computer
workplace consisting of two screens, a keyboard and a
mouse, which are all detected using markers. On the right,
Fig. 4. Scene recognition results for computer workplace configurations. Object models, resized for better visibility, are located according to the markers,
fixed on them. ISM recognition results are shown as two lines meeting at a green sphere. Lines from reference objects oFto ISM references themselves
are of length zero. Lines turn from red to green as ratings of recognition results increase. Red arrows indicate ISM references being reused by other ISMs.
two configurations made up of everyday objects, taken from
a household scenario, are shown. They have two dishes in
common. Here, we employ markerless object localizers.
B. Influence of Object Pose on Scene Recognition
Experiments conducted in the computer workplace sce-
nario provide examples on how the proposed ISMs rate
whether object relations are consistent to a given scene or
not. As visible in Fig. 3, direction continuity heuristic HC
separates the present objects into two clusters. One cluster
contains both screens and the other keyboard and mouse. The
references of the ISMs learnt on these clusters are subsumed
in an additional ISM to model the entire scene.
All object configurations displayed in Fig. 4 are discussed
in terms of the scene recognition results, they provide.
Starting at the top-left corner, the first two images present
recognition results after the right screen is shifted along
different spatial dimensions. As both screens stand in parallel
and next to each other during learning, the right screen is
moved away from its valid pose in relation to the left screen
in both pictures. A recognition result can be provided, but it
only considers three objects in both configurations, leading
to a mediocre recognition rating. It is not rated as poorly
as the result of the ISM relating both displays, because of
the good rating of keyboard and mouse staying at their initial
positions. The image next to the right differs from the former
two in the fact that the right screen is rotated instead of being
translated. The recognition result, equal to those already
presented, illustrates that object orientation is considered
equally to position. The last two images in this line show
the left screen being rotated and then translated instead of
the right. Now, the right screen is part of a recognized scene
that is based on three objects. These results confirm that the
scene model treats both displays equally.
In all further configurations, both screens stay at their
initial locations and keyboard and mouse are moved instead.
In the first two images in the second line, the mouse is
shifted with respect to the keyboard, keeping its orientation.
Recognition delivers well-rated results as long as the mouse
stays on the right of the keyboard, as this has been observed
during learning. Translating the mouse behind the keyboard
has same effect as rotating it at the right of the keyboard.
The mouse keeps its orientation during learning and never
appears behind the keyboard. Recognition results are similar
to the configurations where the left screen is displaced. This
illustrates that both clusters, generated by heuristic HC, have
equal effect on scene recognition. The next image shows how
the keyboard is translated instead of the mouse. The scene
is recognized considering three objects, putting aside the
keyboard. The last image shows a configuration where each
of the clusters fulfills the requirements of the learnt scene
model, but the relative pose of the cluster references differs
from beforehand. Recognition returns a well-rated result.
C. Influence of Object Occurrence on Scene Recognition
In a household scenario, we evaluate the capabilities of
the proposed scene recognition in distinguishing two similar
scenes Aand B. Each scene is divided into two disjoint
binary clusters as visible in Fig. 5. The first line in this figure
displays how one scene is transformed into the other. From
its left to its middle, two objects of scene Aare removed
in the depicted object configurations. On the left, scene Ais
well recognized, while scene Bis detected with a poor rating.
As only those objects remain in the middle, that both scenes
have in common, ratings of recognition results for Aand B
are equal in this configuration. By adding objects, being part
of scene B, from the middle to the right, recognition ratings
for scene Bimprove. The second line of Fig. 5 consists
of configurations with ambiguous ratings. The first two
configurations contain three resp. four objects of both scenes
at once. Both scenes are recognized with mediocre resp. good
ratings at the same time. The objects that both scenes have in
common, are missing in the middle. The object configuration
is opposite to that in the middle of the first line, but their
recognition ratings are equal. The cup that scene Aand B
have in common, misses in the forth constellation. However,
ratings of the first and forth configuration in the second line
are equal. The last configuration, on the right, consists of all
objects, being elements of scene Aand B, each of which is
well recognized. A superfluous cup is present as well.
D. Scene Recognition Runtime
Processing time of scene recognition is analyzed for both
hierarchical and non-hierarchical ISMs on a PC with a
Fig. 5. Recognition results for household object configurations. The visualization of scene recognition results is identical to Fig. 4.
“Core i5 750” and 4 GB RAM. Results of experiments
that are conducted the household scenario and in which
different ISM parameter values are varied, are shown in Fig.
6. Throughout the experiments, both systems take at most
40ms to recognize scenes with six input objects. In Fig. 6,
recognition runtime is measured for different sizes of input
sets and accumulator buckets. Smallest runtimes are nearly
invisible. Recognition runs are performed on each snapshot,
already used for learning, separately and resulting runtimes
are averaged. Varying bucket size has little effect, except
for very small values, where votes for the same pose in
reality spread across several buckets. This particularly affects
hierarchical ISMs, where votes are not only dispersed in one
ISM, but in elements of a whole ISM set. Input set size hardly
affects recognition with non-hier. ISMs. This changes when
using clustering, since additional input produces votes across
entire subtrees of the ISM tree. In Fig. 6, runtime is measured
for different input object set sizes and lengths of trajectories,
used for learning. As in the bucket size experiments, large
jumps in computation time result from cache effects. While
additional learning data has almost no impact on non-hier.
ISMs, hier. ISMs run with linear time on trajectory length.
VII. CONCLUSIONS AND FUTURE WORKS
An approach has been presented that models scenes as
sets of objects and takes their six DoF interrelationships into
account. Scene models are learnt from demonstrations and
allow interpreting object configurations in order to figure out
present scenes and which objects belong to them. Limitations
in modeling object relation topologies, imposed by ISMs, are
overcome by agglomerative clustering, generating trees of
ISMs through spatial restriction analysis. Experiments show
that constraining object relations and distinguishing similar
scenes is achieved in realistic scenarios, while keeping the
system real-time capable. Future work includes integrating
heuristics that learn kinematic models of articulations as well
as extending recognition to input sets, in which class or
identifier labels miss, with the aim to infer such information.
REFERENCES
[1] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in Com-
puter Vision and Pattern Recognition, 2009.
Fig. 6. Recognition runtimes for 1 to 6 input objects resp. hierarchical (C)
vs. non-hier. (NC) ISMs, depending on bucket size and trajectory lengths.
[2] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic
representation of the spatial envelope,” Int. Journal of Computer
Vision, 2001.
[3] A. Borji and L. Itti, “Scene classification with a sparse set of salient
regions,” in Int. Conf. on Robotics and Automation, 2011.
[4] S. Kumar and M. Hebert, “A hierarchical field framework for unified
context-based classification,” in Int. Conf. on Computer Vision, 2005.
[5] T. Southey and J. Little, “3d spatial relationships for improving object
detection,” in Int. Conf. on Robotics and Automation, 2013.
[6] A. Ranganathan and F. Dellaert, “Semantic modeling of places using
objects,” in Robotics: Science and Systems Conference, 2007.
[7] W. Feiten, P. Atwal, R. Eidenberger, and T. Grundmann, “6d pose
uncertainty in robotic perception,” in Advances in Robotics Research.
Springer, 2009.
[8] D. Joho, G. Tipaldi, N. Engelhard, C. Stachniss, and W. Burgard,
“Nonparametric bayesian models for unsupervised scene analysis and
reconstruction,” in Robotics: Science and Systems Conference, 2012.
[9] B. Leibe, A. Leonardis, and B. Schiele, “Robust object detection with
interleaved categorization and segmentation,” Int. Journal of Computer
Vision, 2008.
[10] B. Leibe, E. Seemann, and B. Schiele, “Pedestrian detection in
crowded scenes,” in Computer Vision and Pattern Recognition, 2005.
[11] S. G¨
achter, A. Harati, and R. Siegwart, “Structure verification toward
object classification using a range camera,” in Int. Conf. on Intelligent
Autonomous Systems, 2008.
[12] P. Azad, T. Asfour, and R. Dillmann, “Stereo-based 6d object local-
ization for grasping with humanoid robot systems,” in Int. Conf. on
Intelligent Robots and Systems, 2007.