Conference PaperPDF Available

Recognizing Scenes with Hierarchical Implicit Shape Models based on Spatial Object Relations for Programming by Demonstration

Authors:

Abstract and Figures

We present an approach for recognizing scenes, consisting of spatial relations between objects, in unstructured indoor environments, which change over time. Object relations are represented by full six Degree-of-Freedom (DoF) coordinate transformations between objects. They are acquired from object poses that are visually perceived while people demonstrate actions that are typically performed in a given scene. We recognize scenes using an Implicit Shape Model (ISM) that is similar to the Generalized Hough Transform. We extend it to take orientations between objects into account. This includes a verification step that allows us to infer not only the existence of scenes, but also the objects they are composed of. ISMs are restricted to represent scenes as star topologies of relations, which insufficiently approximate object relations in complex dynamic settings. False positive detections may occur. Our solution are exchangeable heuristics for recognizing object relations that have to be represented explicitly in separate ISMs. Object relations are modeled by the ISMs themselves. We use hierarchical agglomerative clustering, employing the heuristics, to construct a tree of ISMs. Learning and recognition of scenes with a single ISM is naturally extended to multiple ISMs.
Content may be subject to copyright.
Recognizing Scenes with Hierarchical Implicit Shape Models
based on Spatial Object Relations for Programming by Demonstration
Pascal Meißner, Reno Reckling, Rainer J¨
akel, Sven R. Schmidt-Rohr and R¨
udiger Dillmann
Abstract We present an approach for recognizing scenes,
consisting of spatial relations between objects, in unstructured
indoor environments, which change over time. Object relations
are represented by full six Degree-of-Freedom (DoF) coordinate
transformations between objects. They are acquired from object
poses that are visually perceived while people demonstrate
actions that are typically performed in a given scene. We
recognize scenes using an Implicit Shape Model (ISM) that
is similar to the Generalized Hough Transform. We extend
it to take orientations between objects into account. This
includes a verification step that allows us to infer not only
the existence of scenes, but also the objects they are composed
of. ISMs are restricted to represent scenes as star topologies of
relations, which insufficiently approximate object relations in
complex dynamic settings. False positive detections may occur.
Our solution are exchangeable heuristics for recognizing object
relations that have to be represented explicitly in separate ISMs.
Object relations are modeled by the ISMs themselves. We use
hierarchical agglomerative clustering, employing the heuristics,
to construct a tree of ISMs. Learning and recognition of scenes
with a single ISM is naturally extended to multiple ISMs.
I. INTRODUCTION
Programming by Demonstration (PbD) is a learning
paradigm in robotics that consists of recording demon-
strations of everyday actions performed by humans with
sensors and generalizing them to conceptual knowledge
by using learning algorithms. This knowledge shall enable
autonomous robots to reproduce actions taking their goals
and effects into account. Everyday actions usually take place
in specific contexts that can be seen as their preconditions.
A primordial aspect of the notion of context are scenes for
which we present a recognition method. As Quattoni et al.
[1] state, scenes in indoor environments can be best described
by the objects, they contain. In relation to PbD, objects can
be regarded as entities of the world on which actions are
applied. In such a scenario not only occurrence of objects, but
also spatial relations, i.e. six DoF relative object poses, are
necessary to discriminate scenes. An example is cutlery that
usually lies beside plates before breakfast and on top of them
afterwards. As environment is dynamic, both occurrence and
spatial configurations cannot be assumed to be static.
Besides that object relations represent semantic informa-
tion on a scene, typical object detection on local portions
of image data often poorly performs in realistic scenarios,
P. Meißner, R. Reckling, R. J¨
akel, S. R. Schmidt-Rohr and R. Dillmann
are with Institute of Anthropomatics, Karlsruhe Institute of Technology,
76131 Karlsruhe, Germany. pascal.meissner@kit.edu
because of the large number of objects that can be encoun-
tered and visual ambiguities between objects. Recognizing
scenes provides a mean to overcome these issues. We present
a system to do so in the setting described so far. Based on
the PbD principle, it learns object occurrences and relations
from demonstrated data instead of having them modeled by
experts for given scenes. Demonstrations for learning scenes
differ from those for learning manipulation actions: Config-
urations of objects before and after actions over a longer
period of time are suitable data. Consequently, issues like
occlusions during manipulation do not affect our approach.
II. RELATED WORK
Scene understanding is a field in computer vision that
mainly deals with two-dimensional image data. In general,
methods in scene understanding can be divided into two
classes. One option is to infer scenes from detected objects
and relationships between them. The other is to deduce
scenes directly from features extracted in image data. Global
image descriptors like the gist descriptor [2] reduce images
to compact feature vectors by dividing images into regular
intervals and aggregating filter responses calculated on each
of the intervals. Other approaches [3] calculate local descrip-
tors only in interesting image regions designated by visual
saliency. These methods have in common that they perform
well on outdoor scenes and poorly on indoor scenes.
Common approaches in object-based scene understanding
use probabilistic graphical models like Markov Random
Fields (MRF) or Conditional Random Fields (CRF) [4]. For
example both have been applied to image segmentation in
which knowledge about expected co-occurrence of adjacent
image regions is used to correct erroneously labeled regions.
Such methods do not meet our requirements as they represent
correlations between identities of regions or objects instead
of modeling variations in spatial relations between them.
Just few publications deal with using characteristics of
spatial relations for scene understanding. One approach is to
map relative object poses on symbolic qualitative relations
based on manually defined formulas. Qualitative relations
have been shown as a strong alternative [5] to MRFs or
CRFs for correcting false object detections, when operating
in three-dimensional space. Simplification to symbolic rela-
tions provides high generalization capabilities. However it
is limited when trying to express slight differences between
object configurations as expected in our scenario.
978-1-4799-2722-7/13/$31.00 ©2013 IEEE
Fig. 1. Center-left: Localized plate and cup are translated, keeping relation between them static. Smacks is reference of ISM learnt on visualized trajectories.
Trajectories converted to relative poses are depicted as arrows. Center-right: False positive detection, because ISM ignores relation between cup and plate.
Right: Arrow shaped votes and two-dimensional projection of ISM accumulator. False positive corresponds to red peak in elevation profile of bucket values.
Substantial research on representing spatial relationships
has been conducted in part-based object recognition. Ran-
ganathan et al. [6] apply Constellation Models on learning
spatial configurations of objects in office environments in
a supervised manner. They represent object constellations
as normal distributions on object positions omitting their
orientations. They only deliver rough approximations of
complex object configurations as found in our scenario.
Feiten et al. [7] state that parametric representation of
position and orientation uncertainty is a non-trivial issue.
Complementary work [8] learns scene models in an unsu-
pervised manner. It clusters scenes into local reoccurring
object configurations using Bayesian non-parametric models.
Positions are represented by normal distributions and no six
DoF poses are taken into account. Our scenario may not
contain reoccurring structures. A non-parametric method for
representing configurations of arbitrary complexity in part-
based object recognition is the Implicit Shape Model (ISM)
[9]. It has been applied to two-dimensional image data for
detecting pedestrians in street scenes [10]. In range data,
furniture approximated by bounding box primitives [11] has
been recognized using ISMs. These approaches are limited to
using relative positions of object parts, ignoring orientations.
III. OBJECTS, SCENES AND DATA ACQUISITION
We define objects oas entities of our world for which we
require that state estimations E(o) = (c,d,T)from sensor
data can be provided. This triple consists of a label cdefining
the object class, ddiscriminating different instantiations of
the same class as well as a transformation matrix TR4×4
that contains object position pR3. A scene S= ({o},{R})
contains spatial relations {R}between objects oj,ok∈ {o}
of this scene. They are represented as sets of relative six DoF
poses {Tjk}. We define recognition of scenes as calculating
to which grade a configuration of objects {E(o,t)}captured
at a point in time tcorresponds to scene Slocated at pose
TF, taking into consideration beliefs bS(o), which express
how much recognition depends on having captured object o.
To acquire object estimations E(o), we employ three com-
plementary object localizers that interpret stereo images in
real time: A system [12] based on global matching of shapes
in eigenspace. A method [12] that calculates local image de-
scriptors and homographies. And a fiducial marker localizer.
Scene models are learnt from demonstrations during which
object estimations E(o,t)are recorded for a duration of ltime
steps. For each object oof a scene S, we obtain a sequence
of estimations J(o) = (E(o,1),...,E(o,l)) called trajectory
in which E(o,t)are non-empty for every time step tin which
ohas been observed. In trajectory J(o), class labels cand
identifiers dare equal for each time step t.
IV. IMPLICIT SHAPE MODEL
FOR SPATIAL OBJECT RELATIONS
A. Learning a Scene Model
We developed an Implicit Shape Model to represent spatial
restrictions {ToF }of nobjects otowards a common refer-
ence. As scenes Sspecify relations between nobjects {o},
we designate one of the oto be identical to the reference of
the ISM, calling it oFand its pose TF. We define as selection
heuristic HFthat the less non-empty poses T(t)of an object
ochange during its trajectory J(o)from t=1... l, the better
osuits as reference. For instance in scenes with a stove, the
stove itself should be the stationary reference. Learning our
ISM for a scene Sconsists of adding entries that are extracted
from estimations E(o,t), to a table. Each entry consists of
scene label Sthat is assigned to class label cand identifier
dof object oas well as two relative poses: TFo(t)represents
the pose of object owith respect to reference object oFand
ToF (t)stands for the pose of reference object oFwith respect
to o. The learning process is accomplished for any time step
tand trajectory J(o), assigned to scene S, as follows.
for t1... ldo
oFargmax
o∈{o}E(o,t)6=/0 in J(o)
HF(o)
for all {o|o∈ {o} ∧ E(o,t)6=/0 in J(o)}do
TFo(t)TF(t)1·T(t), with T(t)belonging to o
ToF (t)T(t)1·TF(t)
Beliefs bS(o)for each object oin scene Sare added to
the ISM. They are set according to an equal distribution on
object set {o}or can be deduced from the relative frequency
of the table entries of object odepending on scene S.
B. Scene Recognition
Suppose a set of objects {i}, called input objects, is
detected at poses {T}and a set of ISMs for different scenes
Fig. 2. Left: Image of objects from which data is recorded for Fig. 1 and 2. Center-left: Relation between cup and plate is detected by direction continuity
heuristic with the plate as reference: Parallel arrows between both objects. Reference and Smacks are subsumed to ISM for entire scene. Center-right: No
false positive detection. Detection result is incomplete, as the plate is not matched. Right: Poor rating corresponds to yellow peak in elevation profile.
{S}is given as a common table. As ISMs are a variation of
the General Hough Transform, scene recognition takes place
as a voting process: Every input object icasts votes, where it
expects poses of references of scenes S. Then, a verification
step searches for hypotheses about present scenes, given all
casted votes. The voting step that casts votes von reference
positions pFusing input object poses Tand relative reference
positions poF at first, is realized as follows.
for all S ∈ {S}do
for all i∈ {i}do
Extract all ISM table entries matching Sand i
for all matching table entries do
Extract ToF from entry and poF from ToF
Get pFfrom TFT·ToF with Tgiven for i
(X,Y,Z)T← b(x,y,z)T·s1c=bpF·s1c
v(TF,TFo,c,d)
BS(X,Y,Z)BS(X,Y,Z)v
To vote as above, we discretize R3to an accumulator BS
in the form of a voxel grid for each scene S. Thus we are
able to accumulate votes according to reference positions
pF.BSis defined on position space R3instead of six DoF
pose space to reduce required time and memory. We call its
elements buckets BS(X,Y,Z). Their granularity is set by edge
length (bucket size) s. While pFdecides into which bucket
BS(X,Y,Z)vote vfalls, v= (TF,TFo,c,d)itself consists of
reference pose TFaccording to voter, voter pose TFo relative
to scene reference oFand c,dfor voter identification.
The verification process consists of an exhaustive search
on the accumulator to detect instances of scene S. For each
bucket BS(X,Y,Z), we perform fast greedy search on six DoF
poses separately. It looks for the largest set of input objects
{i}S, whose votes have fallen into this bucket and that are
consistent to scene S. Greedy search is sound as recognizing
the same scene with references at the same position, but at
different orientations is unusual. Object set {o}is consistent
to S, when we find an arbitrary object o∈ {o}that enables
us to predict absolute poses T0
p=TF·T0
Fo of all other objects
o0∈ {o}with sufficient accuracy, using the vote for reference
pose TFof o. Accuracy is calculated by comparing measured
pose T0from E(o0)and predicted pose T0
p. This approach
defines similarity of a set of objects {o}to a scene Son
the objects othemselves. Verification allows us to determine
which objects contribute to a recognition result instead of just
telling, how evident it is that a scene exists. We calculate a
rating for bucket BS(X,Y,Z)by summing beliefs bS(i)of
all objects iin greatest object set {i}S. When this rating
exceeds a given threshold, we return a recognition result for
BS(X,Y,Z)consisting of: Its rating, scene reference pose TF,
all objects {i}Sthat lead to this rating with their relative
poses {TFo}S, their classes {c}Sand their identifiers {d}S.
V. IMPLICIT SHAPE MODEL TREE
FROM AGGLOMERATIVE CLUSTERING
A. Learning a Scene Model
Implicit Shape Models, as presented so far, are only able to
represent star topologies of spatial relations from a common
reference oFto nobjects {o}, ignoring all potential relation-
ships between objects in {o}. In certain situations, like that
illustrated in Fig. 1, even though all relations, modeled by
an ISM, are fulfilled, this leads to incorrect scene detections.
To cope with this issue, we developed an approach that
analyzes spatial restrictions in sets of objects {o}in order to
separate {o}into clusters, each of which is dealt with in a
separate ISM m. It uses hierarchical agglomerative clustering
to construct a binary tree of ISMs {m}in which leafs
{o}Lrepresent objects for which estimations E(o)have been
acquired and internal vertices {o}\{o}Lstand for references
{oF}of the different ISMs in the tree. ISMs mrelate child
nodes to a parent thereby modeling relations to a common
reference oF. Recognition results of m∈ {m}are propagated
as input to the ISM m0∈ {m}at the next lower level in the
tree. In m0,oFis treated as an object, whose relations to
other objects are modeled. This process ends at the root oR,
whose ISM mRreturns the scene recognition results.
During clustering, heuristics {H}that analyze relations
between two objects, are used as linkage criterion. To fix the
issue shown in Fig. 1, we implemented a heuristic that rates
temporal continuity HC(oj,ok) = 1u/lin the direction of
vectors pjk(t)connecting two objects oj,ok∈ {o}.HCrelates
the number of direction discontinuities uto trajectory length
l. Discontinuities uare detected by repeatedly comparing
angles (pjk(t),pjk(t+x)) between the same vectors at
consecutive point in time to a given threshold. This allows
us to capture not only a directions distribution, but also
their temporal development. We opted for restricting vector
directions instead of six DoF transformations in order to
Fig. 3. Left: Trajectories of localized markers and visualization markerless localizer results. Else: Object trajectories, used as learning data for three ISMs
employed in Sec. VI. In each cluster m, arrows visualize poses of objects with respect to the reference of the ISM that models an object relation.
obtain a minimal criterion sufficient to overcome the issue
in Fig. 1. Supposing trajectories J(oj)and J(ok), both of
length l, are given for oj∈ {o}and ok∈ {o}, heuristic HC
calculates direction continuity when a sufficient number ε·l
of points in time exist, in which J(oj)and J(ok)both contain
non-empty poses Tj(t)and Tk(t).HCworks as follows.
n1
for t1... ldo
if E(oj,t)6=/0 in J(oj)E(ok,t)6=/0 in J(ok)then
pjk (n)Tj(t)1·pk(t)
nn+1
if n>ε·lthen
x1 and u0
for i1... n1do
while i+xn(pjk (i),pjk (i+x)) <ddo
xx+1
if i+xnthen
uu+1
ii+x, and x1
return 1u/l
Given a set of heuristics {H}with their ratings H(oj,ok)
being normalized and a set of object trajectories {J(o)}for
o∈ {o}, agglomerative clustering is performed as follows.
(HM,oM,qM)argmax
(H,o,q)({H},{o},{o})
H(o,q)
while HM(oM,qM)>edo
Learn ISM mwith J(oM),J(qM).oFtaken among oM,qM
{J(o)}←{J(o)} \ (J(oM)J(qM))
{J(o)}←{J(o)} ∪ J(oF)
(HM,oM,qM)argmax
(H,o,q)({H},{o},{o})
H(o,q)
{m}←{m} ∪ m
Learn root ISM mRwith {J(o)}
{m}←{m} ∪ mR
Pairings of trajectories J(oj),J(ok)are subsumed to clus-
ters mas long as the best rating of any heuristic Hexceeds
threshold e. All remaining trajectories J(o)are unified to a
root ISM mRto prevent creating unnecessary clusters.
B. Scene Recognition
The resulting hierarchy of ISMs for a scene Scan either be
imagined as a binary tree or as a set of ISMs {m}related to
each other through scene reference objects oF. Suppose a set
of input objects {i}at poses {T}is given in addition to the
ISM set {m}. For leafs {o}L, beliefs bS(o)are initialized to
one. For internal vertices {o}\{o}L, beliefs bS(o)are set to
zero. Input objects {i}cast votes as leafs {o}Lin the ISMs m,
they match. No votes are casted for internal vertices, resulting
in missing input for some ISMs m. After scene recognition
is done in each ISM m, those that contain leafs may have
returned scene references oF. Their ratings are used to update
beliefs bS(oF)and are equal to the number of objects whose
poses they restrict. ISMs may return multiple references oF.
All acquired references {oF}are added to the initial object
set {i}={i}∪{oF}and the process, described so far, is
repeated. In case multiple references oFwith identical class
cand identifier dlabels are present in the input set {i},
all of them simultaneously cast votes in the ISMs m, they
match. As new reference objects oFappear in object set {i},
the input for some ISMs mgrows, resulting in additional
references oFbeing returned or increasing beliefs bS(oF)for
already existing references. For an arbitrary reference object
oFin the binary tree {m}, its belief bS(oF)corresponds
to the number of leafs {o}L, in its subtree. This iterative
procedure, propagating beliefs bS(o)from the leafs {o}L
to the root oR, is repeated until beliefs converge through
the tree. Overall, recognition with a single ISM is naturally
extended to recognition with multiple ISMs as shown in Fig.
2. After the last iteration of the method described above,
multiple ISMs may provide recognition results. Results for
entire scenes Shave to be expected in ISM mRwith a
reference object being root oRof the binary tree {m}at the
same time. While the root ISM mRprovides poses of scene
references TFitself, the objects {i}Sthat lead to this result
have to be collected among the leafs {o}Lof the tree {m}.
The tree has to be traversed for every result produced by mR,
going along interrelated ISM recognition results.
VI. EXPERIMENTS AND RESULTS
A. Experimental Setups
Three hierarchical scene models are learnt to evaluate
the capabilities of our hierarchical Implicit Shape Model.
The involved objects and their trajectories, estimated by
object localizers as described in Sec. III, are visible in Fig.
3: The configuration on center-left represents a computer
workplace consisting of two screens, a keyboard and a
mouse, which are all detected using markers. On the right,
Fig. 4. Scene recognition results for computer workplace configurations. Object models, resized for better visibility, are located according to the markers,
fixed on them. ISM recognition results are shown as two lines meeting at a green sphere. Lines from reference objects oFto ISM references themselves
are of length zero. Lines turn from red to green as ratings of recognition results increase. Red arrows indicate ISM references being reused by other ISMs.
two configurations made up of everyday objects, taken from
a household scenario, are shown. They have two dishes in
common. Here, we employ markerless object localizers.
B. Influence of Object Pose on Scene Recognition
Experiments conducted in the computer workplace sce-
nario provide examples on how the proposed ISMs rate
whether object relations are consistent to a given scene or
not. As visible in Fig. 3, direction continuity heuristic HC
separates the present objects into two clusters. One cluster
contains both screens and the other keyboard and mouse. The
references of the ISMs learnt on these clusters are subsumed
in an additional ISM to model the entire scene.
All object configurations displayed in Fig. 4 are discussed
in terms of the scene recognition results, they provide.
Starting at the top-left corner, the first two images present
recognition results after the right screen is shifted along
different spatial dimensions. As both screens stand in parallel
and next to each other during learning, the right screen is
moved away from its valid pose in relation to the left screen
in both pictures. A recognition result can be provided, but it
only considers three objects in both configurations, leading
to a mediocre recognition rating. It is not rated as poorly
as the result of the ISM relating both displays, because of
the good rating of keyboard and mouse staying at their initial
positions. The image next to the right differs from the former
two in the fact that the right screen is rotated instead of being
translated. The recognition result, equal to those already
presented, illustrates that object orientation is considered
equally to position. The last two images in this line show
the left screen being rotated and then translated instead of
the right. Now, the right screen is part of a recognized scene
that is based on three objects. These results confirm that the
scene model treats both displays equally.
In all further configurations, both screens stay at their
initial locations and keyboard and mouse are moved instead.
In the first two images in the second line, the mouse is
shifted with respect to the keyboard, keeping its orientation.
Recognition delivers well-rated results as long as the mouse
stays on the right of the keyboard, as this has been observed
during learning. Translating the mouse behind the keyboard
has same effect as rotating it at the right of the keyboard.
The mouse keeps its orientation during learning and never
appears behind the keyboard. Recognition results are similar
to the configurations where the left screen is displaced. This
illustrates that both clusters, generated by heuristic HC, have
equal effect on scene recognition. The next image shows how
the keyboard is translated instead of the mouse. The scene
is recognized considering three objects, putting aside the
keyboard. The last image shows a configuration where each
of the clusters fulfills the requirements of the learnt scene
model, but the relative pose of the cluster references differs
from beforehand. Recognition returns a well-rated result.
C. Influence of Object Occurrence on Scene Recognition
In a household scenario, we evaluate the capabilities of
the proposed scene recognition in distinguishing two similar
scenes Aand B. Each scene is divided into two disjoint
binary clusters as visible in Fig. 5. The first line in this figure
displays how one scene is transformed into the other. From
its left to its middle, two objects of scene Aare removed
in the depicted object configurations. On the left, scene Ais
well recognized, while scene Bis detected with a poor rating.
As only those objects remain in the middle, that both scenes
have in common, ratings of recognition results for Aand B
are equal in this configuration. By adding objects, being part
of scene B, from the middle to the right, recognition ratings
for scene Bimprove. The second line of Fig. 5 consists
of configurations with ambiguous ratings. The first two
configurations contain three resp. four objects of both scenes
at once. Both scenes are recognized with mediocre resp. good
ratings at the same time. The objects that both scenes have in
common, are missing in the middle. The object configuration
is opposite to that in the middle of the first line, but their
recognition ratings are equal. The cup that scene Aand B
have in common, misses in the forth constellation. However,
ratings of the first and forth configuration in the second line
are equal. The last configuration, on the right, consists of all
objects, being elements of scene Aand B, each of which is
well recognized. A superfluous cup is present as well.
D. Scene Recognition Runtime
Processing time of scene recognition is analyzed for both
hierarchical and non-hierarchical ISMs on a PC with a
Fig. 5. Recognition results for household object configurations. The visualization of scene recognition results is identical to Fig. 4.
“Core i5 750” and 4 GB RAM. Results of experiments
that are conducted the household scenario and in which
different ISM parameter values are varied, are shown in Fig.
6. Throughout the experiments, both systems take at most
40ms to recognize scenes with six input objects. In Fig. 6,
recognition runtime is measured for different sizes of input
sets and accumulator buckets. Smallest runtimes are nearly
invisible. Recognition runs are performed on each snapshot,
already used for learning, separately and resulting runtimes
are averaged. Varying bucket size has little effect, except
for very small values, where votes for the same pose in
reality spread across several buckets. This particularly affects
hierarchical ISMs, where votes are not only dispersed in one
ISM, but in elements of a whole ISM set. Input set size hardly
affects recognition with non-hier. ISMs. This changes when
using clustering, since additional input produces votes across
entire subtrees of the ISM tree. In Fig. 6, runtime is measured
for different input object set sizes and lengths of trajectories,
used for learning. As in the bucket size experiments, large
jumps in computation time result from cache effects. While
additional learning data has almost no impact on non-hier.
ISMs, hier. ISMs run with linear time on trajectory length.
VII. CONCLUSIONS AND FUTURE WORKS
An approach has been presented that models scenes as
sets of objects and takes their six DoF interrelationships into
account. Scene models are learnt from demonstrations and
allow interpreting object configurations in order to figure out
present scenes and which objects belong to them. Limitations
in modeling object relation topologies, imposed by ISMs, are
overcome by agglomerative clustering, generating trees of
ISMs through spatial restriction analysis. Experiments show
that constraining object relations and distinguishing similar
scenes is achieved in realistic scenarios, while keeping the
system real-time capable. Future work includes integrating
heuristics that learn kinematic models of articulations as well
as extending recognition to input sets, in which class or
identifier labels miss, with the aim to infer such information.
REFERENCES
[1] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in Com-
puter Vision and Pattern Recognition, 2009.
Fig. 6. Recognition runtimes for 1 to 6 input objects resp. hierarchical (C)
vs. non-hier. (NC) ISMs, depending on bucket size and trajectory lengths.
[2] A. Oliva and A. Torralba, “Modeling the shape of the scene: A holistic
representation of the spatial envelope,Int. Journal of Computer
Vision, 2001.
[3] A. Borji and L. Itti, “Scene classification with a sparse set of salient
regions,” in Int. Conf. on Robotics and Automation, 2011.
[4] S. Kumar and M. Hebert, “A hierarchical field framework for unified
context-based classification,” in Int. Conf. on Computer Vision, 2005.
[5] T. Southey and J. Little, “3d spatial relationships for improving object
detection,” in Int. Conf. on Robotics and Automation, 2013.
[6] A. Ranganathan and F. Dellaert, “Semantic modeling of places using
objects,” in Robotics: Science and Systems Conference, 2007.
[7] W. Feiten, P. Atwal, R. Eidenberger, and T. Grundmann, “6d pose
uncertainty in robotic perception,” in Advances in Robotics Research.
Springer, 2009.
[8] D. Joho, G. Tipaldi, N. Engelhard, C. Stachniss, and W. Burgard,
“Nonparametric bayesian models for unsupervised scene analysis and
reconstruction,” in Robotics: Science and Systems Conference, 2012.
[9] B. Leibe, A. Leonardis, and B. Schiele, “Robust object detection with
interleaved categorization and segmentation,Int. Journal of Computer
Vision, 2008.
[10] B. Leibe, E. Seemann, and B. Schiele, “Pedestrian detection in
crowded scenes,” in Computer Vision and Pattern Recognition, 2005.
[11] S. G¨
achter, A. Harati, and R. Siegwart, “Structure verification toward
object classification using a range camera,” in Int. Conf. on Intelligent
Autonomous Systems, 2008.
[12] P. Azad, T. Asfour, and R. Dillmann, “Stereo-based 6d object local-
ization for grasping with humanoid robot systems,” in Int. Conf. on
Intelligent Robots and Systems, 2007.
... To provide robots with real-time scene recognition capabilities, we introduced a variant of the Implicit Shape Model (ISM) in [1] that represents spatial relations as six Degreeof-Freedom (DoF) coordinate transforms between objects as well as object occurrences. Learning takes place during human demonstrations. ...
... Object configurations, as shown in 1c in Fig. 1, where only the latter are broken, are erroneously judged as correct examples of this scene by such an ISM, yielding a false positive scene detection. To overcome false positive detections, we presented binary trees of ISMs in [1], constructed via hierarchical clustering. While ISMs that are stacked upon each other to hierarchical ISMs, worked well as representations of non-trivial relation topologies, their construction method proved to be limited. ...
... Graphs for the same set of objects, but with different relations, are shown in In Sec. IV, we present our ISM-based scene model from prior work [1] that enabled us to represent and recognize scenes. In a scene, modeled by an ISM like in 1 in Fig. 3, all spatial relations must meet in a common reference. ...
Conference Paper
Full-text available
We present an approach that uses combinatorial optimization to decide which spatial relations between objects are relevant to accurately describe an indoor scene, made up of objects. We extract scene models from object configurations that are acquired during demonstration of actions, characteristic for a certain scene. We model scenes as graphs with Implicit Shape Models (ISMs), a Generalized Hough Transform variant. ISMs are limited to represent scenes as star-shaped topologies of object relations, leading to false positives in recognizing scenes. To describe other relation topologies, we introduced a representation of trees of ISMs in prior work together with a method to learn such ISM trees from demonstrations. Limited to creating topologies, corresponding to spanning trees, that method omits certain relations so that false positives still occur. In this paper, we introduce a method to convert any relation topology, corresponding to a connected graph, into an ISM tree using a heuristic depth-first-search. It allows using complete graphs as scene models. Despite causing no false positives, complete graphs are intractable for scene recognition. To achieve efficiency, we contribute a method that searches for an optimal relation topology by traversing the space of connected scene graphs, for a given set of objects, using an optimization similar to hill climbing. Optimality is defined as minimizing computational costs during scene recognition, while producing a minimum of false positives. Experiments with up to 15 objects show that both are achievable by the presented method. Costs, growing exponentially with the number of objects, are transferred from online recognition to offline optimization.
... Not only the occurrence of objects, but also the spatial relations between them have to be considered. To address this issue, we presented hierarchical Implicit Shape Models [1] that recognize indoor scenes in object constellations. They model scenes as sets of objects, including the spatial relations between them as six Degree-of-Freedom (DoF) coordinate transforms. ...
... Object configurations often have to be perceived from several viewpoints before a scene can be recognized. Without prior knowledge, which perspectives have to be checked for which objects, the 6 DoF space of camera viewpoints has to be browsed by uninformed search 1 . In a realistic scenario, such an approach is infeasible due to combinatorial explosion. ...
... pascal.meissner@kit.edu 1 An approach for a mobile robot with a pivoting camera is to discretize the space of robot positions with a given resolution to a grid. At each position, the camera has to be rotated to a set of views, lying on a tessellated sphere and at each view, localization is to be performed for all searched objects. ...
Conference Paper
Full-text available
We present an approach for recognizing indoor scenes in object constellations that require object search by a mobile robot, as they cannot be captured from a single viewpoint. In our approach that we call Active Scene Recognition (ASR), robots predict object poses from learnt spatial relations that they combine with their estimates about present scenes. Our models for estimating scenes and predicting poses are Implicit Shape Model (ISM) trees from prior work [1]. ISMs model scenes as sets of objects with spatial relations in-between and are learnt from observations. In prior work [2], we presented a realization of ASR, limited to choosing orientations for a fixed robot head with an approach to search objects that uses positions and ignores types. In this paper, we introduce an integrated system that extends ASR to selecting positions and orientations of camera views for a mobile robot with a pivoting head. We contribute an approach for Next-Best-View estimation in object search on predicted object poses. It is defined on 6 DoF viewing frustums and optimizes the searched view, together with the objects to be searched in it, based on 6 DoF pose predictions. To prevent combinatorial explosion when searching camera pose space, we introduce a hierarchical approach to sample robot positions with increasing resolution.
... In the second category of methods, which use probabilistic graphical models such as Markov Random Fields [6,10], Conditional Random Fields [16], Implicit Shape Models [22], and latent generative models [5], a probability distribution is modeled for relations between objects or entities. In these studies, Anand et al. [6] considered relations like "on-top" and "in-front" (and their symmetries); ...
... Celikkanat et al. [10], "left", "on", and "in-front" (and their symmetries); Lin et al. [16], "on-top", "close-to" relations; Meissner et al. [22], 6-DoF relations (rotation and translation) between objects; and, Joho et al. [5], an implicit model over local arrangements of objects is learned. ...
Preprint
Full-text available
Scene modeling is very crucial for robots that need to perceive, reason about and manipulate the objects in their environments. In this paper, we adapt and extend Boltzmann Machines (BMs) for contextualized scene modeling. Although there are many models on the subject, ours is the first to bring together objects, relations, and affordances in a highly-capable generative model. For this end, we introduce a hybrid version of BMs where relations and affordances are introduced with shared, tri-way connections into the model. Moreover, we contribute a dataset for relation estimation and modeling studies. We evaluate our method in comparison with several baselines on object estimation, out-of-context object detection, relation estimation, and affordance estimation tasks. Moreover, to illustrate the generative capability of the model, we show several example scenes that the model is able to generate.
... We extend Implicit Shape Models to represent spatial relations between n objects o in a scene S as introduced in [13]. As an ISM represents a scene S as a star topology of relations {T oF } towards a common reference, a reference object o F with pose T F is chosen among objects o. ...
... As such relations are not considered by the ISM, their violation is not detected, leading to false positive scene detections. Our solution [13] , both of length l, discontinuities u are determined by repeatedly comparing angles \(p jk (t), p jk (t + x)) resp. \(q jk (t), q jk (t + x)) between the same vectors, valid at different time steps, to a given threshold. ...
Conference Paper
Full-text available
We present an approach that combines passive scene understanding with object search in order to recognize scenes in indoor environments that cannot be perceived from a single point of view. Passive scene recognition is done based on spatial relations between objects using Implicit Shape Models. ISMs, a variant of Generalized Hough Transform, are extended to describe scenes as sets of objects with relations lying in- between. Relations are expressed as six Degree-of-Freedom (DoF) relative object poses. They are extracted from sensor recordings of human demonstrations of actions usually taking place in the corresponding scene. In a scene ISMs solely represent relations of n objects towards a common reference. Violations of other relations are not detectable. To overcome this limitation we extend our scene models to binary trees consisting of ISMs using hierarchical agglomerative clustering. Active scene recognition aims to simultaneously detect present scenes and localize objects these scenes consist of. For a pivoting stereo camera rig, we achieve this by performing recognition with hierarchical ISMs in an object search loop using Next-Best-View (NBV) estimation. A criterion, on which we greedily choose views the rig shall adopt next, is the confidence to detect objects in them. In each search step confidence on potential positions of objects, not found yet, is calculated based on the best available scene hypothesis. This is done by partly reversing the basic principle of ISMs and using spatial relations to predict potential object positions starting from objects already detected.
... These approaches defined relations using rules based on 2D/3D distances between objects, e.g., [18]. With advances in probabilistic graphical modeling, many approaches used models such as Markov Random Fields [6], [19], Conditional Random Fields [7], Implicit Shape Models [20], latent generative models [11]. Many studies also proposed formulating relation detection as a classification problem, e.g., using logistic regression [21], and deep learning [22]. ...
Article
Full-text available
Scene models allow robots to reason about what is in the scene, what else should be in it, and what should not be in it. In this paper, we propose a hybrid Boltzmann Machine (BM) for scene modeling where relations between objects are integrated. To be able to do that, we extend BM to include tri-way edges between visible (object) nodes and make the network to share the relations across different objects. We evaluate our method against several baseline models (Deep Boltzmann Machines, and Restricted Boltzmann Machines) on a scene classification dataset, and show that it performs better in several scene reasoning tasks.
Chapter
Detailed technical presentation of our contributions that are related to Passive Scene Recognition. This includes the learning of Trees of Implicit Shape Models as well as carrying out scene recognition on the basis of these classifiers.
Article
Scene modeling is very crucial for robots that need to perceive, reason about and manipulate the objects in their environments. In this paper, we adapt and extend Boltzmann Machines (BMs) for contextualized scene modeling. Although there are many models on the subject, ours is the first to bring together objects, relations, and affordances in a highly-capable generative model. For this end, we introduce a hybrid version of BMs where relations and affordances are incorporated with shared, tri-way connections into the model. Moreover, we introduce a dataset for relation estimation and modeling studies. We evaluate our method in comparison with several baselines on object estimation, out-of-context object detection, relation estimation, and affordance estimation tasks. Moreover, to illustrate the generative capability of the model, we show several example scenes that the model is able to generate, and demonstrate the benefits of the model on a humanoid robot. The code and the dataset are publicly made available at: https://github.com/bozcani/COSMO.
Conference Paper
Full-text available
Robots operating in domestic environments need to deal with a variety of different objects. Often, these objects are neither placed randomly, nor independently of each other. For example, objects on a breakfast table such as plates, knives, or bowls typically occur in recurrent configurations. In this paper, we propose a novel hierarchical generative model to reason about latent object constellations in a scene. The proposed model is a combination of Dirichlet processes and beta processes, which allow for a probabilistic treatment of the unknown dimensionality of the parameter space. We show how the model can be employed to address a set of different tasks in scene understanding ranging from unsupervised scene segmentation to completion of a par-tially specified scene. We describe how sampling in this model can be done using Markov chain Monte Carlo (MCMC) techniques and present an experimental evaluation with simulated as well as real-world data obtained with a Kinect camera.
Article
Full-text available
This paper proposes an incremental object clas-sification based on parts detected in a sequence of noisy range images. Primitive parts are jointly tracked and detected as probabilistic bounding-boxes using a particle filter which accumulates the information of the local structure over time. A voting scheme is presented as a procedure to verify structure of the object, i.e. the desired geometrical relations between the parts. This verification is necessary to disambiguate object parts from potential irrelevant parts which are structurally similar. The experimental results obtained using a mobile robot in a real indoor environment show that the presented approach is able to successfully detect chairs in the range images.
Conference Paper
Full-text available
Robust vision-based grasping is still a hard problem for humanoid robot systems. When being restricted to using the camera system built-in into the robot's head for object localization, the scenarios get often very simplified in order to allow the robot to grasp autonomously. Within the computer vision community, many object recognition and localization systems exist, but in general, they are not tailored to the application on a humanoid robot. In particular, accurate 6D object localization in the camera coordinate system with respect to a 3D rigid model is crucial for a general framework for grasping. While many approaches try to avoid the use of stereo calibration, we will present a system that makes explicit use of the stereo camera system in order to achieve maximum depth accuracy. Our system can deal with textured objects as well as objects that can be segmented globally and are defined by their shape. Thus, it covers the cases of objects with complex texture and complex shape. Our work is directly linked to a grasping framework being implemented on the humanoid robot ARM AR and serves as its perception module for various grasping and manipulation experiments in a kitchen scenario.
Conference Paper
Full-text available
The problem of accurate 6-DoF pose estimation of 3D objects based on their shape has so far been solved only for specific object geometries. Edge-based recognition and tracking methods rely on the extraction of straight line segments or other primitives. Straight-forward extensions of 2D approaches are potentially more general, but assume a limited range of possible view angles. The general problem is that a 3D object can potentially produce completely different 2D projections depending on the view angle. One way to tackle this problem is to use canonical views. However, accurate shape-based 6-DoF pose estimation requires more information than matching of canonical views can provide. In this paper, we present a novel approach to 6-DoF pose estimation of single-colored objects based on their shape. Our approach combines stereo triangulation with matching against a high-resolution view set of the object, each view having associated orientation information. The errors that arise from separating the position and orientation computation in first place are corrected by a subsequent correction procedure based on online 3D model projection. The proposed approach can estimate the pose of a single object within 20 ms using conventional hardware.
Conference Paper
Full-text available
Indoor scene recognition is a challenging open problem in high level vision. Most scene recognition models that work well for outdoor scenes perform poorly in the indoor domain. The main difficulty is that while some indoor scenes (e.g. corridors) can be well characterized by global spatial properties, others (e.g, bookstores) are better characterized by the objects they contain. More generally, to address the indoor scenes recognition problem we need a model that can exploit local and global discriminative information. In this paper we propose a prototype based model that can successfully combine both sources of information. To test our approach we created a dataset of 67 indoor scenes categories (the largest available) covering a wide range of domains. The results show that our approach can significantly outperform a state of the art classifier for the task.
Conference Paper
Full-text available
While robot mapping has seen massive strides recently, higher level abstractions in map representation are still not widespread. Maps containing semantic concepts such as objects and labels are essential for many tasks in manmade environments as well as for human-robot interaction and map communication. In keeping with this aim, we present a model for places using objects as the basic unit of representation. Our model is a 3D extension of the constellation object model, popular in computer vision, in which the objects are modeled by their appearance and shape. The 3D location of each object is maintained in a coordinate frame local to the place. The individual object models are learned in a supervised manner using roughly segmented and labeled training images. Stereo range data is used to compute 3D locations of the objects. We use the Swendsen-Wang algorithm, a cluster MCMC method, to solve the correspondence problem between image features and objects during inference. We provide a technique for building panoramic place models from multiple views of a location. An algorithm for place recognition by comparing models is also provided. Results are presented in the form of place models inferred in an indoor environment. We envision the use of our place model as a building block towards a complete object-based semantic mapping system.
Conference Paper
This work demonstrates how 3D qualitative spatial relationships can be used to improve object detection by differentiating between true and false positive detections. Our method identifies the most likely subset of 3D detections using seven types of 3D relationships and adjusts detection confidence scores to improve the average precision. A model is learned using a structured support vector machine [1] from examples of 3D layouts of objects in offices and kitchens. We test our method on synthetic detections to determine how factors such as localization accuracy, number of detections and detection scores change the effectiveness of 3D spatial relationships for improving object detection rates. Finally, we describe a technique for generating 3D detections from 2D image-based object detections and demonstrate how our method improves the average precision of these 3D detections.
Chapter
Robotic perception is fundamental to important application areas. In the Joint Research Project DESIRE, we develop a robotic perception system with the aim of perceiving and modeling an unprepared kitchen scenario with many objects. It relies on the fusion of information from weak features from heterogenous sensors in order to classify and localize objects. This requires the representation of wide spread probability distributions of the 6D pose. In this paper we present a framework for probabilistic modeling of 6D poses that represents a large class of probability distributions and provides among others the operations of fusion of estimates and uncertain propagation of estimates. The orientation part of a pose is described by a unit quaternion. The translation part is described either by a 3D vector (when we define the probability density function) or by a purely imaginary quaternion (which leads to a prepresentation of a transform by a dual quaternion). A basic probability density function over the poses is defined by a tangent point on the 3D sphere (representing unit quaternions), and a 6D Gaussian distribution over the product of the tangent space of the sphere and of the space of translations. The projection of this Gaussian induces a distribution over 6D poses. One such base element is called a Projected Gaussian. The set of Mixtures of Projected Gaussians can approximate the probability density functions that arise in our application, is closed under the operations mentioned above and allows for an efficient implementation.
Article
This paper presents a novel method for detecting and localizing objects of a visual category in cluttered real-world scenes. Our approach considers object categorization and figure-ground segmentation as two interleaved processes that closely collaborate towards a common goal. As shown in our work, the tight coupling between those two processes allows them to benefit from each other and improve the combined performance. The core part of our approach is a highly flexible learned representation for object shape that can combine the information observed on different training examples in a probabilistic extension of the Generalized Hough Transform. The resulting approach can detect categorical objects in novel images and automatically infer a probabilistic segmentation from the recognition result. This segmentation is then in turn used to again improve recognition by allowing the system to focus its efforts on object pixels and to discard misleading influences from the background. Moreover, the information from where in the image a hypothesis draws its support is employed in an MDL based hypothesis verification stage to resolve ambiguities between overlapping hypotheses and factor out the effects of partial occlusion. An extensive evaluation on several large data sets shows that the proposed system is applicable to a range of different object categories, including both rigid and articulated objects. In addition, its flexible representation allows it to achieve competitive object detection performance already from training sets that are between one and two orders of magnitude smaller than those used in comparable systems.
Conference Paper
This work proposes an approach for scene classification by extracting and matching visual features only at the focuses of visual attention instead of the entire scene. Analysis over a database of natural scenes demonstrates that regions proposed by the saliency-based model of visual attention are robust to image transformations. Using a nearest neighbor classifier and a distance measure defined over the salient regions, we obtained 97.35% and 78.28% classification rates with SIFT and C2 features from the HMAX model at 5 salient regions covering at most 31% of the image. Classification with features extracted from the entire image results in 99.3% and 82.32% using SIFT and C2 features, respectively. Comparing attentional and adhoc approaches shows that classification rate of the first approach is 0.95 of the second. Overall, our results prove that efficient scene classification, in terms of reducing the complexity of feature extraction is possible without a significant drop in performance.