ArticlePDF Available

Indexing 3D Scenes Using the Interaction Bisector Surface

Authors:

Abstract and Figures

The spatial relationship between different objects plays an important role in defining the context of scenes. Most previous 3D classification and retrieval methods take into account either the individual geometry of the objects or simple relationships between them such as the contacts or adjacencies. In this paper we propose a new method for the classification and retrieval of 3D objects based on the Interaction Bisector Surface (IBS), a subset of the Voronoi diagram defined between objects. The IBS is a sophisticated representation that describes topological relationships such as whether an object is wrapped in, linked to or tangled with others, as well as geometric relationships such as the distance between objects. We propose a hierarchical framework to index scenes by examining both the topological structure and the geometric attributes of the IBS. The topology-based indexing can compare spatial relations without being severely affected by local geometric details of the object. Geometric attributes can also be applied in comparing the precise way in which the objects are interacting with one another. Experimental results show that our method is effective at relationship classification and content-based relationship retrieval.
Content may be subject to copyright.
Indexing 3D Scenes Using the Interaction Bisector Surface
Xi Zhao
University of Edinburgh
He Wang
University of Edinburgh
Taku Komura
University of Edinburgh
The spatial relationship between different objects plays an important role in
defining the context of scenes. Most previous 3D classification and retrieval
methods take into account either the individual geometry of the objects or
simple relationships between them such as the contacts or adjacencies. In
this paper we propose a new method for the classification and retrieval of
3D objects based on the Interaction Bisector Surface (IBS), a subset of the
Voronoi diagram defined between objects. The IBS is a sophisticated repre-
sentation that describes topological relationships such as whether an object
is wrapped in, linked to or tangled with others, as well as geometric rela-
tionships such as the distance between objects. We propose a hierarchical
framework to index scenes by examining both the topological structure and
the geometric attributes of the IBS. The topology-based indexing can com-
pare spatial relations without being severely affected by local geometric
details of the object. Geometric attributes can also be applied in compar-
ing the precise way in which the objects are interacting with one another.
Experimental results show that our method is effective at relationship clas-
sification and content-based relationship retrieval.
Categories and Subject Descriptors: I.3.5 [Computer Graphics]: Com-
putational Geometry and Object Modeling—Geometric algorithms, lan-
guages, and systems
General Terms: Algorithms, Design, Experimentation, Theory
Additional Key Words and Phrases: Spatial relationships, Classification,
Context-based Retrieval
ACM Reference Format:
Zhao, X.,Wang, H. and Komura, T. 2013., Indexing 3D Scenes Using the
Author’s addresses: Xi Zhao, He Wang, Taku Komura: School of Infor-
matics, The University of Edinburgh, 10 Crichton Street, Edinburgh EH8
9AB , UK. (emails) zhaoxi.jade@yahoo.com, He.Wang@ed.ac.uk, tko-
mura@ed.ac.uk
Supported by EPSRC Standard Grant (EP/H012338/1), EU FP7/TOMSY
and China Scholarship Council.
Permission to make digital or hard copies of part or all of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
show this notice on the first page or initial screen of a display along with
the full citation. Copyrights for components of this work owned by others
than ACM must be honored. Abstracting with credit is permitted. To copy
otherwise, to republish, to post on servers, to redistribute to lists, or to use
any component of this work in other works requires prior specific permis-
sion and/or a fee. Permissions may be requested from Publications Dept.,
ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701 USA, fax
+1 (212) 869-0481, or permissions@acm.org.
c
???? ACM 0730-0301/????/14-ART??? $10.00
DOI 10.1145/1559755.1559763
http://doi.acm.org/10.1145/1559755.1559763
Fig. 1. Examples of the Interaction Bisector Surface (the blue surface) for
two parts of the 3D scene (shown as red and green). (a) belt on uniform,
(b) bag on hook, (c) baby on chair, (d) a pentagon tangled with five other
pentagons
Interaction Bisector Surface ACM Trans. Graph. ??, ?, Article ??? (January
2014), ?? pages.
1. INTRODUCTION
Understanding contexts is important for applications such as man-
aging 3D animated scenes and video surveillance. In such appli-
cations, individual geometry and movement of objects does not
provide enough information and should be complemented by de-
scription of their interactions. For example, contexts such as “a boy
wearing a cap” or “a book on a bookshelf” are defined by the fact
that the upper half of the boy’s head is covered by the inner area of
the hat or the book is surrounded by other books and the bookshelf.
Such contexts need to be described using a representation based on
spatial relationships between different objects.
ACM Transactions on Graphics, Vol. ??, No. ?, Article ???, Publication date: ??? ????.
2
The importance of context is well recognized in the area of com-
puter vision and image comprehension. Contextual data encoded
by the adjacency information of individual objects in the image
has been widely applied in shape matching [Belongie et al. 2002],
annotation [Rabinovich et al. 2007], object detection [Giannarou
and Stathaki 2007] and indexing [Harchaoui and Bach 2007]. In
[Harchaoui and Bach 2007], scene graphs are produced by connect-
ing adjacent objects by an edge and conducting graph matching for
scene comparison. The innovation of this approach is that it does
not index images based on only individual object features, but also
the spatial relations of multiple objects.
It is not an easy task to directly extend such an approach for
3D scenes where complex spatial relationships are present. Fisher
and his colleagues encode the spatial context of 3D scenes using
contacts between objects [Fisher et al. 2011] and the adjacency in-
formation is represented by relative vectors [Fisher and Hanrahan
2010]. Although such approaches can successfully classify static
scenes where objects are correlated only by simple adjacencies or
support, they may not be enough for encoding more complex rela-
tions such as enclosures, links and tangles, or those that involve ar-
ticulated models such as human bodies or deformable objects such
as ropes and clothes. Therefore, a more descriptive representation
that can evaluate the complex nature of interactions is needed for
successfully indexing such spatial relationships.
In this paper, we propose using the Interaction Bisector Surface
(IBS), which is a subset of the Voronoi diagram, for the represen-
tation of the spatial context of the scene. The Voronoi diagram has
been applied in indexing and recognizing the relationships of pro-
teins in the area of biology [Kim et al. 2006]. In a similar man-
ner to the Voronoi diagram, the IBS is the collection of points that
are equidistant from at least two objects in the scene. The IBS can
describe the topological and geometric nature of a spatial relation-
ship. By computing a topological feature set called the Betti num-
bers of the IBS, we can detect relationships such as enclosures and
windings, which characterize scenes such as a house surrounded
by fences, a lady with a hand bag hanging on her arm, or an ob-
ject contained in a box. The geometric nature of the relationships
can be analysed using the shape of the IBS, the direction of its
normal vectors, and the distance between the objects and the IBS.
The computation of the IBS makes minimal assumptions about the
forms of data input, which can be polygon meshes, skeletons or
point-clouds, making it applicable to a wide range of existing data.
In this paper, we aim to analyse spatial relationships only based on
the topological and geometric features, thus avoiding object labels
as used in [Fisher and Hanrahan 2010; Fisher et al. 2011].
Using the IBS as the interaction descriptor, we present the fol-
lowing three applications:
Interaction Classification The topological and geometric features
of the IBS can be used for the classification of different spatial re-
lationships.
Automatic Construction of Scene Hierarchies Using scenes that
are composed of multiple objects such as room data, we show that
the IBS can be applied in composing a hierarchy that describes the
scene. Given an input scene, we group individual objects or object
groups iteratively using a closeness-metric based on the IBS.
Content-based Relationship Retrieval Our distance function can
be used for finding similar relationships in the database, based
purely on the relationship information.
Contributions
—A rich representation of relationships between objects in a scene,
which can encode not only the geometric but also the topological
nature of the spatial relationships.
—An automated mechanism to build hierarchical structures for
scenes based on the spatial relationships of the objects.
—Similarity metrics for object-object relations and an approach for
conducting context-based relationship retrieval.
The rest of the paper is organized as follows. After reviewing re-
lated work in Section 2, we explain how to compute the IBS in Sec-
tion 3, and its topology and geometry features in Section 4. Then
we propose an algorithm of building hierarchical structures for 3D
scenes in Section 5. Based on the hierarchy, we explain how to rep-
resent and measure the similarity of spatial contexts of objects in
Section 6. Next, we show the experimental results in Section 7 and
finally discuss the methodology and draw conclusions in Section 8.
2. RELATED WORK
We will first review work about 3D analysis and synthesis, which
is a relatively new topic in the area of computer graphics. As the
medial axis is quite relevant to the IBS, we also review works about
medial axis computation, and discuss the difference between the
medial axis and the IBS.
Analysis of 3D Objects and Scenes Recently, research into re-
trieval and synthesis of 3D objects and scenes is growing due to
the large amount of datasets available from, for example, Google
Warehouse. Among such works, we are mainly interested in meth-
ods that use the spatial relationships between multiple components
in the data to describe the entire object or scene.
Several methods to analyse the structure of man-made objects
have been recently proposed [Wang et al. 2011; Kalogerakis et al.
2012; van Kaick et al. 2013a; Zheng et al. 2013]. Wang et al. [2011]
compute hierarchical pyramids of single objects based on sym-
metry and contact information. Kalogerakis et al. [2012] produce
a probabilistic model of the shape structure from examples. Van
Kaick et al. [2013a] uses a co-hierarchical analysis to learn models’
structure. Zheng et al. [2013] build a graph structure from an object
based on the spatial relationships of its components. As these meth-
ods are focused on single objects, the spatial structure of the objects
is mainly based on the contact information, and the spatial relation-
ships between separate parts are either ignored or only described by
simple features such as relative vectors. Some recent works which
aims to achieve shape matching [van Kaick et al. 2013b; Zheng
et al. 2013], propose new features based on pairwise points to en-
code the spatial context of shapes. These works also show that the
spatial relationship between different parts of a 3D shape is impor-
tant for shape understanding.
Structure analysis is also applied for scenes composed of multi-
ple objects. Fisher et al. [2011] propose to construct scene graphs
based on contextual groups and contact information between ob-
jects. The scenes are then compared by a kernel-based graph match-
ing algorithm, which has been applied in image analysis [Har-
chaoui and Bach 2007]. The spatial relationships are highly ab-
stracted by simple binary information of contacts. Yu et al. [2011]
encode the relationships between furniture using metrics such as
distance, orientation and ergonomic measures. The objects are
grouped into hierarchies to learn the furniture arrangements. The
complex interactions are manually labelled by the users in these
ACM Transactions on Graphics, Vol. ??, No. ?, Article ???, Publication date: ??? ????.
3
studies due to the difficulty of automatically learning them by sim-
ple measures. Fisher et al. [2012] learn contexts from examples by
using Bayesian networks and mixture models. The relationships be-
tween adjacent objects are represented by relative vectors, and are
compared using bipartite matching. Paraboschi et al. [2007] use
the distance from the barycenter, height distance and geodesic dis-
tance as a metric and compose a graph Laplacian to encode the re-
lationship of adjacent objects. Tang et al. [2012] similarly encode
the interactions of multiple characters by applying Delaunay tetra-
hedralization to the joints composing the character skeletons and
computing the Hamming distance between them. These methods
require the objects to be manually labelled in order to reinforce the
simple representations used to describe the relationships. In many
situations, however, the objects may not be tagged or tagged in an
inconsistent manner.
In our case, we compare scenes which may be composed of unla-
belled, dense mesh structures that may interact with one another in
a complex manner. Such relationships are difficult to represent us-
ing simple relative vectors or distances. We cope with this problem
by using a more expressive representation that takes into account
the relationships of the entire surfaces of the objects composing the
scenes.
Medial Axis and Shape Recognition Here, we briefly review the
medial axis computation and application, and how the medial axis
is relate to our work.
The medial axis of a 3D object is the set of points within the
object that have more than one closest points on the boundary of
the object. It has a long history of being used for the recognition of
2D and 3D shapes [Sebastian et al. 2001; Chang and Kimia 2011].
The success of the medial axis for shape recognition lies in the fact
that it produces a discrete graph structure that abstracts the shape.
As a result, the shape recognition problem can be converted into
a graph matching problem, for which various efficient techniques
have been proposed.
The previous research for computing the medial axis can be clas-
sified into two main categories: the continuous method and the dis-
crete method. The continuous method [Culver et al. 1999; Sher-
brooke et al. 1995], aims to compute an accurate medial axis for
polyhedrons, while the discrete method [Amenta et al. 2001], ap-
proximates the medial axis by sample points on the boundary of the
shape.
One issue with using the medial axis for pattern recognition is
its instability, as small perturbations of the boundary of the shape
can introduce large changes to its medial axis. Many methods, in-
cluding [Sud et al. 2007; Imai 1996], are proposed to get the stable
subset of the medial axis which is not sensitive to small perturba-
tions. Our method does not suffer from such instability as we only
use the bisector surface that is defined between two groups of poly-
gons. We shall discuss more about this in Section 3.
The medial axis can be computed inside an object as well as
at the external area of the object. The bisector surface, which is
a subset of the medial axis, has been applied in representing pro-
tein interactions [Kim et al. 2006]. Instead of dealing with specific
structures like proteins, we define a more general metric, which can
compare the spatial relationship between interacting parts by mea-
suring the features of the IBS. In our research, we use the IBS as a
descriptor of the spatial relationships between objects in the scene.
3. INTERACTION BISECTOR SURFACE
Here we define the IBS and then describe how it is computed.
Fig. 2. Steps for computing IBS: Given a segmented scene (a) (in this
example the scene has two segments: a cup and a table), which is composed
of polygon meshes (b), we first subdivide the mesh to triangles of similar
size (c) and then take the centre points of each triangle (d). IBS consists of
the Voronoi surfaces is produced by two samples from different objects (e)
and is also represented by a polygon mesh (f). The table and cup models are
from the Stanford Scene Database [Fisher et al. 2012],
3.1 Definition
Given N point sets S1, S2, ...SNin the 3D space where Si=
{pi
1, pi
2, ..., pi
ni}, an Interaction Bisector Surface (IBS) divides the
space into N regions with following properties:
—Points from the same point set lie in exactly one region.
—If a point q6∈ {S1S2...SN}lies in the same region as Si,
then the Hausdorff distance (in the Euclidean space) between set
{q}and Siwill be shorter than the Hausdorff distance between
set {q}and Sj, where Sjis any other point set.
The IBS is the set of points equidistant from two sets of points
sampled on different objects. It is an approximation of the Voronoi
diagram for objects in the scene. Examples of the IBS for differ-
ent scenes are shown as blue surfaces in Figure 1. It can be either
open or closed. Although the IBS can reach infinity when it is open
(the same as the Voronoi diagram), we truncate it by a bounding
sphere (details are given in Section 3.2). Despite the possibility that
the IBS can produce a complicated polyhedral complex, it tends to
form smooth shapes with stable topology when computed from ob-
jects in daily life such as those presented in the paper.
3.2 IBS Computation
Here we give details about how we compute the IBS for a given
scene. We start by sampling points on the surfaces of the scene
models uniformly, and then compute the Voronoi diagram for all
these samples. The Quickhull algorithm [Barber et al. 1996] was
used in this process. The result of the Quickhull algorithm is a sim-
plicial complex consisting of polygons called ridges. Every ridge
is equidistant to the two sample points which produce it. Hence
there is a correspondence between ridges and the sample points.
Assuming that the scene data is pre-segmented into objects, which
is usually the case in scene data, we only select ridges that corre-
spond to sample points from two different objects for computing
the IBS. These steps are shown in Figure 2.
As the IBS by definition could reach infinity, we trim it by adding
a bounding sphere to the scene data to compute the Voronoi dia-
gram. In practice, the bounding sphere is found in the following
way. We first find the minimum bounding box of the scene, and
use the centre of the bounding box as the centre of the bounding
ACM Transactions on Graphics, Vol. ??, No. ?, Article ???, Publication date: ??? ????.
4
Fig. 3. (a) Penetrations between the models and the IBS, which is caused
by the inadequate sampling on objects. (b) After 4 iterations refinement
there is no penetration any more, and the shape of the IBS becomes
smoother (The big gap between the cup and the table is for visualization
purposes.)
Fig. 4. The IBS (the blue line) is the stable part of the medial axis(the
blue line and the grey lines). It does not fluctuate under subtle geometric
changes.
sphere. The diameter of the sphere is set to 1.5 times the diagonal
of the bounding box.
Special attention is needed if two objects are very close to each
other, as there is a chance that the IBS will penetrate the objects
due to the inadequate sampling density. In this case, we iteratively
refine the IBS by the following process: if penetrations are found
between the IBS and any object, we sample more on the parts of
the object where the penetrations happen and recompute the entire
IBS. Figure 3 shows the IBS between a table and a coffee cup. In
this example, there are no more penetrations after four iterations.
Although the topological structure of the medial axis can be sen-
sitive to subtle geometric changes of the relevant surfaces, the IBS
is rather robust against such changes as it is computed between two
objects. The instability of the medial axis is due to the “fluctuating
spikes” [Attali et al. 2009], which are produced by concave dips
on the surfaces (the grey branches in Figure 4). As the IBS is only
produced between separate objects, such spikes are not included in
its structure and therefore is less likely to be affected by subtle ge-
ometric changes of the object. More examples of the IBS of of 3D
object pairs are shown in Figure 1 and Figure 5.
Given a scene, we only need to compute the IBS for the whole
scene once, and it already contains the spatial relation of every pair
of objects. We denote the subset of the IBS between object iand ob-
ject jas IB S(i, j ). Furthermore, a subset of the IBS between two
groups of objects, gxand gy, can be represented by IB S(gx, gy)=
SIB S(i, j )where igxand jgy.
4. IBS FEATURES
In this section, we give details about how to compute the topologi-
cal and geometric features of the IBS.
4.1 Topological Features of the IBS
Topological descriptions of relationships are succinct and robust
against small geometric variations. Consider a ball in a box. The
description “in” here is irrelevant to the ball position or orientation
as long as it is inside the box. Thus, capturing the topological nature
of the interaction between two objects is crucial in relationship un-
derstanding. A good indicator of the topological nature is the Betti
numbers of the IBS. We will first briefly give the definition of Betti
numbers and then demonstrate how it can be applied as a feature to
classify complex interactions.
The Betti number is a concept in algebraic topology. Formally,
the k-th Betti number refers to the number of independent k-
dimensional surfaces [Massey 1991]. We make use of the sec-
ond (denoted as b1) and third (denoted as b2) Betti numbers in
this research. They represent the number of two-dimensional or
“circular” holes (b1), and the number of three-dimensional holes
or “voids” (b2). Intuitively speaking, b1represents the number of
“cuts” needed to transform a shape into a flat sheet. For example,
objects that are laterally surrounded by others, such as a house sur-
rounded by fences (see Figure 5 (c)), forms an IBS of a cylindrical
shape, resulting in b1= 1. For objects tangled with other objects,
such as toilet paper (see Figure 5 (d)), a partial torus is generated,
resulting in b1= 2.b1can be even larger under complex inter-
actions whose IBS involves a lot of loops (see Figure 5 (e)). b2
represents the number of closed surfaces. In our scenario, it counts
how many objects are wrapped by other objects (see Figure 5 (b)).
The Betti numbers can be easily computed from the mesh data by
the incremental algorithm [Delfinado and Edelsbrunner 1995].
4.2 Geometric Features of the IBS
Although the Betti numbers can distinguish the qualitative differ-
ence of interactions, they cannot distinguish subtle differences. For
example, the IBS of two boxes laterally adjacent to each other has
exactly the same Betti numbers as that of an apple in a bowl. To ad-
dress this problem, we evaluate the following geometric attributes
of the IBS:
(1) geometric shape,
(2) distribution of the direction vectors, and
(3) distribution of distance between the IBS and the objects.
These features are computed at points sampled on the IBS. As
different parts of the IBS are not equally descriptive of the rela-
tionship, we use an importance-based sampling scheme that is de-
scribed in Appendix A. In brief, more points are sampled where the
IBS is in close proximity with the objects defining it.
Geometric Shape of the IBS The geometric shape of the IBS is
useful for comparing the nature of the interactions. For example,
when flat planes of two objects are simply parallel to each other,
the IBS will become planar, but it will form a bowl shape when one
object is surrounded by another object.
Various shape descriptors can be considered for the IBS. One
possibility is to use the curvature profile; however, the curvature
data can be unstable as the IBS may include ridges with sharp turns.
This occurs when the mapping of the closest point between the IBS
ACM Transactions on Graphics, Vol. ??, No. ?, Article ???, Publication date: ??? ????.
5
Fig. 5. The IBS (in blue) of two object scenes (a) table and chair, (b) bird in cage, (c) house surrounded by fences, (d) toilet paper on holder, (e) gift box and
ribbon, and their Betti numbers. The 3D models in (a) and (c) are from the Princeton Shape Benchmark [Shilane et al. 2004].
and the object becomes discontinuous due to the concavity of the
object. Also, the IBS may be either an open or closed surface.
Taking into account these characteristics, we use the Point Fea-
ture Histogram (PFH) descriptor [Rusu et al. 2008a]; PFH is a his-
togram of the relative rotation between each pair of normals in the
whole point cloud. It describes the local geometrical properties by
generalizing the mean curvature at every point. It provides an over-
all pose and density invariant feature which is robust to noise. PFH
is applied for 3D point cloud classification [Rusu et al. 2008a] and
registration [Rusu et al. 2008b].
More specifically, for each sample point on a given IBS, we com-
pute a 125-bin histogram of the relative rotation angles between the
normal vector at the sample point and those of the other sample
points. We produce a set of histograms for the whole IBS. Then we
follow a method proposed in [Alexandre 2012]. We compute the
centroid and the standard deviation for each dimension of the his-
togram set, and use the resulting 250 dimension vector as the final
feature of the IBS. More details for computing the PFH feature are
described in Appendix B.
Direction The normal vectors of sample points on an IBS contain
the direction information about the spatial relationship. For exam-
ple, if all the normal vectors of the IBS samples are pointing up-
wards, one of the objects forming the IBS is above the other.
The direction of the normal vector of each IBS sample is defined
so that it points toward the reference object. In our definition, spa-
tial relations are unidirectional. The relationship of A with respect
to B is different from B with respect to A. Because of this, we first
need to specify the reference object, and then use the normal direc-
tion that is defined on the side of the reference object.
The direction feature of the IBS is computed as follows. To re-
duce the dimensionality of the feature while maintaining the ability
to tell the difference between relations such as “above” and “be-
low”, we use the angle between the normal vector and +zdirection
(upwards direction), denoted here by θ, to compute the direction
feature. We compute θfor each sample on an IBS, and produce a
uniform histogram with 10 bins in the range of 0 to π. The num-
ber of samples which fall into each bin is counted, and normalized
against the total number of samples.
Distance Between the Object Surface and IBS The distribution
of the distance between the IBS and the object surface is descriptive
about the relations of the two objects. The larger the distance is, the
less likely that the two objects are closely related. We produce a
uniform histogram with 10 bins whose range is between 0to 0.5×
d, where dis the diagonal distance of the bounding box of the two
objects. We compute the distance for each sample on the IBS, and
accumulate the number of sample points that fall into each bin. The
histogram is normalized by the total number of samples and is used
as another geometric feature.
5. AUTOMATIC HIERARCHICAL SCENE ANALYSIS
In this section, we propose a method to automatically build a hier-
archy out of a scene by making use of the IBS data. The method
is an adapted version of the Hierarchical Agglomerative Clustering
(HAC) algorithm [Hastie et al. 2009]. The resulting scene structure
is used later for content-based relationship retrieval.
We first give the motivation, then a metric to measure inter-object
and inter-group relations and finally an algorithm for constructing
a hierarchy based on spatial relations.
5.1 Motivation
The idea to represent scenes by graph structures has been applied
in content-based scene retrieval [Fisher and Hanrahan 2010; Fisher
et al. 2011] and synthesis [Yu et al. 2011; Fisher et al. 2012]. In
their works, the relationships between objects in a 3D scene are
either generated from the information embedded manually at the
design stage, or computed based on contact. Examples of this type
of scene graph are shown in Figure 6 (b)(d).
The major difference between our method and previous works is
that we adopt a multiresolution structure that encodes not only the
spatial relations of the individual objects but also those between
the object groups, which are more descriptive about the scene, es-
pecially when the number of scene components is large.
Let us first describe the advantage of considering the inter-group
relationship with an example. For the sake of simplicity, we shall
call an object group a community. A community containing only an
object and its immediately surrounding objects is called the local
community of the object. A larger community containing other ob-
jects further away in the scene is called the extended community of
the object. In scene-b of Figure 6, the status of the bowl can be de-
scribed through its local community (the table and the bowl) first,
and then further described by the relation between the the bowl’s lo-
cal community and other communities (the two chairs) in the room.
This description is far easier to recognize than using the raw, low
level relationships of all the individual objects as shown in scene-c
ACM Transactions on Graphics, Vol. ??, No. ?, Article ???, Publication date: ??? ????.
6
Fig. 6. Scene structures of two example scenes: scene-a and scene-b. (a)
and (c) show the hierarchical structures produced by our method; (b) and
(d) show the scene graphs produced by the method in [Fisher et al. 2011].
The 3D models shown in this figure are from the Stanford Scene Database
[Fisher et al. 2012].
in Figure 6. The reason behind this is that humans tend to recognize
a scene at the group level when observing it from a global perspec-
tive [Goldstein 2010], by aggregating objects based on proximity,
continuation, uniformity, etc. Our multiresolution representation is
also more descriptive than the raw graph used in previous work.
This can be seen through scene-a and scene-b in Figure 6; the two
are the same under the raw graph representation (Figure 6 (b) and
(d)) while the objects are grouped based on the spatial relationships
and distinguished in our multiresolution representation (Figure 6
(a) and (c)).
The terms “local” and “extended” community are only used for
description purpose, and we do not arbitrarily classify neighbours
into such categories. The inter-community relationships are pro-
duced by first grouping individual objects into communities of
closer objects and then recursively grouping them into larger com-
munities. The details of this procedure are described in Section 5.2.
This structure naturally forms different abstraction levels of the
scene. Given a reference object, the inter-community relationships
on each level reflect the relationships between the reference object
and the scene at different abstraction levels.
5.2 Closeness Measure and Hierarchy Construction
To formally define the hierarchy and the relationships between one
object and its environment, we define a measure called closeness
between communities that can contain only an object or a set of ob-
jects. Given a scene Swith ncommunities, G={g1, g1, ..., gn},
the closeness measure between any two communities, gxand gy, is
defined as below:
Rc(gx, gy) = Rratio(gx, gy) + Rratio (gy, gx)(1)
Rratio(gx, gy) = W`IB S(gx, gy)´
W`IB S(gx, G \gx)´
IB S(gx, gy) = [
igx,jgy
IB S(i, j )(2)
where IB S(i, j )represents the IBS subset shared by object iand
j. The function Wcomputes the weighting of the IBS region. Note
that simply computing the area of IB S(i, j )does not give a good
measure of the importance as mentioned in Section 4.2. In prac-
tice, we use W`IB S(i, j )´=nwhere nis the number of sample
points (that is described in Section 4.2) on the IBS shared between
object iand jinstead of computing its actual area. This is to weigh
more the parts where the two communities are closely interacting
with each other.
Rratio(gx, gy)is the commitment of gxtowards gy;Rratio(gx, gy)
is larger if gxshares a large amount of the IBS with gythan with
other communities. It also means gxcommits more to gythan to
any other communities. Note that Rratio(gx, gy)is not necessarily
symmetric. Essentially, Rcmeasures the relation between two com-
munities under the context of the whole scene.
With Rcas a distance function, we present an adopted HAC al-
gorithm to build a hierarchical structure of a scene. This hierarchy
is built iteratively in a bottom-up fashion. Starting from individual
objects (leaf nodes of the tree), we measure the Rcbetween nodes
and group them into nodes that represent bigger communities. A
merge can combine more than two nodes. This process is repeated
until the whole scene is merged into one big single node. The de-
tails of the approach can be found in Algorithm 1. Figure 6 (a) and
(c) show simple examples.
6. SIMILARITY METRICS BASED ON IBS
In this section, we describe how we make use of the features of
the IBS and the scene structure to compute the similarity of in-
teractions. We first explain the similarity measure of relationships
between two objects. We then describe the similarity measure of
relationships between reference objects and their immediate neigh-
bours in the local community. Finally, we explain the similarity
measure of relationships between objects and their extended com-
munities. Note that these measures are only used for content-based
retrieval explained in Section 7.3. For classification, a different
equation based on Radial Basis Function is used, which is ex-
plained in Section 7.1.
6.1 Similarity Measure for Relationships Between
Two Objects
Given an IBS, we can compute its feature f={fb, fPFH ,fdir,
fdis}. The four items are Betti numbers, PFH, direction, and dis-
tance respectively as explained in Section 4 . To compare two IBS
features f1and f2, we first use a simple Kronecker delta kernel as
a measure for the topological features:
δ(fb
1, f b
2) = 1if fb
1=fb
2
0otherwise (3)
Next we define a measure for the geometric features of the IBS
that uses the L1distance of the PFH, direction and distance fea-
ACM Transactions on Graphics, Vol. ??, No. ?, Article ???, Publication date: ??? ????.
7
Data: A scene S, grouping threshold τ,0τ1
Result: A hierarchy H
Compute and sample IBS ;
The first level of grouping G0={g1, g2, ...gm};
Initialize the current level G=G0;
Initialize H=;
Define next level G0={g0
1, g0
2, ...g0
n};
while size(G)>1do
HHG;
n = size(G);
compute matrix M1nn:M1i,j Rc(i, j)(Equation 1) ;
compute M2nn:
M2i,j 1if giand gjhave contact(s)
0otherwise
for 0i, j ndo
if M2 6=0then
M3i,j M1i,j M2i,j ;
else
M3i,j M1i,j ;
end
if M3i,j > τ then
if g0, g0G0, gig0or gjg0then
if gj6⊂ g0,g0 ← g0 ∪ gj,else g0 ← g0 ∪ gi;
else
build g0gigjand G0G0g0;
end
end
end
GG0;
end
HHG
Algorithm 1: Automatic Hierarchy Construction
tures:
dgeo(f1,f2) = a·L1(fPFH
1, f PFH
2) + b·L1(fdir
1, f dir
2)
+c·L1(fdis
1, f dis
2)(4)
where a+b+c= 1 (0 a, b, c 1). As the three features are
in different ranges, we apply the Inverse Variance Weighting
scheme [Hartung et al. 2011] to the L1distance of three features.
We set a= 0.1,b= 0.4and c= 0.5in our experiments.
We combine the topology and geometry measures of the IBS,
and compute the final similarity between two IBS by:
ssr(f1,f2) = δ(fb
1, f b
2)w·(1 dgeo(f1,f2)) (5)
where wis a “switch” for using topological features. From the ex-
periment in Section 7.1 we can see that Betti number is quite use-
ful for complex interactions like tangles or enclosures, while it can
contradict geometric features for data that contains penetrations.
The metric function for measuring the similarity between two sets
of IBS features should be defined based on the nature of the dataset
and the purpose of retrieval. If the data mainly contains complex
relations, wshould be 1 so that the Betti number is used as a fil-
ter for different interaction types with respect to its topology; if the
data mainly contains simple relations, which have Betti numbers
b1= 0, b2= 0,wshould be set to 0 to speed up the computation
and avoid the influence of possible penetrations.
6.2 Similarity Measure for Local Communities
We now describe how we can compare two objects with respect to
their local communities. We define a profile of object oiin a local
community g={o1, o2, ...om}by:
flocali=[
1jm,j6=i
fi,j (6)
where fi,j is the feature computed from the IBS between oiand
oj. Therefore, flocaliis the set of IBS features between oiand all
the other objects in g. We call flocalithe local profile of oi. Given
two objects oi,o0
ifrom different communities gand g0, their local
profiles flocaliand flocali0are first computed. Then we can compute
the similarity between oiand o0
iunder the contexts of their local
communities. We define a similarity measure slocal, normalized in a
way similar to the graph kernel normalization [Fisher et al. 2011]:
slocal(i, i0) = K(i, i0)
max(K(i, i), K(i0, i0)) (7)
K(i, i0) = X
f1flocaliX
f2flocali0
ssr(f1,f2)(8)
where ssr is defined in Equation 5.
6.3 Similarity Metric for Extended Neighbourhood
After defining slocal, we are ready to combine it with the hierarchi-
cal structure and define a profile for an object oiat every level of
the hierarchy. For a scene S and its hierarchy, we assume that the
leaf nodes are at level 1. Let lddenote the nodes at dth level so
that ldis a set of communities {gd
1, gd
2, ..., gd
m}. Assume an object
oigd
x,1xm, a profile of oiat the dth level is defined as:
fd
exti=[
1ym,y6=x
fgd
x,gd
y(9)
where fgd
x,gd
yis the IBS feature set computed from IB S(gd
x, gd
y),
which is the IBS subset shared by community gd
xand gd
y. Given
two objects oiand o0
i, we can compute their profile fd
extiand fd
exti0.
Then the similarity between oiand o0
iat level dcan be computed
by:
sextd(i, i0) = Ke(i, i0)
max(Ke(i, i), Ke(i0, i0)) (10)
Ke(i, i0) = X
f1fd
extiX
f2fd
exti0
ssr(f1,f2)(11)
Finally, given a search depth parameter ddepth , we can find the
similarity between object oiand o0
iby accumulating their similari-
ties from level 1 to ddepth:
sall(i, i0) =
ddepth
X
d=1
γd1sextd(i, i0)(12)
where γis set to 0.5 taken to the power of dat each level. This
is the contextual similarity for two objects in the database up to a
given level.
A detailed example can be found in Figure 7. Assume that we
want to compare side table t1and side table t2in two scenes. If
ACM Transactions on Graphics, Vol. ??, No. ?, Article ???, Publication date: ??? ????.
8
ddepth = 1, then only slocal(t1, t2)is calculated. The only objects
involved are the objects on top of t1and t2. If ddepth = 2, then
sext1=slocal(t1, t2)and sext2(t1, t2)is calculated based on the IBS
subset between the red areas and other areas within level 2. Finally,
sall(t1, t2)=sext1+0.5×sext2(t1, t2).
Our idea to take account of the extended communities for com-
paring the status of an object in a scene resembles the part-in-whole
queries in Shapira et al. [2010]. In Shapira et al. [2010], a hierar-
chical structure of objects is constructed, and when one part of the
object is compared with other parts of another object, how the part
locates with respect to the other parts in the hierarchy is taken into
account and the similarity is computed based on the maximum flow
in a bipartite graph. While their method focuses on the geometri-
cal similarity of the parts in the hierarchy, our method computes
similarities purely based on the relationship similarities. Also, we
change the weights according to the distance such that the extended
neighbourhood is less influential to the results; this is due to the na-
ture of the data we handle.
Fig. 7. An example of hierarchical comparison. The two side tables are
the “centre” objects we want to compare. The red, yellow and light blue
regions contain the side table’s neighbours in level 1(bottom level), level 2,
and level 3 of the scene structures respectively. The 3D models are from the
Stanford Scene Database [Fisher et al. 2012],
7. EXPERIMENTS AND EVALUATION
In this section, we present three experiments. The first is super-
vised classification of interactions between two objects, the second
is building hierarchical structures for 3D scenes, and the third is
relationship-based retrieval. For each experiment, we first explain
the idea, then give experimental settings and results, and present
the evaluation at the end.
7.1 Classification of Interactions
Here we show how geometrical and topological features in different
combinations help in classifying two-object relationships.
Experiment The data set we use contains 1381 items, each of
which is an object pair. We ask the user to label them based on their
Table I. Descriptions of spatial relationships of 16 classes
Examples Description Examples Description
1, 2 Enclose 3, 4 Encircle
5 Interlocked 6 Side by side, similar sizes
7, 8 Tucked in 9, 10 Side by side,
one considerably higher
11, 12 Loosely above 13, 14 On top of
15, 16 Partially inside, with open areas
spatial relations. The database consists of 16 classes. We show one
example in each class in Figure 8. Descriptions for these classes
are summarized in Table I. Note that there are some scenes from
different classes with identical geometry but different object order,
as the spatial relation between two objects are not symmetric. For
example, there are two types of relation “enclose”, one object is
enclosed by another and one object encloses the other. this is the
same with other relation types except type 5 and type 6.
In order to facilitate the description, we refer to the interactions
with Betti numbers b1= 0, b2= 0 as simple relations, and com-
plex relations otherwise. The first part of our database contains
1289 items that are extracted from the Stanford Scene Database
used in [Fisher et al. 2012]. Since this mainly consists of simple
relations, we denote it as S. We manually label the data into mean-
ingful classes, which turn out to be 11 classes (class 6 to class 16 in
Figure 8). The second part of the database contains 92 examples of
complex relations labelled into 5 classes (class 1 to 5 in Figure 8)
by the user. As all data in this part represents complex relations, we
refer to it as C. More examples from these 16 classes are shown in
the supplementary material.
Fig. 8. Examples from 16 classes in our database. One example for each
class. The 3D models shown above are from the Princeton Shape Bench-
mark [Shilane et al. 2004] (1-5) and the Stanford Scene Database [Fisher
et al. 2012] (6-16).
We do the experiments first on Sand Cindividually and
then on the whole database S+C. In each experiment, the
data is split into a training set and a testing set in the ratio
of 7:3. We performed classification on different combinations of
ACM Transactions on Graphics, Vol. ??, No. ?, Article ???, Publication date: ??? ????.
9
features to investigate how individual features and combinations
of them influence the classification. Specifically, we test PFH
(P), PFH+Direction (PDI), PFH+Direction+Distance (PDD) and
PFH+Direction+Distance+Betti number (PDDB). The feature vec-
tor in each experiment is a concatenation of the involved individual
features. Individual features are first normalized. In different exper-
iments, we use different combinations of the normalized features.
In other words, if we use a fixed-length feature vector to repre-
sent each feature with non-zero values on its corresponding dimen-
sions and all other value zeroed, then the concatenation can also
be seen as linearly summing up several features. We tried different
weights for this linear combination to achieve good results. Empir-
ically an equal weighting scheme is used for all the classification
experiments. For comparison, we also tested two features: absolute
height displacement and absolute radial separation used in [Fisher
and Hanrahan 2010] for the whole data set, denoted by DIS.
We choose Support Vector Machines (SVMs) [Boser et al.
1992][Cortes and Vapnik 1995] for the classification task because
of their simplicity. Specifically, we use the soft margin method
[Cortes and Vapnik 1995]. For our multi-class problem, a one-v.s.-
one scheme is used. For the kernel, we use a Radial Basis Function
(RBF), K(x, y) = eθkxyk2where xand yare the concatenated
feature vectors. To find the best parameter values, we do 5-fold
cross-validation and hierarchical grid search. We start with coarser
grids, then subdivide the best grid for another iteration of search un-
til the improvement of the accuracy falls below 0.001 or it reaches
the maximum iteration. Finally, we train the model with the whole
training set again using the optimal values and then test it. For im-
plementation, we use libSVM [Chang and Lin 2011].
Evaluation and Comparison The prediction accuracy is shown
in Table II. The first column consists of the data set. The first row
lists features. The cells are filled with prediction accuracies of each
experiment. They are calculated by feeding the testing data set into
the trained SVM classifier and the accuracy is the percentage of
correctly classified data out of the whole testing data set.
Overall, IBS features are more discriminative for one-to-one re-
lationship classification than DIS used in [Fisher and Hanrahan
2010]. Note that in the paper [Fisher and Hanrahan 2010] , they
achieve good retrieval results by using other information such as
labelling, but we aim to avoid using it as such data is not always
available. The results shows that DIS does not perform as well as
the IBS features under our setting.
For further comparison of features within PDDB, one can see
that the PFH descriptor of the IBS already gives good results for
this 16-class classification problem. On top of PFH, the direction
improves the result. On the complex data set, the distance is a bit
detrimental to the result. This is because distance and direction can
contradict one another in this data set. However, we find that the
prediction accuracy of the complex data set is also 100% when
just using PFH and Betti numbers. It reflects the fact that for com-
plex relationships, direction and distances are not discriminative
enough. It also shows that Betti numbers provide vital information,
especially in classifying complex interactions.
One noteworthy point is that there is a slight decrement of ac-
curacy from PDD and PDDB. One cause is the discrete nature of
Betti numbers. Relations with similar geometric features may have
different Betti numbers. Also, penetrations between objects can
cause the Betti numbers to be calculated incorrectly. Although most
scenes in the Stanford Database do not have penetrations, some still
exist for geometrically adjacent objects. A preprocessing stage to
exclude such penetrations can improve the results.
Table II. Prediction Accuracy
P PDI PDD PDDB DIS
S78.33% 84.07% 85.90% 82.77% 41.25%
C80.00% 96.00% 92.00% 100.00% 44.00%
S+C 76.47% 81.37% 83.33% 84.31% 38.73%
Fig. 9. Confusion matrices of PDDB (Left) and DIS (Right) on the whole
data set (16 classes). Results are normalized within each column.
Table III. Cross-validation accuracy and time
consumption
PDDB DIS
Cross-validation accuracy 83.56% 39.36%
Time for cross-validation (secs) 580.56 106.07
Time for training (secs) 0.22 0.06
In Figure 9 we plot the confusion matrix. The values are normal-
ized for each column. The classes that have the lowest prediction
accuracies are class 9 and 10 (Figure 9 left). Most of their mis-
classified instances are in class 6. Shown in Figure 8 and Table I,
class 6, 9 and 10 all have a “side-by-side” relation. But class 9 and
10 have one object higher than the other. The height difference in
some scenes is not big enough to be classified correctly. This is the
main source of the prediction error.
Performance Table III shows the timing and accuracy information
of cross-validation and training on the whole data set (S+C). The
first row contains best average accuracy of the 5-fold cross valida-
tion during the hierarchical grid search. The second row contains
the time consumption for the grid search and the last row contains
the training time after we find the optimal values of the parameters.
The configuration of the computer where these numbers are cal-
culated is: Intel i7-2760QM CPU, 8GB memory, Windows 7 Pro-
fessional 64 bit and Matlab R2012a (64 bit).
7.2 Building Hierarchical Structures for 3D Scenes
Experiment We build hierarchical structures for 130 scenes in the
Stanford Database by using Algorithm 1. One example of the re-
sults is shown in Figure 10. More results are shown in the supple-
mentary document.
The parameter τcontrols the speed of merging when building
the hierarchy. A lower τwill merge more nodes together in each
round, which means that the number of levels is smaller compared
to the structure built with a higher τ. The choice of τshould depend
on the nature of the data as well as the higher-level application. For
the Stanford Scene Database, we found τ= 0.32 gives visually
reasonable structures for most of the scenes. The average number
of levels of the hierarchical structure under this τsetting is 2.89.
ACM Transactions on Graphics, Vol. ??, No. ?, Article ???, Publication date: ??? ????.
10
Fig. 10. An example of the hierarchical structure. The objects in the same
group are shown in the same colour. The 3D models shown in this figure
are from the Stanford Scene Database [Fisher et al. 2012],
Evaluation and Analysis As the scene structure will be used as
the input for content-based retrieval, the stability of the hierarchi-
cal structure with respect to the parameter setting is important. We
evaluate the stability of our HAC algorithm in this experiment, and
its benefits for retrieval will be evaluated together with the retrieval
results in the next section.
Following the scheme by Goodman and Kruskal [1954], which
has been employed to compute the stability of hierarchical algo-
rithms for image segmentation [MacDonald et al. 2006] and speech
classification [Smith and Dubes 1980], we assess the stability of
generating the hierarchical structure using a consistency measure
(denoted here by γ) of how the merges happen under different pa-
rameter settings. Briefly speaking, γ(1γ1) is the dif-
ference between the probability that the “correct” and the “wrong”
order occurs. See Appendix C for the details of computing γ.
In our algorithm, the grouping threshold τis the main parameter.
We check the stability of the structure under different values of τ.
For each scene, we compute five hierarchies by varying τfrom 0.1
to 0.5 with an interval of 0.1. Then, we compute the γfor every
pair within the five hierarchies. Finally, we use the mean of the γ,
denoted here by ˆγ, as the indicator of the stability of our method.
We compute ˆγfor 130 scenes, and the average ˆγis 0.989, with
the lowest stability 0.779 (scene00042), which means the stability
of our algorithm is very high.
7.3 Content-based Retrieval
Experiment We tested the capability of our algorithm for content-
based scene retrieval using a data set that consists of 130 scenes,
which come from the Stanford Scene Database. We first calculate
the hierarchical structures by the techniques explained in Section 5.
The user then selects any single object from a scene as a query.
Then the system returns objects in any scene which has similar spa-
tial relations with their surrounding objects.
Fig. 11. The solid lines show the precision-recall curve for our algo-
rithm on four test scenes. The dashed lines show the precision-recall
curve when using the DIS feature. The Ids of the four queries in
the Stanford Scene Database are: scene00050-object8(red), scene00118-
object14(green), scene00109-object14(blue), scene00087-object37(yellow)
Fig. 12. Scene regions and precision-recall curves for different ddepths.
Each row corresponds to one query we used for the user study. The red
object in the scene is the query object. The boundary lines in the left column
show the regions corresponding to the p-r curves of the same colour. The 3D
models shown in this figure are from the Stanford Scene Database [Fisher
et al. 2012],
Evaluation and Comparison The retrieval results are shown in
Figure 14 and Figure 15. It can be observed that our system returns
contextually similar results without using the geometry feature or
ACM Transactions on Graphics, Vol. ??, No. ?, Article ???, Publication date: ??? ????.
11
label of each individual model. More retrieval results are presented
in the supplementary document.
In order to evaluate our system, we prepared manually labelled
data, which is produced as follows. Four query objects were se-
lected from the Stanford Scene Database, and then for each query
object an additional 500 objects are randomly selected from the
database. The set of 500 objects were shown to the user with the
scene in random order, and the user was asked if the object has
similar spatial relations with the surrounding objects as compared
to the corresponding query object. We label the spatial relations
as similar if more then half of the users think they are similar. Ten
users including students and staffs from different schools in the uni-
versity took part in the user study.
For quantitative evaluation of the results, the precision and re-
call curves (pr-curve) are drawn based on the search results and the
ground-truth (manually labelled) data. In Figure 11, we present the
curves based on our features (solid lines) and DIS (dashed lines).
It can be observed that all the solid lines show significantly higher
precision than the dashed lines of the same colour, which indicates
that our features out-perform the DIS feature. Our algorithm returns
50% of similar results with a precision of at least 40%, meaning that
at least 1 in every 2.5 resulting scenes is desirable.
We now show the results of analysing the extent of the com-
munity (ddepth) the users take into account when comparing the
similarity of relationships. Figure 12 shows the pr-curves for three
query scenes with different ddepth values. In Figure 12 (a), the pr-
curve for ddepth = 1 gives slightly higher precision than ddepth = 2,
which means that the users tend to pay more attention to the imme-
diate neighbours of the “plate” (which are the toast, the table, and
the other objects on the table) than those in the extended commu-
nity. The results in Figure 12 (c) are similar to (a), but there is a
larger difference in precision between the two depths. In Figure 12
(b), ddepth = 2 gives the highest precision, which means the users
also consider the objects in the extended communities when evalu-
ating the similarity of the spatial relations. We can assume that fac-
tors such as the scale of the scene and the density of objects affect
the perception of the neighbourhood. The scene in Figure 12(b) has
more objects densely located around the desk compared with the
desk in Figure 12(c) and the scale is larger than the scene shown in
Figure 12(a).
Figure 13 shows a failure case for our method on the Stanford
Scene Database, with half of the top ten results (object on table)
not matching the query object (decoration hang above the bed). As
the bottom of the decoration is almost in contact with the bed head,
it is similar to the “one object on another” examples in terms of
spatial relationship. This can be a typical failure case of retrieval
results not being consistent with the user’s intuition due to the fact
that we do not take into account any geometry information of the
individual objects nor their semantic labels. The failure cases are
mostly removed from the retrieval results when the decoration is
lifted a little bit so that the relationship between the decoration and
the bed head are not misunderstood as contacts.
8. CONCLUSION AND DISCUSSIONS
In this paper, we have proposed a new descriptor called the Inter-
action Bisector Surface to capture spatial relationships between 3D
objects. The rich information among numerous types of object rela-
tions is well contained in the IBS. Its capacity for describing these
relationships lies in its topological and geometric features. Betti
numbers enable us to recognize very sophisticated relations while
the distance, the direction and the shape of the IBS further indicate
nuances at a finer level. The IBS is the cornerstone of the research
Fig. 13. Failure case. The results with red bounding box are not satisfac-
tory, as they are “object on table” while the query is “object hanging above
another object”. The 3D models shown in this figure are from the Stanford
Scene Database [Fisher et al. 2012],
and it provides a new perspective to model spatial relationships. In
addition, we have proposed an automated mechanism to understand
the structure of big scenes consisting of a large number of objects.
Knowing the hierarchy of big scenes is crucial for applications such
as content-based relationship retrieval. Because of the nature of the
IBS, the calculation naturally rules out relationships between ob-
jects that are too far from each other or have too many objects in
between. The structure of the IBS segments the scene into groups
at every level. Therefore, we are able to automatically build up a
hierarchical structure that has meaningful geographical groups. We
also propose similarity metrics based on the IBS that effectively
distinguish different types of spatial relationships between objects
or object groups. These metrics are equipped with the scene hierar-
chy so that comparison can be made between objects based on their
contexts. Finally, we also show how the features, metrics and algo-
rithms based on the IBS can be applied to solve practical problems.
Although our approach to computing the IBS is a heuristic, it is
a good compromise in terms of computational cost and precision.
For the computation of the IBS, we use a sampling-based approach
in which points are sampled on the object surfaces and then the
Quickhull algorithm [Barber et al. 1996] is used. This is a heuristic
approach that does not guarantee the exact topology and geome-
try of the resulting medial axis. This could be an issue if we need
to match the homotopy of the IBS. In order to avoid such confu-
sion, we use abstract topology features (the first and second Betti
numbers) whose values are less influenced by the accuracy of the
IBS. Although the Betti numbers can be affected by topological
noise such as holes, this is less likely to appear in the bisector sur-
faces as they are defined between distinct separate objects. Also,
the geometric features of the IBS are statistical values that are less
influenced by the accuracy of the IBS. In addition, exact methods
to compute the medial axis [Culver et al. 1999] and bisector sur-
face [Elber and Kim 1997] are not practical to be applied in a set
of high resolution meshes. In summary, our method makes use of
ACM Transactions on Graphics, Vol. ??, No. ?, Article ???, Publication date: ??? ????.
12
Fig. 14. Retrieval results by using IBS features. (left) In the query scene, the object with a bounding box (the desk) is the query object. We show the other
objects within the search depth in colour while leaving the rest of the scene in grey. (right) The resulting scenes from left to right are in order of similarity.
The red object is the retrieved object with a similar context to the desk. To show the context clearly, we also render other objects within the search depth with
slightly different pale green/blue colour in the result scenes. The 3D scenes are from the Stanford Scene Database [Fisher et al. 2012],
Fig. 15. More retrieval results by using IBS features. The 3D scenes are from the Stanford Scene Database [Fisher et al. 2012],
ACM Transactions on Graphics, Vol. ??, No. ?, Article ???, Publication date: ??? ????.
13
features that are less computationally costly and less affected by
parameter values.
Limitations Although the IBS is good for identifying spatial re-
lationships, the computational cost is higher compared with other
simple features used in [Fisher and Hanrahan 2010]. We believe
that it is a fair trade-off between precision and performance. The
method can be easily parallelised, and can greatly benefit from
implementing on multi-core systems. Secondly, the discriminative
power of the IBS deteriorates when the distance between objects
increases. When there are just two objects in the scene and they
are far apart from each other, we suspect that IBS features can be
replaced by simpler features used in [Fisher and Hanrahan 2010]
such as the height displacement and radial separation. Lastly, as we
focus on relationship understanding in this research, individual ge-
ometry plays a less important role. This is different from previous
works. Hence, for applications such as retrieval, it might cause con-
fusion when the user tries to retrieve scenes not only with similar
relationships but also with similar geometries.
Future Work We believe that the potential of using the IBS for
spatial relation representation has not been fully explored. In the
future, one possible direction is to use it for comparing two scenes.
This can be useful for whole scene retrieval. Another promising
direction is to further exploit IBS features and explore along the
time domain. By observing the feature variations on the time di-
mension, we might be able to understand, recognize and classify
animated scenes. At the same time, by adding human knowledge
via learning algorithms, we can pursue a semantic understanding
of the relationships between motions and environments.
ACKNOWLEDGMENTS
We thank the anonymous reviews for their constructive comments.
We also thank Rami Ali Al-ashqar for his help in preparing 3d mod-
els used in Figure 1 and Figure 5, Shin Yoshizawa for the discus-
sions and the test-users for their help in evaluation of our system.
The scene and object data are provided courtesy of the Stanford
Scene Database [Fisher et al. 2012] and the Princeton Shape Bench-
mark [Shilane et al. 2004].
APPENDIX
A. SAMPLING
Fig. 16. (left) The direction angle of a polygon on the IBS, (right) The
sampling result
Fig. 17. PFH angles
Here we describe how we sample points on the IBS where we
compute the geometric features. This is done by calculating the
weights of the triangles composing the IBS. First, we define a direc-
tion angle αfor each triangle. A triangle T on the IBS is equidistant
to sample points, sband scon o1and o2respectively. Let us define
a vector, v, from the centre of T to sband a normal nof T pointing
towards the side of o1.α(Figure 16(a)) is the angle between vand
n. Note that this angle is the same if we compute it between T and
o2because the normal is flipped in that case. The larger the angle
is, the higher the chance that the sample point is far away from the
objects defining it and less informative about the interaction. We
compute a weight W(T):
W(T) = Warea(T)×Wscene-distance (T)×Wangle(T)(13)
where Warea(T)is the area of triangle T and Wangle is computed as:
Wangle =1α
45if α < 45
0otherwise (14)
Wscene-distance is computed by:
Wscene-distance = (1 d
D)n(15)
where dis the distance between the centre of T and sa(or sb).
D=ddiag/2where ddiag is the length of the diagonal of the bound-
ing box of the whole scene. We empirically set nequal to 20. We
then normalized W(T) for all the triangles. Next, we set up a target
number for all triangles and the final target number for each trian-
gle is the target number times the triangle’s weight. Finally, we use
the final target numbers to do random sampling on every triangle.
Figure 16(b) shows the result of the weighted sampling.
B. POINT FEATURE HISTOGRAM (PFH)
PFH is a feature for encoding the geometry of a point cloud. Given
two points(Figure 17), p1and p2, with normals n1and n2, three
unit vectors (u,vand w) are built by the following procedure: 1) u
is the normal vector of p1, 2) v=u×p2p1
d, 3) w=u×v.d=
kp2p1k2. Then the difference between n1and n2are represented
by three angles (α, θ, φ)which are computed as: α=v·n2,φ=
u·p2p1
d,θ=arctan(w·n2, u ·n2). The triplet < α, φ, θ >
is computed for each pair of points in the k-neighbourhood, and
are binned into a histogram. Usually each angle is divided into b
equal parts, and the triplet can form a b3size histogram in which
each bin represents a unit combination of the value ranges for each
value. In our case, we compute the triplet for each pair of points
in the point cloud which is a set of samples computed by using the
method described previously. We set b= 5. So the PFH feature we
use is a 125-length vector.
ACM Transactions on Graphics, Vol. ??, No. ?, Article ???, Publication date: ??? ????.
14
C. STABILITY
Here we explain the definition of γintroduced by Goodman and
Kruskal [Goodman and Kruskal 1954]. Consider two hierarchical
structures for the same data, h1and h2, and two pairs of elements of
this data pi= (xi1, xi2)and pj= (xj1, xj2). The rank r(h, p)is
defined as the level at which the two elements of pair pfirst appear
in the same cluster in hierarchy h. Over all the pairs of elements in
the data, set
Πs=
Pr{(r(h1, pi)< r(h1, pj)r(h2, pi)< r(h2, pj))
(r(h1, pi)> r(h1, pj)r(h2, pi)> r(h2, pj))}
Πd=
Pr{(r(h1, pi)< r(h1, pj)r(h2, pi)> r(h2, pj))
(r(h1, pi)> r(h1, pj)r(h2, pi)< r(h2, pj))}
Πt=
Pr{(r(h1, pi) = r(h1, pj)) (r(h1, pi) = r(h1, pj))}
(16)
Then
γ=ΠsΠd
1Πt
(17)
γmeasures the difference between the probabilities of “right or-
der” and “wrong order”. In other words γshows how much more
probable it is to get the same rather than different orders in two hi-
erarchies. It ranges from -1 for inconsistency to 1 for consistency.
REFERENCES
ALEXANDRE, L. A. 2012. 3D descriptors for object and category recog-
nition: a comparative evaluation. In Workshop on Color-Depth Camera
Fusion in Robotics at the IEEE/RSJ International Conference on Intelli-
gent Robots and Systems (IROS). Vilamoura, Portugal.
AME NTA, N., CHOI, S., AND KO LL URI , R. K. 2001. The power crust. In
Proceedings of the sixth ACM symposium on Solid modeling and appli-
cations. 249–266.
ATTALI , D., B OIS SON NAT, J.-D., AND ED ELS BRUN NER , H. 2009. Stabil-
ity and computation of medial axes - a state-of-the-art report. In Mathe-
matical Foundations of Scientific Visualization, Computer Graphics, and
Massive Data Exploration, T. Mller, B. Hamann, and R. D. Russell, Eds.
Mathematics and Visualization. Springer Berlin Heidelberg, 109–125.
BARBER, C. B., DOBKIN, D. P., AND HUH DAN PAA, H. 1996. The quick-
hull algorithm for convex hulls. ACM Transactions on Mathematical Soft-
ware 22, 4, 469–483.
BELONGIE, S., MAL IK, J ., AN D PUZICHA, J. 2002. Shape matching and
object recognition using shape contexts. IEEE Transactions on Pattern
Analysis and Machine Intelligence 24, 4, 509–522.
BOS ER, B . E., G UYO N, I. M., AN D VAPNIK, V. N . 1992. A training
algorithm for optimal margin classifiers. In Proceedings of the 5th Annual
ACM Workshop on Computational Learning Theory. ACM Press, 144–
152.
CHA NG, C.-C. A ND LIN , C.-J. 2011. LIBSVM: a library for support vec-
tor machines. ACM Transactions on Intelligent Systems and Technol-
ogy 2, 3, 27:1–27:27. Software available at http://www.csie.ntu.edu.tw/
cjlin/libsvm.
CHA NG, M.-C. A ND KIMIA, B. B. 2011. Measuring 3D shape similarity
by graph-based matching of the medial scaffolds. Computer Vision and
Image Understanding 115, 5, 707–720.
CORT ES, C. AND VAPNIK, V. 1995. Support-vector networks. In Machine
Learning. 273–297.
CULVE R, T., KEYS ER, J ., AN D MANOCHA, D. 1999. Accurate computa-
tion of the medial axis of a polyhedron. In Proceedings of the fifth ACM
symposium on Solid modeling and applications. 179–190.
DEL FINAD O, C. J . A. AN D EDEL SBR UNN ER, H. 1995. An incremen-
tal algorithm for betti numbers of simplicial complexes on the 3-sphere.
Computer Aided Geometric Design 12, 7 (Nov.), 771–784.
ELB ER, G. AND KIM, M. -S. 1997. The bisector surface of freeform ratio-
nal space curves. In Proceedings of the thirteenth annual symposium on
Computational geometry. ACM, 473–474.
FISHER, M. A ND HANRAHAN, P. 2010. Context-based search for 3D mod-
els. ACM Transactions on Graphics 29, 6, 182.
FISHER, M., RITCHIE, D., SAVVA, M., FUNKHOUSER, T., AN D HANR A-
HA N, P. 2012. Example-based synthesis of 3d object arrangements. ACM
Transactions on Graphics 31, 6, 135.
FISHER, M., SAVVA, M., A ND HANRAHAN, P. 2011. Characterizing struc-
tural relationships in scenes using graph kernels. ACM Transactions on
Graphics 30, 4.
GIA NNARO U, S. AND STATHA KI, T. 2007. Object identification in com-
plex scenes using shape context descriptors and multi-stage clustering. In
2007 15th International Conference on Digital Signal Processing. 244–
247.
GOL DST EIN , E. B. 2010. Sensation and perception. CengageBrain. com.
GOODMAN, L. A. AND KRUSK AL, W. H. 1954. Measures of association
for cross classifications. Journal of the American Statistical Association,
732–764.
HAR CHAO UI, Z. AND BACH, F. 2007. Image classification with segmen-
tation graph kernels. In Proceedings of CVPR.
HART UNG , J., KNAPP, G., AND SI NH A, B. K. 2011. Statistical meta-
analysis with applications. Vol. 738. Wiley. com.
HASTIE, T., TIBSHIRANI, R ., AN D FRIEDMAN, J. 2009. The Elements of
Statistical Learning. New York: Springer.
IMAI, T. 1996. A topology oriented algorithm for the voronoi diagram of
polygons. In Proceedings of the 8th Canadian Conference on Computa-
tional Geometry. Carleton University Press, 107–112.
KALOGERAKIS, E., CHAUDHURI, S., KO LLE R, D., AND KOLTU N, V.
2012. A probabilistic model for component-based shape synthesis. ACM
Transactions on Graphics 31, 4, 55.
KIM , C.- M., WON, C. I., CH O, Y., KIM , D., LEE , S., BHAK , J., A ND
KIM , D.-S. 2006. Interaction interfaces in proteins via the voronoi dia-
gram of atoms. Computer-Aided Design 38.
MACDO NALD , D., LAN G, J., AND MCAL LIS TER , M. 2006. Evaluation
of colour image segmentation hierarchies. In Proceedings of the The 3rd
Canadian Conference on Computer and Robot Vision. 27.
MAS SEY, W. 1991. A basic course in algebraic topology. Vol. 127.
Springer.
PARABOSCHI, L., BIAS OTTI , S., AND FALCIDIENO, B. 2007. Comparing
sets of 3D digital shapes through topological structures. In Graph-Based
Representations in Pattern Recognition, F. Escolano and M. Vento, Eds.
Number 4538 in Lecture Notes in Computer Science. Springer Berlin
Heidelberg, 114–125.
RABINOVICH, A., VEDAL DI, A ., GALLEGUILLOS, C ., WIEWIORA, E.,
AND BELONGIE, S. 2007. Objects in context. Proceedings of ICCV.
RUS U, R., MART ON, Z ., BL ODOW, N ., AN D BEET Z, M. 2008a. Learn-
ing informative point classes for the acquisition of object model maps.
In 10th International Conference on Control, Automation, Robotics and
Vision, 2008. ICARCV 2008. 643–650.
RUS U, R. B., MART ON, Z. C., BL ODOW, N ., AN D BEET Z, M. 2008b.
Persistent point feature histograms for 3d point clouds. Intelligent Au-
tonomous Systems 10: Ias-10, 119.
SEBA STI AN, T. B., K LEI N, P. N., AND KIMIA, B. B. 2001. Recognition
of shapes by editing shock graphs. In Proceedings of ICCV. 755–762.
ACM Transactions on Graphics, Vol. ??, No. ?, Article ???, Publication date: ??? ????.
15
SHAPIRA, L., SHALOM, S., SHA MIR , A., COHE N-O R, D., A ND ZHA NG,
H. 2010. Contextual part analogies in 3d objects. International Journal
of Computer Vision 89, 2-3 (Sept.), 309–326.
SHERBROOKE, E. C., PATRIKALAKIS, N. M., A ND BRISSON, E. 1995.
Computation of the medial axis transform of 3-d polyhedra. In Proceed-
ings of the third ACM symposium on Solid modeling and applications.
187–200.
SHILANE, P., MIN , P., KAZ HDA N, M., AND FUNKHOUSER, T. 2004. The
princeton shape benchmark. In Shape Modeling Applications, 2004. Pro-
ceedings. IEEE, 167–178.
SMITH, S. P. AND DU BES , R. 1980. Stability of a hierarchical clustering.
Pattern Recognition 12, 3, 177–187.
SUD , A., FOSK EY, M., AND MANOCHA, D . 2007. Homotopy-preserving
medial axis simplification. International Journal of Computational Ge-
ometry & Applications 17, 05, 423–451.
TANG , J. K. T., CHAN , J. C. P., LE UNG , H., A ND KOMURA, T. 2012.
Retrieval of interactions by abstraction of spacetime relationships. Com-
puter Graphics Forum 31, 2.
VAN KAIC K, O ., XU, K ., ZH ANG , H., WAN G, Y., SUN , S., S HAM IR , A.,
AND COH EN-OR, D . 2013a. Co-hierarchical analysis of shape structures.
ACM Transactions on Graphics 32, 4, 69.
VAN KAIC K, O ., ZH ANG , H., AND HAMARNEH, G. 2013b. Bilateral maps
for partial matching. Computer Graphics Forum 32, 6, 189–200.
WANG, Y., X U, K., L I, J., Z HAN G, H., SHAMIR, A ., LI U, L., CHE NG,
Z.-Q., AND XI ONG , Y. 2011. Symmetry hierarchy of man-made objects.
Computer Graphics Forum 30, 2, 287–296.
YU, L. -F., YE UNG , S.-K., TANG, C .-K ., TERZOPOULOS, D ., CH AN,
T. F., AN D OSHE R, S. J. 2011. Make it home: automatic optimization
of furniture arrangement. ACM Transactions on Graphics 30, 4 (July),
86:1–86:12.
ZHE NG, Y., CO HEN -OR, D., A ND MIT RA , N. J. 2013. Smart variations:
Functional substructures for part compatibility. Computer Graphics Fo-
rum (Eurographics) 32, 2, 195–204.
ZHE NG, Y., TAI, C .-L ., ZH ANG , E., A ND XU, P. 2013. Pairwise harmon-
ics for shape analysis. IEEE Transactions on Visualization and Computer
Graphics 19, 7, 1172–1184.
ACM Transactions on Graphics, Vol. ??, No. ?, Article ???, Publication date: ??? ????.

Supplementary resource (1)

... We are inspired by recent methods that have revisited geometric features, such as the bisector surface for scene-object indexing (Zhao et al., 2014) and affordance detection (Ruiz and Mayol-Cuevas, 2020). Initiating from a spatial representation makes sense if it helps reduce data training needs and simplify explanations-as long as it can outperform data-intensive approaches. ...
... Initiating from a spatial representation makes sense if it helps reduce data training needs and simplify explanations-as long as it can outperform data-intensive approaches. Our affordance descriptor expands on the Interaction Bisector Surface (IBS) (Zhao et al., 2014), an approximation of the well-known Bisector Surface (BS) (Peternell, 2000). Given two surfaces S 1 , S 2 ∈ ℝ 3 , the BS is the set of sphere centers that touch both surfaces at one point each. ...
... Given two surfaces S 1 , S 2 ∈ ℝ 3 , the BS is the set of sphere centers that touch both surfaces at one point each. Due to its stability and geometrical characteristics, the IBS has been used in context retrieval, interaction classification, and functionality analysis (Zhao et al., 2014;Hu et al., 2015;Hu et al., 2016;Zhao et al., 2016;Zhao et al., 2017;Ruiz and Mayol-Cuevas, 2020). Our approach expands on these ideas and is geometrically intuitive and straightforward. ...
Article
Full-text available
We present Affordance Recognition with One-Shot Human Stances (AROS), a one-shot learning approach that uses an explicit representation of interactions between highly articulated human poses and 3D scenes. The approach is one-shot since it does not require iterative training or retraining to add new affordance instances. Furthermore, only one or a small handful of examples of the target pose are needed to describe the interactions. Given a 3D mesh of a previously unseen scene, we can predict affordance locations that support the interactions and generate corresponding articulated 3D human bodies around them. We evaluate the performance of our approach on three public datasets of scanned real environments with varied degrees of noise. Through rigorous statistical analysis of crowdsourced evaluations, our results show that our one-shot approach is preferred up to 80% of the time over data-intensive baselines.
... Interaction descriptors. Zhao et al. [ZWK14] propose a scene relationship descriptor, called Interaction Bisector Surface (IBS), to characterize complex relationships in a scene, or rather, a sub-scene. IBS describes topological (wrapped in, linked to or tangled with) as well as spatially proximal relationships between objects (see Figure 18). ...
... IBS describes topological (wrapped in, linked to or tangled with) as well as spatially proximal relationships between objects (see Figure 18). IBS is defined as the set of points that are equidistant from two objects, which form an approximation Figure 18: Interaction Bisector Surface (IBS) [ZWK14] is a rich representation between objects in a scene that describes topological and geometric relationships between objects in a scene. IBS is the set of points equidistant from two sets of points sampled on different objects, shown as the blue colored surface above. ...
... Since the semantic graphs are rich with relational semantics between objects, and the similarity function based on either Jaccard Figure 19: Retrievals results using IBS features [ZWK14] -In the query scene on the left, a desk, overlayed using its bounding box, is the query object for IBS algorithm. On the right are the scenes ordered based on their similarity, where the red object is the retrieved object with a similar context to the query desk. ...
Preprint
Full-text available
This report surveys advances in deep learning-based modeling techniques that address four different 3D indoor scene analysis tasks, as well as synthesis of 3D indoor scenes. We describe different kinds of representations for indoor scenes, various indoor scene datasets available for research in the aforementioned areas, and discuss notable works employing machine learning models for such scene modeling tasks based on these representations. Specifically, we focus on the analysis and synthesis of 3D indoor scenes. With respect to analysis, we focus on four basic scene understanding tasks -- 3D object detection, 3D scene segmentation, 3D scene reconstruction and 3D scene similarity. And for synthesis, we mainly discuss neural scene synthesis works, though also highlighting model-driven methods that allow for human-centric, progressive scene synthesis. We identify the challenges involved in modeling scenes for these tasks and the kind of machinery that needs to be developed to adapt to the data representation, and the task setting in general. For each of these tasks, we provide a comprehensive summary of the state-of-the-art works across different axes such as the choice of data representation, backbone, evaluation metric, input, output, etc., providing an organized review of the literature. Towards the end, we discuss some interesting research directions that have the potential to make a direct impact on the way users interact and engage with these virtual scene models, making them an integral part of the metaverse.
... We are inspired by recent methods that have revisited geometric features such as the bisector surface for scene-object indexing [27] and affordance detection [22]. Starting from a spatial representation makes sense if it helps to reduce data training needs and simplifies explainability -as long as it can outperform data-intensive approaches. ...
... Starting from a spatial representation makes sense if it helps to reduce data training needs and simplifies explainability -as long as it can outperform data-intensive approaches. Our affordance descriptor expands on the Interaction Bisector Surface (IBS) [27], an approximation of the well-known Bisector Surface (BS) [28]. Given two surfaces S 1 , S 2 ∈ R 3 , the BS is the set of sphere centres that touch both surfaces at one point each. ...
... Given two surfaces S 1 , S 2 ∈ R 3 , the BS is the set of sphere centres that touch both surfaces at one point each. Due to its stability and geometrical characteristics, the IBS has been used in context retrieval, interaction classification, and functionality analysis [27,29,30,31,32,22]. Our approach expands on these ideas and is geometrically intuitive and straightforward. ...
Preprint
Full-text available
We present AROS, a one-shot learning approach that uses an explicit representation of interactions between highly-articulated human poses and 3D scenes. The approach is one-shot as the method does not require re-training to add new affordance instances. Furthermore, only one or a small handful of examples of the target pose are needed to describe the interaction. Given a 3D mesh of a previously unseen scene, we can predict affordance locations that support the interactions and generate corresponding articulated 3D human bodies around them. We evaluate on three public datasets of scans of real environments with varied degrees of noise. Via rigorous statistical analysis of crowdsourced evaluations, results show that our one-shot approach outperforms data-intensive baselines by up to 80\%.
... For the given demo interaction, we sample a set of IBS points that are equidistant to two interacting objects. IBS [36] has been shown to be an informative spatial descriptor of object-object interactions, while robust against shape variations, and thus we denote this set of IBS points associated with the NIF feature as our neural interaction template (NIT). Then given a new target object, our goal is to find the optimal pose of the anchor object together with the NIT in the target NIF with matched features. ...
... Relative vector used in most of works for scene generation [9] is relatively simple and thus usually need to incorporated with other properties together to be able to characterize the accurate spatial relationship between two objects. On the contrary, the Interaction Bisector Surface (IBS) introduced in Zhao et al. [36] provides more detailed and informative interaction representation with the geometric and topological features extracted from the spatial boundary between two objects, which have been used for scene completion and synthesis [37,38]. The corresponding regions on the interacting objects, denoted as Interaction Region (IR), are further explored for functionality analysis of 3D shapes [14,15]. ...
... Given a demo interaction with a pair of objects (O a , O s ), IBS [36] is defined as a set of points that are equidistant to the two objects. To compute IBS, we first sample a set of points on the surfaces or point clouds of the two interacting objects uniformly and compute the Voronoi diagram for all those samples. ...
Preprint
We introduce NIFT, Neural Interaction Field and Template, a descriptive and robust interaction representation of object manipulations to facilitate imitation learning. Given a few object manipulation demos, NIFT guides the generation of the interaction imitation for a new object instance by matching the Neural Interaction Template (NIT) extracted from the demos to the Neural Interaction Field (NIF) defined for the new object. Specifically, the NIF is a neural field which encodes the relationship between each spatial point and a given object, where the relative position is defined by a spherical distance function rather than occupancies or signed distances, which are commonly adopted by conventional neural fields but less informative. For a given demo interaction, the corresponding NIT is defined by a set of spatial points sampled in the NIF of the demo object with associated neural features. To better capture the interaction, the points are sampled on the interaction bisector surface, which consists of points that are equidistant to two interacting objects and has been used extensively for interaction representation. With both point selection and pointwise features defined for better interaction encoding, NIT effectively guides the feature matching in the NIFs of the new object instances to optimize the object poses to realize the manipulation while imitating the demo interactions. Experiments show that our NIFT solution outperforms state-of-the-art imitation learning methods for object manipulation and generalizes better to objects from new categories.
... where L cont encourages body vertices to contact with the scene mesh, L coll is the signed-distance-based collision term defined in [36], and the L reg is a regularizer that penalizes SMPL-X parameters deviating from the initialization. A further addition we make over existing HSI methods is adopting the Interaction Bisector Surface (IBS) [38], which is the set of points equidistant from two sets of points sampled on the scene and the human, respectively. For our task, we modify it as additional loss supervision L IBS : ...
Preprint
Naturally controllable human-scene interaction (HSI) generation has an important role in various fields, such as VR/AR content creation and human-centered AI. However, existing methods are unnatural and unintuitive in their controllability, which heavily limits their application in practice. Therefore, we focus on a challenging task of naturally and controllably generating realistic and diverse HSIs from textual descriptions. From human cognition, the ideal generative model should correctly reason about spatial relationships and interactive actions. To that end, we propose Narrator, a novel relationship reasoning-based generative approach using a conditional variation autoencoder for naturally controllable generation given a 3D scene and a textual description. Also, we model global and local spatial relationships in a 3D scene and a textual description respectively based on the scene graph, and introduce a partlevel action mechanism to represent interactions as atomic body part states. In particular, benefiting from our relationship reasoning, we further propose a simple yet effective multi-human generation strategy, which is the first exploration for controllable multi-human scene interaction generation. Our extensive experiments and perceptual studies show that Narrator can controllably generate diverse interactions and significantly outperform existing works. The code and dataset will be available for research purposes.
Article
This report surveys advances in deep learning‐based modelling techniques that address four different 3D indoor scene analysis tasks, as well as synthesis of 3D indoor scenes. We describe different kinds of representations for indoor scenes, various indoor scene datasets available for research in the aforementioned areas, and discuss notable works employing machine learning models for such scene modelling tasks based on these representations. Specifically, we focus on the analysis and synthesis of 3D indoor scenes. With respect to analysis, we focus on four basic scene understanding tasks – 3D object detection, 3D scene segmentation, 3D scene reconstruction and 3D scene similarity. And for synthesis, we mainly discuss neural scene synthesis works, though also highlighting model‐driven methods that allow for human‐centric, progressive scene synthesis. We identify the challenges involved in modelling scenes for these tasks and the kind of machinery that needs to be developed to adapt to the data representation, and the task setting in general. For each of these tasks, we provide a comprehensive summary of the state‐of‐the‐art works across different axes such as the choice of data representation, backbone, evaluation metric, input, output and so on, providing an organized review of the literature. Towards the end, we discuss some interesting research directions that have the potential to make a direct impact on the way users interact and engage with these virtual scene models, making them an integral part of the metaverse.
Chapter
The diversity of action possibilities offered by an environment, a.k.a affordances, cannot be addressed in a scalable manner simply from object categories or semantics, which are limitless. To this end, we present a one-shot learning approach that trains on one or a handful of human-scene interaction samples. Then, given a previously unseen scene, we can predict human affordances and generate the associated articulated 3D bodies. Our experiments show that our approach generates physically plausible interactions that are perceived as more natural in 60–70% of the comparisons with other methods.
Article
Thesupport-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data.High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Article
LIBSVM is a library for Support Vector Machines (SVMs). We have been actively developing this package since the year 2000. The goal is to help users to easily apply SVM to their applications. LIBSVM has gained wide popularity in machine learning and many other areas. In this article, we present all implementation details of LIBSVM. Issues such as solving SVM optimization problems theoretical convergence multiclass classification probability estimates and parameter selection are discussed in detail.
Article
The support-vector network is a new learning machine for two-group classification problems. The machine conceptually implements the following idea: input vectors are non-linearly mapped to a very high-dimension feature space. In this feature space a linear decision surface is constructed. Special properties of the decision surface ensures high generalization ability of the learning machine. The idea behind the support-vector network was previously implemented for the restricted case where the training data can be separated without errors. We here extend this result to non-separable training data. High generalization ability of support-vector networks utilizing polynomial input transformations is demonstrated. We also compare the performance of the support-vector network to various classical learning algorithms that all took part in a benchmark study of Optical Character Recognition.
Article
The practice of meta-analysis allows researchers to obtain findings from various studies and compile them to verify and form one overall conclusion. Statistical Meta-Analysis with Applications presents the necessary statistical methodologies that allow readers to tackle the four main stages of meta-analysis: problem formulation, data collection, data evaluation, and data analysis and interpretation. Combining the authors' expertise on the topic with a wealth of up-to-date information, this book successfully introduces the essential statistical practices for making thorough and accurate discoveries across a wide array of diverse fields, such as business, public health, biostatistics, and environmental studies. Two main types of statistical analysis serve as the foundation of the methods and techniques: combining tests of effect size and combining estimates of effect size. Additional topics covered include: Meta-analysis regression procedures Multiple-endpoint and multiple-treatment studies The Bayesian approach to meta-analysis Publication bias Vote counting procedures Methods for combining individual tests and combining individual estimates Using meta-analysis to analyze binary and ordinal categorical data Numerous worked-out examples in each chapter provide the reader with a step-by-step understanding of the presented methods. All exercises can be computed using the R and SAS software packages, which are both available via the book's related Web site. Extensive references are also included, outlining additional sources for further study. Requiring only a working knowledge of statistics, Statistical Meta-Analysis with Applications is a valuable supplement for courses in biostatistics, business, public health, and social research at the upper-undergraduate and graduate levels. It is also an excellent reference for applied statisticians working in industry, academia, and government.