ArticlePDF Available

Structure-based object representation and classification in mobile robotics through a Microsoft Kinect

Authors:

Abstract and Figures

A new approach enabling a mobile robot to recognize and classify furniture-like objects composed of assembled parts using a Microsoft Kinect is presented. Starting from considerations about the structure of furniture-like objects, i.e., objects which can play a role in the course of a mobile robot mission, the 3D point cloud returned by the Kinect is first segmented into a set of “almost convex” clusters. Objects are then represented by means of a graph expressing mutual relationships between such clusters. Off-line, snapshots of the same object taken from different positions are processed and merged, in order to produce multiple-view models that are used to populate a database. On-line, as soon as a new object is observed, a run-time window of subsequent snapshots is used to search for a correspondence in the database. Experiments validating the approach with a set of objects (i.e., chairs, tables, but also other robots) are reported and discussed in detail.
Content may be subject to copyright.
Structure-based Object Representation and
Classification in Mobile Robotics through a Microsoft
Kinect
Antonio Sgorbissa and Damiano Verda
DIBRIS - University of Genova, Via Opera Pia 13, 16145 Genova, Italy.
Abstract
A new approach enabling a mobile robot to recognize and classify furniture-
like objects composed of assembled parts using a Microsoft Kinect is pre-
sented. Starting from considerations about the structure of furniture-like
objects, i.e., objects which can play a role in the course of a mobile robot
mission, the 3D point cloud returned by the Kinect is first segmented into
a set of “almost convex” clusters. Objects are then represented by means
of a graph expressing mutual relationships between such clusters. Off-line,
snapshots of the same object taken from different positions are processed and
merged, in order to produce multiple-view models that are used to populate a
database. On-line, as soon as a new object is observed, a run-time window of
subsequent snapshots is used to search for a correspondence in the database.
Experiments validating the approach with a set of objects (i.e., chairs,
tables, but also other robots) are reported and discussed in detail.
Keywords: Object classification, Structure-based modelling and
Email address: antonio.sgorbissa@unige.it, damiano.verda@unige.it
(Antonio Sgorbissa and Damiano Verda)
Preprint submitted to Robotics and Autonomous Systems July 15, 2013
recognition, Microsoft Kinect, Mobile Robotics
1. Introduction
The article describes a system for recognizing and classifying furniture-
like objects composed of assembled parts on the basis of a sequence of snap-
shots taken with a Microsoft Kinect1. The system is meant to provide a
mobile robot with advanced perceptual capabilities, with the final aim of
enabling it to interact with the environment in different ways depending on
the affordances offered by different classes of objects.
The concept of affordance initially proposed by the psychologist J.J. Gib-
son [1] has been widely adopted in robotics, mainly in the context of grasping
and manipulation. Even if the present work does not deal explicitly with
object affordances, the concept helps us to motivate the development of a
system for classifying furniture-like objects in the context of mobile robotics,
and therefore deserves a deeper discussion. “Affordance” is a term used to
describe a possibility for actions: this possibility varies depending both on
the object and on the agent performing the action, and therefore cannot be
uniquely expressed as an intrinsic characteristic of a given object or environ-
mental feature. For example a chair affords pushing both to humans and
mobile robots, but it affords sitting only to humans. A table affords pushing
to humans but usually not to mobile robots. A door affords passing through
to robots and humans, but not to elephants or cars.
1The Microsoft Kinect is an RGB-D camera: in addition to standard RGB information
it returns, for every pixel, the distance to the closest object, thus effectively providing
depth information that can be used to build a volumetric representation of the scene.
2
Suppose now that, in the course of a mission, the robot encounters some-
thing that blocks its path. The robot can try to push it away, or ask the
object to move on: this corresponds to exploring and learning the affordances
of the new object. However, if the robot is able to use perceptual data to
recognize the object and to classify it as belonging to a given category (e.g., a
chair, a table, or a human), it can immediately infer what the object affords
or not, and act accordingly. As the reader can imagine, what the robot needs
is not the capability to recognize a particular chair or table, but to be able
to abstract from the peculiar characteristics of individual objects through a
conceptualization process whose final output is a model of a given class of ob-
jects. The robot needs to be able to distinguish chairs from tables (since the
former can be pushed away, whereas the latter cannot), and not to recognize
a particular chair or table.
In the recent robotic literature, object recognition and classification has
received great attention [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13]. The major
contribution of this work is to propose a novel approach that relies on the
depth data returned by a Microsoft Kinect, and focusses on recognizing and
classifying only those objects that offer possibilities of actions to a mobile
robot: for instance chairs, tables, other robots, but not cups, pens, or wall
clocks, which appear not to play a relevant role in the context of mobile robot
navigation. Using the Kinect, objects are initially perceived as point clouds,
possibly through different snapshots taken from different perspectives: the
general idea is to merge the information acquired in subsequent snapshots to
model objects as composed of “almost convex” geometrical primitives, and
finally to match data acquired in run-time with pre-computed models stored
3
Figure 1: Pictographs representing a chair (a), a table (b), a bed (c), and a door (d).
in a database.
The approach is motivated by the consideration that a very small set of
geometric primitives and their mutual geometrical relationships appear to be
sufficient for humans to distinguish among common use objects. Consider
the pictographs in Figure 1: by submitting the picture to volunteers and
by asking them which objects are represented in it, almost 100% of the
interviewed volunteers have no difficulties in recognizing a) as a chair, b) as a
table, c) as a bed, and d) as a door (the test has open answers, i.e., volunteers
are not provided with a list of objects to choose among). In a similar spirit,
we conjecture that modelling an object in the real world through a limited
4
set of geometrical primitives (labelled with their geometrical properties, as
well as their mutual geometrical and topological relationships) should be
sufficient to characterize it as a member of the class it belongs to, thus being
recognizable by the system. For instance, chairs have a seat orthogonal to
legs, a back orthogonal to the seat, and legs parallel to each others.
This conjecture is enforced by the fact that, in most cases, furniture-like
objects in human environments are not only modelable as composed of “al-
most convex” parts. As a matter of fact, pieces of furnitures are actually
built by assembling smaller parts, for obvious constraints puts by the manu-
facturing process. Interestingly enough, notice also that the design of many
pieces of furniture has not substantially changed since centuries: the way of
designing and building a modern chair or table has inherited the constraints
put by woodworking centuries ago, even if some of these constraints do not
hold any more after the introduction of plastic. Then, even when a plastic
chair is actually composed of a single piece, the latter is still modelable as if
it were composed of a seat, a back, and legs.
Section 2 describes related work. Section 3 introduces the system’s ar-
chitecture. The processing phases which are required to build the model of
an object starting from depth data are described in Sections 4 to 6, whereas
the classification process is described in Section 7. Section 8 describes exper-
imental results. Conclusions follow.
2. Related Work
Object recognition and classification is a widely investigated topic in the
scientific literature: among the others, it plays an important role in robotics,
5
for navigation as well as for grasping and manipulation, in order to enable
the robot to behave differently when dealing with different objects.
Most approaches in the recent literature rely on appearance-based tech-
niques, and perform object recognition and classification on the basis of a set
of invariant descriptors. In a few words, the basic idea is that of taking snap-
shots of the objects, computing a set of global and local descriptors which
summarize data, and then comparing objects by computing the “distance”
between vectors of such descriptors. This is different from past research (i.e.,
up to twenty years ago), when holistic approaches based on global features
were intensively investigated in computer vision.
Among the others, many works assume that the robot has 2D vision ca-
pabilities, and rely on local descriptors such as SIFT features [14] or similar.
As an example, the authors of [15] focus on robots controlled via the Inter-
net, and deal with object recognition as a prerequisite to allow the user to
issue voice commands to refer to the objects by their names (e.g., “grasp the
cube”): a neural-network is proposed, able to classify the objects depicted in
low-resolution and noisy images on the basis of a set of invariant descriptors.
In [2] an agent-based architecture is presented which is able of continuous,
supervised learning through the feedback that the robot receives from the
user. The approach relies on a sequence of processing phases involving multi-
ple object representations (i.e., different orientation-independent features are
extracted from the same image) as well as multiple classifiers (i.e., different
similarity and membership measures are adopted), which are then combined
in a probabilistic framework. The work described in [8] focuses on the prob-
lem of learning affordances of objects which are relevant for navigation and
6
manipulation tasks, and presents a probabilistic model that describes the
relationships between object categories, affordances, and their visual appear-
ance.
Some authors argue that object recognition can be improved by consid-
ering multiple perceptual and motor modalities. In this spirit, [3] proposes
a statistical technique for multimodal object categorization based on audio-
visual and haptic information, by allowing the robot to use its physical em-
bodiment to grasp and observe an object from various view points, as well as
listen to the sound during the observation. In [16] a method is proposed to
learn a motor/perceptual model of objects belonging to different categories
(e.g., container/non-container objects), which can be generalized to classify
novel objects. In [13] a dataset of 100 objects divided in 20 categories is
considered to test the ability of a humanoid robotic torso to interact in the
most appropriate way with different categories of objects: to this purpose,
the authors propose a supervised recognition method that does not rely on
vision alone, but on the integration of different exploration behaviours tak-
ing into account visual, proprioceptive and acoustic information: look, grasp,
lift, hold, shake, drop, tap, poke, push, and press. A feature vector is prop-
erly recorded for each combination of perceptual modality / behaviour /
object, which is then used to train a Support Vector Machine for classifica-
tion. Similarly, [17] starts from the consideration that simple interactions
with objects in the environment leads to a manifestation of the perceptual
properties of objects, and tries to predict the effects of actions in relation to
the perceptual features extracted from the objects. The authors of [18] deal
with the problem of augmenting visual information about an object with
7
motor information about it, that is the way the object can be grasped by a
human being. After acquiring data with human volunteers and training the
system, a visuo-motor map is built, mapping the visual features of an object
to an associated grasp (i.e., mapping visual appearance to affordances): the
article shows that the presence of motor information improves object recog-
nition, even when the test data to be classified do not include actual motor
information, but the latter is simply inferred through the visuo-motor map.
Our approach differs substantially from all the approaches above, in that
we rely on the depth data provided by a Microsoft Kinect sensor to build a
3D representation of the objects of to be classified2.
Many recent works rely on the depth data returned by stereo vision, a
Time Of Flight (TOF) camera or a 3D laser scanner. For example, [19] fo-
cuses on object classification in mobile robotics through 3D range data, using
a cascade of classifiers composed of several simple classifiers which are able
to detect simple features such as edges or lines. The authors of [5] inves-
tigate the recognition capabilities of personal robots aimed at picking and
placing objects in the context of everyday manipulation tasks in domestic
environments, by integrating in a probabilistic framework depth information
returned from a TOF camera, as well as color and thermal information re-
turned by a standard and a thermal camera. The work in [6] focuses on
autonomous car navigation, and shows how to use objects from Googles 3D
Warehouse (containing thousands of 3D models of objects such as furniture,
cars, buildings, people, vegetation, and street signs) to train classifiers for
2Depth data returned by a RGB-D camera are also referred to as 2.5D data: however,
in the following, we always speak about 3D representations for the sake of simplicity.
8
3D laser scans collected by a robot navigating through urban environments.
The approach first segments the data and then computes, for a subset of
laser points in each segment, a spin-image signature [20, 21] which repre-
sents the shape around that point: each segment is then characterized by
a set of shape descriptors, and a distance function operating on descriptors
is introduced. The article faces the problem of domain adaptation, needed
to deal with the different characteristics of the web data (where scan points
are obtained through ray tracing) and the real laser data. The authors of [7]
investigate how to use the Google Warehouse as a source for obtaining the
required data for building a fast, stereo-vision based object categorization
system for robotics. 3D models are automatically built by taking different
views of synthetic data, and then compared with stereo data acquired in
real-time through the Spherical Harmonics Descriptor [22], on which basis
the k-nearest neighbours between the data to be classified and the pre-stored
models are computed. In a non robotic context, a 3D model-based algorithm
which performs viewpoint independent recognition of free-form objects is pre-
sented in [23]. The views are converted into multidimensional table represen-
tations, i.e., tensors: correspondences are automatically established between
these views by simultaneously matching the tensors of a view with those
of the remaining views using a hash table-based voting scheme. Similarly,
the approach described in [24] requires to take into account many images
of the same object, gathered from different points of view: the authors pro-
pose to use Algebraic Functions of Views (AFoVs) to predict the geometric
appearance of an object due to viewpoint changes.
Among the sensors able to return depth information, the Microsoft Kinect
9
sensor has been recently used in different applications belonging to this re-
search field. For instance, the authors of [25] propose a technique allowing
a mobile robot to detect objects in the scene, and express their intention to
continue their work by developing a recognition technique. A similar work is
described in [26], proposing an object detection technique based on Kinect
data to the end of obstacle avoidance. The authors of [27] focus on mobile ma-
nipulation in a domestic domain: the approach is aimed at planning collision
free grasps, but does not deal with object recognition and classification. Not
strictly related to robotics, [10] introduces a large-scale, hierarchical multi-
view object data set collected using a PrimeSense3, Kinect-style sensor, and
propose an algorithm for object recognition and classification, both at the
instance and at the category level. The dataset contains the full 3D recon-
struction of objects, performed by taking a video with the RGB-D camera,
which can be projected on different views whenever required. Recognition is
performed by extracting state-of-the-art features including spin-images (for
category recognition) SIFT descriptors (for instance recognition), and using
different state-of-the-art classifiers. In [11, 12] it is proposed to use the Kinect
sensor both for object recognition (both at the instance and at the category
level) and scene labelling, the latter being the task of assigning a semantic
label to every element composing a scene. Specifically, the articles focus on
developing good features for object recognition using depth maps, proposing
to use the so–called Kernel descriptors in addition to spin-images and other
descriptors commonly used for other types of 3D point clouds. A comparison
3www.primensense.com
10
between the performance of the Kinect and other depth sensors (i.e., Swiss
Ranger SR-4000 and Fotonic B70) is performed in [28].
Our system differs from all appearance-based approaches above, since it
does not perform object recognition and classification by computing a set of
descriptors which summarize data. Instead, our system segments the point
cloud representing an object into clusters corresponding to its constituent
parts, and searches for a correspondence between the structure of the per-
ceived object and the models of objects in a database.
Structure-based techniques, aimed to describe objects with a set of geo-
metrical primitives in order to identify them, have a long tradition: in the
seminal work by Fischler et al. [29] (as well as in subsequent works) object
parts are represented by surface patches, however other geometrical primi-
tives have been proposed, such as sticks, plates and blobs in [30], Geons in
[31, 32], tubular structures in [33], and others [34]. Some authors [35, 36]
propose to recognize categories of objects also using the information relative
to the functionality of an object, i.e., a set of volumetric shape primitives
and their relations are combined with a set of functional primitives and their
relations to perform recognition. The work described in [37] introduces a
probabilistic grammar to represent object classes recursively in terms of their
parts, thereby exploiting the hierarchical structure inherent to many types
of objects: the framework models the 3D geometric characteristics of object
parts using multivariate conditional Gaussians over dimensions, position, and
rotation. In [4] the focus is on indoor mobile robotics, and a TOF camera
is used to track parts of an object (e.g., the legs of a chair) over subsequent
range images using a particle filter: the volumetric representation of parts
11
is made with bounding boxes. Similar approaches have been proposed in
[38], describing an approach to recognize furniture based on local features,
as well as in [39], integrating grey scale values and depth information for
furniture recognition. The authors of [40] propose once again to represent
objects through their component parts, but focus on the problem of gener-
ating novel views when it is not possible to store in the memory a complete
set of views for every object. In [9] 2D data and 3D data returned by a
stereo system are used to extract multi-modal contours which compose ob-
jects: applications and results are shown in the domain of driver assistance,
depth prediction and robot grasping. A probabilistic approach for object
class recognition, applied on images, is proposed in [41]. The authors sug-
gest to model objects as sets of distinguishable parts. Each part is identified
as a region of the image which is salient over both location and scale [42].
Then each part is characterized performing a Principal Component Analy-
sis (PCA) on the corresponding region. Object recognition is realized in a
Bayesian framework, checking if, given the parts corresponding to the input
image and those corresponding to the model, a match is more likely than a
mismatch.
Our system differs from other structure-based approaches for what con-
cerns the kind of representation adopted to model the structure of objects
(i.e., as if they were composed of almost “convex” shapes), which is moti-
vated by considerations about the nature of furniture-like objects offering
affordances both to humans and to mobile robots. Also, the system differs
in the approach adopted to build and match models starting from multiple
views of the same object, which heavily relies on the underlying graph-like
12
Off-line Phase
On-line Phase
Microsoft
Kinect
Point Cloud
Segmentation
Object (i.e, set of
elongated and flat shapes)
Graph
construction
Graph
Model
Extraction
Recognition
Object model
database
Model
Models
Recognition Score
Figure 2: Block diagram which shows the main components of the system and the infor-
mation flow.
representation of the object structure.
3. System Architecture
An overview of the system architecture is shown in Figure 2. Information
can flow along two different paths: the dashed path, corresponding to the
Off-line phase, and the solid path, corresponding to the On-line phase.
The Off-line phase comprises three components: Segmentation,Graph
construction and Model extraction.
13
Segmentation isolates the objects in the point cloud returned by the Mi-
crosoft Kinect sensor and decomposes each of them into clusters, which we
classify depending on their shape: mostly elongated shapes are referred to as
“tubes”, whereas mostly flat shapes are referred to as “planes”. The point
cloud is processed using the Point Cloud Library (PCL) [43]. PCL offers
some clustering primitives [44] but, as it will be shown in Section 4, these are
not well-suited for the proposed system: another clustering technique, based
on the recognition of approximately convex shapes, will be proposed.
Graph construction represents the set of clusters, provided by the Seg-
mentation block, as a graph. Nodes correspond to clusters, whereas edges
represent the existence of one of the following geometrical relations between
the connected clusters:
two orthogonal tubes;
two parallel tubes;
two orthogonal planes;
a tube orthogonal to a plane.
Model extraction considers a number of snapshots of the same object
gathered from a single point of view and sequentially processed by the Seg-
mentation and Graph construction components, and uses these snapshots to
build a model summarizing the information contained in them. Specifically,
the nodes and edges which are present in a statistically significant subset of
snapshots are labelled as being part of the model characterizing the object,
whereas the others are discarded.
14
Finally, the output of the Model extraction block is used to populate the
Object model database.
The On-line phase is similar for what concerns the used blocks: the dif-
ference lies in the fact that the Object model database is now used as an input
for the Recognition block. Specifically, the Recognition block compares the
point cloud perceived from the Kinect in real–time (and processed by the
Segmentation and Graph construction blocks) with all the models included
in the database. If such comparison returns a recognition score higher than
a given threshold, the object is classified as belonging to the corresponding
category. Otherwise, the data acquired in real–time can be possibly used to
build a new model, which is then stored in the database.
All the blocks are now described in greater details.
4. Segmentation
In order to be able to recognize categories of objects, the point cloud
returned by the Kinect is first decomposed into a number of clusters.
The first step of the process is a standard euclidean clustering, which
aims at segmenting the different objects in the scene from the background.
To ensure the effectiveness of this operation, the points lying on the plane
representing the floor (or the ground), on which all objects are assumed to
lie, must first be removed. This is done through 3-point RANSAC plane
estimation[45] using the implementation available in the PCL library. Notice
also that many approaches in the literature make the assumption that the
object to be recognized lies on the floor or on a table, thus being coherent
with the assumptions we have made in our work (approaches not relying on
15
this assumption can be found as well, e.g., [46]).
Then, each of the resulting clusters is decomposed again into a number of
tubes and planes. The PCL library offers some primitives to segment a point
cloud such that each cluster fits a geometric primitive in the sense of the least
squares. This segmentation method is not suitable for the present application
because it lacks robustness and repeatability, i.e., the results of the process
may differ significantly even for similar images. See for example Figure 3:
in the image on the left, the back of the chair is segmented as belonging to
a single “plane”; in the image on the right, the same points are interpreted
as belonging to a number of different “lines”. This increases the uncertainty
about the segmentation process and hence makes the algorithm unreliable for
an object recognition application which is based on the extraction of (nearly)
invariant clusters.
Another segmentation method, based on the search of clusters corre-
sponding to approximately convex shapes, is proposed and its higher suit-
ability for this application is validated through experiments. To describe the
behaviour of the algorithm, the following variables and functions are in order:
O={oi},C={ci}are lists of points in the 3-dimensional space,
used to represent, respectively, the point cloud and a generic cluster of
points;
C={Ci}is a set of clusters;
remove(L) removes the first element in the list L, and returns it;
neighbours(p,L,d) returns a list of points in Lwhich are closer than
a distance dfrom the point p;
16
Figure 3: Two different images representing the same chair, from the same point of view,
clustered using PCL primitives. Different colors correspond to different clusters: the
centroids of each cluster are shown and labelled with their identifier. Parameters used in
the PCL implementation of RANSAC: maximum number of iterations= 1000; maximum
distance threshold for inliers= 0.05.
17
random(L) returns a random point in the list L;
midpoint(p,n) returns the midpoint between the points pand n, i.e.,
whose coordinates are computed as the average of p’s and n’s coordi-
nates;
move(p,L,N) removes the point pfrom the list Land inserts it at the
head of N.
append(L,N) appends the list of points Nto the list L;
The point cloud is segmented into sub-clusters, which comply with an
euclidean distance constraint and with a relaxed version of the convexity
constraint.
The algorithm requires in input the point cloud Oto be processed. The
main cycle in lines 2–24 is executed until there are points to be considered in
O: lines 3,4 initialize a new cluster Cwith a point premoved from the point
cloud; lines 5 initializes the list Lof candidate points which are neighbours of
pwithin a given distance; lines 6–22 iteratively considers all candidate points
to check if they belong to the cluster or not, by adding to Lthe neighbours
of every candidate point which passes the test. When there are no more
candidate points in L, line 23 inserts the new cluster into C.
Specifically, line 7 gets a candidate point n1in L; lines 9–16 checks if n1
belongs to the cluster or not using the convexity test (details in the following);
in case of affirmative answer, line 18 removes n1from the point cloud to
avoid considering it again in the future, and lines 19–20 update the list Lof
candidate points by adding all neighbours of n1.
18
Algorithm 1 Segmentation with approximately convex shapes
Require: O,r,d
Ensure: C
1: initialize C=
2: while O̸=do
3: p= remove(O)
4: C={p}
5: L= neighbours(p,O,d)
6: while L̸=do
7: n1= remove(L)
8: flag = true
9: for i= 1 rdo
10: n2= random(C)
11: m= midpoint(n1,n2)
12: if neighbours(m,C,d) = then
13: flag = false
14: break
15: end if
16: end for
17: if flag then
18: move(n1,O,C)
19: N= neighbours(n1,O,d)
20: append(L,N)
21: end if
22: end while
23: C=CC
24: end while 19
The convexity test in lines 8–16 works as follows. Assume a set of points
that already belong to the cluster C. Then assume a new point n1, which
must be tested to check if it belongs to the cluster or not. The test is based
on the consideration that, in a convex shape, every point can be connected
to every other point using a segment lying entirely within the shape: first,
the algorithm randomly selects a subset of rpoints4already belonging to C
(lines 9,10); second, a segment is ideally drawn from the candidate point n1
to each of the randomly selected points, and the midpoint mof the segment
is computed (line 11); third, n1is classified as not belonging to the cluster
if it exists a midpoint mwhich has no neighbours within a given distance in
the cluster (lines 12–15). Finally, please notice that the threshold distance d
in line 12 should depend on the distance between the object to be segmented
and the robot itself: in fact, as this latter distance increases, the distance
between points in the cloud increase as well, and the convexity test must
take this into account by properly increasing d5.
As an illustrative example, Figure 4 reports the output of the algorithm in
a couple of 2D prototypical situations, i.e., the segmentation of a U-shaped
and an S-shaped point cloud. The example shows that the output of the
algorithm can limitedly depend on the order according to which points are
4In all experiments r= 30, a number arbitrarily selected to find a compromise between
computational efficiency and “quality” of the result: ideally, one should consider all points
already belonging to Cat every iteration.
5In the current implementation we assume that all objects are perceived by the Kinect
at a distance which ranges from 1.5mto 2m, which allows us to set the constant value
d= 3cm yielding reasonable results in all experiments performed.
20
Figure 4: Segmentation of convex clusters in prototypical 2D cases: different colors corre-
spond to different clusters. An “almost triangular” cluster marked with “a” is segmented
from the S-shaped cloud.
considered and removed from the point cloud in Line 3 of Algorithm 1. Con-
sider Figure 4 on the right: since, in this case, the algorithm starts from
the point on the top–left of the point cloud, an “almost triangular” cluster
marked with “a” is produced. However, the approach is sufficiently robust to
produce also five “almost rectangular” clusters, three out of which are “hor-
izontal” and two “vertical”, as it is expected when segmenting an S-shaped
cloud. Similar considerations hold when considering the point cloud returned
by the Kinect, which defines surfaces in a 3D workspace.
After the points have been segmented, each cluster is labelled depending
on its shape. To this purpose, the covariance matrix that describes the
distribution of points in the cluster is computed: if the highest eigenvalue
of the covariance matrix is significantly higher than the other eigenvalues,
the cluster is labelled as a tube. Otherwise, it is labelled as a plane. Also,
each cluster is labelled with its direction, computed as follows: for a tube,
the direction is the eigenvector associated with the highest eigenvalue; for a
21
plane, the direction is the eigenvector associated with the lowest eigenvalue.
Notice that, differently from [30] (where the primitive shapes “sticks” and
“plates” closely resemble our tubes and planes), we do not consider “blobs”,
i.e., clusters where all three eigenvalues have the same order of magnitude.
This is due to the fact that, using depth data, the point cloud does never
include blobs due to self-occlusions. For instance, when observing a cube, the
depth data allows for detecting one, two or three faces, which can be modelled
as planes but do not contain enough information to make inferences about
the occupation of the space within them (i.e., it is not possible to distinguish
between an empty box and a solid cube).
Using the same input data already commented in Figure 3, it is possible to
notice that the reconstruction based on Algorithm 1 is more stable, in terms
of the type and number of sub-clusters, even in presence of measurement
noise, see Figure 5. However, due to the presence of noise, two problems
persist. First, there are some clusters that have no correspondences in sub-
sequent images: for instance, this is the case of cluster 17 on the left image,
that is classified as a plane (cluster 7 in the right image has been classified as
a tube). Second, there are some clusters that are composed of a very small
number of points: for instance, this is the case of cluster 15 on the left image.
Both problems are solved in the following phases: specifically, only clusters
that are repeatedly detected in subsequent snapshots will be used to build a
model of the object in the Model Extraction phase, and the relative size of
clusters will play a primary role in the Recognition phase. This allows for
filtering out noisy readings, and to attribute a greater importance to those
clusters that are both persistent and include a sufficiently high number of
22
Figure 5: Two different images representing the same chair, from the same point of view,
clustered using Algorithm 1. Different colors correspond to different clusters: the centroids
of each cluster are shown and labelled with their identifier: the centroids of each cluster
are shown and labelled with their identifier. Clusters 11, 15, and 17 in the right image, as
well as clusters 10 and 15 in the right image have been classified as “planes”, whereas all
other clusters have been classified as “tubes”.
23
points (e.g., the plane corresponding to the chair seat, labelled as 11 in the
left image and as 10 in the right one).
5. Graph Construction
The set of clusters obtained through Algorithm 1 is used as an input to
the Graph Construction block.
The basic idea is to build a graph where each cluster corresponds to a
node, and two nodes are connected by an edge if a geometrical relation exists
(parallelism or orthogonality) between the directions of the corresponding
clusters.
More formally, every snapshot taken by the Kinect and processed by the
Segmentation block is associated to a graph G= (N, E), where N={ni}is
a set of nodes, E={eij }is a set of edges.
Each node niis labelled with:
the cluster type, i.e., tube or plane;
the cluster direction;
the cluster relative size, i.e., the number of points belonging to the
cluster divided by the number of points of the whole object.
Each edge eij is characterized by the nodes it connects, and the type of
relationship between the corresponding clusters:
two tubes are defined as orthogonal if their directions are orthogonal;
two tubes are defined as parallel if their directions are parallel;
24
two planes are defined as orthogonal if their directions are orthogonal;
a tube is defined as orthogonal to a plane if their directions are parallel.
The algorithm for building the graph by expressing parallelism and or-
thogonality between tubes and planes is not shown here for sake of brevity.
It is worth outlining that using especially predefined categories for rela-
tions between clusters like “parallel” or “orthogonal” is not common in the
state-of-the-art: instead, explicit angular values are preferably used to quan-
titatively express the relative orientations of object components at a higher
resolution. The simplified approach proposed here is motivated by the type
of objects to be modelled, i.e., furniture-like objects, that play a relevant role
in mobile robot navigation tasks.
6. Model Extraction
When the Graph Construction block has processed a sufficient number of
snapshots, the resulting graphs can be used as inputs to the Model Extraction
block, to the end of computing a single graph which is “representative” of
a set of snapshots. Specifically, a set of snapshots of the same object taken
from neighbouring positions composes a view (e.g., the front side of a chair).
Each view (e.g., front, side, back) originates a model representing the object.
In Figure 6, it is possible to see five snapshots corresponding to the front
view of a chair.
As the reader may notice, a problem arises about how to estimate which
snapshots should be grouped to generate an object view. In the current
25
Figure 6: Five snapshots corresponding to the front view of a chair: different colors
correspond to different clusters.
implementation, this is done manually6: however, by adopting the same
criteria to evaluate the similarity between models that is implemented in
the Recognition block (described in details in Section 7), the design of an
automatic procedure appears feasible and is currently being investigated. For
instance, while turning around the object and taking subsequent snapshots,
the system can iteratively evaluate the similarity of the current snapshot with
previous ones: when it detects a sudden decrement in the similarity function
(e.g., the back of the seat in Figure 6 is no more perceived as a plane), it
infers that it is necessary to start a new model, which represents a different
view of the object.
6As it will be clarified in Section 8, four views are currently considered for each model
stored in the Object model database, and each view is built in the Off-line phase by pro-
cessing 5 snapshots taken from different positions around the object with an angular
displacement of 18deg. In the On-line phase, a run-time window of 2 subsequent snap-
shots is used to build the run-time model, which is then matched against models stored
in the database for object recognition.
26
In order to extract a model comprising the invariant characteristics of
a set of snapshots, we first introduce a distance measure to determine how
similar two nodes are. Let ni,njbe two nodes, where njhas a number of
incident edges greater than or equal to ni. The following functions are in
order:
dS(ni, nj) returns a real number between 0 and 1, computed as the ratio
between the sizes of the clusters corresponding to niand nj(i.e., the
number of points composing the smaller cluster divided by the number
of points composing the bigger cluster);
dT(ni, nj) returns a number which counts, for each edge eik connecting
nito an adjacent node nk, if it exists an edge ejl of njexpressing the
same type of geometrical relationship with an adjacent node nlof the
same type as nk(i.e., if the correspondence is found, the counter is
increased by one, and ejl is no more available to be matched against
another edge of ni; if the correspondence is not found, the counter is
decreased);
dN(ni, nj) returns a number greater than 0, computed as the difference
between the number of edges of njand the number of edges of ni;
dD(ni, nj) returns a real number, ranging from 0 to π/2, corresponding
to the absolute value of the relative angle between the directions of the
two clusters corresponding to niand nj.
The similarity score sis computed as
27
s=wsdS(ni, nj) + wtdT(ni, nj)wndN(ni, nj)wddD(ni, nj) (1)
when niand njare of the same type (i.e., they are either both tubes or both
planes),
s= 0 (2)
otherwise. The variables ws,wt,wn,wdrepresent weighting factors, which
weigh the different components of the similarity score7. The similarity score
is saturated to 1 if it has a higher value, or rounded up to 0 if it has a negative
value.
The algorithm for building a model starting from a set of snapshots works
in two phases: the first part of the model extraction algorithm (Algorithm 2)
analyses each snapshot, and returns the subset of nodes of the corresponding
graph which are candidate to be included in the model: that is, it selects
nodes that are more likely to have a correspondence in other snapshots (thus
filtering out noise), but without univocally fixing this correspondence. The
second part (Algorithm 3) considers the subset of candidate nodes corre-
sponding to each snapshot and finally computes the model, by fixing a uni-
vocal correspondence between nodes belonging to different snapshots. Notice
that the two algorithms are executed in cascade, with the output of the for-
mer (the set of candidate nodes) constituting the input of the latter, and
could be ideally seen as a single Algorithm.
Algorithm 2 requires the following variable and functions:
7In the current implementation, ws= 1, wt= 0.1, wn= 0.1, wd= 0.0025.
28
G={Gi}is a set of graphs corresponding to a set of subsequent snap-
shots, i.e., taken in subsequent times from neighbouring positions using
the Kinect; |G|is the number of graph in G,|Ni|is the number of nodes
in Gi;
F={fik}is a set of boolean values, the element fik being true if the kth
node in graph Gibelongs to the set of candidate nodes, false otherwise;
S={sik}is a set of real numbers, with the same number of elements
as F;
get(G,k) returns the kth node in G;
similarity(ni,nj) returns the similarity score between niand nj, com-
puted according to (1),(2).
Algorithm 2 works as follows. Line 1 initializes Sand F. Lines 2–13
iterates over all the snapshots available: in each iteration, a different graph
Giin G(corresponding to a different snapshot) is taken as the reference
graph, and all its nodes nkare considered. Lines 5–7 compare nkwith all
the nodes belonging to graphs Gj̸=Gi, and the node nlin Gjwhich returns
the maximum similarity score is used to update the cumulative value sik.
Line 8 computes the average similarity score for each node nkin Gi. Lines
9–11 label nkas a candidate node to belong to the final model if the average
similarity score is higher than a given threshold. Generally speaking, Algo-
rithm 2 evaluates how often a node in a snapshot is present also in the other
snapshots, and returns the set of labels Fwhich classify each node either as
a possible candidate for the model or not.
29
Algorithm 2 Model Extraction Part I, Computing Candidate Nodes
Require: G, ths
Ensure: F
1: initialize sik = 0, fik = false i,k
2: for i= 1 → |G|do
3: for k= 1 → |Ni|do
4: nk= get(Gi,k)
5: for j= 1 → |G|,j̸=ido
6: sik =sik + maxnlGj[similarity(nk,nl)]
7: end for
8: sik =sik / (|G|-1)
9: if sik ths then
10: fik = true
11: end if
12: end for
13: end for
30
Algorithm 3 requires the following variables and functions:
M={mi}is a model, i.e., a set of nodes; each node miis labelled with
its type and size, as well as with the number and type of relationships
(orthogonality and parallelism) it is involved in8;
H={hik}is a set of real numbers with the same dimension as Sand
F;
put(R,n) inserts the node nto the set R;
candidates(Gi) returns the subset of candidate nodes belonging to the
graph Gi, i.e., the nodes nkNisuch that fik = true;
setNotCandidate(R,F) sets to false the values of all elements in the
set Fcorresponding to the nodes contained in the set R.
The algorithm works as follows. Line 1 initializes all elements of H, as
well as the set of nodes R. The main cycle in lines 2–20 considers all nodes nk
of all graphs GiG. First, if nkhas been labelled as a candidate node, lines
6–12 search every other graph Gjfor the most similar candidate node nl, and
update the cumulative value hik only if the matching score is higher than a
threshold. Second, after nkhas been compared with all nodes nl, lines 13–17
compare the final value hik with a threshold to decide whether nkshould
be included into the model or not. Notice also that lines 9 and 15, taken
8Notice that Mis not a graph, since it does not include edges: when building the
model starting from a set of graphs G, information about the graph connectivity is lost to
improve computational efficiency in the recognition phase, see Figure 10, 11.
31
Algorithm 3 Model Extraction Part II, Computing Model
Require: G,F, ths1, ths2
Ensure: M
1: hik = 0 i,k,R=
2: for i= 1 → |G|do
3: for k= 1 → |Ni|do
4: if fik = true then
5: nk= get(Gi,k)
6: for j= 1 → |G|,j̸=ido
7: s= maxnlcandidates(Gj)[similarity(nk,nl)]
8: if sths1 then
9: put(R, get(Gj,l))
10: hik =hik + 1/(|G|)-1)
11: end if
12: end for
13: if hik ths2 then
14: put(M, get(Gi,k))
15: setNotCandidate(R,F)
16: R=
17: end if
18: end if
19: end for
20: end for
32
together, have the effect of removing, from the set of candidate nodes, all
nodes njwhich have have been judged very similar to nk, thus avoiding them
to be reconsidered in subsequent iterations. Generally speaking, Algorithm 3
tries to match the candidate nodes belonging to different graphs, and stores
in the model only those candidate nodes that appear in a sufficient number
of snapshots.
It should be finally observed that finding correspondences between clus-
ters perceived in subsequent snapshots is similar to the data association
problem in multiple object tracking. Traditional approaches in this field,
such as Joint Probabilistic Data Association Filtering (JPDAF) [47] and
Multi-Hypothesis Tracking [48], rely on the computation of the probability
of each association for all possible matches. These probabilistic approaches
are widely used in many applications, but they are usually characterized by a
higher computational complexity with respect to a deterministic, greedy ap-
proach (see, for instance, [49, 50]), which is therefore preferred in the present
work.
7. Recognition
In the On-line phase, Algorithm 4 is used to recognize if an object cur-
rently perceived by the Kinect is similar to one of the objects that have been
stored in the Object model database during the Off-line phase. Specifically,
the goal of Algorithm 4 is to compute a recognition score, which corresponds
to the best match between the observed object and the pre-stored models.
By remembering that each model is a set of nodes, each node being la-
belled with its type and size, as well as with the number and type of rela-
33
tionships it is involved in, the following variables and functions are in order:
M={Mi}is the set of all models which have been saved in the Off-line
phase; |M|is the dimension of the set;
M={m
i}is an On-line model, i.e., a set of nodes built using the
point cloud gathered by the Kinect in subsequent snapshots, i.e., using
the same scheme proposed for the Off-line phase (involving Algorithms
2 and 3); |M|is the dimension of the set;
size(n) returns the relative size of the cluster corresponding to the node
n, i.e., the ratio of points belonging to nwith respect to the whole
object;
U={ui}is a set of real numbers with dimension |U|=|M|.
Algorithm 4 first compares each node included in each pre-stored model
with the nodes contained in the On-line model M, by searching for the best
correspondences and computing a recognition score which is used to select the
best matching pre-stored model Ml. Then, the recognition score is updated
by performing the whole operation in the reverse order, i.e., by comparing
each node in the best matching model Mlwith nodes in the On-line model
M.
Specifically, Algorithm 4 works as follows. Lines 2–7 compare each node
njin each Off-line model Miwith the nodes nkin the On-line model M:
lines 3–5 searches for the node nkwhich is more similar to nj, by weighing the
matching score with the relative size of nj, and by integrating the matching
scores of individual nodes; line 6 computes the matching score uiof every
34
Algorithm 4 Object recognition
Require: M,M
Ensure: v
1: initialize ui= 0 i,v1= 0, v2= 0
2: for i= 1 → |M|do
3: for all njMido
4: ui=ui+ size(nj) maxnkM[similarity(nj,nk)]
5: end for
6: ui=ui/|Mi|
7: end for
8: v1= maxiui
9: l= arg maxiui
10: for all njMdo
11: v2=v2+ size(nj) maxnkMl[similarity(nj,nk)]
12: end for
13: v2=v2/|M|
14: v= min(v1,v2)
35
model Miby averaging over the number of nodes in Mi. As anticipated in
Section 4, considering the size of clusters in the recognition process allows for
attributing a greater importance to those clusters that play a fundamental
role in characterizing an object.
Lines 8–9 compute the recognition score v1as the matching score of the
model Mlwhich is the most similar to M.
Lines 10–13 re-consider the two models Mland M, but this time the
matching scores are computed by iterating over the nodes in the On-line
model M: this is required because Mland Mdo not necessarily have the
same number of nodes, and therefore searching for correspondences starting
from Ml’s nodes do not necessarily yield the same result as starting from
M’s ones. In this phase, for every node njin M, the best matching node
nkin Mlis found, and the recognition score v2is computed by integrating the
contributions of individual nodes and averaging over the number of nodes in
M. Finally, line 14 computes the final recognition score vas the minimum
between v1and v2, which allows the attribution of a higher score to models
Mland Mwhose clusters have a bidirectional correspondence.
8. Experimental Results
The dataset of objects which have been used for experiments is shown in
Figure 7.
Experiments have been performed by placing an object in the center of
the room, and by moving the Kinect around it to take snapshots every 18
degrees, at a distance of about 1.5 meters, see Figure 8. Notice that no
calibration is required: however, in the current implementation, snapshots
36
Figure 7: Objects used in experiments.
37
Figure 8: Experimental setup.
are taken by assuming that the Kinect is not moving and it is parallel to the
floor, i.e., no experiments have been yet performed by placing the sensor on
a moving robot.
8.1. Graph Construction
Figure 9 shows two examples of graphs corresponding to Chair 1 and
Desk 1 (on the left), which have been built starting from two point clouds
decomposed into “almost convex” clusters (on the right). Blue nodes repre-
sent planes, pink nodes represent tubes. Pink edges connect two orthogonal
tubes, blue edges connect two parallel tubes. Yellow edges connect orthog-
onal tubes and planes, green edges connect orthogonal planes. The relative
size of each node is evaluated in terms of the number of points included in
the associated cluster with respect to the total number of points in the ob-
ject. Remember that the direction of a tube is defined as the eigenvector
associated to the highest eigenvalue of the covariance matrix describing the
cluster, whereas the direction of a plane is the eigenvector associated to the
lowest eigenvalue.
38
Figure 9: Two examples of graph reconstruction, using Chair 1 (top) and Desk 1 (bot-
tom). Blue nodes represent planes, pink nodes represent tubes. Pink edges connect two
orthogonal tubes, blue edges connect two parallel tubes. Yellow edges connect orthogonal
tubes and planes, green edges connect orthogonal planes.
8.2. Model Extraction
Figure 10 shows the models which have been extracted starting from
different snapshots of Chair 1 (5 snapshots per every view have been consid-
ered). The three models represent the front, side, and back view of Chair
1. In this Figure, each node of the original graph is represented as a sepa-
rate entity (e.g., in the front view there are 8 nodes), by reporting its type
(blue for planes, pink for tubes), the percentage of points that constitutes it
with respect to the whole object (white label9), and finally its adjacent nodes
(pink edges connect orthogonal tubes, blue edges connect parallel tubes, yel-
low edges connect orthogonal tubes and planes, and green edges connect
orthogonal planes). As it is possible to notice, both the front (top) and the
9The sum of individual percentage values is not 100, because not all of the clusters
composing the object in the original point cloud are used to build the model.
39
Figure 10: Three models of Chair 1, seen from the front view (top), from the side
(left/right) view (middle) and from the back view (bottom). Blue nodes represent planes,
pink nodes represent tubes. Pink edges connect two orthogonal tubes, blue edges connect
two parallel tubes. Yellow edges connect orthogonal tubes and planes, green edges connect
orthogonal planes.
40
back view (bottom) include a plane node with a relative size of more than
30% of the whole object. This represents the back of the chair which is not
visible if the chair is observed from the left or from the right view. The other
nodes effectively offer a coherent representation of the seat (a plane with at
least four orthogonal tubes, i.e., the legs) and the legs (at least three tubes
parallel to at least three other tubes).
As another example, Figure 11 shows the front, side, and back model of
Desk 1. Even in this case, the graphs are coherent with the most relevant
characteristics of the table. In fact the table top (a plane node) is visible
from every point of view and corresponds to the most significant cluster in
terms of its relative size. In the front and the back view it is possible to
identify the two legs (the two tube nodes), whereas only one leg is visible
from the left/right view.
8.3. Recognition
In order to validate the approach, we first try to recognize the same object
used to create the model. The result of this experiment, for what concerns
Chair 1, is shown in Figure 12. Two different datasets are used: the former
(composed of 15 snapshots) to build the front, side, and back models in the
Off-line phase (5 snapshots for each model)10; the latter (composed of 20
snapshots) to perform object recognition in the On-line phase, using a run-
time window of 2 subsequent snapshots to build run-time models, which are
then matched against models stored in the database.
10Only 15 snapshots are required, since the same data can be used to model both the
left and the right profile of the chair.
41
Figure 11: Three models of Desk 1, seen from the front view (top), from the left/right
view (middle) and from the back view (bottom).
42
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
#image
similarity coefficient
Chair 1 Recognition (with respect to Chair 1 model)
front
side
back
front side back side
Figure 12: After having computed the model associated to Chair 1, another set of snap-
shots of the same chair is used for recognition. The recognition score is 0.880253.
43
The horizontal axis corresponds to different snapshots acquired in run-
time, which have been taken in subsequent times by moving the Kinect
around the object: positions which correspond to the central (and hence
most representative) front, side and back views are outlined using a vertical
line. The vertical axis plots the recognition score. Each curve corresponds
to a different model pre-stored in the database, again corresponding to front,
side, and back models of the same object.
The overall recognition score is computed by considering, for every run-
time snapshot, the model in the database whose curve yields the maximum
score, and by averaging over the 20 snapshots taken by turning around the
object. Please notice in Figure 12 that the scores corresponding to the differ-
ent models grow and decrease coherently with the point of view from which
Chair 1 is seen: that is, when the Kinect is taking a snapshot from the front
view (e.g., snapshot number 3), the highest matching score is returned by
the front model (i.e., the blue curve). In this case the overall recognition
score is 0.880253.
As previously said, the proposed technique aims not only to recognize a
specific instance of an object, but categories of objects. Figure 13 reports
the results obtained trying to recognize Chair 2 and Chair 3 using the model
of Chair 1. To this purpose, 20 snapshots of Chair 2 and Chair 3 are taken,
5 for each view. The overall recognition scores for Chair 2 and Chair 3 are,
respectively, 0.711573 and 0.662804. As shown in Figure 14, these scores are
significantly higher with respect to the score obtained trying to recognize
another kind of object, i.e., Desk 1, using the models of Chair 1. The overall
recognition score in this case is, in fact, equal to 0.314906, less than a half of
44
the lowest score obtained trying to recognize other chairs.
In order to validate these results, the front, side, and back models of
Desk 1 are stored in the database. Then, the snapshots of Desk 1 acquired
in run-time are matched against the pre-stored models of the same object.
The result is shown in Figure 15. In this case, the overall recognition score
is equal to 0.87558, even though the 11th snapshot is very noisy. Also, when
comparing both Desk 2 and Chair 1 with the pre-stored models of Desk 1
(Figure 16), the results are coherent with the ones previously described. The
overall recognition score for Desk 2 is equal to 0.838437, while the overall
recognition score for Chair 1 is 0.37914.
The recognition scores relative to the whole dataset are summarized in
Table 1. Ch 1 stands for Chair 1, Ch 2 stands for Chair 2, Ch 3 stands for
Chair 3, De 1 stands for Desk 1, De 2 stands for Desk 2, Ch 2 T identifies
Chair 2 Tilted (i.e., Chair 2 lying on its right side), La stands for Ladder,
Ha stands for Hanger, MR 1 stands for Mobile Robot 1 and MR2 stands for
Mobile Robot 2. Noise represents a random sequence of snapshots. Each row
corresponds to the model of the corresponding object acquired in the On-line
phase, and each column corresponds to the models of the object stored in
the database. For each row, the two highest recognition scores are displayed
in bold, if above a threshold of 0.6.
As it is possible to observe, each object gets the highest score when
matched against its own model in the database (values in the diagonal).
The only exception is Chair 2 Tilted (which lies on its side, and hence it is
composed of different views with respect to Chair 1, Chair 2 and Chair 3): the
object gets the highest score when matched against Chair 3, which however
45
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
#image
similarity coefficient
Chair 2 Recognition (with respect to Chair 1 model)
front
side
back
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
#image
similarity coefficient
Chair 3 Recognition (with respect to Chair 1 model)
front
side
back
Figure 13: After having computed the model associated to Chair 1, a set of snapshots
representing Chair 2 (top) and Chair 3 (bottom) are used for recognition. The recognition
score are 0.711573 for Chair 2 and 0.662804 for Chair 3.
46
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
#image
similarity coefficient
Desk 1 Recognition (with respect to Chair 1 model)
front
side
back
front side back side
Figure 14: After having computed the model associated to Chair 1, a set of snapshots
representing Desk 1 is used for recognition. The recognition score is 0.314906.
47
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
#image
similarity coefficient
Desk 1 Recognition (with respect to Desk 1 model)
front
side
back
front side back side
Figure 15: After having computed the model associated to Desk 1, another set of snapshots
of the same table is used for recognition. The recognition score is 0.87558.
48
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
#image
similarity coefficient
Desk 2 Recognition (with respect to Desk 1 model)
front
side
back
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
#image
similarity coefficient
Chair 1 Recognition (with respect to Desk 1 model)
front
side
back
Figure 16: After having computed the model associated to Desk 1, a set of snapshots
representing Desk 2 (top) and Chair 1 (bottom) are used for recognition. The recognition
score are 0.838437 for Desk 2 and 0.37914 for Chair 1.
49
Ch 1 Ch 2 Ch 3 De 1 De 2 Ch 2 T La Ha MR 1 MR 2
Ch 1 0.880 0.711 0.663 0.315 0.302 0.369 0.391 0.010 0.482 0.651
Ch 2 0.681 0.815 0.619 0.227 0.271 0.486 0.540 0.038 0.319 0.629
Ch 3 0.622 0.639 0.768 0.355 0.279 0.632 0.645 0.067 0.367 0.560
De 1 0.379 0.187 0.418 0.876 0.838 0.120 0.129 0.033 0.144 0.227
De 2 0.378 0.323 0.427 0.669 0.741 0.205 0.291 0.065 0.226 0.200
Ch 2 T 0.485 0.600 0.713 0.372 0.219 0.666 0.650 0.050 0.284 0.561
La 0.478 0.558 0.635 0.290 0.236 0.621 0.697 0.114 0.345 0.564
Ha 0.156 0.162 0.298 0.100 0.113 0.360 0.374 0.703 0.156 0.167
MR 1 0.454 0.431 0.377 0.171 0.136 0.262 0.252 0.013 0.653 0.686
MR 2 0.513 0.524 0.496 0.265 0.157 0.415 0.402 0.032 0.608 0.701
Noise 0.245 0.358 0.327 0.356 0.059 0.392 0.477 0.033 0.276 0.343
Table 1: Overall recognition scores: rows and columns correspond, respectively, to On-line
and Off-line models. Ch 1: Chair 1, Ch 2: Chair 2, Ch 3: Chair 3, De 1: Desk 1, De 2:
Desk 2, Ch 2 T: Chair 2 Tilted, La: Ladder, Ha: Hanger, MR 1: Mobile Robot 1, MR2:
Mobile Robot 2. Noise represents a random sequence of snapshots.
50
Figure 17: Additional objects used in the On-line phase.
belongs to the same category. In general, objects of the same category obtain
recognition scores higher than the recognition threshold, whereas object of
different categories do not: an interesting counterexample is Ladder, which
gets a high recognition score when matched against Chair 3. This is quite
reasonable, since the similarity between the ladder and the chairs in Figure
7 is high, both having vertical legs and a plane on which it is possible to sit.
Finally – and very important – Noise is never recognized as an object: the
absence of false positives when the Kinect is collecting random snapshots in
the environment, is a necessary condition for the applicability of the system
on a robot involved in complex navigation tasks.
As a first attempt to validate the generalization capabilities of the ap-
proach, the performance of the system are also tested by trying to recognize
51
Ch 1 Ch 2 Ch 3 De 1 De 2 Ch 2 T La Ha MR 1 MR 2
Ch 4 0.708 0.723 0.638 0.308 0.235 0.391 0.387 0.012 0.509 0.695
Ch 5 0.617 0.626 0.777 0.418 0.299 0.557 0.554 0.040 0.446 0.565
Table 2: Overall recognition scores: rows and columns correspond, respectively, to On-line
and Off-line models. Ch 4: Chair 4, Ch 5: Chair 5.
Chair 4 and Chair 5 in Figure 17, that have not been considered in the Off-
line phase, and hence do not have corresponding models in the Object model
database. Table 2 reports recognition scores, showing that that Chair 4 and
Chair 5 – as expected – get higher recognition score when matched against
chairs.
Table 3 reports the computational time (average, maximum, and min-
imum value) required to execute all the phases above11 when processing
small-size (e.g., hangers), medium-size (e.g., chairs), and big-size (e.g., desks)
objects. It should be noticed that Graph Construction concerns the time re-
quired for each individual snapshot, and it should be properly multiplied for
the number of snapshots required to build a model: in the Off-line phase,
each model requires 5 snapshots, whereas in the on-line phase the system
currently requires 2 subsequent snapshots to build an on-line model to be
compared with the database of off-line models. Similarly, Recognition con-
cerns the time required to compare the on-line model with one model in the
database, and it should be properly multiplied for the number of objects in
the database. As the reader may notice, the time required to process big-size
11Experiments performed on an Intel(R) Core(TM) 2 Duo CPU T6670 @ 2.20 Ghz,
RAM: 4 GB.
52
Graph Construction (per snapshot) Av. Max. Min.
Small-size objects 0.4s0.57s0.31s
Medium-size objects 2.0s3.2s1.55s
Big-size objects 21s30s9s
Model extraction 0.11s0.19s0.06s
Recognition (per model) 0.022s0.03s0.015s
Table 3: Computational Time Analysis.
objects is quite high in the current setup, and should be reduced to allow the
system to be efficiently adopted in a real-world robotic scenario.
9. Conclusions
A new approach to object recognition and classification have been pre-
sented, aimed at enabling a mobile robot to classify the objects encountered
in the course of a mission, with the final aim of predicting their affordances
and behave accordingly.
In order to model and recognize objects, a Microsoft Kinect return-
ing depth information is used: starting from the consideration that most
furniture-like objects in human environments can be modelled as composed
of simpler parts, the point cloud returned by the Kinect is first segmented
into a set “almost convex” clusters, and modelled through a graph which
represents mutual relationships between simpler geometrical primitives. Off-
line, subsequent snapshots of the same object taken from different points of
view are processed and merged, in order to produce multiple-view models
53
that are used to populate a database. On-line (ideally, whenever the robot
encounters a new object while exploring the environment), snapshots are
taken and used to search for a correspondence in the database, by possibly
adding a new instance in it, if the new object does not match any of the
pre-stored ones.
The system has been validated through experiments. The results obtained
up to now shows that the system is able to correctly match objects against
other objects belonging to the same class, i.e., individual chairs always gets
the highest matching score when compared with other chairs in the Object
model database. The recognition system relies on the structure of objects,
and therefore is not able to differentiate, for example, between a round table
and a square table. However, it is able to differentiate between a table and
a stool using size information, thus proving to be an effective tool whenever
categorization is required, rather than the recognition of individual entities.
It should be remarked that the experiments performed have some limita-
tions. Currently, only a few classes of objects have been considered (chairs,
desks, ladders, hangers, and mobile robots), and therefore the analysis should
be extended by considering more classes of objects in the Off-line phase, as
well as more individual objects to be recognized in the On-line phase. Also,
all experiments discussed in this article assume that the Kinect is placed at
a standard distance from the object (i.e., ranging from 1.5mto 2m), and
we have ignored the problem of occlusions. This assumes that a robot, to
recognize an object, should first approach it to have a good view of it: this is
a constraint that can be met in a number of real-world scenarios, but should
be released in order to extend the domain of applicability of the system.
54
As a future work, in addition to facing all the issues above, we plan to
integrate the system in the on-board architecture of our robot Merry Porter12
for transportation and surveillance [51], and to automatize the whole process
for populating the database and matching.
References
[1] J. Gibson, The Ecological Approach to Visual Perception, Houghton
Mifflin, Boston, 1979.
[2] L. Lopes, A. Chauhan, Scaling up category learning for language ac-
quisition in humanrobot interaction (LangRo’2007), in: Symposium on
Language and Robots, 2007, pp. 83–92.
[3] T. Nakamura, T. Nagai, N. Iwahashi, Multimodal object categoriza-
tion by a robot, in: IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS 2007), 2007, pp. 2415–2420.
[4] S. Gachter, A. Harati, R. Siegwart, Incremental object part detection to-
ward object classification in a sequence of noisy range images, in: IEEE
International Conference on Robotics and Automation (ICRA 2008),
2008, pp. 4037 –4042.
[5] Z.-C. Marton, R. Rusu, D. Jain, U. Klank, M. Beetz, Probabilistic cat-
egorization of kitchen objects in table settings with a composite sensor,
in: IEEE/RSJ International Conference on Intelligent Robots and Sys-
tems (IROS 2009), 2009, pp. 4777–4784.
12Merry Porter is commercialized by Genova Robot srl, www.genovarobot.com.
55
[6] K. Lai, D. Fox, Object recognition in 3d point clouds using web data
and domain adaptation, Int. J. Rob. Res. 29 (8) (2010) 1019–1037.
[7] W. Wohlkinger, M. Vincze, 3d object classification for mobile robots in
home-environments using web-data, in: 2010 IEEE 19th International
Workshop on Robotics in Alpe-Adria-Danube Region (RAAD),, 2010,
pp. 247–252.
[8] J. Sun, J. L. Moore, A. Bobick, J. M. Rehg, Learning visual object
categories for robot affordance prediction, Int. J. Rob. Res. 29 (2-3)
(2010) 174–197.
[9] E. Ba¸seski, N. Pugeault, S. Kalkan, L. Bodenhagen, J. H. Piater,
N. Kr¨uger, Using multi-modal 3d contours and their relations for vision
and robotics, Journal of Visual Communication and Image Representa-
tion 21 (8) (2010) 850–864.
[10] K. Lai, L. Bo, X. Ren, D. Fox, A large-scale hierarchical multi-view rgb-
d object dataset, in: IEEE International Conference on Robotics and
Automation (ICRA 2011),, 2011, pp. 1817–1824.
[11] L. Bo, X. Ren, D. Fox, Depth kernel descriptors for object recognition,
in: 2011 IEEE/RSJ International Conference on Intelligent Robots and
Systems (IROS), 2011, pp. 821–826.
[12] X. Ren, L. Bo, D. Fox, Rgb-(d) scene labeling: Features and algorithms,
in: 2012 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), 2012, pp. 2759–2766.
56
[13] J. Sinapov, C. Schenck, K. Staley, V. Sukhoy, A. Stoytchev, Grounding
semantic categories in behavioral interactions: Experiments with 100
objects, Robotics and Autonomous Systems (0) (2012) –.
[14] D. G. Lowe, Distinctive image features from scale-invariant keypoints,
International Journal of Computer Vision 60 (2) (2004) 91–110.
[15] P. Sanz, R. Marin, J. Sanchez, Including efficient object recognition
capabilities in online robots: from a statistical to a neural-network clas-
sifier, IEEE Transactions on Systems, Man, and Cybernetics, Part C:
Applications and Reviews 35 (1) (2005) 87–96.
[16] S. Griffith, J. Sinapov, M. Miller, A. Stoytchev, Toward interactive
learning of object categories by a robot: A case study with container
and non-container objects, in: IEEE 8th International Conference on
Development and Learning (DEVLRN ’09), IEEE Computer Society,
Washington, DC, USA, 2009, pp. 1–6.
[17] N. Dag, I. Atil, S. Kalkan, E. Sahin, Learning affordances for categoriz-
ing objects and their properties, in: 20th International Conference on
Pattern Recognition (ICPR 2010), 2010, pp. 3089–3092.
[18] C. Castellini, T. Tommasi, N. Noceti, F. Odone, B. Caputo, Using ob-
ject affordances to improve object recognition, IEEE Transactions on
Autonomous Mental Development 3 (3) (2011) 207–215.
[19] A. N¨uchter, H. Surmann, J. Hertzberg, Automatic classification of ob-
jects in 3d laser range scans, in: F. Groen, N. Amato, A. Bonarini,
57
E. Yoshida, B. Kr¨ose (Eds.), IAS-8, Proc. 8th Conf. on Intelligent Au-
tonomous Systems, IOS Press, 2004, pp. 963–970.
[20] A. Johnson, M. Hebert, Using spin images for efficient object recogni-
tion in cluttered 3d scenes, IEEE Transactions on Pattern Analysis and
Machine Intelligence 21 (5) (1999) 433–449.
[21] J. Assfalg, M. Bertini, A. Bimbo, P. Pala, Content-based retrieval of 3-d
objects using spin image signatures, IEEE Transactions on Multimedia
9 (3) (2007) 589–599.
[22] M. Kazhdan, T. Funkhouser, S. Rusinkiewicz, Rotation invariant spheri-
cal harmonic representation of 3d shape descriptors, in: 2003 Eurograph-
ics/ACM SIGGRAPH symposium on Geometry processing (SGP ’03),
Eurographics Association, Aire-la-Ville, Switzerland, 2003, pp. 156–164.
[23] A. Mian, M. Bennamoun, R. Owens, Three-dimensional model-based
object recognition and segmentation in cluttered scenes, IEEE Trans-
actions on Pattern Analysis and Machine Intelligence 28 (10) (2006)
1584–1601.
[24] W. Li, G. Bebis, N. Bourbakis, 3-d object recognition using 2-d views,
IEEE Transactions on Image Processing 17 (11) (2008) 2236–2255.
[25] J.-J. Hern´andez-L´opez, A.-L. Quintanilla-Olvera, J.-L. L´opez-Ram´ırez,
F.-J. Rangel-Butanda, M.-A. Ibarra-Manzano, D.-L. Almanza-Ojeda,
Detecting objects using color and depth segmentation with kinect sen-
sor, Procedia Technology 3 (0) (2012) 196–204, the 2012 Iberoamerican
Conference on Electronics Engineering and Computer Science.
58
[26] V. Filipe, F. Fernandes, H. Fernandes, A. Sousa, H. Paredes, J. Barroso,
Blind navigation support system based on microsoft kinect, Procedia
Computer Science 14 (0) (2012) 94–101.
[27] J. St¨uckler, R. Steffens, D. Holz, S. Behnke, Efficient 3d object
perception and grasp planning for mobile manipulation in domestic
environments, Robotics and Autonomous Systems (2012) –In press.
doi:10.1016/j.robot.2012.08.003.
[28] T. Stoyanov, R. Mojtahedzadeh, H. Andreasson, A. J. Lilienthal, Com-
parative evaluation of range sensor accuracy for indoor mobile robotics
and automated logistics applications, Robotics and Autonomous Sys-
tems (0) (2012) –. doi:10.1016/j.robot.2012.08.011.
[29] M. A. Fischler, R. A. Elschlager, The representation and matching of
pictorial structures, IEEE Transactions on Computers 22 (1) (1973) 67–
92.
[30] L. G. Shapiro, J. D. Moriarty, R. M. Haralick, P. G. Mulgaonkar,
Matching three-dimensional objects using a relational paradigm, Pat-
tern Recognition 17 (4) (1984) 385 – 405.
[31] I. Biederman, Recognition by components: A theory of human image
understanding., Psychological Review 94 (2) (1987) 115–147.
[32] Q.-L. Nguyen, M. D. Levine, Representing 3-d objects in range images
using geons, Computer Vision and Image Understanding 63 (1) (1996)
158–168.
59
[33] H. Rom, G. Medioni, Part decomposition and description of 3d shapes,
in: Proc. 12th International Conference on Pattern Recognition, volume
I, Society Press, 1994, pp. 629–632.
[34] L. Stark, K. Bowyer, Generic Object Recognition using Form and Func-
tion, Series in Machine Perception and Artificial Intelligence, World Sci-
entific, 1996.
[35] E. Rivlin, S. Dickinson, A. Rosenfeld, Recognition by functional parts
[function-based object recognition], in: IEEE Computer Society Confer-
ence on Computer Vision and Pattern Recognition (CVPR ’94), 1994,
pp. 267–274.
[36] G. Froimovich, E. Rivlin, I. Shimshoni, Object classification by func-
tional parts, in: First International Symposium on 3D Data Processing
Visualization and Transmission, 2002, pp. 648–655.
[37] M. A. Aycinena, Probabilistic geometric grammars for object recogni-
tion (August 2005).
[38] B. Steder, R. Rusu, K. Konolige, W. Burgard, Point feature extraction
on 3d range scans taking into account object boundaries, in: 2011 IEEE
Int. Conf. on Robotics and Automation (ICRA),, 2011, pp. 2601–2608.
[39] P. Meßner, S. R. Schmidt-Rohr, M. L¨osch, R. J¨akel, R. Dillmann, Local-
ization of furniture parts by integrating range and intensity data robust
against depths with low signal-to-noise ratio, Robotics and Autonomous
Systems (0) (2012) –, in press.
60
[40] H.-P. Chiu, L. P. Kaelbling, T. Lozano-P´erez, Learning to generate
novel views of objects for class recognition, Comput. Vis. Image Un-
derst. 113 (12) (2009) 1183–1197.
[41] R. Fergus, P. Perona, A. Zisserman, Object class recognition by unsu-
pervised scale-invariant learning, in: Proc. 2003 IEEE Computer Society
Conference on Computer Vision and Pattern Recognition, Vol. 2, 2003,
pp. II–264–II–271 vol.2.
[42] T. Kadir, M. Brady, Saliency, scale and image description, Int. J. Com-
put. Vision 45 (2) (2001) 83–105.
[43] R. Rusu, S. Cousins, 3d is here: Point cloud library (pcl), in: 2011
IEEE International Conference on Robotics and Automation (ICRA),
2011, pp. 1–4.
[44] S. Blumenthal, E. Prassler, J. Fischer, W. Nowak, Towards identification
of best practice algorithms in 3d perception and modeling, in: 2011
IEEE International Conference on Robotics and Automation (ICRA),
2011, pp. 3554–3561.
[45] M. A. Fischler, R. C. Bolles, Random sample consensus: a paradigm
for model fitting with applications to image analysis and automated
cartography, Commun. ACM 24 (6) (1981) 381–395.
[46] T. Grundmann, R. Eidenberger, M. Schneider, M. Fiegert, G. V.
Wichert, Robust high precision 6D pose determination in complex en-
vironments for robotic manipulation, in: Workshop on Best Practice
in 3D Perception and Modeling for Mobile Manipulation, held on the
61
IEEE Int. Conf, on Robotics and Automation (ICRA2010), Anchorage,
Alaska, 2010, pp. 1–7.
[47] T. E. Fortmann, Y. Bar-Shalom, M. Scheffe, Sonar tracking of multi-
ple targets using joint probabilistic data association, IEEE Journal of
Oceanic Engineering 8 (3) (1983) 173–184.
[48] D. Reid, An algorithm for tracking multiple targets, IEEE Transactions
on Automatic Control 24 (6) (1979) 843–854.
[49] C. Veenman, M. J. T. Reinders, E. Backer, Resolving motion correspon-
dence for densely moving points, IEEE Transactions on Pattern Analysis
and Machine Intelligence 23 (1) (2001) 54–72.
[50] K. Shafique, M. Shah, A non-iterative greedy algorithm for multi-frame
point correspondence, in: Computer Vision, 2003. Proceedings. Ninth
IEEE International Conference on, 2003, pp. 110–115 vol.1.
[51] F. Mastrogiovanni, A. Paikan, A. Sgorbissa, Semantic-aware real-
time scheduling in robotics, IEEE Trans. Robot. 1 –18In press.
doi:10.1109/TRO.2012.2222273.
62
... The depth stream returns depth information in pixels (same resolution and rate) with a valid range of [500mm, 4000mm]. For more information, refer to (Sgorbissa and Verda, 2013). ...
Article
This paper deals with the off-line path planning problem for wheeled indoor mobile robots. In the proposed approach, the robot exploits depth information acquired by a two Kinect cameras vision system to perfectly model its surroundings environment. Then, the Rapidly-exploring Random Tree algorithm is used to generate a collision-free path linking the initial configuration (Source) to the final configuration (Target). This feasible path consists of a set of n nodes in form of vectors. Next, an algorithm is proposed to reduce the number of nodes and edges of the generated path. Finally, the found path is smoothed using Piecewise Cubic Hermite Interpolating Polynomial technique. The proposed RRT-PCHIP solution is validated on the differential mobile robot, RobuTER, evolving in an indoor environment cluttered with static obstacles. Obtained results are compared with another similar work using a Genetic Algorithm in terms of (i) generation time, (ii) path length, (iii) number of segments constituting the path and (iv) transfer time.
... Recently, 3D printing is used to produce objects of almost any geometry or shape from digital data or electronic data source. A few companies such as Google and Microsoft enabled their hardware to perform 3D scanning [28,29]. In this work it has been carried out that the ultrasonic waves can be applied as an option of 3D scanning reconstruction. ...
Article
Full-text available
This study uses an acoustic technique to determine the surface topography for three target objects. Ultrasound is used to define the 3D surface of a wrench, the face of a British twenty-pence coin (20p) that features the profile of Queen Elizabeth II and a US five-cent coin. A single transducer is used to automatically scan the target region, in order to produce a matrix of the point-to-point reflected amplitude data. This work not only showed the surface topography of the targets as a 3D image but also scaled up the height of the particular local surface, which proves that the signal processing method can be applied to make special display treatment for the local area of interest. The experimental process and results perform an attempt method for 3D image reconstruction. Spatial resolution is important for the production of 3D images. For the three experiments, the transducer moves in 0.1 mm steps, which give tens of thousands of scanned data points within the given region. The scale of the surface topography is adjusted by recalibrating the reflected signals. The maps show the reflection coefficient for the two kinds of coins, varied from 0.74 to 0.99.
... Deneysel çalışmalarda Kinect sahnedeki objeleri doğru tahmin etmektedir. Fakat Kinect' in menzil sorunu bu çalışmada eksiklik yaratmaktadır[15].Du ve Zhangn'ın çalışması işaretsiz Kinect tabanlı 3D el izleme kullanarak çift robot yürütme ara yüzü metodu üzerinedir. İşaretsiz Kinect tabanlı el izleme 3D anatomik pozisyon ve yörünge için gerekmektedir. ...
... part generates. Base on the assumption that the dynamical action of a common partial shape of one object category gives the object function unique for the category, [16] builds a graphical model of the 3-D partial shape and infers the object category. This paradigm is reduced to "affordance" [4] since most of artificial objects assume human manipulations and are designed as tools with specific functions. ...
Conference Paper
Full-text available
This paper proposes an automatic functional object segmentation method based on modeling the relationship between grasping hand form and the object appearance. First the relationship among a representative grasping pattern and a position and pose of a object relative to the hand is learned based on a few typical functional objects. By learning local features from the hand grasping various tools with various way to hold them, the proposed method can estimate the position, scale, direction of the hand and the region of the grasped object. By some experiments, we demonstrate that the proposed method can detect them in cluttered backgrounds.
Article
In the large-screen interactive system with lidar sensor, due to the low accuracy of the lidar and the instability of the users’ gestures, the system’s recognition and tracking of gesture coordinates cannot be well obtained. Aiming at solving the problems of swaying and drifting gestures of the traditional filtering algorithm with a lidar sensor, this paper proposes a contactless interaction control technology based on switching filtering algorithm, which can realize non-contact high-precision multi-point interaction. The proposed algorithm first recognizes and extracts users’ gestures, and then the gestures are mapped to the screen position. Also, the mouse operation is simulated to realize operations such as selecting, sliding, and zooming in and out. Besides, the algorithm can effectively solve jitter and drift problems caused by scanning defects of radar and instability of the user gesture operations. Experimental results show that by applying the switching filtering algorithm to the contactless human-computer interaction system, the interactive trajectory becomes smoother and more stable compared with the traditional filtering algorithms. The proposed algorithm exhibits excellent accuracy and real-time performance, supporting efficient interaction with multiple people.
Article
This paper presents preliminary results of the application of two-Kinect cameras system on a two-wheeled indoor mobile robot for off-line optimal path planning and execution. In our approach, the robot makes use of depth information delivered by the vision system to accurately model its surrounding environment through image processing techniques. In addition, a Genetic Algorithm is implemented to generate a collision-free optimal path linking an initial configuration of the mobile robot (Source) to a final configuration (Target). After that, Piecewise Cubic Hermite Interpolating Polynomial is used to smooth the generated optimal path. Finally, an Adaptive Fuzzy-Logic controller is designed to keep track of a mobile robot on the desired smoothed path (by transmitting the appropriate right and left velocities using wireless communication). In parallel, sensor fusion (odometry sensors and Kinect sensors) is used to estimate the current position and orientation of the robot using Kalman filter. The validation of the proposed solution is carried out using the differentially-driven mobile robot, RobuTER, to successfully achieve safe motion (without colliding with obstacles) in an indoor environment.
Conference Paper
How to obtain the spatial coordinates of kiwi fruit has been one of the key techniques for kiwi fruit harvesting robot. In this paper, the writer proposes a unique way to obtain the spatial coordinates of the features of kiwi fruit from the bottom of the target fruit based on the growth characteristics and scaffolding cultivation pattern characteristics of kiwi fruit, plus the help of Microsoft camera and Kinect sensor. Also included in this paper is the coordinate conversion between the images come from Microsoft camera and the images of the Kinect sensor, which is followed by an analysis of the precision of the spatial coordinates of Kiwi fruit captured by the Microsoft camera and Kinect sensor. The process is like this: first, capture images of the target fruit from the bottom of the fruit with Microsoft camera, and then extract coordinates of the target fruits’ feature points to determine the corresponding target fruit feature point coordinates in the Kinect sensor; second, analyze the correspondence between the Microsoft camera image coordinate system and the Kinect sensor image coordinate system so as to establish a mathematical model for the image coordinate conversion; finally, capture target feature points’ spatial coordinates with Kinect sensor and conduct tests. The results show that the precision of coordinate conversion mode and Kiwifruit spatial coordinates can meet the requirements of the harvesting robots.
Article
In order to detect and recognize the traffic-related object, a learning-based classification approach is proposed on RGB-D data. Since RGB-D data can provide the depth information and thus make it capable of tackling the baffling issues such as overlapping, clustered background, the depth data obtained by Microsoft Kinect sensor is introduced in the proposed method for efficiently detecting and extracting the objects in the traffic scene. Moreover, we construct a feature vector, which combine the histograms of oriented gradients, 2D features and 3D Spin Image features, to represent the traffic-related objects. The feature vector is used as the input of the random forest for training a classifier and classifying the traffic-related objects. In experiments, by conducting efficiency and accuracy tests on RGB-D data captured in different traffic scenarios, the proposed method performs better than the typical support vector machine method. The results show that traffic-related objects can be efficiently detected, and the accuracy of classification can achieve higher than 98 %.
Article
We present a novel algorithm for object detection from the scene point clouds acquired in complex environments based on the structure analysis of objects. First, the scene point clouds are partitioned using the Gaussian map, and the primitive shapes are extracted from the segments. Second, each primitive shape is represented by a node, and the connection between two shapes is represented by an edge. The topological graph of the scene is reconstructed by defining the node properties, edge properties, and connection types, and the structure of the target objects is also analyzed. Then, an “assembly-matching” strategy is proposed to recognize the target objects in the scene. The qualified primitive shapes are assembled iteratively until no more suitable shapes are found. At the same time, the connection string of the combinational shapes is recorded using several numbers. Finally, the target object is detected by comparison to the connection string. The object is detected successfully if the structure coding between the iterative shapes is in line with the target object. The experimental results show that the proposed method can quickly detect common objects from massive point clouds.
Article
Increasing the intelligence level of path planning for autonomous systems and for their interaction with human walkers requires higher level representations for the humans displacement and the design of the corresponding ontologies. The paper proposes a sketch of an ontology of human movements for use by robots in assessing the displacement of humans in its environment and for correspondingly planning the path. The main purpose of the paper is to contribute to knowledge representation for a specific, narrow task in robotics, namely the formal representation of the knowledge about human walk, as needed for the safe and convenient displacement of autonomous systems in crowded spaces.
Article
Full-text available
This paper introduces semantic-aware real-time (SeART), an extension to conventional operating systems, which deals with complex real-time robotics applications. SeART addresses the problem of selecting a subset of tasks to be scheduled depending on the current operating context: mission objectives, other tasks currently executed, the availability or unavailability of sensors and other resources, as well as temporal constraints. Toward this end, SeART is able to represent the semantics of tasks to be scheduled, i.e., what tasks are meant for, and to use this information in the scheduling process. This paper describes in detail the SeART architecture by focusing on representations and reasoning procedures, and presenting a case-study which is related to mobile robotics for autonomous objects transportation.
Article
Full-text available
In order to optimize the movements of a robot, every object found in the work environment must not just be identified, but located in reference to the robot itself. Usually, object segmentation from an image is achieved using color segmentation. This segmentation can be achieved by processing the R, G and B chromatic components. However, this method has the disadvantage of been very sensitive to the changes on lighting. Converting the RGB image to the CIE-Lab color space avoids the lack of sensitivity by increasing the accuracy of the color segmentation. Unfortunately, if multiple objects of the same color are presented in the scene, is not possible to identify one of these objects using only this color space. Therefore, we need to consider an additional data source, in this case the depth, in order to discriminate objects that are not in the same plane as the object of interest. In this paper, we introduce an algorithm to detect objects, essentially on indoor environments, using CIE-Lab and depth segmentation techniques. We process the color and depth images provided by the Kinect sensor for proposing a visual strategy with real-time performance
Article
An algorithm for tracking multiple targets in a cluttered environment is developed. The algorithm is capable of initiating tracks, accounting for false or missing reports, and processing sets of dependent reports. As each measurement is received, probabilities are calculated for the hypotheses that the measurement came from previously known targets in a target file, or from a new target, or that the measurement is false. Target states are estimated from each such data-association hypothesis, using a Kalman filter. As more measurements are received, the probabilities of joint hypotheses are calculated recursively using all available information such as density of unknown targets, density of false targets, probability of detection, and location uncertainty. The branching techique allows correlation of a measurement with its source based on subsequent, as well as previous, data.
Article
This paper presents a method for extracting distinctive invariant features from images that can be used to perform reliable matching between different views of an object or scene. The features are invariant to image scale and rotation, and are shown to provide robust matching across a substantial range of affine distortion, change in 3D viewpoint, addition of noise, and change in illumination. The features are highly distinctive, in the sense that a single feature can be correctly matched with high probability against a large database of features from many images. This paper also describes an approach to using these features for object recognition. The recognition proceeds by matching individual features to a database of features from known objects using a fast nearest-neighbor algorithm, followed by a Hough transform to identify clusters belonging to a single object, and finally performing verification through least-squares solution for consistent pose parameters. This approach to recognition can robustly identify objects among clutter and occlusion while achieving near real-time performance.
Article
3D range sensing is an important topic in robotics, as it is a component in vital autonomous subsystems such as for collision avoidance, mapping and perception. The development of affordable, high frame rate and precise 3D range sensors is thus of considerable interest. Recent advances in sensing technology have produced several novel sensors that attempt to meet these requirements. This work is concerned with the development of a holistic method for accuracy evaluation of the measurements produced by such devices. A method for comparison of range sensor output to a set of reference distance measurements, without using a precise ground truth environment model, is proposed. This article presents an extensive evaluation of three novel depth sensors - the Swiss Ranger SR-4000, Fotonic B70 and Microsoft Kinect. Tests are concentrated on the automated logistics scenario of container unloading. Six different setups of box-, cylinder-, and sack-shaped goods inside a mock-up container are used to collect range measurements. Comparisons are performed against hand-crafted ground truth data, as well as against a reference actuated Laser Range Finder (aLRF) system. Additional test cases in an uncontrolled indoor environment are performed in order to evaluate the sensors' performance in a challenging, realistic application scenario.
Article
From an early stage in their development, human infants show a profound drive to explore the objects around them. Research in psychology has shown that this exploration is fundamental for learning the names of objects and object categories. To address this problem in robotics, this paper presents a behavior-grounded approach that enables a robot to recognize the semantic labels of objects using its own behavioral interaction with them. To test this method, our robot interacted with 100 different objects grouped according to 20 different object categories. The robot performed 10 different behaviors on them, while using three sensory modalities (vision, proprioception and audio) to detect any perceptual changes. The results show that the robot was able to use multiple sensorimotor contexts in order to recognize a large number of object categories. Furthermore, the category recognition model presented in this paper was able to identify sensorimotor contexts that can be used to detect specific categories. Most importantly, the robot's model was able to reduce exploration time by half by dynamically selecting which exploratory behavior should be applied next when classifying a novel object.
Scene labeling research has mostly focused on outdoor scenes, leaving the harder case of indoor scenes poorly understood. Microsoft Kinect dramatically changed the landscape, showing great potentials for RGB-D perception (color+depth). Our main objective is to empirically understand the promises and challenges of scene labeling with RGB-D. We use the NYU Depth Dataset as collected and analyzed by Silberman and Fergus [30]. For RGB-D features, we adapt the framework of kernel descriptors that converts local similarities (kernels) to patch descriptors. For contextual modeling, we combine two lines of approaches, one using a superpixel MRF, and the other using a segmentation tree. We find that (1) kernel descriptors are very effective in capturing appearance (RGB) and shape (D) similarities; (2) both superpixel MRF and segmentation tree are useful in modeling context; and (3) the key to labeling accuracy is the ability to efficiently train and test with large-scale data. We improve labeling accuracy on the NYU Dataset from 56.6% to 76.1%. We also apply our approach to image-only scene labeling and improve the accuracy on the Stanford Background Dataset from 79.4% to 82.9%.