Conference PaperPDF Available

Abstract and Figures

This work investigates the problem of 6-Degrees-Of-Freedom (6-DOF) object tracking from RGB-D images, where the object is rigid and a 3D model of the object is known. As in many previous works, we utilize a Particle Filter (PF) framework. In order to have a fast tracker, the key aspect is to design a clever proposal distribution which works reliably even with a small number of particles. To achieve this we build on a recently developed state-of-the-art system for single image 6D pose estimation of known 3D objects, using the concept of so-called 3D object coordinates. The idea is to train a random forest that regresses the 3D object coordinates from the RGB-D image. Our key technical contribution is a two-way procedure to integrate the random forest predictions in the proposal distribution generation. This has many practical advantages, in particular better generalization ability with respect to occlusions, changes in lighting and fast-moving objects. We demonstrate experimentally that we exceed state-of-the-art on a given, public dataset. To raise the bar in terms of fast-moving objects and object occlusions, we also create a new dataset, which will be made publicly available.
Content may be subject to copyright.
6-DOF Model Based Tracking via Object
Coordinate Regression
Alexander Krull, Frank Michel, Eric Brachmann, Stefan Gumhold,
Stephan Ihrke, Carsten Rother
TU Dresden, Dresden, Germany
Abstract. This work investigates the problem of 6-Degrees-Of-Freedom
(6-DOF) object tracking from RGB-D images, where the object is rigid
and a 3D model of the object is known. As in many previous works, we
utilize a Particle Filter (PF) framework. In order to have a fast tracker,
the key aspect is to design a clever proposal distribution which works
reliably even with a small number of particles. To achieve this we build
on a recently developed state-of-the-art system for single image 6D pose
estimation of known 3D objects, using the concept of so-called 3D ob-
ject coordinates. The idea is to train a random forest that regresses
the 3D object coordinates from the RGB-D image. Our key technical
contribution is a two-way procedure to integrate the random forest pre-
dictions in the proposal distribution generation. This has many practical
advantages, in particular better generalization ability with respect to
occlusions, changes in lighting and fast-moving objects. We demonstrate
experimentally that we exceed state-of-the-art on a given, public dataset.
To raise the bar in terms of fast-moving objects and object occlusions,
we also create a new dataset, which will be made publicly available.
1 Introduction
In this paper we address the problem of tracking the pose of a previously known
rigid object from a RGB-D video stream in real-time. The object pose is typically
expressed relative to the camera and has 6-DOF, three for orientation and three
for position. A solution to this problem has great impact in many application
fields, ranging from augmented reality, human-computer interaction, to robotics.
However, given constraints on high precision, real-time, as well as robustness to
real world situations, such as occlusions and changes in lighting conditions, this
is still an open problem.
Building on the results of multiple decades of research on object tracking, very
recently several researchers have re-investigated pose estimation [14, 21, 4] and
tracking approaches [16,8, 18] in order to exploit new RGB-D sensor technology
and to ensure real-time support through GPU based implementations. For the
tracking problem the PF framework [12] has become a preferred choice as it can
model multi-modal probability densities that are essential for successful tracking
of objects with occlusions.
As described in more detail in Sec. 3.2 the PF framework incrementally traces
the posterior probability density of the 6D pose through a set of samples from
2 A. Krull, F. Michel, E. Brachmann, S. Gumhold, S. Ihrke, C. Rother
the 6D Euclidean Group SE(3)1. Key ingredients to the PF are a model for
the relative motion of the object over time, as well as an observation model
for the image formation process. In single image pose estimation approaches,
like [4], solely the observation model is used. An important difficulty in the
application of the PF framework to 6D pose tracking is the high dimensionality
of the state space, as each particle represents a 6D pose estimate. This either
demands for a huge number of particles prohibiting real-time systems or for an
importance sampling approach (compare [10]) based on a proposal distribution
that effectively approximates the posterior probability distribution. Any pose
estimation approach can be used to implement a proposal distribution as shown
by St¨uckler et al. in [24], where a multi-resolution surfel representation of the
tracked object was utilized.
In this work we demonstrate the combination of the PF framework with the
recently developed concept of object coordinate regression which has achieved
state of the art results for one shot estimation of camera pose [22] or object
pose [4]. The object coordinate regression framework is detailed in Sec. 3.1.
It exploits random forests to automatically learn optimal features from RGB-
D training images. Brachmann et al. [4] have shown that such learning based
approaches can efficiently deal with changes in illumination and with partial
occlusions. Given this distinction between a training and test phase, our system
works as follows. Given a potentially large selection of 3D objects, here 4, and 20
in [4], as well as example images of background, we first train a random forest.
At test time, we know which object we want to track and use the output of
the forest only for this particular object. Note, a straightforward extension, not
evaluated in this work, is to jointly do multiple object detection and tracking.
In this work, we carefully extend the work [4] to the 6D pose tracking problem.
Our main contributions are:
A new model-based 6D pose tracking framework, based on the concept of
predicting 3D object coordinates, which helps to generalize better to real-
world settings of fast-moving objects, occlusions and changes in lighting.
The technically new aspect is a two-way procedure for optimally using the
output of the object coordinate regression framework to determine the pro-
posal distribution of the tracker.
A new, challenging 6D pose tracking data set that will be publicly available.
A system that exceeds state of the art results on 6D model-based tracking.
2 Related Work
Almost two decades ago the PF has been introduced for 2D visual tracking by
Isard and Blake [15]. Based on a statistical observation model and a motion
model the PF approximates the posterior distribution of the object’s position in
a non parametric form, using a set of samples. Ten years later Pupilli et al. [20]
adapted the framework to 6-DOF camera tracking using edge features. Shortly
1The group of rigid body transformations.
6-DOF Model Based Tracking via Object Coordinate Regression 3
later Klein et al. [16] presented an implementation that utilized the GPU for
the evaluation of its observation model, which usually is the bottleneck in PF
applications. The GPU implementation enabled them to deal with hidden edges
while allowing a speedup [16].
The number of necessary particles and therefore the runtime can be reduced
by guiding particle sampling with a proposal distribution. Ideally the proposal
distribution is very close to the posterior distribution. Furthermore, it needs to
exploit both the observation model and the motion model in order to improve
over the standard PF as well as over one shot pose estimation. Bray et al. [5]
improved hand pose tracking with a proposal distribution, which was defined
as the mixture of two distributions both represented as a particle ensemble.
The first particle ensemble is constructed in the default manner by applying the
motion model to the sampled posterior distribution of the previous frame. The
second ensemble is constructed by moving the resulting particles further to local
optima and using them as centers of a mixture of normal distributions. Teuliere
et al.[26] used a similar approach to edge-based tracking of simple objects from
luminance images. Corresponding approaches for 3D pose tracking from RGB-D
videos can be found in [9, 7, 18].
A PF that has to operate on the 6D Euclidean group SE(3) brings some
theoretical challenges. The definition of probability distributions and calculation
of average rotations is not straight forward. A theoretical analysis of these issues
can be found in [6] and [17]. While earlier methods relied on simple random walk
motion models, Choi et al. [9] used autoregressive models that assume a more
or less constant pose velocity and were thus able to deal with faster motion.
With the availability of RGB-D sensors, the question of the best image fea-
tures for the observation model has sparked recent research. St¨uckler et al. [24]
use RGB-D images to learn 3D surfel maps of objects and use them in a PF
operating on RGB-D. Choi et al. [8] present in their recent work a highly opti-
mized GPU implementation that uses a traditional mesh representation. Their
observation model is based on comparing rendered images and observations us-
ing color as well as depth features. Learning of the best features in a random
forest has been successfully applied to pose estimation of articulated objects [25],
camera pose [22] and object poses [4]. In these approaches a random forest is
trained to predict part or object probabilities and in the latter two approaches
also coordinates in a reference coordinate system of the background scene or
the learned objects. During detection the output of this discriminative model
is used as input to an optimization procedure for the 3D-pose with respect to
an observation model. Brachmann et al. [4] improve over this basic concept by
beneficially incorporating the predicted object probabilities and coordinates into
the observation model, which is formulated in form of an energy.
In this work we build a PF tracking framework with the energy based obser-
vation model of [4] and carefully design a proposal distribution that intelligently
exploits the input RGB-D image as well as the output of the trained discrimina-
tive model. In this way we combine the robustness of the discriminative model
with respect to changes in illumination and the robustness of the PF framework
4 A. Krull, F. Michel, E. Brachmann, S. Gumhold, S. Ihrke, C. Rother
with respect to occlusions. Furthermore, we use a two-way PF that builds on
similar ideas as annealed PF approaches as previously proposed in [16] and [2].
The random forest approach is in spirit similar to boosting based approaches
that have been examined in the area of tracking in the works [19, 13, 1, 27]. Fi-
nally, our approach also uses GPU rendering features to efficiently evaluate the
observation model.
3 Method
Given a stream of RGB-D images of a moving object our goal is to estimate in
each frame tthe object pose Ht. We assume that a 3D model of the corresponding
object is available. We define the pose Htas the transformation that maps a
point from the local coordinate system of the object to the coordinate system of
the camera. We cast pose estimation as a tracking problem and solve it with a
PF framework. In Sec. 3.1 we introduce a regression forest, that predicts object
probabilities and local object coordinates. This output is used in several steps
of our approach. Then, we briefly reiterate the basic PF framework (Sec. 3.2).
We follow with a description of our motion model (Sec. 3.3) and our definition
of particle likelihood (Sec. 3.4). Finally, in Sec. 3.5 we describe how we adapt
sampling of particles according to a proposal distribution which concentrates
on image areas where high particle likelihood is expected. This is the main
component to facilitate efficient and robust pose tracking.
3.1 Discriminative Prediction of Object Probabilities and Object
Coordinates from a Single RGB-D Image
In order to assess the likelihood of particles (Sec. 3.4), and to concentrate hy-
potheses sampling on promising image areas (Sec. 3.5) we utilize a discriminative
function. This function takes local image patches as input and estimates the fol-
lowing two outputs for the center pixel of each patch: the probability p(c) of
the pixel belonging to object c, and its coordinate ycin the object coordinate
system. We follow exactly the setup of [4] to achieve this mapping densely for
each pixel of the current frame tusing decision forests.
Note that we train the forest jointly for multiple objects although, in this
work, we consider tracking one pre-specified object conly. During test time,
discriminative predictions are only calculated for object c. Other objects, which
the forest might know, are not considered. However, the same forest can provide
predictions for different objects, so that training has to be done only once. In
the following we give a short summary on training and prediction.
Training. We train an ensemble Tof decision trees Tj. Each tree is trained
separately using a set of object images as well as neutral background images.
Object images are segmented and show object cin different poses. Each pixel
within a segmentation mask is furthermore annotated with its object coordinate
yc. Trees operate on simple scale-invariant depth and RGB difference features
[22]. During training we select for each node one feature out of a randomly gen-
erated pool that maximizes the standard information gain defined on a discrete
6-DOF Model Based Tracking via Object Coordinate Regression 5
set of labels. These labels are obtained by separating object cinto ncell cells
with a 3D grid resulting in ncell discrete labels per object. One additional label
is assigned to the background class.
We do not restrict tree depth but stop growing when less than a minimum
number of pixels arrive at a node. At each leaf ljwe approximate the object
probability p(c|lj) by the fraction of pixels at that leaf that belong to object c.
The estimate of the object coordinate yc,ljis found by running mean-shift on
all pixels of object cin leaf ljand taking the largest mode.
Prediction. Each pixel of an input image is run through each tree in Tand
arrives at leafs (l1, . . . , l|T |). This results in a list of object probabilities (p(c|l1),
. . . , p(c|l|T |)) and object coordinates (yc,l1, . . . , yc,l|T | ) for each object c. The
object probabilities of all trees are combined with Bayes rule over all known
objects and the background class to obtain the final p(c) for each pixel [4].
3.2 PF for 3D pose tracking
A PF approximates the current posterior distribution p(Ht|Z1:t) of the pose Ht
at time tgiven all previous observations Z1:t. In each time step tthe posterior
distribution is represented by a set of samples St={H1
t, . . . , HN
t}, which are
referred to as particles. Each particle has two velocity vectors vtand etattached.
They correspond to the previous translational and rotational motion respectively.
The PF requires an observation model and a motion model. The former
describing the likelihood p(Zt+1|Ht+1) of an observation given a pose. The latter
describing the probability p(Ht+1|Ht,vt,et) of a pose given the previous pose
Htas well as the last velocity vectors. With each new frame t+ 1 a new set of
particles St+1 is found in three steps:
1. Sampling: For each particle Hi
tan intermediate particle ˆ
Hi
t+1 is sampled
according to the motion model p(ˆ
Hi
t+1|Hi
t,vi
t,ei
t). New velocities vi
t+1 and
ei
t+1 are calculated (see supplemental note for details) using Hi
tand ˆ
Hi
t+1.
2. Weighting: Each intermediate particle is assigned a weight πi
t+1, which is
proportional to the likelihood p(Zt+1 |ˆ
Hi
t+1) of the observed data given the
pose ˆ
Ht+1.
3. Resampling: Finally, the set St+1 ={H1
t+1, . . . , H N
t+1}of unweighted parti-
cles (with their attached velocities), is randomly drawn from {ˆ
H1
t+1,..., ˆ
HN
t+1}
using probabilities proportional to the weights πi
t+1.
The number of particles required to approximate the 6D posterior distribution
can be drastically reduced if the sampling is concentrated in areas where one
expects the true pose of the object. This is done using a proposal distribution
q(Ht+1|Ht,vt,et) in the sampling step instead of the original motion model.
To compensate for the fact that we sample from a different distribution, the
calculation of weights has to be adjusted according to Eq. (19) in [10]:
πi
t+1 p(Zt+1|ˆ
Hi
t+1)p(ˆ
Hi
t+1|Hi
t,vi
t,ei
t)
q(ˆ
Hi
t+1|Hi
t,vi
t,ei
t)(1)
6 A. Krull, F. Michel, E. Brachmann, S. Gumhold, S. Ihrke, C. Rother
In the following we describe the specifics of our implementation of the PF frame-
work: the motion model, the observation likelihood, and finally, our main contri-
bution, the construction of our proposal distribution. An overview of our tracking
pipeline can be found in Fig. 1.
RGB
Depth p(c)
y
Proposal
Distribution
Resampled
Particles
Weighted
Particles
Estimated
6D Pose
Sampled
Particles
Motion
Model
j
t
(a) (b) (c)
(d)
(e) (f) (g) (h)
Fig. 1. Our tracking pipeline. For each frame tthe RGB-D image(a) is processed
by the forest to predict object probabilities and local object coordinates(b). We use
the observed depth from the original image, the forest predictions and the particles
from the last frame together with our motion model(d) to construct our proposal
distribution(c). Particles are sampled(e) according to the proposal distribution, then
weighted(f) and resampled(g). Our final pose estimate is calculated as mean of the
resampled particles(h).
3.3 Our Motion Model
The motion model describes which movements of the object are plausible be-
tween two frames. Generally speaking, we assume our object roughly continues
its previous motion, and that an additional random normally distributed trans-
lation is applied to our pose together with a random rotation around the center
of the object. More specifically we assume the rotation to be around a uniformly
chosen random axis and the angle of the rotation to be normally distributed. We
will introduce a continuous probability distribution on SE(3) representing such
a random motion. It will be reused in the context of our proposal distribution
as described in Sec. 3.5.
Let Rtand Ttbe the homogeneous 4x4 matrix representations of the rota-
tional and translational component of the pose Ht=TtRt. We model the change
of the pose between two time steps tand t+ 1 separately for both components:
Ht+1 =Tt+1Rt+1 with Tt+1 =T
tTt,Rt+1 =R
tRtwhere T
tand R
tare 4x4
translation and rotation matrices, representing the change in the rotational and
translational component respectively. Note that R
tcontains the relative rota-
tion around the object center and not the camera center. Translational change
is defined as:
T
t=TλTvt+ωT(2)
The vector ωTcontains independent zero centered Gaussian noise with σ2
Tvari-
ance. The symbol λTstands for a damping parameter. It determines how much
the previous translation, described by the velocity vector vtis continued. Finally,
6-DOF Model Based Tracking via Object Coordinate Regression 7
T(v) shall be the translation matrix corresponding to the translation vector v.
The rotational change is defined as:
R
t=R(ωRθ)R(λRet) (3)
The symbol ωRstands for a random unit vector that defines the rotation axis of
the random movement. Here θis a Gaussian distributed zero centered random
variable determining the rotation angle2. Its variance is σ2
R. The symbol λR
stands for a damping parameter. It determines how much the previous rotation,
described by the rotational velocity vector etis continued. Finally, R(e) shall
stand for the rotation matrix corresponding to the rotation vector3e.
Based on the model described above we can calculate the probability den-
sity p(Ht+1|Ht,vt,et) for a transition to pose Ht+1 given the previous pose
Ht. We approximate our motion model using the following density function
f(Ht+1;Hpred
t, Σmm , κmm ), which is described at the end of this section:
p(Ht+1|Ht,vt,et)f(Ht+1 ;Hpred
t, Σmm , κmm ) (4)
This function describes the probability of an arbitrary pose Ht+1 with respect
to the predicted pose Hpred
t, which is found by extrapolating previous motion
given by the velocity vectors vtand et. The probability distribution depends on
the variance in the translational component through Σmm and the variance of
the rotational component through κmm. The diagonal matrix Σmm =I σ2
T, the
term κmm = 12
Rand
Hpred
t=T(λTvt)TtR(λRet)Rt(5)
The density function f(H;H0, Σ, κ) corresponds to the random motion described
in Eqs. (2) and (3):
f(H;H0, Σ, κ) = fn(vdif f ;0, Σ)fv m(θdif f ; 0, κ)φ(θdif f ) (6)
It consists of a zero centered multivariate normal distribution fn(vdiff ;0, Σ )
and a zero centered von Mises distribution fvm(θdif f ; 0, κ). While the vector
vdiff denotes the translational difference between H0and H, the symbol θdiff
stands for the angle of the difference rotation. The normal distribution models
the translational noise ωTintroduced in Eq. (2). The von Mises distribution in
Eq. (6) is a close approximation for a wrapped normal distribution, it models the
Gaussian rotational noise introduced by θin Eq 3. Note that the random rotation
axis ωRof Eq. (3) has no influence on the density, since each rotation axis has
equal probability. Also note that an additional factor φ(θdiff ) is necessary to
map the 1D density over angles onto the group of rotations SO(3). A more
detailed discussion of φ(θdif f ) can be found in the supplemental note.
2Please note, that because of its circular nature, applying rotations with the normally
distributed angles θwill result in angles distributed in the interval between 0 and
2πaccording to a wrapped normal distribution. Such a distribution is difficult to
handle and we will use a von Mises distribution as approximation.
3Direction and length of a rotation vector correspond to rotation axis and rotation
angle, respectively.
8 A. Krull, F. Michel, E. Brachmann, S. Gumhold, S. Ihrke, C. Rother
3.4 Our Observation Likelihood
We use a likelihood formulation based on the energy E(H) from [4]:
p(Zt+1|H)exp(αE(H)) (7)
where αis a control parameter determining how harshly larger energy values
should be punished. The energy term is a weighted sum of three components
Edepth(H), Ecoord (H) and Eob j(H). Detailed equations can be found in [4].
The depth component Edepth(H) punishes deviations between the observed and
rendered depth images within the object mask. The other two components are
concerned with the output of the forest. The coordinate component Ecoord (H)
punishes deviations between each pixels true object coordinates for pose Hand
the object coordinates predicted by the forest. The object component Eob j(H)
punishes deviation between the ideal segmentation obtained by rendering and
the soft segmentations predicted by the trees in form of p(c|lj).
In contrast to [4], we use a simple modification of the depth term, that copes
better with occlusion. Depth values that lie in front of the object can be explained
by occlusion. This is not the case for depth values that lie behind the object.
Our modified depth term accounts for this by reducing the threshold of possible
punishment for values in front of the object. A detailed description can be found
in the supplemental note.
3.5 Our Proposal Distribution
Our proposal distribution allows our method to cope with unexpected motion
and occlusion, while maintaining high accuracy. It allows us to approximate the
posterior distribution p(Ht|Z1:t) more accurately with a small number of parti-
cles. The construction of our proposal distribution is described in the following,
and subsumed in Fig. 2.
A proposal distribution describes the sampling of a new particle ˆ
Hi
t+1 on the
basis of an old particle Hi
t. We define the proposal distribution q(Ht+1 |Ht,vt,et)
for the particle Hi
tas a mixture of two parts (Fig. 2(o)):
q(Hi
t+1|Hi
t,vi
t,ei
t) = (1 αprop)f(Hi
t+1;Hi,pred
t, Σprop, κprop )
+αpropf(Hi
t+1;Hest
t+1, Σ prop, κprop )(8)
The mixture is governed by the weight αprop. Both parts reuse the density
function defined in Eq. (6) with variance parameters Σprop and κprop, which
can be found in the supplemental note. The first part of Eq. (8) is centered on
Hi,pred
twhich is the extrapolation of the current particle Hi
taccording to our
motion model as described in Eq. (5). Hence, Hi,pred
tdiffers for each particle Hi
t.
The second part of Eq. (8) is centered on a preliminary estimate Hest
t+1 which is
found based on the output of our discriminative function (Sec. 3.1). It does not
depend on Hi
t, but is shared among all particles.
Regarding Hest
t+1, we will discuss two different ways to quickly obtain a good
estimate: One way finds a local estimate Hlocal
t+1 , the other way finds a global
6-DOF Model Based Tracking via Object Coordinate Regression 9
Depth
p(c)
y
j
Sampled6D
Pose
Hypotheses
Particles
(t)
Motion
Model
Moved
Particles
Weighted
Hypotheses
Estimated
PDF
Mean
Pose
Refned
Pose
Mean
Pose
Refned
Pose
Optimized
Pose
Estimated
PDF
local
global
(a) (b) (d)
(c)
(f)
(e)
(g) (h)
(i) (j)
(k) (l)
(m)
(n) (o)
Hlocal
t+1
Hglobal
t +1
q(H |H )
t+1 t
ii
Hest
t+1
Fig. 2. To construct our proposal distribution we first calculate a continuous repre-
sentation of the prior distribution for the pose at the current frame (a-d). Next we
determine two pose estimates Hlocal
t+1 (light gray) and Hglobal
t+1 (dark gray). The local
estimate is calculated based on a local search on propagated solutions via the motion
model (i-j). The global estimate is based on the pose sampling scheme from [4] (e-l).
We choose the particle with the lower energy and use it as starting point for a final
optimization (n). Our final proposal distribution q(Hi
t+1|Hi
t) for particle Hi
tis a mix-
ture of two components (o): one centered on Hi
tand one on the newly found particle
in (n). This figure represents component (c) of Fig. 1.
estimate Hglobal
t+1 . While a proposal distribution based on a local estimate is
sufficient in most cases, it may fail in situations with fast unexpected motion. The
global estimate on the other hand depends on the quality of the discriminative
prediction and may at times give noisy results. As a consequence, we apply a
combination of the two approaches:
Hest
t+1 = argminH∈HE0(H); H={Hlocal , H global}(9)
Note, that this is our main technical contribution. Energy E0(H) will be defined
below. The preliminary estimate Hest
t+1 is optimized (Fig. 2(n)) using a general
purpose optimizer (details can be found in the supplemental note).
The remainder of this section is concerned with the calculation of Hlocal
t+1 and
Hglobal
t+1 . First, however, we discuss how we represent prior knowledge used in
both estimates.
Prior Knowledge. The proposal distribution should be an estimate of the pos-
terior p(Ht+1|Z1:t+1 ). Both of our estimates should thus include not only knowl-
edge taken from the current observation i.e. the likelihood and results from the
discriminative function, but also information from the previous particle set i.e.
the prior. To include the prior we perform the following preparatory steps. We
take the last set of particles St={H1
t, . . . , HN
t}and move each particle accord-
ing to the motion model (Fig. 2(a-c)). The result is an extrapolated set of parti-
cles ˜
St+1 ={˜
H1
t+1,..., ˜
HN
t+1}. In order to obtain a continuous representation we
reuse the distribution f(H;Hcenter, Σ , κ) (Eq. (6)). We fit f(H;Hcenter, Σ , κ) to
the set ˜
St+1 (Fig. 2(d)). The resulting parameters are ˜
Hcenter,˜
Σ, ˜κ. For details
on the fitting procedure, please refer to the supplemental note. The distribution
10 A. Krull, F. Michel, E. Brachmann, S. Gumhold, S. Ihrke, C. Rother
f(H;˜
Hcenter,˜
Σ, ˜κ) is a representation of the knowledge we have about the pose
at the current time t+ 1 without considering the current observation Zt+1. It is
a representation of the prior.
Local Estimate. To find Hlocal
t+1 , we use ˜
Hcenter (Fig. 2(i)) as starting point
for refinement as described in [4] (Fig. 2(j)). This refinement is done by repeat-
edly finding inlier pixels. Their predicted object coordinates together with the
observed depth values enable a rough but quick optimization using Kabsch al-
gorithm. In order to include prior knowledge in the refinement we change the
objective function to:
E0(H) = αE(H)ln f(H;˜
Hcenter,˜
Σ, ˜κ) (10)
Because of this adjustment of the objective function the resulting Hlocal
t+1 becomes
a local maximum a posteriori (MAP) estimate.
Global Estimate. Calculation of the global estimate Hg lobal
t+1 is based on a
sampling scheme similar to the one in [4]. We sample a set of mparticles ˇ
Hi
(Fig. 2(e-g)). Details can be found in the supplemental note. Then, the particles
ˇ
Hiare weighted using the distribution f(H;˜
Hcenter,˜
Σ, ˜κ) (Fig. 2(h)). Finally,
their weighted mean is calculated (Fig. 2(k)) and used as initialization for the
refinement (Fig. 2(l)), again using E0(H) from Eq. (10) as objective function.
This yields Hglobal
t+1 .
4 Experiments
Some RGB-D object tracking datasets have been published in recent years. For
example, Fanelli et al. [11] recorded a dataset to track human head poses using a
Kinect camera. Song and Xiao [23] used a Kinect camera to record 100 RGB-D
video sequences of arbitrary objects but do only provide 2D bounding boxes as
ground truth. For our purpose, we found only one relevant dataset from Choi
and Christensen [8]. It consists of 3D object models and synthetic test sequences.
For further evaluation, we recorded a new more challenging and realistic dataset
on which we compared our approach. Additionally, we conduct experiments to
demonstrate that our proposal distribution achieves superior results when un-
expected object motion occurs.
Kinect Box Milk Orange Juice Tide
Fig. 3. Example images of the dataset provided by Choi and Christensen [8].
Dataset of Choi and Christensen [8]. The dataset of Choi and Christensen
[8] provides four textured 3D models and four synthetic test sequences (1000
6-DOF Model Based Tracking via Object Coordinate Regression 11
RGB-D frames). To generate the test sequences, each of the four objects was
placed in a static texture-less 3D scene and the camera was slowly moved around
the object. The authors provide the ground truth camera trajectory which is
error-free since it was generated through rendering. Fig. 3 shows one image of
each sequence.
To gather the training data for our random forest we rendered RGB-D images
of each model. We sampled the full view sphere of the model regularly with
fixed distance and including in-plane rotation. For the background set we used
renderings from multiple 3D scenes from Google warehouse. We trained three
decision trees with a maximum feature patch size of 20 pixel meter and ncell =
125 discrete labels per object. We trained the trees for all 4 objects jointly.
For each testing sequence the object to be tracked is assumed to be known, and
predictions are only made for this object. Our PF uses 70 particles. The complete
list of parameters is included in the supplemental note.
Fig. 4. Averaged translation and rotation
RMSE on the dataset of [8].
Fig. 5. Reconstructed motion trajectory
(green) for one sequence of our dataset
(Cat, sequence 1). Ground truth is de-
picted blue for comparison.
While testing we follow the evaluation protocol of Choi and Christensen [8]
and compute the Root Mean Square Error (RMSE) of the translation parameters
X,Y and Z and the rotation parameters Roll, Pitch and Yaw. We average the
translational RMSE over three test runs, the coordinates (X,Y and Z), as well
as over the four objects to obtain one translational error measure. We do the
same for rotational RMSE. We compare to the numbers provided in [8] which
also include results for the tracking implementation of the Point Cloud Library
(PCL) [3]. We base our comparison on the results in [8] achieved with 12800
particles, for which the lowest error is reported.
Our method results in an average translational RMSE of 0.83 mm compared
to 1.36 mm for [8], i.e. we achieve a 38% lower translational error (PCL: 18.7
mm). For the average rotational RMSE we report 1.38 deg compared to 2.45
deg in [8], which is 43% lower (PCL: 29.6 deg). We achieve these results while
keeping the computation time on our system4comparable to the one reported
4Intel Core i7-3820 CPU @ 3.6GHz with a Nvidia GTX 550 TI GPU
12 A. Krull, F. Michel, E. Brachmann, S. Gumhold, S. Ihrke, C. Rother
in [8]. Figure 4 depicts the average RMSE over all objects. Detailed results
including run-times can be found in Table 1.
Objects Tracker RMSE
X (mm) Y (mm) Z (mm) Roll (deg) Pitch (deg) Yaw (deg) Time (ms)
PCL 43.99 42.51 55.89 7.62 1.87 8.31 4539
Kinect Box Choi and Christensen 1.84 2.23 1.36 6.41 0.76 6.32 166
Our 0.83 1.67 0.79 1.11 0.55 1.04 143
PCL 13.38 31.45 26.09 59.37 19.58 75.03 2205
Milk Choi and Christensen 0.93 1.94 1.09 3.83 1.41 3.26 134
Our 0.51 1.27 0.62 2.19 1.44 1.90 135
PCL 2.53 2.20 1.91 85.81 42.12 46.37 1637
Orange Juice Choi and Christensen 0.96 1.44 1.17 1.32 0.75 1.39 117
Our 0.52 0.74 0.63 1.28 1.08 1.20 129
PCL 1.46 2.25 0.92 5.15 2.13 2.98 2762
Tide Choi and Christensen 0.83 1.37 1.20 1.78 1.09 1.13 111
Our 0.69 0.81 0.81 2.10 1.38 1.27 116
Table 1. Comparison of the translation error (X,Y,Z), rotation error (Roll, Pitch,
Yaw) and computation time on the synthetic dataset of Choi and Christensen [8] with
results from our method, [8] and the PCL.
(a) (b) (c)
(d) (e)
(f) (g)
Fig. 6. (a)-(c) Our objects from left to right Cat, Toolbox, Samurai. (d) color frame and
(e) depth frame recorded with the commercially available Kinect camera.(f) Probability
map and (g) predicted 3D object coordinates from a single tree mapped to the RGB-
cube for the object Cat.
Our dataset. The dataset which was provided by [8] is problematic for several
reasons: testing sequences are generated synthetically without camera noise, and
without occlusion. The objects are placed in a texture-less and static environ-
ment. In a static scene, a tracking method can in theory use the entire image
to estimate the motion of the camera instead of estimating the motion of the
object. Furthermore, the camera is moved around the object. The statistics of
object motion when the camera is moved are very different from a situation
where the camera is static and the object is moved. E.g., a complete vertical flip
of the object is unlikely in the first scenario.
To address these issues we introduce our own dataset which consists of three
objects. The objects were scanned in using Kinect, and six RGB-D testing se-
quences were recorded (350+ frames each). The objects are moved quickly in
front of a static camera and are at times strongly occluded. Ground truth poses
6-DOF Model Based Tracking via Object Coordinate Regression 13
Fig. 7. Example images from our dataset. Blue object silhouettes depict ground truth
and green silhouettes depict the estimated poses. The first four columns show correctly
estimated poses and the last column missed poses.
were manually labeled by hand annotation followed by ICP. In Fig. 7 five images
of each object are shown. For our dataset we trained decision trees as discussed in
the previous section, but with renderings of our scanned-in objects, and a set of
arbitrary RGB-D office images as background set. We keep all other training pa-
rameters as in the previously described experiment. We compare our approach to
Objects Sequence
Ratio
of
frames
used
Method
Our Approach Brachmann et al.
full proposal
distribution
local proposal
distribution
Accuracy Accuracy Accuracy
Cat
1100% 100% 89% 66.8%
33% 91.5% 90.4%
2100% 99.4% 100% 44.2%
33% 94.9% 87.8%
Samurai
1100% 96.3% 96.7% 72%
33% 68.6% 52.4%
2100% 92.3% 98.1% 33.7%
33% 55.4% 29.1%
Toolbox
1100% 88.8% 88.5% 54.7%
33% 81.2% 68.2%
2100% 100% 100% 59.4%
33 89.9% 34.5%
Fig. 8. Accuracy measured on our
dataset. Comparison of our full proposal
distribution to the local proposal dis-
tribution and to [4]. Evaluation is done
based on all frames and on every third
frame of the image sequences.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
(a) (b) (c) (d) (e)
Fig. 9. Average accuracy over all se-
quences for (a) [4]. Our full proposal dis-
tribution using (b) all frames and (d) ev-
ery third frame of the sequence. Our local
proposal distribution using (c) all frames
(e) every third frame of the sequence.
the one shot pose estimation from [4]. We exactly adhere to their training setup
and parameters (3 trees, maximum patch size 20 pixel meter, 210 hypotheses
per frame, Gaussian noise on training data). We measure accuracy as in [4] as
the fraction of frames where the object pose was estimated correctly.
While our approach achieves 96.2% accuracy on average over all sequences
the approach of [4] only estimates 58.9% of the poses correctly. Even though [4]
is inherently robust to occlusion, the heavy occlusions in our dataset still cause
it to fail. In contrast, our approach is able to estimate most poses correctly by
using information from previous frames. The results are depicted in Fig. 9 (a)
14 A. Krull, F. Michel, E. Brachmann, S. Gumhold, S. Ihrke, C. Rother
and (b). Additionally, we computed rotational and translational distances to the
ground truth for both methods. The distribution of errors for one sequence is
depicted in Fig. 10. The plots again show the large number of outlier estimations
of [4] (rightmost bins). The plots also reveal that concerning correct poses, our
approach leads to much more precise estimations.
To show that our full proposal distribution (Sec. 3.5) increases the robustness
of our method we conducted the following experiment: We define a simplified
variant of the proposal distribution, which is based only on Hlocal
t+1 to which we
also apply the final optimization. We term this variant local proposal distribution.
We use it together with 120 particles since it needs less computation time. For
0 1 2 3 4 5 6 7 8 9 >10
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Brachmann et al.
Our approach
mean rotational error (deg)
frequency
0 2 4 6 8 10 12 14 16 18 >20
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
0.16
0.18
0.2
Brachmann et al.
Our approach
mean translational error (mm)
frequency
Fig. 10. Histogram of rotational and translational errors for our approach in compar-
ison to [4], which is a single frame pose estimation framework.
this experiment, we artificially increase motion in our test sequences by using
only every third frame. As before, we measure the number of correctly estimated
poses. In this challenging setup the full proposal distribution achieves 80.3%
accuracy on average while the local proposal distribution achieves only 60.4%
accuracy. The results are depicted in Fig. 9 (b)-(e). Table 8 provides detailed
information for the achieved accuracy on all sequences.
Fig. 5 shows the estimated object motion path for one sequence with fast
motion. The plot illustrates precision and robustness of our approach.
5 Conclusion
We have introduced a novel method applying the concept of 3D object coordinate
regression to 6-DOF pose tracking. We utilize predicted object coordinates in
a proposal distribution, making our method very robust with regard to fast
movements in combination with occlusion. We have evaluated our method on
the dataset by Coi and Christensen and demonstrated that it yields superior
results. The method was additionally evaluated using a new dataset, specially
designed for RGB-D 6-DOF pose tracking, which will be made available to the
community.
Acknowledgement. This work has partially been supported by the European
Social Fund and the Federal State of Saxony within project VICCI (#100098171).
We thank Daniel Schemala for development of the manual pose annotation
tool, we used to generate ground truth data.
6-DOF Model Based Tracking via Object Coordinate Regression 15
References
1. Avidan, S.: Ensemble tracking. IEEE Trans. on PAMI 29 (2007) 261271
2. Azad, P., Munch, D., Asfour, T., Dillmann, R.: 6-DoF model-based tracking of
arbitrarily shaped 3D objects. In: IEEE ICRA (2011) 5204–5209
3. Bersch, C., Pangercic, D., Osentoski, S., Hausman, K., Marton, Z.C., Ueda, R.,
Okada, K., Beetz, M.: Segmentation of textured and textureless objects through
interactive perception. In: RSS Workshop on Robots in Clutter: Manipulation,
Perception and Navigation in Human Environments (2012)
4. Brachmann, E., Krull, A., Michel, F., Shotton, J., Gumhold, S., Rother, C.: Learn-
ing 6d object pose estimation using 3d object coordinates. In: ECCV (2014)
5. Bray, M., Koller-Meier, E., Van Gool, L.: Smart particle filtering for 3D hand track-
ing. In: IEEE International Conference on Automatic Face and Gesture Recogni-
tion (2004) 675680
6. Chiuso, A., Soatto, S.: Monte carlo filtering on lie groups. In: 39th IEEE Conference
on Decision and Control Volume 1 (2000) 304–309
7. Choi, C., Christensen, H.I.: 3D textureless object detection and tracking: An edge-
based approach. In: IEEE/RSJ International Conference on IROS (2012) 38773884
8. Choi, C., Christensen, H.I.: RGB-D object tracking: A particle filter approach on
GPU. In: IEEE/RSJ International Conference on IROS (2013) 1084–1091
9. Choi, C., Christensen, H.: Robust 3D visual tracking using particle filtering on the
SE(3) group. In: 2011 IEEE ICRA (2011) 4384–4390
10. Doucet, A., Godsill, S., Andrieu, C.: On sequential monte carlo sampling methods
for bayesian filtering. Statistics and Computing 10 (2000) 197–208
11. Fanelli, G., Weise, T., Gall, J., Van Gool, L.: Real time head pose estimation from
consumer depth cameras. In: Pattern Recognition. Springer (2011) 101110
12. Gordon, N., Salmond, D., Smith, A.: Novel approach to nonlinear/non-gaussian
bayesian state estimation. In: IEEE Radar and Signal Processing. Volume 2 (1993)
107–113
13. Grabner, H., Bischof, H.: Online boosting and vision. In: IEEE CVPR Volume 1
(2006) 260–267
14. Hinterstoisser, S., Lepetit, V., Ilic, S., Holzer, S., Bradski, G.R., Konolige, K.,
Navab, N.: Model based training, detection and pose estimation of texture-less 3d
objects in heavily cluttered scenes. In: ACCV (2012) 548–562
15. Isard, M., Blake, A.: Contour tracking by stochastic propagation of conditional
density. In: ECCV (1996) 343356
16. Klein, G., Murray, D.W.: Full-3D edge tracking with a particle filter. In: BMVC
(2006) 11191128
17. Kwon, J., Choi, M., Park, F.C., Chun, C.: Particle filtering on the euclidean group:
framework and applications. Robotica 25 (2007) 725–737
18. McElhone, M., Stuckler, J., Behnke, S.: Joint detection and pose tracking of multi-
resolution surfel models in RGB-D. In: IEEE ECMR (2013) 131137
19. Okuma, K., Taleghani, A., De Freitas, N., Little, J.J., Lowe, D.G.: A boosted
particle filter: Multitarget detection and tracking. In: Computer Vision-ECCV
(2004) 2839
20. Pupilli, M., Calway, A.: Real-time camera tracking using known 3d models and a
particle filter. In: IEEE ICPR Volume 1 (2006) 199203
21. Rios-Cabrera, R., Tuytelaars, T.: Discriminatively trained templates for 3d object
detection: A real time scalable approach. In: IEEE ICCV (2013) 2048–2055
16 A. Krull, F. Michel, E. Brachmann, S. Gumhold, S. Ihrke, C. Rother
22. Shotton, J., Glocker, B., Zach, C., Izadi, S., Criminisi, A., Fitzgibbon, A.: Scene
coordinate regression forests for camera relocalization in RGB-D images. In: IEEE
CVPR (2013) 29302937
23. Song, S., Xiao, J.: Tracking revisited using rgbd camera: Unified benchmark and
baselines. In: ICCV (2013) 233–240
24. Stckler, J., Behnke, S.: Multi-resolution surfel maps for efficient dense 3D modeling
and tracking. Journal of Visual Communication and Image Representation 25
(2014) 137–147
25. Taylor, J., Shotton, J., Sharp, T., Fitzgibbon, A.W.: The vitruvian manifold:
Inferring dense correspondences for one-shot human pose estimation. In: IEEE
CVPR (2012) 103–110
26. Teuliere, C., Marchand, E., Eck, L.: Using multiple hypothesis in model-based
tracking. In:IEEE ICRA (2010) 4559–4565
27. Wu, B., Nevatia, R.: Detection and tracking of multiple, partially occluded humans
by bayesian combination of edgelet based part detectors. International Journal of
Computer Vision 75 (2007) 247–266
... In turn, simple pixelbased features have been also employed to tackle the object pose estimation problem. More recently, Brachmann et al. [5] introduced a new representation in form of a joint 3D object coordinate and class labelling (extended for tracking in [21]), which, however, suffers in cases of occlusions. ...
Preprint
In this paper we present Latent-Class Hough Forests, a method for object detection and 6 DoF pose estimation in heavily cluttered and occluded scenarios. We adapt a state of the art template matching feature into a scale-invariant patch descriptor and integrate it into a regression forest using a novel template-based split function. We train with positive samples only and we treat class distributions at the leaf nodes as latent variables. During testing we infer by iteratively updating these distributions, providing accurate estimation of background clutter and foreground occlusions and, thus, better detection rate. Furthermore, as a by-product, our Latent-Class Hough Forests can provide accurate occlusion aware segmentation masks, even in the multi-instance scenario. In addition to an existing public dataset, which contains only single-instance sequences with large amounts of clutter, we have collected two, more challenging, datasets for multiple-instance detection containing heavy 2D and 3D clutter as well as foreground occlusions. We provide extensive experiments on the various parameters of the framework such as patch size, number of trees and number of iterations to infer class distributions at test time. We also evaluate the Latent-Class Hough Forests on all datasets where we outperform state of the art methods.
... Recent methods leverage data-driven methods such as Random Forests [5] to learn more robust features that better handle occlusion, and with low computing overhead. Of note, Krull et al. [19][20][21] use a more sophisticated likelihood function by regressing a pixel wise probability of the object and its local object coordinates. They use this representation in different frameworks such as particle filters, RANSAC and deep learning. ...
Preprint
We present a temporal 6-DOF tracking method which leverages deep learning to achieve state-of-the-art performance on challenging datasets of real world capture. Our method is both more accurate and more robust to occlusions than the existing best performing approaches while maintaining real-time performance. To assess its efficacy, we evaluate our approach on several challenging RGBD sequences of real objects in a variety of conditions. Notably, we systematically evaluate robustness to occlusions through a series of sequences where the object to be tracked is increasingly occluded. Finally, our approach is purely data-driven and does not require any hand-designed features: robust tracking is automatically learned from data.
... Zhong et al. [56] used the Rigid Pose dataset for their evaluation. Furthermore, the ACCV14 dataset [57], an RGB-D dataset, was used for their evaluation. The Princeton [41] dataset is an RGB-D dataset used by Rasoulidanesh et al. [40] for evaluating their method for tracking the object along with depth. ...
Article
Full-text available
Object tracking is one of the most important problems in computer vision applications such as robotics, autonomous driving, and pedestrian movement. There has been a significant development in camera hardware where researchers are experimenting with the fusion of different sensors and developing image processing algorithms to track objects. Image processing and deep learning methods have significantly progressed in the last few decades. Different data association methods accompanied by image processing and deep learning are becoming crucial in object tracking tasks. The data requirement for deep learning methods has led to different public datasets that allow researchers to benchmark their methods. While there has been an improvement in object tracking methods, technology, and the availability of annotated object tracking datasets, there is still scope for improvement. This review contributes by systemically identifying different sensor equipment, datasets, methods, and applications, providing a taxonomy about the literature and the strengths and limitations of different approaches, thereby providing guidelines for selecting equipment, methods, and applications. Research questions and future scope to address the unresolved issues in the object tracking field are also presented with research direction guidelines.
... The dense 2D-3D correspondence is established by predicting the 3D coordinates of each object pixel or by predicting a dense UV map. Prior to deep learning, some methods used random forests to predict the dense coordinates of objects [7,31,32]. [33] extended the standard random forest to a contextual regression framework to iteratively reduce the uncertainty of the predicted object coordinates. However, this approach has poor performance because it can only handle limited features in simple scenes. ...
Article
Full-text available
The current challenging problems of learning a robust 6D pose lie in noise in RGB/RGBD images, sparsity of point cloud and severe occlusion. To tackle the problems, object geometric information is critical. In this work, we present a novel pipeline for 6DoF object pose estimation. Unlike previous methods that directly regressing pose parameters and predicting keypoints, we tackle this challenging task with a point-pair based approach and leverage geometric information as much as possible. Specifically, at the representation learning stage, we build a point cloud network locally modeling CNN to encode point cloud, which is able to extract effective geometric features while the point cloud is projected into a high-dimensional space. Moreover, we design a coordinate conversion network to regress point cloud in the object coordinate system in a decoded way. Then, the pose could be calculated through point pairs matching algorithm. Experimental results show that our method achieves state-of-the-art performance on several datasets.
... Instead of relying on a single image for absolute pose estimation, tracking methods exploit temporal information. While earlier methods were prone to fail in the presence of heavy occlusion and clutter [23,7,1], data-driven methods have been proposed to learn more robust features by using Random Forests [22,43,44]. This problem has been recently formulated under a deep learning framework, where a network is trained to regress the pose difference between image pairs extracted from RGB [8,4] or RGB-D videos [12,13,58,49]. ...
... Temporal tracking in video data can improve Our method leverages a Rao-Blackwellized particle filter and an auto-encoder network to estimate the 3D translation and a full distribution of the 3D rotation of a target object from a video sequence. pose estimation [28,5,17,8]. In the context of point-cloud based pose estimation, Kalman filtering has also been used to track 6D poses, where Bingham distributions have been shown to be well suited for orientation estimation [36]. ...
Article
Full-text available
In this work, an optical-flow-based pose tracking method with long short-term memory for known uncooperative spacecraft is proposed. In combination with the segmentation network, we constrain the optical flow area of the target to cope with harsh lighting conditions and highly textured background. With the introduction of long short-term memory structure, the proposed method can maintain a robust and accurate tracking performance even in a long-term sequence of images. In our experiments, the pose tracking effects in the synthetic images as well as the SwissCube dataset images are tested, respectively. By comparing with the state-of-the-art pose tracking frameworks, we demonstrate the performance of our method and in particular the improvements under complex environments.
Article
This paper introduces a dataset for training and evaluating methods for 6D pose estimation of hand-held tools in task demonstrations captured by a standard RGB camera. Despite the significant progress of 6D pose estimation methods, their performance is usually limited for heavily occluded objects, which is a common case in imitation learning, where the object is typically partially occluded by the manipulating hand. Currently, there is a lack of datasets that would enable the development of robust 6D pose estimation methods for these conditions. To overcome this problem, we collect a new dataset (Imitrob) aimed at 6D pose estimation in imitation learning and other applications where a human holds a tool and performs a task. The dataset contains image sequences of nine different tools and twelve manipulation tasks with two camera viewpoints, four human subjects, and left/right hand. Each image is accompanied by an accurate ground truth measurement of the 6D object pose obtained by the HTC Vive motion tracking device. The use of the dataset is demonstrated by training and evaluating a recent 6D object pose estimation method (DOPE) in various setups. The dataset and code are publicly available at http://imitrob.ciirc.cvut.cz/imitrobdataset.php .
Article
Full-text available
This paper describes a novel object segmentation approach for autonomous service robots acting in human living environments. The proposed system allows a robot to effectively segment textured objects in cluttered scenes by leveraging its manipulation capabilities. In this interactive perception approach, 2D-features are tracked while the robot actively induces motions into a scene using its arm. The robot autonomously infers appropriate arm movements which can effectively separate objects. The resulting tracked feature trajectories are assigned to their corresponding object by using a novel clustering algorithm, which samples rigid motion hypotheses for the a priori unknown number of scene objects. We evaluated the approach on challenging scenes which included occluded objects, as well as objects of varying shapes and sizes. Finally, we also briefly describe and discuss our most recent work towards the segmentation of textureless objects.
Conference Paper
Full-text available
We propose a particle filter framework for the joint detection, pose estimation, and real-time tracking of objects in RGB-D video. We do not rely on the availability of CAD models, but employ multi-resolution surfel maps as a concise representation of object shape and texture that is acquired through SLAM. We propose to initialize the particle belief for tracking with pose votes cast from matching colored surfel-pair features at multiple resolutions. Multi-hypothesis tracking then finds the most consistent track over time. We utilize efficient registration of RGB-D images to the model to obtain improved proposals for particle filtering which greatly enhances tracking accuracy. We evaluate our approach on a publicly available RGB-D object tracking dataset, and Show high rates of detection and good tracking performance with respect to various speeds of camera motion and occlusions.
Conference Paper
Full-text available
In this technical demonstration, we will show our framework of automatic modeling, detection, and tracking of arbitrary texture-less 3D objects with a Kinect. The detection is mainly based on the recent template-based LINEMOD approach [1] while the automatic template learning from reconstructed 3D models, the fast pose estimation and the quick and robust false positive removal is a novel addition. In this demonstration, we will show each step of our pipeline, starting with the fast reconstruction of arbitrary 3D objects, followed by the automatic learning and the robust detection and pose estimation of the reconstructed objects in real-time. As we will show, this makes our framework suitable for object manipulation e.g. in robotics applications.
Conference Paper
Full-text available
Fitting an articulated model to image data is often approached as an optimization over both model pose and model-to-image correspondence. For complex models such as humans, previous work has required a good initialization, or an alternating minimization between correspondence and pose. In this paper we investigate one-shot pose estimation: can we directly infer correspondences using a regression function trained to be invariant to body size and shape, and then optimize the model pose just once? We evaluate on several challenging single-frame data sets containing a wide variety of body poses, shapes, torso rotations, and image cropping. Our experiments demonstrate that one-shot pose estimation achieves state of the art results and runs in real-time.
Article
Full-text available
Building consistent models of objects and scenes from moving sensors is an important prerequisite for many recognition, manipulation, and navigation tasks. Our approach integrates color and depth measurements seamlessly in a multi-resolution map representation. We process image sequences from RGB-D cameras and consider their typical noise properties. In order to align the images, we register view-based maps e�ciently on a CPU using multiresolution strategies. For simultaneous localization and mapping (SLAM), we determine the motion of the camera by registering maps of key views and optimize the trajectory in a probabilistic framework. We create object models and map indoor scenes using our SLAM approach which includes randomized loop closing to avoid drift. Camera motion relative to the acquired models is then tracked in real-time based on our registration method. We benchmark our method on publicly available RGB-D datasets, demonstrate accuracy, e�ciency, and robustness of our method, and compare it with state-of-theart approaches. We also report on several successful public demonstrations where it was used in mobile manipulation tasks.
Conference Paper
This work addresses the problem of estimating the 6D Pose of specific objects from a single RGB-D image. We present a flexible approach that can deal with generic objects, both textured and texture-less. The key new concept is a learned, intermediate representation in form of a dense 3D object coordinate labelling paired with a dense class labelling. We are able to show that for a common dataset with texture-less objects, where template-based techniques are suitable and state of the art, our approach is slightly superior in terms of accuracy. We also demonstrate the benefits of our approach, compared to template-based techniques, in terms of robustness with respect to varying lighting conditions. Towards this end, we contribute a new ground truth dataset with 10k images of 20 objects captured each under three different lighting conditions. We demonstrate that our approach scales well with the number of objects and has capabilities to run fast.
Conference Paper
Despite significant progress, tracking is still considered to be a very challenging task. Recently, the increasing popularity of depth sensors has made it possible to obtain reliable depth easily. This may be a game changer for tracking, since depth can be used to prevent model drift and handle occlusion. We also observe that current tracking algorithms are mostly evaluated on a very small number of videos collected and annotated by different groups. The lack of a reasonable size and consistently constructed benchmark has prevented a persuasive comparison among different algorithms. In this paper, we construct a unified benchmark dataset of 100 RGBD videos with high diversity, propose different kinds of RGBD tracking algorithms using 2D or 3D model, and present a quantitative comparison of various algorithms with RGB or RGBD input. We aim to lay the foundation for further research in both RGB and RGBD tracking, and our benchmark is available at http://tracking.cs.princeton.edu.
Conference Paper
In this paper we propose a new method for detecting multiple specific 3D objects in real time. We start from the template-based approach based on the LINE2D/LINEMOD representation introduced recently by Hinterstoisser et al., yet extend it in two ways. First, we propose to learn the templates in a discriminative fashion. We show that this can be done online during the collection of the example images, in just a few milliseconds, and has a big impact on the accuracy of the detector. Second, we propose a scheme based on cascades that speeds up detection. Since detection of an object is fast, new objects can be added with very low cost, making our approach scale well. In our experiments, we easily handle 10-30 3D objects at frame rates above 10fps using a single CPU core. We outperform the state-of-the-art both in terms of speed as well as in terms of accuracy, as validated on 3 different datasets. This holds both when using monocular color images (with LINE2D) and when using RGBD images (with LINEMOD). Moreover, we propose a challenging new dataset made of 12 objects, for future competing methods on monocular color images.
Conference Paper
This paper presents a particle filtering approach for 6-DOF object pose tracking using an RGB-D camera. Our particle filter is massively parallelized in a modern GPU so that it exhibits real-time performance even with several thousand particles. Given an a priori 3D mesh model, the proposed approach renders the object model onto texture buffers in the GPU, and the rendered results are directly used by our parallelized likelihood evaluation. Both photometric (colors) and geometric (3D points and surface normals) features are employed to determine the likelihood of each particle with respect to a given RGB-D scene. Our approach is compared with a tracker in the PCL both quantitatively and qualitatively in synthetic and real RGB-D sequences, respectively.
Conference Paper
We address the problem of inferring the pose of an RGB-D camera relative to a known 3D scene, given only a single acquired image. Our approach employs a regression forest that is capable of inferring an estimate of each pixel's correspondence to 3D points in the scene's world coordinate frame. The forest uses only simple depth and RGB pixel comparison features, and does not require the computation of feature descriptors. The forest is trained to be capable of predicting correspondences at any pixel, so no interest point detectors are required. The camera pose is inferred using a robust optimization scheme. This starts with an initial set of hypothesized camera poses, constructed by applying the forest at a small fraction of image pixels. Preemptive RANSAC then iterates sampling more pixels at which to evaluate the forest, counting inliers, and refining the hypothesized poses. We evaluate on several varied scenes captured with an RGB-D camera and observe that the proposed technique achieves highly accurate relocalization and substantially out-performs two state of the art baselines.