ArticlePDF Available

Multiple Hypothesis Semantic Mapping for Robust Data Association

Authors:

Abstract and Figures

In this paper, we present a semantic mapping approach with multiple hypothesis tracking for data association. As semantic information has the potential to overcome ambiguity in measurements and place recognition, it forms an eminent modality for autonomous systems. This is particularly evident in urban scenarios with several similar looking surroundings. Nevertheless, it requires the handling of a non-Gaussian and discrete random variable coming from object detectors. Previous methods facilitate semantic information for global localization and data association to reduce the instance ambiguity between the landmarks. However, many of these approaches do not deal with the creation of complete globally consistent representations of the environment and typically do not scale well. We utilize multiple hypothesis trees to derive a probabilistic data association for semantic measurements by means of position, instance and class to create a semantic representation. We propose an optimized mapping method and make use of a pose graph to derive a novel semantic SLAM solution. Furthermore, we show that semantic covisibility graphs allow for a precise place recognition in urban environments. We verify our approach using real-world outdoor dataset and demonstrate an average drift reduction of 30 % w.r.t. the raw odometry source. Moreover, our approach produces 55 % less hypotheses on average than a regular multiple hypotheses approach.
Content may be subject to copyright.
IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JUNE, 2019 1
Multiple Hypothesis Semantic Mapping for Robust
Data Association
Lukas Bernreiter, Abel Gawel, Hannes Sommer1, Juan Nieto, Roland Siegwart and Cesar Cadena
Abstract—In this paper, we present a semantic mapping
approach with multiple hypothesis tracking for data association.
As semantic information has the potential to overcome ambiguity
in measurements and place recognition, it forms an eminent
modality for autonomous systems. This is particularly evident
in urban scenarios with several similar looking surroundings.
Nevertheless, it requires the handling of a non-Gaussian and
discrete random variable coming from object detectors. Previous
methods facilitate semantic information for global localization
and data association to reduce the instance ambiguity between the
landmarks. However, many of these approaches do not deal with
the creation of complete globally consistent representations of the
environment and typically do not scale well. We utilize multiple
hypothesis trees to derive a probabilistic data association for
semantic measurements by means of position, instance and class
to create a semantic representation. We propose an optimized
mapping method and make use of a pose graph to derive a novel
semantic SLAM solution. Furthermore, we show that semantic
covisibility graphs allow for a precise place recognition in urban
environments. We verify our approach using real-world outdoor
dataset and demonstrate an average drift reduction of 33 % w.r.t.
the raw odometry source. Moreover, our approach produces 55%
less hypotheses on average than a regular multiple hypotheses
approach.
Index Terms—SLAM, Semantic Scene Understanding, Proba-
bility and Statistical Methods
I. INTRODUCTION
SEMANTIC data is a reliable and ubiquitous flow of in-
formation in structured and non-structured environments.
Especially for perception systems, semantically annotated data
and higher reasoning about the underlying scene on top of
purely geometric approaches have the potential to increase
the robustness of the estimation [1], [2]. A reliable mapping
is eminently important especially for autonomous, as well
as, augmented reality systems since the recognition of the
surrounding objects and the localization in a globally unknown
environment are crucial factors there.
Manuscript received: February, 24, 2019; Revised May, 16, 2019; Accepted
June, 10, 2019.
This paper was recommended for publication by Editor Cyrill Stachniss
upon evaluation of the Associate Editor and Reviewers’ comments.
This work was supported by the National Center of Competence in Research
(NCCR) Robotics through the Swiss National Science Foundation and has
received funding from the European Union’s Horizon 2020 research and
innovation programme under grant agreement No 688652 and from the
Swiss State Secretariat for Education, Research and Innovation (SERI) under
contract number 15.0284.
All authors are with the Autonomous Systems Lab, ETH Zurich, Zurich
8092, Switzerland, {berlukas, gawela, sommerh, nietoj,
rsiegwart, cesarc}@ethz.ch.
1Additionally with Sevensense Robotics AG, Zurich.
Digital Object Identifier (DOI): see top of this page.
Hypothesis 3Hypothesis 1 Hypothesis 2
Multiple Hypothesis Tree
Place Recognition
Fig. 1: We propose a semantic SLAM system that maintains
multiple hypotheses of the landmark locations structured in
a hypothesis tree (bottom right image). Data association is
done in a semantic framework to create new branches in the
hypothesis tree. Furthermore, we perform a semantic place
recognition method utilizing the object class distribution of a
submap (top left image).
Traditional approaches for localization often rely on specific
low-level visual features such as points and lines which are
inherently ambiguous preventing the approach to scale well to
large environments. In contrast, semantic information features
a promising approach for many robotic applications by allow-
ing more unique local and global descriptors for landmarks
as well as potential viewpoint-invariance. Therefore, this con-
stitutes a crucial factor for the measurement association to
mapped landmarks and thus influences the quality of the
localization. Moreover, semantics are very efficient at dealing
with place recognition as they are less affected by seasonal or
appearance changes as well as large drifts.
In a conventional SLAM setting, the measurement noise is
commonly relaxed to the continuous Gaussian case [1] which
however, does not apply to semantic variables. Uncertainties
in the object detection such as class labels and object instances
typically involve the handling of non-Gaussian discrete vari-
ables. How to properly handle such variables is still quite
challenging and remains an open research question [3].
Many existing semantic mapping approaches are primarily
concerned with the creation of an indoor semantic represen-
tation with minor illumination and viewpoint changes [4].
In contrast, realistic outdoor applications often come with
severe changes of illumination and viewpoint. This can hamper
loop closure detection since drastic view-point changes might
render scenes completely different when revisiting.
Additionally, local descriptors for place recognition often
2 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JUNE, 2019
Fig. 2: System overview of our proposed approach. We make use of our previous work X-View [5] by means of the semantic
object extraction. Hereby the semantic objects are converted to measurements and used as input for our system. Afterwards we
start simultaneously with the creation of the factor graph as well as with the multiple hypothesis mapping of the environment.
Loop closures are detected and placed into the factor graph when submaps are completed.
rely on the bag-of-words paradigm [6], [7] which can fail often
in environments with repetitive features which commonly
occur in urban environments leading to false loop closures.
Other semantic SLAM systems do not directly incorporate
the semantic information into the estimation but rather use it
to filter out bad classes such as cars or pedestrians before-
hand [8].
In this work, we aim to build a globally consistent semantic
mapping formulation by improving the incorporation of dis-
crete random variables in the map building and localization
processes.
Throughout this paper, observations comprise the semantic
class and position from static landmarks as well as the
spatial relationship to other static landmarks. We approach
the measurement association problem by utilizing the semantic
class of an object and deferring the decision on associations
until the ambiguity is resolved. In other words, the decision on
the association is done at a time when more observations are
available or a place is revisited allowing to correctly identify
the instance label with a certain assurance. This is motivated
by the fact that in many cases the most likely association given
only a few measurements does not necessarily need to be the
correct one.
Furthermore, we derive a loop closure detection and veri-
fication algorithm operating directly on the level of semantic
objects. Utilizing the class labels and the spatial relationships
between the objects enables a robust recognition of places in
urban environments. An overview of our proposed system is
given in figure 2.
The main contributions of this work are
Consistent multiple hypothesis mapping using an opti-
mized Multiple Hypothesis Tracking (MHT) approach.
A Dirichlet Process (DP)-based relaxed probabilistic
Hungarian algorithm for viewpoint-invariance.
Semantic selection strategy to identify potential submaps
for loop closures.
Place recognition based on the semantic classes and the
covisibility graphs.
Incorporation of the proposed approach into a graph-
based semantic SLAM pipeline and evaluation of the
resulting system.
A. Related Work
In recent years, the advances to deep learning systems led to
more reliable as well as practically usable object detectors [9].
Consequently, this allowed SLAM systems to additionally
include semantically rich information in order to improve their
estimation [10]–[12].
Recently, some research specifically addresses the problem
of correctly assigning measurements to already known objects
utilizing additional semantic information [13], [14]. These
systems, however, do not deal with the estimation of the
camera’s position, i.e. their application implies a static position
of the camera and is often placed indoors. Thus, they are not
optimized for viewpoint-invariance, but rather emphasize on
the probabilistic data association and the tracking of objects
across multiple scenes.
Nevertheless, we make use of the close relationship to
robotic mapping since target tracking is a special case of
mapping. Elfring et al. [13] presented a semantic anchoring
framework using MHTs [15] which defers the data associ-
ation until the ambiguity between the instances is resolved.
Generally, the MHT enables accurate results but is inher-
ently intractable with a large amount of objects and requires
frequent optimizations [16]. The work of Wong et al. [14]
presents an approach using the DPs which yields estimation
results comparable to the MHT but with substantially less
computational effort. Nevertheless, their proposed approach is
not incremental and therefore, not directly applicable for the
mapping of a robot’s environment. Similar, in their previous
work [17] the authors propose a world modeling approach
using dependent DPs to accommodate for dynamic objects. In
their proposed framework, the optimal measurement assign-
ment is computed using the Hungarian method operating on
negative log-likelihoods for the individual cases. Furthermore,
Atanasov et al. [18] emphasizes on a novel derivation of
the likelihood of Random Finite Set (RFS) models using the
matrix permanent for localizing in a prior semantic map. Their
system utilizes a probabilistic approach for data association
which considers false positives in the measurements.
There is a vast literature on indoor semantic mapping
available which however, does often not directly incorporate
semantic information in a SLAM pipeline but rather uses the
BERNREITER et al.: MULTIPLE HYPOTHESIS SEMANTIC MAPPING FOR ROBUST DATA ASSOCIATION 3
information for mapping and scene interpretation [19] [20].
The work that is most similar to ours is the work of Bowman
et al. [21] which proposed a semantic system which enables
to directly facilitate semantic factors in their optimization
framework. Despite using probabilistic formulation for the
data association, their approach inherently neglects false pos-
itives and false negatives, and lacks including a prior on the
assignments. Moreover, they limited the possible classes in
their mapping so that only cars were enabled in their outdoor
experiments. This greatly reduces the complexity in outdoor
scenarios with semantically rich information and further, is not
a reliable source for place recognition.
Another direction of research is to represent landmarks as
quadrics to capture additional information such as size and
orientation [12], [22]. However, they either assume that the
measurement association is given [22] or utilize the seman-
tic labels for a hard association using a nearest neighbor
search [12]. Thus, their work does not include any probabilistic
inference for the association and does not consider false
positives.
Our previous work by Gawel et al. [5] represents the
environment using semantic graphs and performs global lo-
calization by matching query graphs of the current location
with a global graph. The query graphs however, are not used
in a data association framework and thus, landmarks could
potentially be duplicated. This system does neither deal with
map management and optimization nor with drift reduction
for globally consistent mapping. Our semantic SLAM system
does not require any prior of the object shapes and comprises a
soft probabilistic data association for semantic measurements.
To the best of our knowledge, a complete semantic SLAM
system comprising the aforementioned approaches has not
been reported in literature before.
In the remaining part of this paper we will start deriving
a semantic mapping approach (section II) and a concrete
algorithm for localization (section III). The presented work
is evaluated in chapter IV. Finally, chapter V concludes this
work and gives further research directions.
II. SEMANTIC MU LTIP LE -HYP OTH ES ES MA PPING
When performing SLAM, measurement noise typically
leads to drift and inconsistent maps – in particular when mea-
surements get wrongly associated to landmarks. We approach
this problem by introducing locally optimized submaps. Each
submap maintains an individual Multiple Hypothesis Tree
(MHt) and propagates a first-moment estimate to proximate
submaps. Specifically, for each submap we want to maximize
the posterior distribution, f(Θt|Zt), of the associations Θt
of all measurements Ztreceived till the time step twhich is
proportional to1
f(zt|Θt,Zt1)f(θt|Θt1,Zt1)f(Θt1|Zt1).(1)
Here, the set of Nmeasurement associations at time step tis
represented by θt= [θ1
t...θN
t]. The first factor, f(zt|Θt,Zt1),
in (1) represents the distribution of all measurements at time
1Vectors are underlined and matrices are written with bold capital letters.
Fig. 3: Likelihood of assigning a measurement to a specific
scenario. Each measurement zi
t, at time t, could be assigned
to any of the existing landmarks m{1..6}, represent a new
landmark (new) or a false positive (fp). The thickness and
opacity denotes how likely the association is given a certain
example of measurements and landmarks.
tfor which we assume conditional independence of the
individual measurements such that it equals to
n
Y
i=1
f(zi
t|θi
t=l, Zt1) = ps(ci
t)
n
Y
i=1
p(ci
t|γl
t1)f(pi
t|πl
t1),
(2)
where zi
tdenotes the attributes of the ith semantic mea-
surement at time t,θi
tis the index of the landmark, l, this
measurement is associated with, and πl
t1,γl
t1the assigned
landmark’s position and class estimated at time t1. Further-
more, a semantic measurement, zi
t, is split into its position,
pi
t, and class, ci
t, component, whose Probability Mass Function
(PMF), ps, is a prior assumption based on how well the classes
fit into the current environment. For the ith class measurement,
ci
t, we assume p(ci
t|θi
t=l, γl
t1) := δci
tl
t1, where δdenotes
the Kronecker delta. For pi
t, we assume the following form
of a stochastic measurement model
f(pi
t|θi
t=l, πl
t1)=fν(pi
tπl
t1),
where fνdenotes the Probability Density Function (PDF)
of the additive position measurement noise, νi
t, which we
model as a zero-mean Gaussian distribution with covariance
Σz. For practical stability, we use an Unscented Kalman
Filter (UKF) [23] for the estimation of πl
t. The second fac-
tor, f(θt|Θt1,Zt1), in (1) is the assignment prior and is
calculated using the well-known equation [13], [16]
f(θt|Θt1,Zt1) = Nn
t!Nf
t!
Nm
t!pn(Nn
t)pf(Nf
t),(3)
where Nn
tdenotes the number of new measurements, Nf
tthe
number of false positives identified by the Hungarian method
and Nm
tthe total number of measurements at time step t. The
functions pnand pfare prior PMFs over the number of new
measurements and false positives, respectively. Typically both
4 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JUNE, 2019
are chosen as Poisson PMFs with a specific spatial density λ
and a volume V[24], e.g.
pn(N) = exp (λnV)(λnVn)N
N!.
Each branch in the MHt comprises a different set of associ-
ations Θt. Utilizing (2) and (3) together with the previous
posterior distribution f(Θt1|Zt1)we can evaluate these
branches using (1).
A. Probabilistic Measurement Association
Finding the correspondence θtbetween measurements and
mapped object can be challenging since the current set of mea-
surements often does not allow deriving a correct assignment.
Fortunately, this problem can be considered as a weighted
combinatorial assignment problem for which the Hungarian
algorithm [17], [25] is well known. Figure 3 illustrates the
probabilistic combinatorial assignment problem. To find the
most likely assignment we utilize a stochastic association al-
gorithm based on the DP. DPs are a good choice for modeling
the probability of seeing new and re-observing already mapped
landmarks [26].
The likelihood of the associations of new measure-
ments ztat time twith landmarks, θt, is expressed by
f(θt|zt,Θt1,Zt1). We assume that at each time step a
landmark in the scene can at most generate one observation.
Inspired by the dependent DP formulation in [17], we differ-
entiate four cases: (i) landmarks that have already been seen in
the current submap, (ii) landmarks seen in previous submaps,
(iii) new landmarks and (iv) false positives. The likelihood for
the association of a measurement zi
twith an existing landmark
kof the same class in the current submap is modeled as
f(θi
t=k|zi
t,Θt1,Zt1) = exp(Nk
t)fν(pi
tπk
t).(4)
Here, the scalar Nk
tdenotes the number of assignments to
the landmark k. Despite the fact that we only deal with static
objects, a landmark, l, which was seen in a previous submap
at time, τ, is modeled using a transitional density, T, i.e.
f(θi
t=l|zi
t,Θt1,Zt1) =Zfν(pi
tx)T(x, πl
τ)dx . (5)
The transitional density, T, depends on the semantic class
of the object and is used to accommodate for the unknown
shape of the landmarks. Since we take the centroid of the
segmented objects as input, we employ two approaches for
the choice of Tto compensate for large measurement noise.
Objects such as poles and trees are modeled using a Dirac-δ
distribution: T(x|πl
τ) = δ(xπl
τ), reducing the right hand
side of (5) to fν(pi
tπl
τ), the measurement distribution of zi
t
given the last seen position of l. The transition of objects like
buildings and fences is modeled using a Gaussian distribution
with covariance Σαresulting in
Zfν(xpi
t)N(x;πl
τ,Σα)dx
=F1{F {N (0,Σz)} · F{N (πl
τ,Σα)}}(pi
t)
=N(pi
t;πl
τ,Σz+Σα),
𝑥𝑡
𝑥𝑡+1
𝑥𝑡+2
𝑥𝑡+3
𝑥𝑡+4
𝑚1
𝑚2
𝑚3
𝑧𝑡
1
𝑧𝑡+2
2
𝑧𝑡+2
1
𝑧𝑡+3
1
Fig. 4: Illustration of the Gaussian mixture landmarks created
by the fusion of the weighted hypotheses.
where Fis the Fourier transform and Σαis a transitional
covariance depending on the object’s class c. The likelihood
of assigning a new landmark, l, to the ith observation at time
t, is approximated by the uniform distribution in zi
tover the
volume of the map [17], M, i.e.
f(θi
t=l|zi
t,Θt1,Zt1) = αZfν(pi
tx)HDP(x)dx
αU|M| ,
where HDP is the base distribution of the DP. Generally, false
positives occur due to clutter in the images and are essentially
detected objects which do not physically exist in the environ-
ment. Having such cases in the map may result in improper
assignments of future measurements. False positives have the
likelihood f(θi
t= 0|zi
t,Θt1,Zk1), i.e. the likelihood of the
measurement ibeing an observation of the false landmark, 0.
False positives, are assumed to occur at a fixed rate ρ, i.e.
f(θi
t= 0|zi
t,Θt1,Zt1)
Y
j
f(zi
t|θi
t=j, γj
t1, πj
t1)1
·(ρN0
tN0
t>0
ρα N0
t= 0 ,
where N0
tdenotes the number of occurred false positives until
time step t, and αis the concentration parameter of the DP.
The aforementioned four cases will be used as an input to
the Hungarian algorithm yielding an optimal assignment for
each measurement as well as landmark. Based on this initial
assignment, the optimal branch of the MHt will be formed. In
case the assignment is not distinct enough, new branches in
the MHt are generated by re-running the Hungarian algorithm
without the previous optimal assignment. Since small MHts
are generally better for computational performance, we only
create branches for associations that are reasonable.
B. Optimized Resampling of Hypotheses
Each hypothesis is weighted using their measurement likeli-
hood (2) and assignment prior term (3). At each time step, the
BERNREITER et al.: MULTIPLE HYPOTHESIS SEMANTIC MAPPING FOR ROBUST DATA ASSOCIATION 5
Fig. 5: Illustration of the trajectories of the KITTI sequence 05, sequence 00, sequence 07 together with a laser map. Regions
in blue denote the estimated trajectory, orange regions are submaps which were checked for loop closures and green areas
show performed loop closures. We gain additional efficiency by only checking a subset of the submaps for loop closure.
existing hypotheses are reweighted and eventually resampled
by a systematic resampling technique [27]. In general, this
fuses the current knowledge in the hypothesis set and elimi-
nates the hypotheses which have a low weight and preserves
hypotheses with a good weight.
A crucial factor is when to decide that resampling should be
performed on the hypothesis tree. In this case, it is common
to use selective resampling [28] based on the calculation of
the effective sample size which essentially captures the diver-
sity of the hypothesis set. Consequently, resampling is only
performed when the effective sample size exceeds a certain
threshold. Furthermore, many particle filter implementations
only consider a fixed particle size. However, it is desired that
the number of particles is high for a high state uncertainty, and
low when the uncertainty is low. Fox [29] introduced a variable
sampling algorithm based on the KLD distance for particle
filters. During each iteration of the resampling procedure, the
number of hypotheses is dynamically bounded by nusing
n=k1
2 12
9(k1) +s2
9(k1)z1δ!,(6)
where kis the current number of resampled hypotheses and
z1δis the upper 1δquantile of a normal distribution which
models how probable the approximation of the true sample
size is [29]. The value nis dynamically calculated at each step
of the resampling until the number of resampled hypotheses is
greater than n. Nevertheless, we bound the maximum number
of resampled hypotheses to avoid drastic changes.
III. SEM AN TIC LOCALIZATION
Every time a submap is completed, the resulting map as
well as the odometry measurements are used to compute
a trajectory estimate. The weighted hypotheses allow for
the creation of weighted mixtures of probability distributions
resulting in a weighted fusion which considers the uncertainty
of each hypothesis (cf. figure 4). The result of the fusion
is formulated as a relative constraint and incorporated into
a nonlinear factor graph as semantic landmarks.
A. Semantic Evaluation of Submaps
Loop closures are identified by first evaluating the quality
of the submap in terms of the occurred landmarks. This is
motivated by the fact that in many cases (e.g. highways) it
is not necessary to check for loop closures. The examination
whether a submap is good enough for loop closure detection is
based on a decision tree. We train a decision tree by comparing
the trace of the state covariance before and after incorporating
a specific region in the factor graph. The trained decision tree
is specific to an urban environment and furthermore, to the
length of the submap. Thus for other environments, a retraining
of the decision tree or online learning approaches are required.
A submap is considered as good either when it lowers the size
of the bounding box or by having loop closures in it.
Evaluating a submap requires extracting descriptive at-
tributes from it and we argue that semantic information is
a crucial factor for this. In more detail, we first approximate
the Shannon entropy Hof the mixture distribution using an
approximate single multivariate Gaussian distribution over the
submap, with covariance Σ, i.e.
H=1
2log (2πe)3det(Σ).
On a level of semantic classes we then calculate a term
frequency-inverse document frequency (tf-idf) score, i.e.
Si
tf-idf =X
c
ni
c
nilog N
nc,
where ni
cdenotes the number of occurrences of class cin
submap i,nithe total number of classes in i. Furthermore,
Ndenotes the total number of submaps processed so far and
ncrepresents the number of scenes within the submaps which
included an object of type c. This is efficiently compared and
updated with the previous submaps. As a final score, we make
use of the number of landmarks within the submap.
As shown in figure 5, the loop closure detection is triggered
once the decision tree predicts that a submap is potentially
good in terms of its mapped objects.
B. Semantic Loop Closure Detection
Loop closures are found in multiple steps. First, we find
similar submaps using an incremental kd-tree [30] of the
submap’s normalized class histograms while employing the
Jensen-Shannon divergence (JSD) [31] as the distance mea-
sure. For each similar submap the individual scene candidates
6 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JUNE, 2019
Fig. 6: Comparison of several KITTI sequences by means of the RMSE plotted as a function of the frames. The two methods,
UKF and MHM, show several jumps in the error due to wrong data associations. Each wrong association pulls the factor graph
towards a wrong direction which results in the jumps of the RMSE. Our method performs a more correct data association and
keeps therefore the least error and does not include any sudden changes.
are identified with another kd-tree of the scene’s normalized
class histograms using the L2-norm for faster retrieval. Ad-
ditional efficiency can be achieved with a tuning parameter
that restricts the search space of the kd-tree in terms of the
distance.
Good loop closure candidates found by the second kd-tree
are verified and further filtered with a discrete Bayes filter. We
define a Markov chain between the events for loop closure and
no loop closure. The transitional probabilities are chosen to be
similar to [7].
The verification process calculates two scores for how
similar the candidate scene and the current scene are. First,
the topology of a scene is represented by the Laplacian matrix
which is calculated based on the spatial relationship between
the semantic classes as well as their degrees in the scene. We
compare the topologies of two scenes based on a normalized
cross correlation (NCC) [32] score, SN CC . Second, another
score, Sscene, expresses the overall similarity of the landmarks
in the two scenes. For this the landmarks get associated with
the Hungarian algorithm on the estimated landmark positions
of each scene and their Euclidean distances. A pair of matched
landmarks iand jcontribute to Sscene through
si,j
match := 1 Hi,j
2, si,j
class := (1 δci,cj)·p,
where His the Hungarian cost matrix (output of the Hungarian
algorithm), pdenoting a penalty factor, ci,cjbeing the label
of the landmarks i, j, and δdenoting the Kronecker delta. The
two scores smatch and sclass are combined using all matched
landmark pairs, as follows
Sscene := X
i,j
1si,j
matchsi,j
class.
The sum of both scores, SNC C as well as Sscene has to be
larger than a threshold (tuning parameter) to verify the match
of the two scenes. This binary decision serves as input to the
discrete Bayes filter which finally gets to decide whether to use
the scene pair as a loop closure candidate for the next step. As
the last step, the set of all loop closures candidates undergoes
a final geometric consistency check based on RANSAC before
the actual loop closure constraints are inserted into the factor
graph. Both, the Hungarian algorithm and RANSAC can be
computationally expensive. Therefore, we filter most invalid
candidates beforehand using the kd-trees which can be per-
formed in logarithmic time. For additional robustness, we use
m-estimators with Cauchy functions [33] in the optimization
of the factor graph.
IV. EVALUATION
We evaluate our system on the KITTI dataset sequences
00, 05, 06 and 07 [34] where we use SegNet [35] to derive
the semantic classes of the individual scenes. For each image,
the semantic objects are extracted and projected into the world
frame [5] using the Velodyne scans. In general, our approach is
not limited to the use outdoors but rather depends on the object
detector. Additionally, one might need to adapt the decision
tree and psin equation (2).
Since, to our knowledge, no appropriate approach for
comparison is publicly available, we could not compare
our proposed approach to another semantic SLAM system.
BERNREITER et al.: MULTIPLE HYPOTHESIS SEMANTIC MAPPING FOR ROBUST DATA ASSOCIATION 7
Therefore, we evaluate our proposed system to two other
semantic solutions as well as two non-semantic approaches.
As a baseline we compare to LeGo-LOAM [36] and ORB-
SLAM2 stereo [37]. For a semantic baseline, we utilize a
single UKF estimator with a Hungarian algorithm based on
the L2norm for data association. This essentially performs
a nearest neighbor data association with a single hypothesis.
We also added a multiple hypothesis mapping (MHM) using
a maximum likelihood approach and a MHt. Similar to our
main approach, frequent optimizations of the MHt are needed.
Hence, we threshold the likelihood if the MHt reaches a certain
size (see equation (2)) and keep only the best third of all.
Our proposed system is agnostic to the source of odometry
which we show by making use of two different ones for
all sequences. More specifically, we utilized the tracking of
ORB features [37] as well as LiDAR surface and corner
features [36] to get an odometry estimate. We have used the
provided camera calibration parameters from Geiger et al. [34]
for both, Visual Odometry (VO) and ORB-SLAM2. Thus, the
results of ORB-SLAM2 are different than the results reported
in the work of Mur-Artal et al. [37] where they used different
parameters per sequence.
A. Results
We demonstrate the performance of our proposed approach
by means of calculating the RMSE of the estimated trajectory
location to the GPS ground truth provided by the KITTI
dataset using the VO and Laser Odometry (LO) sources. Both,
VO and LO, accumulate an error and hence, are subject to drift
over time. Using our DP-based multiple hypothesis mapping
approach together with our place recognition (cf. figure 5) we
can reduce the drift up to 50 % for several sequences.
The simple UKF and MHM estimation approaches are
strongly affected by wrong measurement associations resulting
in bad constraints in the pose graph. These wrong assignments
can be observed in figure 6 as sudden jumps in the RMSE.
Consequently, the RMSE will have an increased total error
which is even worse than the raw odometry source for a
few sequences. Our approach is less perturbed with wrong
associations and thus, maintains a more robust RMSE over
time.
Table I show the mean and standard deviation of the
RMSE for each sequence and estimator. Our approach does
particularly well on on the longer sequences (00, 05) which
results from the correct data association together with the
semantic place recognition. Regardless of the odometry source,
our proposed system yields results comparable to the state-of-
the-art SLAM approaches in VO and LO and compared to
the MHM, maintains less hypotheses about the environment
as shown in figure 7. Due to the fact that the total number
of hypotheses of the environment is only increased when the
state uncertainty is high, we gain additional efficiency for our
proposed system.
Figure 8 evaluates the performance of our algorithm when
the semantic classes are removed as well as for the restriction
to a single hypothesis. The single hypothesis, no semantic
solution then still performs a probabilistic Hungarian method
and achieves a mean RMSE of 5.45 m±2.94 m.
Sequence 00 05 06 07
VO 8.41±2.51 6.42±3.9 3.8±1.4 6.23±2.4
UKF 11.14±2.9 8.9±4.2 5.96±2.4 14.55±5.1
MHM 6.84±1.3 5.6±2.94 3.17±1.07 6.93±2.02
DPMHM 4.54±1.58 4.4±2.3 2.3±0.74 2.9±4.5
ORB-SLAM2 5.7±1.0 4.51±1.3 2.1±0.6 2.71±0.9
LO 7.33±2.5 2.96±1.3 3.3±1.2 6.65±2.85
UKF 6.96±1.7 6.96±1.5 5.47±0.75 10.4±2.8
MHM 5.3±2.2 3.8±1.4 2.67±0.35 11.3±3.9
DPMHM 3.94±1.17 2.42±0.66 2.66±0.35 5.5±2.4
LeGo-LOAM 5.8±2.2 2.54±0.72 2.15±0.52 1.0±0.16
TABLE I: Comparison of the mean RMSE and standard de-
viation in meters achieved with VO and LO as the underlying
odometry source.
Sequence 00 Sequence 05
KITTI
0
10
20
30
40
Hypothesis Count
Type
MHM
DPMHM
Fig. 7: Evaluation of the two multiple hypotheses-based imple-
mentations. The naive likelihood thresholding approach has an
average of 12 (sequence 00) and 18 (sequence 05) hypotheses,
respectively, whereas our proposed resampling approach has
an average of 7 (sequence 00) and 6 (sequence 05) hypotheses.
Fig. 8: Reduction of the RMSE for a single hypothesis, non-
semantic DPMHM. Including both modalities, we achieve an
average reduction of 34 %, with only multiple hypotheses 27 %
and 13 % with a pure semantic DPMHM.
V. CONCLUSION AND FUTURE WORK
In this work, we presented a novel semantic SLAM system
based on factor graphs and a MHt mapping approach aiming
to deal with ambiguities in data association in semantic-
based SLAM. We showed that our resampling method for
optimizing the hypothesis tree yields a more robust estimation
8 IEEE ROBOTICS AND AUTOMATION LETTERS. PREPRINT VERSION. ACCEPTED JUNE, 2019
and requires substantially less hypotheses. Moreover, we gain
additional efficiency by preselecting submaps for loop closure
detection.
As further research, we intend to remove the assumption
that each object can generate at most one measurement per
time-step since a bad detector or viewpoint angle might
easily violate this assumption. Additionally, this work could
potentially also be extended towards utilizing an instance-
based detection. Instance information could possibly give a
prior on how to associate the measurements at the cost of an
additional non-Gaussian discrete random variable.
REFERENCES
[1] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira,
I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous
localization and mapping: Toward the robust-perception age,” IEEE
Transactions on Robotics, vol. 32, no. 6, pp. 1309–1332, 2016.
[2] A. Ess, T. M¨
uller, H. Grabner, and L. J. Van Gool, “Segmentation-Based
Urban Traffic Scene Understanding.” in BMVC, vol. 1. Citeseer, 2009,
p. 2.
[3] H. Blum, A. Gawel, R. Siegwart, and C. Cadena, “Modular sensor fusion
for semantic segmentation,” in 2018 IEEE/RSJ International Conference
on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 3670–3677.
[4] I. Kostavelis and A. Gasteratos, “Semantic mapping for mobile robotics
tasks: A survey,Robotics and Autonomous Systems, vol. 66, pp.
86–103, 2015.
[5] A. Gawel, C. Del Don, R. Siegwart, J. Nieto, and C. Cadena, “X-View:
Graph-Based Semantic Multi-View Localization,” 2017.
[6] M. Labbe and F. Michaud, “Appearance-based loop closure detection
for online large-scale and long-term operation,” IEEE Transactions on
Robotics, vol. 29, no. 3, pp. 734–745, 2013.
[7] A. Angeli, D. Filliat, J.-a. Meyer, A. Angeli, D. Filliat, J.-a. M. A.
Fast, A. Angeli, D. Filliat, and J.-a. Meyer, “A Fast and Incremental
Method for Loop-Closure Detection Using Bags of Visual Words,” IEEE
Transactions on Robotics, vol. 24, 2008.
[8] W. Chen, M. Fang, Y.-H. Liu, and L. Li, “Monocular semantic SLAM in
dynamic street scene based on multiple object tracking,” in Cybernetics
and Intelligent Systems (CIS) and IEEE Conference on Robotics, Au-
tomation and Mechatronics (RAM), 2017 IEEE International Conference
on. IEEE, 2017, pp. 599–604.
[9] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and
J. Garcia-Rodriguez, “A review on deep learning techniques applied to
semantic segmentation,” arXiv preprint arXiv:1704.06857, 2017.
[10] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and A. J.
Davison, “SLAM++: Simultaneous localisation and mapping at the level
of objects,” Proceedings of the IEEE Computer Society Conference on
Computer Vision and Pattern Recognition, pp. 1352–1359, 2013.
[11] J. Civera, D. Galvez-Lopez, L. Riazuelo, J. D. Tardos, and J. M. M.
Montiel, “Towards semantic SLAM using a monocular camera,” in 2011
IEEE/RSJ International Conference on Intelligent Robots and Systems,
2011, pp. 1277–1284.
[12] M. Hosseinzadeh, Y. Latif, T. Pham, N. Suenderhauf, and I. Reid,
“Structure Aware SLAM using Quadrics and Planes,” arXiv preprint
arXiv:1804.09111, 2018.
[13] J. Elfring, S. Van Den Dries, M. J. Van De Molengraft, and M. Stein-
buch, “Semantic world modeling using probabilistic multiple hypothesis
anchoring,” Robotics and Autonomous Systems, vol. 61, no. 2, pp. 95–
105, 2013.
[14] L. L. Wong, L. P. Kaelbling, and T. Lozano-P´
erez, “Data association
for semantic world modeling from partial views,Springer Tracts in
Advanced Robotics, vol. 114, pp. 431–448, 2016.
[15] S. S. Blackman, “Multiple hypothesis tracking for multiple target
tracking,” IEEE Aerospace and Electronic Systems Magazine, vol. 19,
no. 1, pp. 5–18, 2004.
[17] L. L. S. Wong, T. Kurutach, L. P. Kaelbling, and T. Lozano-P´
erez,
“Object-based World Modeling in Semi-Static Environments with
Dependent Dirichlet-Process Mixtures,” pp. 1–23, 2015.
[16] I. Cox and S. Hingorani, “An Efficient Implementation and Evaluation
of Reid’s Multiple Hypothesis Tracking Algorithm for Visual Tracking,
in Proceedings of the 12th IAPR International Conference on Pattern
Recognition, no. 1, 1994, pp. 437–442.
[18] N. Atanasov, M. Zhu, K. Daniilidis, and G. J. Pappas, “Localization from
semantic observations via the matrix permanent,” International Journal
of Robotics Research, vol. 35, no. 1-3, pp. 73–99, 2016.
[19] A. N ¨
uchter and J. Hertzberg, “Towards semantic maps for mobile
robots,” Robotics and Autonomous Systems, vol. 56, no. 11, pp. 915–926,
2008.
[20] A. Pronobis and P. Jensfelt, “Large-scale semantic mapping and rea-
soning with heterogeneous modalities,” in 2012 IEEE International
Conference on Robotics and Automation. IEEE, 2012, pp. 3515–3522.
[21] S. L. Bowman, N. Atanasov, K. Daniilidis, and G. J. Pappas, “Prob-
abilistic data association for semantic SLAM,” Proceedings - IEEE
International Conference on Robotics and Automation, pp. 1722–1729,
2017.
[22] L. Nicholson, M. Milford, and N. S¨
underhauf, “QuadricSLAM: Dual
Quadrics as SLAM Landmarks,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition Workshops, 2018, pp. 313–
314.
[23] E. A. Wan and R. Van Der Merwe, “The unscented Kalman filter
for nonlinear estimation,” in IEEE 2000 Adaptive Systems for Signal
Processing, Communications, and Control Symposium, AS-SPCC 2000,
vol. 7, 2000, pp. 153–158.
[24] Y. Bar-Shalom, S. S. Blackman, and R. J. Fitzgerald, “Dimensionless
score function for multiple hypothesis tracking,” IEEE Transactions on
Aerospace and Electronic Systems, vol. 43, no. 1, pp. 392–400, 2007.
[25] R. Jonker and T. Volgenant, “Improving the Hungarian assignment
algorithm,” Operations Research Letters, vol. 5, no. 4, pp. 171–175,
1986.
[26] A. Ranganathan and F. Dellaert, “A rao-blackwellized particle filter for
topological mapping,” Proceedings - IEEE International Conference on
Robotics and Automation, vol. 2006, no. May, pp. 810–817, 2006.
[27] B. Ristic, S. Arulampalam, and N. Gordon, Beyond the Kalman Filter:
Particle Filters for Tracking Applications, 2004.
[28] G. Grisetti, C. Stachniss, and W. Burgard, “Improving Grid Based
SLAM with Rao Blackwellized Particle Filters by Adaptive Proposals
and Selective Resampling,International Conference on Robotics and
Automation, no. April, pp. 2443–2448, 2005.
[29] D. Fox, “Adapting the sample size in particle filters through KLD
Sampling,” Intl Jour of Robotics Research, vol. 22, no. 12, pp. 985–
1004, 2003.
[30] J. L. Bentley, “Multidimensional binary search trees used for associative
searching,” Communications of the ACM, vol. 18, no. 9, pp. 509–517,
1975.
[31] J. Lin, “Divergence Measures Based on the Shannon Entropy,IEEE
Transactions on Information Theory, vol. 37, no. 1, pp. 145–151, 1991.
[32] S. Cascianelli, G. Costante, E. Bellocchio, P. Valigi, M. L. Fravolini,
and T. A. Ciarfuglia, “Robust visual semi-semantic loop closure
detection by a covisibility graph and CNN features,” Robotics and
Autonomous Systems, vol. 92, pp. 53–65, 2017.
[33] G. H. Lee, F. Fraundorfer, and M. Pollefeys, “Robust pose-graph
loop-closures with expectation-maximization,” in Intelligent Robots and
Systems (IROS), 2013 IEEE/RSJ International Conference on. IEEE,
2013, pp. 556–563.
[34] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous
driving? the KITTI vision benchmark suite,” Proceedings of the IEEE
Computer Society Conference on Computer Vision and Pattern Recog-
nition, pp. 3354–3361, 2012.
[35] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A Deep Convo-
lutional Encoder-Decoder Architecture for Image Segmentation,IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol. 39,
no. 12, pp. 2481–2495, 2017.
[36] T. Shan and B. Englot, “LeGO-LOAM: Lightweight and ground-
optimized lidar odometry and mapping on variable terrain,” in 2018
IEEE/RSJ International Conference on Intelligent Robots and Systems
(IROS). IEEE, 2018, pp. 4758–4765.
[37] R. Mur-Artal and J. D. Tardos, “ORB-SLAM2: An Open-Source SLAM
System for Monocular, Stereo, and RGB-D Cameras,IEEE Transac-
tions on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.
... Utilizing mixture models [36], Doherty et al. [37] coupled geometric (continuous) and semantic (discrete) information of landmarks for optimization, where they show that a mixture approach outperforms the EMbased approach. Bernreiter et al. [38] developed a multiple hypothesis approach with an optimizing hypothesis tree. However, these methods are often limited to closed-set se-mantics or require prior knowledge of a multiclass prediction confusion matrix. ...
Preprint
Semantic Simultaneous Localization and Mapping (SLAM) systems struggle to map semantically similar objects in close proximity, especially in cluttered indoor environments. We introduce Semantic Enhancement for Object SLAM (SEO-SLAM), a novel SLAM system that leverages Vision-Language Models (VLMs) and Multimodal Large Language Models (MLLMs) to enhance object-level semantic mapping in such environments. SEO-SLAM tackles existing challenges by (1) generating more specific and descriptive open-vocabulary object labels using MLLMs, (2) simultaneously correcting factors causing erroneous landmarks, and (3) dynamically updating a multiclass confusion matrix to mitigate object detector biases. Our approach enables more precise distinctions between similar objects and maintains map coherence by reflecting scene changes through MLLM feedback. We evaluate SEO-SLAM on our challenging dataset, demonstrating enhanced accuracy and robustness in environments with multiple similar objects. Our system outperforms existing approaches in terms of landmark matching accuracy and semantic consistency. Results show the feedback from MLLM improves object-centric semantic mapping. Our dataset is publicly available at: jungseokhong.com/SEO-SLAM.
... Probabilistic data association methods for SLAM are essential for achieving semantic localization in dynamic environments [62], [63]. Semantic mapping is crucial for data association in SLAM, ensuring robust data fusion from multiple sources [64]. Dynamic visual SLAM, when combined with deep learning, further improves the accuracy and robustness of SLAM in changing environments [65]. ...
Preprint
Full-text available
Connected and autonomous vehicles (CAVs) have garnered significant attention due to their extended perception range and enhanced sensing coverage. To address challenges such as blind spots and obstructions, CAVs employ vehicle-to-vehicle (V2V) communications to aggregate sensory data from surrounding vehicles. However, cooperative perception is often constrained by the limitations of achievable network throughput and channel quality. In this paper, we propose a channel-aware throughput maximization approach to facilitate CAV data fusion, leveraging a self-supervised autoencoder for adaptive data compression. We formulate the problem as a mixed integer programming (MIP) model, which we decompose into two sub-problems to derive optimal data rate and compression ratio solutions under given link conditions. An autoencoder is then trained to minimize bitrate with the determined compression ratio, and a fine-tuning strategy is employed to further reduce spectrum resource consumption. Experimental evaluation on the OpenCOOD platform demonstrates the effectiveness of our proposed algorithm, showing more than 20.19\% improvement in network throughput and a 9.38\% increase in average precision (AP@IoU) compared to state-of-the-art methods, with an optimal latency of 19.99 ms.
... Second, sliding window-based optimization approaches [14], [15] that create a local factor graph and use a non-linear optimizer to solve the unknown states. Except for particle filters, none of the other approaches is implicitly able to handle multiple hypotheses, which has shown to be beneficial, for example in tracking [16] or landmark association [17]. To the best of our knowledge, our approach is the first to propose using a collection of Extended Kalman Filters (EKFs) to track and rank multiple hypotheses simultaneously. ...
Preprint
Full-text available
Globally rising demand for transportation by rail is pushing existing infrastructure to its capacity limits, necessitating the development of accurate, robust, and high-frequency positioning systems to ensure safe and efficient train operation. As individual sensor modalities cannot satisfy the strict requirements of robustness and safety, a combination thereof is required. We propose a path-constrained sensor fusion framework to integrate various modalities while leveraging the unique characteristics of the railway network. To reflect the constrained motion of rail vehicles along their tracks, the state is modeled in 1D along the track geometry. We further leverage the limited action space of a train by employing a novel multi-hypothesis tracking to account for multiple possible trajectories a vehicle can take through the railway network. We demonstrate the reliability and accuracy of our fusion framework on multiple tram datasets recorded in the city of Zurich, utilizing Visual-Inertial Odometry for local motion estimation and a standard GNSS for global localization. We evaluate our results using ground truth localizations recorded with a RTK-GNSS, and compare our method to standard baselines. A Root Mean Square Error of 4.78 m and a track selectivity score of up to 94.9 % have been achieved.
... Other Methods. Recent studies enhanced semanticbased LPR through innovative technologies and theories, including multiple hypothesis trees [222], siamese neural network [72], [223], [224], spherical convolution [225], and neural tensor network [226]. ...
Preprint
Full-text available
LiDAR-based place recognition (LPR) plays a pivotal role in autonomous driving, which assists Simultaneous Localization and Mapping (SLAM) systems in reducing accumulated errors and achieving reliable localization. However, existing reviews predominantly concentrate on visual place recognition (VPR) methods. Despite notable advancements in LPR in recent years, there is yet a systematic review dedicated to this field to the best of our knowledge. This paper bridges the gap by providing a comprehensive review of place recognition methods employing LiDAR sensors, thus facilitating and encouraging further research. We commence by delving into the problem formulation of place recognition and exploring existing challenges, describing relations to previous surveys. Subsequently, we conduct an in-depth review of related research, which offers detailed classifications, strengths and weaknesses, and architectures. Finally, we summarize existing datasets, commonly used evaluation metrics, and comprehensive evaluation results from various methods on public datasets. This paper can serve as a valuable tutorial for newcomers entering the realm of place recognition and researchers interested in long-term robot localization. We pledge to maintain an up-to-date project on our website https://github.com/ShiPC-AI/LPR-Survey.
Article
Pedestrian location tracking in emergency responses and environmental surveys of indoor scenarios tend to rely only on their own mobile devices, reducing the usage of external services. Low-cost and small-sized inertial measurement units (IMU) have been widely distributed in mobile devices. However, they suffer from high-level noises, leading to drift in position estimation over time. In this work, we present a graph-based indoor 3D pedestrian location tracking with inertial-only perception. The proposed method uses onboard inertial sensors in mobile devices alone for pedestrian state estimation in a simultaneous localization and mapping (SLAM) mode. It starts with a deep vertical odometry-aided 3D pedestrian dead reckoning (PDR) to predict the position in 3D space. Environment-induced behaviors, such as corner-turning and stair-taking, are regarded as landmarks. Multi-hypothesis loop closures are formed using statistical methods to handle ambiguous data association. A factor graph optimization fuses 3D PDR and behavior loop closures for state estimation. Experiments in different scenarios are performed using a smartphone to evaluate the performance of the proposed method, which can achieve better location tracking than current learning-based and filtering-based methods. Moreover, the proposed method is also discussed in different aspects, including the accuracy of offline optimization and proposed height regression, and the reliability of the multi-hypothesis behavior loop closures. The video (YouTube) or (BiliBili) is also shared to display our research.
Article
Full-text available
LiDAR has gained popularity in autonomous driving due to advantages like long measurement distance, rich 3D information, and stability in harsh environments. Place Recognition (PR) enables vehicles to identify previously visited locations despite variations in appearance, weather, and viewpoints, even determining their global location within prior maps. This capability is crucial for accurate localization in autonomous driving. Consequently, LiDAR-based Place Recognition (LPR) has emerged as a research hotspot in robotics. However, existing reviews predominantly concentrate on Visual Place Recognition (VPR), leaving a gap in systematic reviews on LPR. This paper bridges this gap by providing a comprehensive review of LPR methods, thus facilitating and encouraging further research. We commence by exploring the relationship between PR and autonomous driving components. Then, we delve into the problem formulation of LPR, challenges, and relations to previous surveys. Subsequently, we conduct an in-depth review of related research, which offers detailed classifications, strengths and weaknesses, and architectures. Finally, we summarize existing datasets and evaluation metrics and envision promising future directions. This paper can serve as a valuable tutorial for newcomers entering the field of place recognition. We plan to maintain an up-to-date project on https://github.com/ShiPC-AI/LPR-Survey.
Preprint
For interacting with mobile objects in unfamiliar environments, simultaneously locating, mapping, and tracking the 3D poses of multiple objects are crucially required. This paper proposes a Tracklet Graph and Query Graph-based framework, i.e., GSLAMOT, to address this challenge. GSLAMOT utilizes camera and LiDAR multimodal information as inputs and divides the representation of the dynamic scene into a semantic map for representing the static environment, a trajectory of the ego-agent, and an online maintained Tracklet Graph (TG) for tracking and predicting the 3D poses of the detected mobile objects. A Query Graph (QG) is constructed in each frame by object detection to query and update TG. For accurate object association, a Multi-criteria Star Graph Association (MSGA) method is proposed to find matched objects between the detections in QG and the predicted tracklets in TG. Then, an Object-centric Graph Optimization (OGO) method is proposed to simultaneously optimize the TG, the semantic map, and the agent trajectory. It triangulates the detected objects into the map to enrich the map's semantic information. We address the efficiency issues to handle the three tightly coupled tasks in parallel. Experiments are conducted on KITTI, Waymo, and an emulated Traffic Congestion dataset that highlights challenging scenarios. Experiments show that GSLAMOT enables accurate crowded object tracking while conducting SLAM accurately in challenging scenarios, demonstrating more excellent performances than the state-of-the-art methods. The code and dataset are at https://gslamot.github.io.
Preprint
Full-text available
Collaborative autonomous driving with multiple vehicles usually requires the data fusion from multiple modalities. To ensure effective fusion, the data from each individual modality shall maintain a reasonably high quality. However, in collaborative perception, the quality of object detection based on a modality is highly sensitive to the relative pose errors among the agents. It leads to feature misalignment and significantly reduces collaborative performance. To address this issue, we propose RoCo, a novel unsupervised framework to conduct iterative object matching and agent pose adjustment. To the best of our knowledge, our work is the first to model the pose correction problem in collaborative perception as an object matching task, which reliably associates common objects detected by different agents. On top of this, we propose a graph optimization process to adjust the agent poses by minimizing the alignment errors of the associated objects, and the object matching is re-done based on the adjusted agent poses. This process is carried out iteratively until convergence. Experimental study on both simulated and real-world datasets demonstrates that the proposed framework RoCo consistently outperforms existing relevant methods in terms of the collaborative object detection performance, and exhibits highly desired robustness when the pose information of agents is with high-level noise. Ablation studies are also provided to show the impact of its key parameters and components. The code is released at https://github.com/HuangZhe885/RoCo.
Article
Full-text available
We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network [1]. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN [2] and also with the well known DeepLab-LargeFOV [3] , DeconvNet [4] architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/.
Article
Full-text available
Global registration of multiview robot data is a challenging task. Appearance-based global localization approaches often fail under drastic view-point changes, as representations have limited view-point invariance. This letter is based on the idea that human-made environments contain rich semantics that can be used to disambiguate global localization. Here, we present X-View, a multiview semantic global localization system. X-View leverages semantic graph descriptor matching for global localization, enabling localization under drastically different view-points. While the approach is general in terms of the semantic input data, we present and evaluate an implementation on visual data. We demonstrate the system in experiments on the publicly available SYNTHIA dataset, on a realistic urban dataset recorded with a simulator, and on real-world StreetView data. Our findings show that X-View is able to globally localize aerial-to-ground, and ground-to-ground robot data of drastically different view-points. Our approach achieves an accuracy of up to 85% on global localizations in the multiview case, while the benchmarked baseline appearance-based methods reach up to 75%.
Article
Full-text available
Image semantic segmentation is more and more being of interest for computer vision and machine learning researchers. Many applications on the rise need accurate and efficient segmentation mechanisms: autonomous driving, indoor navigation, and even virtual or augmented reality systems to name a few. This demand coincides with the rise of deep learning approaches in almost every field or application target related to computer vision, including semantic segmentation or scene understanding. This paper provides a review on deep learning methods for semantic segmentation applied to various application areas. Firstly, we describe the terminology of this field as well as mandatory background concepts. Next, the main datasets and challenges are exposed to help researchers decide which are the ones that best suit their needs and their targets. Then, existing methods are reviewed, highlighting their contributions and their significance in the field. Finally, quantitative results are given for the described methods and the datasets in which they were evaluated, following up with a discussion of the results. At last, we point out a set of promising future works and draw our own conclusions about the state of the art of semantic segmentation using deep learning techniques.
Article
Full-text available
Visual Self-localization in unknown environments is a crucial capability for an autonomous robot. Real life scenarios often present critical challenges for autonomous vision-based localization, such as robustness to viewpoint and appearance changes. To address these issues, this paper proposes a novel strategy that models the visual scene by preserving its geometric and semantic structure and, at the same time, improves appearance invariance through a robust visual representation. Our method relies on high level visual landmarks consisting of appearance invariant descriptors that are extracted by a pre-trained Convolutional Neural Network (CNN) on the basis of image patches. In addition, during the exploration, the landmarks are organized by building an incremental covisibility graph that, at query time, is exploited to retrieve candidate matching locations improving the robustness in terms of viewpoint invariance. In this respect, through the covisibility graph, the algorithm finds, more effectively, location similarities by exploiting the structure of the scene that, in turn, allows the construction of virtual locations i.e., artificially augmented views from a real location that are useful to enhance the loop closure ability of the robot. The proposed approach has been deeply analysed and tested in different challenging scenarios taken from public datasets. The approach has also been compared with a state-of-the-art visual navigation algorithm.
Chapter
Simultaneous Localization And Mapping (SLAM) is a fundamental problem in mobile robotics. While point-based SLAM methods provide accurate camera localization, the generated maps lack semantic information. On the other hand, state of the art object detection methods provide rich information about entities present in the scene from a single image. This work marries the two and proposes a method for representing generic objects as quadrics which allows object detections to be seamlessly integrated in a SLAM framework. For scene coverage, additional dominant planar structures are modeled as infinite planes. Experiments show that the proposed points-planes-quadrics representation can easily incorporate Manhattan and object affordance constraints, greatly improving camera localization and leading to semantically meaningful maps.