ArticlePDF Available

Abstract and Figures

Matching and merging overlapping point clouds is a common procedure in many applications, including mobile robotics, three-dimensional mapping, and object visualization. However, fully automatic point-cloud matching, without manual verification, is still not possible because no matching algorithms exist today that can provide any certain methods for detecting misaligned point clouds. In this article, we make a comparative evaluation of geometric consistency methods for classifying aligned and nonaligned point-cloud pairs. We also propose a method that combines the results of the evaluated methods to further improve the classification of the point clouds. We compare a range of methods on two data sets from different environments related to mobile robotics and mapping. The results show that methods based on a Normal Distributions Transform representation of the point clouds perform best under the circumstances presented herein.
Content may be subject to copyright.
Learning to detect misaligned point clouds
H˚akan Almqvist, Martin Magnusson, Tomasz Kucner and Achim J. Lilienthal*
Abstract
To match and merge overlapping point clouds is a common procedure in many applications
such as mobile robotics, 3D-mapping and object visualization. However, fully automatic
point cloud matching, without manual verification, is still not possible, since no matching
algorithms of today provide any certain methods for detecting misaligned point clouds. In
this article we make a comparative evaluation of geometric consistency methods for classi-
fying aligned and nonaligned point cloud pairs. We also propose a method that combines
the result of the evaluated methods to further improve the classification of the point clouds.
We compare a range of methods on two data sets from different environments related to
mobile robotics and mapping. The results show that methods based on a Normal Distri-
butions Transform representation of the point clouds perform best under the circumstances
presented herein.
*The authors are with the Centre for Applied Autonomous Sensor Systems (AASS) at ¨
Orebro University, Sweden.
firstname.lastname@oru.se
1 Introduction
Point cloud registration is a central part of robot perception and is used in numerous applications of mobile
robotics such as mapping, object detection, manipulation, etc. Point cloud registration can be formulated
as the problem of finding the relative transformation (i.e. a translation and rotation) between two 3D point
clouds that best aligns them. However, no existing point cloud registration algorithm will always perfectly
align a pair of point clouds. Therefore, methods for automatic detection of misaligned point clouds are
important, but this is a problem that hasn’t been thoroughly studied before.
The problem is visualized in Figures 1 and 2 where two point clouds that are aligned are shown in Figure 1
and two point clouds that are misaligned are shown in Figure 2.
Figure 1: Two point clouds, one in red and one in green, that are well aligned.
Figure 2: Two point clouds, one in red and one in green, that are not well aligned.
In this article we will present an investigation of methods that can be used for automatic detection of
misaligned point clouds.
In related work, some authors have previously proposed to measure global map quality by comparing the
output of a mapping algorithm to a ground-truth reference. Birk [5] and Schwertfeger and Birk [20] proposed
a formal definition of map brokenness, and an algorithm for measuring how many such inconsistencies are
present in a map. However, being dependent on a given reference map is a severe limitation in autonomous
mapping and localization applications. In the present paper, we will only investigate methods for determining
whether two point clouds are aligned or not, without requiring comparison to a ground-truth estimate.
The typical work cycle when doing metric mapping with, for example, a mobile robot is to acquire consecutive,
overlapping 3D point clouds and merge them together using a point cloud registration algorithm. Such
algorithms try to find the best relative transformation between two point clouds, starting from an initial
alignment estimate.
All currently available local and global registration algorithms can fail to find the correct alignment. For
challenging data sets, both local and global state-of-the-art registration algorithms can have relatively low
success rates [16]. The predominant causes of failure are bad initial alignment or small overlap of the point
clouds. Typically, the registration algorithm gets trapped in a local minimum (with respect to its objective
function) in the failure cases.
In any scenario where the process of aligning point clouds is intended to be automated, a reliable method
for verifying the point cloud alignment is required.
There are a number of popular methods available in the literature that approach this problem (see Section
2, however there is currently no consensus regarding which method is most reliable. To the best of our
knowledge, no methodical evaluation of these measures has yet been carried out. We are going to evaluate
some common methods for point cloud misalignment detection, as well as some less established ones, by
assessing how well they can be used to discriminate between well-aligned and misaligned scans on two
labeled large-scale data sets.
The methods presented in this article can be applied to any situation where an automated method for
determining whether two point clouds are aligned or not is required, although the datasets chosen for the
work herein are connected to localization in mobile robotics.
We will refer to a method capable of determining whether two point clouds are aligned or not as a “classifier”,
for the simple reason that what it does is to “classify” the point clouds as aligned or not aligned.
This article provides two main contributions:
Evaluations of several existing methods for detecting whether two point clouds are aligned or not.
The use of boosting to learn a strong classifier by combining the weaker individual classifiers.
2 Alignment quality measures
This section briefly lists the methods that we have studied. More details of each method can be found in
Section 3. Alignment quality of point clouds can be investigated at different levels, from maps consisting of
many point clouds to point cloud pairs.
Alignment quality measures for point cloud pairs can roughly be divided into two classes: Those that can
be used directly upon any pair of point clouds, and those that rely on measures taken both before and
after a point cloud registration, thus measuring how the alignment of the point clouds have changed as a
consequence of the registration. The second class is based on the assumption that a point cloud pair where
an alignment quality measure generates a better result after registration than before indicates that the pairs’
alignment has been improved. Since we are interested in methods that can classify whether two point clouds
are aligned or not in all contexts, without requiring registration, we will not consider the second class of
methods here. It is also precarious to evaluate point cloud alignment before and after registrations using
the same measures as is used by the registration algorithm. We will discuss this more thoroughly in Section
5.1.3.
A common property of all point cloud alignment measures is that the result is dependent on the scene being
evaluated. Therefore, no method provides a universal solution that is capable of correctly describing the
alignment of any point cloud pair.
Previous work in the area of evaluating the performance of different point cloud alignment measures is scarce.
Notable, however, is Campbell and Whitty [6] who recently made evaluations on four measures and also a
combined measure using a multi-class support vector machine. Their evaluations show that the combined
measure achieves better results than the single measures at the cost of requiring a larger training set.
RMS A popular method [19] is to use the root mean squared point-to-point distance between the point
clouds, which also is the function that is minimized by ICP registration [2]. ICP registration will be further
introduced in Section 5.1.3. The RMS distance is a measure of the average distance between the nearest-
neighbor point pairs in two point clouds. Point clouds that are well aligned should receive a small positive
RMS value.
However, as demonstrated by, for example, Silva et al. [21], the mean squared error is not always a good mea-
sure of alignment quality — especially not for detecting small errors, since it does not reveal any information
concerning the distribution of the errors.
When computing the RMS distance, care has to be taken not to include outlier points, or points from non-
overlapping regions, because such points can influence the computed value drastically, even for otherwise
well-aligned point clouds. Therefore, points that do not have a neighbor in the other point cloud within a
threshold distance are considered outliers and removed.
NDT Score NDT, the normal distributions transform [3, 4, 23] can be used to acquire a measure of
point cloud alignment. In the NDT representation, one [14] or both [22] point clouds are represented by a
combination of Gaussians, describing the probability of finding part of the surface at any point in space.
The level of alignment can then be measured by evaluating the NDT representation for the point clouds as
described in detail in Section 3.2.
NDT Hessian The inverse Hessian matrix of the NDT score function can be used as an estimate of the
covariance matrix of the pose parameters, and as such gives an indication of the certainty by which each
pose parameter can be determined. Although NDT is often used for point cloud registration [4, 16], the
score and Hessian, given a pair of point clouds and a relative transformation, can be computed without
actually performing a registration. Well-aligned point clouds should have an inverse NDT Hessian where all
eigenvalues are small.
Plane extraction Chandran-Ramesh and Newman [7, 8] train a conditional random field to detect “suspi-
cious” and “plausible” areas of a map. The suspicious areas, in this case, are those that contain intersecting
plane patches or parallel planes in close proximity. This work is one of few where the accuracy of the pro-
posed method is evaluated. In their evaluations on 3D-laser data the percentage of correct classifications
reach approximately 80%. We have made an attempt to adapt the method to work on point cloud pairs in
Section 3.4.
Partitioned mean normals The partitioned mean normals measure, used by Makadia et al. [17], use
consistency between normals as the alignment measure. Normals are calculated for each point in a point
cloud by using a number of nearest neighbor points within a specified radius. The space containing each point
cloud is voxelised and the mean of the normals value is calculated for each voxel. By comparing the mean
normals of the corresponding voxels in two point clouds a consistency value can be computed. Well-aligned
point clouds should have a small mean difference between corresponding voxel normals.
Surface interpenetration measure Silva et al. [21] propose a surface interpenetration measure, based
on the observation that well-aligned point clouds should present a “splotchy” surface, where coinciding
surfaces from the point clouds cross over each other repeatedly. The alignment measure can be acquired by
identifying and counting “interpenetrating” points between two point clouds, i.e. points where surfaces are
intersecting. Well-aligned point clouds should have a high number of interpenetrating points.
3 Point Cloud Alignment Classification
Point cloud alignment classification is the method of classifying the alignment of point clouds into two or
more “classes”. The aim of the classification is to detect whether the point clouds are correctly aligned
(within some tolerance margin) or not. The scope of our evaluation of point clouds is to make a binary
classification of each point cloud pair as “aligned” or “not aligned”. It would also be possible to treat the
problem as a continuous one where a value could be used describing the degree of misalignment, however
since we evaluate several different classifiers with different objective functions it could be difficult to find a
common measure that applies to all classifiers. To do binary classification we wish to find a quantitative
measure of “alignedness” that we can threshold. Each classifier has been designed to provide such a measure.
By using AdaBoost we find a suitable threshold for each classifier on a subset of the available data. The
threshold is then used to evaluate the remaining part of the dataset. It is in general not possible to find a
single threshold suitable for all classifications for any classifier, but by training on relevant datasets we can
find thresholds which provide a suitable tradeoff between classification precision and recall.
In this section we will describe, in detail, all the alignment measures that we will be evaluating, and the
parameters we select for each classifier.
3.1 RMS distance classifier
The neighbor threshold distance, explained in Section 2, has a major impact on the behavior of the RMS-
distance classifier, and also affects the magnitude of the errors that it can classify. A too small threshold
might remove misaligned but overlapping points (which should not be considered outliers) leading to an
overconfident error measure. A too large threshold might include those points that are truly outliers, i.e.
points in non-overlapping parts of the point clouds, and consequently classify well-aligned point cloud pairs
as not aligned. We have included a set of six thresholds, chosen to range from fine to coarse errors (in the
context of our datasets), and we have also included an RMS classifier with a statistical threshold suggested
by Masuda et al. [18]. The statistical threshold is defined as 2.5 standard deviations from the mean RMS-
distance of all points.
The RMS-classifiers and their respective thresholds are specified in Table 1.
3.2 NDT Score classifier
Essentially, a 3D-NDT representation is constructed by creating a voxel grid over a point cloud, and for each
voxel compute the mean and covariance of the points in it (that is, the parameters of a Gaussian function).
Consider two point clouds 𝒜and , i.e. two unordered sets of 3D points. We compute an NDT grid for
the points in 𝒜, and keep the point cloud representation of ={x1,...,x𝑛}.
Using the coordinate frame of 𝒜, we express the likelihood of being aligned with 𝒜with the NDT score
function
𝑠(,,p) = 𝑛
𝑘=1 ˜𝑝(𝑇(p,x𝑖))
𝑛,(1)
where ˜𝑝is the Gaussian of the closest voxel of x𝑖in ,pis the pose of in the coordinate frame of 𝒜,𝑇
is a function that transforms point x𝑖according to p, and 𝑛is the number of points in the point clouds.
As described by Biber et al. [3], ˜𝑝is a Gaussian that approximates the logarithm of a linear combination of
two probability functions: the normal distribution of the points of 𝒜that are contained in the voxel, and a
constant distribution that is used for modeling outliers.
The sum 𝑠(,,p) corresponds to the negative log-likelihood that point cloud lies on the surface of point
cloud 𝒜[14, Chapter 6], and is used directly as the alignment quality measure.
A good alignment should attain a large negative value. The closer to zero the NDT score is, the worse is the
alignment.
A characteristic weakness of the described implementation of this classifier is that the evaluation score will
be better for point clouds with a large overlap and worse for point clouds with a small overlap. We are
using four different variations of the NDT Score classifier to account for such effects: The standard NDT
implementation as described above and three different methods of point cloud segmentation described below.
The voxel size for the normal distribution calculation has been set to 0.5 meters for all classifiers. In our
experience, this voxel size provides a good tradeoff between the number of points in each voxel and the NDT
representation accuracy.
3.2.1 Remove ground floor
A common scenario is the acquisition of point clouds by a movable sensor in urban or indoor environments
where the ground floor is flat. The errors that appear when trying to fit point clouds in similar environments
are typically translations parallel to the ground floor and rotations around the ground floor normal since
those are the major movement directions. Translations along the normal of the ground floor and rotations
around any axis in parallel with the ground floor are often orders of magnitude smaller. Even with a large
pose error between the two scans, the ground floor is still often well aligned (for ground vehicles). The
ground points then risk to outweigh the other structures in the scene that better show the pose error. Das
et al. [9] have proposed to use a Gaussian process model to remove ground points before registration. For
the data sets evaluated here, the ground is smooth enough so that it is reliably removed by filtering out all
points with a vertical normal vector. The normal is computed using the 10 nearest neighbors. All points
that have a normal in close proximity to the vertical axis, i.e. where the dot product of the normal and a
vector of the same length along the vertical axis exceeds 0.9, are removed from the point cloud.
3.2.2 Overlap evaluation
Another subsampling strategy is to only evaluate those voxels where the point clouds overlap. By taking
Equation 1 and replacing the denominator 𝑛with 𝑚, where 𝑚is the number of points from 𝐵that are in
the occupied voxels of 𝑀, we get a measure that is less sensitive to the amount of overlap.
This method does have an increased risk, in comparison with the standard implementation, of classifying
point cloud pairs with bad alignment as well-aligned as long as a small portion of the point clouds are aligned.
3.2.3 Ground floor removal and overlap evaluation
Finally we also evaluate a classifier that both removes the ground floor and only evaluate the overlapping
parts of the point clouds. The risk here might be that excessive subsampling leaves too little data to make
a reliable classification.
3.3 NDT Hessian classifier
The NDT Hessian is also suitable as a binary classifier. We have divided it into two classifiers:
1. The Hessian translation classifier and
2. The Hessian rotation classifier
The reason for this is because there is a distinct difference in magnitude between the eigenvalues related
to the translation parameters and the eigenvalues related to the rotation parameters. This means that a
single threshold value would be severely biased towards either the translation or rotation parameters. Both
classifiers do however work according to the same principle, where the alignment measure is the largest of the
three diagonal elements in the inverse hessian matrix related to the translational and rotational parameters
respectively.
The Hessian is calculated as follows:
𝐻𝑖𝑗 =𝛿2𝑠
𝛿𝑝𝑖𝛿𝑝𝑗
(2)
where 𝑝𝑖and 𝑝𝑗are elements from the pose vector pin Equation 1 and 𝑠is the score function in Equation
1. The voxel size for the Hessian classifiers is the same as for the NDT score classifiers.
3.4 Plane extraction classifier
Furthermore, we have investigated a method that is a modification of the plane extraction method of
Chandran-Ramesh and Newman [7, 8]. Similarly to the work of Chandran-Ramesh and Newman, this
method approximates the point cloud as a set of plane patches. However, in our implementation, instead of
classifying the relations between patches within a single point cloud map, we perform pairwise evaluation
between two scans. Knowing which points belong to which scan makes the conditional random field used
by Chandran-Ramesh and Newman superfluous, and thus we focus on scoring plane patch alignment. The
approach can be split into two steps: plane extraction and scoring.
Plane extraction This step is a modification of Weingarten et al. [24], which also is part of the pipeline of
Chandran-Ramesh and Newman [7, 8]. The underlying idea is to split plane fitting into sub-problems. Given
a 3D point cloud we split it into a regular voxel grid. Each voxel contains the points within its boundaries.
Next, for each non-empty voxel we find the best fitting plane surface using least-squares. The last step is to
define patch vertices by finding the points where the plane surface intersects with the voxel edges. In this
way we obtain a set 𝒫of plane patches. Algorithm 1 shows how to cluster adjacent patches that belong to
the same surface (having the same label) into sets Π𝐿𝑖={𝑃[𝐿𝑖]|𝑃∈ 𝒫}.
The next step is to perform plane reconstruction for each patch cluster Π𝐿𝑖. Weingarten et al. [24] solves
this problem by merging patches into convex polygons. This approach will cause windows, doors, etc. to
Algorithm 1: Patch clustering.
Data: List of existing patches 𝑃of size 𝑙
Result: Each patch has an assigned label
Variables: 𝐿𝑐-label counter
for 𝑙𝑖1to 𝑙do
if label of 𝑃[𝑙𝑖]is unassigned then
increment 𝐿𝑐
set the label of 𝑃[𝐿𝑖] as 𝐿𝑐
end
foreach adjacent patch 𝑃[𝑎𝑖]to 𝑃[𝑙𝑖]do
if 𝑃[𝑎𝑖]is on the same plane as 𝑃[𝑙𝑖]then
check which of the patch labels is smaller
replace the higher label by the lower one
decrement 𝐿𝑐
end
end
end
disappear. To prevent this problem we have decided to replace this patch merging with an algorithm based
on 𝛼-shapes.
The 𝛼-shapes were first introduce by Edelsbrunner et all in [10]. The idea behind 𝛼-shapes is to build a
hull enclosing all the points in the data set. Edelsbrunner and M¨ucke in [11] describe the intution for this
procedure as follows. We start with a hull enclosing all the points with some big margin. In each step we
remove a chunk of hull of radius 𝛼which is not enclosing any of the points in the data set. In this method
we can remove also part of the hull, even if such a chunk is isolated from the other removed ones.
First, for each Π𝐿𝑖we compute the best fitting plane, using the points from the voxels corresponding to the
patches in Π𝐿𝑖. Next we fit a least-squares plane to those points. Afterwards we project the points in the
affected voxels onto the obtained surface and employ 𝛼-shapes in 2D Bernardini and Bajaj [1].
Scoring alignment quality After we have extracted planar polygon segments in both point clouds we
measure the alignment and overlap of all segments with their nearest neighboring segment in the other point
cloud. The final score is computed based on three factors:
1. Distance - The difference in distance between the origin and center of gravity for both patches along
the normal for each plane: |n𝑖·cog𝑖n𝑗·cog𝑗|. The distance is normalized based on the size of
the considered pointcloud. As the furthest distance between the origin and the center of gravity of
the furthest patch - 𝐷𝑓.
2. Orientation - The dot product of the normals of a plane pair. Since the dot product of two unit
vectors is always between 0 and 1 this factor does not require normalization.
3. Size - The difference in size between two neighboring patches. This factor is normalized in respect
to the biggest patch in the given data set 𝐴𝑓. The impact of this factor requires fine tuning. The
difference in the size of the patches if addressed carelessly might lead to skewing the score. However,
we should keep in mind that we are facing a real, environment where dynamic changes can occur.
In such circumstances we need to consider all the differences that will help to filter out information
coming from such distortions.
To compute the final score we use the same method as Chandran-Ramesh and Newman [7, 8] and employ
the following equation:
𝑆=1
𝑁
𝑁
𝑖1
(𝐷𝑆
𝑖)2+ (𝐴𝑆
𝑖)2·(1 + 𝑂𝑆
𝑖)
𝐷2
𝑓+𝐴2
𝑓·2
,(3)
where 𝐷𝑆
𝑖,𝐴𝑆
𝑖and 𝑂𝑆
𝑖are the distance, size, and orientation scores, while the normalization factors are 𝐷2
𝑓
and 𝐴2
𝑓. We normalize the score by the number of plane pairs to obtain a score between 0 and 1.
3.5 Partitioned mean normals classifier
In our implementation of the partitioned mean normals classifier [17], the normals are computed using the
20 nearest neighbors within a specified radius. In our experiments, we have set the neighborhood radius
to 0.5 m, which is consistent with the general sample density and scale of structures in our data sets. The
voxel grid used by this classifier must be static for all point clouds so that corresponding voxels covering the
same space can be found between both point clouds in a point cloud pair. Point clouds that are well-aligned
should receive a small absolute value from this classifier.
3.6 Surface interpenetration measure classifier
Given two point clouds 𝒜and , the set of interpenetrating points, 𝐶𝒜,, is defined as all points xin 𝒜
whose local neighborhood includes at least one pair of points that are separated by the local tangent plane
computed at the closest neighbor of xin point cloud . The surface interpenetration measure, SIM , is
calculated as the fraction 𝐶𝒜,of interpenetrating points in 𝒜according to Equation 4.
SIM =|𝐶𝒜,|
𝒜(4)
Well-aligned point clouds should have a high rate of interpenetration points.
We have included two variants of the surface interpenetration measure classifier:
1. SIM1: This classifier is implemented from Silva et al. [21] without modifications.
2. SIM2: The measure in this classifier is scaled by the amount of points which have any neighbors
within the specified threshold, i.e. the part of the point clouds which overlap, thus making it less
sensitive to differences in the amount of overlap between different point cloud pairs.
3.7 AdaBoost classifier
To achieve a stronger classification, we propose a combined classifier created using AdaBoost [12]. A complete
explanation of this classifier will be given in Section 4.
In total we use 17 different classifiers in our evaluations. All classifiers are listed in Table 1 together with
their respective abbreviation.
4 Adaptive boosting of quality measures
Adaptive boosting, or AdaBoost [12] is a method for iteratively adding weak binary classifiers to build a
stronger “expert” classifier. Each classifier emits a binary opinion; which, in this work, is whether two point
Name Description
RMS1 Root mean square classifier with nearest neighbor threshold 4 meters.
RMS2 Root mean square classifier with nearest neighbor threshold 2 meters.
RMS3 Root mean square classifier with nearest neighbor threshold 0.5 meters.
RMS4 Root mean square classifier with nearest neighbor threshold 0.25 meters.
RMS5 Root mean square classifier with nearest neighbor threshold 0.15 meters.
RMS6 Root mean square classifier with nearest neighbor threshold 0.05 meters.
RMS7 Root mean square classifier with statistical nearest neighbor threshold
NDT1 NDT Score classifier with standard implementation.
NDT2 NDT Score classifier with removed ground floor.
NDT3 NDT Score classifier with overlap evaluation.
NDT4 NDT Score classifier with both removed ground floor and overlap evaluation.
HEST NDT Hessian translation parameters classifier.
HESR NDT Hessian rotation parameters classifier.
PLEX Plane extraction classifier.
NORM Partitioned mean normals classifier.
SIM1 Surface interpenetration measure classifier with standard implementation.
SIM2 Surface interpenetration measure classifier with overlap scaling.
ADA The AdaBoost classifier.
Table 1: A summary of all classifiers and their abbreviations used in the evaluations.
Algorithm 2: AdaBoost algorithm
Input: (𝑥1, 𝑦1),...,(𝑥𝑚, 𝑦𝑚)
Output: A strong classifier 𝐻(𝑥)
Initialize 𝐷1(𝑖) = 1
𝑚, 𝑖 = 1, . . . , 𝑚
for 𝑡= 1, ..., 𝑇 do
1. find the weak classifier 𝑡that maximizes |0.5𝜖𝑡|with respect to 𝐷𝑡
2. choose 𝛼𝑡=1
2ln 1𝜖𝑡
𝜖𝑡
3. update 𝐷𝑡+1 (𝑖) = 𝐷𝑡(𝑖) exp(𝛼𝑡𝑦𝑖𝑡(𝑥𝑖))
𝑍𝑡where 𝑍𝑡is the normalization factor:
𝑖𝐷𝑡(𝑖) exp(𝛼𝑡𝑦𝑖𝑡(𝑥𝑖))
end
return The final classifier 𝐻(𝑥) = sign(𝑇
𝑡=1 𝛼𝑡𝑡(𝑥))
clouds are aligned or not. A labeled training set (𝑥1, 𝑦1),...,(𝑥𝑚, 𝑦𝑚) where 𝑦𝑖is the label associated with
feature 𝑥𝑖is used as input to the algorithm. For each iteration we find the classifier 𝑡from the collection
of weak classifiers 𝐻𝑡that improves the classification the most; that is, maximizes the difference of the
corresponding weighted error rate 𝜖𝑡and 0.5 (which is the error rate of a random classifier). A distribution
of weights, 𝐷𝑡, that indicates the importance of each feature is updated for each iteration round. The weights
for the correctly classified features are decreased and the weights for the incorrectly classified features are
increased, so the classifier focuses on the misclassified features. The procedure is repeated for 𝑇iterations,
thus adding 𝑇classifiers. The set of classifiers with their respective weights 𝛼constitutes the strong classifier.
The decision of the strong classifier is determined by the sign of the weighed sum of all classifiers. The
algorithm is described in Algorithm 2.
Threshold selection We use a straightforward method, adapted from Granstr¨om et al. [13], to train the
threshold for each classifier in each evaluation round. The method is described in Algorithm 3. In short, we
sort the results from the training set in ascending order and find the threshold where the number of incorrect
classifications are minimized.
Algorithm 3: Parameter search algorithm
Input: Classifier results 𝐹𝑖for a classifier from m training iterations and corresponding labels 𝑦𝑖and
weights 𝐷𝑖, 𝑖 = 1, . . . , 𝑚
Output: The optimal parameter value 𝜃𝑡
for 𝑖= 1, ..., 𝑚 do
1. Sort 𝐹𝑖in ascending order
2. For each 𝐹𝑖, assume classifier outputs with lower 𝐹to be plausible and classifier outputs with higher
𝐹to be suspicious.
3. Compute the weighed error 𝜖𝑝𝑎𝑟𝑎𝑚 as the sum of weights for all labels mismatching the classification
in 2
4. Find the optimal parameter value 𝜃𝑡as the classifier output associated with the lowest 𝜖𝑝𝑎𝑟𝑎𝑚.
end
return 𝜃𝑡
5 Experiments
In this section we will present the procedures and the results of the evaluations.
5.1 Environments
The applications of point clouds as spatial representations range from describing very small objects to vast
landscapes. The selection of environments for this work, however, is made from the outdoor mobile robotic
perspective. We have selected two environments: The first environment, the Kjula set, is situated at an
asphalt manufacturing plant featuring a slightly hilly landscape with an abundance of gravel piles and some
structures. The second environment, the Hannover set, is from a campus with buildings, streets and other
regular geometric features (i.e. objects with flat surfaces and/or well defined edges). The environments have
been chosen in order to validate how well the methods apply in very different environments while still being
relevant in a field robotics context. Further the sensor setup also differs where the Kjula data is acquired
with a 180 by 40 degree field of view with high resolution while the Hannover data is using a spherical field
of view with lower resolution. The Hannover set is heavily featured with flat surfaces and rectangular shapes
while the Kjula environment features convex and concave structures with rugged surfaces. We have assumed
that the environments are static (however there are some moving persons in one of the sets, but the vast
majority of the environment is static).
5.1.1 Kjula
The Kjula data, seen in Figure 3 were acquired at an asphalt manufacturing plant nearby a village called
Kjula just outside the city of Eskilstuna in Sweden. The asphalt plant is a combined gravel pit and production
unit for asphalt. The work at the plant is performed by construction equipment like wheel loaders, dumpers
and excavators. The scans were acquired with a mid-size Volvo wheel loader equipped with an actuated
SICK LMS291 laser scanner. Figure 4 shows the wheel loader in the Kjula asphalt plant.
The horizontal resolution of the scanner was set to 1 degree and the vertical field of view was 40 degrees
to +10 degrees with 0 degrees being the horizontal line. The range of the scanner setup in the asphalt plant
environment was approximately 30-50 meters depending on the reflectivity of the surface. Each point cloud
was acquired while standing still during the entire scanner sweep. Each point cloud contains between 60 000
and 120 000 readings.
We have used 65 scans from the Kjula plant to generate 132 point cloud pairs, where 64 are well aligned
and 64 are not well aligned. These point cloud pairs have been used to generate a number of data sets, all
containing the same 132 point cloud pairs, but each with a different set of error offsets.
Figure 3: Overhead view of the Kjula data set. The set covers an area of approximately 150 by 150 meters.
Figure 4: The Volvo L120F wheel loader at the asphalt plant in Kjula. The laser scanner (see inset) is
mounted on top of the cabin underneath the gray protection shield.
Figure 5: Overhead view of the Hannover data set [15]. The data set covers an area of approximately 180
by 200 meters.
5.1.2 Hannover
The Hannover set is a publicly available set of point clouds1acquired at the university of Hannover (see
Figure 5). It is acquired with a moving platform equipped with two continuously rotating SICK scanners
mounted back to back and odometry sensors. Each point cloud features one half revolution of the scanners
(i.e. 360 degrees of readings) and the points are compensated for the platform’s movement by using the
odometry. The speed of the platform is low enough in relation to the accuracy of the odometry to not cause
any significant movement errors. The density of these point clouds is much less than the Kjula set with only
4 000 to 10 000 points in each point cloud. We have used 801 scans to create 1600 point cloud pairs where
800 are well aligned and 800 are not well aligned in the same manner as for the Kjula set.
5.1.3 Ground truth datasets
One crucial aspect of evaluating point cloud alignment is how the aligned ground truth data is created.
We have used registration methods to align point clouds to each other and afterwards we have made visual
confirmations of the results. There are of course more sophisticated methods of acquiring ground truth
1http://kos.informatik.uni-osnabrueck.de/3Dscans/
data, such as using GPS or camera based motion capture (e.g. Vicon) however Vicon is not feasible in large
environments since the volume must be in full view of the cameras. GPS might have been beneficial for the
Kjula set but would probably be challenging to use in the Hannover set due to the possibility of multiple
reflections when driving close to walls.
When verifying the registrations there can be variations in alignment which are not obvious upon visual
inspection. Therefore, we have compared two common registration methods, NDT registration and ICP
registration, since some of the classifiers explicitly use the objective function of NDT and some use the
objective function of ICP (the RMS error). When a specific registration method is used to produce the
ground truth data sets, it might be the case that this introduces a bias towards certain classifiers.
In order to quantify this bias, we have created two ground truth datasets each, for the Hannover and
Kjula point clouds respectively; one using NDT registration and one using ICP registration. We have made
evaluations with both to see how the differences affect the performance of the classifiers. The failure rate
(identified by visual inspection) of the registration methods was approximately 20% for the Kjula point cloud
pairs and 1.4% for the Hannover pairs. In those cases we have used the registration of the non-failing method
as a replacement. Luckily both methods never failed for the same point cloud pair. The environment in Kjula
is much less structured than the Hannover campus, which explains the higher number of failed registrations.
The magnitudes of the difference between the resulting transformations after applying the two registration
algorithms have been calculated. The box plots in Figures 6, 7 and 8 show the differences. The resulting
transformations for most of the point clouds differ with less than 0.05 m in translation, and the rotations
differ with less than 0.032 radians for all point clouds. However, 25% of the point cloud pairs in the Hannover
set differ with 5–30 cm, which could introduce a bias towards some classifiers, but our evaluations in Section
5.3.2 will show that this is not the case. Errors of up to 30 cm would be clearly visible in most cases, but the
ones encountered here are in such areas that the correct alignment is not obvious, for example in corridor-like
areas.
0 0.05 0.1 0.15 0.2 0.25 0.3
Hannover
Kjula
Translation difference in meters
Figure 6: The box plots show the distribution of the translation differences, comparing the transformations
given by NDT and ICP registration on the Hannover and Kjula data set, respectively.
5.1.4 Induced errors
To generate datasets with both correctly aligned point cloud pairs and misaligned pairs, which is needed for
training of the classifiers, we have manually added errors. To cover a wide set of scenarios, we have created
datasets with several error types and magnitudes.
0 0.01 0.02 0.03
HannoverZRot
HannoverYRot
HannoverXRot
Rotation difference in radians
Figure 7: The box plots show the distribution of the rotation differences, comparing the transformations
given by NDT and ICP registration on the Hannover data set.
The added errors are made up of two components, a translational part and a rotational part. One of the
point clouds in each pair has been translated and rotated to put it out of alignment with the other point
cloud.
Further we have defined three different error magnitudes with the aim of creating three distinct difficulties.
The translational part is defined as a static distance with a random direction, meaning that the magnitude
of the error will always be the same but the direction in which the error is applied will be random. The
rotational part of the added errors are defined as a rotation with static magnitude but random direction
(clockwise or counter-clockwise). Each point cloud with an added error has first been translated followed by
a rotation.
Small errors: These errors are approximately twice the magnitude of the upper quartile of the
difference we measured between ICP and NDT registration in Section 5.1.3. This magnitude is
selected to be (just) distinguishable from noise occurring in point cloud registration methods. The
error is only applied in three degrees of freedom to simulate how odometry errors often behave. The
small translational error is 0.1 meters in the X-Y plane, The small rotational error is 0.01 radians
(0.57 degrees) around the vertical axis. These errors are most difficult to detect. Figure 9 shows a
point cloud with a small error.
Medium errors: These errors are chosen to be less of a challenge for the classifiers. They are also
limited to three degrees of freedom but with a bigger magnitude. The medium translational error is
0.3 meters in the X-Y plane, The medium rotational error is 0.03 radians (1.72 degrees) around the
vertical axis.
Large errors: These errors are chosen to be easy to detect for any reasonably good evaluation method.
The errors are, in contrast to the small and medium errors, applied in six degrees of freedom and
the magnitude is even larger. The large translational error is 0.5 meters, The large rotational error
is 0.05 radians (2.86 degrees).
By using manually verified datasets to create the datasets with errors we also get a correct labeling, i.e.
whether the point cloud pair is aligned or not, of all point cloud pairs. A correct labeling is a requirement
for both training and evaluation of the classifiers.
0 0.001 0.002 0.003 0.004 0.005 0.006
KjulaZRot
KjulaYRot
KjulaXRot
Rotation difference in radians
Figure 8: The box plots show the distribution of the rotation differences, comparing the transformations
given by NDT and ICP registration on the Kjula data set.
5.2 Evaluation methods
We denote a point cloud pair that is classified as aligned as a positive result of the classifier, and a point
cloud pair that is classified as unaligned as a negative result. A true positive (tp) is a point cloud pair that
is detected as aligned, and is indeed successfully aligned, according to our manually labeled ground truth. A
false positive (fp) is a point cloud pair that is detected as aligned, but that is, in fact, not properly aligned.
True and false negative (tn ,fn ) classifications are symmetrically defined.
The primary evaluation metrics used in this article is accuracy, formalized as follows:
𝐴=tp +tn
tp +fp +tn +fn (5)
The accuracy measures the overall quality of a binary classifier, regarding both true positives and true
negatives. The accuracy is also the measure that AdaBoost seeks to maximize. The range of the accuracy
goes from 0 to 1 where 0 means that all evaluations are false and 1 means that all evaluations are true.
Random classifications should end up with an accuracy of 0.5.
We also investigate the sensitivity to threshold selection for each classifier. We use receiver operator charac-
teristic (ROC) plots to show how the performance of each classifier is affected by changes to the threshold.
The step size for all single classifiers have been chosen by making a linear distribution of 100 steps between
the maximum and minimum result from the classifier in the associated evaluation round. The boosted clas-
sifier is a special case where the returned value is a mixture of several classifiers which varies for different
datasets. To produce the ROC-plot for the boosted classifier we use Equation 6, where 𝛼and 𝑡are the
classifier weight and corresponding classifier respectively as introduced in Section 4, to normalize the output
and then change the threshold between 1 and 1 with increments of 0.02, thus creating 100 steps.
𝑇
𝑡=1 𝛼𝑡𝑡(𝑥)
𝑇
𝑡=1 𝛼𝑡
(6)
Figure 9: Two point clouds, one in red and one in green, from the Hannover data set that are misaligned
with a small 10 cm error. The two vertical lines in the square marking in the figure are both forming the
same wall but are clearly not aligned with each other.
Error name Error description
small Small translational and rotational errors
medium Medium translational and rotational errors
large Large translational and rotational errors
varying Evenly distributed small, medium and large combined errors
Table 2: The error types used in the evaluations.
5.3 Evaluations
5.3.1 Classifier performances
We have evaluated the performance of the classifiers using 𝑘-fold cross validation.
A requirement for all weak classifiers used in AdaBoost is that they perform better than random. Therefore,
we remove from the boosting all classifiers whose performance is inside a 95% confidence interval of random
accuracy. The outcome of a classifier is considered to be a random variable following a binomial distribution
with 𝑛samples, where 𝑛is the number of point cloud pairs. Using the normal approximation we can use
the following equation:
𝐼𝑝=𝑝𝑜𝑏𝑠 ±1.96 *𝑝𝑜𝑏𝑠(1 𝑝𝑜𝑏𝑠 )/𝑛. (7)
With the observed probability 𝑝𝑜𝑏𝑠 = 0.5 the 95% confidence interval 𝐼𝑝is 0.4755 to 0.5245 for the Hannover
set and 0.4147 to 0.5853 for the Kjula set. We use these numbers to disqualify weak classifiers from use by
the AdaBoost classifier to avoid random classifiers.
To evaluate the performance of the classifiers, we have been running cross-validation for all classifiers on the
ten datasets (five error distributions for each of the two environments). The error types are described in
Table 2. On the Hannover-sets, consisting of 1600 point cloud pairs, we made an 8-fold cross validation, i.e.
training on 1400 pairs and evaluating on 200 pairs 8 times. On the Kjula-sets, consisting of 132 point cloud
pairs, we made a 3-fold cross validation, i.e. training on 88 pairs and evaluating on 44 pairs 3 times. In the
case of single classifier evaluations, the training phase is used to find the classifier threshold as described in
Section 4.
The performance of a classifier is determined by its accuracy. An accuracy of 1 means that the classifier can
properly classify all point cloud pairs as aligned or not aligned. An accuracy of 0.5 means that the outcome
of the classifier is random. We expect to see the best performance on the datasets with large errors and then
declining performance as the errors become smaller.
Hannover results Figures 10 and 11 shows the classifiers’ accuracy on the Hannover datasets for all error
types.
In Figure 10 we see that most classifiers obtain more than 90% accuracy for large errors which can be
considered as a good result. The NORM classifier manage well above a random result for large errors but
fails to classify the point cloud pairs with medium and small errors, indicated by the accuracy of 50% which
implies a random classification result. When dealing with smaller error magnitudes, the difference between
the mean normals are smaller and more sensitive to noise and irregularities in the point clouds from for
example rough surfaces, which could explain the poor performance for the NORM classifier on small error
magnitudes. The PLEX classifier fails to classify the point cloud pairs completely, which could imply that
this is, in the implementation we have used, an unsuitable classifier for this type of problem. The NDT3
and NDT4 classifiers have the highest accuracy for all error magnitudes together with the ADA classifier.
Figure 10: Classifier accuracy on the Hannover data set where the erroneous point cloud alignments consists
of small, medium and large errors, respectively.
Figure 11: Classifier accuracy on the Hannover data set where the erroneous point cloud alignments consists
of varying errors.
Figure 12: Classifier accuracy on the Kjula data set where the erroneous point cloud alignments consists of
small, medium and large errors.
Interesting to notice is how the RMS-classifiers perform differently, the best RMS-classifier for large errors is
not the best RMS-classifier when the error magnitude changes. This indicates that it is important to select
a suitable threshold for RMS-classifiers.
If we look at the dataset with errors of all three magnitudes, shown in Figure 11, we observe that the
classifiers performance is roughly the mean value of the accuracy obtained over all error magnitudes in 10.
This result suggests that the thresholds acquired during the training phase of the classification process is
relatively stable for all classifiers even though error magnitudes vary.
Kjula results Figures 12 and 13 show the classifiers’ accuracy on the Kjula dataset for all error types.
The results from the Kjula set in Figure 12 show a more spread out result than the corresponding results in
the Hannover set, indicating the higher difficulty of the less structured Kjula environment. There are still
some classifiers that perform above 80% even on the small error magnitudes, but there are more classifiers
that perform close to or at random than in the Hannover set.
Worth noting among the large error evaluations is that the boosted classifier (ADA) performed slightly
worse than the NDT3 classifier. Our analysis suggests that this is due to overfitting. Even if AdaBoost is
rather robust against overfitting, the Kjula data sets contain only 64 positive and 64 negative point cloud
pairs. We have observed that the NDT3 and NDT4 classifiers perform equally well on the training data, but
NDT3 performs better on the evaluation data. The boosted classifier, as a consequence, uses a suboptimal
combination of the “weak” classifiers in this case. The effect of this on the AdaBoost classifier can be seen
both in Figure 12 and Figure 13, as it performs worse than NDT3 but better than or equal to NDT4. This
result emphasizes the risks of small training sets and also shows that AdaBoost is not immune to overfitting.
We can also observe the same trend among the RMS-classifiers as in the Hannover evaluations. The best
RMS classifier for large errors is not the best classifier for small errors, showing even further the importance
of a suitable threshold for RMS-classifiers and that the RMS classification method is sensitive to varying
error magnitudes.
We further notice the same behavior as in the Hannover evaluations when evaluating on data with varying
Figure 13: Classifier accuracy on the Kjula data set where the erroneous point cloud alignments consists of
varying errors.
error magnitudes. The result for each classifier is close to the mean value of the accuracy for small, medium
and large errors.
Cross-dataset results In this section we have evaluated classifiers that have been trained on data sets
from one of the two environments and evaluated on the other. These evaluations are performed to investigate
the classifiers sensitivity to the likeness between the training data and the evaluation data.
Figures 14 and 15 show the accuracy for small, medium and large errors. In comparison to the evaluations
with training and evaluation data from the same environment we see that many of the classifiers produce
random or close to random results. Some of the RMS classifiers perform well above random on large errors
but none perform better than random on small errors. The NDT classifiers (NDT1-4) all perform better
than random. NDT3, that is, using only overlapping points, and including the ground plane, is the best
classifier on large and medium errors and NDT4 is the best classifier on small errors. These results suggests
that NDT based classifiers are the least sensitive classifiers when the environment changes, and that the
other classifiers are much more sensitive to differences between the training data and the evaluation data.
These results show that the abovementioned classifiers generalise rather well between two data sets that
have different geometric characteristics and different point-cloud resolutions. However, the results might
be different for denser point clouds (e.g., from land survey equipment such as Faro Focus) or ones with a
restricted field of view (e.g., from RGB-D sensors).
5.3.2 ICP vs NDT evaluations
In Section 5.1.3 we discussed the creation of ground truth data sets and the possible problems with using a
certain registration algorithm to create such data sets. In this section we compare evaluations made where
the ground truth data sets are created using either NDT registration or ICP registration.
Figure 16 shows the difference in accuracy for the best performing NDT and RMS classifiers when evaluated
on the Hannover set with ground truth data created using NDT registration and ICP registration. The
upper left diagram, NDT3 vs NDT3ICP, shows the accuracy for the NDT3 classifier on the Hannover set
Figure 14: Accuracy of classifiers trained on data sets from the Kjula environment and evaluated on data
sets from the Hannover environment.
with NDT based ground truth (blue/dark) and with ICP based ground truth (red/light). The accuracy is, as
expected, highest for the evaluations made with NDT based ground truth. The upper right diagram shows
in the same way the accuracy for the RMS6 classifier on the same Hannover sets with NDT based ground
truth and ICP based ground truth. In this case the accuracy is better on the dataset with ICP based ground
truth. The differences explained in the two upper diagrams shows that the method used to create the ground
truth data do have some effect on the outcome of the evaluations. However, the effect of the method used to
create the ground truth is not significant enough in our experiments to alter the outcome of the evaluations.
This is shown in the lower two diagrams in Figure 16 where the two classifiers NDT3 (blue/dark) and RMS6
(red/light) is compared to each other with NDT-based ground truth to the left and ICP based ground truth
to the right. The diagrams are nearly identical since the differences in accuracy for the classifiers is much
larger than the effect of the method used to create the ground truth. It might though be important to take
this effect into account in a scenario where the difference in accuracy between classifiers is very small.
5.3.3 Classifier threshold sensitivity
The threshold sensitivity evaluation is performed by inspecting ROC plots of the classifiers performance
when varying the threshold as described in Section 5.3.1. It is preferable to have a low sensitivity to
threshold selection, as this increases the robustness of the classifier. A low sensitivity is displayed as a high
concentration of samples (dots) in the upper left corner of the ROC plot, where the true positive rate is high
and the false positive rate is low.
Hannover dataset In Figure 17 we show example ROC plots for the classifiers that showed the best
performance on the Hannover dataset. Each plot shows ROC curves for each of the three difficulties on the
combined dataset. The best performance in terms of low sensitivity is achieved by the AdaBoost classifier
in the bottom right corner where a large concentration of samples is found in the upper left corner of the
diagram, suggesting that a wide range of thresholds will result in a high performance. The bottom left RMS7
classifier on the other hand shows only a few samples along the curves and the highest concentration is found
at the endpoints (close to 0,0 and 1,1). This suggests that the range of acceptable thresholds is small. The
other two classifiers, NDT3 (top left) and RMS6 (top right) shows an even spread of samples along the curves
Figure 15: Accuracy of classifiers trained on data sets from the Hannover environment and evaluated on
data sets from the Kjula environment.
showing that the sensitivity of these classifiers is between the RMS7 and AdaBoost classifiers.
Kjula dataset In Figure 18 we see examples of ROC-plots for the Kjula set. The number of samples in the
plots are lower because of the lower number of point clouds in the entire data set. Both classifiers (NDT3 to
the left and RMS3 to the right) show an even distribution of samples along the curves, suggesting that the
classifiers have a similar threshold sensitivity, which was also the case for the NDT3 and RMS6 classifiers in
Figure 17.
5.4 Conclusions
We can conclude from the data that NDT with overlap evaluation (and sometimes removing the ground
floor); i.e., NDT3 and NDT4; are better than any of the other single classifiers (including several variants
of the commonly used RMS measure as well as other proposed classifiers from the literature) for classifying
data from both structured and unstructured outdoor environments.
In particular, the NDT score classifiers have been shown to be substantially more robust to changing envi-
ronments than methods using RMS classifiers or SIM classifiers. Even when trained on a small sample from
an unstructured outdoor work site and evaluated in an urban campus environment, the NDT classifiers alone
achieves over 80% accuracy for the most difficult set of scan pairs. The NORM classifier seems to be more
suitable for flat surfaces without irregularities and noise since its performance dropped significantly when
the error magnitudes where smaller.
An AdaBoost classifier built by combining all these classifiers performs on par with the best single classifiers
(i.e., the ones using NDT) suggesting that the classifiers do not complement each other in such a way that
improved performance can be achieved. However, the ROC-plot evaluations suggest that AdaBoost might
be less sensitive to threshold selection than any single classifier.
Figure 16: Diagrams explaining the effects on accuracy for classifiers when using ground truth created with
different methods. The top diagrams compares the change in accuracy for the classifiers NDT3 and RMS6
when evaluating on data with NDT based ground truth (blue/dark) and ICP based ground truth (red/light).
The bottom diagrams shows the effect of the method used to create the ground truth when the classifiers
are compared to each other with NDT based ground truth to the left and ICP based ground truth to the
right.
6 Summary and future work
The present work is, to the best of our knowledge, the first comparative evaluation of geometric consistency
methods for classifying aligned and nonaligned point cloud pairs. Such classification methods can be used,
e.g., for detecting point cloud registration failures in applications of model reconstruction or localization
and mapping. In particular, we have evaluated several variants of common classifiers in two environments
pertinent to mobile robots: an urban campus environment and an unstructured work site.
We hope that the results presented here also can be used as a baseline for further research on automatically
classifying point cloud alignedness. In particular, we have identified two outstanding issues that deserve
further attention.
The first is to investigate existing classifiers more deeply, in order to better isolate the specific cases in
which they perform better or worse, and to learn and evaluate optimal parameter selections for different
environment types and also investigate other aspects not considered in this article such as computational
time of the different classifiers. Depending on the application it might be an advantage to sacrifice some
accuracy in favor of faster computational times. In this work we have tried to make a broad survey, covering
many methods. In future work we hope that more specific classifiers will be investigated in-depth to provide
a good foundation for our second suggested direction.
The second direction is to make more evaluations on other datasets, common for other applications, as well as
further investigate the impact of variations in the point clouds on classifiers to evaluate classifier robustness.
One such variation could for example be different point cloud densities, where the results achieved on sparse
point clouds, such as the ones evaluated in this article, might not be reproducible on dense point clouds,
i.e. point clouds consisting of millions of points. A special category of errors which we have left out of this
article is the output of scan registration failures (as opposed to induced error offsets with fixed magnitudes),
which happens when a registration method converges to a local optimum instead of the best relative pose.
To create such datasets and evaluate them would also be a useful contribution.
In this work we used AdaBoost to create a strong classifier, the results are promising but further research
must be conducted in this area. We also know that there are other methods that might prove to be good
candidates. We do encourage a deeper investigation into combinations of several classifiers, both using
AdaBoost and with other methods.
Figure 17: ROC-plots of small, medium and large errors from the Hannover dataset for the NDT3, RMS6,
RMS7 and AdaBoost classifiers. The concentration of points along the curves shows the sensitivity for
threshold selection for each classifier. A high concentration of points close to the point 0,1 in each plot shows
that the classifier is not sensitive to threshold selection while a sparse concentration of points at 0,1 shows
that the classifier is sensitive to threshold selection.
Figure 18: Sample ROC-plots of small, medium and large errors from the Kjula dataset. Both classifiers
show an even spread of samples along the curves.
References
[1] Bernardini, F. and Bajaj, C. L. (1997). Sampling and reconstructing manifolds using alpha-shapes.
[2] Besl, P. J. and McKay, N. D. (1992). A method for registration of 3-D shapes. IEEE Transactions on Pattern Analysis
and Machine Intelligence, 14(2):239 – 256.
[3] Biber, P., Fleck, S., and Straßer, W. (2004). A probabilistic framework for robust and accurate matching of point clouds.
In 26th Pattern Recognition Symposium (DAGM 04).
[4] Biber, P. and Straßer, W. (2003). The normal distributions transform: A new approach to laser scan matching. In
Proceedings of the IEEE International Conference on Intelligent Robots and Systems (IROS), pages 2743–2748, Las Vegas,
USA.
[5] Birk, A. (2010). A quantitative assessment of structural errors in grid maps. Autonomous Robots, 28(2):187–196.
[6] Campbell, D. and Whitty, M. (2014). Metric-based detection of robot kidnapping with an {SVM}classifier. Robotics and
Autonomous Systems, pages –.
[7] Chandran-Ramesh, M. and Newman, P. (2007a). Assessing map quality and error causations using conditional random
fields. In Proceedings of the 6th. IFAC Symposium Intelligent Autonomous Vehicles (IAV), Toulouse, France.
[8] Chandran-Ramesh, M. and Newman, P. (2007b). Assessing map quality using conditional random fields. In Proc. of the
International Conference on Field and Service Robotics.
[9] Das, A., Servos, J., and Waslander, S. L. (2013). 3d scan registration using the normal distributions transform with ground
segmentation and point cloud clustering. In icra, pages 2207–2212.
[10] Edelsbrunner, H., Kirkpatrick, D. G., and Seidel, R. (1978). On the shape of a set. Transactions on Information Theory,
26(5):552–564.
[11] Edelsbrunner, H. and M¨ucke, E. P. (1994). Three-dimensional alpha shapes. ACM Transactions on Graphics (TOG),
13(1):43–72.
[12] Freund, Y. and Schapire, R. (1995). A decision-theoretic generalization of online learning and an application to boosting.
In Proceedings of the European Conference on Computational Learning Theory.
[13] Granstr¨om, K., Sch¨on, T. B., Nieto, J. I., and Ramos, F. T. (2011). Learning to close loops from range data. Journal of
Robotics Research, pages 1728–1754.
[14] Magnusson, M. (2009). The Three-Dimensional Normal-Distributions Transform — an Efficient Representation for
Registration, Surface Analysis, and Loop Detection. PhD thesis, ¨
Orebro University, Sweden. ¨
Orebro Studies in Technology
36.
[15] Magnusson, M., Andreasson, H., N¨uchter, A., and Lilienthal, A. J. (2009). Automatic appearance-based loop detection
from 3D laser data using the normal distributions transform. Journal of Field Robotics, 26(11–12):892–914.
[16] Magnusson, M., Vaskevicius, N., Stoyanov, T., Pathak, K., and Birk, A. (2015). Beyond points: Evaluating recent 3D
scan-matching algorithms. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).
[17] Makadia, A., Patterson, A., and Daniilidis, K. (2006). Fully automatic registration of 3D point clouds. In Proceedings of
the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.
[18] Masuda, T., Sakaue, K., and Yokoya, N. (1996). Registration and integration of multiple range images for 3-d model
construction. In Proccedings of the 13th International Conference on Pattern Recognition, pages 879–883.
[19] Rusinkiewicz, S. M. (2001). Efficient variants of the ICP algorithm. In Proceedings of the International Conference on
3-D Digital Imaging and Modeling, pages 145–152.
[20] Schwertfeger, S. and Birk, A. (2013). Evaluation of map quality by matching and scoring high-level, topological map
structures. In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), pages 2221–2226,
Karlsruhe, Germany.
[21] Silva, L., Bellon, O. R., and Boyer, K. L. (2005). Precision range image registration using a robust surface interpenetration
measure and enhanced genetic algorithms. IEEE Transactions on Robotics, 27(5):762–776.
[22] Stoyanov, T., Magnusson, M., and Lilienthal, A. J. (2012). Fast and accurate scan registration through minimization of
the distance between compact 3d ndt representations. The International Journal of Robotics Research, 31(12):1377–1393.
[23] Stoyanov, T., Magnusson, M., and Lilienthal, A. J. (2013). Comparative evaluation of the consistency of three-dimensional
spatial representations used in autonomous robot navigation. Journal of Field Robotics, 30(2):216–236.
[24] Weingarten, J., Gruener, G., and Siegwart, R. (2003). A fast and robust 3D feature extraction algorithm for structured
environment reconstruction. In Proceedings of the International Conference on Advanced Robotics.
... Alsayed et al. (2017) detects localization failures using extracted lines and curves. Almqvist et al. (2018) proposes a method based on NDT (Normal Distribution Transform) for detecting misaligned point clouds and predicting the degree of misalignment using machine learning. Yin et al. (2019a) proposes a statistical learning-based approach for localization failure detection by defining the problem as a binary classification task. ...
... Specifically, for a query point cloud and a point cloud in the prior map, certain spatial properties such as distances and angles are preserved between them. Such spatial invariances, i.e. spatially verifiable cues, can be used to assess the similarity of scenes (Vidanapathirana et al., 2023) and verify the reliability of the local pose estimation (Almqvist et al., 2018). For a point cloud sequence, the sequential motions and uncertainties (Wan et al., 2021), i.e. temporal verifiable cues, provided by the odometry can also verify the reliability of the current localization result. ...
Preprint
Wearable laser scanning (WLS) system has the advantages of flexibility and portability. It can be used for determining the user's path within a prior map, which is a huge demand for applications in pedestrian navigation, collaborative mapping, augmented reality, and emergency rescue. However, existing LiDAR-based global localization methods suffer from insufficient robustness, especially in complex large-scale outdoor scenes with insufficient features and incomplete coverage of the prior map. To address such challenges, we propose LiDAR-based reliable global localization (Reliable-loc) exploiting the verifiable cues in the sequential LiDAR data. First, we propose a Monte Carlo Localization (MCL) based on spatially verifiable cues, utilizing the rich information embedded in local features to adjust the particles' weights hence avoiding the particles converging to erroneous regions. Second, we propose a localization status monitoring mechanism guided by the sequential pose uncertainties and adaptively switching the localization mode using the temporal verifiable cues to avoid the crash of the localization system. To validate the proposed Reliable-loc, comprehensive experiments have been conducted on a large-scale heterogeneous point cloud dataset consisting of high-precision vehicle-mounted mobile laser scanning (MLS) point clouds and helmet-mounted WLS point clouds, which cover various street scenes with a length of over 20km. The experimental results indicate that Reliable-loc exhibits high robustness, accuracy, and efficiency in large-scale, complex street scenes, with a position accuracy of 1.66m, yaw accuracy of 3.09 degrees, and achieves real-time performance. For the code and detailed experimental results, please refer to https://github.com/zouxianghong/Reliable-loc.
... However, there have been some works on detecting wrong loop closures [12], [13], [14], which is a closely related problem, since an invalid map merge is fairly similar to a wrong global loop closure. Furthermore there is some research on map quality measures [15], [16], [17], which could be used to detect invalid merges by computing a map quality measure and assuming that a invalid merge must have happened if it drops below a certain threshold. We adapt the method in [12] (histogram) and [16] (entropy) to compare our methods against. ...
... Another idea to detect invalid merges is to use some kind of map quality metric in order to assess the quality of the map and to use a low quality as an indicator that a invalid merge has happened. Different map quality metrics have been proposed in literature, e.g., [15], [17]. Some of these metrics measure aspects of the map's quality that are not suitable for detecting invalid merges since they are designed to respond to noisy maps or slightly misaligned scans. ...
... However, there have been some works on detecting wrong loop closures [12], [13], [14], which is a closely related problem, since an invalid map merge is fairly similar to a wrong global loop closure. Furthermore there is some research on map quality measures [15], [16], [17], which could be used to detect invalid merges by computing a map quality measure and assuming that a invalid merge must have happened if it drops below a certain threshold. We adapt the method in [12] (histogram) and [16] (entropy) to compare our methods against. ...
... Another idea to detect invalid merges is to use some kind of map quality metric in order to assess the quality of the map and to use a low quality as an indicator that a invalid merge has happened. Different map quality metrics have been proposed in literature, e.g., [15], [17]. Some of these metrics measure aspects of the map's quality that are not suitable for detecting invalid merges since they are designed to respond to noisy maps or slightly misaligned scans. ...
Preprint
Full-text available
For Lifelong SLAM, one has to deal with temporary localization failures, e.g., induced by kidnapping. We achieve this by starting a new map and merging it with the previous map as soon as relocalization succeeds. Since relocalization methods are fallible, it can happen that such a merge is invalid, e.g., due to perceptual aliasing. To address this issue, we propose methods to detect and undo invalid merges. These methods compare incoming scans with scans that were previously merged into the current map and consider how well they agree with each other. Evaluation of our methods takes place using a dataset that consists of multiple flat and office environments, as well as the public MIT Stata Center dataset. We show that methods based on a change detection algorithm and on comparison of gridmaps perform well in both environments and can be run in real-time with a reasonable computational cost.
... In [105] authors investigated the effect of geometric instability on alignment and proposed a novel learning-based approach to predict misalignment. A comparative evaluation of several misaligned point cloud detection methods for point cloud registration problem, where multiple point clouds need to be aligned or merged together is presented in [106]. Similarly, in [107], researchers proposed a novel system to detect alignment errors in point cloud registration. ...
Article
Full-text available
Automated driving systems (ADSs) aim to improve the safety, efficiency and comfort of future vehicles. To achieve this, ADSs use sensors to collect raw data from their environment. This data is then processed by a perception subsystem to create semantic knowledge of the world around the vehicle. State-of-the-art ADSs’ perception systems often use deep neural networks for object detection and classification, thanks to their superior performance compared to classical computer vision techniques. However, deep neural network-based perception systems are susceptible to errors, e.g., failing to correctly detect other road users such as pedestrians. For a safety-critical system such as ADS, these errors can result in accidents leading to injury or even death to occupants and road users. Introspection of perception systems in ADS refers to detecting such perception errors to avoid system failures and accidents. Such safety mechanisms are crucial for ensuring the trustworthiness of ADSs. Motivated by the growing importance of the subject in the field of autonomous and automated vehicles, this paper provides a comprehensive review of the techniques that have been proposed in the literature as potential solutions for the introspection of perception errors in ADSs. We classify such techniques based on their main focus, e.g., on object detection, classification and localisation problems. Furthermore, this paper discusses the pros and cons of existing methods while identifying the research gaps and potential future research directions.
... We instead use odometry uncertainty as a prior, which we combine with additional loop evidence computed after registration. Finally, a range of methods aims at verifying pose estimates by detecting point cloud misalignment [30], [32], [37]. We fuse misalignment detection [32] together with place similarity and odometry uncertainty to achieve robust unified loop detection and verification. ...
Preprint
Full-text available
Robust SLAM in large-scale environments requires fault resilience and awareness at multiple stages, from sensing and odometry estimation to loop closure. In this work, we present TBV (Trust But Verify) Radar SLAM, a method for radar SLAM that introspectively verifies loop closure candidates. TBV Radar SLAM achieves a high correct-loop-retrieval rate by combining multiple place-recognition techniques: tightly coupled place similarity and odometry uncertainty search, creating loop descriptors from origin-shifted scans, and delaying loop selection until after verification. Robustness to false constraints is achieved by carefully verifying and selecting the most likely ones from multiple loop constraints. Importantly, the verification and selection are carried out after registration when additional sources of loop evidence can easily be computed. We integrate our loop retrieval and verification method with a fault-resilient odometry pipeline within a pose graph framework. By evaluating on public benchmarks we found that TBV Radar SLAM achieves 65% lower error than the previous state of the art. We also show that it's generalizing across environments without needing to change any parameters.
... Similarly, [15] relies on the condition number to determine the health of the optimization process and includes partial constraints along non-degenerate direction for sensor fusion. Other methods, such as [16,17] rely on the final alignment of scans to capture the adequacy of geometric constraints provided by the environment for correct solution convergence. However, all these methods either rely on the point cloud registration process or its result to determine the performance of the pose estimation process, and do not exploit the information provided by the point cloud data directly to facilitate the estimation process itself. ...
Conference Paper
LiDAR-based localization and mapping is one of the core components in many modern robotic systems due to the direct integration of range and geometry, allowing for precise motion estimation and generation of high quality maps in real-time. Yet, as a consequence of insufficient environmental constraints present in the scene, this dependence on geometry can result in localization failure, happening in self-symmetric surroundings such as tunnels. This work addresses precisely this issue by proposing a neural network-based estimation approach for detecting (non-)localizability during robot operation. Special attention is given to the localizability of scan-to-scan registration, as it is a crucial component in many LiDAR odometry estimation pipelines. In contrast to previous, mostly traditional detection approaches, the proposed method enables early detection of failure by estimating the localizability on raw sensor measurements without evaluating the underlying registration optimization. Moreover, previous approaches remain limited in their ability to generalize across environments and sensor types, as heuristic-tuning of degeneracy detection thresholds is required. The proposed approach avoids this problem by learning from a collection of different environments, allowing the network to function over various scenarios. Furthermore, the network is trained exclusively on simulated data, avoiding arduous data collection in challenging and degenerate, often hard-to-access, environments. The presented method is tested during field experiments conducted across challenging environments and on two different sensor types without any modifications. The observed detection performance is on par with state-of-the-art methods after environment-specific threshold tuning 1.
Article
Full-text available
GNSS/INS-based positioning must be revised for forest mapping, especially inside the forest. This study deals with the issue of the processability of GNSS/INS-positioned MLS data collected in the forest environment. GNSS time-based point clustering processed the misaligned MLS point clouds collected from skid trails under a forest canopy. The points of a point cloud with two misaligned copies of the forest scene were manually clustered iteratively until two partial point clouds with the single forest scene were generated using a histogram of GNSS time. The histogram’s optimal bin width was the maximum bin width used to create the two correct point clouds. The influence of GNSS outage durations, signal strength statistics, and point cloud parameters on the optimal bin width were then analyzed using correlation and regression analyses. The results showed no significant influence of GNSS outage duration or GNSS signal strength from the time range of scanning the two copies of the forest scene on the optimal width. The optimal bin width was strongly related to the point distribution in time, especially by the duration of the scanned plot’s occlusion from reviewing when the maximum occlusion period influenced the optimal bin width the most (R² = 0.913). Thus, occlusion of the sub-plot scanning of tree trunks and the terrain outside it improved the processability of the MLS data. Therefore, higher stem density of a forest stand is an advantage in mapping as it increases the duration of the occlusions for a point cloud after it is spatially tiled.
Article
Full-text available
Robust SLAM in large-scale environments requires fault resilience and awareness at multiple stages, from sensing and odometry estimation to loop closure. In this work, we present TBV (Trust But Verify) Radar SLAM, a method for radar SLAM that introspectively verifies loop closure candidates. TBV Radar SLAM achieves a high correct-loop-retrieval rate by combining multiple place-recognition techniques: tightly coupled place similarity and odometry uncertainty search, creating loop descriptors from origin-shifted scans, and delaying loop selection until after verification. Robustness to false constraints is achieved by carefully verifying and selecting the most likely ones from multiple loop constraints. Importantly, the verification and selection are carried out after registration when additional sources of loop evidence can easily be computed. We integrate our loop retrieval and verification method with a robust odometry pipeline within a pose graph framework. By evaluation on public benchmarks we found that TBV Radar SLAM achieves 65% lower error than the previous state of the art. We also show that it generalizes across environments without needing to change any parameters. We provide the open-source implementation at https://github.com/dan11003/tbv\_public</uri
Article
Reliability is a key factor for realizing safety guarantee of fully autonomous robot systems. In this paper, we focus on reliability in mobile robot localization. Monte Carlo localization (MCL) is widely used for mobile robot localization. However, it is still difficult to guarantee its safety because there are no methods determining reliability for MCL estimate. This paper presents a novel localization framework that enables robust localization, reliability estimation, and quick relocalization, simultaneously. The presented method can be implemented using a similar estimation manner to that of MCL. The method can increase localization robustness to environment changes by estimating known and unknown obstacles while performing localization; however, localization failure of course occurs by unanticipated errors. The method also includes a reliability estimation function that enables a robot to know whether localization has failed. Additionally, the method can seamlessly integrate a global localization method via importance sampling. Consequently, quick relocalization from a failure state can be realized while mitigating noisy influence of global localization. We conduct three types of experiments using wheeled mobile robots equipped with a two‐dimensional LiDAR. Results show that reliable MCL that performs robust localization, self‐failure detection, and quick failure recovery can be realized.
Conference Paper
Full-text available
Given that 3D scan matching is such a central part of the perception pipeline for robots, thorough and large-scale investigations of scan matching performance are still surprisingly few. A crucial part of the scientific method is to perform experiments that can be replicated by other researchers in order to compare different results. In light of this fact, this paper presents a thorough comparison of 3D scan registration algorithms using a recently published benchmark protocol which makes use of a publicly available challenging data set that covers a wide range of environments. In particular, we evaluate two types of recent 3D registration algorithms - one local and one global. Both approaches take local surface structure into account, rather than matching individual points. After well over 100 000 individual tests, we conclude that algorithms using the normal distributions transform (NDT) provides accurate results compared to a modern implementation of the iterative closest point (ICP) method, when faced with scan data that has little overlap and weak geometric structure. We also demonstrate that the minimally uncertain maximum consensus (MUMC) algorithm provides accurate results in structured environments without needing an initial guess, and that it provides useful measures to detect whether it has succeeded or not. We also propose two amendments to the experimental protocol, in order to provide more valuable results in future implementations.
Conference Paper
Full-text available
Mapping is an important task for mobile robots. But assessing the quality of maps in a simple, efficient and automated way is not trivial and an ongoing research topic. A new approach on map evaluation is presented here. It is based on Topology Graphs as a topological, abstracted representation of 2D grid maps. The Topology Graphs are derived from Voronoi Diagrams that get post-processed to capture the high-level spatial structures. Based on a similarity metric on vertices in Topology Graphs, the vertices can be matched across maps and spatial (dis)similarities and hence errors in the mapping can be identified and measured. More precisely, the vertex-similarity is the basis to match the structures of Topology Graphs up to the identification of subgraph isomorphisms through wave-front propagation. This allows to determine important map quality attributes up to very challenging structural elements like brokenness, i.e., the number of locally correct partitions in the candidate map and their relative placement towards each other. Experiments with real robot generated maps including examples from various teams in the RoboCup Rescue competition are used to validate the usefulness of this method for map quality assessment.
Article
The hydrodynamic drive of tubular centrifuges of types OTR and RTR was devised as a result of theoretical and experimental investigations of the drive and bench and industrial tests. The technical specification for designing the hydrodynamic drive of tubular centrifuges is the output torque characteristic, which is the dependence of the torque on the driving shaft of the drive on its transmission ratio. In practice, this characteristic is specified by two points. One of them is determined by the resistance moment in idle running of the centrifuge (with zero performance). The second point corresponds to the torque on the centrifuge shaft when fully loaded. The tests of a hydraulic drive, calculated by the above-outlined method and mounted on in OTR-152K-1 tubular centrifuge established that deviations of the calculated values of the driving moment from the actual values did not exceed 8%, while the deviation of the calculated values of the transmission from the true ones did not exceed 2%.
Article
Kidnapping occurs when a robot is unaware that it has not correctly ascertained its position, potentially causing severe map deformation and reducing the robot’s functionality. This paper presents metric-based techniques for real-time kidnap detection, utilising either linear or SVM classifiers to identify all kidnapping events during the autonomous operation of a mobile robot. In contrast, existing techniques either solve specific cases of kidnapping, such as elevator motion, without addressing the general case or remove dependence on local pose estimation entirely, an inefficient and computationally expensive approach. Three metrics that measured the quality of a pose estimate were evaluated and a joint classifier was constructed by combining the most discriminative quality metric with a fourth metric that measured the discrepancy between two independent pose estimates. A multi-class Support Vector Machine classifier was also trained using all four metrics and produced better classification results than the simpler joint classifier, at the cost of requiring a larger training dataset. While metrics specific to 3D point clouds were used, the approach can be generalised to other forms of data, including visual, provided that two independent ways of estimating pose are available.
Article
An increasing number of robots for outdoor applications rely on complex three‐dimensional (3D) environmental models. In many cases, 3D maps are used for vital tasks, such as path planning and collision detection in challenging semistructured environments. Thus, acquiring accurate three‐dimensional maps is an important research topic of high priority for autonomously navigating robots. This article proposes an evaluation method that is designed to compare the consistency with which different representations model the environment. In particular, the article examines several popular (probabilistic) spatial representations that are capable of predicting the occupancy of any point in space, given prior 3D range measurements. This work proposes to reformulate the obtained environmental models as probabilistic binary classifiers, thus allowing for the use of standard evaluation and comparison procedures. To avoid introducing localization errors, this article concentrates on evaluating models constructed from measurements acquired at fixed sensor poses. Using a cross‐validation approach, the consistency of different representations, i.e., the likelihood of correctly predicting unseen measurements in the sensor field of view, can be evaluated. Simulated and real‐world data sets are used to benchmark the precision of four spatial models—occupancy grid, triangle mesh, and two variations of the three‐dimensional normal distributions transform (3D‐NDT)—over various environments and sensor noise levels. Overall, the consistency of representation of the 3D‐NDT is found to be the highest among the tested models, with a similar performance over varying input data. © 2013 Wiley Periodicals, Inc.
Conference Paper
The Normal Distributions Transform (NDT) scan registration algorithm models the environment as a set of Gaussian distributions and generates the Gaussians by discretizing the environment into voxels. With the standard approach, the NDT algorithm has a tendency to have poor convergence performance for even modest initial transformation error. In this work, a segmented greedy cluster NDT (SGC-NDT) variant is proposed, which uses natural features in the environment to generate Gaussian clusters for the NDT algorithm. By segmenting the ground plane and clustering the remaining features, the SGC-NDT approach results in a smooth and continuous cost function which guarantees that the optimization will converge. Experiments show that the SGC-NDT algorithm results in scan registrations with higher accuracy and better convergence properties when compared against other state-of-the- art methods for both urban and forested environments.
Article
Registration of range sensor measurements is an important task in mobile robotics and has received a lot of attention. Several iterative optimization schemes have been proposed in order to align three-dimensional (3D) point scans. With the more widespread use of high-frame-rate 3D sensors and increasingly more challenging application scenarios for mobile robots, there is a need for fast and accurate registration methods that current state-of-the-art algorithms cannot always meet. This work proposes a novel algorithm that achieves accurate point cloud registration an order of a magnitude faster than the current state of the art. The speedup is achieved through the use of a compact spatial representation: the Three-Dimensional Normal Distributions Transform (3D-NDT). In addition, a fast, global-descriptor based on the 3D-NDT is defined and used to achieve reliable initial poses for the iterative algorithm. Finally, a closed-form expression for the covariance of the proposed method is also derived. The proposed algorithms are evaluated on two standard point cloud data sets, resulting in stable performance on a par with or better than the state of the art. The implementation is available as an open-source package for the Robot Operating System (ROS).
Article
Frequently, data in scientific computing is in its abstract form a finite point set in space, and it is sometimes useful or required to compute what one might call the "shape" of the set. For that purpose, this article introduces the formal notion of the family of a-shapes of a finite point set in R:~. Each shape is a well-defined polytope, derived from the Delaunay triangulation of the point set, with a parameter a ● R controlling the desired level of detail. An algorithm is presented that constructs the entire family of shapes for a given set of size n in time 0( nz ), worst case. A robust implementation of the algorithm is discussed, and several applications in the area of scientific computing are mentioned.