ArticlePDF Available

Abstract

In this paper, we present a publicly available benchmark dataset on multisensorial indoor mapping and positioning (MiMAP), which is sponsored by ISPRS scientific initiatives. The benchmark dataset includes point clouds captured by an indoor mobile laser scanning system in indoor environments of various complexity. The benchmark aims to stimulate and promote research in the following three fields: (1) LiDAR-based Simultaneous Localization and Mapping (SLAM); (2) automated Building Information Model (BIM) feature extraction; and (3) multisensory indoor positioning. The MiMAP project provides a common framework for the evaluation and comparison of LiDAR-based SLAM, BIM feature extraction, and smartphone-based indoor positioning methods. This paper describes the multisensory setup, data acquisition process, data description, challenges, and evaluation metrics included in the MiMAP project.
ISPRS BENCHMARK ON MULTISENSORY INDOOR MAPPING AND POSITIONING
Cheng Wang1 *, Yudi Dai1, Naser Elsheimy2, Chenglu Wen1, Guenther Retscher3, Zhizhong Kang4, and Andrea Lingua5
1 Fujian Key Laboratory of Sensing and Computing, School of Informatics, Xiamen University, 422 Siming Road South, Xiamen
361005, China - (cwang@xmu.edu.cn, daiyudi@stu.xmu.edu.cn, clwen@xmu.edu.cn)
2 University of Calgary, Canada - elsheimy@ucalgary.ca
3 TU Wien - Vienna University of Technology, Austria - guenther.retscher@geo.tuwien.ac.at
4 China University of Geosciences, Beijing - zzkang@cugb.edu.cn
5 Polytechnic University of Turin, Italy - andrea.lingua@polito.it
KEY WORDS: Multi-sensor, Indoor, Benchmark Dataset, SLAM, BIM, Indoor Positioning
ABSTRACT:
In this paper, we present a publicly available benchmark dataset on multisensorial indoor mapping and positioning (MiMAP), which
is sponsored by ISPRS scientific initiatives. The benchmark dataset includes point clouds captured by an indoor mobile laser scanning
system in indoor environments of various complexity. The benchmark aims to stimulate and promote research in the following three
fields: (1) LiDAR-based Simultaneous Localization and Mapping (SLAM); (2) automated Building Information Model (BIM) feature
extraction; and (3) multisensory indoor positioning. The MiMAP project provides a common framework for the evaluation and
comparison of LiDAR-based SLAM, BIM feature extraction, and smartphone-based indoor positioning methods. This paper describes
the multisensory setup, data acquisition process, data description, challenges, and evaluation metrics included in the MiMAP project.
1. INTRODUCTION
Indoor environments such as office, classroom, shopping mall,
and parking lots are essential to our daily life. Three-dimensional
(3D) mapping and positioning technologies for indoor
environments have become in high demand in recent years.
Online visualization, location-based services (LBS), indoor
navigation, elder assistance, and emergency evacuation are just a
few examples of the emerging applications that require 3D
mapping and positioning of indoor environments. SLAM-based
indoor mobile laser scanning systems (IMLS) (Wen et al., 2016)
provide a useful tool for indoor applications. During the IMLS
procedure, 3D point clouds and high accuracy trajectories with
position and orientation are acquired. Many efforts have been
made in the last few years to improve the SLAM algorithms
(Zhang et al., 2014) and the geometric/semantic information
extraction from point clouds and images (Armeni et al., 2016)
(Wang et al.,2018). However, both significant opportunities and
severe challenges exist in the multisensory data processing of
IMLS. First, lack of efficient or real-time 3D point cloud
generation methods of as-built 3D indoor environment; second,
face difficulties of building information model (BIM) features
extraction in the clustered and occluded indoor environment.
Also, given the relatively high accuracy, the IMLS trajectory
provides a perfect reference for the low-cost indoor positioning
solutions. Standard datasets are critical for the research on these
topics.
Under the sponsorship of ISPRS Scientific Initiatives 2019, we
developed the ISPRS Benchmark on Multisensorial Indoor
Mapping and Positioning (MiMAP). MiMAP aims to promote
researches in three aspects: (1) LiDAR-based SLAM; (2)
automated BIM feature extraction from point clouds, focusing on
extraction of building elements, such as floors, walls, ceilings,
doors, windows that are important in building management and
navigation tasks; and (3) multisensory indoor positioning,
focusing on the smartphone platform solution. MiMAP also
provides evaluation methods for these three aspects. MiMAP
Dataset is open-access via the ISPRS WG I/6 official Website
(ISPRS WG I/6, 2020) or the mirror website http://mi3dmap.net/.
The rest of this paper describes the multisensory setup, data
acquisition, dataset description, challenges and evaluation
metrics in the MiMAP project.
2. DATASET
MiMAP project team upgraded the XBeibao system (Wen et al.,
2016), a multi-sensory backpack system developed by Xiamen
University to build the MiMAP benchmark. The upgraded
system (Figure 1. (a)) can synchronously collect data with multi-
beam laser scanners, fisheye cameras, and readings from
smartphones built-in sensors, such as barometer, magnetometer,
MEMS IMU and WiFi. The baseline SLAM 3D point clouds of
the indoor test environments were also provided based on the
XBeibao processing software. We used Riegl VZ 1000 (Figure 1.
(b)) to collect high accuracy point cloud as the ground-truth of
indoor mapping.
2.1 Multisensory setup
The involved sensors are listed as follows:
XBeibao system
1 Velodyne VLP-Ultra Puck™ rotating 3D laser scanner.
20Hz, 32 beams, 4cm accuracy, collecting 0.6 million
points/second, 200m range, with a field of view of 360
horizontal × 40 vertical.
1Velodyne VLP-16L rotating 3D laser scanner. 20Hz, 16
beams, 0.1 ~ 0.4 horizontal angle resolution, 3cm accuracy,
collecting 0.3 million points/second, 100m range, with a field
of view of 360° horizontal15 vertical.
4Fisheye Camera. 1280*720 @ 30fps video resolution,
with a field of view of 4175.
Smartphones
Sensors: gyroscope, accelerometer, barometer, electronic
compass, Wi-Fi, magnetometer, GNSS(GPS).
Riegl VZ 1000 laser scanner
Range from 1.5m up to 1200m, 5mm precision, 8mm
accuracy, collecting 0.3 million points/second, with a field of
view of 100 vertical ×360 horizontal.
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V-5-2020, 2020
XXIV ISPRS Congress (2020 edition)
This contribution has been peer-reviewed. The double-blind peer-review was conducted on the basis of the full paper.
https://doi.org/10.5194/isprs-annals-V-5-2020-117-2020 | © Authors 2020. CC BY 4.0 License.
117
(a)
(b)
Figure 1. Data acquisition devices.
(a) XBeibao. (b) Riegl VZ-1000.
When collecting the data, we placed one smartphone facing up
on the top of the upper LiDAR sensor, the others are held in hand.
A laptop is used to control the data collection of cameras and
LiDAR sensors. Also, it is used as a hotspot to connect with the
smartphone to synchronize the sensors and used to store the
incoming LiDAR data streams. A system operator needs to carry
the laptop during the collection process. All the collected data
will be transferred to the laptop through wire.
2.2 Dataset
2.2.1 Dataset overview:
The MiMAP benchmark includes three datasets:
Indoor LiDAR-based SLAM dataset
We collected indoor point clouds dataset in three multi-floor
buildings with the upgraded XBeibao. This dataset represents the
typical indoor building complexity. We provide raw data of one
indoor scene with ground truth for users own evaluation. We
also provide raw data of two scenes for evaluation by submitting
their results to us. The evaluation criteria encompass the error to
the ground truth point cloud acquired with a millimeter-level
accuracy terrestrial laser scanner (TLS) (Figure 2(b)).
BIM feature extraction dataset
We provide three data with ground truth for evaluating the BIM
feature extraction on indoor 3D point clouds. Ground truth data
was manually built, and the examples are presented in Figure 3.
Indoor positioning dataset
We provide two data sequences with ground truth and provide
three data sequences without ground truth for evaluation by
submitting results. The evaluation criteria encompass the error to
the centimeter-level accuracy platform trajectory from the SLAM
processing (Figure 4).
(a) (b)
Figure 2. Illustration of indoor LiDAR-based indoor point
cloud. (a) multi-beam laser scanning sensor model. (b) the high-
accuracy reference TLS point cloud.
(a) (b)
(c) (d)
(e) (f)
Figure 3. Illustration of BIM feature extraction dataset.
(a,c,e) Point clouds. (b,d,f) Corresponding BIM frame features.
Figure 4. Illustration of the indoor positioning dataset.
(a)Setup of the smartphone with XBeibao. (b)SLAM trajectory
as synchronized reference for indoor positioning.
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V-5-2020, 2020
XXIV ISPRS Congress (2020 edition)
This contribution has been peer-reviewed. The double-blind peer-review was conducted on the basis of the full paper.
https://doi.org/10.5194/isprs-annals-V-5-2020-117-2020 | © Authors 2020. CC BY 4.0 License.
118
2.2.2 Dataset description:
A sequence of data is compressed into a file with the name format
mimap_type_number.zip, where type represents one of the three
datasets, and the number indicates the serial number of this type’s
recording round. The “type” has three values——in_slam, bim
and in_pose, representing the indoor LiDAR-based SLAM
dataset, the BIM feature extraction dataset, and the indoor
positioning dataset, respectively. The dataset’s directory
structure and detailed description are shown below.
The indoor LiDAR-based SLAM dataset consists of three
scenes captured by multi-beam laser scanners in indoor
environments with various complexity. The original scan frame
data from scanners are provided and saved in pcap file. The
timestamp of every point from the LiDAR sensor is given in the
pcap file.
The mimap_in_slam_00.zip and the mimap_in_slam_01.zip are
acquired by a Velodyne Ultra-packTM, while
mimap_in_slam_02.zip is acquired by a Velodyne HDL-32e.
Only the mimap_in_slam_00.zip dataset provides the ground
truth point cloud data, which acquired by a Riegl VZ 1000.
We provide the raw videos captured by the four cameras in
mimap_in_slam_02.zip. The videos are names as position.avi,
where the position is the placeholder of the front, the rear, the left,
or the right camera. The time of every frame is saved in
video_frame_timet.txt. Each line of the file is a relative
timestamp(us) to the system boot time, and the line number
represents the frame number of the video. The four videos have
the same timestamp.
If video data are provided, each camera’s intrinsic matrix,
extrinsic matrix and distortion coefficients will be saved in
parameter.xml. There are four cameras, front, rear, left and right,
which respectively refer to the direction of the camera and their
positions on the XBeibao system. The extrinsic matrix is used to
convert the camera’s coordinate system to LiDAR A’s coordinate
system.
If original pcap files of two Velodyne sensors are provided, the
44 calibration matrix converting the LiDAR B’s coordinate
system to LiDAR A’s coordinate system will be saved in
parameter.xml.
Figure 5. Structure of the indoor LiDAR-based SLAM dataset.
Table 1. Data description of the indoor LiDAR-based SLAM
dataset.
Data
File name
Description
00
mimap_in_slam_00.pcap
Raw data of a two-floor
building scene. Scanned
by a Velodyne Ultra
pack.
GroundTruth.las
Scanned by the Riegl VZ
1000.
01
mimap_in_slam_01.pcap
Raw data of a five-floor
building scene. Scanned
by a Velodyne Ultra
pack.
02
mimap_in_slam_02.pcap
A five-floor building
scene. Scanned by a
Velodyne HDL-32E.
front / rear / left /
right.avi
Four video files (25fps).
parameter.xml
Cameras’ intrinsic,
extrinsic and distortion
coefficients parameters
Video_frame_time.txt
Timestamp(us) of every
video frame. The line
number is equal to the
frame number.
The BIM feature extraction dataset contains data from three
indoor scenes with various complexity. For each scene, raw data
(point cloud in LAS format) and corresponding BIM line
framework (in OBJ format) are provided. Users can evaluate their
methods using the downloaded reference line frameworks.
Figure 6. Structure of the BIM feature extraction dataset.
Table 2. Data description of the BIM feature extraction dataset.
Data
Description
00
A closed-loop corridor scene.
The line framework of the point
cloud scene.
01
A corridor and multiple rooms scene.
The line framework of the point
cloud scene.
02
A closed-loop corridor and multiple
rooms scene.
The line framework of the point
cloud scene.
The indoor positioning dataset consists of five data sequences
acquired in indoor environments with various complexity. Data
sequences of sensor records from smartphones are provided.
Users can test their positioning algorithm on these data. The first
two sequences (mimap_in_pose_00 and mimap_in_pose 01)
were acquired in one building, and the other three sequences
(mimap_in_pose 02, mimap_in_pose 03, mimap_in_pose 04)
were acquired in another building. Only mimap_in_pose_00 and
mimap_in_pose_02 contains ground truth trajectory file(in TXT
format). The trajectory is the SLAM result of the LiDAR,
containing the position, rotation and timestamp(us) of every
frame. The detailed format is listed in the file.
Each data sequence contains a phones directory folder and
phones_data_description.txt file. The phones folder is the
placeholders of the smartphone’s name, and usually, there are
mutiple phones directory folders. In every smartphone’s folder,
there are timeOffset.txt and many sensor_name.txt files.
sensor_name represents the smartphone’s sensor abbreviation
name, including gyroscope, accelerometer, barometer, electronic
compass, Wi-Fi sensor, magnetometer, GPS, etc. The
timeOffset.txt records the time offsets between the phone and the
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V-5-2020, 2020
XXIV ISPRS Congress (2020 edition)
This contribution has been peer-reviewed. The double-blind peer-review was conducted on the basis of the full paper.
https://doi.org/10.5194/isprs-annals-V-5-2020-117-2020 | © Authors 2020. CC BY 4.0 License.
119
NTP server. The phones_data_description.txt details the format
of each file in phones directory.
The accuracy of the distance between the smartphones and the
LiDAR is sufficient for indoor positioning tasks, so we did not
provide the calibration files between smartphones and the
LiDARs.
Figure 7. Structure of the indoor positioning dataset.
Table 3. Data description of the indoor positioning dataset.
Data
File name
Description
00
GroundTruth_traj.txt
LiDAR’s trajectory
data containing the
positon, rotation and
timestamp.
phones_data_description.txt
A five-floor building
scene including data
of individual rooms,
closed-loop corridors
and stairs.
XIAOMI5S / XIAOMI6
Two directorys of the
smartphones data
files.
01
phones_data_description.txt
A three-floor building
scene including data
of individual rooms,
closed-loop corridors
and stairs.
XIAOMI5S / XIAOMI6
Two directorys of the
smartphones data file.
02
GroundTruth_traj.txt
LiDAR’s trajectory
data containing the
positon, rotation and
timestamp.
phones_data_description.txt
A six-floor building
scene including data
of corridors and stairs
HuaweiP8lite / MI6
Two directorys of the
smartphones data
files.
03
phones_data_description.txt
A single-floor
building scene
including data of
multiple rooms
MI6 / ALE-L21
Two directorys of the
smartphones data file.
04
phones_data_description.txt
A single-floor
building scene
including data of
multiple rooms.
MI5S
The directory of the
smartphone data file.
3. CHALLENGES
3.1 Time synchronization
In order to synchronize the smartphone and LiDARs, a laptop is
set as a local NTP (Network Time Protocol) server, then the
phones are connected to it to synchronize their time. The LiDAR
is connected to the laptop through a network cable. The
timestamp of every point cloud frame is a relative time to the start
recording time. We can view the start Unix-timestamp on the
laptop and then add it to all frames’ timestamps, the point clouds’
timestamp is therefore connected to the NTP server. Thus, the
smartphone and LiDAR can synchronize their time now through
the laptop as a bridge.
Smartphone’s time can synchronize to the local NTP server
during the recording, so the Unix-timestamp in every piece of
data is relatively accurate. Due to the instability of the Wi-Fi
connection, there are time offsets between the smartphones and
the NTP server, which range from 20ms to 500ms. We record
them before recording the data.
Since all data’s timestamps are acquired, we can obtain the
position at any time by interpolation and can use the LiDAR’s
positioning result as the smartphone’ positioning ground-truth.
3.2 Multi-Sensors Calibration
In this system, LiDAR sensor A ( ) is mounted
horizontally; LiDAR sensor B ( ) is mounted
45°below the LiDAR sensor A (Figure 1 (b)). Based on our
previous work (Gong et al., 2018), point cloud data of LiDAR
sensor A, (), and point cloud data of LiDAR sensor B, (), are
fused into by the  transform matrix between the two
LiDAR sensors (). (Eq. (1)). Additionally, Terrestrial Laser
Scanning (TLS) data is introduced to bridge the calibration
between LiDAR sensors and cameras. The calibration process is
shown in Figure 8.
 (1)
Figure 8. Flowchart of the calibration process (Wen et al.,
2019).
3.2.1 LiDAR-to-LiDAR calibration: The calibration of the
multi-LIDAR sensor is calculated recursively in the construction
of the sub-map and its isomorphism constraint (Gong et al., 2018).
Assuming is the trajectory of LIDAR sensor A at a time (0~n)
in the mapping algorithm,
is the point cloud of LIDAR sensor
B at time n. is the initial coordinate system transformation
between the LIDAR sensors. Calibration is the calculation of the
exact calibration matrix  by:
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V-5-2020, 2020
XXIV ISPRS Congress (2020 edition)
This contribution has been peer-reviewed. The double-blind peer-review was conducted on the basis of the full paper.
https://doi.org/10.5194/isprs-annals-V-5-2020-117-2020 | © Authors 2020. CC BY 4.0 License.
120


(2)
 
 
 
(3)
where  is the nearest neighbour point search algorithm.
Using and ,
is first transformed to its location at time n
in the sub-map M. Then the  algorithm is used to search the
sub-map for the nearest neighbour point set, 
. Lastly, an
environmental consistency constraint is introduced to obtain .
3.2.2 Camera -to-LiDAR calibration: The camera intrinsic
calibration matrix is given by  
 
   and ,
where  is the focal length of the camera,  is the
position of the camera and is the factors of radial
distortion. Scaramuzza’s camera calibration method
(Scaramuzza et al., 2006) is used to determine the internal
parameters and distortion factors of the camera.
We utilized a TLS (e.g., Riegl VZ 1000) to bridge the calibration
between LiDAR sensors and cameras. By manually selected
matching points between them, we can acquire the camera’s
extrinsic transformation , where is the 3×3 rotation
matrix, and is the 1×3 translation vector.
Phone-to-LiDAR calibration: We placed the smartphone face
up on the LiDAR A (Figure 9), and making the Y-axis parallel to
the laser beam scanning direction. Thus, the phone’s coordinate
system and the LiDAR’s coordinate system have the same XYZ-
axis direction. We carried more than one smartphone in some
scenes, except the one on LiDAR A, other smartphones are held
in hand. We did not provide the calibration files, because the
accuracy of the distance is sufficient for indoor positioning tasks.
Figure 9. The smart phone’s position and coordinate.
3.3 Reference data generation
For benchmark evaluation, we generated reference data from a
subset of the raw data and introduced other high accuracy data.
3.3.1 SLAM-based indoor point cloud
The reference data of SLAM-based indoor point cloud is
collected by a millimeter-level accuracy terrestrial laser scanner
(TLS) (Figure.10). Before scanning, many high-reflection
rectangle markers were placed on the wall and ground. Then
several sub-maps were generated by scanning the scene in
different positions, and overlap was guaranteed between adjacent
sub-maps. Finally, these sub-maps were manually registered by
picking the same marker and other feature points via RiSCAN
PRO.
Figure 10. The reference data of SLAM-based indoor point cloud.
3.3.2 BIM feature
We used the building line framework developed by Wang (Wang
et al., 2018) and the semantic objects labeled via manually editing.
We selected the building lines with their length greater than 0.1
m in structured indoor building and saved their own two
endpoints’ coordinates. Fig.11 gives an example of BIM features.
According to Wang’s method, semantically labels the raw point
clouds into the walls, ceiling, floor, and other objects firstly. And
then, line structures are extracted from the labeled points to
achieve an initial description of the building line framework. To
optimize the detected line structures caused by occlusion, a
conditional Generative Adversarial Nets (cGAN) deep learning
model is constructed. The line framework optimization model
includes structure completion, extrusion removal, and
regularization. Finally, CloudCompare (Girardeau-Montaut,
2011) is used to fine-tune the line framework according to the
raw point clouds with human intervention.
Figure 11. BIM feature examples. (a) Point cloud data. (b) BIM
structure model. The green lines are doors and pillars. The red
lines are ceilings. The blue lines represent the ground.
3.3.3 Indoor positioning
Firstly, we started to collect the smartphone sensors’ data and the
LiDAR’s data at the same time. Then we applied the SLAM
method (Wen et al., 2019) on the LiDAR’s pcap data to generate
a trajectory file with timestamps. The process of time
synchronization was done according to subsection 3.1. The
LiDAR’s trajectory file is treated as the reference data of indoor
positioning. An indoor positioning reference example is shown
in Figure 12. The red line is the reference trajectory from SLAM
process.
4. EVALUATION METRICS
4.1 SLAM-based indoor point cloud
mmerle (K¨ummerle et al., 2009) proposed a metric for
measuring the performance of a SLAM algorithm by considering
the poses of a robot during data acquisition. However, for indoor
environments, it is hard to get the reference of the trajectory poses.
We follow the metric for point cloud comparison proposed by
A
B
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V-5-2020, 2020
XXIV ISPRS Congress (2020 edition)
This contribution has been peer-reviewed. The double-blind peer-review was conducted on the basis of the full paper.
https://doi.org/10.5194/isprs-annals-V-5-2020-117-2020 | © Authors 2020. CC BY 4.0 License.
121
Lehtola (Lehtola, V. V. et al. 2017). To be specific, our
evaluation firstly reconstructs the point cloud based on the
submitted trajectory. Then, voxel filtering of 3cm is performed to
ensure the same resolution of the point cloud. The error of a
single point is given by the weighted point-to-point (p2p)
absolute distance:
 
where is a point in the evaluated point cloud, is the
corresponding nearest neighbor point in the reference point cloud,
means the Euclidean distance between two points. is
calculated as:


and the error of the whole point cloud is calculated by the mean
and stand deviation of each point:

 

where N is the number of points in the evaluated point cloud
which satisfy .
The motivation for using absolute distance is that it can be
calculated by searching the nearest neighbor instead of manually
selecting the corresponding feature points between the two point
cloud maps, which will introduces manual errors and unfairness.
Since the nearest neighbor search is used for points association,
the coordinate system of the point cloud to be evaluated should
be the same with the reference one. The point cloud generated by
the SLAM algorithm uses the local coordinate system of the first
frame as the global coordinate system. To make a fair comparison,
we manually registered the first frame of the SLAM point cloud
to the reference point cloud to obtain a transformation matrix T.
By subsequently applied T to each evaluated point cloud, this
point is aligned to the reference point cloud. The evaluation table
will rank methods according to the average of absolute errors.
Figure 12. Indoor positioning reference example. The red line is
the reference trajectory from SLAM process. The point cloud
map generated based on the red trajectory is a five-floors building.
(Only part of the building is shown).
4.2 BIM feature
The BIM feature extraction dataset contains data from three
indoor scenes with various complexity. For each of the scenes,
raw data (point cloud in LAS format) and corresponding BIM
line framework (in OBJ format) are provided.
Imitating COCO evaluation criterion (Lin et al., 2014), we adopt
the average precision (AP) of the predicted line framework as the
primary metric. We use thresholdto decide whether two lines
are coincident, instead of Intersection over Union (IoU) used in
COCO.
Given a line 
in ground truth annotations and a line

in prediction, if the mean valueof the distance between
two pairs of endpoints is less than the threshold, the two lines
are considered to be coincident.



Figure 13 shows one example: because the distance between 
and 
is  and the distance between 
and

is , 
is considered as true positive
while
is considered as false positive.
Figure 13. An example of evaluating BIM feature lines. The dots
and the lines in the figure represent the vertexes and the lines of
the BIM feature. The number is the distance(m) between the
line’s vertex.
AP is defined as the area under the precision-recall curve, and AP
is averaged over multiple threshold . Specifically, we set ten
thresholds from 1.4cm to 0.5cm at step 0.1.
The proposed metric computes the spatial consistency of the
predicted and ground truth line frameworks. If the algorithm fails
to the endpoints or capture the correct line direction, the number
of true positive will be limited under strict threshold and the
AP will be small.
4.3 Indoor positioning:
The approach of evaluating indoor positioning is similar to the
translation evaluation extended by Geiger (Geiger at al., 2012).
Our evaluation firstly locates the corresponding pose information
in the submitted trajectory results based on the timestamp of each
pose in ground truth files. Then, computes the average of
translation errors for all possible sub-sequences of some lengths
(5, 10, 25, 50 meters).

 

where N is the number of relative sub-sequences, and is the
inverse of a standard motion composition operator. Let  be the
relative transformation from pose j to pose i and 
be the
reference relative sub-sequence
The indoor positioning dataset provides two data sequences with
ground truth for evaluation. Each ground truth trajectories file (in
TXT format) contains an Nx9 table, where N is the number of
frames of this sequence. The format of each row in the file is:
 
p timestamq_w q_zq_y q_x p_zp_y p_x frame_id
. Here,
frame_id is the index of lidar frame with the current pose, p_x,
p_y, and p_z are the translation components of the current pose,
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V-5-2020, 2020
XXIV ISPRS Congress (2020 edition)
This contribution has been peer-reviewed. The double-blind peer-review was conducted on the basis of the full paper.
https://doi.org/10.5194/isprs-annals-V-5-2020-117-2020 | © Authors 2020. CC BY 4.0 License.
122
q_x, q_y, q_z, and q_w are the quaternion representations of the
rotation component of the current pose.
The dataset also provides three data sequences for submitting
results. In the submitted trajectory file, each line in the file
formats as:
 
time(s))p(UTC timestamp_zp_y p_x frame_id
.
The evaluation table will rank methods according to the average
of translation errors, where errors are measured in percent.
4.4 Examples of dataset
Fig. 14 shows some examples of this dataset. Fig 14 (a) shows a
frame of the Velodyne VLP-16 LiDAR data. Different color
represents the intensity of every point; the brighter color means
the stronger intensity. Fig 14 (b) shows the high accuracy data
from Riegl VZ 1000, which is used as Indoor LiDAR SLAM
ground truth. Fig 14 (c) and (d) show two examples of BIM
benchmark, and Fig 14 (e) and (d) show two examples of indoor
positioning benchmark. The blue dots in (d) are trajectories
generated from the LiDAR-based SLAM method, and the yellow
dots are trajectories generated by the smartphone sensor data.
5. CONCLUSION
This paper presents the design of the benchmark dataset on
multisensory indoor mapping and position (MIMAP). Each scene
in the dataset contains the point clouds from the multi-beam laser
scanner, the images from fisheye lens cameras, and the records
from the attached smartphone sensors. The benchmark dataset
can be used to evaluate algorithms on: (1) SLAM-based indoor
point cloud generation; (2) automated BIM feature extraction
from point clouds; and (3) low-cost multisensory indoor
positioning, focusing on the smartphone platform solution.
(a) (b)
(c) (d)
(e) (f)
Figure 14. (a)A single frame from the LiDAR stream. (b) An
indoor view of Riegl VZ 1000 data. (c) BIM structure model of
a circular corridor. (d) BIM structure model with its point cloud.
(e) An example of the indoor positioning benchmark. (f) The
ground truth trajectory with the corresponding point cloud.
6. ACKNOWLEDGEMENTS
This work was supported by ISPRS Scientific Initiatives 2019
“ISPRS BENCHMARK ON MULTISENSORY INDOOR
MAPPING AND POSITIONING”, and National Natural Science
Foundation of China project U1605254 and 61771413.
7. REFERENCES
Armeni, I., Sener, O., Zamir, A. R., Jiang, H., Brilakis, I., Fischer,
M., & Savarese, S., 2016. 3d semantic parsing of large-scale
indoor spaces. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (pp. 1534-1543).
Geiger, A., Lenz, P., Urtasun, R., 2012. Are we ready for
autonomous driving? the kitti vision benchmark suite. In 2012
IEEE Conference on Computer Vision and Pattern
Recognition (pp. 3354-3361).
Girardeau-Montaut, D., 2011. 3D point cloud and mesh
processing software. Open Source Project. cloudcompare.org
Gong, Z., Wen, C., Wang, C., Li, J., 2018. A target-free
automatic self-calibration approach for multibeam laser
scanners. IEEE Trans. Instrum. Meas, 67(1), 238-240.
ISPRS Commission I WG I/6, ISPRS Benchmark on
Multisensorial Indoor Mapping and Positioning, 2020.
http://www2.isprs.org/commissions/comm1/wg6/isprs-
benchmark-on-multisensory-indoor-mapping-and-
positioning.html
mmerle, R., Steder, B., Dornhege, C., Ruhnke, M., Grisetti,
G., Stachniss, C., Kleiner, A., 2009. On measuring the accuracy
of SLAM algorithms. Auton. Robots, 27(4), 387.
Lin, T. Y., Maire, M., and et.al., 2014. Microsoft coco: common
objects in context. In European Conference on Computer
Vision (pp. 740-755). Springer, Cham.
Lehtola, V. V. , Kaartinen, H., Nüchter, A., Kaijaluoto, R.,
Kukko, A., Litkey, P., Honkavaara, E., Rosnell, T., Vaaja , M. T.,
Virtanen, J-P., Kurkela, M., El Issaoui, A. , Zhu, L., Jaakkola A
& Hyyppä, J., 2017. Comparison of the selected state-of-the-art
3D indoor scanning and point cloud generation methods. Remote
Sens., 9 (8), 796.
Scaramuzza, D., Martinelli, A., Siegwart, R., 2006. A toolbox for
easily calibrating omnidirectional cameras. In 2006 IEEE/RSJ
International Conference on Intelligent Robots and Systems (pp.
5695-5701).
Wen, C., Pan, S., Wang, C., Li, J., 2016. An indoor backpack
system for 2-D and 3-D mapping of building interiors. IEEE
Geosci. Rem. Sens. Lett., 13(7), 992-996.
Wen, C., Dai, Y., Xia, Y., Lian, Y., Tan, J., Wang, C., Li, J., 2019.
Toward efficient 3-D colored mapping in GPS/GNSS-denied
environments. IEEE Geosci. Rem. Sens. Lett., 17(1), 147-151.
Wang, C., Hou, S., Wen, C., Gong, Z., Li, Q., Sun, X., Li, J.,
2018. Semantic line framework-based indoor building modeling
using backpacked laser scanning point cloud. ISPRS J.
Photogramm. Rem. Sens., 143, 150-166.
Zhang, J., Singh, S., 2014. LOAM: Lidar odometry and mapping
in real-time. In Robotics: Science and Systems (Vol. 2, p. 9).
ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume V-5-2020, 2020
XXIV ISPRS Congress (2020 edition)
This contribution has been peer-reviewed. The double-blind peer-review was conducted on the basis of the full paper.
https://doi.org/10.5194/isprs-annals-V-5-2020-117-2020 | © Authors 2020. CC BY 4.0 License.
123
... For indoor semantic reconstruction, a general way is to build a 3D model with point cloud acquired by dense image matching or LiDAR observation (Izadi et al., 2011;Wang et al., 2020a;Wang et al., 2020b), and parse semantics subsequently based on the geometry of the 3D model (Koppula et al., 2011;Wang et al., 2018;Li et al., 2020b;Lin et al., 2021). Despite high accuracy for well observed regions, guaranteeing the completeness of the recovered 3D geometry remains hard in terms of data collection and traditional reconstruction procedure, which is disadvantageous for the following semantic parsing and individual object model recovery. ...
Article
Semantic indoor 3D modeling with multi-task deep neural networks is an efficient and low-cost way for reconstructing an indoor scene with geometrically complete room structure and semantic 3D individuals. Challenged by the complexity and clutter of indoor scenarios, the semantic reconstruction quality of current methods is still limited by the insufficient exploration and learning of 3D geometry information. To this end, this paper proposes an end-to-end multi-task neural network for geometry-enhanced semantic 3D reconstruction of RGB-D indoor scenes (termed as GeoRec). In the proposed GeoRec, we build a geometry extractor that can effectively learn geometry-enhanced feature representation from depth data, to improve the estimation accuracy of layout, camera pose and 3D object bounding boxes. We also introduce a novel object mesh generator that strengthens the reconstruction robustness of GeoRec to indoor occlusion with geometry-enhanced implicit shape embedding. With the parsed scene semantics and geometries, the proposed GeoRec reconstructs an indoor scene by placing reconstructed object mesh models with 3D object detection results in the estimated layout cuboid. Extensive experiments conducted on two benchmark datasets show that the proposed GeoRec yields outstanding performance with 5.19×10-3 mean chamfer distance error for object reconstruction on the challenging Pix3D dataset, 70.45% mAP for 3D object detection and 77.1% 3D mIoU for layout estimation on the commonly-used SUN RGB-D dataset. Especially, the mesh reconstruction sub-network of GeoRec trained on Pix3D can be directly transferred to SUN RGB-D without any fine-tuning, manifesting a high generalization ability.
... If the point clouds represent complex scenes in which individual objects appear several times, then instancing is often the goal. Applications include the modeling of digital twins or the creation of city models, as well as the direct creation of simple building models based on point clouds and prior knowledge [20][21][22]. ...
Article
Full-text available
Recently, 3D point clouds have become a quasi-standard for digitization. Point cloud processing remains a challenge due to the complex and unstructured nature of point clouds. Currently, most automatic point cloud segmentation methods are data-based and gain knowledge from manually segmented ground truth (GT) point clouds. The creation of GT point clouds by capturing data with an optical sensor and then performing a manual or semi-automatic segmentation is a less studied research field. Usually, GT point clouds are semantically segmented only once and considered to be free of semantic errors. In this work, it is shown that this assumption has no overall validity if the reality is to be represented by a semantic point cloud. Our quality model has been developed to describe and evaluate semantic GT point clouds and their manual creation processes. It is applied on our dataset and publicly available point cloud datasets. Furthermore, we believe that this quality model contributes to the objective evaluation and comparability of data-based segmentation algorithms.
... The ISPRS Multi-sensor Indoor Mapping and Positioning dataset [65] is collected by a backpack mapping system with Velodyne VLP32C and HDL32E in a 5-floor building. High accuracy terrestrial laser scanning (TLS) point cloud of floor 1 & 2 scanned by Rigel VZ-1000 is regarded as the ground truth for map quality evaluation. ...
... The SLAM benchmarks generally only provide the ground truth of the trajectory, which is not suitable for directly evaluating the performance of the mapping. The benchmark "ISPRS BENCHMARK ON MULTISENSORY INDOOR MAPPING AND POSITIONING" [32], which has the truth point cloud map as a reference, does not record IMU data, so it cannot be used either. Therefore, an experiment imitating the benchmark is supplemented. ...
Article
Full-text available
High-precision point cloud maps have drawn increasing attention due to their wide range of applications. In recent decades, point cloud maps are normally generated by simultaneous localization and mapping (SLAM) methods, which favor real-time performance over high precision. These methods generally focus on trajectory accuracy resulting in the unclearness of map accuracy. Therefore, to build a high-precision point cloud map and evaluate the mapping performance directly, this study proposes a tight coupling mapping method that integrates the error-state Kalman filter (ESKF), the general framework for graph optimization (g2o), and the point cloud alignment. An ESKF and a g2o are both used to improve the precision of the mapping process. Also, experiments based on a mobile mapping backpack prototype are conducted to verify the proposed method. Targets in the environment and a high-precision reference point cloud map are used to directly evaluate the map performance. The results indicate that the generated point cloud map is sufficiently precise and can reach the centimeter level.
... B. Qualitative experiments 1) MIMAP: The ISPRS Multi-sensor Indoor Mapping and Positioning dataset [53] is collected by a backpack mapping system with Velodyne VLP32C and HDL32E in a 5-floor building. High accuracy terrestrial laser scanning (TLS) point cloud of floor 1 & 2 scanned by Rigel VZ-1000 is regarded as the ground truth for map quality evaluation. ...
Preprint
The rapid development of autonomous driving and mobile mapping calls for off-the-shelf LiDAR SLAM solutions that are adaptive to LiDARs of different specifications on various complex scenarios. To this end, we propose MULLS, an efficient, low-drift, and versatile 3D LiDAR SLAM system. For the front-end, roughly classified feature points (ground, facade, pillar, beam, etc.) are extracted from each frame using dual-threshold ground filtering and principal components analysis. Then the registration between the current frame and the local submap is accomplished efficiently by the proposed multi-metric linear least square iterative closest point algorithm. Point-to-point (plane, line) error metrics within each point class are jointly optimized with a linear approximation to estimate the ego-motion. Static feature points of the registered frame are appended into the local map to keep it updated. For the back-end, hierarchical pose graph optimization is conducted among regularly stored history submaps to reduce the drift resulting from dead reckoning. Extensive experiments are carried out on three datasets with more than 100,000 frames collected by six types of LiDAR on various outdoor and indoor scenarios. On the KITTI benchmark, MULLS ranks among the top LiDAR-only SLAM systems with real-time performance.
... The International Society for Photogrammetry and Remote Sensing (ISPRS) benchmark on multisensory indoor mapping and position (MiMAP) is the first dataset that links the multiple tasks of light detection and ranging (LiDAR)-based SLAM, building information model (BIM) feature extraction, and smartphone-based indoor positioning all together [29]. This benchmark includes three datasets: an indoor LiDAR-based SLAM dataset, a BIM feature extraction dataset, and an indoor positioning dataset. ...
Article
Full-text available
Indoor environment model reconstruction has emerged as a significant and challenging task in terms of the provision of a semantically rich and geometrically accurate indoor model. Recently, there has been an increasing amount of research related to indoor environment reconstruction. Therefore, this paper reviews the state-of-the-art techniques for the three-dimensional (3D) reconstruction of indoor environments. First, some of the available benchmark datasets for 3D reconstruction of indoor environments are described and discussed. Then, data collection of 3D indoor spaces is briefly summarized. Furthermore, an overview of the geometric, semantic, and topological reconstruction of the indoor environment is presented, where the existing methodologies, advantages, and disadvantages of these three reconstruction types are analyzed and summarized. Finally, future research directions, including technique challenges and trends, are discussed for the purpose of promoting future research interest. It can be concluded that most of the existing indoor environment reconstruction methods are based on the strong Manhattan assumption, which may not be true in a real indoor environment, hence limiting the effectiveness and robustness of existing indoor environment reconstruction methods. Moreover, based on the hierarchical pyramid structures and the learnable parameters of deep-learning architectures, multi-task collaborative schemes to share parameters and to jointly optimize each other using redundant and complementary information from different perspectives show their potential for the 3D reconstruction of indoor environments. Furthermore, indoor–outdoor space seamless integration to achieve a full representation of both interior and exterior buildings is also heavily in demand.
Article
Full-text available
3D mapping is now essential for urban infrastructure resource monitoring, autonomous driving, and the fulfillment of digital earth. LiDAR-based Simultaneous Localization And Mapping (SLAM) technology has been widely studied because of its efficient application in the reconstruction of the 3D environments that benefitted from its sensor characteristic. However, the sparsity of low cost LiDAR (Light Detection And Ranging) sensor provides great challenge for it. Insufficient or poor distribution of the constraints as well as the motion distortion in one scan of point cloud would deteriorate the pose estimation in SLAM. In this paper, a novel 3D mapping method based only on the LiDAR data is proposed to enhance the performance of SLAM-based 3D point clouds map reconstruction. The proposed system consists of two main modules: localization module for rapid ego-motion estimation within the surrounding environment, and mapping module for accurate point clouds fusion. As for the localization, scan-to-map point cloud registration scheme are adopted where structural features are firstly extracted to provide valid correspondences. What’s more, an effective method is figured out to balance the cost factors for the optimization function of pose estimation in the process. This further handles the degenerated cases for cloud registration. Focusing on improving the quality of the map, a spline motion model is integrated with non-rigid point cloud fusion to facilitate the continuous and nonrepetitive mapping of the environment. Extensive experiments are carried out for the verification of our system including large-scale outdoor and long-corridor scenes. According to the experimental results, our system has achieved 0.78‰closure error and 27.5% improvement in the APE (Absolute Pose Errors), outperforming the state-of-the-art LiDAR-based SLAM (e.g. LeGO-LOAM) frameworks in our test sites. To specifically verify the performance of 3D mapping, the precision of 3D points is quantified through comparing with the high-precision point cloud maps collected by MLS (Mobile Laser Scanning) and TLS (Terrestrial Laser Scanning). The RMS of point-to-plane distances have improved 20% based on the fitted Weibull distribution. In addition, some ablation tests are conducted to reveal the efficacy of different components in our system.
Article
Point clouds and models with semantic information facilitate various indoor automation, ranging from indoor robotics to emergency responses. Studies are currently being conducted on semantic labeling and modeling based on offline mapped point clouds, in which, the performance is strongly limited by the mapping process. To address this issue, we propose a framework to cooperatively perform the three tasks of semantic labeling, mapping, and 3D modeling of point clouds. First, our framework uses a deep-learning-assisted method to perform frame-level point cloud semantic labeling. Subsequently, point cloud frames with semantic labels are used to extract the structural planes of buildings, followed by the generation of line structures from the planes. Then, these frames are used to estimate the initial poses of a 3D sensor for data collection. In the subsequent pose optimization process, the initial poses are optimized under the constraints of the structural planes. Finally, the optimized poses are used to integrate semantic frames and line structures to generate a point cloud map and 3D line model of buildings. The experimental results show that the proposed method achieves better results than the state-of-the-art methods that separately perform one of the two tasks.
Article
Full-text available
The lack of benchmarking data for the semantic segmentation of digital heritage scenarios is hampering the development of automatic classification solutions in this field. Heritage 3D data feature complex structures and uncommon classes that prevent the simple deployment of available methods developed in other fields and for other types of data. The semantic classification of heritage 3D data would support the community in better understanding and analysing digital twins, facilitate restoration and conservation work, etc. In this paper, we present the first benchmark with millions of manually labelled 3D points belonging to heritage scenarios, realised to facilitate the development, training, testing and evaluation of machine and deep learning methods and algorithms in the heritage field. The proposed benchmark, available at http://archdataset.polito.it/, comprises datasets and classification results for better comparisons and insights into the strengths and weaknesses of different machine and deep learning approaches for heritage point cloud semantic segmentation, in addition to promoting a form of crowdsourcing to enrich the already annotated database.
Article
Full-text available
Indoor building models are essential in many indoor applications. These models are composed of the primitives of the buildings, such as the ceilings, floors, walls, windows, and doors, but not the movable objects in the indoor spaces, such as furniture. This paper presents, for indoor environments, a novel semantic line framework-based modeling building method using backpacked laser scanning point cloud data. The proposed method first semantically labels the raw point clouds into the walls, ceiling, floor, and other objects. Then line structures are extracted from the labeled points to achieve an initial description of the building line framework. To optimize the detected line structures caused by furniture occlusion, a conditional Generative Adversarial Nets (cGAN) deep learning model is constructed. The line framework optimization model includes structure completion, extrusion removal, and regularization. The result of optimization is also derived from a quality evaluation of the point cloud. Thus, the data collection and building model representation become a united task-driven loop. The proposed method eventually outputs a semantic line framework model and provides a layout for the interior of the building. Experiments show that the proposed method effectively extracts the line framework from different indoor scenes.
Article
Full-text available
Accurate three-dimensional (3D) data from indoor spaces are of high importance for various applications in construction, indoor navigation and real estate management. Mobile scanning techniques are offering an efficient way to produce point clouds, but with a lower accuracy than the traditional terrestrial laser scanning (TLS). In this paper, we first tackle the problem of how the quality of a point cloud should be rigorously evaluated. Previous evaluations typically operate on some point cloud subset, using a manually-given length scale, which would perhaps describe the ranging precision or the properties of the environment. Instead, the metrics that we propose perform the quality evaluation to the full point cloud and over all of the length scales, revealing the method precision along with some possible problems related to the point clouds, such as outliers, over-completeness and misregistration. The proposed methods are used to evaluate the end product point clouds of some of the latest methods. In detail, point clouds are obtained from five commercial indoor mapping systems, Matterport, NavVis, Zebedee, Stencil and Leica Pegasus: Backpack, and three research prototypes, Aalto VILMA , FGI Slammer and the Würzburg backpack. These are compared against survey-grade TLS point clouds captured from three distinct test sites that each have different properties. Based on the presented experimental findings, we discuss the properties of the proposed metrics and the strengths and weaknesses of the above mapping systems and then suggest directions for future research.
Article
Full-text available
This letter presents a backpack mapping system for creating indoor 2-D and 3-D maps of building interiors. For many applications, indoor mobile mapping provides a 3-D structure via an indoor map. Because there are significant roll and pitch motions of the indoor mobile mapping system, the need arises for a moving mobile system with 6 degrees of freedom (DOFs) ( $x$, $y$, and $z$ positions and roll, yaw, and pitch angles). First, we present a 6-DOF pose estimation algorithm by fusing 2-D laser scanner data with inertial sensor data using an extended Kalman filter-based method. The estimated 6-DOF pose is used as the initialized transformation for consecutive map alignment in 3-D map building. The 6-DOF pose gives a full 3-D estimation of the system pose and is used to accelerate the map alignment process and also align the two maps directly when there are few or no overlapping areas between the maps. Our results show that the proposed system effectively builds a consistent 2-D grid map and a 3-D point cloud map of an indoor environment.
Article
Full-text available
In this paper, we address the problem of creating an objective benchmark for evaluating SLAM approaches. We propose a framework for analyzing the results of a SLAM approach based on a metric for measuring the error of the corrected trajectory. This metric uses only relative relations between poses and does not rely on a global reference frame. This overcomes serious shortcomings of approaches using a global reference frame to compute the error. Our method furthermore allows us to compare SLAM approaches that use different estimation techniques or different sensor modalities since all computations are made based on the corrected trajectory of the robot. We provide sets of relative relations needed to compute our metric for an extensive set of datasets frequently used in the robotics community. The relations have been obtained by manually matching laser-range observations to avoid the errors caused by matching algorithms. Our benchmark framework allows the user to easily analyze and objectively compare different SLAM approaches.
Conference Paper
Full-text available
In this paper, we present a novel technique for calibrating central omnidirectional cameras. The proposed procedure is very fast and completely automatic, as the user is only asked to collect a few images of a checker board, and click on its corner points. In contrast with previous approaches, this technique does not use any specific model of the omnidirectional sensor. It only assumes that the imaging function can be described by a Taylor series expansion whose coefficients are estimated by solving a four-step least-squares linear minimization problem, followed by a non-linear refinement based on the maximum likelihood criterion. To validate the proposed technique, and evaluate its performance, we apply the calibration on both simulated and real data. Moreover, we show the calibration accuracy by projecting the color information of a calibrated camera on real 3D points extracted by a 3D sick laser range finder. Finally, we provide a Toolbox which implements the proposed calibration procedure.
Article
Efficient 3-D mapping provides useful and detailed 3-D data for many applications. In this letter, we present a multisensor calibration and mapping method, to provide highly efficient and relatively accurate colored mapping for GPS-/global navigation satellite system-denied environments. The sensor data include 3-D laser scanning point clouds and camera images. A simultaneous localization and mapping (SLAM)-assisted calibration method is first proposed for multiple multibeam light detection and ranging (LiDAR) and multiple camera calibration. An improved SLAM method with loop closure is proposed for 3-D mapping. With the proposed calibration and mapping methods, centimeter-level colored point clouds can be obtained efficiently. The proposed method was tested with both backpacked and car-mounted systems on indoor and outdoor scenes. Experimental results show the effectiveness and efficiency of the proposed calibration and mapping methods.
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Today, visual recognition systems are still rarely employed in robotics applications. Perhaps one of the main reasons for this is the lack of demanding benchmarks that mimic such scenarios. In this paper, we take advantage of our autonomous driving platform to develop novel challenging benchmarks for the tasks of stereo, optical flow, visual odometry/SLAM and 3D object detection. Our recording platform is equipped with four high resolution video cameras, a Velodyne laser scanner and a state-of-the-art localization system. Our benchmarks comprise 389 stereo and optical flow image pairs, stereo visual odometry sequences of 39.2 km length, and more than 200k 3D object annotations captured in cluttered scenarios (up to 15 cars and 30 pedestrians are visible per image). Results from state-of-the-art algorithms reveal that methods ranking high on established datasets such as Middlebury perform below average when being moved outside the laboratory to the real world. Our goal is to reduce this bias by providing challenging benchmarks with novel difficulties to the computer vision community. Our benchmarks are available online at: www.cvlibs.net/datasets/kitti.