ArticlePDF Available

Automatic Label Injection Into Local Infrastructure LiDAR Point Cloud for Training Data Set Generation

Authors:
  • Budapest Unviersity of Technology and Ecnomics

Abstract and Figures

The representation of objects in LiDAR point clouds is changed as the height of the mounting position of sensor devices gets increased. Most of the available open datasets for training machine learning based object detectors are generated with vehicle top mounted sensors, thus the detectors trained on such datasets perform weaker when the sensor is observing the scene from a significantly higher viewpoint (e.g. infrastructure sensor). In this paper a novel Automatic Label Injection method is proposed to label the objects in the point cloud of the high-mounted infrastructure LiDAR sensor based on the output of a well performing “trainer” detector deployed at optimal height while considering the uncertainties caused by various factors described in detail throughout the paper. The proposed automatic labeling approach has been validated on a small scale sensor setup in a real-world traffic scenario where accurate differential GNSS reference data where also available for each test vehicle. Furthermore, the concept of a distributed multi-sensor system covering a larger area aimed for automatic dataset generation is also presented. It is shown that a machine learning based detector trained on differential GNSS-based training dataset performs very similarly to the detector retrained on a dataset generated by the proposed Automatic Label Injection technique. According to our results a significant increase in the maximum detection range can be achieved by retraining the detector on viewpoint specific data generated fully automatically by the proposed label injection technique compared to a detector trained on vehicle top mounted sensor data.
Content may be subject to copyright.
Date of publication xxxx 00, 0000, date of current version xxxx 00, 0000.
Digital Object Identifier 10.1109/ACCESS.2017.DOI
Automatic Label Injection Into Local
Infrastructure LiDAR Point Cloud for
Training Data Set Generation
ZSOLT VINCZE1, ANDRÁS RÖVID2, VIKTOR TIHANYI3
1Budapest University of Technology and Economics, Faculty of Transportation Engineering and Vehicle Engineering, Department of Automotive Technologies,
1111 Budapest, Stoczek str. 6. J. building fifth floor (vincze.zsolt@kjk.bme.hu)
2Budapest University of Technology and Economics, Faculty of Transportation Engineering and Vehicle Engineering, Department of Automotive Technologies,
1111 Budapest, Stoczek str. 6. J. building fifth floor (rovid.andras@kjk.bme.hu)
3Budapest University of Technology and Economics, Faculty of Transportation Engineering and Vehicle Engineering, Department of Automotive Technologies,
1111 Budapest, Stoczek str. 6. J. building fifth floor (tihanyi.viktor@kjk.bme.hu)
Corresponding author: Zsolt Vincze (e-mail: vincze.zsolt@kjk.bme.hu).
ABSTRACT The representation of objects in LiDAR point clouds is changed as the height of the mounting
position of sensor devices gets increased. Most of the available open datasets for training machine learning
based object detectors are generated with vehicle top mounted sensors, thus the detectors trained on such
datasets perform weaker when the sensor is observing the scene from a significantly higher viewpoint (e.g.
infrastructure sensor). In this paper a novel Automatic Label Injection method is proposed to label the
objects in the point cloud of the high-mounted infrastructure LiDAR sensor based on the output of a well
performing "trainer" detector deployed at optimal height while considering the uncertainties caused by
various factors described in detail throughout the paper. The proposed automatic labeling approach has
been validated on a small scale sensor setup in a real-world traffic scenario where accurate differential
GNSS reference data where also available for each test vehicle. Furthermore, the concept of a distributed
multi-sensor system covering a larger area aimed for automatic dataset generation is also presented. It is
shown that a machine learning based detector trained on differential GNSS-based training dataset performs
very similarly to the detector retrained on a dataset generated by the proposed Automatic Label Injection
technique. According to our results a significant increase in the maximum detection range can be achieved
by retraining the detector on viewpoint specific data generated fully automatically by the proposed label
injection technique compared to a detector trained on vehicle top mounted sensor data.
INDEX TERMS object label,automatic label injection, roadside infrastructure sensors, training dataset
I. INTRODUCTION
ARAPIDLY evolving area of autonomous driving is
infrastructure aided traffic. In this case, the ego vehicle
makes decisions not only based on its own sensors but
information from fix-mounted infrastructure sensor stations
as well. This high level information can represent objects that
are too far away from the ego vehicle to be detectable but
still have relevance, or objects which are occluded by another
vehicle or building for instance. Multiple perception sta-
tions deployed in the infrastructure may cover a significantly
larger area than vehicle mounted sensors. Nevertheless, the
huge amount of infrastructure sensors may provide more
reliable detections when fused together. In recent days, neural
network-based detectors make the recognition of traffic par-
ticipants, such as for instance cars, pedestrians and cyclists
[1]. These networks need to be trained prior operation to
develop their recognition capability.
A. PROBLEM STATEMENT
The available open datasets such as KITTI [2], NuScenes
[3], Waymo [4], Level 5 [5], Argo [6], PandaSet [7], Cana-
dian Adverse Driving Conditions Dataset [8], Ford Campus
LiDAR dataset [9], Sydney Urban Objects [10], Stanford
Track Collection [11] contain data acquired by various type
of sensors one of which is the LiDAR (Light Detection And
Ranging). These sensors are mounted on top of different
types of passenger cars. At these mounting heights, the rep-
resentation of objects in the point cloud differs significantly
VOLUME 4, 2016 1
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202223
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
from the case, when the sensor is deployed in the infras-
tructure (the mounting height of infrastructure sensors might
be several meters). LiDAR-based object detector networks
trained on these open datasets, perform poorly on point
clouds acquired by infrastructure LiDAR sensors.
B. RELATED WORKS
In [12] the different representation of objects in point clouds
acquired by different types of LiDAR sensors is handled by
training a neural network to increase the density of points
along the surface of objects. A recently presented unique
infrastructure data set [13] on the other hand, enables to
train machine learning-based object detectors to detect ob-
jects from a specific (higher) viewpoint. Creating sensor and
mounting position specific training datasets for infrastruc-
ture LiDAR sensors manually requires significant time and
effort, nevertheless in case of manual labeling there might
be relevant variance in the quality of labels. One solution
to overcome this problem is to generate training dataset in a
simulated environment, such as described in [14]. However,
simulation based datasets lack all the real-life noise, which
can reduce the performance of the trained detector on real-
life data.
Another way to find objects of interest in point clouds
yielded by LiDAR sensors is to filter out the background,
thus the remaining points will represent the searched objects
regardless their representation in the point cloud. In [15] the
authors aggregated consecutive point cloud frames, then with
map cleaning methods they separated the estimated static
map from the moving objects. To generate static background
with vehicle mounted LiDAR sensors authors of [16] made
recordings at the same location at several different occasions,
building upon the concept that mobile objects are unlikely to
stay in the same place in every recording session. In case
of infrastructure mounted sensors the background remains
the same over time due to the fix installation, however
the shaking of the mounting structure must be considered.
Therefore, in [17] the authors voxelised the point cloud, then
modeled the average height and the number of points for each
voxel as Gaussian distribution in order to filter out the static
background.
Because of the representation of objects in the point cloud
is dependent of the channel number and beam configura-
tion and the mounting height of the given LiDAR sensor,
the corresponding detector network requires retraining if
the used LiDAR sensor has different features than the one
which recorded the dataset which the detector network was
originally trained on. In case of background segmentation,
good estimation of the bounding boxes of objects can only be
made when the objects are represented in the point cloud with
larger number of points. This criteria limits the maximum
detection distance of the trained detector network, because
with the increase of the distance between the object and the
LiDAR sensor, the amount of points representing the object
reduces. For example, in [17] it means that the maximum
detection distance of the trained detector network was set to
66.32m.
C. CONTRIBUTION
The main contribution of this work is an automatic label in-
jection method, which realizes the automatic transformation
of object labels (provided by an object detector with good
performance which processes the point cloud frames of a
temporarily deployed LiDAR sensor) to the recorded point
clouds acquired by a high mounted infrastructure LiDAR
sensor by considering the time synchronisation and sensor
calibration related errors. Here we would like to emphasize
the problems caused by the independent operation of the
LiDAR sensors, namely their laser beams are not synchro-
nised, thus moving objects are scanned in different moments
which (if not handled properly) may have significant impact
on the accuracy of the injected labels. The proposed system
was validated under real conditions on real-world data.
Further contribution is a conceptual distributed multi-
sensor based labeling system, which relies on the proposed
Automatic Label Injection method to create a training dataset
being tailored for the specific infrastructure sensors deployed
at elevated positions into the given traffic environment. With
the help of such tailored datasets, the performance of the de-
tectors might be optimised for these specific LiDAR sensors
and their specific locations. This system can be adopted by
intelligent traffic systems [18], which have the capability of
sharing perception and object level information.
With the help of our proposed Automatic Label Injec-
tion method, LiDAR sensor and sensor placement specific
training datasets can be generated automatically. This means
that for an arbitrary LiDAR device (independently on its
laser beam distributions, sensitivity and other sensor specific
factors) and its hosting environment (where the sensor is
deployed), a tailored training dataset suited for that particu-
lar setup can be generated. Furthermore, for the automated
dataset generation, there is no need for any sensors (such
as for instance GNSS device, LiDAR etc.) to be present in
vehicles. Furthermore, the trained detector network which
processes the point clouds from the infrastructure sensor can
estimate appropriate bounding boxes from less object rep-
resenting points, thus it has greater detection range than the
detectors which were trained on data obtained by background
segmentation techniques.
D. STRUCTURAL OVERVIEW
This article consists of the following sections: II-A gives the
detailed description of the proposed automatic label injec-
tion procedure, II-B introduces the concept of the training
dataset generator system, II-C shows how the detectability
of objects can be increased by training on a custom dataset
which is tailored for the mounting position of the sensor.
Section III-A contains the description of the measurement
setup. The Intersection Over Union (IOU) metric of GNSS-
based ground truth and the labels given by the detector
(which processes the point clouds of the Trainer station) is
described in section III-B. Section III-C is devoted to the
2VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202223
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
description of the IOU metric in case of the GNSS-based
ground truth and the injected labels. Section III-D reports
performance metrics of the detectors trained on the following
datasets independently: KITTI, GNSS based, the proposed
label injection-based. Section IV is devoted to the evaluation
of the measurement results reported in III-C and III-D and
the discussion of the limitations of the proposed method.
II. METHODOLOGY
A. AUTOMATIC LABEL INJECTION
In order to create a training dataset for a neural network
which is aimed for detecting objects in LiDAR point clouds,
each frame must contain a point cloud and labels for the
objects represented in it. The problem is how to generate
the labels for the objects to avoid the huge amount of human
effort. The proposed automatic label injection method relies
on the detection capabilities of a neural network trained on
open datasets. This detector network processes the point
clouds of a LiDAR sensor, which is deployed at similar
height as vehicle mounted LiDARs, thus the neural network
can operate with optimal performance. Let us call this low
mounted LiDAR unit and the detector network together as
the TRAINER SENSOR. On the other hand the TRAINED
SENSOR is a fix mounted infrastructure sensor deployed at
an increased height above the road surface in order to observe
the traffic from an elevated position. From such a view-
point, the point cloud representation of traffic participants
is changed, which leads to degraded detection performance.
The detector of the Trained sensor needs to be trained on
a custom dataset in order for the detector to adapt to the
altered patterns of object representation caused by the higher
viewpoint of the LiDAR. The point cloud of each frame of
the dataset is recorded by the Trained sensor. The labels
for all the objects in the saved point cloud are provided by
the Trainer sensor itself. The labels are processed and trans-
formed from the local coordinate system of the Trainer sensor
into the local coordinate system of the Trained one. Fig. 1
shows the Trainer and the Trained sensors during operation,
and Fig. 6 gives an overview of the Automatic Label Injection
process. In the followings the proposed method is described
in detail.
FIGURE 1. The Trainer and Trained Sensors during operation.
First, the point cloud recorded by the Trainer sensor goes
through a two stage preparation process. Let us denote the
kth frame acquired by the sensor by Fk. When the sensor
starts recording the frame Fk, it starts the measurement
from a certain yaw angle ω, and sweeps the environment
with laser beams in clockwise direction. The whole 360°
horizontal field of view of the sensor is divided into Nseg
segments where Nseq {512,1024,2048}depending on the
actual configuration of the lidar sensor. Let us denote these
segments by Sq, where q= 1..Nseq. The LiDAR sensor
records the time tSqwhen the scanning of the segment Sq
was finished. From the scanning results of all segments a
single point cloud is composed and the timestamp of the
last segment SNseq is assigned as the timestamp tFkof the
whole point cloud frame. Let us define fS:N×N N
which maps the point index of point pi,k to the index of its
corresponding segment. During the first preparation stage,
each point in the point cloud frame on top of the x, y, z
parameters, gets the timestamp tSqof the corresponding
segment. Therefore each point of the kth frame is represented
as follows:
pi,k = [xi,k, yi,k , zi,k, tSq],(1)
where q=fS(i, k).
In the second stage, the whole point cloud is rotated to a
pose which is considered to be optimal for the object detector
network Let us denote this rotation by Rdet.
As next step the point cloud is processed by the PointPil-
lars [19] detector network which was trained on the KITTI
open dataset. For the object detection phase, any well per-
forming detector network can be utilised instead of the one
considered in this paper. We have selected the PointPillars
network because it performs well on the KITTI benchmark
and has the ability to process point clouds in real-time
(16ms). In order to reach our future goal, namely the online
operation of the proposed method, high performance (fast
and accurate) object detectors have to be considered.
When processed by the PointPillars detector network, the
point cloud is rasterized along the x-y plane with a preset
resolution and divided into so called pillars. Each point in
the point cloud is assigned to one of these pillars, and then
the so called pseudo image is created which is processed by
the "Backbone" and the "Detection Head" convolutional
neural networks [19].The general arcitecture of the Point
Pillars detector network is shown if Fig.2
FIGURE 2. The general architecture of PointPillars network [19].
The detector provides the labels for each object in the
current frame l(k)
1,l(k)
2, ..., l(k)
Nobj , where Nobj denotes the
VOLUME 4, 2016 3
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202223
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 1. Pseudo code for the Trainer Sensors process
READ point cloud frame
RESHAPE point cloud into (segments, channels, point features)
CONCATENATE segment timestamps to points
ROTATE point cloud
DETECT objects with Point Pillars detector
FOR each object label:
DETERMINE bounding box corners
FOR each point in point cloud frame:
FOR each object bounding box:
IF bounding box contains point:
ASSIGN point to object
FOR each object:
CALCULATE mean timestamp
ASSIGN mean time stamp to object as object scan time
FOR each bounding box corner:
TRANSFORM coordinates from Trainer into Trained local system
APPEND object scan time to bounding box corner list
SEND bounding box list to Trained sensor
number of the recognised objects in the kth frame Fk. A
label l(k)
jcontains the coordinates of the bottom center point
(expressed wrt. the local coordinate system of the Trainer
sensor, see Fig.8 in section III-A), the width, length and
yaw of the bounding box of the detected object. Based on
this information the eight corners of the bounding box are
determined for each label l(k)
j. Every point pi,k of the kth
frame is then evaluated whether or not it is encapsulated
by one of the bounding boxes and if it is, it is assigned to
the corresponding object. Every point pi,k holds the time
stamp tSq, where q=fS(i, k)which is the time when
the corresponding segment was measured. Let us define the
following membership function:
µ(i, j, k) = (1,if pi,k jth bounding box
0,otherwise (2)
Let t(j)
obj denote the timestamp assigned to the jth object.
t(j)
obj is determined as the mean of timestamps of all the points
which have been assigned to the jth object, i.e.:
t(j)
obj =PNpts
i=1 µ(i, j, k)tSfS(i,k)
PNpts
i=1 µ(i, j, k),(3)
where and Npts represents the number of points of the kth
frame. The time stamps are written in Unix Epoch Clock
format. Fig. 3 shows the segmented horizontal field of view
of the Trainer Sensor.
By using the estimated rotations RUT M T r ,RU T MT d
and translations tU T MT r ,tU T MT d from the UTM to the
local coordinate system (see Section III-A) of both devices
and the saved rotation Rdet (see above in this section) from
the preparation step, the corner points of the bounding boxes
are transformed from the local coordinate system of the
Trainer sensor into the local coordinate system of the Trained
sensor. Additionally, each transformed bounding box l(k)
jare
augmented with the time t(j)
obj . The pseudo code for the above
described process is listed in Table 1.
FIGURE 3. The segmented Field Of View (FOV) of the LiDAR device.
The Trained Sensor has a temporary storage which oper-
ates in a First In First Out (FIFO) manner for the recorded
point clouds FF IF O ={F(0)
F IF O , ..., F (NF I F O 1)
F IF O }, where
NF IF O denotes the maximum number of frames in the
queue. When a new frame is recorded by the Trained Station,
this queue is updated accordingly, thus its first element (the
oldest frame) is dropped. In the Trained Station, each frame
in addition to its timestamp is augmented by the timestamps
t
S0..t
SNseg corresponding to the individual segments. We
have set the size of this FIFO queue to contain 25 frames
at most, i.e. NF IF O = 25. When a new set of labels
L
k={l(k)
1,l(k)
2, ..., l(k)
N}with timestamp t(j)
obj arrives to
the Trained Station, the frame having the closest time stamp
is selected from the queue based on the segment time stamps
of the frames.
The timestamp for the point cloud selection from the FIFO
storage is determined as follows: For the first label from the
given frame l(k)
0, the center point of the bottom side of the
bounding box is calculated. It is sufficient because the goal
is to find the frame measured by the Trained sensor being
nearest in time to the scan time t(j)
obj of the objects yielded
by the Trainer Station. The labels arriving from the Trainer
Station fall in a narrow region of the perception field of the
Trained Station, therefore the difference in the scan times
t(0,k)
obj , t(1,k)
obj , ..., t(Nk,k)
obj of the detected objects is much less
than the time difference between two consecutive frames
(t(j1,k)
obj t(j2,k)
obj t
Fkt
Fk+1 , where j1, j2= 1..Nk,
j1=j2). Let vrepresent the vector which points from the
origin of the local coordinate system of the LiDAR sensor
to the center point of the bottom side of the bounding box
of the 1st object (with index 0). Let us denote by αthe
angle between vand the xaxis of the local coordinate
system of LiDAR sensor. The range of αis between 0°and
180°and it is symmetric to the xaxis, therefore the evaluation
of the signs of the coordinates of the center point is also
necessary to calculate αin range between 0°and 360°. α
4VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202223
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
TABLE 2. Pseudo code for the label-frame assignment process
READ label[0] from label set
CALCULATE bottom center point of bounding box from its corners
CALCULATE α
DETERMINE qr
FOR each frame in FIFO:
APPEND the timestamp t(m)
Sqr
corresponding to qrindex to TS list
FIND nearest timestamp from TS list to t(0,k)
obj
USE index mof assigned timestamp from TS list to pair frame with the labels
determines in which segment S
qof the Trained Sensor the
center point of the object was scanned. Let us denote this
segment index by qr. When qris known, from all frames in
the queue, the timestamp corresponding to the segment Sqr
is taken and used to form a list of timestamps as follows:
t
Sqr_list ={t(0)
Sqr, ..., t(NF IF O1)
Sqr}. The timestamps in the
composed list have the same order as the frames in the queue.
Then the timestamp from the list t
Sqr_list is selected which
is nearest in time with the scan time of the 1st object t(0,k)
obj .
It is now obvious that the list index of the selected timestamp
t(m)
Sqrwhere m= 1..NF IF O 1is the same as the index
of the corresponding frame F(m)
F IF O , which gets paired with
the label set L
k. The pseudo code for the above explained
process is described in Table 2.
FIGURE 4. The relation of the laser beam and the objects orientation vector.
Although the labels L
kand the associated frame Fof the
Trained Sensor are close in time, there may still be some time
difference, since the LiDAR devices operate independently,
i.e. their scanning beams are not aligned. Therefore, a given
object in the point cloud and the corresponding label may not
fit exactly. To reduce such inaccuracies the pose of the bound-
ing box is refined by a box fitting method described later in
this section. According to the time difference between the
labels and the point cloud the labels might be ahead or behind
in time wrt. the point cloud. Due to the time difference be-
tween the labels and the point cloud some of the object points
may fall outside the bounding box. In order to consider all the
points belonging to the object an increased point searching
region is considered which is based on the original bounding
box but has increased length and width. Such extended point
searching region is used during the mentioned box fitting
process. The length increase value is determined from the
distance which the object can travel during the maximum
time difference between the label and the point cloud. The
speed limit of Hungarian highway roads is 130 km/h. The
LiDAR sensors are operating with 20 frame/s configuration
therefore, the maximum time difference between the closest
label and the point cloud is 25 ms. With 130 km/h velocity,
the movement during that time is 0.9 m, so the bounding
box length is increased with this length in both forward and
backward directions. The width increase which in this case
is 0.2 m for both sides, compensates the inaccuracy of the
objects orientation estimation. After the search regions are
determined, the enclosed points are collected for each region.
To ensure that no road surface points are collected, the points
with zcoordinate being below the bottom boundary of the
searching region plus 0.2 m are filtered out.
To decide which part of the vehicle is represented by the
collected points, the orientation information and the position
of the bottom front right corner (C5) of the bounding box
are used, as shown in Fig. 4. Let vector C5= (C5x, C5y)
represent the scanning laser beam. The scanned side of
the target vehichle can be determined based of the object
orientation wrt. the laser beam direction (C5) as follows:
First the orientation vector of the label (bounding box) is
computed as s=C5C1where C1is the bottom back
right corner of the label. Then the unit vector (pointing to the
same direction as C5)qand its normal rare determined:
q= (q1, q2)=(C5x/||C5||, C5y/||C5||)(4)
r= (q2, q1)(5)
The svector can be expressed: s=a1q+a2r, where a
represents the coordinates of sin the new orthogonal basis
with basis vectors qand r. In matrix form: s=Ba. The
coefficient vector acan be obtained as: a=B1s, where
B= [q|r]represents the matrix of basis vectors. The signs
of the coefficients a1, a2show which side of the vehicle was
scanned (visible by the LiDAR). This can be determined by
Table 3.
TABLE 3. Rules for label fitting side decision
a1a2Long side Short side
+ + Left side Back
+ - Right side Back
- - Right side Front
- + Left side Front
The visible sides of the object determine which pair
of perpendicular edges elong,eshort of the bottom side of
the corresponding bounding box will be used for fitting.
However, it is necessary to determine which object points
are going to be used as reference for fitting. Let us de-
note these points by Ps_ref = (Ps_refx, Ps_refy)Tand
Pl_ref = (Pl_refx, Pl_refy)Twhich represent the reference
VOLUME 4, 2016 5
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202223
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
point of fitting the short and long sides of the bounding
box respectively (see Fig. 5). The positions of the objects
in the local coordinate system of the Trained Sensor and
their orientations relative to the scanning laser beam must
be considered because the position and relative direction
together with a rule base (see later in this section) is used
to determine Ps_ref ,Pl_ref . The two axes of the Trained
Sensors local coordinate system divides the x, y plane into
four quarters. On Fig. 4 the quarters are numbered with
blue labels. The quarter index is determined based on which
quarter C5is located in. A rule set has been set up to select
Ps_ref , and Pl_ref among the object points based on their
coordinate values, with the evaluation of the quarter index
and the signs of a1, and a2. The rules are described in Table 4.
For example, in Fig. 4 Ps_ref , and Pl_ref are selected for the
object being in the outer lane as follows: The quarter index
is 1, and the signs of a1, and a2are and +respectively.
According to the corresponding rule, the object point which
has the least xcoordinate value is selected for Ps_ref , and the
object point which has the least ycoordinate value is selected
as Pl_ref .
TABLE 4. Rules for fitting point selection
Quarter
index a1a2Selection Rule
for Short side
Selection Rule
for Long side
1 + + min y max y
1 + - min x min y
1 - - min y max y
1 - + min x min y
2 + + min x max y
2 + - max y min y
2 - - min x max y
2 - + max y min y
3 + + max y min y
3 + - max x max y
3 - - max y min y
3 - + max x max y
4 + + max x min y
4 + - min y max y
4 - - max x min y
4 - + min y max y
Let us denote by p1= (p1x, p1y)T, and p2= (p2x, p2y)T
the nodes of eshort also p1is the common node of eshort
and elong.p1,p2determine normal vectors n= (nx, ny)T,
and m= (mx, my)Tof the bottom short edge eshort and
the bottom long edge elong of the bounding box. The normal
vectors are computed with the following expressions:
mx=p2xp1xnx=my
my=p2yp1yny=mx
The algorithm then calculates the equations of the lines,
es,elwhich have the normal vectors n,mtherefore parallel
with the side lines and pass through the corresponding fitting
reference point. Finally the crossing point Pof lines es, and
elis calculated thus for its coordinates, we get:
py= (nxclmxcs)/(nxmymxny)
px= (csnypy)/nx,
where cs, clare the offset coefficients of lines es, elrespec-
tively. This calculated point represents the common node of
eshort and elong of the fitted label. Subtracting point p1from
the calculated crossing point presult the # »
p1ptranslation
vector, with which the original label can be fitted to the object
points. Fig. 5 shows an example for the above described
process.
FIGURE 5. Side fitting example for one of the two objects visible in the point
cloud (left) and the transformed label, the extended search area and the fitted
labels represented with white, blue and green color (right).
B. CONCEPT OF TRAINING DATASET GENERATOR
SYSTEM
The proposed Automatic Label Injection method gives an
opportunity for training dataset generation without the need
of manual annotation. It can be used first of all in such mea-
surement systems, which consists of multiple trainer sensors
(covering a larger area) to provide fully automatic labeling
for point clouds recorded by the infrastructure mounted
LiDAR sensor. The trainer sensors can be temporarily de-
ployed units, being mounted at similar height as the device
used for the creation of the datasets the detector network
was trained on. Thus, the detectors (of the trainer sensors)
have appropriate detection performance on the recorded point
clouds. The conceptual design is illustrated by Fig. 7. The
theoretical concept of the system is described in the following
paragraph.
Since the performance of the detector degrades as the
object distance to the Trainer sensor increases, thus for each
sensor a range limit can be defined inside which the detector
performance is considered as acceptable for labeling. Let us
call this area as Label Injection Region. In this concept, the
label with higher confidence value gets assigned to an object
which is currently within an overlapping Label Injection
Region. With this configuration double labeling of an object
can simply be avoided, otherwise an object level fusion
solution might be considered. The point cloud acquired by
6VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202223
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 6. The overview of the Automatic Label Injection method.
FIGURE 7. The overview of the dataset generator concept.
the Trained Sensor is saved by the proposed dataset generator
system if injected object labels from any Trainer Sensor have
been assigned to it. The Label Injection Region of the Trainer
Sensors is determined based on the results in Section III-B.
The aim is to define an area where the error rate of the
detector is low enough, therefore the rarely happening false
negative detections (absence of label) have negligible effect
to the training process of the Trained Sensor. Based on the
evaluation of the measurement results, the Label Injection
range of the Trainer Sensors was set to 35 m. The distance
between each Trainer Sensor must be defined according to
the Label Injection range and the current traffic environment.
For example, in a highway section, all traffic lanes must be
covered with at least one Label Injection Region in order to
avoid blind zones. Obviously, for a two-by-two lane arrange-
ment (when each carriageway consists of two traffic lanes),
the gap between the Trainer sensors can be larger than in
case of a three-by-three lane arrangement (where there are
3 allocated lanes for each traffic direction). For the highway
section which is displayed in Fig. 7, with a lane width of 3.75
m and with 35 m Label Injection range, the optimal distance
between Trainer Stations is 67 m. Although, there are some
constrains regarding the proposed system. First, it requires a
traffic environment where the deployment of Trainer Sensors
is supported, i.e. there is enough space for their temporal
deployment, and the devices can be protected against theft
or abuse. Secondly, sparse traffic is more favorable, because
the chance of one object being shadowed by an other is
reduced. The third constraint is the necessity to include
multiple LiDAR devices as trainer sensors (to fully cover
the area perceived by the Trained Sensor), although they are
costly nowadays, therefore high amount of financial effort is
required for the installation of such a system.
C. INCREASING THE DETECTION RANGE OF THE
TRAINED SENSOR
If high precision GNSS (Global Navigation Satellite System)
data is available for all the objects inside the perception range
of the Trained Sensor, the GNSS position, orientation, the
GNSS reference point configuration and vehicle dimension
data might be used as ground truth [20] and based on that,
labels can be injected into the point cloud of the Trained
Sensor. This ground truth can be used for two purposes: to
validate the described Automatic Label Injection method and
to generate a training set, similar to the one which would have
been produced by the dataset generator system. This section
describes the steps of increasing the object detection range
of the detector (PointPillars neural network) operating on the
data of a fix mounted infrastructure sensor. This goal was
VOLUME 4, 2016 7
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202223
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
achieved by creating a training dataset based on available
GNSS data. An infrastructure LiDAR sensor was deployed
on a Highway section at ZalaZone Automotive Proving
Ground. The detailed description about the measurement and
the test environment is given in Section III-A. Multiple test
vehicles were installed with high precision GNSS devices
which logged the accurate position and orientation of the
test vehicles with 10 ms sampling interval. The installation
configuration (i.e. the position of the GNSS device inside
the vehicle wrt. a reference point) of the GNSS device was
given for all the vehicles as well as the dimensions of the test
vehicles. With the help of such data, labels can be assigned to
each object. The width, height and length properties for the
label is determined by the vehicle dimensions. The position
of the center point and the orientation of the label comes from
the GNSS data. The GNSS-based label set (ground truth) is
assigned to the point cloud being closest in time. Because
of the time difference between two GNSS-based label sets
was 10 ms, the maximum time shift between the assigned
ground truth and the point cloud is 5 ms. To reduce this
inaccuracy, the test vehicles performed the test scenarios with
reduced speed, which meant 50 km/h and 100 km/h velocity.
The position information from the GNSS device is logged in
UTM (Universal Travel Mercator) global coordinate system.
The deployed infrastructure sensor was calibrated wrt. this
global reference system, which resulted the local to UTM
transformation. By using these extrinsics, the labels given in
the UTM system were transformed into the local coordinate
system of the infrastructure sensors. Next, the labels were
filtered by their position and the number of enclosed object
points. The position filter was set to eliminate all the labels
which were closer than 22 m or had a distance greater than
100 m. The lower threshold of the filter comes from the
features of the LiDAR device and the mounting height, which
creates a blind zone on the ground under 25 m from the origin
of the LiDAR. The test vehicles had an average height of
1.5 m, thus, point cloud points belonging to vehicles can be
observed starting from 22 m distance. The upper limit on the
other hand ensures that the labels encapsulate at least one
object point. In addition, one the constraints of the training
algorithm of the PointPillars detector is that the number of
encapsulated object points should be at least five in order
for the label to be taken into account during training. Based
on the statistical evaluation of the injected labels in section
III-C, gaussian white noise N(µ0,Σ) was added to the
ground truth labels. This noise models the uncertainties in the
position and heading in the overall label injection procedure.
A point cloud frame combined with the corresponding label
represents a frame in the training set. A training set was also
created for benchmark, based on the measurement records. In
this case, no noise has been added to the ground truth during
the label generation phase.
During the test measurement, multiple test runs were per-
formed. The ground truth data have been paired with the
corresponding point cloud frames of the Trained station from
all of the test runs. A single test run was selected and reserved
for evaluation of the trained detector. The frames from this se-
lected test run obviously were not used during the generation
of the training set. After the frame collection was done the
obtained dataset (including the point clouds and the labels)
was converted into KITTI data format. In the KITTI data
format, each point in the LiDAR point cloud is represented
by x, y, z coordinates and an intensity value. The values are
encoded in float32 data type. To ensure compatibility with the
KITTI label format, the object class name, the position, the
yaw angle and the dimension of the bounding box had to be
computed for each label. The object class name for all labels
is set to "car". The 3-D dimension information stands for the
height, with and length of the object in meters. The location
gives the 3-D position of the object in meters. The Rotation_y
means the heading of the object in range between πand π.
During the KITTI format transformation, the frames of the
dataset are shuffled. This step breaks the consecutive frame
series with small position difference of objects, ensuring
better generalisation capability for the trained network.
From the 14 recorded test runs, 13 were used to make the
training set. The data set has 1734 frames and over 2400 car
labels, which is considered as a smaller training set. From this
set, 1213 frames were used for the training and the rest was
reserved for evaluation. The detection range of the network
was set from 0 m to 110 m in longitudinal direction and from
-39.68 m to 39.68 m in lateral direction. The height range was
set between -3 and -7 meters, because the mounting height
of the sensor was approximately 6m above the road surface.
The size of the voxels is set to 0.16 m in both longitudinal
and lateral directions and their height is set to 4 m. The
training has run through 160 epochs, overall 296950 steps
and reached a classification loss of 0.0803, and localisation
loss of 0.144.
The detector network was also trained on the GNSS
ground truth-based training set. The procedure and the con-
figuration was the same as previously, which means 160
epochs and 296950 steps. At the end of the training, the net-
work reached a classification loss of 0.0555, and localisation
loss of 0.12.
The network which were trained with the ground truth and
the network which were trained on the ground truth with
noise added were compared to the network which was trained
on the KITTI dataset. The results are presented in Section
III-D.
III. RESULTS
A. MEASUREMENT SETUP
The test measurement for the proposed Automatic Label
Injection method was performed at the Motorway module
of ZalaZONE Automotive Proving Ground. This module is
a real 1.5 km long highway section with an overpass. The
bridge provides the possibility to set up the Trained Sensor
above the road surface in any chosen point in lateral direction.
The trainer station consisted of an Ouster OS2 high range
LiDAR unit with 128 channels, a Flir BlackFly 2 MP camera
with GigE output, and a Cohda MK5 unit. The camera
8VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202223
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
was used to give visual feedback for the measurements but
the captured data was not utilised for the Automatic Label
Injection method. The Cohda MK5 unit provided the time
synchronisation service. The OS2 unit was mounted at vehi-
cle rooftop level above the ground. The station of the Trainer
Sensor was deployed at 50 m distance from the overpass (see
Fig. 9).
The Trained Sensor was an Ouster OS2 high range LiDAR
unit. In addition to this LiDAR, the station was equipped
with a Cohda MK5 unit. The sensors were mounted on the
overpass element of the Motorway module. For testing the
proposed Automatic Label Injection method, the output data
of the OS2 unit and the time synchronisation service of the
Cohda MK5 device was used from this sensor station. Fig. 11
shows the station of the Trainer Sensor (left) and the station
which holds the Trained Sensor (right) and the location of the
two stations can be seen in Fig. 9.
In order to perform the proposed label injection method,
the transformation between the two local coordinate systems
had to be determined. The local coordinate system of the
LiDAR sensors is defined according to Fig. 8. This was
performed by calibration of the sensors.
FIGURE 8. The local coordinate system of each sensor from top-view (left)
and side-view (right). The origin of the local coordinate system is at the center
of the bottom of the housing of the sensor. The X-axis is pointing forward, the
Y-axis is pointing to the left and the Z-axis is pointing to the top of the house of
the sensor. X, Y and Z axes are marked with Xs,Ys,Zsrespectively [21].
The UTM (Universal Transverse Mercator) system [22]
was selected as the reference global coordinate system, the
extrinsics of the two sensors have been estimated wrt. this
global reference. The transformation from the local coordi-
nate system of the Trainer sensor (T Rlocal) to UTM system
and the transformation from the local coordinate system of
the Trained sensor (T Dlocal) to UTM provided the transfor-
mation form T Rlocal to T Dlocal. The following paragraphs
will describe how the calibration was performed.
First, several easily distinguishable points were marked
in the environment as calibration points. Then the precise
GPS coordinates of each point were measured with a high
precision GNSS surveyor. Fig. 9 shows the layout of the
points. To determine the location of each calibration point,
an indicator box (of size 1x1x1 m) was used to detect the
calibration points in LiDAR point clouds. The box was
placed at the location of calibration points in such a way that
one of its corners indicated the exact location of the given
calibration point. At each location, the corresponding point
cloud (containing also the indicator box) was recorded. Then
the local coordinates of the calibration points were extracted
from the point cloud in case of both sensors.
To enable the transformation from the local coordinate
system of a given sensor to the global UTM system, a six
degrees of freedom problem must be solved. This means,
that at least 3 non-collinear points are necessary to estimate
transformation. However, due to the measurement noise,
thirteen calibration points were marked and surveyed. This
generates an over-determined system of linear equations. The
two representations (local coordinate system of the sensor
and UTM) of the calibration points are normalised to have
zero mean and unit variance. To determine the rotation, a
system of linear equations must be solved resulting a 3x3
matrix Q, which is not yet a rotation matrix. In order to get
the closest rotation matrix Rto Qthe following problem has
to be solved:
min
R||RQ||2
Fsubject to RTR=I(6)
By taking the singular value decomposition of Qi.e.
Q=USVT.Rcan be obtained as R=UVT[23]. With
Rand the center points of the point clouds: centroid_A,
and centroid_B, the translation can be computed with the
expression below.
translation =R×centroid_A+centroid_B(7)
Because of the measurement noise during the GNSS sur-
vey and the local position measurement, the obtained rotation
matrix and the translation vector are an estimations of the real
transformation, resulting a rough calibration. Let us refer to
the calibration of LidarUTM extrinsics as UTM calibra-
tion. To obtain a more precise calibration, further refinement
is needed. Since the HD map from the Motorway section is
available, it can be used to refine the calibration as follows:
the HD map and the LiDAR point cloud are registered by
the ICP (Iterative Closest Point) method [24]. This method
yielded acceptable results for our proposed Automatic Label
Injection method. For each point in the source point cloud
the closest point from the reference point cloud is assigned.
Based on such assignment the transformations estimated and
applied on the source point cloud. These steps are repeated
until the two point clouds get aligned (i.e. the defined cost is
minimised). The whole calibration process can be followed
in Fig. 10. This process was performed for both the Trained
and the Trainer Sensors.
Let us denote the obtained sensor to UTM transformations
as follows: TT DUT M and TT RU T M . With the help of
these transformation matrices the labels can be transformed
from T Rlocal to T Dlocal. The transformation is calculated
with the following expression, where T_orig and Crepre-
sent the transformation of the Trainer Sensors original point
cloud (acquired by the Trainer sensor) to such a pose which
VOLUME 4, 2016 9
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202223
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 9. Layout of the calibration points and the location of the stations.
FIGURE 10. The process of the calibration.
is optimal for the detector network and a corner point of a
detected label, respectively:
C=T1
T DU T M TT RU T M T_orig1CT(8)
Each data recorder framework at the trainer and the trained
stations had its own system clock which have to be synchro-
nised in order to collect data with synchronised timestamps.
The reference clock for the synchronisation was the high
precision clock of the GPS system. In every second, a precise
clock signal is transmitted by the GPS satellites. The Cohda
MK5 devices used at the stations are capable to receive those
transmissions and adjust their own system clock accordingly.
The clock of the data recording system is kept synchronised
with the clock of the corresponding Cohda MK5 device by
using the Network Time Protocol [25].
B. TRAINER DETECTOR PERFORMANCE
The performance of the detector processing the data of the
Trainer Sensor was evaluated by relying on the IOU metric.
For a given frame, the predictions are paired with the ground
truth labels based on the center points of the bounding boxes.
Then the IOU score for a ground truth-prediction pair is
calculated. The IOU scores are calculated as follows:
OverallIOUScore =Pn
i=1
Pmi
j=1 IOU
mi
n,(9)
where mi,i= 1..n and nstand for the number of objects
in the ith frame and the number of frames, respectively
furthermore, it is obvious that 0OverallIOUScore 1.
If there is no detection available for a given object, the corre-
sponding IOU is set to zero. Furthermore, the mean and stan-
dard deviation of the distances between the center points of
the bounding boxes (detected and the corresponding ground
truth) were calculated. In this case, the detector of the Trainer
10 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202223
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 11. The Trainer Station (left) and the Trained Station (right).
Station was evaluated on a recording which was used during
the training dataset (for the detector corresponding to the
Trained Station) generation process. To determine the label
injection range of the Trainer Sensor, the OverallI OU Score
for different distance intervals were computed, see Table 5.
Based on the measurement results, the label injection range
for the Trainer Sensor is set at 35 m due to the decreasing
tendency of the IOU score values after the 35-0 m range,
shown in Fig. 12.
TABLE 5. Distance intervals and the corresponding overall IOU scores
Distance interval
[m]
IOU score Distance interval
[m]
IOU score
10-0 0.6260 45-0 0.6486
15-0 0.6517 50-0 0.6011
20-0 0.6473 55-0 0.5876
25-0 0.6691 60-0 0.5458
30-0 0.6689 65-0 0.4989
35-0 0.6722 70-0 0.4725
40-0 0.6530 75-0 0.4323
C. LABEL VERIFICATION
The injected labels were evaluated just as the original labels
yielded by the Trainer sensor. The evaluation sequence was
the same as in III-B. The evaluation of the injected labels
gives information about the overall performance of the Au-
tomatic Label Injection method, especially when the IOU
results are compared with the IOU results of the detections of
FIGURE 12. IOU scores of the Trainer Sensors detector in relevance of
various distance intervals.
the Training Sensor. The difference between the two results
(see Table 6) reflects the effectiveness of the proposed label
injection approach (described in II-A) which prevents signif-
icant accuracy losses caused by the time difference between
the scans of the two sensors, and the measurement system
calibration error.
TABLE 6. Injected labels and Trainer Sensors labels evaluation results:
overall IOU score, center points distance mean and standard deviation values.
IOU score Mean value [m] Std. deviation
Injected labels 0.6117 0.3196 0.1623
Trainer Sensors Labels 0.6722 0.3274 0.0751
VOLUME 4, 2016 11
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202223
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
Let µIN J and σIN J denote the mean and standard de-
viation of the distance between the center points of the
bounding boxes of the ground truth and the bounding boxes
of the injected labels. Let µT R and σT R denote the mean
and standard deviation of the distance between the center
points of the bounding boxes of the ground truth and the
labels provided by the detector of the Training Station. The
OverallIOUScore for the injected labels and for the labels
provided by the detector of the Training Station, together
with µT R,µI NJ ,σT R and σI NJ can be followed in Table
6.
The center point distance between the injected labels and
the ground truth has been evaluated along each axis. Further-
more the absolute error of the heading was measured as well
(see Table 7).
TABLE 7. The mean and standard deviation of the distance between the
center points of the injected labels and the center points of the ground truth
labels.
Mean value [m] Std. deviation
xdirection 0.2363 0.1807
ydirection 0.1311 0.0975
zdirection 0.3274 0.0751
heading 0.1133 0.0683
D. COMPARISON OF THE ORIGINAL AND THE
RETRAINED DETECTORS
One of the test runs was not involved during the training
dataset generation procedure. This recording serves as ref-
erence to evaluate the performance of three detectors: the
detector which was trained on the KITTI open dataset, the
detector which was trained on a generated training dataset
based on the ground truth and the detector which was trained
on a generated dataset corrupted by Gaussian white noise
with mean and std. deviation given by Table 7 (this noise
was added to the ground truth labels). The test run realises a
scenario where two cars are approaching the sensor stations
beside each other and a third one follows them in the outer
lane (in an L-shaped formation). Here, the ability of the
detectors to recognise objects from different distances was
tested. The maximum detection range was determined in case
of the mentioned three detector variants. Table 8 contains the
measurement results.
TABLE 8. Maximum detection ranges for the compared detector nets
Detector variant Maximum detection range [m]
Trained on KITTI 53.22
Trained on GNSS based training set 104.87
Trained on noise added training set 104.87
In case of the original detector, the OverallIOU Score
was calculated for ranges 22–100 m and 22–50 m. The
narrower range falls within the maximum detection range
of the original detector. The overall IOU score for the 22–
100 m range enables the comparison between the detection
performance of the three different detector variants. The
results obtained for the range 22–50 m show how the pre-
dicted labels overlap with the ground truth labels inside the
detection range (see Table 8) of the original detector. For
the detectors which were trained on the generated training
datasets, evaluation in the 22–50 m range is not necessary
because the 22–100 m range falls into their object detection
range.
The OverallIOUScore for the detectors are contained by
Table 9.
TABLE 9. OverallIOU Score of the original and custom trained detectors
Detector variant IOU at 22-50m IOU at 22-100m
Trained on KITTI 0.222 0.106
Trained on GNSS based training set - 0.442
Trained on noise added training set - 0.439
The third comparison method was the evaluation of the
distances between the center points of the ground truth and
the predicted boxes. This metric considers only those frames
where the ground truth and the corresponding predicted label
are present for an object. The mean distance and standard
deviation are calculated (see Table 10).
TABLE 10. Mean value and standard deviation of distance between ground
truth and prediction
Detector variant Mean value [m] Std. deviation
Trained on KITTI 0.3287 0.0929
Trained on GNSS based training set 0.2059 0.1364
Trained on noise added training set 0.2538 0.1679
The distance between the center points of the predicted
labels and the ground truth were evaluated along each axis.
Furthermore the heading difference was also examined. The
details of evaluation results for the three detector variants are
reported in Table 11.
TABLE 11. The center point mean distance and standard deviation along
each axis and the mean difference and standard deviation for the heading for
the three detector variants.
Trained on KITTI
Mean value [m] Std. deviation
xdirection 0.2249 0.1414
ydirection 0.1264 0.0863
zdirection 0.1577 0.0756
heading 3.954 3.159
Trained on GNSS-based training set
Mean value [m] Std. deviation
xdirection 0.1047 0.1582
ydirection 0.1339 0.0939
zdirection 0.0799 0.0529
heading 6.4695 26.4724
Trained on training set corrupted by noise
Mean value [m] Std. deviation
xdirection 0.1154 0.1570
ydirection 0.1832 0.1175
zdirection 0.0623 0.0580
heading 15.4679 44.6587
IV. DISCUSSION
12 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202223
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
FIGURE 13. Predicted labels in case of the original (left) and the ground
truth-based (center) and the noise added (right) custom trained detectors for
objects in the same distance.
A. LABEL VERIFICATION
The results in Table 6 suggest that the labels yielded by the
Trainer sensor suffered only a slight accuracy loss during
the label injection process. However, σI NJ (see III-C) is in-
creased compared to σT R, showing that the distance between
the center points of the injected label and the corresponding
ground truth label may be larger than in case of the labels
coming from the Trainer Station. The reason for this is
the rule set of the label fitting process. There are several
boundary situations, where the orientation vector is close to
a segment separation line. In those cases, the label fitting
algorithm can adjust the label into a non-optimal position.
Fig. 14 shows an optimal and a non-optimal solution for the
label fitting process.
FIGURE 14. Optimal (left) and non-optimal (right) fitting examples. The
injected label and the ground truth are represented with a green and blue
bounding box respectively.
B. COMPARISON OF THE ORIGINAL AND THE
RETRAINED DETECTORS
As for the comparison of the original and the retrained
detectors (see III-D), the results in Table 8 show that the
maximum detection range has been increased by 97% with
the use of the ground truth based dataset. This range remained
the same when noise was added to the training labels which
the detector was trained on.
The reason for the increased standard deviation in case of
the detector trained on the ground truth-based dataset (see
Table 10) is the increased detection range, since at larger
distances the bounding box predictions are less accurate.
On the other hand, in case of the original detector at such
distances we obtained false negative (missing) detections
only. The mean and standard deviation values for the detector
which was trained on the training set corrupted by noise were
larger than the ground truth based variants. However, the
mean distance was still smaller than in case of the original
detector.
There are two reasons which enable the performance in-
crease which can be seen in Fig.13 despite the relatively
small frame number in the generated training dataset: Firstly,
the retrained detector network could learn the new object rep-
resentation from the elevated viewpoint (this was the original
goal). Secondly, the environment in which the sensor was
deployed, was simple, static and well separable, therefore
the network could easily recognise the objects even at longer
distances. The static environment is a great advantage for ob-
ject detection algorithms running in infrastructure mounted
LiDAR point clouds, compared to the continuously changing
environment in the case of vehicle mounted sensors.
C. LIMITATIONS
In order to perform a tailored training dataset generation
with the proposed system, numerous temporarily deployed
trainer sensors are required. These sensors are currently
being expensive therefore, large amount of funds is neces-
sary.Furthermore, the installation area has to be safe enough
to prevent theft or intentional damage in the equipment.
The method operates on pre-collected data in offline mode.
The quality of the generated labels significantly depends on
the performance of the detector network which processes the
point clouds of the Trainer sensor.
V. CONCLUSION
In this article a novel Automatic Label Injection method has
been proposed, which automatically generates the labels for
objects in the point cloud acquired by the Trained Sensor
relying on the labels provided by a well performing detector
operating on the point cloud of the Trainer Sensor. The
proposed technique enables the creation of a training dataset
which is tailored for a particular high mounted infrastructure
LiDAR unit. The creation process of the dataset does not
require manual labour for object labeling. The proposed
method was tested under real conditions at a motorway sec-
tion where a single Trainer and a single Trained station was
deployed. The performance of the proposed Automatic Label
Injection technique was evaluated on the data recorded at the
test site. The case of multiple trainer stations covering a larger
area was also considered and its concept was elaborated.
It is shown in the article, that the detector neural networks
can be retrained on the created dataset to enhance their
VOLUME 4, 2016 13
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202223
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
detection performance on point clouds which are provided
by fix mounted infrastructure LiDAR devices at elevated
position. Further development possibility for the proposed
method is to change the trainer Sensors from temporarily
deployed units to fix mounted sensors assigned with already
optimised detectors, which includes not just LiDAR devices
but the fusion of different sensor types, as well. Furthermore
the implementation of object level fusion which increases the
robustness of the labeling is a possible way of improvement.
The preliminary calibration process which provides the trans-
formation between the local and UTM coordinate systems is
performed manually. In further development stages, this pro-
cess can be automated to reduce the necessary manual labour.
Also, refinement of the label fitting rule set for eliminating
non-optimal bounding box adjustments can be considered as
further development direction.
VI. ACKNOWLEDGMENT
The research reported in this paper and carried out at the
Budapest University of Technology and Economics has been
supported by the National Research Development and In-
novation Fund (TKP2020 National Challenges Subprogram,
Grant No. BME-NC) based on the charter of bolster issued
by the National Research Development and Innovation Of-
fice under the auspices of the Ministry for Innovation and
Technology.
REFERENCES
[1] Fernandes, D., Silva, A., Névoa, R., Simoes, C., Gonzalez, D., Guevara,
M., Novais, P., Monteiro, J. and Melo-Pinto, P., 2021. Point-cloud based
3D object detection and classification methods for self-driving applica-
tions: A survey and taxonomy. Information Fusion, 68, pp.161-191.
[2] A. Geiger, and P. Lenz, and R. Urtasun, “Are we ready for Autonomous
Driving? The KITTI Vision Benchmark Suite, in Conference on Com-
puter Vision and Pattern Recognition (CVPR), 2012.
[3] H. Caesar et al.,, “nuScenes: A multimodal dataset for autonomous driv-
ing,” in arXiv preprint arXiv:1903.11027, 2019.
[4] P. Sun et al.,, “Scalability in Perception for Autonomous Driving: Waymo
Open Dataset,” in arXiv eprint arXiv:1912.04838, 2020.
[5] J. Houston et al.,, “One Thousand and One Hours: Self-driving Motion
Prediction Dataset,” arXiv eprint arXiv:2006.14480, 2020.
[6] M. F. Chang et al.,, Argoverse: 3D Tracking and Forecasting with Rich
Maps,” arXiv eprint arXiv:1911.02620, 2019.
[7] “PandaSet”: Public large-scale dataset for autonomous driving, Scale AI
Inc., [Online]. Available: https://scale.com/open-datasets/pandaset 2019.
[8] Pitropov et al., “Canadian Adverse Driving Conditions dataset, in The
International Journal of Robotics Research vol. 40, SAGE Publica-
tions, Dec. 2020 , pp. 681-–690, DOI: 10.1177/0278364920979368, url:
http://dx.doi.org/10.1177/0278364920979368.
[9] G. Pandey, and J. R. McBride, and R. M. Eustice, "Ford Cam-
pus vision and lidar data set", in Int. J. Robotics Res. vol.
30, 2011 , pp. 1543–1552, DOI: 10.1177/0278364911400640, url:
https://doi.org/10.1177/0278364911400640.
[10] Deuge et al.,, “Unsupervised feature learning for classification of outdoor
3D Scans,” Australasian Conference on Robotics and Automation, ACRA,
2013.
[11] Teichman et al.,, “Towards 3D Object Recognition via Classification of
Arbitrary Object Tracks, in International Conference on Robotics and
Automation, 2011.
[12] Yi, Li, B. Gong, and T. Funkhouser. “Complete & label: A domain
adaptation approach to semantic segmentation of LiDAR point clouds, In
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition, pp. 15363-15373. 2021.
[13] Creß, C., Zimmer, W., Strand, L., Lakshminarasimhan, V., Fortkord, M.,
Dai, S. and Knoll, A., 2022. A9-Dataset: Multi-Sensor Infrastructure-
Based Dataset for Mobility Research. arXiv preprint arXiv:2204.06527.
[14] F. Wang, Y. Zhuang, H. Gu and H. Hu, "Automatic Generation of Synthetic
LiDAR Point Clouds for 3-D Data Analysis," in IEEE Transactions on
Instrumentation and Measurement, vol. 68, no. 7, pp. 2671-2673, July
2019, doi: 10.1109/TIM.2019.2906416.
[15] Chen, X., Mersch, B., Nunes, L., Marcuzzi, R., Vizzo, I., Behley, J. and
Stachniss, C., 2022. Automatic Labeling to Generate Training Data for
Online LiDAR-Based Moving Object Segmentation. IEEE Robotics and
Automation Letters, 7(3), pp.6107-6114.
[16] You, Y., Luo, K., Phoo, C.P., Chao, W.L., Sun, W., Hariharan, B., Camp-
bell, M. and Weinberger, K.Q., 2022. Learning to Detect Mobile Objects
from LiDAR Scans Without Labels. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition (pp. 1130-1140).
[17] Wang, G., Wu, J., Xu, T. and Tian, B., 2021. 3D vehicle detection
with RSU LiDAR for autonomous mine. IEEE Transactions on Vehicular
Technology, 70(1), pp.344-355.
[18] Obaid, M., Szalay, Zs. (2020) "A Novel Model Representation
Framework for Cooperative Intelligent Transport Systems", Peri-
odica Polytechnica Transportation Engineering, 48(1), pp. 39–44.
https://doi.org/10.3311/PPtr.13759
[19] Lang, Alex H., Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang,
and Oscar Beijbom. "Pointpillars: Fast encoders for object detection from
point clouds." In Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, pp. 12697-12705. 2019.
[20] Tihanyi, Viktor, et al. "Motorway measurement campaign to support R&D
activities in the field of automated driving technologies." Sensors 21.6
(2021): 2169.
[21] “Sensor Data” Sensor Data - Ouster Sensor Docs docu-
mentation. [Online]. Available: https://static.ouster.dev/sensor-
docs/image_route1/image_route2/sensor_data/sensor-data.html#sensor-
coordinate-frame. [Accessed: 10-Aug-2022].
[22] Langley, Richard B. "The UTM grid system." GPS world 9.2 (1998): 46-
50.
[23] Z. Zhang. A Flexible New Technique for Camera Calibration. Technical
Report MSRTR-98-71, Microsoft Research, December 1998. Available
together with the software at http://research.microsoft.com/~zhang/Calib/
[24] Bouaziz, Sofien, Andrea Tagliasacchi, and Mark Pauly. "Sparse iterative
closest point." In Computer graphics forum, vol. 32, no. 5, pp. 113-123.
Oxford, UK: Blackwell Publishing Ltd, 2013.
[25] D. L. Mills, "Internet time synchronization: the network time protocol,"
in IEEE Transactions on Communications, vol. 39, no. 10, pp. 1482-1493,
Oct. 1991, doi: 10.1109/26.103043.
ZSOLT VINCZE was born in Budapest, Hungary
in 1986. He received the B.S. degree in electrical
engineering from Óbuda University, Budapest, in
2012 and the M.S. degree in electrical engineering
from Budapest University of Technology and Eco-
nomics, Budapest, in 2015. He is currently pursu-
ing the Ph.D. degree in transportation and vehicle
engineering at Budapest University of Technology
and Economics at Budapest.
From 2015 to 2020, he worked as an electrical
engineer at Geoelectro Ltd, Nagykovácsi, Hungary. From 2020 he works
as a Research Assistant at the Department of Automotive Technologies at
Budapest University of Technology and Economics, Budapest, Hungary.
His research interest includes the development of intelligent infrastructure
system for aiding autonomous traffic, which involves the development of
new neural networks based object detectors, also researching new solutions
which lessen the required amount of manual labour for supervised learning
of detector networks.
14 VOLUME 4, 2016
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202223
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
Author et al.: Preparation of Papers for IEEE TRANSACTIONS and JOURNALS
ANDRÁS RÖVID was born in Rimaszombat, Slo-
vakia, in 1978. He graduated in computer engi-
neering, Faculty of Electrical Engineering and In-
formatics, Technical University of Kosice, Kosice,
Slovakia, in 2001. He received the Ph.D. degree
in transportation sciences from the Budapest Uni-
versity of Technology and Economics (BUTE),
Budapest, Hungary, in 2005. He is currently a
senior research fellow and from 2019 the leader
of the Perception Group at the Department of
Automotive Technologies of BUTE. He has been author or co-author of over
100 publications. His main interest include image processing, 3D machine
vision, sensor fusion.
VIKTOR TIHANYI was born in Hungary, on April
13, 1981. He graduated in 2005 from Budapest
University of Technology and Economics as elec-
tric engineer, at the Faculty of Electric Machines
and Drives. He made his PhD in 2012. He has got
a BSc degree in 2014 of mechanical engineering
from University of Obuda in vehicle technology
faculty. He was working at Hyundai Technology
Center Hungary for 5 years from 2008. In 2013 he
changed to the automotive sector at Knorr-Bremse
Fekrendszerek Kft. as project leader and team leader of electromobility and
autonomous vehicle related projects until 2019. From 2020 he is working
at the ZalaZONE proving ground as team leader of research and innovation
activities. Besides his industrial employment he has been also working at
Budapest Univrsity of Technology and Economics at the Department of
Automotive Technologies as well research leader of autonomous vehicle
related research projects since 2016 as associate professor.
VOLUME 4, 2016 15
This article has been accepted for publication in IEEE Access. This is the author's version which has not been fully edited and
content may change prior to final publication. Citation information: DOI 10.1109/ACCESS.2022.3202223
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
... Deep learning methods learn depth and scale for example from perspective distortion, mounting the camera on a high pole for infrastructure use alters this perspective, reducing detector performance and necessitating context-specific retraining. This is true even for LiDARbased detectors [24]. The associated cost is particularly high when both 3D annotations and CAD models are required. ...
Article
Full-text available
Currently the state-of-the-art monocular 3D object detectors use machine learning to estimate the 6DOF pose and shape of vehicles. This requires large amounts of precisely annotated 3D data for the training process and significant computing power for inference. Alternatively, there exist methods, which attempt to reconstruct target vehicle shapes and scales using projective geometry and classically detected feature points such as SURF and ORB. These methods use specific camera motion or geometrical constraints which cannot always be assumed. The resulting model is an unstructured point cloud which contains no semantic information, making its utility inconvenient in a distributed perception system. In this study, the applicability of semantic keypoints for vehicle shape and trajectory estimation is explored. A novel method is presented, which is capable reconstructing the semantic shape and trajectory of the target vehicle from a sequence of images with state-of-the art accuracy. The resulting semantic vertex model is then used for monocular, single frame 6DOF pose estimation with high accuracy. Building on this, a cooperative perception framework is also introduced. The algorithm is tested in both in-vehicle and infrastructure mounted mono-camera sensor setups. In addition to achieving state of the art depth accuracy in vehicle trajectory reconstruction on the Argoverse dataset, our method outperforms the state of the art shape-aware deep learning method in pose estimation in a cooperative perception scenario both in simulation and in real-world experiments.
... In order to obtain the maximum capacity of the algorithm for dynamic obstacles of different volumes, the size of voxel grid probability n can be specified. After obtaining the random voxel grid, the Gaussian noise is added to the obtained voxel grid [41]- [46]. The calculation method can be easily obtained by using the knowledge of normal distribution in probability theory. ...
Article
Full-text available
Normal Distribution Transform (NDT) algorithm plays the role of detecting the relative pose of the vehicle in high-precision map. In this paper, a method of setting downsampling factor and grid factor for NDT relocation algorithm in dynamic environment is proposed, which can solve the problems of excessive NDT relocation error and location loss caused by dynamic objects accounting for 1 % to 35 % of the volume of scanning point cloud in vehicle environment. To simulate a real dynamic point cloud environment, the single-frame LiDAR point cloud space is voxelized into a mesh. Each grid is assigned a random number evenly distributed between 0 and 1. The threshold value for whether to add a Gaussian noise point is also set. Seven representative dynamic objects on the highway are selected. The threshold value of probability distribution function of Gaussian noise object needs to be set. Then the volume content of the dynamic object in the single frame point cloud space is calculated according to the set threshold value by using the definite integral. By changing the content and volume of dynamic obstacles in a dynamic environment, the effects of the downsampling factor and the grid factor on the accuracy of the repositioning trajectory are obtained. The resampling and mesh coefficients are optimized based on the analysis of the repositioning trajectory accuracy. The results show that When the current sampling factor is fixed, the grid factor of the NDT algorithm is proportional to the RMSE factor. When the NDT grid factor is fixed and the down sampling factor is equal to the side length of the obstacle, the NDT relocation accuracy is the highest and reaches the local optimum.
Article
Full-text available
Understanding the scene is key for autonomously navigating vehicles, and the ability to segment the surroundings online into moving and non-moving objects is a central ingredient of this task. Often, deep learning-based methods are used to perform moving object segmentation (MOS). The performance of these networks, however, strongly depends on the diversity and amount of labeled training data—information that may be costly to obtain. In this letter, we propose an automatic data labeling pipeline for 3D LiDAR data to save the extensive manual labeling effort and to improve the performance of existing learning-based MOS systems by automatically annotation training data. Our proposed approach achieves this by processing the data offline in batches, i.e., it is not designed to run online on a vehicle. It labels the actually moving objects such as driving cars and pedestrians as moving. In contrast, the non-moving objects, e.g., parked cars, lamps, roads, or buildings, are labeled as static. We show that this approach allows us to label LiDAR data highly effectively and compare our results to those of other label generation methods. We also train a deep neural network with our automatically generated labels and achieve comparable performance to the one trained with manual labels on the same data—and an even better performance when using additional datasets with labels generated by our approach. Furthermore, we evaluate our method on multiple datasets using different sensors, and our experiments indicate that our method can generate labels in different outdoor environments.
Article
Full-text available
A spectacular measurement campaign was carried out on a real-world motorway stretch of Hungary with the participation of international industrial and academic partners. The measurement resulted in vehicle based and infrastructure based sensor data that will be extremely useful for future automotive R&D activities due to the available ground truth for static and dynamic content. The aim of the measurement campaign was twofold. On the one hand, road geometry was mapped with high precision in order to build Ultra High Definition (UHD) map of the test road. On the other hand, the vehicles—equipped with differential Global Navigation Satellite Systems (GNSS) for ground truth localization—carried out special test scenarios while collecting detailed data using different sensors. All of the test runs were recorded by both vehicles and infrastructure. The paper also showcases application examples to demonstrate the viability of the collected data having access to the ground truth labeling. This data set may support a large variety of solutions, for the test and validation of different kinds of approaches and techniques. As a complementary task, the available 5G network was monitored and tested under different radio conditions to investigate the latency results for different measurement scenarios. A part of the measured data has been shared openly, such that interested automotive and academic parties may use it for their own purposes.
Article
Full-text available
With the development of intelligent and connected vehicles, RSU (roadside unit) sensors are playing an increasingly important role for environment perception. For vehicle detection in autonomous mine, lack of diversity data on RSU LiDAR limits the application of deep learning based methods. To solve this issue, a voxel-based background filtering module is introduced into 3D object detectors for vehicle detection with RSU LiDAR in mine environments. The proposed background filtering method models average height and the number of points for each voxel as Gaussian distribution to generate a background table. To address the impact of the false negative points of the background filtering module, we also propose a multivariate Gaussian loss to model bounding box uncertainty. The predicted covariances between variates help to learn the relationship between the missed parts and the visible ones. Besides, a background filtering based data augmentation method for vehicle detection is also proposed in this paper. Three RSU LiDAR datasets with different terrains in the BaoLi mine area are used for comprehensive experiment evaluations. Experiments show that the proposed background filtering module and multivariate Gaussian loss can significantly improve the generalization ability and performance of several state-of-the-art 3D detectors on different terrain data. Meanwhile, most background voxels are filtered out, the inference time of the 3D detectors is about 2 faster. Besides, the effectiveness of the proposed data augmentation method is also demonstrated.
Article
Full-text available
Autonomous vehicles are becoming central for the future of mobility, supported by advances in deep learning techniques. The performance of a self-driving system is highly dependent on the quality of the perception task. Developments in sensor technologies have led to an increased availability of 3D scanners such as LiDAR, allowing for a more accurate representation of the vehicle's surroundings, leading to safer systems. The rapid development and consequent rise of research studies around self-driving systems since early 2010, resulted in a tremendous increase in the number and novelty of object detection methods. After the first wave of works that essentially tried to expand known techniques from object detection in images, more recently there has been a notable development in newer and more adapted to LiDAR data works. This paper addresses the existing literature on object detection using LiDAR data within the scope of self-driving and brings a systematic way for analysing it. Unlike general object detection surveys, we will focus on point-cloud data, which presents specific challenges, notably its high-dimensional and sparse nature. This work introduces a common object detection pipeline and taxonomy to facilitate a thorough comparison between different techniques and, departing from it, this work will critically examine the representation of data (critical for complexity reduction), feature extraction and finally the object detection models. A comparison between performance results of the different models is included, alongside with some future research challenges.
Article
The Canadian Adverse Driving Conditions (CADC) dataset was collected with the Autonomoose autonomous vehicle platform, based on a modified Lincoln MKZ. The dataset, collected during winter within the Region of Waterloo, Canada, is the first autonomous driving dataset that focuses on adverse driving conditions specifically. It contains 7,000 frames of annotated data from 8 cameras (Ximea MQ013CG-E2), lidar (VLP-32C), and a GNSS+INS system (Novatel OEM638), collected through a variety of winter weather conditions. The sensors are time synchronized and calibrated with the intrinsic and extrinsic calibrations included in the dataset. Lidar frame annotations that represent ground truth for 3D object detection and tracking have been provided by Scale AI.
Article
Autonomous vehicles are becoming central for the future of mobility, supported by advances in deep learning techniques. The performance of a self-driving system is highly dependent on the quality of the perception task. Developments in sensor technologies have led to an increased availability of 3D scanners such as LiDAR, allowing for a more accurate representation of the vehicle’s surroundings, leading to safer systems. The rapid development and consequent rise of research studies around self-driving systems since early 2010, resulted in a tremendous increase in the number and novelty of object detection methods. After the first wave of works that essentially tried to expand known techniques from object detection in images, more recently there has been a notable development in newer and more adapted to LiDAR data works. This paper addresses the existing literature on object detection using LiDAR data within the scope of self-driving and brings a systematic way for analysing it. Unlike general object detection surveys, we will focus on point-cloud data, which presents specific challenges, notably its high-dimensional and sparse nature. This work introduces a common object detection pipeline and taxonomy to facilitate a thorough comparison between different techniques and, departing from it, this work will critically examine the representation of data (critical for complexity reduction), feature extraction and finally the object detection models. A comparison between performance results of the different models is included, alongside with some future research challenges.