Conference PaperPDF Available

Abstract and Figures

A novel learnable line segment detector and descriptor is proposed which allows efficient extraction and matching of 2D lines via the angular distance of 128 dimensional unit descriptor vectors. While many handcrafted and deep features have been proposed for keypoints, only a few methods exist for line segments. It is well known, however, that line segments are commonly found in man-made environments , in particular urban scenes, thus they are important for applications like pose estimation, visual odometry, or 3D reconstruction. Our method relies on a 2-stage deep convolutional neural network architecture: In stage 1, candidate 2D line segments are detected, and in stage 2, a de-scriptor is generated for the extracted lines. The network is trained in a self-supervised way using an automatically collected dataset of matching and non-matching line segments across (substantially) different views of 3D lines. Experimental results confirm the state-of-the-art performance of the proposed L2D2 network on two well-known datasets for autonomous driving both in terms of detected line matches as well as when used for line-based camera pose estimation and tracking.
Content may be subject to copyright.
L2D2: Learnable Line Detector and Descriptor
Hichem Abdellali1, Robert Frohlich1, Viktor Vilagos1, Zoltan Kato1,2
1University of Szeged, Institute of Informatics, Hungary. 2J. Selye University, Komarno, Slovakia.
{hichem,frohlich,vilagosv,kato}@inf.u-szeged.hu
Abstract
A novel learnable line segment detector and descriptor
is proposed which allows efficient extraction and matching
of 2D lines via the angular distance of 128 dimensional unit
descriptor vectors. While many handcrafted and deep fea-
tures have been proposed for keypoints, only a few meth-
ods exist for line segments. It is well known, however, that
line segments are commonly found in man-made environ-
ments, in particular urban scenes, thus they are important
for applications like pose estimation, visual odometry, or
3D reconstruction. Our method relies on a 2-stage deep
convolutional neural network architecture: In stage 1, can-
didate 2D line segments are detected, and in stage 2, a de-
scriptor is generated for the extracted lines. The network is
trained in a self-supervised way using an automatically col-
lected dataset of matching and non-matching line segments
across (substantially) different views of 3D lines. Experi-
mental results confirm the state-of-the-art performance of
the proposed L2D2 network on two well-known datasets for
autonomous driving both in terms of detected line matches
as well as when used for line-based camera pose estimation
and tracking.
1. Introduction
Local features [45,7] play a fundamental role in almost
all fields of computer vision, where matching between im-
ages is needed, e.g. pose estimation [23,25,19,17,1], reg-
istration, 3D reconstruction, structure from motion, visual
localization [37,36,35], object recognition, visual odome-
try and simultaneous localization and mapping, etc. Hand-
crafted keypoint extractors like SIFT [26], BRIEF [6], or
ORB [34] are still widely used in spite of numerous alterna-
tive end-to-end learning based approaches like R2D2 [33],
LIFT [55], MatchNet [10], or DeepCompare [56]. Although
learnable features have been a rather active research topic
recently, their advantages over handcrafted ones are still not
evident [39,4,7]. For example, according to [7], extraction
This work was partially supported by the NKFI-6 fund through project
K135728; EFOP-3.6.3-VEKOP-16-2017-0002
of handcrafted descriptors on a CPU is often much faster
than extraction of learned descriptors on a GPU; and re-
cent benchmarks targeting their application in image-based
reconstruction and localization pipelines suggest that hand-
crafted features still perform just as well or even better than
deep-learned features on such tasks. Among learned key-
point descriptors, L2-Net [43] and its variants became par-
ticularly popular. In [29], it was shown that a more pow-
erful descriptor (called HardNet) with significantly simpler
learning objective can be learned via an efficient hard neg-
ative mining strategy. In [52], a modified robust angu-
lar loss function (RAL-Net) was proposed which, coupled
with HardNet’s negative mining strategy, yield state-of-the-
art performance. A significant amount of work on local
features and feature matching (surveys in [39,4,7]) was
presented in the past 20 years to overcome common chal-
lenges like texture-less areas or repetitive patterns typical
to a man-made environment found e.g. on planar building
facades, traffic signs on roads, printed circuits and various
other mass-produced objects.
1.1. Line detection and descriptor
As opposed to the wast amount of work done on extract-
ing and matching keypoints, the field of line descriptor ex-
traction is less active. The clear advantage of line features
is a higher robustness to repetitive structures and occlusion.
2D lines are typically detected using the LSD [9] or the
faster EDLines detector [3]. Hough transform was also used
in different ways trying to extend it to line segment detec-
tion [53], but without a significant breakthrough. Recently,
promising deep learning based solutions started emerging
with the advent of the wireframe parsing [14,59,54] ap-
proach, where two parallel branches are predicting junction
maps and line heatmaps, finally merged into line segments.
Wireframes are constructed from the salient structural lines
in the scene and their junction points, usually manually la-
beled for ground truth data [14], and can be parametrized in
a multitude of ways; [54] uses a holistic 4D attraction field
map, while in contrast, line junctions are validated by a line
verification module in [14]. Most recently [32] proposed
SOLD2, a self-supervised solution using synthetic train-
ing data in the first stage, then adding homographic aug-
mentation in their training pipeline, outperforming the cur-
rent state-of-the-art. Besides the two-step approach of most
wireframe related solutions, [15] proposed a compact and
fast one-step model using a tri-point line-representation.
Recently, semantic line attributes are also used to gain a
higher-level representation of lines [21,40,44].
Like handcrafted keypoint descriptors, line segment de-
scriptors can be constructed from the neighborhood appear-
ance of the detected line, without resorting to any other con-
straints or a priori knowledge. Handcrafted line descrip-
tors include the Mean Standard deviation Line Descriptor
(MSLD) proposed in [50], the SIFT-like method, LEHF
(Line-based Eight-directional Histogram Feature) [11]. A
wide baseline stereo line matching method is proposed
in [5], which compares the histograms of neighboring color
profiles and iteratively eliminates mismatches by a topolog-
ical filter. In [57], a multi-scale line detection and matching
strategy coupled with the Line Band Descriptor (LBD) is
proposed which relies on both the local appearance of lines
and their geometric attributes. Another approach is SM-
SLD [47] to add scale-invariance to line segment descriptors
using 5 basic rules. In [47], these rules are applied to en-
hance both the line descriptor of [5] as well as MSLD [50].
Other methods typically rely on some kind of partial 3D
information or structural organization of a set of lines to
solve matching: Putative line correspondences in [38] relies
on both cross-correlation based on gray-level information
and the multiple view geometric relations (epipolar geom-
etry and trifocal tensor) between the images. [28] rely on
VisualSFM, and use the estimated camera poses to project
2D lines into 3D. Visual localization is cast as an alignment
problem of the edges of the query image to a 3D model
consisting of line segments in [27] thus avoiding descriptor
extraction and matching. They define an efficient Chamfer
distance-based aligning cost, incorporated into a first-best-
search strategy to register image lines with the model lines.
A line matching method in an affine projection space is pro-
posed in [49] to compensate for viewing angle changes in
aerial oblique images. A hierarchical method is proposed
in [24], where line segments are matched first in groups and
then individually. Group matching relies on LineJunction-
Line (LJL), which consists of two adjacent line segments
and their intersecting junction. LJLs are then matched
based on a robust descriptor, and unmatched lines are subse-
quently matched using local homographies of neighboring
matched LJLs.
To the best of our knowledge, the first learnable line seg-
ment detector and descriptor (SOLD2) has been proposed
in [32], which combines line segment detection and then de-
scriptor generation in one single pipeline, although the de-
tector and descriptor branch of the proposed deep network is
separated: the detection branch is inspired by the line detec-
tion architecture of wireframe detectors while the descrip-
tor is essentially a keypoint descriptor sampled along the
detected line segments. A similar descriptor (but without
detection), called LLD has been proposed in [46], however
it is specifically designed for automatic 2D-2D matching in
a SLAM problem, where image pairs have a short baseline
thus avoiding drastic changes in viewpoint and photomet-
ric properties. LLD descriptors are constructed on top of
a deep yet lightweight fully convolutional neural network
inspired by L2Net [43] with triplet loss where triplets are
mined from subsequent frames of the KITTI [8] and EuroC
datasets.
1.2. Contributions
In this paper, we aim to develop a new Learnable Line
Detector and Descriptor (L2D2) that is robust enough to
detect and match 2D lines across wide view-point changes.
The clear advantage of line features over keypoints is that
they are commonly found in man-made environments (e.g.
buildings), have less issues with repetitive structures and -
most importantly- do not require a point-wise match, hence
potentially they can be used when cameras have very lit-
tle overlapping views. Our deep convolutional net’s archi-
tecture consists of two phases: 1) a line segment detector
with a lightweight residual network architecture inspired
by wireframe networks, followed by 2) a patch-based de-
scriptor network inspired by L2Net [43] with a rectangular
patch size adapted to line-based orientation normalization,
yielding 128-dimensional unit feature vectors that can be
matched via an angular distance. For efficient training, we
adopt the hard negative mining strategy from [29] combined
with the robust angular loss function of RAL-Net [52].
The training data is generated automatically from the Lyft
dataset [18], which contains high quality Lidar point clouds
and precise ground truth poses for RGB camera images. Ex-
perimental results obtained on KITTI [8], KITTI360 [51]
and Lyft [18] show state-of-the-art performance compared
to SOLD2 [32], LLD [46], and SMSLD [47]. Our semi-
supervised training allows for further improvement using
other datasets, but the performance of our descriptor on the
KITTI and KITTI360 datasets shows that it already gener-
alizes well to rather different road scenes. The code and
dataset will be publicly available when the paper gets ac-
cepted.
2. Learning to Detect Matchable Line Seg-
ments
Feature repeatability and reliability is the key for estab-
lishing good matches between image pairs. For keypoints,
such an end-to-end network is R2D2 proposed in [33]. Re-
peatability simply means, that the same keypoint can be de-
tected in all views regardless of the geometric/photometric
changes, while reliability (of a descriptor) means, that such
keypoints can be reliably matched. For lines, repeatabil-
ity has a slightly different meaning: any detector (including
the proposed one) will detect line segments, but a line corre-
spondence is always interpreted as a matching infinite line
pair. Therefore the lines defined by the actually detected
line segments should be considered as e.g. any line-based
pose estimation method rely on infinite line pairs [13,2,1].
Thus a repeatable line is an infinite line, parts of which can
be detected on multiple images. This is an important dif-
ference w.r.t keypoints or wireframes, where exact position
of the point feature or line segment is critical. This higher
level aspect is rarely considered in current line detectors,
but considering a pose estimation or visual odometry ap-
plication, detecting multiple line segments along the same
infinite line are useless for line-based pose estimation. Thus
we argue that detecting less but more relevant line segments
is better as this way the descriptor could also be more reli-
able than matching too many irrelevant lines.
2.1. Training Data
An important part of our method is the automatic con-
struction of a large training dataset of matching 2D line
segment pairs. For this purpose, a dataset with ground truth
camera poses and corresponding 3D point cloud is needed.
Herein, the Lyft dataset [18] has been used, which contains
several sequences of images captured by a fully calibrated
multi-view RGB camera system and a dense, high preci-
sion, metric Lidar. Each of the 150 sequences consists of
exactly 126 frames, that were captured in one single burst
by the system in 25 seconds. The extraction of matching
lines is done automatically using the 3D data, hence our
model can be trained in a self supervised way, including
other future datasets, when available. The total size of the
dataset is 741706 line pairs, which is divided into a training
(60%) and test (40%) set.
The training data consists of 426496 line pairs.A se-
quence of non-overlapping images is called a cluster. Thus
images in a cluster cover the full scene of the dataset with-
out redundancy in the image views. Since the Lyft dataset
contains multiple drives along the same trajectories, multi-
ple clusters can be constructed (22 in total), which can be
considered as separate scans of the full scene captured at
different times. In the process of creating a cluster, over-
laps were avoided by calculating the Euclidean distance of
each frame’s absolute position in a sequence to all the other
cameras from different sequences, and keeping images with
a distance larger than 27m (the maximum distance between
the lanes on the same road).
To create 2D line pairs, first 2D-3D line correspondences
were determined by detecting lines on the images and pro-
jecting them into the 3D point cloud (see Fig. 1). Using the
algorithm outlined below, all the relevant information can
be extracted from the data structure, such as what are the 2D
Figure 1. 3D fitted line (red) to the merged 3D points coming from
2 2D lines on a building edge where we have a discontinuity in
the 3D, and only one side is visible from the left view (shown in
green), while the right view sees the yellow 3D points.
views of each 3D line (i.e. which camera sees the 3D line),
or which 3D lines are visible on a given 2D image. These
2D line pairs are then used for the detector training, while
the line support regions (patches) are extracted for all 2D
lines corresponding to the same 3D line, yielding a list of
all possible patch-correspondences of the 3D line used for
the descriptor training. The dataset construction consists of
the following main steps:
1. Detect lines on all 2D images using any line detector
and keep only the lines that are at least 48 pixels long
(height hof the line support region).
2. For each 2D image, we project 2D lines onto the 3D
point cloud using the ground truth pose and filter the
3D points that backproject to a 2D line within a 2pix-
els distance (point-to-line distance). A 3D line is then
fitted to these points. That gives us the candidate 3D
lines for the dataset.
3. Then we cluster 3D line segments that are assumed to
be part of the same 3D line using the following pro-
cedure for each candidate: (1) we calculate the small-
est bounding box with ±20 cm in all directions of the
3D line segment; (2) for subsequent filtering, consider
only 3D lines whose mid point is closer than 1.5m
from the mid point of the selected 3D line; (3) keep
lines whose point-line distance is less than 25cm from
the mid point of the selected 3D line; (4) keep lines
whose angle is less than 5degrees w.r.t. the selected
3D line; (5) the kept lines endpoints are checked if
at least one endpoint is inside the bounding box or
there is an intersection with the bounding box of the
selected 3D line; (6) the lines that fulfilled the previ-
ous constraints are marked as processed and we merge
their 3D points and do a robust 3D line fitting with
RANSAC by using a point to line distance threshold
of 3cm; (7) the fitted 3D line is backprojected to all the
corresponding 2D lines and those are discarded whose
line backprojection error [12] is above 0.01. These
2D lines form the visibility group of the fitted 3D line,
hence they are all corresponding.
4. After processing all the 3D lines, we obtain all the
views with at least two 2D correspondences, for which
the 48 ×32 line support regions are extracted.
3. Line Detector Network
The network architecture is inspired by the line detec-
tion branch of the Wireframe Parsing [14] method, which
in turn was inspired by the Stacked Hourglass network [31].
Based on our observations, the stacked module in the line
detection of [14], also used as the backbone of the most re-
cent state-of-the-art SOLD2 [32] method, is not necessary
for repeatable line detection, thus the architecture’s com-
plexity can be greatly reduced from 20.77M to only 1M
parameters, while the line detection performance is simi-
lar (see an ablation study in the Supplementary). As shown
on Fig. 2, its main components are the three Residual Mod-
ules R1,R2,R3, followed by deconvolutions to produce a
line segment heat map at the input resolution.
For training the line detector network, a mean squared
difference loss is applied on each output of the batches of
b= 20 images:
LMS E =1
b
b
X
i=1
(1
N
N
X
j=1
(hj(Ii)GTj(Ii))2)(1)
where hj(Ii)is the jth pixel of the detection heatmap out-
put of the network for image Iiand GTj(Ii)is the corre-
sponding binary image pixel of the ground truth lines on the
image, while Nis the number of pixels on the image. Each
batch is constructed by starting from a randomly selected
3D line, collecting the camera views that see that line, then
adding new lines from the field of view of the selected cam-
eras, and repeating the process until the desired batch size
is reached. This way, in each batch seen by the network in
one iteration we have different views of the same 3D line,
i.e. it is guaranteed that the 2D lines used for training are re-
peatable. We used the Stochastic Gradient Descent (SGD)
optimizer for 100 epochs, and we set a fixed learning rate of
0.025, and the momentum to be 0.9, weight decay equals to
0.0001.
The detection network provides a heatmap, where each
pixels intensity represents the probability that it belongs to a
line segment. In most applications a parametric representa-
tion of the lines is needed, thus in a postprocessing step we
find connected components using Hough transform. False
detections of multiple versions of the same line due to noisy
network output are filtered based on the angle of lines. Note
that SOLD2 also uses postprocessing to get the lines, but it
is more complex as it has distinct heat maps for line seg-
ments and junctions which need to be combined involving a
regular sampling of points along each line (between 2 junc-
tions), adaptive local-maximum search, and accepting the
lines verifying a minimum average score and inlier ratio.
4. Line Descriptor Network
Given a set of 2D lines {li}`
i=1, how to extract descrip-
tors that are sufficiently discriminative and reliable across
wide viewpoint changes? Our approach consists of a sup-
port region selection mechanism which guarantees a nor-
malized orientation with respect to each line liand a deep
neural network architecture based on RAL-Net [52] (which
adopts the HardNet [29] architecture, which is identical to
L2Net [43]). Fig. 2summarizes the layers. The input is the
32×48 pixels line support region with normalized grayscale
values (subtracting the mean and dividing by the standard
deviation) and the output is an L2 normalized 128D unit
length descriptor. The whole feature extraction is built of
full convolution layers, downsampling by two-stride convo-
lution. There is a Batch Normalization (BN) [16] layer and
a ReLu [30] activation layer in every layer except the last
one. To prevent overfitting, there is a 0.3Dropout layer [42]
above the bottom layer as in RAL-Net [52].
4.1. Support region of a line
The standard strategy for keypoint descriptors is to use
a squared window around the detected keypoint and con-
struct (for handcrafted features like SIFT [26]) or learn (for
learned descriptors like L2Net [43]) a feature vector based
on the visual pattern within that window. A critical step in
these approaches is the coordinate transformation applied
to the window before extracting the descriptor such that it
will be invariant to local deformations (mainly scale and ori-
entation) of corresponding visual patterns across matched
views. A typical solution for handcrafted features, proposed
originally for SIFT [26], is to take the dominant gradient di-
rection of the pattern as the orientation and the keypoint’s
detection scale as the scale of the local coordinate system.
For learned keypoint detectors, e.g. in [22,58], the covari-
ant constraint of the local feature detector is adopted via
training of a transformation predictor which also provides a
local coordinate frame in which a canonical feature can be
extracted.
Since, unlike keypoints, each line lihas its own length
|li|and orientation ϕi, a normalization is needed to fix a
local coordinate frame for generating the descriptor. In
MSLD [50], LBD [57], and DLD [20], a rectangular region
centered at the line is used as the local region (hence orien-
tation is normalized w.r.t. ϕi), which is divided into several
sub-regions and a SIFT-like descriptor is calculated for each
sub-region. The normalization of length |li|is achieved by
constructing the final descriptor with the mean and stan-
dard deviation of these sub-region descriptors. In contrast,
LLD [46] samples a few points along the line liand for
each of these points an L2Net-like keypoint descriptor is
extracted and then averaged into a descriptor. Hence for
LLD, the orientation normalization is relying on the indi-
vidual keypoint’s orientation, which is acceptable for short
Figure 2. Architecture of the proposed L2D2 network. Stage1: Detector, Stage 2: Descriptor.
Loss functions Derivative of loss functions
Figure 3. Loss functions: LLD (2) in red, L2D2 (3) in green.
baseline images typical in SLAM, but for wide baseline
matching, we propose another approach: Since lines have a
natural orientation we simply rotate each line by its ϕiinto
avertical orientation and then we extract a w×h,w < h
patch centered at the vertical line. Of course, only lines
with |li|> h are considered, but the detection accuracy
of short line segments is unstable anyway. Note that we
do not normalize the length, instead we simply take a sam-
ple of length hfrom the beginning of the vertical line (see
Fig. 2). The reason is that intensity patterns around lines
are typically repetitive, or homogeneous with a characteris-
tic change across the line, hence scaling down will not help
in characterizing corresponding lines across views as their
extracted line segments length might be drastically different
due to occlusion and other visibility constraints, hence an
equally sampled area along the line tells more then a down-
scaled version with an uncorrelated scale across views!
4.2. Loss function
Triplet loss has been successfully applied in many de-
scriptor learning tasks [29,52,46]:
L3= [d(a,p)d(a,n) + m]+(2)
with a,p,nbeing an anchor, positive (w.r.t. a), and neg-
ative (w.r.t. a) descriptor of the triplet, respectively. [x]+
means a clamping at 0,i.e.max(x, 0), while dis the dis-
tance of a descriptor pair. The common difficulty for triplet
loss (2) is that the choice of the margin m(which is typ-
ically set to m= 1 [41,29,46]) has a great impact on
the result [48,52], since there is always a part of the se-
lected triplets with 0derivative (see Fig. 3). However, as
also argued in [52], we can relax the clipping effect of (2)
by choosing a nonlinear loss function [52]
L= 1 + tanh(d(a,p)d(a,n)) (3)
which has a smooth derivative (see Fig. 3)
L0= 1 tanh2(d(a,p)d(a,n)) (4)
In this way, positive (a,p)and (a,n)negative pairs should
be more important when their similarity is high and vice
versa giving less weight rather than letting their derivative
to be 0. This is important when a considerable amount of
false negative matches exists (since we are training in a self-
supervised way, this is inevitable). Therefore, our loss puts
less weight on triplets when the cosine distance between the
negative pair (a,n)is much higher than for the positive pair
(a,p).
Herein, we follow the HardNet [29] strategy to construct
batches: First a matching set M={lai, l+
ai}N
i=1 of Nline
pairs is generated, where laistands for an anchor line and
l+
aifor its positive pair (i.e. they correspond the the same
3D line). Mmust contain exactly one pair originating from
a given 3D line! Then the line support regions are ex-
tracted (see Section 4.1) and passed through our L2D2 net-
work. That provides the descriptors (ai,pi), from which a
pairwise N×Ndistance matrix Dis calculated such that
Di,j =d(ai,pj), i = 1..N, j = 1..N. While LLD [46]
uses L2 norm d(ai,pj) = kaipjkof the descriptor vec-
tor differences, following [48,52], we will use cosine simi-
larity for our metric learning, since our descriptors are unit
vectors:
d(ai,pj) = (1 ai·pj) = (1 cos(^(ai,pj))) (5)
Statistics/Detector L2D2 SOLD2 EDLines
detected line segments 73,063 70,836 66,887
unique infinite detected lines 59,395 53,381 44,485
percentage 81.29% 75.36% 66.51%
validated line segments 10,685 13,552 17,771
unique infinite validated lines 9,785 11,771 13,762
percentage 91.58% 86.86% 77.44%
Table 1. Detector performance comparison.
Using D, for each matching pair aiand pi, the closest non-
matching descriptor niis found by searching the minimum
over the off-diagonal elements of the ith row and ith column
of D. The following loss is then minimized for each batch:
1
N
N
X
i=1
(1 + tanh(d(ai,pi)d(ai,ni))) (6)
The training data is partitioned into 3332 batches of 128
corresponding patch pairs, each batch being created from a
cluster. We applied the strategy which trains for 200 epochs
with learning rate linearly decreasing to 0in the end, as
in RAL-Net [52]. We choose Stochastic Gradient Descent
(SGD) as our optimizer and we set the initial learning rate
to be 0.1, and the momentum to be 0.9, dampening equal to
0.9and weight decay equal to 0.0001.
5. Experimental Results
Experimental validation of the proposed L2D2 method
was performed on the separate testing data from Lyft [18]
(see Section 2.1), and to show the generalization capabil-
ity of the method, we used the KITTI Visual Odometry [8],
and the KITTI360 [51] datasets. The line detection capa-
bilities will be measured and compared to state-of-the-art in
terms of repeatability of the extracted 2D lines as discussed
in Section 2. From the KITTI dataset, we randomly se-
lect 1257 image-pairs where -according to our ground-truth
(GT) generation procedure outlined in Section 2.1- at least
1 line pair exists. Note, that EDLines [3] has been used for
the GT line detection, hence the selection of this image set is
somewhat biased towards EDLines. Then lines are detected
on each image using SOLD2 [32] and our L2D2. Each de-
tected line returned by each method is validated using the
pointcloud of the test data. Thus detected lines are accepted
as correct if it has 2 views (i.e. repeatable) and the point-
cloud supports it (i.e. it corresponds to a true 3D line). Out
of the image pairs, 1091/1060/1170 contains at least 1 such
validated line in case of L2D2/SOLD2/EDLines detectors.
The average/maximum length of the detected line segments
are L2D2:94/484, SOLD2:89/401 and EDLines:81/393.
Analyzing this data together with the number of detected
and validated line segments as well as the number of distinct
infinite lines they correspond to shown in Table 1, we can
conclude, that EDLines detects the most repeatable lines
Figure 4. Global matching performance on the Lyft (L) and KITTI
(K) testing data. Percentage of pairs that were ranked in the top 10
positions are shown. Legend contains for each method the % of
lines matched correctly / % of lines found in positions (1 10).
Descriptor/Detector L2D2 SOLD2 EDLines
L2D2 84.08% (5398/6420) 72.11% (5843/8103) 84.21% (9760/11590)
SOLD2 82.49% (5296/6420) 70.15% (5685/8103) 78.04% (9045/11590)
SMSLD 72.35% (4645/6420) 65.36% (5296/8103) 74.78% (8667/11590)
Table 2. Detector/descriptor performance on validated line pairs
using the proposed L2D2, SOLD2, and EDLines/SMSLD.
across image pairs, but the proposed method tends to de-
tect the longest line segments and has the highest ratio of
detected individual repeatable infinite lines, that are more
useful for real applications like pose estimation (see Sec-
tion 5.1)
Descriptor matching performance was evaluated in two
different settings using cosine similarity metric of (5).
Global matching is when a line is matched against a large
set of lines corresponding to an entire cluster in Lyft and a
whole sequence in KITTI. This case characterizes the per-
formance of the descriptors in applications like localization
or SLAM, where matching lines have to be retrieved from
a large image dataset. Fig. 4shows comparative results
with LLD [46] and SMSLD [47] descriptors. Our method
clearly outperforms them: 85% of the true line-pairs ranked
in the top 10 out of 4499 possible line matches in average on
Lyft, compared to 68%(SMSLD) and 71%(LLD). While on
KITTI these percentages are 71%(L2D2), 56%(SMSLD),
and 60%(LLD). In spite of the quite different scenes in
KITTI compared to our training dataset, our method out-
performs the others, which clearly demonstrates its gener-
alization capability.
Matching between image pairs is a typical scenario for
relative pose estimation or 3D reconstruction. Here, the aim
to find all of the true line correspondences (i.e. inliers) and
avoid wrong matches (i.e. outliers). Thus an appropriate fil-
tering of the putative matches is needed, relying on the dis-
Figure 5. Matching scores of the correct and wrong line pairs on KITTI. The threshold τis also visualized with a dotted line, at which only
10% of the correct matches are lost. Lastly, the inlier ratio of the image pairs are shown after applying this τthreshold.
SMSLD LLD SOLD2 L2D2 (proposed)
Figure 6. Matching examples on a KITTI image pair. Correct matches are shown with green, wrong matches with red lines. Note: SOLD2
automatically filters bad matches (shown in red boxes) based on its own metrics, while for the other methods we show all putative matches.
criminative power of the descriptor, in order to maximize
the inlier/outlier ratio of the returned set of putative line-
pairs. On the KITTI dataset, we match a line with the can-
didate lines detected on another image (in average 44 lines
per image) for which a ground truth match is located. The
matching is evaluated in terms of the matching score, which
is the ratio of the descriptor distances of the best ranked
match over the second best match, which is expected to be
low for a highly discriminative descriptor and close to 1 for
a weakly discriminative matching. Fig. 5shows compara-
tive results on 45000 image pairs. The matching ratio (num-
ber of correct matches found over the number of true line-
pairs in an image pair) is 79.92% for our method, 74.40%
for SMSLD, 73.75% for LLD and 73.05% for SOLD2.
The most significant difference between the discriminative
power of the methods can be highlighted when the matching
ratio is checked at a τthreshold value set to drop maximum
10% of the correct matches. At the expense of losing 10%
of the correct matches, as shown in Fig. 5with the hori-
zontal line corresponding to this particular τthreshold, the
above mentioned ratios can be improved to 93.17% (L2D2),
87.69% (SMSLD) and 78.27 (LLD). The last plot in Fig. 5
shows that almost 1/3of the test cases contain more than
30% outliers with LLD, while SMSLD and the proposed
method improves this ratio, with L2D2 the resulting test set
only contains 16% of the image pairs where there are out-
liers in the matches. Note that SOLD2 has its own selection
mechanism, hence it returns only the correct matches ac-
cording to its selection metric [32], which is 73.05%! Ex-
amples of the detected and matched lines between two im-
age frames are shown in Fig. 6, where we can observe how
LLD and SMSLD tend to introduce some incorrect matches
between similar line segments, while the proposed method
matches them correctly.
Full Detector-Descriptor Pipeline is the real scenario for
working with line-correspondences. The detector outputs of
SOLD2 [32], EDLines [3], and L2D2 were analyzes previ-
ously. Using these 2D lines, we match lines between the im-
age pairs using SOLD2 and L2D2 deep descriptors as well
as the handcrafted SMSLD. Results are summarized in Ta-
ble 2, where in the diagonal we can see the performance of
the SOLD2, L2D2, and EDLines-SMSLD pipelines, while
off-diagonal numbers show the performance of the various
detector/descriptor combinations. In each case, the numbers
in parentheses represent the number of matched line pairs
out of all validated GT line pairs. Our pipeline is the most
efficient by successfully matching 84% of detected GT line
pairs, while the EDLines/SMSLD pipeline detects the high-
est number of GT line pairs between image pairs (but note
that the random selection is based on EDLines-detected GT
line pairs as discussed earlier this section, thus being bi-
ased). It is also interesting to note that the lines detected
by SOLD2 are the hardest to match for any of the descrip-
tors, it achieves better matching ratio with both EDLines
and L2D2 detected lines.
5.1. Pose Estimation and Tracking
As we have seen in the previous section, the L2D2 en-
ables the matching of 2D lines between two views with a
high inlier ratio. Herein, we will use such matches to es-
tablish 2D-3D line correspondences by taking one of the
images as a reference image that has 3D lines for each 2D
lines (this can be achieved by knowing the camera pose
w.r.t. the Lidar sensor as it is a common practice for Lidar
scanners). To validate the line matches in a RANSAC abso-
Figure 7. Pose estimation errors of [1] on KITTI image pairs (m
stands for median value).
lute pose estimation application, we used the robust method
of [1] using the publicly available MATLAB implementa-
tion provided by the authors. Since pose estimation needs a
minimum of 3line-correspondences and lines should not be
e.g. parallel, we selected image pairs from the 1257 image-
pairs described in the matching evaluation step, where at
least 6ground truth line-pairs are available, resulting in
442,222 and 695 image-pairs for L2D2, SOLD2 and hand-
crafted pipelines respectively. Note that our method pro-
vides 2 times more image pairs with at least 6image pairs
than SOLD2, while the (somewhat biased) number of image
pairs from the EDLines+SMSLD pipeline is the highest.
Using L2D2 and SMSLD, detected lines on the reference
image are matched with all detected lines on the other im-
age, a unique one-to-one matching is obtained by relying on
the matching score to keep the best match for each line, and
discard matches with scores above a τ= 0.92 threshold.
SOLD2 manages the matching and filtering of bad matches
internally. These putative line-correspondences are then fed
into the robust pose estimation [1]. The estimated pose has
been evaluated in terms of angular distance of the rotation
error and percentage of the translation error’s relative to the
ground truth translation. Fig. 7shows these pose errors for
each pipeline on the 85 image pairs on which all of them
had at least 6validated GT line-pairs. We can observe that
L2D2 outperforms both methods, solving 94% of the cases
with an error of less than 5and 5%, which clearly shows
that our method detects repeatable lines which can be reli-
ably matched.
Pose Tracking is fundamental for visual odometry and
navigation. Herein, we present quantitative results using
our L2D2 method within a minimalistic Kalman-filter de-
signed to track a camera’s extrinsic parameters. The filter
gives a pose prediction for each frame, and receives mea-
surement from the pose estimator [1] to update the state
of the tracked pose. The line correspondences come from
matching 3D lines from previous frames with the detected
2D lines from the current image. Again, repeatability and
reliability is critical, since for tracking we need to detect
the same 3D line on subsequent frames, which should be
reliably matched across neighboring frames. While we
L2D2 descriptor SMSLD
% of good poses 69.3% 59.65%
% of bad poses 3.3% 10.53%
mean of good matches 8.87 10.49
mean of bad matches 3.3 7.17
mean of good RANSAC inliers 5.96 6.93
mean of bad RANSAC inliers 0.92 1.08
Figure 8. Pose tracking results of our L2D2 vs. SMSLD.
make use of the prediction and covariance provided by the
Kalman filter to define a bounding box where the corre-
sponding 2D line should be searched, which helps reli-
able matching once the line is detected, nearby detected 2D
lines typically have similar visual appearance, thus chal-
lenging descriptor reliability. Pose tracking was tested on
the KITTI360 [51] dataset, which is recorded in rural ar-
eas with challenging narrow, winding streets. We selected
short sequences from the full dataset, where we could val-
idate at least 3trackable lines on each consecutive frame.
Our test data consists of 6 such sequences with a total of
114 frames, and 7.39 trackable lines per frame. Using the
lines from the proposed line detector we compared the re-
sults with the proposed line descriptor L2D2 and SMSLD.
The errors of the estimated poses and the statistics are sum-
marized in Fig. 8. We can see, that in spite of matching
slightly less lines and having approximately the same num-
ber of inliers after RANSAC, L2D2 line correspondences
yield consistently better pose estimates of the Kalman filter.
6. Conclusions
A robust learnable line detector and descriptor (L2D2)
is proposed for wide baseline line matching. The network
provides a 128D unit descriptor vector which can be easily
matched via cosine similarity. The training data prepara-
tion is fully automatic, and can be adapted to other datasets
as well. Experimental results confirm the state-of-the-art
performance of the proposed method on three different
datasets for autonomous driving both in terms of detected
line matches as well as in terms of inlier/outlier ratio. Fur-
thermore, the detected line-pairs were successfully used for
line-based camera pose estimation and pose tracking.
References
[1] H. Abdellali, R. Frohlich, and Z. Kato. Robust absolute and
relative pose estimation of a central camera system from 2d-
3d line correspondences. In Proceedings of ICCV Workshop
on Computer Vision for Road Scene Understanding and Au-
tonomous Driving, Seoul, Korea, Oct. 2019. IEEE. 1,3,8
[2] H. Abdellali and Z. Kato. Absolute and relative pose esti-
mation of a multi-view camera system using 2d-3d line pairs
and vertical direction. In Proceedings of International Con-
ference on Digital Image Computing: Techniques and Appli-
cations, pages 1–8, Canberra, Australia, Dec. 2018. IEEE.
3
[3] C. Akinlar and C. Topal. EDLines: A real-time line segment
detector with a false detection control. Pattern Recognition
Letters, 32(13):1633 – 1642, 2011. 1,6,7
[4] V. Balntas, K. Lenc, A. Vedaldi, and K. Mikolajczyk.
Hpatches: A benchmark and evaluation of handcrafted and
learned local descriptors. In Proceedings of Conference
on Computer Vision and Pattern Recognition, pages 3852–
3861, July 2017. 1
[5] H. Bay, V. Ferraris, and L. Van Gool. Wide-baseline stereo
matching with line segments. In Proceedings of Confer-
ence on Computer Vision and Pattern Recognition, volume 1,
pages 329–336. IEEE, June 2005. 2
[6] M. Calonder, V. Lepetit, C. Strecha, and P. Fua. BRIEF: Bi-
nary robust independent elementary features. In Proceedings
of European Conference on Computer Vision, pages 778–
792. Springer, 2010. 1
[7] G. Csurka, C. R. Dance, and M. Humenberger. From
handcrafted to deep local features. arXiv preprint
arXiv:1807.10254, 2018. 1
[8] A. Geiger, P. Lenz, and R. Urtasun. Are we ready for au-
tonomous driving? the KITTI vision benchmark suite. In
Proceedings of International Conference on Computer Vi-
sion and Pattern Recognition. IEEE, jun 2012. 2,6
[9] R. Grompone von Gioi, J. Jakubowicz, J. Morel, and G. Ran-
dall. LSD: A fast line segment detector with a false detection
control. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 32(4):722–732, Apr. 2010. 1
[10] X. Han, T. Leung, Y. Jia, R. Sukthankar, and A. C. Berg.
MatchNet: Unifying feature and metric learning for patch-
based matching. In Proceedings of Conference on Computer
Vision and Pattern Recognition, pages 3279–3286, 2015. 1
[11] K. Hirose and H. Saito. Fast line description for line-based
SLAM. In Proceedings of the British Machine Vision Con-
ference. BMVA, 2012. 2
[12] N. Horanyi and Z. Kato. Generalized pose estimation from
line correspondences with known vertical direction. In Pro-
ceedings of International Conference on 3D Vision, pages
1–10, Qingdao, China, Oct. 2017. IEEE. 3
[13] N. Horanyi and Z. Kato. Multiview absolute pose using 3D
- 2D perspective line correspondences and vertical direction.
In Proceedings of ICCV Workshop on Multiview Relation-
ships in 3D Data, pages 1–9, Venice, Italy, Oct. 2017. IEEE.
3
[14] K. Huang, Y. Wang, Z. Zhou, T. Ding, S. Gao, and Y. Ma.
Learning to parse wireframes in images of man-made envi-
ronments. In Proceedings of Conference on Computer Vision
and Pattern Recognition, pages 626–635, 2018. 1,4
[15] S. Huang, F. Qin, P. Xiong, N. Ding, Y. He, and X. Liu. Tp-
lsd: Tri-points based line segment detector. In A. Vedaldi,
H. Bischof, T. Brox, and J.-M. Frahm, editors, Proceedings
of European Conference Computer Vision, pages 770–785.
Springer International Publishing, 2020. 2
[16] S. Ioffe and C. Szegedy. Batch normalization: Accelerating
deep network training by reducing internal covariate shift. In
Proceedings of International Conference on Machine Learn-
ing, page 448456. JMLR.org, 2015. 4
[17] T. Ke and S. I. Roumeliotis. An efficient algebraic solution to
the perspective-three-point problem. In Proceedings of Con-
ference on Computer Vision and Pattern Recognition, pages
1–9, Honolulu, HI, USA, July 2017. IEEE. 1
[18] R. Kesten, M. Usman, J. Houston, T. Pandya, K. Nad-
hamuni, A. Ferreira, M. Yuan, B. Low, A. Jain, P. On-
druska, S. Omari, S. Shah, A. Kulkarni, A. Kazakova, C. Tao,
L. Platinsky, W. Jiang, and V. Shet. Lyft level 5 av dataset
2019. urlhttps://level5.lyft.com/dataset/, 2019. 2,3,6
[19] L. Kneip, H. Li, and Y. Seo. UPnP: an optimal O(n) solution
to the absolute pose problem with universal applicability. In
D. J. Fleet, T. Pajdla, B. Schiele, and T. Tuytelaars, editors,
Proceedings of European Conference Computer Vision, Part
I, volume 8689 of Lecture Notes in Computer Science, pages
127–142, Zurich, Switzerland, Sept. 2014. Springer. 1
[20] M. Lange, F. Schweinfurth, and A. Schilling. Dld: A
deep learning based line descriptor for line feature match-
ing. In Proceedings of International Conference on Intelli-
gent Robots and Systems, pages 5910–5915, 2019. 4
[21] J.-T. Lee, H.-U. Kim, C. Lee, and C.-S. Kim. Semantic line
detection and its applications. In Proceedings of Interna-
tional Conference on Computer Vision, pages 3249–3257,
2017. 2
[22] K. Lenc and A. Vedaldi. Learning covariant feature detec-
tors. In G. Hua and H. J´
egou, editors, Proceedings of ECCV
Workshops, pages 100–117, Amsterdam, Netherlands, 2016.
Springer. 4
[23] V. Lepetit, F.Moreno-Noguer, and P.Fua. EPnP: an accurate
O(n) solution to the PnP problem. International Journal of
Computer Vision, 81(2), 2009. 1
[24] K. Li, J. Yao, X. Lu, L. Li, and Z. Zhang. Hierarchical line
matching based on linejunctionline structure descriptor and
local homography estimation. Neurocomputing, 184:207 –
220, 2016. RoLoD: Robust Local Descriptors for Computer
Vision 2014. 2
[25] S. Li, C. Xu, and M. Xie. A robust O(n) solution to the
perspective-n-point problem. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 34(7):1444–1450, 2012.
1
[26] D. G. Lowe. Distinctive image features from scale-invariant
keypoints. International Journal of Computer Vision,
60(2):91–110, 2004. 1,4
[27] B. Micusik and H. Wildenauer. Descriptor free visual in-
door localization with line segments. In Proceedings of Con-
ference on Computer Vision and Pattern Recognition, pages
3165–3173. IEEE, June 2015. 2
[28] P. Miraldo, T. Dias, and S. Ramalingam. A minimal closed-
form solution for multi-perspective pose estimation using
points and lines. In The European Conference on Computer
Vision, pages 1–17, Munich, Germany, Sept. 2018. Springer.
2
[29] A. Mishchuk, D. Mishkin, F. Radenovic, and J. Matas. Work-
ing hard to know your neighbor’s margins: Local descriptor
learning loss. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wal-
lach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,
Advances in Neural Information Processing Systems, pages
4826–4837. Curran Associates, Inc., 2017. 1,2,4,5
[30] V. Nair and G. E. Hinton. Rectified linear units improve
restricted boltzmann machines. In Proceedings of Inter-
national Conference on Machine Learning, page 807814,
Madison, WI, USA, 2010. Omnipress. 4
[31] A. Newell, K. Yang, and J. Deng. Stacked hourglass net-
works for human pose estimation. In B. Leibe, J. Matas,
N. Sebe, and M. Welling, editors, Proceedings of European
Conference Computer Vision, volume 9912, pages 483–499.
Springer International Publishing, 2016. 4
[32] R. Pautrat, L. Juan-Ting, V. Larsson, M. R. Oswald, and
M. Pollefeys. SOLD2: Self-supervised occlusion-aware line
description and detection. In Proceedings of Conference on
Computer Vision and Pattern Recognition, 2021. 1,2,4,6,7
[33] J. Revaud, P. Weinzaepfel, C. R. de Souza, and M. Humen-
berger. R2D2: repeatable and reliable detector and descrip-
tor. In NeurIPS, 2019. 1,2
[34] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski. ORB:
An efficient alternative to SIFT or SURF. In Proceedings of
International Conference on Computer Vision. IEEE, Nov.
2011. 1
[35] P.-E. Sarlin, C. Cadena, R. Siegwart, and M. Dymczyk. From
coarse to fine: Robust hierarchical localization at large scale.
In The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2019. 1
[36] T. Sattler, B. Leibe, and L. Kobbelt. Efficient & effective pri-
oritized matching for large-scale image-based localization.
IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 39(9):1744–1756, 2016. 1
[37] T. Sattler, A. Torii, J. Sivic, M. Pollefeys, H. Taira, M. Oku-
tomi, and T. Pajdla. Are large-scale 3D models really nec-
essary for accurate visual localization? In Proceedings of
International Conference on Computer Vision and Pattern
Recognition, 2017. 1
[38] C. Schmid and A. Zisserman. Automatic line matching
across views. In Proceedings of Conference on Computer
Vision and Pattern Recognition, pages 666–671. IEEE, June
1997. 2
[39] J. L. Schnberger, H. Hardmeier, T. Sattler, and M. Polle-
feys. Comparative evaluation of hand-crafted and learned lo-
cal features. In Proceedings of Conference on Computer Vi-
sion and Pattern Recognition, pages 6959–6968, July 2017.
1
[40] J. L. Schnberger, M. Pollefeys, A. Geiger, and T. Sattler. Se-
mantic visual localization. In Proceedings of Conference on
Computer Vision and Pattern Recognition, June 2018. 2
[41] F. Schroff, D. Kalenichenko, and J. Philbin. FaceNet: A
unified embedding for face recognition and clustering. In
Proceedings of Conference on Computer Vision and Pattern
Recognition, pages 815–823, June 2015. 5
[42] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov. Dropout: A simple way to prevent neu-
ral networks from overfitting. Journal of Machine Learning
Research, 15:1929–1958, 2014. 4
[43] Y. Tian, B. Fan, and F. Wu. L2-net: Deep learning of discrim-
inative patch descriptor in euclidean space. In cvpr. IEEE, jul
2017. 1,2,4
[44] C. Toft, E. Stenborg, L. Hammarstrand, L. Brynte, M. Polle-
feys, T. Sattler, and F. Kahl. Semantic match consistency
for long-term visual localization. In V. Ferrari, M. Hebert,
C. Sminchisescu, and Y. Weiss, editors, Proceedings of Euro-
pean Conference Computer Vision, pages 391–408. Springer
International Publishing, 2018. 2
[45] T. Tuytelaars and K. Mikolajczyk. Local invariant feature
detectors: A survey. Found. Trends. Comput. Graph. Vis.,
3(3):177280, July 2008. 1
[46] A. Vakhitov and V. Lempitsky. Learnable line segment de-
scriptor for visual SLAM. IEEE Access, 7:39923–39934,
2019. 2,4,5,6
[47] B. Verhagen, R. Timofte, and L. Van Gool. Scale-invariant
line descriptors for wide baseline matching. In Proceedings
of Winter Conference on Applications of Computer Vision,
pages 493–500. IEEE, Mar. 2014. 2,6
[48] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li,
and W. Liu. Cosface: Large margin cosine loss for deep face
recognition. In Proceedings of Conference on Computer Vi-
sion and Pattern Recognition, pages 5265–5274. IEEE, June
2018. 5
[49] Q. Wang, W. Zhang, X. Liu, Z. Zhang, M. H. A. Baig,
G. Wang, L. He, and T. Cui. Line matching of wide baseline
images in an affine projection space. International Journal
of Remote Sensing, pages 1–23, July 2019. 2
[50] Z. Wang, F. Wu, and Z. Hu. MSLD: A robust descriptor for
line matching. Pattern Recognition, 42(5):941 – 953, 2009.
2,4
[51] J. Xie, M. Kiefel, M.-T. Sun, and A. Geiger. Semantic in-
stance annotation of street scenes by 3d to 2d label transfer.
In Proceedings of Conference on Computer Vision and Pat-
tern Recognition, 2016. 2,6,8
[52] Y. Xu, M. Gong, T. Liu, K. Batmanghelich, and C. Wang.
Robust angular local descriptor learning. In C. Jawahar,
H. Li, G. Mori, and K. Schindler, editors, Proceedings
of Asian Conference on Computer Vision, pages 420–435,
Perth, Australia, 2019. Springer. 1,2,4,5,6
[53] Z. Xu, B.-S. Shin, and R. Klette. Accurate and ro-
bust line segment extraction using minimum entropy with
hough transform. IEEE Transactions on Image Processing,
24(3):813–822, 2015. 1
[54] N. Xue, T. Wu, S. Bai, F. Wang, G.-S. Xia, L. Zhang, and
P. H. Torr. Holistically-attracted wireframe parsing. In
Proceedings of Conference on Computer Vision and Pattern
Recognition, pages 2785–2794, 2020. 1
[55] K. M. Yi, E. Trulls Fortuny, V. Lepetit, and P. Fua. LIFT:
Learned invariant feature transform. In Proceedings of Euro-
pean Conference on Computer Vision, volume 9910 of Lec-
ture Notes in Computer Science, pages 467–483, Amster-
dam, Nederlands, Oct. 2016. Springer. 1
[56] S. Zagoruyko and N. Komodakis. Learning to compare im-
age patches via convolutional neural networks. In Proceed-
ings of Conference on Computer Vision and Pattern Recog-
nition, pages 4353–4361, June 2015. 1
[57] L. Zhang and R. Koch. An efficient and robust line segment
matching approach based on LBD descriptor and pairwise
geometric consistency. Journal of Visual Communication
and Image Representation, 24(7):794 – 805, 2013. 2,4
[58] X. Zhang, F. X. Yu, S. Karaman, and S. Chang. Learning dis-
criminative and transformation covariant local feature detec-
tors. In Proceedings of Conference on Computer Vision and
Pattern Recognition, pages 4923–4931. IEEE, July 2017. 4
[59] Y. Zhou, H. Qi, and Y. Ma. End-to-end wireframe parsing.
In Proceedings of International Conference on Computer Vi-
sion. IEEE, Oct. 2019. 1
... This resurgence of line detection methods was initiated by the deep wireframe methods aiming at inferring the line structure of indoor scenes [20,32,56,57,65]. Since then, more generic deep line segment detectors have been proposed [10,17,21,30,53], including joint line detectors and descriptors [1,39,59]. These methods can, in theory, be trained on challenging images and, thus, gain robustness where classical methods fail. ...
... SOLD2 [39] introduced a self-supervised training, using the homography adaptation technique initially described in SuperPoint [12]. ELSD [59] and L2D2 [1] both propose similar networks, but ELSD is again trained on the Wireframe dataset, while L2D2 uses a novel process to extract a line ground truth from LiDAR scans. Though these approaches are a first step towards unsupervised line detection, they still lack accuracy. ...
Preprint
Full-text available
Line segments are ubiquitous in our human-made world and are increasingly used in vision tasks. They are complementary to feature points thanks to their spatial extent and the structural information they provide. Traditional line detectors based on the image gradient are extremely fast and accurate, but lack robustness in noisy images and challenging conditions. Their learned counterparts are more repeatable and can handle challenging images, but at the cost of a lower accuracy and a bias towards wireframe lines. We propose to combine traditional and learned approaches to get the best of both worlds: an accurate and robust line detector that can be trained in the wild without ground truth lines. Our new line segment detector, DeepLSD, processes images with a deep network to generate a line attraction field, before converting it to a surrogate image gradient magnitude and angle, which is then fed to any existing handcrafted line detector. Additionally, we propose a new optimization tool to refine line segments based on the attraction field and vanishing points. This refinement improves the accuracy of current deep detectors by a large margin. We demonstrate the performance of our method on low-level line detection metrics, as well as on several downstream tasks using multiple challenging datasets. The source code and models are available at https://github.com/cvg/DeepLSD.
Conference Paper
Full-text available
We propose a new algorithm for estimating the absolute and relative pose of a camera system composed of general central projection cameras such as perspective and omni-directional cameras. First, we derive a minimal solver for the minimal case of 3 line pairs per camera, which is used within a RANSAC algorithm for outlier filtering. Second, we also formulate a direct least squares solver which finds an optimal solution in case of noisy (but inlier) 2D-3D line pairs. Both solver relies on Grobner basis, hence they provide an accurate solution within a few milliseconds in Mat-lab. The algorithm has been validated on a large synthetic dataset as well as real data. Experimental results confirm the stable and real-time performance under realistic outlier ratio and noise on the line parameters. Comparative tests show that our method compares favorably to the latest state of the art algorithms.
Article
Full-text available
Line matching plays an important role in vision localization and three-dimensional reconstruction of building structures. The conventional method of line matching is not effective for processing stereo images with wide baselines and large viewing angles. This paper proposes a line matching method in an affine projection space, aiming to solve the problem of change of viewing angles in aerial oblique images. Firstly, monocular image orientation can be performed through geometric structures of buildings. Secondly, according to the pose information of the camera, the affine projection matrix is obtained. The original image can be rectified as a conformal image based on this projection matrix, thereby reducing the difference in the viewing angle between images. Then, line matching is performed on the rectified images to get the matched line pairs. Finally, the inverse affine projection matrix is used to back-project the matched line pairs to the original images. The experimental results of five groups of aerial oblique images show that the matched line segments obtained by the proposed method are basically superior to those of the methods which are directly processed on the original image in terms of quantity, correctness, and efficiency.
Chapter
This paper proposes a novel deep convolutional model, Tri-Points Based Line Segment Detector (TP-LSD), to detect line segments in an image at real-time speed. The previous related methods typically use the two-step strategy, relying on either heuristic post-process or extra classifier. To realize one-step detection with a faster and more compact model, we introduce the tri-points representation, converting the line segment detection to the end-to-end prediction of a root-point and two endpoints for each line segment. TP-LSD has two branches: tri-points extraction branch and line segmentation branch. The former predicts the heat map of root-points and the two displacement maps of endpoints. The latter segments the pixels on straight lines out from background. Moreover, the line segmentation map is reused in the first branch as structural prior. We propose an additional novel evaluation metric and evaluate our method on Wireframe and YorkUrban datasets, demonstrating not only the competitive accuracy compared to the most recent methods, but also the real-time run speed up to 78 FPS with the \(320\times 320\) input.
Article
Accurate visual localization is a key technology for autonomous navigation. 3D structure-based methods employ 3D models of the scene to estimate the full 6 degree-of-freedom (DOF) pose of a camera very accurately. However, constructing (and extending) large-scale 3D models is still a significant challenge. In contrast, 2D image retrieval-based methods only require a database of geo-tagged images, which is trivial to construct and to maintain. They are often considered inaccurate since they only approximate the positions of the cameras. Yet, the exact camera pose can theoretically be recovered when enough relevant database images are retrieved. In this paper, we demonstrate experimentally that large-scale 3D models are not strictly necessary for accurate visual localization. We create reference poses for a large and challenging urban dataset. Using these poses, we show that combining image-based methods with local reconstructions results in a pose accuracy similar to the state-of-the-art structure-based methods. Our results suggest that we might want to reconsider the current approach for accurate large-scale localization.
Chapter
In recent years, the learned local descriptors have outperformed handcrafted ones by a large margin, due to the powerful deep convolutional neural network architectures such as L2-Net [1] and triplet based metric learning [2]. However, there are two problems in the current methods, which hinders the overall performance. Firstly, the widely-used margin loss is sensitive to incorrect correspondences, which are prevalent in the existing local descriptor learning datasets. Second, the L2 distance ignores the fact that the feature vectors have been normalized to unit norm. To tackle these two problems and further boost the performance, we propose a robust angular loss which (1) uses cosine similarity instead of L2 distance to compare descriptors and (2) relies on a robust loss function that gives smaller penalty to triplets with negative relative similarity. The resulting descriptor shows robustness on different datasets, reaching the state-of-the-art result on Brown dataset, as well as demonstrating excellent generalization ability on the Hpatches dataset and a Wide Baseline Stereo dataset.