Conference PaperPDF Available

A Benchmark for 3D Reconstruction from Aerial Imagery in an Urban Environment

Authors:

Abstract and Figures

This paper presents a novel benchmark to evaluate 3D reconstruction methods using aerial images in a large-scale urban scenario. In particular, it presents an evaluation of open-source state-of-the-art pipelines for image-based 3D reconstruction including, for the first time, an analysis per urban object category. Therefore, the standard evaluation presented in generalist image-based reconstruction benchmarks is extended and adapted to the city. Furthermore, our benchmark uses the densest annotated LiDAR point cloud available at city scale as ground truth and the imagery captured alongside. Additionally, an online evaluation server will be made available to the community.
Content may be subject to copyright.
A Benchmark for 3D Reconstruction from Aerial Imagery
in an Urban Environment
Susana Ruano1, Aljosa Smolic1
1V-SENSE, Trinity College Dublin, Ireland
ruanosas, smolica@scss.tcd.ie
Keywords: Benchmark, 3D Reconstruction, Structure-from-Motion, Multi-View Stereo
Abstract: This paper presents a novel benchmark to evaluate 3D reconstruction methods using aerial images in a large-
scale urban scenario. In particular, it presents an evaluation of open-source state-of-the-art pipelines for image-
based 3D reconstruction including, for the first time, an analysis per urban object category. Therefore, the
standard evaluation presented in generalist image-based reconstruction benchmarks is extended and adapted
to the city. Furthermore, our benchmark uses the densest annotated LiDAR point cloud available at city scale
as ground truth and the imagery captured alongside. Additionally, an online evaluation server will be made
available to the community.
1 INTRODUCTION
Creating 3D models from a collection of images
is a classic problem in computer vision (Hartley
and Zisserman, 2003), it has been extensively stud-
ied (Snavely et al., 2008; Sweeney et al., 2015;
Moulon et al., 2013; Sch ¨
onberger and Frahm, 2016;
Furukawa and Ponce, 2010; Barnes et al., 2009;
Sch¨
onberger et al., 2016) and it is used in fields such
as augmented and virtual reality (Ruano et al., 2017;
Pag´
es et al., 2018). The common procedure to gener-
ate a 3D reconstruction from a collection of images
begins with the detection and matching of features
which are given as input to a Structure-from-Motion
(SfM) algorithm to recover the camera poses and a
sparse model (Snavely et al., 2008). Then, to densify
the skeleton of the reconstruction given by the sparse
3D point cloud a Multi-View Stereo (MVS) algorithm
is applied (Hartley and Zisserman, 2003). The meth-
ods can be configured according to the particularities
of the scene (e.g., smoothness of the terrain (Ruano
et al., 2014)) but the majority of them considers gen-
eral scenarios and specially, the ones which are open-
source are widely used (Stathopoulou et al., 2019).
Despite the constant and recent progress in 3D re-
construction techniques, one of the problems that has
been pointed out is the lack of ground-truth models
to test the algorithms (Schops et al., 2017; Knapitsch
et al., 2017). These models are difficult to obtain with
the appropriate density and quality because their col-
lection is usually done with active techniques that re-
quire special equipment (e.g., LiDAR) and it is not
always accessible. Still, some efforts were made in
the past to fulfill those necessities and several bench-
marks were created (Seitz et al., 2006; Schops et al.,
2017; Knapitsch et al., 2017). This ranges from
the first widely known Middlebury benchmark (Seitz
et al., 2006) with only two pieces of ground truth cap-
tured indoors in a controlled environment, to the lat-
est releases such as ETH3D (Schops et al., 2017) and
Tanks and Temples (Knapitsch et al., 2017), which
include outdoor areas.
A commonality in recent benchmarks is that the
challenges for the reconstruction methods are based
on the variety of scenarios they provide to do the
evaluation. Although this is a valid strategy, urban
scenarios can provide itself with a great variety of
challenges, even in the same city, due to the hetero-
geneity of elements that can be found there. As sug-
gested in (Zolanvari et al., 2019), the analysis of a
reconstruction in terms of categories of elements of
a city will be of interest to support the observations
that can be done about different parts of the city with
quantitative measurements. Fortunately, the interest
of the community in deep learning techniques with
3D data has pushed the creation of ground-truth data
with a categorization of urban elements but, still there
are new benchmarks focusing on urban environments
which do not incorporate this information in their 3D
reconstruction analysis ( ¨
Ozdemir et al., 2019).
In this paper, we present a benchmark that offers
an extensive evaluation for 3D reconstruction algo-
rithms thanks to a broad study of a large urban envi-
ronment. It uses as ground truth the densest annotated
point cloud available at city scale, DublinCity (Zolan-
vari et al., 2019), which presents a remarkable level
of detail in the annotations allowing for an evaluation
of image-based 3D reconstruction methods not only
at scene level but also per category of urban element,
which is a unique feature compared to previous avail-
able benchmarks. The benchmark includes three dif-
ferent sets of images that are created from aerial im-
ages with two different camera configurations. Fur-
thermore, we provide online evaluation so new 3D re-
construction methods can be assessed through a com-
mon comparative evaluation setup.
2 RELATED WORK
The first widely known benchmark for MVS is the
Middlebury (Seitz et al., 2006), which consists of
two ground-truth models captured with a laser stripe
scanner and three different sets of images per model.
This benchmark was later extended in (Aanæs et al.,
2016), where they increased the number of considered
3D models with corresponding images, with differ-
ent lighting conditions for each set. However, both
of them were captured in a controlled environment
and lack of any scene outside of the laboratory. One
benchmark that overcomes the limitation of provid-
ing ground truth only in a confined space is the EPFL
benchmark (Strecha et al., 2008), which includes re-
alistic scenes captured outdoors with a terrestrial Li-
DAR. This was an advantage compared to previous
ones but they only covered a few scenarios such as a
facade of a building or a fountain. Further, it has the
limitation of terrestrial LiDAR which can only par-
tially cover the scenes.
The ETH3D benchmark (Schops et al., 2017)
however, shows a wide variety of indoor and out-
door scenarios. In particular, different areas are cov-
ered (e.g., a courtyard, a meadow, a playground) to
test the algorithms against different challenges. How-
ever, the models were captured with a terrestial Li-
DAR and they do not provide details of the tops of
the buildings such as the roofs. At a similar time the
Tanks and Temples (Knapitsch et al., 2017) bench-
mark was released, which also gained a lot of in-
terest. The main difference to ETH3D, is that they
use video as input, and they evaluate the complete
pipeline including SfM and MVS, not only MVS, be-
cause they want to stimulate research that solves the
image-based 3D reconstruction problem as a whole,
including the estimation of the camera positions.
Other benchmarks which are solely focused on urban
Area 1
Area 2
Figure 1: Areas evaluated in the benchmark. At the top,
the areas of the the city under study are overlaid in the map.
At the bottom, a top view of the ground truth of each area
is depicted with a different color per class. Area 1 is on the
left and Area 2 is on the right.
environments are the Kitti benchmark (Menze and
Geiger, 2015) and the TerraMobilita/iQmulus bench-
mark (Vallet et al., 2015). Both of them use 3D Mo-
bile Laser Scanning (MLS) as ground truth. The for-
mer includes the images captured alongside to evalu-
ate several tasks such as scene flow, stereo, object de-
tection but it does not include neither aerial data nor
a specific MVS evaluation. The latter only provides
annotated 3D data, similar to the Paris-rue-madame
dataset (Serna et al., 2014) and the Oakland 3D Point
Cloud dataset (Munoz et al., 2009), but no images are
included and therefore the evaluation is done for tasks
like object segmentation and object classification.
The Toronto/Vaihingen ISPRS benchmark used
in (Zhang et al., 2018) is not limited by the use of
MLS to create the ground-truth model. However, al-
though a large-scale urban scenario is covered, the 3D
building models are not created from the images, they
are created from the Aerial Laser Scan (ALS) point
cloud. Also, Urban Semantic 3D data (Bosch et al.,
2019) is a large-scale public dataset that includes se-
mantic labels for two large cities but it uses satellite
Figure 2: Hidden areas. On the left, an example of a hidden
area (in color) over the whole region (in grey). On the right,
the hidden area represented in 3D.
imagery and they do not provide a semantic evalua-
tion of the 3D reconstruction. The ISPRS Test project
on Urban Classification and 3D Building Reconstruc-
tion (Rottensteiner et al., 2014) have airborne images
but the number of images offered is no more than 20
per area, the benchmark reconstruction task is focused
on roofs, the density of the ALS is 6 points per m2,
and they only provide one type of images per set. One
of the most recent annotated LiDAR point clouds cap-
tured with ALS is DublinCity (Zolanvari et al., 2019),
which also includes airborne images associated (Lae-
fer et al., 2015). The average density of the ALS
point cloud is remarkable compared to other existing
datasets (i.e., 348.43 points/m2) and the LiDAR point
cloud is used as ground truth for the evaluation of an
image-based 3D reconstruction, but they do not in-
clude the annotations in the evaluation.
3 BENCHMARK DATASET
Area selection. The initial dataset of Dublin city
center (Laefer et al., 2015) and the annotations incor-
porated in (Zolanvari et al., 2019) are used for the cre-
ation of this benchmark. We analysed the whole an-
notated dataset (the potential ground truth) to ensure
that the selected areas constitute a good representation
of the city and also, that they have a balanced distri-
bution of points in each category, so content diversity
and representativeness is guaranteed for an evaluation
per urban object category. The criteria used to choose
the specific tiles for evaluation are based on the area
of the city covered, the variety of the city elements in
them and the avoidance of potential sources of errors.
In particular, we have discarded the ones that are
covering less than 250 ×250 m2of the city, the ones
that include less than 1% of points corresponding to
grass and less than 4% of points corresponding to
trees. Also, we discarded tiles with points in the unde-
(a) (b)
Figure 3: Images. (a) a sample of the nadir set of images;
(b) a sample of the oblique one.
fined class above 10% and the ones containing cranes,
because they may degrade performance. With these
criteria two tiles are left, that will be the areas used in
this benchmark, shown in Fig. 1 and will allow us to
have a representative part of the city while keeping the
amount of data that have to be processed reasonable.
Hidden areas. We are also evaluating a specific
region in each area that will not be revealed to the
user to avoid fine tuning during online evaluation (see
Section 4). We call these regions hidden areas and
we show an example of a potential hidden area in
Fig. 2. As it can be seen, they consist of meaning-
ful sections of the city spread across the ground truth
(each of them including several classes). We have se-
lected them taking the class distribution into account
but the exact portions of the ground truth that belongs
to the hidden area are not revealed.
Area description. We are evaluating two differ-
ent regions of the city of Dublin (depicted in red in
Fig. 1) which were selected following the aforemen-
tioned analysis. Area 1 encompasses the south west
part of Trinity College Dublin (TCD) campus, several
streets of the city, buildings and green areas. Area 2
includes the north east of the TCD campus, different
buildings, streets and it includes parts of rail tracks.
Table 1 shows the distribution of the classes in
terms of number of points and the percentage of
points in the whole ground truth and the hidden areas.
It can be seen that the most populated class is roof
and the one with least representation is doors. Never-
theless, the percentage of the class roof in the hidden
areas is more balanced with the other classes. It can
also be observed that the undefined data, which is a
potential source of inaccuracies, represents between
5% and 9% of the points. Area 1 has almost double of
points in the window category and almost four times
of windows in the roof. Furthermore, Area 2 triples
the number of points associated with grass in Area 1.
For each area, three different sets of images are
used to create the reconstructions: oblique, nadir and
combined. Oblique and nadir contain images from
the homonyms groups explained in (Zolanvari et al.,
2019) and a sample of each group is shown in Fig. 3.
The combined group contains the images from both of
them. The initial dataset contains a large number of
undefined
facade
window
door
roof
r. window
r. door
sidewalk
street
grass
tree
Area 1
# points (×103) 1397 3048 540 40 10727 481 3 2029 3210 1587 3471
percentage 5.27 11.49 2.04 0.15 40.43 1.81 0.01 7.65 12.10 5.98 13.08
hidden # p. (×103) 1015 1816 312 27 2849 247 2 1602 2441 1353 2720
hidden pct. 7.06 12.63 2.17 0.19 19.81 1.72 0.01 11.14 16.97 9.41 18.91
Area 2
# points (×103) 1615 2776 308 40 7807 112 3 2789 2642 4061 2852
percentage 6.46 11.10 1.23 0.16 31.22 0.45 0.01 11.15 10.57 16.24 11.41
hidden # p. (×103) 1262 1547 220 27 2708 47 1 2010 1993 2226 1897
hidden pct. 9.05 11.105 1.58 0.19 19.43 0.34 0.01 14.42 14.30 15.97 13.61
Table 1: Number of points and percentage of points per class in each evaluated area. Also the number of points and percentage
used as the hidden zone.
oblique and nadir images, many more than needed for
a meaningful 3D reconstruction of a particular area.
Further, it is not obvious which images are associated
with a certain area. In order to identify a meaning-
ful subset of images, we ran the COLMAP SfM al-
gorithm (Sch¨
onberger and Frahm, 2016) with all the
oblique images of the initial dataset. We then selected
the subset of images, where each generated at least
1500 3D points in the areas under evaluation. The
same process was done with the nadir images.
4 EVALUATION
Pipelines tested. In this benchmark, we evaluate
the 3D dense point cloud obtained when an image-
based reconstruction pipeline is applied to a collec-
tion of aerial images. Usually, research is focused on
solving one specific problem of the pipeline: SfM or
MVS. However, we want to allow the possibility of
evaluating new techniques that can solve the problem
with a different approach, not necessarily applying
SfM followed by MVS. Furthermore, as we are not in
a controlled environment, we do not have the camera
positions with accuracy to be used as ground-truth.
Instead, we have the GPS, which is only a coarse ap-
proximation. For these reasons, we are only evaluat-
ing the final 3D reconstruction, not the intermediate
steps.
To do the state-of-the-art evaluation we have se-
lected different pipelines that were studied in the
latest comparison of open-source 3D reconstruction
methods (Stathopoulou et al., 2019). Following this
approach we established the pipelines assembling
compatible SfM and MVS methods. In particular,
for SfM, we use COLMAP (Sch¨
onberger and Frahm,
2016) which includes a geometric verification strat-
egy to improve robustness on initialization and tri-
angulation and an improved bundle adjustment strat-
egy with an outlier filtering strategy. Besides, we
selected OpenMVG with two different approaches:
global (Moulon et al., 2013), with a global calibra-
tion approach based on the fusion of relative mo-
tions between image pairs; and incremental (Moulon
et al., 2012), that iteratively adds new estimations to
an initial reconstruction minimizing the drift with suc-
cessive steps of non-linear refinement. Furthermore,
for MVS, we choose COLMAP (Sch¨
onberger et al.,
2016) that makes a joint estimation of depth and nor-
mal information and makes a pixelwise view selec-
tion using photometric and geometric priors. We also
use OpenMVS (Sch¨
onberger et al., 2016), that uses
an efficient patch based stereo matching followed by
a depth-map refinement process. In (Stathopoulou
et al., 2019), AliceVision was also tested, but we
are not including it in our evaluation because it di-
rectly produces a mesh, without providing a dense
point cloud representation as an exploitable interme-
diate step. However, AliceVision is based on the
incremental OpenMVG (Moulon et al., 2012) and
the CMPMVS (Jancosek and Pajdla, 2011) which
can handle weakly textured surfaces. We selected
the former method for SfM but the latter is no
longer publicly available. To sum up, Fig. 4 shows
the details of the six pipelines tested in the bench-
mark: (1) COLMAP(SfM) + COLMAP(MVS); (2)
COLMAP(SfM) + OpenMVS; (3) OpenMVG-g +
COLMAP(MVS); (4) OpenMVG-g + OpenMVS; (5)
OpenMVG-i + COLMAP(MVS); (6) OpenMVG-i +
OpenMVS. The version of COLMAP, OpenMVG and
OpenMVS used are v3.6, v1.5 and v1.1 respectively,
and, as it is depicted in the figure, four stages are
needed: SfM, geo-registration, data preparation and
MVS. The SfM step includes the feature detection
and matching provided by each of the methods used.
The parameters used for COLMAP are the same as
in DublinCity (Zolanvari et al., 2019), and the param-
SfM Geo-registration MVS
Data preparation
OpenMVG
incremental
COLMAP
aligner InterfaceCOLMAP
(OpenMVS )
OpenMVG2 OpenMVS
(OpenMVG)
OpenMVG2 COLMAP
(OpenMVG)
COLMAP
OpenMVG
global
OpenMVG
geodesy
registration
OpenMVG2 OpenMVS
(OpenMVG)
OpenMVG2COLMAP
(OpenMVG)
Undistortion
(COLMAP)
Undistortion
(COLMAP )
Undistortion
(COLMAP )
COLMAP
OpenMVS
COLMAP
OpenMVS
COLMAP
OpenMVS
(1)
(2)
(3)
(4)
(5)
(6)
Figure 4: Scheme of 3D reconstruction pipelines tested. (1) COLMAP+COLMAP, (2) COLMAP+OpenMVS, (3) Open-
MVGg+COLMAP, (4) OpenMVGg+OpenMVS, (5) OpenMVGi+COLMAP, (6) OpenMVGi+OpenMVS.
Figure 5: Registration. Skyline of the ground-truth (in red)
against the 3D reconstructions with oblique (in green), nadir
(in purple) and the combined (in blue) sets of images. On
the top, the coarse registration, at the bottom, the fine one.
eters used for OpenMVG are the default: SIFT for
feature detection, essential matrix filtering for com-
puting matches for global, and fundamental matrix
filtering for incremental. After the SfM is done, the
geo-information of the dataset is used for a coarse
registration of the point cloud (i.e, geo-registration),
and we use the methods provided by COLMAP and
OpenMVG.
After the geo-registration is done, we have to pre-
pare the data to be densified. For that reason, we use
the different procedures proposed for each method for
converting the formats and undistorting the images.
We used the same parameters each time we applied
the same procedure. Finally, the last step is the den-
sification which is done with the recommended pa-
rameters in COLMAP and OpenMVS except when
an image reduction was needed for OpenMVS. In that
case, we used the parameters reported in the ETH3D
benchmark which were not hard-coded.
Alignment. The strategies for the alignment of
the 3D reconstructions with the ground truth usu-
ally consist of two steps: a coarse alignment fol-
lowed by the refinement of the initial estimation. In
our evaluation, the coarse registration is done in the
geo-registration step of the pipelines, and as a conse-
quence, the dense point clouds generated are already
coarsely registered with the LiDAR scan. As an ex-
ample, the results of the coarse registration with the
COLMAP + COLMAP pipeline are shown on the left
in Fig. 5. The skyline of the ground truth is depicted
in red and the 3D reconstructions in blue (oblique im-
ages), green (nadir images) and purple (combined).
The refinement of the registration is commonly
done applying a 7DoF ICP algorithm. This is the
strategy followed in (Knapitsch et al., 2017). A
more sophisticated approach is used in (Schops et al.,
2017), but they use the color information of the laser
scan, which is not available in our benchmark. In our
approach, we use the point cloud obtained from the
first pipeline to refine the coarse registration with the
ground-truth applying an ICP algorithm. Then, for
the rest of the pipelines, we use the camera positions
of the already refined one as a reference, and we ap-
ply the same ICP algorithm to obtain the transforma-
tions that will align the cameras. After that, we ap-
ply the transformation to the entire 3D point clouds.
For online evaluation using our benchmark we will
require the input 3D point cloud to be already regis-
tered, which will allow users to use and optimize their
own registration.
Measurements. For the evaluation we use the
measurements proposed in (Knapitsch et al., 2017;
Zolanvari et al., 2019). In particular, we use: pre-
cision, P, recall, Rand F score, F. The precision,
shows how closely the reconstruction is to the ground
truth, the recall, is related to how complete the recon-
struction is, and the F score, is a combination of both.
Other measurements as the mean distance between
the point clouds could be used as in (Stathopoulou
et al., 2019), but the advantage of the selected ones
is that they are less affected by outliers. They are de-
fined in Eq. (1), Eq. (2), Eq. (3), respectively, for a
given threshold distance d. In the equations, Iis the
point cloud under evaluation and Gis the ground-truth
(a) (b) (c)
(d) (e) (f)
Figure 6: Qualitative 3D reconstruction results. Point clouds obtained with the oblique and nadir images combined in Area
1with each of the pipelines tested (a) COLMAP + COLMAP, (b) COLMAP + OpenMVS, (c) OpenMVG-g + COLMAP, (d)
OpenMVG-g + OpenMVS, (e) OpenMVG-i + COLMAP, (f) OpenMVG-i + OpenMVS.
point cloud. |·| is the cardinality and distIG(d)are
the points in Iwith a distance to Gless than dand
distGI(d)is analogous (i.e., distAB(d) = {aA|
min
bBkabk2<d},Aand Bbeing point clouds).
P(d) = |d istIG(d)|
|I|100 (1)
R(d) = |d istGI(d)|
|G|100 (2)
F(d) = 2P(d)R(d)
P(d) + R(d)(3)
To perform the evaluation per class, the point under
evaluation will be given the same class as its nearest
neighbor in the ground-truth.
Online evaluation. We will provide online eval-
uation1of the 3D reconstructions to stimulate and
support progress in the field. The users should pro-
vide the final dense point cloud already registered to
the ground truth. This will be possible because the
ground truth will be publicly available. To ensure a
fair comparison, we will be calculating the precision,
recall and F score not only in the complete ground
truth but also in the hidden areas.
5 RESULTS
We report the precision, recall and F score values for
each selected pipeline, set of images and area, includ-
1https://v-sense.scss.tcd.ie/research/6dof/benchmark-
3D-reconstruction
ing the hidden parts. The measurements were calcu-
lated with din the range of 1 cm to 100 cm obtain-
ing, as expected, an increasing performance in every
method when a further distance was considered. We
report the results at 25 cm, similar to (Zolanvari et al.,
2019), as a good compromise between the limitations
of the image resolution and the meaningfulness of the
precision, since selecting a very small distance would
mean poorer performances for all the methods and
with a larger distance the precision would be less in-
formative.
Scene level evaluation. Table 2 shows the evalu-
ation of the pipelines’ outcomes against the ground-
truth models without considering the urban element
category. From these measurements, we can ob-
serve that reconstructions from oblique imagery as in-
put, achieve the lowest recall values in all the areas.
Therefore, the precision has a main role in achieving
a high F score for this image set category. In particu-
lar, COLMAP + COLMAP has the best performance
(F score) with that type of imagery in every area. Ad-
ditionally, the reconstructions done with the nadir im-
age set have higher recall values than the ones with
the oblique imagery. Also, the recall is usually higher
than the precision for these sets so the accuracy is not
as determinant as in the oblique sets. As can be ob-
served, COLMAP + COLMAP is the best pipeline in
Area 1 whereas OpenMVG-i + OpenMVS is the best
in Area 2 (even in the hidden parts). These results
suggest that having different camera angles and less
coverage of the same parts of the scene (as it is the
case in the oblique set and not in the nadir one) makes
the recall value decrease while the precision remains
similar.
For the reconstructions obtained with the com-
Area 1 Area 2
oblique nadir oblique and nadir oblique nadir oblique and nadir
(1) 79.18 / 60.5 / 68.59 73.08 / 68.98 / 70.97 74.89 / 74.15 / 74.52 80.48 / 65.51 / 72.23 74.97 / 72.34 / 73.63 76.54 / 77.98 / 77.25
(2) 22.74 / 28.28 / 25.21 23.96 / 46.23 / 31.57 27.93 / 60.84 / 38.29 26.85 / 41.81 / 32.7 24.45 / 49.96 / 32.83 24.7 / 63.53 / 35.57
(3) 49.42 / 13.09 / 20.69 44.02 / 47.95 / 45.9 48.74 / 58.24 / 53.07 36.92 / 15.7 / 22.03 33.26 / 36.4 / 34.76 41.94 / 50.3 / 45.74
(4) 61.07 / 57.1 / 59.02 56.61 / 73.46 / 63.94 78.27 /80.49 /79.36 40.23 / 54.27 / 46.21 75.2 / 75.92 / 75.56 79.3 / 82.03 / 80.64
(5) 37.13 / 16.59 / 22.94 36.59 / 43.62 / 39.8 39.48 / 51.36 / 44.64 38.11 / 15.11 / 21.64 43.76 / 52.0 / 47.52 36.62 / 48.53 / 41.74
(6) 55.12 / 64.37 / 59.39 49.14 / 70.75 / 58.0 74.19 / 79.5 / 76.76 58.52 / 70.43 / 63.92 71.44 / 79.0 / 75.03 79.77 /82.54 /81.13
hidden Area 1 hidden Area 2
oblique nadir oblique and nadir oblique nadir oblique and nadir
(1) 78.68 / 49.89 / 61.06 72.69 / 62.77 / 67.37 74.54 / 68.34 / 71.3 80.06 / 61.48 / 69.55 73.63 / 68.9 / 71.18 75.5 / 75.27 / 75.39
(2) 23.55 / 18.97 / 21.01 24.42 / 36.88 / 29.39 28.38 / 50.45 / 36.32 27.05 / 37.02 / 31.26 24.24 / 43.98 / 31.25 24.4 / 56.88 / 34.15
(3) 43.12 / 6.61 / 11.46 42.52 / 38.01 / 40.14 48.48 / 48.74 / 48.61 36.47 / 13.19 / 19.37 33.08 / 31.82 / 32.43 40.97 / 43.7 / 42.29
(4) 56.57 / 48.7 / 52.34 56.03 / 69.48 / 62.03 75.36 /75.76 /75.56 40.31 / 54.08 / 46.19 74.93 / 74.31 / 74.62 77.19 / 79.52 / 78.34
(5) 30.83 / 8.88 / 13.79 36.51 / 35.47 / 35.98 38.79 / 42.08 / 40.37 35.24 / 12.3 / 18.24 43.7 / 47.28 / 45.42 35.35 / 42.39 / 38.55
(6) 52.93 / 59.33 / 55.95 48.47 / 67.81 / 56.53 70.56 / 75.02 / 72.72 56.16 / 67.58 / 61.34 67.99 / 77.08 / 72.25 78.38 /80.49 /79.42
Table 2: Quantitative results with the whole ground-truth. Each row shows the results of a specific 3D reconstruction
pipeline giving the precision / recall / F score for d=25cm obtained for the reconstruction in each set of images in each area.
The best score for each area and image set is in bold letters and the pipelines are as follow: (1) COLMAP + COLMAP,
(2) COLMAP + OpenMVS, (3) OpenMVG-g + COLMAP, (4) OpenMVG-g + OpenMVS, (5) OpenMVG-i + COLMAP, (6)
OpenMVG-i + OpenMVS.
F SCORE
Area 1 -Area 1 hidden Area 2 -Area 2 hidden
oblique nadir combined oblique nadir combined
facade 58.34 (1) - 57.1 (1) 59.96 (4) - 59.08 (4) 67.01 (4) - 65.91 (4) 63.12 (1) - 62.17 (1) 67.7 (6) - 65.82 (6) 71.73 (6) -70.83 (6)
window 58.83 (1) - 58.18 (1) 53.92 (4) - 53.83 (4) 62.41 (1) - 62.45 (1) 55.06 (1) - 57.2 (1) 61.73 (6) - 60.9 (6) 63.34 (6) -63.59 (6)
door 41.97 (1) - 45.46 (1) 51.55 (4) - 52.55 (4) 53.14 (4) - 52.96 (4) 49.18 (6) - 48.63 (6) 53.84 (6) - 51.07 (6) 56.57 (6) -56.69 (6)
roof 80.78 (1) - 75.55 (1) 80.18 (1) - 77.39 (1) 87.98 (4) -85.55 (4) 81.57 (1) - 80.66 (1) 81.41 (1) - 80.06 (1) 86.05 (6) - 85.53 (6)
r. window 77.33 (1) - 77.67 (1) 73.78 (1) - 74.68 (1) 85.06 (4) -85.04 (4) 74.54 (1) - 74.53 (1) 76.36 (6) - 78.54 (6) 83.0 (4) - 84.89 (6)
r. door 57.72 (1) - 65.94 (4) 53.58 (4) - 58.63 (3) 63.83 (4) -69.3 (4) 50.1 (6) - 45.14 (1) 53.29 (6) - 41.27 (4) 50.71 (6) - 43.13 (6)
sidewalk 76.46 (6) - 74.99 (6) 79.76 (1) - 79.85 (1) 87.86 (4) - 87.42 (4) 79.11 (1) - 80.05 (1) 87.56 (4) - 87.14 (4) 88.28 (6) -88.03 (6)
street 77.28 (1) - 74.91 (1) 85.85 (1) - 85.9 (1) 90.72 (4) - 90.78 (4) 71.02 (1) - 68.67 (1) 90.88 (4) - 90.55 (4) 90.81 (6) - 90.5 (6)
grass 84.12 (6) - 82.76 (6) 79.75 (1) - 77.38 (1) 88.79 (4) - 87.97 (4) 91.1 (1) - 89.1 (1) 95.59 (4) - 93.6 (4) 96.36 (6) -94.91 (6)
tree 31.74 (6) - 32.55 (6) 38.47 (4) - 38.37 (4) 40.92 (1) - 40.21 (6) 25.34 (1) - 28.09 (1) 33.7 (1) - 35.08 (1) 39.45 (1) - 41.34 (1)
PRECISION
Area 1 -Area 1 hidden Area 2 -Area 2 hidden
oblique nadir combined oblique nadir combined
facade 75.23 (1) -75.84 (1) 61.32 (1) - 62.59 (1) 66.9 (1) - 68.04 (1) 73.77 (1) - 74.81 (1) 66.71 (6) - 64.7 (6) 68.33 (1) - 69.01 (1)
window 64.97 (1) - 65.13 (1) 54.16 (1) - 56.04 (1) 60.75 (1) - 62.14 (1) 63.77 (1) - 63.84 (1) 56.53 (6) - 55.17 (6) 59.09 (1) - 59.16 (1)
door 59.1 (1) -58.75 (1) 42.31 (4) - 43.6 (4) 47.34 (1) - 46.05 (1) 55.49 (1) - 54.64 (1) 49.25 (6) - 45.61 (6) 48.06 (4) - 48.44 (6)
roof 81.56 (1) - 80.58 (1) 77.05 (1) - 74.96 (1) 84.08 (4) - 80.21 (4) 83.21 (1) - 83.35 (1) 78.56 (1) - 77.46 (1) 81.16 (6) - 80.91 (6)
r. window 78.96 (1) - 80.13 (1) 70.18 (1) - 72.01 (1) 79.43 (4) - 78.29 (4) 81.25 (1) - 83.9 (1) 80.25 (1) - 82.71 (1) 80.65 (1) - 84.08 (1)
r. door 63.5 (1) - 62.5 (1) 48.32 (1) - 55.11 (4) 54.66 (1) - 59.58 (1) 65.19 (1) -63.67 (1) 45.04 (6) - 40.6 (4) 45.99 (6) - 40.99 (1)
sidewalk 88.51 (1) -88.48 (1) 80.37 (1) - 81.26 (1) 83.35 (4) - 83.16 (1) 87.79 (1) - 87.55 (1) 83.66 (4) - 83.26 (4) 82.67 (6) - 82.32 (6)
street 90.97 (1) -90.86 (1) 83.27 (1) - 83.28 (1) 86.43 (4) - 86.35 (4) 87.23 (1) - 86.92 (1) 86.71 (4) - 86.46 (4) 85.91 (6) - 85.87 (6)
grass 92.32 (1) - 91.33 (1) 84.18 (1) - 82.37 (1) 89.35 (4) - 88.73 (4) 95.91 (1) - 94.75 (1) 96.11 (4) - 94.39 (4) 95.88 (4) - 94.15 (6)
tree 78.37 (1) - 77.2 (1) 75.41 (4) - 75.15 (4) 74.22 (1) - 72.94 (1) 80.73 (1) -80.49 (1) 77.41 (6) - 76.66 (6) 77.63 (1) - 77.23 (1)
RECALL
Area 1 -Area 1 hidden Area 2 -Area 2 hidden
oblique nadir combined oblique nadir combined
facade 47.64 (1) - 45.78 (1) 63.92 (4) - 62.69 (4) 72.38 (4) - 71.14 (4) 60.48 (6) - 56.97 (6) 68.72 (6) - 66.99 (6) 77.28 (6) -75.63 (6)
window 53.75 (1) - 52.58 (1) 62.35 (4) - 61.28 (4) 78.64 (4) - 77.22 (4) 56.4 (6) - 54.72 (6) 67.99 (6) - 67.96 (6) 77.05 (6) - 77.71 (6)
door 40.86 (6) - 40.72 (6) 65.95 (4) - 66.12 (4) 68.76 (4) - 67.99 (4) 51.1 (6) - 54.05 (6) 59.37 (6) - 58.03 (6) 68.94 (6) -68.34 (6)
roof 80.03 (1) - 76.27 (6) 83.82 (4) - 83.92 (4) 92.25 (4) -91.65 (4) 80.08 (6) - 81.01 (6) 87.8 (6) - 88.35 (6) 91.57 (6) - 90.72 (6)
r. window 75.76 (1) - 84.01 (6) 77.75 (1) - 77.57 (1) 91.54 (4) - 93.07 (4) 70.23 (6) - 72.07 (6) 79.95 (6) - 84.97 (6) 85.78 (4) - 87.82 (6)
r. door 56.41 (4) - 71.34 (4) 63.3 (3) - 70.56 (3) 83.51 (4) -92.79 (4) 60.73 (6) - 35.79 (6) 65.25 (6) - 42.42 (6) 65.8 (4) - 56.62 (6)
sidewalk 79.36 (6) - 77.95 (6) 90.37 (4) - 91.4 (4) 92.88 (4) - 92.56 (4) 84.66 (6) - 84.74 (6) 93.77 (6) - 93.31 (6) 94.71 (6) -94.59 (6)
street 80.46 (6) - 79.17 (6) 92.97 (4) - 92.75 (4) 95.45 (4) - 95.68 (4) 77.66 (6) - 73.99 (6) 95.47 (4) - 95.05 (4) 96.31 (6) - 95.66 (6)
grass 81.14 (6) - 79.23 (6) 75.77 (1) - 74.63 (4) 88.45 (6) - 87.43 (6) 86.75 (1) - 84.09 (1) 95.08 (4) - 92.82 (4) 96.88 (6) -95.67 (6)
tree 20.94 (6) - 21.65 (6) 25.82 (4) - 25.8 (6) 29.44 (2) - 28.76 (2) 15.03 (1) - 17.01 (1) 21.62 (1) - 22.82 (1) 26.58 (2) - 30.28 (2)
Table 3: Quantitative results per urban element. Results for F score (top), precision (middle) and recall (bottom) for Area
1and Area 2 at d=25cm. Each row gives the result for a particular class, first in the whole area and then in the hidden area,
separated with a hyphen. The best score obtained among the pipelines is shown with the pipeline that generated it in brackets,
numbered as in Fig. 4. In bold, the best measurement per class among all the image sets in both areas. Pipelines are as follow:
(1) COLMAP + COLMAP, (2) COLMAP + OpenMVS, (3) OpenMVG-g + COLMAP, (4) OpenMVG-g + OpenMVS, (5)
OpenMVG-i + COLMAP, (6) OpenMVG-i + OpenMVS.
bined imagery, OpenMVG-g + OpenMVS and
OpenMVG-i + OpenMVS have the highest F score:
the former in Area 1 and latter in Area 2. Fig. 6 also
includes the qualitative results of the point clouds ob-
tained for the combined dataset in Area 1 rendered un-
der the same configuration (e.g., point size, shading)
to make them comparable. From these results, we can
observe that the point cloud obtained with COLMAP
+ COLMAP (Fig. 6 (a)) is sharper than the one ob-
tained with COLMAP + OpenMVS (Fig. 6 (b)), in ac-
cordance with the precision values (74.89 and 27.93,
respectively). Moreover, there are also differences in
the completeness of the reconstructions: OpenMVG-i
+ OpenMVS (Fig. 6 (f)) and OpenMVG-g + Open-
MVS (Fig. 6 (d)) are denser than the rest, and they
also seem to be accurate. As before, the F score val-
ues confirm these observations, where they obtained
the highest scores (79.36 and 76.76).
Urban category centric evaluation. Addition-
ally, in Table 3 we present a summary of the same
measurements calculated above (precision, recall and
F score) but this time per urban element category.
This summary shows three tables, one per measure-
ment, where each row has the results of a specific
class (i.e., urban category) and each column corre-
sponds to a unique set of images. The result presented
per cell in the table is the maximum score obtained
among the six pipelines tested (see Fig. 4) and the
pipeline that generate it is shown in brackets. The
results for the complete area and the hidden one are
presented in the same cell, in that order, separated by
a hyphen.
Analyzing the results that were obtained per class
across all the image sets available, we can observe
that the method that most frequently gets the maxi-
mum precision is COLMAP + COLMAP. Those re-
sults are different when looking at recall. In that case,
the pipeline OpenMVG-i + OpenMVS is the one that
more frequently achieves the highest scores. When
looking at the F score, we cannot identify a clear
predominant pipeline for each class. In all the sce-
narios, the class with lowest F score values is tree
and the results are really influenced by the low val-
ues of the recall. These results confirm the hypoth-
esis in (Zolanvari et al., 2019): trees in the parks of
the city can degrade the scores of the reconstructions.
We can also analyze the results related to the image
set under study. For example, with the combined set,
the pipeline with best performance in the majority of
classes is OpenMVG-g + OpenMVS in Area 1 and
OpenMVG-i + OpenMVS in Area 2. These results
are in accordance with the ones commented before,
which does not consider the class information (Ta-
ble 2).
There is a different case scenario when we look at
the results with the nadir images. In Area 2 the scores
from the OpenMVG-i + OpenMVS pipeline are the
highest in most of the classes but it is not the best
pipeline in the scene level evaluation. This is because
OpenMVG-i + OpenMVS is the best with the facade
class, with 11.10% of occupancy in the ground-truth
(see Table 1), but also in classes with 1.23% of occu-
pancy or less. However, OpenMVG-g + OpenMVS
achieves the highest scores in the classes grass, side-
walk and street (16.24%, 11.18% and 10.57% of oc-
cupancy, respectively) and they also have the high-
est scores among all the classes. Also, if we look
at the scores per class with the combined set in Area
1, COLMAP + COLMAP and OpenMVG-g + Open-
MVS obtain the highest F scores in the same num-
ber of classes. This time, the most populated classes
are the ones where COLMAP + COLMAP is better.
These results reflect the importance of incorporating
an object-category centric evaluation since a more de-
tailed analysis can be done. The results of the evalua-
tion per urban category in the hidden part (also de-
picted in Table 3) are different from the ones with
the whole area but still follow the same pattern. For
example, we can observe that the method that gener-
ates the best results per image set and class remains
constant for the majority of them except for the door
class. This is due to its small amount of samples.
Evaluation of pipeline components. We can
also observe that, in general, the pipelines that ob-
tained the best results are COLMAP + COLMAP,
OpenMVG-g + OpenMVS and OpenMVG-i + Open-
MVS. These results are in accordance with previ-
ous studies that used the same kind of metric, where
COLMAP + COLMAP and OpenMVG-i + Open-
MVS obtained the best results (Knapitsch et al.,
2017). In particular, in that study OpenMVG-g +
OpenMVS never has better results than COLMAP
+ COLMAP, but this situation is plausible in our
study given the different camera trajectories (aerial
grid configuration vs circle around an object), soft-
ware versions and parameters used. COLMAP used
as MVS is better than OpenMVS only if it is applied
after COLMAP SfM. Whereas OpenMVS is better
using the other SfM methods tested. This leads to the
necessity to test not only a particular MVS method
but a complete pipeline since it is going to be influ-
enced by: the results obtained in the SfM step, the
data conversion and preparation for the MVS step, as
well as memory and computing limitations. In this
benchmark, the results after each step are not evalu-
ated because, as explained in Section 4, we want to
enforce the creation of end-to-end solutions and we
do not focus on any particular part of a pipeline. Sim-
ilarly, specific processing times are not reported but
for all the pipelines it was in the order of hours due to
the quantity of images that had to be processed.
6 CONCLUSION
In this paper, we have presented a novel bench-
mark for evaluating image-based 3D reconstruction
pipelines with aerial images in urban environments.
The results obtained with the considered SfM+MVS
state-of-the-art pipelines are evaluated at scene level
and per urban category. This allows for further
analysis of the reconstructions (i.e., analysis of
the influence of each urban category in the scene
level scores) and it supports previous hypothesis
(e.g., parks can degrade the F score values in a
scene level evaluation) with quantitative measure-
ments. Also, we provide the means for evaluating
results in a hidden area to avoid fine tuning of
algorithms to the given ground truth. Furthermore,
we stimulate and support the evaluation of new
approaches for image-based 3D reconstruction as
we do not limit the evaluation to a specific stage
of the pipeline (e.g., MVS). Finally, to support the
progress of research in the community we provide
the dataset and an online evaluation platform at
https://v-sense.scss.tcd.ie/research/6dof/benchmark-
3D-reconstruction.
ACKNOWLEDGEMENTS
This publication has emanated from research con-
ducted with the financial support of Science Foun-
dation Ireland (SFI) under the Grant Number
15/RP/2776.
REFERENCES
Aanæs, H., Jensen, R. R., Vogiatzis, G., Tola, E., and Dahl,
A. B. (2016). Large-scale data for multiple-view stere-
opsis. International Journal of Computer Vision.
Barnes, C., Shechtman, E., Finkelstein, A., and Goldman,
D. B. (2009). Patchmatch: A randomized correspon-
dence algorithm for structural image editing. In ACM
Transactions on Graphics (ToG). ACM.
Bosch, M., Foster, K., Christie, G., Wang, S., Hager, G. D.,
and Brown, M. (2019). Semantic stereo for incidental
satellite images. In WACV 2019. IEEE.
Furukawa, Y. and Ponce, J. (2010). Accurate, dense, and ro-
bust multiview stereopsis. IEEE Transactions on Pat-
tern Analysis and Machine Intelligence.
Hartley, R. and Zisserman, A. (2003). Multiple view geom-
etry in computer vision. Cambridge University Press.
Jancosek, M. and Pajdla, T. (2011). Multi-view reconstruc-
tion preserving weakly-supported surfaces. In CVPR
2011. IEEE.
Knapitsch, A., Park, J., Zhou, Q.-Y., and Koltun, V. (2017).
Tanks and temples: Benchmarking large-scale scene
reconstruction. ACM Transactions on Graphics.
Laefer, D. F., Abuwarda, S., Vo, A.-V., Truong-Hong, L.,
and Gharibi, H. (2015). 2015 aerial laser and pho-
togrammetry survey of dublin city collection record.
Menze, M. and Geiger, A. (2015). Object scene flow for
autonomous vehicles. In CVPR 2015.
Moulon, P., Monasse, P., and Marlet, R. (2012). Adaptive
structure from motion with a contrario model estima-
tion. In ACCV 2012. Springer Berlin Heidelberg.
Moulon, P., Monasse, P., and Marlet, R. (2013). Global fu-
sion of relative motions for robust, accurate and scal-
able structure from motion. In ICCV 2013.
Munoz, D., Bagnell, J. A., Vandapel, N., and Hebert, M.
(2009). Contextual classification with functional max-
margin markov networks. In CVPR 2009.
¨
Ozdemir, E., Toschi, I., and Remondino, F. (2019). A multi-
purpose benchmark for photogrammetric urban 3d re-
construction in a controlled environment.
Pag´
es, R., Amplianitis, K., Monaghan, D., Ondˇ
rej, J., and
Smoli´
c, A. (2018). Affordable content creation for
free-viewpoint video and vr/ar applications. Journal
of Visual Communication and Image Representation.
Rottensteiner, F., Sohn, G., Gerke, M., Wegner, J. D., Bre-
itkopf, U., and Jung, J. (2014). Results of the isprs
benchmark on urban object detection and 3d building
reconstruction. ISPRS journal of photogrammetry and
remote sensing.
Ruano, S., Cuevas, C., Gallego, G., and Garc´
ıa, N. (2017).
Augmented reality tool for the situational awareness
improvement of uav operators. Sensors.
Ruano, S., Gallego, G., Cuevas, C., and Garc´
ıa, N. (2014).
Aerial video georegistration using terrain models from
dense and coherent stereo matching. In Geospatial In-
foFusion and Video Analytics IV; and Motion Imagery
for ISR and Situational Awareness II. International So-
ciety for Optics and Photonics.
Sch¨
onberger, J. L. and Frahm, J.-M. (2016). Structure-
from-motion revisited. In CVPR 2016.
Sch¨
onberger, J. L., Zheng, E., Pollefeys, M., and Frahm, J.-
M. (2016). Pixelwise view selection for unstructured
multi-view stereo. In ECCV 2016. Springer.
Schops, T., Schonberger, J. L., Galliani, S., Sattler, T.,
Schindler, K., Pollefeys, M., and Geiger, A. (2017).
A multi-view stereo benchmark with high-resolution
images and multi-camera videos. In CVPR 2017.
Seitz, S. M., Curless, B., Diebel, J., Scharstein, D., and
Szeliski, R. (2006). A comparison and evaluation of
multi-view stereo reconstruction algorithms. In CVPR
2006.
Serna, A., Marcotegui, B., Goulette, F., and Deschaud, J.-
E. (2014). Paris-rue-madame database: a 3d mobile
laser scanner dataset for benchmarking urban detec-
tion, segmentation and classification methods.
Snavely, N., Seitz, S. M., and Szeliski, R. (2008). Modeling
the world from internet photo collections. Interna-
tional Journal of Computer Vision.
Stathopoulou, E. K., Welponer, M., and Remondino, F.
(2019). Open-source image-based 3d reconstruction
pipelines: review, comparison and evaluation. The
International Archives of Photogrammetry, Remote
Sensing and Spatial Information Sciences.
Strecha, C., Von Hansen, W., Van Gool, L., Fua, P., and
Thoennessen, U. (2008). On benchmarking camera
calibration and multi-view stereo for high resolution
imagery. In CVPR 2008.
Sweeney, C., Hollerer, T., and Turk, M. (2015). Theia:
A fast and scalable structure-from-motion library. In
Proceedings of the ACM International Conference on
Multimedia, pages 693–696. ACM.
Vallet, B., Br ´
edif, M., Serna, A., Marcotegui, B., and Papar-
oditis, N. (2015). Terramobilita/iqmulus urban point
cloud analysis benchmark. Computers & Graphics.
Zhang, L., Li, Z., Li, A., and Liu, F. (2018). Large-scale
urban point cloud labeling and reconstruction. ISPRS
Journal of Photogrammetry and Remote Sensing.
Zolanvari, S., Ruano, S., Rana, A., Cummins, A., da Silva,
R. E., Rahbar, M., and Smolic, A. (2019). Dublincity:
Annotated lidar point cloud and its applications. 30th
BMVC.
... In this paper, we present an extension of the previous work [15] with an evaluation for image-based 3D reconstruction pipelines not limited to combination of open-source techniques. We expand the previous work by including commercial applications in the evaluation and thus improving the study of a metropolitan area. ...
... We expand the previous work by including commercial applications in the evaluation and thus improving the study of a metropolitan area. We use the annotated point cloud of DublinCity [30] and make use of the initial study done in [15] where the city is analyzed. The analysis not only includes the evaluation at scene level but also per category of urban element, as it was done in the benchmark. ...
... Some of the benchmarks that include images such as the Kitti benchmark [8], and the ISPRS Test project on Urban Classification and 3D Building Reconstruction [12] are not focused on the evaluation of the MVS output; instead, they focus on the evaluation of other tasks such as stereo, object detection, or reconstructions made from roofs. Some of the benchmarks cited above have an annotated point cloud of the city [26,20,11], but as pointed out, they do not provide images, with the exception of the ISPRS Test project on Urban Classification and 3D Building Reconstruction [12] and the Ai3Dr Benchmark [15]. The main difference between these last two regarding the ground-truth model is the density. ...
Chapter
Full-text available
We present a study of image-based 3D reconstruction methods that use aerial images in large-scale metropolitan areas. Specifically, the study analyzes both open-source methods from the state of the art, and some of the most used commercial photogrammetry applications. The performance of these methods is measured against the densest annotated LiDAR point cloud available at a city scale. The study not only analyzes the accuracy and completeness of the reconstruction methods in the metropolitan area as in previous studies, but it also evaluates their performance for different city elements such as trees, windows, roofs, etc. which are present in the annotated ground truth model.
... In this paper, we present an extension of the previous work [15] with an evaluation for image-based 3D reconstruction pipelines not limited to combination of open-source techniques. We expand the previous work by including commercial applications in the evaluation and thus improving the study of a metropolitan area. ...
... We expand the previous work by including commercial applications in the evaluation and thus improving the study of a metropolitan area. We use the annotated point cloud of DublinCity [30] and make use of the initial study done in [15] where the city is analyzed. The analysis not only includes the evaluation at scene level but also per category of urban element, as it was done in the benchmark. ...
... Some of the benchmarks that include images such as the Kitti benchmark [8], and the ISPRS Test project on Urban Classification and 3D Building Reconstruction [12] are not focused on the evaluation of the MVS output; instead, they focus on the evaluation of other tasks such as stereo, object detection, or reconstructions made from roofs. Some of the benchmarks cited above have an annotated point cloud of the city [26,20,11], but as pointed out, they do not provide images, with the exception of the ISPRS Test project on Urban Classification and 3D Building Reconstruction [12] and the Ai3Dr Benchmark [15]. The main difference between these last two regarding the ground-truth model is the density. ...
Preprint
Full-text available
We present a study of image-based 3D reconstruction methods that use aerial images in large-scale metropolitan areas. Specifically, the study analyzes both open-source methods from the state of the art, and some of the most used commercial photogrammetry applications. The performance of these methods is measured against the densest annotated LiDAR point cloud available at a city scale. The study not only analyzes the accuracy and completeness of the reconstruction methods in the metropolitan area as in previous studies, but it also evaluates their performance for different city elements such as trees, windows, roofs, etc. which are present in the annotated ground truth model.
... Experiment results: From Table 1, it can be observed that the best quality 3D rendering is achieved when the highest resolution (i.e., setting scale=1) video frames/images are used for reconstruction. Here, for quality evaluation, we use widely accepted F-score metric proposed in [20]. From the table, we also observe that this baseline case requires a latency of 16.86 sec. ...
Conference Paper
Full-text available
Multi-view 3D reconstruction driven augmented, virtual, and mixed reality applications are becoming increasingly edge-native, due to factors such as, rapid reconstruction needs, security/privacy concerns, and lack of connectivity to cloud platforms. Managing edge-native 3D reconstruction, due to edge resource constraints and inherent dynamism of 'in the wild' 3D environments, involves striking a balance between conflicting objectives of achieving rapid reconstruction and satisfying minimum quality requirements. In this paper, we take a deeper dive into multi-view 3D reconstruction latency-quality trade-off, with an emphasis on reconstruction of dynamic 3D scenes. We propose data-level and task-level par-allelization of 3D reconstruction pipelines, holistic edge system optimizations to reduce reconstruction latency, and long-term minimum reconstruction quality satisfaction. The proposed solutions are validated through collection of real-world 3D scenes with varying degree of dynamism that are used to perform experiments on hardware edge testbed. The results show that our solutions can achieve between 50% to 75% latency reduction without violating long term minimum quality requirements.
... For offline qualitative evaluation, i.e., to test the performance of our optimized pipeline (without the concern of evaluation latency), we use the metric proposed in [18]. In this method, the quality is measured in terms of precision (P ), recall (R), and F-score (F ) -where P measures how close a 3D point cloud is to the ground truth, R measures the completeness of the reconstruction, and F is a function of P and R, i.e. , ...
Preprint
Full-text available
In order to plan rapid response during disasters, first responder agencies often adopt `bring your own device' (BYOD) model with inexpensive mobile edge devices (e.g., drones, robots, tablets) for complex video analytics applications, e.g., 3D reconstruction of a disaster scene. Unlike simpler video applications, widely used Multi-view Stereo (MVS) based 3D reconstruction applications (e.g., openMVG/openMVS) are exceedingly time consuming, especially when run on such computationally constrained mobile edge devices. Additionally, reducing the reconstruction latency of such inherently sequential algorithms is challenging as unintelligent, application-agnostic strategies can drastically degrade the reconstruction (i.e., application outcome) quality making them useless. In this paper, we aim to design a latency optimized MVS algorithm pipeline, with the objective to best balance the end-to-end latency and reconstruction quality by running the pipeline on a collaborative mobile edge environment. The overall optimization approach is two-pronged where: (a) application optimizations introduce data-level parallelism by splitting the pipeline into high frequency and low frequency reconstruction components and (b) system optimizations incorporate task-level parallelism to the pipelines by running them opportunistically on available resources with online quality control in order to balance both latency and quality. Our evaluation on a hardware testbed using publicly available datasets shows upto ~54% reduction in latency with negligible loss (~4-7%) in reconstruction quality.
... For offline qualitative evaluation, i.e., to test the performance of our optimized pipeline (without the concern of evaluation latency), we use the metric proposed in [18]. In this method, the quality is measured in terms of precision (P ), recall (R), and F-score (F ) -where P measures how close a 3D point cloud is to the ground truth, R measures the completeness of the reconstruction, and F is a function of P and R, i.e. , ...
... Two different sets of aerial images (Table 1) are employed to test the capabilities of learning-based methods within AT processes and their influence on the generation of dense point clouds: the ISPRS/EuroSDR Dortmund benchmark (Nex et al., 2015) and the Dublin benchmark (Ruano and Smolic, 2021). These urban datasets were chosen due to their complementarity in terms of acquisitions, resolution, and ground truth (GT). ...
Article
Full-text available
Aerial triangulation (AT) has reached outstanding progress in the last decades, and now fully automated solutions for nadir and oblique images are available. Usually, image correspondences (tie points) are found using hand-crafted methods, such as SIFT or its variants. But in the last years, there were many investigations and developments to promote the use of machine and deep learning solutions within the photogrammetric processing pipeline. The paper explores learning-based methods for the extraction of tie points in aerial image blocks. Image correspondences are used to perform aerial triangulation (AT) and successively generate dense point clouds. Two different datasets are used to compare conventional hand-crafted detector/descriptor methods with respect to learning-based methods. Accuracy analyses are performed using GCPs as well as ground truth LiDAR point clouds. Results confirm the potential of learningbased methods in finding reliable image correspondences in the aerial block, still showing space for improvements due to camera rotations.
Article
Full-text available
The popularity of Structure from Motion (SfM) techniques has significantly advanced 3D reconstruction in various domains, including construction site mapping. Central to SfM, is the feature extraction and matching process, which identifies and correlates keypoints across images. Previous benchmarks have assessed traditional and learning-based methods for these tasks but have not specifically focused on construction sites, often evaluating isolated components of the SfM pipeline. This study provides a comprehensive evaluation of traditional methods (e.g., SIFT, AKAZE, ORB) and learning-based methods (e.g., D2-Net, DISK, R2D2, SuperPoint, SOSNet) within the SfM pipeline for construction site mapping. It also compares matching techniques, including SuperGlue and LightGlue, against traditional approaches such as nearest neighbor. Our findings demonstrate that deep learning-based methods such as DISK with LightGlue and SuperPoint with various matchers consistently outperform traditional methods like SIFT in both reconstruction quality and computational efficiency. Overall, the deep learning methods exhibited better adaptability to complex construction environments, leveraging modern hardware effectively, highlighting their potential for large-scale and real-time applications in construction site mapping. This benchmark aims to assist researchers in selecting the optimal combination of feature extraction and matching methods for SfM applications at construction sites.
Article
Accurate and up-to-date 3D maps, often represented as point clouds, are crucial for autonomous vehicles. Crowd-sourcing has emerged as a low-cost and scalable approach for collecting mapping data utilizing widely available dashcams and other sensing devices. However, it is still a non-trivial task to utilize crowdsourced data, such as dashcam images and video, to efficiently create or update high-quality point clouds using technologies like Structure from Motion (SfM). This study assesses and compares different image matching options available in open-source SfM software, analyzing their applicability and limitations for mapping urban scenes in different practical scenarios. Furthermore, the study analyzes the impact of various camera setups (i.e., the number of cameras and their placement) and weather conditions on the quality of the generated 3D point clouds in terms of completeness and accuracy. Based on these analyses, our study provides guidelines for creating more accurate point clouds.
Conference Paper
Full-text available
Current human digitization techniques from a single image are showing promising results when it comes to the quality of the estimated geometry, but they often fall short when it comes to the texture of the generated 3D model, especially on the occluded side of the person, while some others do not even output a texture for the model. Our goal in this paper is to improve the predicted texture of these models without requiring any other additional input more than the original image used to generate the 3D model in the first place. For that, we propose a novel way to predict the back view of the person by including semantic and positional information that outperforms the state-of-the-art techniques. Our method is based on a general-purpose image-to-image translation algorithm with conditional adversarial networks adapted to predict the back view of a human. Furthermore, we use the predicted image to improve the texture of the 3D estimated model and we provide a 3D dataset, V-Human, to train our method and also any 3D human shape estimation algorithms which use meshes such as PIFu.
Article
Full-text available
State-of-the-art automated image orientation (Structure from Motion) and dense image matching (Multiple View Stereo) methods commonly used to produce 3D information from 2D images can generate 3D results – such as point cloud or meshes – of varying geometric and visual quality. Pipelines are generally robust and reliable enough, mostly capable to process even large sets of unordered images, yet the final results often lack completeness and accuracy, especially while dealing with real-world cases where objects are typically characterized by complex geometries and textureless surfaces and obstacles or occluded areas may also occur. In this study we investigate three of the available commonly used open-source solutions, namely COLMAP, OpenMVG+OpenMVS and AliceVision, evaluating their results under diverse large scale scenarios. Comparisons and critical evaluation on the image orientation and dense point cloud generation algorithms is performed with respect to the corresponding ground truth data. The presented FBK-3DOM datasets are available for research purposes.
Article
Full-text available
This paper presents a novel multi-purpose benchmark for assessing the performance of the entire image-based pipeline for 3D urban reconstruction and 3D data classification. The proposed benchmark is called “3DOMcity” photogrammetric contest and is publicly available at https://3dom.fbk.eu/3domcity-benchmark. Compared to previous benchmark initiatives, the innovative aspects introduced by 3DOMcity are threefold. First, it covers the entire 3D reconstruction pipeline, including image orientation, dense image matching, point cloud classification and 3D building reconstruction. Second, the performance assessment is performed within a metrological context. Finally, the provided datasets offer 2D and 3D data collected at a very high spatial resolution. Furthermore, the benchmark is multi-purpose since it involves a multiplicity of tasks, that can be either performed independently from each other’s, or grouped together. The contributions of this paper are (i) the presentation of the novel benchmark dataset and corresponding reference data, (ii) the description of the multi-purpose benchmark tasks and evaluation procedures, and (iii) the discussion of first results of image orientation, dense image matching and classification.
Conference Paper
Full-text available
Scene understanding of full-scale 3D models of an urban area remains a challenging task. While advanced computer vision techniques offer cost-effective approaches to analyse 3D urban elements, a precise and densely labelled dataset is quintessential. The paper presents the first-ever labelled dataset for a highly dense Aerial Laser Scanning (ALS) point cloud at city-scale. This work introduces a novel benchmark dataset that includes a manually annotated point cloud for over 260 million laser scanning points into 100'000 (approx.) assets from Dublin LiDAR point cloud (Laefer, et al) in 2015. Objects are labelled into 13 classes using hierarchical levels of detail from large (i.e. building, vegetation and ground) to refined (i.e. window, door and tree) elements. To validate the performance of our dataset, two different applications are showcased. Firstly, the labelled point cloud is employed for training Convolutional Neural Networks (CNNs) to classify urban elements. The dataset is tested on the well-known state-of-the-art CNNs (i.e. PointNet, PointNet++ and So-Net). Secondly, the complete ALS dataset is applied as detailed ground truth for city-scale image-based 3D reconstruction.
Research
Full-text available
This record serves as an index to a suite of high density, aerial remote sensing data for a 2km² area of Dublin, Ireland obtained at an average flying altitude of 300m. Collected in March 2015, the data include aerial laser scanning (ALS) from 41 flight paths in the form of a 3D point-cloud (LAZ) and 3D full waveform ALS (LAS and Pulsewave), and imagery data including ortho-rectified 2D rasters (RGBi) and oblique images. The ALS data consist of over 1.4 billion points (inclusive of partially covered areas) and were acquired by a TopEye system S/N 443. Imagery data were captured using a Phase One camera system. In this data offering, the ALS and imagery data are structured both by flight paths and by 500 × 500 m rectangular tiles. Miscellaneous data including video records and instrument parameters are available in the preservation record. For more information on these data including related publications, reports, and bulk-download instructions, please consult the documentation. For a visualization of the data as compared to standard LiDAR data density, please consult the video flythrough available at https://youtu.be/qEi2Wo7Bcuk. This dataset was collected with funding from European Research Council Consolidator project RETURN – Rethinking Tunnelling in Urban Neighbourhoods [ERC-2012- StG-307836] and additional funding from Science Foundation Ireland [12/ERC/I2534]. These data were released with an Attribution 4.0 International (CC BY 4.0) license.
Article
Full-text available
Unmanned Aerial Vehicles (UAVs) are being extensively used nowadays. Therefore, pilots of traditional aerial platforms should adapt their skills to operate them from a Ground Control Station (GCS). Common GCSs provide information in separate screens: one presents the video stream while the other displays information about the mission plan and information coming from other sensors. To avoid the burden of fusing information displayed in the two screens, an Augmented Reality (AR) tool is proposed in this paper. The AR system has two functionalities for Medium-Altitude Long-Endurance (MALE) UAVs: route orientation and target identification. Route orientation allows the operator to identify the upcoming waypoints and the path that the UAV is going to follow. Target identification allows a fast target localization, even in the presence of occlusions. The AR tool is implemented following the North Atlantic Treaty Organization (NATO) standards so that it can be used in different GCSs. The experiments show how the AR tool improves significantly the situational awareness of the UAV operators.
Article
The large number of object categories and many overlapping or closely neighboring objects in large-scale urban scenes pose great challenges in point cloud classification. In this paper, a novel framework is proposed for classification and reconstruction of airborne laser scanning point cloud data. To label point clouds, we present a rectified linear units neural network named ReLu-NN where the rectified linear units (ReLu) instead of the traditional sigmoid are taken as the activation function in order to speed up the convergence. Since the features of the point cloud are sparse, we reduce the number of neurons by the dropout to avoid over-fitting of the training process. The set of feature descriptors for each 3D point is encoded through self-taught learning, and forms a discriminative feature representation which is taken as the input of the ReLu-NN. The segmented building points are consolidated through an edge-aware point set resampling algorithm, and then they are reconstructed into 3D lightweight models using the 2.5D contouring method (Zhou and Neumann, 2010). Compared with deep learning approaches, the ReLu-NN introduced can easily classify unorganized point clouds without rasterizing the data, and it does not need a large number of training samples. Most of the parameters in the network are learned, and thus the intensive parameter tuning cost is significantly reduced. Experimental results on various datasets demonstrate that the proposed framework achieves better performance than other related algorithms in terms of classification accuracy and reconstruction quality.
Article
We present a scalable pipeline for Free-Viewpoint Video (FVV) content creation, considering also visualisation in Augmented Reality (AR) and Virtual Reality (VR). We support a range of scenarios where there may be a limited number of handheld consumer cameras, but also demonstrate how our method can be applied in professional multi-camera setups. Our novel pipeline extends many state-of-the-art techniques (such as structure-from-motion, shape-from-silhouette and multi-view stereo) and incorporates bio-mechanical constraints through 3D skeletal information as well as efficient camera pose estimation algorithms. We introduce multi-source shape-from-silhouette (MS-SfS) combined with fusion of different geometry data as crucial components for accurate reconstruction in sparse camera settings. Our approach is highly flexible and our results indicate suitability either for affordable content creation for VR/AR or for interactive FVV visualisation where a user can choose an arbitrary viewpoint or sweep between known views using view synthesis.
Article
We present a benchmark for image-based 3D reconstruction. The benchmark sequences were acquired outside the lab, in realistic conditions. Ground-truth data was captured using an industrial laser scanner. The benchmark includes both outdoor scenes and indoor environments. High-resolution video sequences are provided as input, supporting the development of novel pipelines that take advantage of video input to increase reconstruction fidelity. We report the performance of many image-based 3D reconstruction pipelines on the new benchmark. The results point to exciting challenges and opportunities for future work.
Conference Paper
In this paper, we have presented a comprehensive multi-view geometry library, Theia, that focuses on large-scale SfM. In addition to state-of-the-art scalable SfM pipelines, the library provides numerous tools that are useful for students, researchers, and industry experts in the field of multi-view geometry. Theia contains clean code that is well documented (with code comments and the website) and easy to extend. The modular design allows for users to easily implement and experiment with new algorithms within our current pipeline without having to implement a full end-to-end SfM pipeline themselves. Theia has already gathered a large number of diverse users from universities, startups, and industry and we hope to continue to gather users and active contributors from the open-source community.