Conference PaperPDF Available

From Google Street View to 3D City models


Abstract and Figures

We present a structure-from-motion (SfM) pipeline for visual 3D modeling of a large city area using 360° field of view Google Street View images. The core of the pipeline combines the state of the art techniques such as SURF feature detection, tentative matching by an approximate nearest neighbour search, relative camera motion estimation by solving 5-pt minimal camera pose problem, and sparse bundle adjustment. The robust and stable camera poses estimated by PROSAC with soft voting and by scale selection using a visual cone test bring high quality initial structure for bundle adjustment. Furthermore, searching for trajectory loops based on co-occurring visual words and closing them by adding new constraints for the bundle adjustment enforce the global consistency of camera poses and 3D structure in the sequence. We present a large-scale reconstruction computed from 4,799 images of the Google Street View Pittsburgh Research Data Set.
Content may be subject to copyright.
From Google Street View to 3D City Models
Akihiko Torii Michal Havlena Tom´sPajdla
Center for Machine Perception, Department of Cybernetics
Faculty of Elec. Eng., Czech Technical University in Prague
We present a structure-from-motion (SfM) pipeline for
visual 3D modeling of a large city area using 360field of
view Google Street View images. The core of the pipeline
combines the state of the art techniques such as SURF fea-
ture detection, tentative matching by an approximate near-
est neighbour search, relative camera motion estimation by
solving 5-pt minimal camera pose problem, and sparse bun-
dle adjustment. The robust and stable camera poses esti-
mated by PROSAC with soft voting and by scale selection
using a visual cone test bring high quality initial structure
for bundle adjustment. Furthermore, searching for trajec-
tory loops based on co-occurring visual words and clos-
ing them by adding new constraints for the bundle adjust-
ment enforce the global consistency of camera poses and 3D
structure in the sequence. We present a large-scale recon-
struction computed from 4,799 images of the Google Street
View Pittsburgh Research Data Set.
1. Introduction
Large scale 3D models of cities built from video se-
quences acquired by car mounted cameras provide richer
3D contents than those built from aerial images only. A vir-
tual reality system covering the whole world can be brought
by embedding such 3D contents into Google Earth or Mi-
crosoft Virtual Earth in near future. In this paper, we present
a structure-from-motion (SfM) pipeline for visual 3D mod-
eling of such a large city area using 360field of view om-
nidirectional images.
Recently, work [27] demonstrated 3D modeling from
perspective images exported from Google Street View im-
ages using piecewise planar structure constraints. Another
recent related work [38] demonstrated the performance of
the SfM which employs the guided matching by using
epipolar geometries computed in previous frames, and the
robust camera trajectory estimation by computing camera
orientations and positions individually for the calibrated
perspective images acquired by Point Grey Ladybug Spher-
ical Digital Video Camera System [32]. This paper shows a
large scale sparse 3D reconstruction using the original om-
nidirectional panoramic images.
Previously, city reconstruction has been addressed us-
ing aerial images [9, 3, 10, 22, 40, 41] which allowed re-
constructing large areas from a small number of images.
The resulting models, however, often lacked visual realism
when viewed from the ground level since it was impossible
to texture the facades of the buildings.
A framework for city modeling from ground-level im-
age sequences working in real-time has been developed, e.g.
in [1] and [5]. Work [5] uses SfM to reconstruct camera
trajectories and 3D key points in the scene, fast dense im-
age matching, assuming that there is a single gravity vector
in the scene and all the building facades are ruled surfaces
parallel to it. The system gives good results but 3D recon-
struction could not survive sharp camera turns when a large
part of the scene moved away from the limited view field
of cameras. A recent extension of [5] using a pair of cali-
brated fisheye lens cameras [12], which have hemispherical
fields of view, could successfully reconstruct a trajectory
with sharp turns. In this work, we assume a single moving
camera which provides sparse image sequences only.
Short baseline SfM using simple image features [5],
which performs real-time detection and matching, recovers
camera poses and trajectory sufficiently well when all cam-
era motions between consecutive frames in the sequence
are small. On the other hand, wide baseline SfM based
methods, which use richer features such as MSER [25],
Laplacian-Affine, Hessian-Affine [28], SIFT [21], and
SURF [2], are capable of producing feasible tentative
matches under large changes of visual appearance between
images induced by rapid changes of camera pose and illu-
mination. Work [7] presented the SfM based on wide base-
line matching of SIFT features using a single omnidirec-
tional camera and demonstrated the performance on indoor
environments. We use SURF features [2] since they are
the fastest among those features used for the wide baseline
matching and produce sufficiently robust tentative matches
even on distorted omnidirectional images.
(a) (b)
Figure 1. Camera trajectory computed by SfM. (a) Camera positions (red circles) exported into Google Earth [8]. To increase the visibility,
every 12th camera position in the original sequence is plotted. (b) The 3D model representing 4,799 camera positions (red circles) and
123,035 3D points (color dots).
The problem inevitable for sequential SfM is to have
drift errors accumulated while proceeding along the trajec-
tory. Loop closing [16, 34] is essentially capable of remov-
ing the drift errors since it brings the global consistency of
camera poses and 3D structures by giving additional con-
straints for the final refinement accomplished by bundle ad-
justment. In [16], the loop closing is achieved by merg-
ing partial reconstructions of overlapping sequences which
are extracted using an image similarity matrix [36, 17].
Work [34] finds loop endpoints by using the image similar-
ity matrix and verifies the loops by computing the rotation
transform between the pairs of origins and endpoints under
the assumption that the position of the origin and the end-
point of each loop coincide. Furthermore, they constraint
the camera motions on a plane to reduce the number of pa-
rameters in bundle adjustment. Unlike in [34], we aim at
proposing a pipeline which recovers camera poses in 3D
and tests the loops by solving camera resectioning [31] in
order to accomplish large scale 3D modeling of cities, see
Figure 1.
The main contribution of this paper is in demonstrating
that one can achieve SfM from a single sparse omnidirec-
tional sequence with only an approximate knowledge of cal-
ibration as opposed to [5, 38] where the large scale mod-
els are computed from dense sequences and with precisely
calibrated cameras. We present an experiment with the
Google Street View Pittsburgh Research Data Set1,which
has denser images than data freely available at Google
Maps. Therefore, we processed every second image and
could have processed even every fourth image with a small
degradation of the results.
1Provided and copyrighted by Google.
2. The Pipeline
The proposed SfM pipeline is an extension of the pre-
vious work [39] which demonstrated the performance of
the recovery of camera poses and trajectory on the image
sequence acquired by a single fisheye lens camera. We
refer [39] for more technical details of each step in the
2.1. Calibration
Assuming that the input omnidirectional images are pro-
duced by the equirectangular projection, see Figure 2, the
transformation from image points to unit vectors of their
rays can be formulated as follows. For the equirectan-
gular image having the dimensions IWand IH, a point
j)in the image coordinates is transformed into
a unit vector p=(px,p
z)in spherical coordinates:
px=cosφsin θ, py=sinφ, pz=cosφcos θ. (1)
where angles θand φare computed as:
2.2. Generating Tracks by Concatenating Pairwise
Tracks used for SfM are generated in several steps. First,
up to thousands of SURF features [2] are detected and de-
scribed on each of the input images.
(a) (b)
Figure 2. Omnidirectional imaging. (a) Point Grey Ladybug Spherical Digital Video Camera System [32] used for acquiring the Street
View images. (b) Omnidirectional image used as input data for SfM. (c) Transformation between a unit vector pon a unit sphere and a
pixel uof the equirectangular image. The coordinates px,py,andpzof the unit vector pare transformed into angles θand φ. Column
index uiis computed from the angle θand row index ujfrom the angle φ.
Secondly, sets of tentative matches are constructed be-
tween pairs of consecutive images. The matching is
achieved by finding features with closest descriptors be-
tween the pair of images, which is done for each feature
independently. When conflicts appear, we select the most
discriminative match by computing the ratio between the
first and the secondbest match. We use Fast Library for Ap-
proximate Nearest Neighbors (FLANN) [29] which delivers
approximate nearest neighbours significantly faster than ex-
act matching thanks to using several random kd-trees.
Thirdly, tentative matches between each pair of consec-
utive images are verified through epipolar geometry (EG)
computed by solving the 5-point minimal relative pose
problem for calibrated cameras [30]. The tentative matches
are verified with a RANSAC based robust estimation [6]
which searches for the largest subset of the set of tenta-
tive matches consistent with the given epipolar geometry.
We use PROSAC [4], a simple modification of RANSAC,
which brings a good performance [33] because of reducing
the number of samples by using the ordered sampling [4].
The 5-tuples of tentative matches are drawn from the list
ordered ascendingly by their discriminativity scores, which
are the ratios between the distances of the first and the sec-
ond nearest neighbours in the feature space. Finally, the
tracks are constructed by concatenating inlier matches.
The pairwise matches, obtained by epipolar geometry
validation, often contain incorrect matches lying on epipo-
lar lines or in the vicinity of epipoles since they may sup-
port the epipolar geometry even without violating geomet-
ric consistency. In practice, such incorrect matches can be
mostly filtered out by selecting only the tracks having a
longer length. We reject tracks containing less than three
2.3. Robust Initial Camera Pose Estimation
Initial camera poses and positions in a canonical coor-
dinate system are recovered by using the epipolar geome-
tries of pairs of consecutive images computed in the stage
of verifying tracks. The essential matrix Eij , encoding the
relative camera pose between frames iand j=i+1, can
be decomposed into Eij =[tij ]×Rij . Although there exist
four possible decompositions, the right one can be selected
as that which reconstructs the largest number of 3D points
in front of both cameras. Having the normalized camera
matrix [11] of the i-th frame Pi=[Ri|Ti], the normalized
camera matrix Pjcan be computed by
Pj=[Rij Ri|Rij Ti+γtij ](4)
where γis the scale of the translation between frames iand
jin the canonical coordinate system. The scale γcan be
computed by any 3D point seen in at least three consec-
utive frames but the precision depends on the uncertainty
of the reconstructed 3D point. Therefore, a robust selec-
tion from possible candidates of scales has to be done while
evaluating the quality of the computed camera position. The
best scale is found by RANSAC maximizing the number of
points that pass the “cone test” [13] which checks the inter-
section of pixel ray cones in a similar way as the feasibility
test of L1-orL- triangulation [14, 15], see Algorithm 1.
During the cone test, one pixel wide cones formed by four
planes (up, down, left, and right) are casted around the
matches and we test whether the intersection of the cones
is empty or not using the LP feasibility test [23] or an ex-
haustive test [13] which is faster when the number of the
intersected cones is smaller than four.
2.4. Bundle Adjustment Enforcing Global Camera
Pose Consistency
Even though the Google Street View data is not primarily
acquired by driving the same street several times, there are
some overlaps suitable for constructing loops that can com-
pensate drift errors induced while proceeding the trajectory
sequentially. We construct loops by searching pairs of im-
ages observing the same 3D structure in different times in
the sequence.
Algorithm 1 Construction of the Initial Camera Poses by Chaining Epipolar Geometries
Input {Ei,i+1}n1
i=1 Epipolar geometries of pairs of consecutive images.
i=1 Matches (tracks) supporting the epipolar geometries.
Output {Pi}n
i=1 Normalized camera matrices.
1: P1:= [I3×3|03×1]... Set the first camera to be the origin of the canonical coordinates.
2: for i:= 1,...,n1do
3: Decompose Ei,i+1 and select the right rotation Rand translation twhere || t|| =1.
4: {Ui}:= 3D points computed by triangulating the matches {mi}i+1
iusing Rand t
5: if i=1then
6: Pi+1 := [ RA |Rb+t]where Pi=[A|b].
7: {X}:= {Ui}... Update 3D points
8: else
9: Find 3D points {Ui1,i+1}in {Ui}in the ith camera coordinates seen in three images.
10: Find 3D points {Xi1,i+1}in {X}in the canonical coordinates seen in three images.
11: t:= 0,Smax := 0,N:= |{Xi1,i+1}| ... Initialization for RANSAC cone test.
12: while tNdo
13: t:= t+1... New sample.
14: γ:= || Xi1,i+1 || /|| A(Ui1,i+1 b)|| ... The scale to be tested.
15: Pt:= [ RA |Rb+γt]where Pi=[A|b].
16: St:= the number of matches {mi}i+1
i1which are consistent with the motions Pi1,Piand Pt.
17: if St>S
max then
18: Pi+1 := Pt... The best motion with scale so far.
19: Smax := St... The maximum number of supports so far.
20: Update the termination length N.
21: end if
22: end while
23: Update {X}by merging {Ui1,i+1}and adding {Ui}\{Ui1,i+1}
24: end if
25: end for
The knowledge of GPS locations of Street View images
truly alleviates the problem of image matching for loop
closing but does not completely reduce it since common 3D
structures can be seen even among relatively distant images.
In this paper, we do not rely on GPS locations because the
image matching achieved by using the image similarity ma-
trix is potentially capable to match such distant images and
it is always important for the vision community to see that
certain problem can be solved entirely using vision.
Building Image Similarity Matrix SURF descriptors of
each image are quantized into visual words using the vi-
sual vocabulary containing 130,000 words computed from
urban area omnidirectional images. Next, term frequency–
inverse document frequency (tf-idf) vectors [36, 17], which
weight words occurring often in a particular document and
downweight words that appear often in the database, are
computed for each image with more than 50 detected vi-
sual words. Finally, the image similarity matrix Mis con-
structed by computing the image similarities, which we de-
fine as cosines of angles between normalized tf-idf vectors,
between all pairs of images.
Loop Finding and Closing First, we take the upper tri-
angular part of Mto avoid duplicate search. Since the diag-
onal entries of Mwhich are the neigbouring frames in the
sequence essentially have high scores, the 1st to 50th diag-
onals are zeroed in order to exclude very small loops. Next,
for the image Iiin the sequence, we select the image Ijas
the one having the highest similarity score in the ith row of
M.ImageIjis a candidate of the endpointof the loop which
starts from Ii. Note that the use of an upper triangular ma-
trix constraints j>i.
Next, the candidate image Ijis verified by solving the
camera resectioning [31]. Triplets of the tentative 2D-3D
matches constructed by matching the descriptors of 3D
points associated to the images Iiand Ii+1 with the de-
scriptors of the features detected in the image Ijare sam-
pled by RANSAC to find the camera pose having the largest
support evaluated by the cone test again. The image Ii+1,
which is the successive frame of Ii, is additionally used for
performing the cone test with three images in order to en-
force geometric consistencies in the support evaluation of
the RANSAC. Local optimization is achieved by repeated
camera pose computation from all inliers [35] via SDP and
(a) (b)
(c) (d)
Figure 3. Results of SfM with loop closing. (a) Trajectory before bundle adjustment. (b) Trajectory after bundle adjustment with loop
closing. Examples of the images used for the loop closing: (c) Frames 6597 and 8643. (d) Frames 6711 and 6895.
SeDuMi [37]. If the inlier ratio is higher than 70%, the cam-
era resectioning is considered successful and the candidate
image Ijis accepted as the endpoint of the loop. The in-
lier matches are used to give additional constraints on the
final bundle adjustment. We perform this loop search for
every image in the sequence and test only the pair of im-
ages having the highest similarity score. If one increased
the number of candidates to be tested, our pipeline would
approach SfM [24, 19, 26] for unorganized images based
on exhaustive pairwise matching.
Finally, very distant points, i.e. likely outliers, are fil-
tered out and sparse bundle adjustment [20] modified in or-
der to work with unit vectors, which is the approach similar
to [18], refines both points and cameras.
3. Experimental Results
We used 4,799 omnidirectional images of the Google
Street View Pittsburgh Research Data Set. Since the input
omnidirectional images have large distortion at the top and
bottom, we clipped original images by cropping 230 pix-
els from the top and 410 pixels from the bottom to obtain
3,328 ×1,024 pixel large images, see Figure 2(b). Since
the tracks are generated based on wide baseline matching,
it is possible to save computation time by constructing ini-
tial camera poses and 3D structure from a sparser image
sequence. Our SfM was run on every second image in the
sequence, i.e. 2,400 images were used to create a global re-
construction. The remaining 2,399 images were attached to
the reconstruction in the final stage.
The initial camera poses were estimated by comput-
(a) (b)
Figure 4. Resulted 3D model consisting of 2,400 camera positions (red circles) and 124,035 3D points (blue dots) recovered by our pipeline.
(a) Initial estimation. (b) After bundle adjustment with loop closing.
ing epipolar geometries of pairs of successive images, and
chaining them by finding the global scale of camera trans-
lation, see Algorithm 1. The resulting trajectory is shown
in Figure 3(a). After estimating the initial camera poses and
reconstructing 3D points, the pairs of images acquired at the
same location in different times were searched for. The red
lines in Figure 3(a) indicate links between the accepted im-
age pairs. Figure 3(b) shows the camera trajectory after the
bundle adjustment with the additional constraints obtained
from loop closing. Figures 3(c) and (d) show the exam-
ples of pairs of images used for closing the loops at frames
(6597,8643) and (6711,6895) respectively. Furthermore,
Figure 4 shows the camera positions and the 3D points of
the initial recovery (a) and after the loop closing (b) in dif-
ferent views. In Figure 5, the recovered trajectory is com-
pared to the GPS positions provided in the Google Street
View Pittsburgh Research Data Set. The computational
time spent in different steps of the pipeline implemented
in MATLAB+MEX running on a standard Core2Duo PC is
shown in Table 1. Since the method is scalable and there-
fore storing the intermediate results of the computation on
a hard drive instead of in RAM, performance could be im-
proved by using a fast SSD drive instead of a standard SATA
5.842 5.843 5.844 5.845 5.846 5.847 5.848 5.849 5.85
x 105
x 106
Figure 5. Comparison to the GPS provided in the Google Street
View Pittsburgh Research Data Set. Camera trajectory by GPS
(red line) and estimated camera trajectory by our SfM (blue line).
Detection 12.8
Matching 4.5
Chaining 1.0
Loop Closing 6.3
Bundle 14.5
Table 1. Computational time in hours. (Detection) SURF detec-
tion and description. (Matching) Tentative matching and comput-
ing EGs. (Chaining) Chaining EGs and computing scales. (Loop
Closing) Searching and testing loops. (Bundle) Final sparse bun-
dle adjustment.
Finally, the remaining 2,383 camera poses were com-
puted by solving the camera resectioning in the same man-
ner as used in the loop verification. Linear interpolation
was used for the 16 cameras that could not be resectioned
successfully. Figure 1(b) shows the 4,799 camera positions
(red circles) and the 124,035 world 3D points (color dots)
of the resulted 3D model.
4. Conclusions
We demonstrated the recovery of camera trajectory and
3D structure of a large city area from omnidirectional im-
ages and showed that the world can in principle be recon-
structed from Google Street View images. We also showed
that finding loops and using additional constraints on final
bundle adjustment significantly improve the qualities of re-
sulting camera trajectory and 3D structures. Since the street
view images on Google Maps are approximately 10 times
sparser than the original sequence from the Google Street
View Pittsburgh Research Data Set, testing the performance
of the proposed pipeline on such sparse sequences will be
our next challenge.
The authors were supported by EC project FP6-IST-
027787 DIRAC. T. Pajdla was supported by Czech Govern-
ment under the research program MSM-684 0770038. Any
opinions expressed in this paper do not necessarily reflect
the views of the European Community. The Community is
not liable for any use that may be made of the information
contained herein.
[1] A. Akbarzadeh, J.-M. Frahm, P. Mordohai, B. Clipp, C. En-
gels, D. Gallup, P. Merrell, M. Phelps, S. Sinha, B. Tal-
ton, L. Wang, Q. Yang, H. Stew´enius, R. Yang, G. Welch,
H. Towles, D. Nist´er, and M. Pollefeys. Towards urban 3d
reconstruction from video. In 3DPVT06, May 2006.
[2] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up
robust features (SURF). CVIU, 110(3):346–359, June 2008.
[3] C. Brenner and N. Haala. Fast production of virtual reality
city models. IAPRS98, 32(4):77–84, 1998.
[4] O. Chum and J. Matas. Matching with PROSAC: Progressive
sample consensus. In CVPR05, pages I: 220–226, 2005.
[5] N. Cornelis, K. Cornelis, and L. Van Gool. Fast compact
city modeling for navigation pre-visualization. In CVPR06,
pages 1339–1344, 2006.
[6] M. A. Fischler and R. C. Bolles. Random sample consen-
sus: A paradigm for model fitting with applications to image
analysis and automated cartography. Communications of the
ACM, 24(6):381–395, June 1981.
[7] T. Goedem´e, M. Nuttin, T. Tuytelaars, and L. Van Gool.
Omnidirectional vision based topological navigation. IJCV,
74(3):219–236, 2007.
[8] Google. Google earth -, 2004.
[9] A. Gr¨un. Automation in building reconstruction. In Pho-
togrammetric Week’97, pages 175–186, 1997.
[10] N. Haala, C. Brenner, and C. St¨atter. An integrated system
for urban model generation. In ISPRS Congress Comm. II,
pages 96–103, 1998.
[11] R. Hartley and A. Zisserman. Multiple View Geometry in
Computer Vision. Cambridge University Press, second edi-
tion, 2003.
[12] M. Havlena, T. Pajdla, and K. Cornelis. Structure from om-
nidirectional stereo rig motion for city modeling. In VIS-
APP08, pages II: 407–414, 2008.
[13] M. Havlena, A. Torii, and T. Pajdla. Randomized struc-
ture from motion based on atomic 3d models from camera
triplets. In CVPR09, 2009.
[14] F. Kahl. Multiple view geometry and the L-norm. In
ICCV05, pages II: 1002–1009, 2005.
[15] Q. Ke and T. Kanade. Quasiconvex optimization for robust
geometric reconstruction. PA M I , 29(10):1834–1847, 2007.
[16] M. Klopschitz, C. Zach, A. Irschara, and D. Schmalstieg.
Generalized detection and merging of loop closures for video
sequences. In 3DPVT, 2008.
[17] J. Knopp, J. Sivic, and T. Pajdla. Location recognition us-
ing large vocabularies and fast spatial matching. Research
Report CTU–CMP–2009–01, CMP Prague, January 2009.
[18] M. Lhuillier. Effective and generic structure from motion
using angular error. In ICPR06, pages I: 67–70, 2006.
[19] X. Li, C. Wu, C. Zach, S. Lazebnik, and J. Frahm. Modeling
and recognition of landmark image collections using iconic
scene graphs. In ECCV08, pages I: 427–440, 2008.
[20] M. Lourakis and A. Argyros. The design and implementa-
tion of a generic sparse bundle adjustment software package
based on the levenberg-marquardt algorithm. Tech. Report
340, Institute of Computer Science – FORTH, August 2004.
[21] D. Lowe. Distinctive image features from scale-invariant
keypoints. IJCV, 60(2):91–110, November 2004.
[22] H. Maas. The suitability for airborne laser scanner data for
automatic 3d object reconstruction. In Ascona01, pages 291–
296, 2001.
[23] A. Makhorin. GLPK: GNU linear programming kit -, 2000.
[24] D. Martinec and T. Pajdla. Robust rotation and translation
estimation in multiview reconstruction. In CVPR07, 2007.
[25] J. Matas, O. Chum, M. Urban, and T. Pajdla. Robust wide
baseline stereo from maximally stable extremal regions. IVC,
22(10):761–767, September 2004.
[26] Microsoft. Photosynth -,
[27] B. Micusik and J. Kosecka. Piecewise planar city 3d mod-
eling from street view panoramic sequences. In CVPR09,
[28] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman,
J. Matas, F. Schaffalitzky, T. Kadir, and L. V. Gool. A
comparison of affine region detectors. IJCV, 65(1-2):43–72,
[29] M. Muja and D. Lowe. Fast approximate nearest neighbors
with automatic algorithm configuration. In VISAPP09, 2009.
[30] D. Nist´er. An efficient solution to the five-point relative pose
problem. PAMI, 26(6):756–770, June 2004.
[31] D. Nister. A minimal solution to the generalized 3-point pose
problem. In CVPR04, pages I: 560–567, 2004.
[32] I. Point Grey Research. Ladybug2 -, 2005.
[33] R. Raguram, J.-M. Frahm, and M. Pollefeys. A comparative
analysis of RANSAC techniques leading to adaptive real-
time random sample consensus. In ECCV08, pages 500–513,
[34] D. Scaramuzza, F. Fraundorfer, R. Siegwart, and M. Polle-
feys. Closing the loop in appearance guided SfM for omni-
directional cameras. In OMNIVIS08, 2008.
[35] G. Schweighofer and A. Pinz. Globally optimal O(n) so-
lution to the PnP problem for general camera models. In
BMVC08, 2008.
[36] J. Sivic and A. Zisserman. Video google: Efficient visual
search of videos. In CLOR06, pages 127–144, 2006.
[37] J. Sturm. Sedumi: A software package to solve optimization
problems -, 2006.
[38] J. Tardif, Y. Pavlidis, and K. Daniilidis. Monocular vi-
sual odometry in urban environments using an omdirectional
camera. In IROS08, 2008.
[39] A. Torii, M. Havlena, and T. Pajdla. Omnidirectional image
stabilization by computing camera trajectory. In PSIVT09,
pages 71–82, 2009.
[40] C. Vestri and F. Devernay. Using robust methods for auto-
matic extraction of buildings. In CVPR01, pages I:133–138,
[41] G. Vosselman and S. Dijkman. Reconstruction of 3d building
models from laser altimetry data. IAPRS01, 34(3):22–24,
... Much of the imagery covers urban areas that have not been subject of 3D city modelling, including cities that do not have even building footprints to begin with. As such, SVI has been employed for 3D building reconstruction using traditional techniques (Cavallo, 2015;Torii et al., 2009;Micusik and Kosecka, 2009). However, these approaches utilising SVI often require multiple images to form a dense correspondence, which is often not suitable for SVI as buildings are often partially or fully occluded by vegetation, vehicles, and other objects (Zhang et al., 2021c) (Fig. 1), and therefore are not available in more than one or two unobstructed images. ...
... Generating 3D building models from SVI has been of continuous interest (Zhang et al., 2021a), dating back to the work by Torii et al. (2009). Structure from Motion (SfM) techniques have been employed to reconstruct buildings by stitching a series of GSV images with known GPS location and camera internal parameters (Lee, 2009;Torii et al., 2009). ...
... Generating 3D building models from SVI has been of continuous interest (Zhang et al., 2021a), dating back to the work by Torii et al. (2009). Structure from Motion (SfM) techniques have been employed to reconstruct buildings by stitching a series of GSV images with known GPS location and camera internal parameters (Lee, 2009;Torii et al., 2009). For example, Bruno and Roncella (2019) investigated reconstruction using GSV photogrammetric strip but reported hit-or-miss results. ...
Full-text available
3D building models are an established instance of geospatial information in the built environment, but their acquisition remains complex and topical. Approaches to reconstruct 3D building models often require existing building information (e.g. their footprints) and data such as point clouds, which are scarce and laborious to acquire, limiting their expansion. In parallel, street view imagery (SVI) has been gaining currency, driven by the rapid expansion in coverage and advances in computer vision (CV), but it has not been used much for generating 3D city models. Traditional approaches that can use SVI for reconstruction require multiple images, while in practice, often only few street-level images provide an unobstructed view of a building. We develop the reconstruction of 3D building models from a single street view image using image-to-mesh reconstruction techniques modified from the CV domain. We regard three scenarios: (1) standalone single-view reconstruction; (2) reconstruction aided by a top view delineating the footprint; and (3) refinement of existing 3D models, i.e. we examine the use of SVI to enhance the level of detail of block (LoD1) models, which are common. The results suggest that trained models supporting (2) and (3) are able to reconstruct the overall geometry of a building, while the first scenario may derive the approximate mass of the building, useful to infer the urban form of cities. We evaluate the results by demonstrating their usefulness for volume estimation, with mean errors of less than 10% for the last two scenarios. As SVI is now available in most countries worldwide, including many regions that do not have existing footprint and/or 3D building data, our method can derive rapidly and cost-effectively the 3D urban form from SVI without requiring any existing building information. Obtaining 3D building models in regions that hitherto did not have any, may enable a number of 3D geospatial analyses locally for the first time.
... Large 3D reconstructing of outdoor scenes using a moving omnidirectional camera was performed in [31,110,156]. Torii et al. [156] and Micusík and Koseckà [110] use planar SURF descriptors to obtain correspondences across frames, but trajectory estimation exploring loop closure (as in other approaches for V-SLAM) was the focus in [156], whereas depth estimation (with the help of superpixels, under a piece-wise planar assumption) was the main goal of [110]. Caruso et al. [31] explore a robust photometric error to align the frames and explore keyframe matching to reine the alignment. ...
... Large 3D reconstructing of outdoor scenes using a moving omnidirectional camera was performed in [31,110,156]. Torii et al. [156] and Micusík and Koseckà [110] use planar SURF descriptors to obtain correspondences across frames, but trajectory estimation exploring loop closure (as in other approaches for V-SLAM) was the focus in [156], whereas depth estimation (with the help of superpixels, under a piece-wise planar assumption) was the main goal of [110]. Caruso et al. [31] explore a robust photometric error to align the frames and explore keyframe matching to reine the alignment. ...
... Large 3D reconstructing of outdoor scenes using a moving omnidirectional camera was performed in [31,110,156]. Torii et al. [156] and Micusík and Koseckà [110] use planar SURF descriptors to obtain correspondences across frames, but trajectory estimation exploring loop closure (as in other approaches for V-SLAM) was the focus in [156], whereas depth estimation (with the help of superpixels, under a piece-wise planar assumption) was the main goal of [110]. Caruso et al. [31] explore a robust photometric error to align the frames and explore keyframe matching to reine the alignment. ...
This paper provides a comprehensive survey on pioneer and state-of-the-art 3D scene geometry estimation methodologies based on single, two, or multiple images captured under omnidirectional optics. We first revisit the basic concepts of the spherical camera model and review the most common acquisition technologies and representation formats suitable for omnidirectional (also called 360°, spherical or panoramic) images and videos. We then survey monocular layout and depth inference approaches, highlighting the recent advances in learning-based solutions suited for spherical data. The classical stereo matching is then revised on the spherical domain, where methodologies for detecting and describing sparse and dense features become crucial. The stereo matching concepts are then extrapolated for multiple view camera setups, categorizing them among light fields, multi-view stereo, and structure from motion (or visual simultaneous localization and mapping). We also compile and discuss commonly adopted datasets and figures of merit indicated for each purpose and list recent results for completeness. We conclude this paper by pointing out current and future trends.
... Ground can be surveying by various techniques such as LiDAR, photogrammetry, and computer vision. These include using LiDAR for surveying and extraction of houses and roads (Rottensteiner et al., 2005), building height prediction (Park and Guldmann, 2019), completing automated 3D city reconstruction (Garcíamoreno et al., 2012), using tilted aerial imagery to give texture information to 3D city models (Frueh et al., 2004), using UAV tilt photography to complete 3D city modeling (Wang et al., 2015), using Google Earth imagery and ground images for 3D city modeling (Ding et al., 2007), and using Google Street View for 3D city modeling (Torii et al., 2009). Underground can be surveyed non-destructively by ground-penetrating radar, ultrasonic, thermal infrared, and nuclear magnetic resonance techniques. ...
Full-text available
Digitalization of urban roads is an important part of smart city construction. In addition to having a basic understanding of the structure of the transportation network, we need to have a preliminary understanding of the information around the road, the current status of the road, and the impact that municipal projects may have on the road. At present, the three-dimensional information of the ground parts of roads can be obtained efficiently and accurately on a large scale by using three-dimensional scanning technology. However, there is a lack of comprehensive and intuitive understanding of the under-ground information and a lack of synergistic consideration of the ground and under-ground information. In this paper, a ground and under-ground urban road surveying system based on 3D LiDAR and 3D ground penetrating radar (GPR) is presented. The system covers multi-sensor coordinated control, time-space datum setup, and post-processing data. Experiments show that the system can realize the integrated ground and under-ground 3D surveying for urban roads, generate intuitive three-dimensional point cloud map model of ground and under-ground of urban roads, and provide effective technical support for smart city construction.
... In addition to releasing street view map services for user browsing, they have released application program interfaces (APIs) for developers to customize web applications. The GSV image was a useful potential data source for urban studies, including 3D city model construction (Torii et al., 2009), commercial-entity identification (Zamir, et al., 2011), public environment audit (Edwards et al., 2013). And they are even used in layer interpretation for ground, pedestrians, buildings, and sky. ...
... Some studies used SVIs to reconstruct 3D cities where point cloud generation is a key step (Klingner et al., 2013;Micusik and Kosecka, 2009;Torii et al., 2009). According to the assessment by Bruno and Roncella (2019), the 3D location error for such models was a few meters without using control points. ...
Full-text available
Street view images are now widely used in web map services, providing on-site photos of street scenes for users to explore without physically being in the field. These photos record detailed visual information of the street environment with geospatial control; therefore, they can be used for metric mapping purposes. In this study, we present a method to convert street view images to measurable land cover maps using their associated depthmap data. The proposed method can autonomously extract and measure land cover objects over large areas covered by a mosaic of street view images. In the case study, we demonstrated the use of land cover maps derived from Google Street View images to extract sidewalk features and to measure sidewalk clear widths for wheelchair users. Sidewalk feature slopes were also extracted from the metadata of street view images. Using the Washington D.C., U.S. as the study area, our method extracted a sidewalk network of 2,561 km in length with the precision of 0.8662 and recall of 0.8525. The extracted sidewalks have widths between 1-2 m, the mean width error of 0.24 m, and the slope mean error of 0.638°. In Washington D.C., most sidewalks meet the minimum width requirement (0.9 m), but 20% of them have slopes that exceed the maximum allowance (1:20 or about 2.9°). These results demonstrate the converted land cover maps from street view images can be used for metric mapping purposes. The extracted sidewalk network can serve as a valuable inventory for urban planners to promote equitable walkability for mobility disabled users. And if widely available, mobility-impaired users could consult them prior to planning a route.
... There are many methods available to identify strong features on RGB images in order to locate relevant points, among them, the commonly used SURF, abbreviation of Speeded Up Robust Features [?], as used in the works of [7] [8]. SURF is called a blob detector and its main idea is to identify sets of pixels of the same tonality. ...
Full-text available
Visual odometry is the process of estimating the position and orientation of a camera by analyzing the images associated to it. This paper develops a quick and accurate approach to visual odometry of a moving RGB-D camera navigating on a static environment. The proposed algorithm uses SURF (Speeded Up Robust Features) as feature extractor, RANSAC (Random Sample Consensus) to filter the results and Minimum Mean Square to estimate the rigid transformation of six parameters between successive video frames. Data from a Kinect camera were used in the tests. The results show that this approach is feasible and promising, surpassing in performance the algorithms ICP (Interactive Closest Point) and SfM (Structure from Motion) in tests using a publicly available dataset.
... Ground-level images provide a valuable resource for exploring how features vary across regions, such as the amount of green and buildings (Li et al., 2015;Torii et al., 2009). In particular, GSV is a service that provides ground-level images for public access and with comprehensive spatial coverage. ...
Conference Paper
Full-text available
Graffiti is an inseparable element of most large cities. It is of critical value to recognize whether it is an artistry product or a distortion sign. This study develops a larger graffiti dataset containing a variety of graffiti types and annotated boundary boxes. We use this data to obtain a robust graffiti detection model. Compared with existing methods on the task, the proposed model achieves superior results. As a case study, the created model is evaluated on a vast number of street view images to localize graffiti incidence in the city of São Paulo, Brazil. We also validated our model using the case study data, and, again, the method achieved outstanding performance. The robustness of the technique enabled further analysis of the geographical distribution of graffiti. Considering graffiti as a spatial element of the city, we investigated its relation with crime occurrences. Relatively high correlation values were obtained between graffiti and crimes against pedestrians. Finally, this work raises many questions, such as the understanding of how these relationships change across the city according to the types of graffiti.
... Street-view images have been used in various applications, including urban modeling, land-use functions, walkability assessment, and crime evaluation [9,11,18,21]. Currently, research on individual building segmentation and extraction from image datasets is mostly based on airborne remote sensing images [3,12,8,22,23,25]. ...
Conference Paper
Full-text available
This paper proposes a new method to join building footprint GIS data with the relevant buildings in a street-view image, taken by a vehicle-mounted camera. This is achieved by segmenting buildings in the street-view images and identifying the relevant building coordinates in the image. The building coordinates on the image are then estimated from the building vertices in the building footprint GIS data and vehicle trajectory history. Finally, the objective building is identified and relevant building attributes corresponding to each building image are linked together. This method enables the development of building image datasets with associated building attributes. The building image data, when linked to the relevant building attributes, could contribute to many innovative urban analyses, such as urban monitoring, the development of three-dimensional (3D) city models, and image datasets for training with annotated building attributes.
Overpopulated cities have practiced trading available land for the necessary livelihood resources, depleting accessible green assets in the densely developed urban habitat. Facing climate change as a risk multiplier, cities must reckon for alternative mitigation strategies. This study ascertains that green has proven to increase one's psychological well-being, it is also one of the most important factors to intercede the effects of air pollution and temperature. For communities necessitating viable green resources, the addition of vertical greening retrofit on building façade can satisfy the need as an intermediary media to engage meaningful interaction with the urban green-scape. Thus, proper urban planning must re-think how to incorporate these measures for public health consideration. The assessment of these criteria is important to accomplish the planning of vertical green to enhance the residents’ well-being state. A central focus is placed on the benefits to health and wellbeing and the increase in physical activity and social interaction at the neighborhood scale. This article presents the systemic analysis to vertical green-scape attainment in central Taipei, allowing the locality and place to be incorporated into the exploration framework. The case studies presented in this paper focuses on analytical exploration of vertical green-scape attainment framework with the sub-tropic Taipei in mind. The attributes assessment framework is established; the research concludes that: (1) in line with human-nature connection and raising its priority level within both design research and design practice should consider the environmental, social, economic and spatial criteria for the design thinking exploration; (2) the analysis confirmed that desired green-scape could be attained either indoor or outdoor, with tissue, support, and infill attributes as a probable solution suited for the locality; (3) integration with urban context is highly capable in encouraging the social well-being for the urban system.
Full-text available
The paper introduces a data collection system and a processing pipeline for automatic geo-registered 3D reconstruction of urban scenes from video. The system collects multiple video streams, as well as GPS and INS measurements in order to place the reconstructed models in geo-registered coordinates. Besides high quality in terms of both geometry and appearance, we aim at real-time performance. Even though our processing pipeline is currently far from being real-time, we select techniques and we design processing modules that can achieve fast performance on multiple CPUs and GPUs aiming at real-time performance in the near future. We present the main considerations in designing the system and the steps of the processing pipeline. We show results on real video sequences captured by our system.
Full-text available
In this work we present a method to detect overlaps in image sequences, and use this information to integrate over-lapping sparse 3D structure from video sequences. The ad-ditional temporal information of these images is used to increase robustness over single image pair matching. A scanline optimization problem formulation is used to com-pute the best sequence alignment using wide-baseline im-age matching techniques. Compared to a direct dynamic programming approach, the scanline optimization formu-lation increases the robustness of sequence alignment for general relative motions. The proposed alignment method is employed to integrate sparse 3D models reconstructed from separate video sequences. In addition loop closures are detected. Consequently, the 3D modeling process from sequential image data can be split into fast sequence pro-cessing and subsequent global integration steps.
Technical Report
Bundle adjustment using the Levenberg-Marquardt minimization algorithm is almost invariably used as the last step of every feature-based structure and motion estimation vision algorithm to obtain optimal 3D structure and viewing parameter estimates. However, due to the large number of unknowns contributing to the minimized reprojection error, a general purpose implementation of the Levenberg-Marquardt algorithm incurs high computational costs when applied to the problem of bundle adjustment. Fortunately, the lack of interaction among parameters for different 3D points and cameras in multiple view reconstruction results in the under-lying normal equations exhibiting a sparse block structure, which can be exploited to gain considerable computational benefits. This paper presents the design and explains the use of sba, a publicly available C/C++ software package for generic bundle adjustment based on the sparse Levenberg-Marquardt algorithm.
An efficient algorithmic solution to the classical five-point relative pose problem is presented. The problem is to find the possible solutions for relative camera motion between two calibrated views given five corresponding points. The algorithm consists of computing the coefficients of a tenth degree polynomial and subsequently finding its roots. It is the first algorithm well suited for numerical implementation that also corresponds to the inherent complexity of the problem. The algorithm is used to a robust hypothesise-and-test framework to estimate structure and motion in real-time.
Conference Paper
It is a well known classical result that given the image projections of three known world points it is possible to solve for the pose of a calibrated perspective camera to up to four pairs of solutions. We solve the generalised problem where the camera is allowed to sample rays in some arbitrary but known fashion and is not assumed to perform a central perspective projection. That is, given three back-projected rays that emanate from a camera or multi-camera rig in an arbitrary but known fashion, we seek the possible poses of the camera such that the three rays meet three known world points. We show that the generalised problem has up to eight solutions that can be found as the intersections between a circle and a ruled quartic surface. A minimal and efficient constructive numerical algorithm is given to find the solutions. The algorithm derives an octic polynomial whose roots correspond to the solutions. In the classical case, when the three rays are concurrent, the ruled quartic surface and the circle possess a reflection symmetry such that their intersections come in symmetric pairs. This manifests itself in that the odd order terms of the octic polynomial vanish. As a result, up to four pairs of solutions can be found in closed form. The proposed algorithm can be used to solve for the pose of any type of calibrated camera or camera rig. The intended use for the algorithm is in a hypothesise-and-test architecture.
This article presents a novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features). SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper encompasses a detailed description of the detector and descriptor and then explores the effects of the most important parameters. We conclude the article with SURF's application to two challenging, yet converse goals: camera calibration as a special case of image registration, and object recognition. Our experiments underline SURF's usefulness in a broad range of topics in computer vision.