Content uploaded by E. Charou
Author content
All content in this area was uploaded by E. Charou on Feb 06, 2019
Content may be subject to copyright.
Comparative Analysis of Classification Techniques for Building Block Extraction using
Aerial Imagery and LiDAR data
E. Bratsolis1,3, S. Gyftakis1,2, E. Charou2, N.Vassilas1
1Department of Informatics,
Technological Educational Institute of Athens,
12210 Aigaleo, Greece
2Inst. of Informatics and Telecommunications
NCSR Demokritos, 15310 Agia Paraskevi, Greece
3Department of Physics
National University of Athens, 15784 Athens, Greece
Abstract — Building detection has been a prominent area in the
area of image classification. Most of the research effort is adapted to
the specific application requirements and available datasets. In this
paper we present a comparative analysis of different classification
techniques for building block extraction. Our dataset includes aerial
orthophotos (with spatial resolution 20cm), a DSM generated from
LiDAR (with spatial resolution 1m and elevation resolution 20 cm)
and DTM (spatial resolution 2m) from an area of Athens, Greece.
The classification methods tested are unsupervised (K-Means, Mean
Shift), and supervised (Feed Forward Neural Net, Radial-Basis
Functions, Support Vector Machines). We evaluated the performance
of each method using a subset of the test area. We present the
classified images, and statistical measures (confusion matrix, kappa
coefficient and overall accuracy). Our results demonstrate that the
top unsupervised method is the Mean Shift that performs similarly to
the best supervised methods.
Keywords — remote sensing, image classification algorithms,
LiDAR.
I. INTRODUCTION
Recently there is an increasing demand for detailed 3D
models of buildings, monuments, urban planning units and
cities from elevation (x, y, z) data such as those acquired from
LiDAR airborne scanners [1]. A critical step towards 3D
modeling is the segmentation of the spatial data into
homogeneous regions (e.g. buildings, parts of urban planning
units, roads, etc.) and then the extraction of their boundaries.
The algorithms and methods of image segmentation constitute,
even nowadays, an open research domain in the fields of image
analysis, computer vision and digital remote sensing. The
various techniques that have been developed for this purpose
are based either exclusively on LiDAR data [2] or they utilize
in parallel and other complementary data such as digital maps
[3], high resolution satellite images [3] or aerial
orthophotographs [4].
Following the progressive increase of sensors’ accuracies
during the last decades, small and medium scales research in
the mature fields of land-use and land-cover applications has
been expanded to include large scale detection of buildings and
other urban objects that can be used towards generation of 3D
models. For example, conventional multispectral unsupervised
classification methods followed by co-occurrence matrix based
filtering have been used for buildings classification in urban
regions [5]. Due to the low-resolution TM-SPOT satellite
images that represented medium size buildings (12-20m in
width) with less than five pixels, the aim of this semi-automatic
classification method was to manually determine rough
building classes following a quite analytic clustering. On the
other hand, 3D building reconstruction in [6] is assisted by
accurate 2D digital maps used to locate buildings in laser
scanning data, thus, bypassing the need for an automatic
building detection phase. Building roofs can then be
reconstructed from point clouds through a model based or data
driven approach.
Ahlberg et al. [7] present a method using high resolution
LiDAR and image data for 3D building reconstruction which is
based on a series of preprocessing steps that include generation
of DTM using active contours, ground classification via simple
height thresholding, segmentation and building versus non-
building classification of each segment using an artificial
neural network. The classification is based on measures of
shape, curvature and maximum slope. The 3D reconstruction is
completed by extracting planar roof faces from the elevation
data.
Chen et al. [8] also proposed a scheme towards building
detection and 3D building reconstruction by integrating
LiDAR data, multispectral satellite images and aerial
photography. Building detection is performed through region-
based segmentation and knowledge-based classification. Then,
3D reconstruction is achieved in four steps: (1) 3D planar patch
forming, (2) initial building edge detection, (3) straight line
extraction, and (4) split-merge-shape patented method for
building modeling.
In our case, the available LiDAR data are not in the form of
point clouds. Instead, they have been provided in the form of a
Digital Surface Map (DSM) and a Digital Terrain Map (DTM)
at relatively low resolutions. On the other hand, high resolution
aerial imagery from the same urban region is at a spatial
resolution five and ten times higher than DSM and DTM
respectively. The inaccuracies of the low resolution LiDAR
data as well as the arrangement and layout of the buildings in
the pilot area of Kallithea – characteristic of most of Athens’s
urban areas – make automatic building detection a challenging
application. It is our belief that an attempt to automatically
detect buildings under these circumstances should follow
careful and well designed steps, similar to [7]. To this end, a
number of unsupervised and supervised classification methods
are implemented and tested for building block segmentation
using a fusion of aerial and LiDAR data. In particular, all
classifiers have been trained in a 4-D input space (i.e. the RGB
bands of the optical image augmented by an upsampled
LiDAR depth map).
Sections II and III present the unsupervised and supervised
classification methods tested in this work respectively. Data
description is in Section IV, classification results are presented
in Section V and, finally, Section VI presents the conclusions
and future work.
II. UNSUPERVISED IMAGE CLASSIFICATION
TECHNIQUES
Although in most cases the accuracy of the 3D building
models is specified by the spatial resolution of LiDAR data, a
challenge that is often encountered with multiple independent
data sources and preprocessing steps is to fuse low resolution
LiDAR data with high resolution aerial images to improve the
accuracy of 3D building reconstruction. It is our belief that an
attempt to automatically detect buildings under these
circumstances should follow careful and well designed steps.
To take into account the added problem complexity due to
touching buildings that in many cases form block-scale
connected regions, we suggest to first perform a building block
image segmentation through buildings vs. no buildings pixel
classification prior to the detection of building footsteps.
Since the aim is to develop an automatic system for 3D
building reconstruction, it is our intension to select an
unsupervised technique with a classification accuracy
comparable to that of well known supervised classifiers. In this
respect, two of the most popular unsupervised classification
techniques have been examined in this work and presented in
the sequel.
A. K-means
The algorithm is an unsupervised classification technique
where the user initiates the algorithm by specifying the desired
number of classes. It is started with some clusters of pixels in
the feature space, each of them defined by its center. The initial
cluster is created by associating each pixel in the image to the
given nearest centroid, then the mean values of the elements
are computed and the centroids are replaced by them. These
steps are done iteratively until there is no more allocation of
pixels available for creating new clusters [10].
B. Mean Shift
Mean Shift was first proposed by Fukunaga and Hostetler
[11], later adapted by Cheng [12] for the purpose of image
analysis and more recently extended by Comaniciu, Meer and
Ramesh to low-level vision problems, including, segmentation
[13], adaptive smoothing [13] and tracking [14].
The main idea behind mean shift is to treat the points in a
multidimensional feature space as an empirical probability
density function where dense regions in the feature space
correspond to the local maxima or modes of the underlying
distribution. For each data point in the feature space, one
performs a gradient ascent procedure on the local estimated
density until convergence. The stationary points of this
procedure represent the modes of the distribution. The data
points associated with the same stationary point are considered
members of the same cluster.
For a given pixel, the mean shift algorithm builds a set of
neighboring pixels within a given spatial radius and a color
range. The spatial and color center of this set is then computed
and the algorithm iterates with this new spatial and color
center.
III. SUPERVISED IMAGE CLASSIFICATION
TECHNIQUES
A. Feed Forward Neural Net (FFNN)
Artificial neural networks (ANNs) are connectionist
systems consisting of many primitive units (artificial neurons)
which are working in parallel and are connected via directed
links. The general neural unit has several inputs and each input
is weighted with a weight factor. The main processing principle
of these units is the distribution of activation patterns across the
links similarly to the basic mechanism of a biological neural
network. The knowledge is stored in the structure of the links,
their topology and weights which are organized by training
procedures. The link connecting two units is directed, fixing a
source and a target unit. The weight attributed to a link
transforms the output of a source unit to an input on a target
unit. This is a supervised learning. Depending on the weight,
the transmitted signal can take a value ranging from highly
activating to highly forbidding.
The basic function of a unit is to accept inputs from units
acting as sources, to activate itself, and to produce one output
that is directed to units-targets. Based on their topology and
functionality, the units are arranged in layers. The layers can be
generally divided into three types: input, hidden, and output.
The input layer consists of units that are directly activated by
the input pattern. The output one is made by the units that
produce the output pattern of the network. All the other layers
are hidden and directly inaccessible. Supervised learning
proceeds by minimizing a cost (or error) function with respect
to all of the network weights. The activation function of the
unit is given by the sigmoid function [15].
B. Radial-Basis Functions
Radial-basis functions (RBFs) were first introduced in the
solution of the real interpolation problems. The early work on
this subject is surveyed by Powell [16]. In the RBF neural
networks, radial basis functions are embedded into a two layer
feed-forward neural network. The network has a set of inputs
and a set of outputs. Between the inputs and outputs there is a
layer of processing units referred to as hidden units. Each
hidden unit is implemented with a radial basis function. In the
RBF neural networks, the nodes of the hidden layer generate a
local response of input prompting through the radial basis
functions, and the output layer of RBF neural networks realize
the linear weighted combination of the output of the hidden
basis functions. There is a large class of radial-basis functions.
C. Support Vector Machines
The Support Vector Machines (SVM), have been
introduced within the framework of the Statistical Learning
Theory [16] - [19] developed by V. Vapnik and co-workers.
The approach consists in searching for the separating surface
between 2 classes by the determination of the subset of training
samples which best describes the boundary between the 2
classes. These samples are called support vectors and
completely define the classification system. In the case where
the two classes are nonlinearly separable, the method uses a
kernel expansion in order to make projections of the feature
space onto higher dimensionality spaces where the separation
of the classes becomes linear.
IV. DATA DESCRIPTION
For our experiments we have used the following dataset
from the Kallithea neighborhood of Athens, Greece:
Orthophotos from color (channels R, G, B) aerial imagery
acquired by the National Cadastre and Mapping Agency of
Greece (Fig. 1.). The acquisition date of the orthophotos is
2007 with spatial resolution of 20cm. LiDAR data was
acquired by the GeoIntelligence SA over the above area with
spatial resolution 1m. The vertical resolution is 20 cm. The
acquisition date is 2003. DTM of the above area was acquired
by the GeoIntelligence SA with spatial resolution of 2m.
From combination of LiDAR (DSM) and DTM datasets we
produced the normalized DSM (or n-DSM) of the above area
(Fig. 3.). The n-DSM is the difference between DSM and DTM
and represents the net building heights rather than the absolute
elevations. In our experiments we used the 3 channels of the
orthophoto augmented by the n-DSM as one additional
channel. The values of all these 4 channels were normalized in
the same range [0, 255].
V. CLASSIFICATION RESULTS
All experiments have been performed using the Matlab and
Monteverdi environments. Following training of the supervised
or unsupervised classifiers, test results have been obtained
regarding the central urban block of Fig. 1 (including patio
pixels in the middle of the block). The corresponding prototype
building block mask to which all results will be compared is
shown Fig. 3
The number of classes selected for the K-Means algorithm
was set to 2. The training set included the 4-D points from all
pixels of the whole region and the algorithm converged in less
than 10 iterations. The corresponding classification result is
shown in Fig. 5..
As far as the Mean Shift algorithm is concerned, the spatial
radius was set to 5 pixels and the spectral range to 15
(expressed in radiometric units). The resulting amplitude image
was then thresholded using Ochu’s optimal threshold selection
method in order to produce the final binary image. The Mean
Shift classification result is shown in Fig. 6..
The FFNN classifier consists of four layers. An input layer
with four nodes, a first hidden layer with 20 nodes, a second
hidden layer with 10 nodes and an output layer with one node.
Our classifier uses the Levenberg – Marquardt training
algorithm and converges in less than 30 epochs with a training
set consisting of pixels from representative areas from a broad
region, balanced in terms of categories representation, and
corresponding roughly to 1.5% of the overall image. The
FFNN classification results are shown in Fig. 7..
The radial-basis function used in the RBF classifier is the
Gaussian one. The spread constant is chosen 70 and the hidden
nodes 60. Using the same training set as before, the
classification results are shown in Fig. 8.. Finally, the
classification results regarding SVM are shown in Fig. 9..
In all classification results buildings are shown as white and
non-buildings as black pixels. For the evaluation and
quantitative comparison of all classification methods we
compared all results with the prototype mask of Fig. 3. In
particular, in order to quantify the quality of the classification
results, we have computed the following statistical measures on
the above subsets.
a) Confusion matrix: A = [aij],
where aij is the number of pixels from the jth class that
have been classified as belonging to the ith class,
divided by arj, the overall number of pixels from class j.
c) Overall accuracy: ∑aii / at
where at is the total number of pixels (of the evaluation
subset)
b) Kappa coefficient: =
where P0 is the probability of correct pixel classification,
i.e. P0 = ∑aii / at, and Pe = ∑ari aci / at2, with acj, being the
overall number of pixels assigned to class j. Values of
exceeding 0.75 suggest strong - non-accidental
-classification performances [16].
The corresponding confusion matrices, Cohen’s kappa
coefficient and overall accuracy are shown in Table I.
TABLE I. CLASSIFICATION RESULTS
Unsupervised Method Confusion Matrix kappa
coefficient
Overall
Accuracy
K-Means 0.8191 0.1809
0.1498 0.8502
0.6400 0.8298
Mean Shift 0.9166 0.0834
0.1650 0.8350
0.7529 0.8885
Supervised Method Confusion Matrix kappa
coefficient
Overall
Accuracy
FFNN 0.9230 0.0770
0.1460 0.8540
0.7770 0.8992
RBF 0.9166 0.0834
0.1281 0.8719
0.7829 0.9012
SVM 0.8192 0.1808
0.0554 0.9446
0.7151 0.8625
VI. CONCLUSIONS AND FUTURE RESEARCH
Comparing the measures (Table I) we conclude that the best
unsupervised method is the Mean Shift while the FFNN and
RBF supervised methods produce similarly best results. We
intend to use the Mean Shift classification algorithm as a step
prior to building detection in densily built areas such as the one
in Fig. 1. Accurate building detection is a necessary step
towards the development of a system for automatic 3D
reconstruction of buildings.
ACKNOWLEDGMENT
This research has been co-funded by the European Union
(European Social Fund) and Greek national resources under the
framework of the Archimedes III: Funding of Research Groups
in T.E.I. of Athens project of the Education & Lifelong
Learning Operational Programme. We would also like to thank
GeoIntelligence for providing the DSM and DTM elevation
data as well as the National Cadastre & Mapping Agency of
Greece for providing us with the high resolution aerial
photographs.
REFERENCES
[1] C. Poullis, S. You, and U. Neumann, “Rapid creation of large-scale
photorealistic virtual environments”, IEEE Virtual Reality, pp. 153–160,
2008.
[2] J. Giglierano, Lidar Basics for Mapping Applications. US Geological
Survey, http://pubs.usgs.gov/of/2007/1285/pdf/Giglierano.pdf, Open-
File Report 2007-1285.
[3] Y. Zhang, Z. Zhang, J. Zhang and J. Wu, “3D Building Modeling With
Digital Map, LiDAR Data and Video Image Sequences”, The
Photogrammetric Record, Vol. 20(111), pp. 285-302, 2005.
[4] G. Sohn and I. Dowman, “Data fusion of high resolution satellite
imagery and LiDAR data for automatic building extraction”, ISPRS
Journal of Photogrammetry and Remote Sensing, Vol. 62 , pp. 43–63,
2007.
[5] Y. Zhang, “Optimisation of building detection in satellite images by
combining multispectral classification and texture filtering”, ISPRS
Journal of Photogrammetry and Remote Sensing, Vol. 54, pp. 50–60,
1999.
[6] G. Vosselman, “Fusion of Laser Scanning Data, Maps, and Aerial
Photographs for Building Reconstruction”, in Proc. IEEE International
Geoscience and Remote Sensing Symposium and the 24th Canadian
Symposium on Remote Sensing, IGARSS'02, Toronto, Canada, 2002.
[7] S. Ahlberg, U. Söderman, M. Elmqvist and Å. Persson, “On Modelling
and Visualisation of High Resolution Virtual Environments Using
LiDAR Data”, in Proc. 12th International Conference on
Geoinformatics − Geospatial Information Research: Bridging the
Pacific and Atlantic, Geoinformatics 2004, University of Gävle,
Sweden, 7-9 June 2004.
[8] L.-C. Chen , T.-A. Teo, Y.-C. Shao, Y.-C. Lai, and J.-Y. Rau, “Fusion of
LiDAR Data and Optical Imagery for Building Modeling”, International
Archives of Photogrammetry, Remote Sensing and Spatial Information
Sciences, Vol. 35, pp.732–737, 2004.
[9] H. Park and S. Lim. “Data Fusion of Laser Scanned Data and Aerial
Ortho-Imagery for Digital Surface Modeling”, in Proc. 3rd Int.
Workshop on 3D Geo-Information, Seoul, Korea, pp. 65-72, 13-14
November, 2008.
[10] R. Duda, P. Hart, and D. Stork, Pattern Classification, Wiley, pp. 526-
527, 2001.
[11] K. Fukunaga and L. Hostetler, “The estimation of the gradient of a
density function, with applications in pattern recognition”, IEEE
Transactions on InformationTheory, 21(1), pp. 32–40, 1975.
[12] Y. Cheng, “Mean shift, mode seeking, and clustering”, IEEE
Transactions on Pattern Analysis and Machine Intelligence , 17(8), pp.
790–799, 1995.
[13] D. Comaniciu, and P. Meer, “Mean shift: A robust approach toward
feature space analysis”, IEEE Transactions on Pattern Analysis and
Machine Intelligence, 24(5), pp. 603–619, 2002.
[14] D. Comaniciu., V. Ramesh, and P. Meer “Kernel-based object tracking”,
IEEE Transactions on Pattern Analysis and Machine Intelligence , Vol.
25(5), pp. 564–577, 2003.
[15] J.D. Paola, and R.A. Schowengerdt, “A Detailed Comparison of
Backpropagation Neural Network and Maximum – Likelihood
Classifiers for Urban Land Use Classification”, IEEE Transactions
Geosci. Remote Sens., Vol. 33, No. 4, pp. 981-996, July 1995.
[16] M.J.D. Powel, “Radial basis functions for multivariable interpolation: A
review”, in Proc. IMA Conference on Algorithms for the Approximation
of Functions and Data, pp. 143-167, RCMS, Shrivenham, England,
1985.
[17] V.N. Vapnik, The Nature of Statistical Learning Theory, Springer Verlag,
New York, 1995.
[18] V.N. Vapnik, Statistical Learning Theory, Wiley, New York, 1998.
[19] C. Cortes, and V.N. Vapnik, “Support vector networks”, Machine
Learning, Vol. 20, pp. 273-297, 1995.
[20] T. Kubik, W. Paluszynski, A. Iwaniak and P. Tymkow, “Supervised
Classification of Multi-Spectral Satellite Images Using Neural
Networks”, in Proc. of the 10th IEEE International Conference on
Methods and Models in Automation and Robotics (MMAR 2004), Eds.
S. Domek, R. Kaszyński , pp. 769-744, Miedzyzdroje, Poland, 2004.
Fig. 1. Orthophoto image of building block for test.
Fig. 2. The n-DSM elevation data of the same region as that of Fig. 1.
Fig. 3. A 3-D representation of n-DSM.
Fig. 4. Binary image subset for evaluation of classification.
Fig. 5. Classification result of K-Means
Fig. 6. Classification result of Mean Shift
Fig. 7. Classification result of FFNN
Fig. 8. Classification result of RBF
Fig. 9. Classification result of SVM