Scalable Algorithms for Large High-Resolution Terrain Data∗
Pankaj K. Agarwal
In this paper we demonstrate that the technology required to per-
models has matured enough to be ready for use by practitioners.
We also demonstrate the impact that high-resolution data has on
common problems. To our knowledge, some of the computations
we present have never before been carried out by standard desktop
computers on data sets of comparable size.
Categories and Subject Descriptors: D.2 [Software]: Software
Engineering; F.2.2 [Nonnumerical Algorithms and Problems]: Ge-
ometrical problems and computations; H.2.8 [Database Manage-
ment]: Database Applications—Data Mining, Image Databases,
Spatial Databases and GIS
General Terms: Performance, Algorithms
Keywords: LIDAR, Massive data, GIS
The revolution in sensing and mapping technologies is providing
an unprecedented opportunity to characterize and understand the
earth’s surface, its dynamics, and its properties. For instance, sec-
ond generation airborne LIDAR technology can map the earth’s
surface at a 15-20cm horizontal resolution, and the future genera-
tion of LIDAR scanners are expected to generate high-resolution
maps of other planets. Capitalizing on these opportunities and
transforming these massive amounts of topographic data into use-
ful information for vastly different types of users requires solving
several challenging algorithmic problems. GIS, geometric comput-
ing and other disciplines have made great strides, during the last
few years, in providing theoretical insights, algorithmic tools, and
software for meeting many of the challenges that arise when these
large data sets have to be processed.
∗Work in this paper was supported by ARO grant W911NF-04-1-0278.
Pankaj K. Agarwal and Thomas Mølhave are also supported by NSF un-
der grants CCR-00-86013, CCR-02-04118, and DEB-04-25465, and by a
grant from the U.S.–Israel Binational Science Foundation. Lars Arge and
Morten Revsbæk are also supported by a NABIIT grant from the Danish
Strategic Research Council and by MADALGO - Center for Massive Data
Algorithmics - a Center of the Danish National Research Foundation.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
COM.Geo 2010 June 21-23, 2010, Washington, DC, USA.
Copyright 2010 ACM 978-1-4503-0031-5 ...$10.00.
Figure 1. The coast line of Denmark, the main study area for the
experiments in this paper.
A major challenge is the sheer size of the gathered topographic
data which is exposing serious scalability problems with existing
GIS systems. The main reason for these problems is that algo-
rithms in current systems often assume the data to fit in the main
memory of the computer, which is not the case for the large data
sets provided by modern LIDAR scanners. A main memory access
is around 106times faster than a disk access, and programs that do
not use the disk efficiently are essentially slowed done by that fac-
tor. Thus it is key to use algorithms try to minimize disk accesses
when handling massive data sets. Developing such Input/Output-
efficient (or just I/O-efficient) algorithms have been a flourishing
research area for the past many years.
There is a number of ways traditional GIS applications attempt
to deal with the massive terrain elevation data sets. The simplest
common technique is to thin the point cloud by more or less arbi-
trarily discarding a significant fraction of the points. A similar form
of thinning is often applied to gridded terrain models where neigh-
boring grid cells are averaged (or discarded) to produce a grid with
a much larger cells size and correspondingly lower spatial fidelity.
These methods, although obviously very effective at reducing the
size of the point cloud or grid, are not ideal since important topo-
graphic features will be lost in such an operation. This can signif-
icantly reduce the accuracy and effectiveness of many topographic
analysis techniques. Finding a way to drastically reduce the data
set without invariably creating problems with non-trivial computa-
tions would in many cases likely not be much easier than solving
the original problem directly.
Another practical and popular method is to bin the data into tiles
of some fixed low size, transforming a very big data set into a list of
manageable units. This can be done directly on the original point
cloud, but also on gridded terrain models. However this approach
is not feasible for a lot of non-trivial computations where interac-
complexity of the method.
There has also been a growing amount of research into theo-
retically efficient algorithms for solving GIS-related problems for
massive data sets [1, 2, 3, 10, 11, 19]. Many of the algorithmic de-
velopments have been mostly theoretical or their implementations
have not been ready to be presented to a wider non-technical audi-
ence in a user-friendly manner. In 2007 some of the present authors
presented a preliminary version of a tool, TerraSTREAM, that spe-
cializes in handling large terrain models  for solving a range of
However, despite the recent research, most popular GIS soft-
ware packages remain unable to handle truly large elevation mod-
els. Thus users usually resort to crude approximations like the ones
mentioned above, and do not routinely work with the full detail
provided by their data sets. This decreases the value of the de-
rived products and is unfortunate given the considerable investment
a large-scale LIDAR survey represents.
The goal of our research is to make theoretical work practical
and thereby making practical algorithms more grounded theoreti-
cally than the existing algorithms. This enables us to prove guaran-
tees on the performance of the practical algorithms we propose. In
this paper we demonstrate the value of these algorithms to the GIS
on country-sized high-resolution terrain models is now feasible on
standard desktop computers and without any complicated trickery
required by the user. We also demonstrate why it is important to
use the full high-resolution data. Our main examples in this paper
will be two hydrology-related problems; flow modeling and flood
The main data set used in this paper is a massive LIDAR point
cloud of the country of Denmark (with 26 billion points). It is so
big that most users are unable to perform non-trivial computations
on the entire model at once. To our knowledge most of the compu-
tations in this paper have never before been performed on data sets
of comparable size using standard desktop computers.
data sets by showing examples of how flow modeling and flood
mapping computations are affected by changes in topography. We
will then, in Sections 3 and 4, give an overview the software and
the process we can use to perform computations on very large data
sets. Finally in Section 5 we will give more detail on the experi-
ments and on how we visualize the results.
In this section we illustrate our system by describing algorithms for
two widely used hydrological problems. These problems illustrate
how the high spatial fidelity of modern data sets improve our ability
to reason about the terrain.
of water on the surface of the terrain. Such modeling can reveal
how water accumulates into creeks that converge and form streams,
and later rivers, and can be used to extract the watersheds of the
terrain. Intuitively, a river network is a collection of paths that
indicate where large amounts of water, or rivers, are likely to flow
on the terrain.
Another popular product of elevation models is the computa-
tion of flood risk information. Computing highly accurate flood
Importance of High-Resolution Data sets
Figure 2. Flood map computation, the protected area behind the
dike is not flooded when the ocean level reaches h.
risk information is a very complex task and involves many types
of fine-grained information about the study area. However, with
high-resolution terrain models that contain geometrically small but
important features (suck as dikes), it is possible to get good initial
flood risk estimates using only topography and ignoring the effect
of e.g. sewers and the groundwater. In this type of computation,
water rises from the oceans but is blocked by the terrain as repre-
sented by the model, an example can be seen in Figure 2.
It is important to note that both the flood risk estimation and the
flow modeling problems are “non-local” in the sense that a small
local change in topography can affect the result of the computation
for the entire terrain model. It is this property that makes the prob-
lems hard to solve for large data sets, and is also why the quality
and resolution of the elevation model is vital.
Our main study area for this paper is the country of Denmark,
which is one of the growing list of countries that have performed a
nationwide high-resolution LIDAR scanning. Through a company
(COWI A/S) we have received a complete high-resolution LIDAR
point cloud for Denmark. The point cloud is stored in 14063 LAS-
files, each containing the points covering a four square kilometer
area. The files themselves takes up about 1.7 terabytes. There is
a total of 25,887,357,931 — nearly 26 billion bare earth points in
the data set, this is a bit less than a one point square meter for
the entire country. We constructed a triangulation of the entire
point Denmark point cloud and used this triangulation to create
a high-resolution country-wide grid model with a 2m horizontal
resolution. We then algorithmically extracted river networks and
computed detailed flood risk assessments. Before computing river
networks we prepared the grid by removing spurious sinks in the
terrain. We also used the data set generated by the Shuttle Radar
Topography Mission (SRTM) . The SRTM data set provides a
90m (at the equator) grid that covers most of the inhabited parts of
the world. Using the same software tools we were able to compute
flood risk mapping and river networks for the entire SRTM grid.
Figure 3(a,b) shows the result of the flood risk mapping for the
island of “Mandø” in the Wadden-Sea of the west-coast of Den-
mark. This tiny island, its area is approximately seven square kilo-
meters, has been hit by storm floods on multiple occasions. As a
result of these incidents, the island has an approximately five meter
tall perimeter dike which protects it from the sea. Due to the small
horizontal extent of the perimeter dike, this feature is not present
in the SRTM data, or in most “mid”-resolution data sets. Thus,
when the flooding computation described above is performed for a
water level of 2 meter, it looks as if most of the island will be un-
der water. This is visualized in Figure 3(a). The same computation
performed on the 2m-grid, shown in Figure 3(b) correctly finds that
the dikes present in the terrain model block the water from entering
the lower-lying areas inside the perimeter. This kind of example
is easy to find, and there are many examples, including near major
cities, where dikes have been built. The flooding simulations rele-
vance as a good resource for doing an initial flood risk assessment
is severely impeded if these dikes and similar natural features are
not present in the elevation model.
Like the flood mapping example, the omission of a relatively
small feature in a low-resolution data set can have a global impact
(a) SRTM grid(b) 2m grid
(c) SRTM grid(d) 2m grid
Figure 3. (a,b) A flood risk mapping of the island of Mandø in Denmark, using the SRTM (a) and high-resolution grid (b). (c,d) A river network
showing cells with an upstream area greater than 50km2in blue. The first figure (c) shows the river network on the SRTM model, the other (d)
shows the river network computed on the high-resolution grid. All the figures are screenshots of our custom map application building on Google
Figure 4. The modules of TerraSTREAM. The modules marked in
gray are used in this paper to demonstrate the pipeline and are de-
scribed in more detail in Section 4.
on the computed river network. An example of this is provided
in Figure 3(c,d) which shows a section of the north-eastern part of
the main Danish peninsula of Jutland. The networks derived from
the SRTM grid and the high-resolution 2m terrain model diverge
significantly at a point in the terrain. At the point of divergence a
fields to south-west of the divergence and ensures that the excess
water from the fields make it to the ocean via the river. In the
computation from the SRTM data set the small drainage ditch is
marked as the main river and the remainder of the original river
becomes a small tributary to this river. This means that the entire
watershed upstream of the point of divergence is routed in a wrong
direction and this makes the result misleading for most of the area
downstream of this point.
components, where each component uses I/O-efficient algorithms,
called TerraSTREAM. Thus, each algorithm in the pipeline scales
to massive data sets. TerraSTREAM consists of a number of dif-
ferent modules which can be divided into two main components:
DEM construction: This component constructs triangular irreg-
ular network (TIN) and grid DEMs from a (LIDAR) point
cloud. It checks the quality of a grid DEM by analyzing the
distribution of the point cloud. Finally it also contains a mod-
ule to construct contour DEMs.
upstream area contributions from which river networks can
be derived. It also partition the terrain into watersheds and
perform a simple flooding simulation. Finally it also contains
a module to “hydrologically” condition a terrain to make it
more suitable for the flow computations, by removing spuri-
We will discuss in more detail (Section 4) some of the modules
of these two components that are relevant for the results described
here. Figure4illustratesthe overallstructureofthepipelineand the
outputs of its several stages. The individual stages build upon new
I/O-efficient algorithms as well as extensions to a number of pre-
viously developed algorithms. In addition, a considerable amount
of engineering effort is devoted to making TerraSTREAM efficient
andpractical. Whereitmakessense, themodulesofTerraSTREAM
works on both TIN and grid DEMs.
The most important feature of all the modules is their scalabil-
ity. The user supplies an upper bound on the amount of memory
that can be used during the computation and TerraSTREAM works
with the disk in an efficient way to ensure this limit is not violated.
1Project site: http://madalgo.au.dk/Trac-TerraSTREAM
TerraSTREAM can efficiently handle arbitrarily large data sets, re-
gardless of the amount of memory available.
TerraSTREAM has now matured to a point where it is being used
by a large number of users worldwide, many of them with little
background in algorithms or programming. It includes extensions
to some of the most popular commercial GIS software (ArcGIS and
MapInfo), as well as a simple standalone application that does not
require any other GIS software. We refer to [15, 4, 8] for more in-
paper we will only briefly describe the actual stages of the pipeline
used in the experiments.
In this section we begin by giving a brief overview of the DEM con-
struction component and then discuss the hydrology component in
more detail. High-resolution data and the scalability of our system
are crucial for the hydrology component.
cell size from a point cloud is to compute the height value at each
grid point by using an interpolation or approximation scheme (refer
to e.g.,  and the references therein). This method is, however,
impractical even for a point cloud of a few thousand points because
of the computational complexity of solving large systems of lin-
ear equations. Using I/O-efficient techniques , TerraSTREAM
breaks the large systems into smaller pieces and extends many dif-
ferent interpolation routines to these smaller pieces.
The benefits of sophisticated interpolation schemes (e.g. )
for high-resolution data sets are often out-weighted by their high
time-complexity. TerraSTREAM therefore also provides a triangu-
lation based linear interpolation scheme .
Most flow modeling algorithms assumes water flows downhill until
it reaches a local minimum or sink. In practice however, local min-
ima in DEMs fall into two primary categories; significant and in-
significant, or spurious, sinks. Significant sinks correspond to large
real geographic features such as quarries, sinkholes or large natu-
ral closed basins with no drainage outlet. The insignificant sinks
may be due to noise in the input data or correspond to small natu-
ral features that flood easily. When modeling water flow these in-
significant sinks impede flow and result in artificially disconnected
hydrological networks. We “hydrologically condition” the DEM to
solve this problem, by removing insignificant sinks, while preserv-
ing significant sinks.
Many popular hydrological conditioning algorithms remove all
sinks using a so-called flooding approach, which simulates the pro-
cess of uniformly pouring water on the terrain until a steady-state
is reached. A weakness of this approach is that it removes even
significant sinks. See Figures 5(a) and (b). We use a partial flood-
ing algorithm, based on topological persistence [7, 6], that detects
and removes only insignificant sinks, as indicated in Figure 5(c).
This leads to a more realistic flow network. We briefly describe
topological persistence and then present our algorithm.
In the context of a terrain T repre-
sented by a grid, topological persistence [6, 7] matches each local
minimum (sink) cell v of T to a higher “saddle” cell w (see  for
the precise definition of a saddle) and assigns a persistence value,
denoted by π(v), to v; π(v) is defined to be the difference in the
(a) (b) (c)
Figure 5. (a) Original terrain. (b) Terrain flooded with τ = ∞. (c) Terrain partially flooded with persistence threshold τ = 10.
heights of v and w, i.e., π(v) = h(w) − h(v) . The persistence
π(v) denotes the significance of the sink v. Intuitively, the saddle
cell w is a cell at which two distinct connected components of the
portion of T lying strictly below w merge. Each connected compo-
nent is represented by the lowest vertex in the component. Suppose
v is the highest representative of the two connected components
merged by w and let u denote the representative of the other com-
Partial flooding. We use topological persistence as a measure
of the significance of a sink. Given a user-specified threshold τ, we
declare all sinks with persistence less than τ to be the insignificant
sinks and remove all such sinks using a partial flooding method
described below. The user can change the threshold to control the
smallest feature size to be preserved.
Let ζ1,...,ζkbe the significant sinks in T. We construct a graph
on T by connecting each grid vertex to its neighbors. Let the height
of a path in this graph be the height of the highest vertex on the
path, andlettheraiseelevationofavertexv betheminimumheight
of all paths from v to ζi for any 1 ≤ i ≤ k. In partial flooding,
we change the height of each vertex to its raise elevation. Partial
flooding produces a terrain containing only significant sinks whose
persistence value is greater than τ. Note that if τ = ∞, partial
flooding is the same as the original definition of flooding. Thus,
partial flooding is a tunable way to condition the terrain for the
purpose of flow modeling.
We also have the ability to compute the volume and area of the
sinks in the terrain . This allows for more flexibility in defining
the significance of a sink so that e.g. sinks with high volume but
low persistence (“height”) are not marked as insignificant.
In the third stage of our experiment we model the flow of water
on the hydrologically conditioned DEM. It consists of two phases.
The first flow-routing phase, we compute a flow direction for each
cell in the grid that intuitively indicates the direction water will
flow on that particular cell. In the second flow-accumulation phase,
we intuitively compute the area of all the cells upstream of each
cell. These flow-accumulation can then be used to derive a river
Flow routing. The flow-routing phase computes for each cell,
c, a list of those of its neighbors that water reaching c can flow
to. These lists are constructed by looking at c and its neighbors in
the grid and then applying a flow-direction model. We have imple-
mented two popular flow-direction models:
• Single-flow-direction (SFD) model: The water for each cell
flows to the steepest-descent down-slope neighbor.
• Multi-flow-directions (MFD) model: The water for each cell
flows to all down-slope neighbors.
Several other flow-direction models have also been proposed (e.g.,
), and most of them can be incorporated in TerraSTREAM.
Flat areas. The traditional flow-direction models are only de-
fined for cells with at least one down-slope neighbor. However,
realistic terrains can have large flat areas of same-height cells. Flat
areas can be natural plateaus in the terrain, or they can appear as
by-products of the hydrological conditioning stage. If a cell on a
flat area has a down-slope neighbor it is called a spill point. When
at least one spill point and those that do not; in the first case water
should be able to flow out of the flat area through the spill points,
while in the second case water is simply absorbed into the extended
In many early flat area routing approaches flow directions were
assigned in a simple way such that each cell v was assigned a flow
direction to the adjacent neighbor that was along the shortest Eu-
clidean path along grid edges from v to the closest spill point.
However, these approaches are not hydrologically realistic and tend
to create many parallel flow lines . TerraSTREAM circum-
vents this problem by using a more sophisticated method that uses
geodesic distance on T, based on the approach described in .
been assigned flow directions, the flow accumulation  phase
computes the amount of water that reaches each grid cell of T.
More precisely, each cell c is assigned some initial flow; c then re-
ceives incoming flow from up-slope neighbors and distributes all
incoming and initial flow to one or more of its down-slope neigh-
bors. The flow accumulation of c is the sum of its initial flow and
incoming flow from up-slope neighbors. Given the flow accumu-
lations for all vertices, we can extract river networks  simply
by extracting edges incident to vertices whose flow accumulation
exceeds a given threshold.
As discussed in Section 2 computing highly accurate flood risk
information is a very complex task and involves many types of
fine-grained information about the study area. However, with mod-
ern high-resolution DEMs that contain small but important features
(e.g. dikes), it is possible to get good initial flood risk estimates us-
and the groundwater.
Given a flood height h, a simple way of computing the flooded
area is to mark all vertices of elevation less than h as flooded. This
naive approach is problematic because it ignores the effect of dikes