Conference PaperPDF Available

A columnar architecture for modern risk management systems

Authors:

Abstract and Figures

3D digital city models form the basis for flow simulations (e.g. wind flow and water runoff), urban planning, underand over- ground formation analysis, and they are very important for automated anomaly detection on man made structures. They consist of large collections of semantically rich objects which have many properties such as material and color. Such user’s data structure perception is leading to complex storage schemas. The number of table relations to manage and the large data storage footprint drawbacks are then extended with the fact that not all the systems have a "real" 3D data type. In this work we would like to show our efforts to develop a new kind of Spatial Data Management System (SDBMS) where topological and geometric functionality for 3D raster manipulation will become part of the relational kernel and not an add-on. With it spatial analysis tailored to different use case scenarios is done on-demand and fast enough to support real-time interaction in modern risk management systems.
Parquet file layout query processing. The metadata stored in the file header and column-page header allows a kernel during predicate evaluation to skip data blocks and have lazy predicate evaluation over compressed data. At the same time, this metadata is used in our in-situ data access strategy, which is explained in Section III-B, to reduce the amount of data imported during query execution. Parquet captures the record structure of each value through two integers called repetition level and definition level. During query processing they are used to fully reconstruct the nested structure 1 . Definition level. It is used to store the level of which the field is NULL. From 0 at the root of the schema up to the maximum level for the column. When a field is defined then all its parents are also defined. The definition level records at which level it started being null. Repetition level. It is used to define when a new list starts in a column of values. It marks the level at which we have to create a new list for the current value. Storing definition levels and repetition levels efficiently. For each primitive type it is necessary to store three sub columns. Due to the columnar representation the storage overhead is low. The depth of the schema defines the number of levels. For instance with 3 bits it is possible to store 7 levels of nesting. Required fields do not need definition level, and fields that are not repeated do not need repetition level. Figure 3 represents the nested structure for a voxel-based city model. The LOD is used for the definition level. It is assumed that if a object has LOD2 semantics, it will also has LOD1 semantics, i.e., all the voxels inherit the semantics from the parent. The repetition level is the number of sub-divisions a parent voxel has. As an example, an object is semantically identified as a building in LOD1 while in LOD2 it might be composed by a set of sub-voxels to define walls, floor surface and etc.
… 
Content may be subject to copyright.
A columnar architecture for modern risk
management systems
Romulo Goncalves1, Sisi Zlatanova2, Kostis Kyzirakos3, Pirouz Nourian2, Foteini Alvanaki3, Willem van Hage1
1Netherlands eScience Center, The Netherlands
{r.goncalves,w.vanhage}@esciencecenter.nl
2TU Delft, The Netherlands
{s.zlatanova, p.nourian}@tudelft.nl
3CWI, The Netherlands
{kostis.kyzirakos,f.alvanaki}@cwi.nl
Abstract—3D digital city models form the basis for flow simu-
lations (e.g. wind flow and water runoff), urban planning, under-
and over- ground formation analysis, and they are very important
for automated anomaly detection on man made structures. They
consist of large collections of semantically rich objects which
have many properties such as material and color. Such user’s
data structure perception is leading to complex storage schemas.
The number of table relations to manage and the large data
storage footprint drawbacks are then extended with the fact that
not all the systems have a "real" 3D data type.
In this work we would like to show our efforts to develop
a new kind of Spatial Data Management System (SDBMS)
where topological and geometric functionality for 3D raster
manipulation will become part of the relational kernel and not
an add-on. With it spatial analysis tailored to different use case
scenarios is done on-demand and fast enough to support real-time
interaction in modern risk management systems.
I. INTRODUCTION
Digital 3D city models play a crucial role in research of
urban phenomena; they form the basis for flow simulations
(e.g. wind streams and water runoff), analysis of underground
formations and man made structures which provide crucial
information for effective risk management systems.
An urban scene, represented as a 3D city model, consists
of large collections of semantically rich objects which have a
number of properties such as use, function, and year. They
are commonly reconstructed by segmenting and triangulat-
ing a point cloud thereby creating a surface representation.
Representing urban objects (e.g. buildings, roads, trees, etc.)
as surfaces has drawbacks while calculating intersections and
volumes, and creating cross-sections is complex. Furthermore,
modeling volumetric objects, such as walls, water, and under-
ground, requires the deployment of complex shapes [18].
Such users data structure perception is leading to complex
storage schemes. The storage scheme designed for systems like
Oracle Spatial, Grass, and PostGIS has limitations such as the
management of many tables when the selection predicate is on
the 3D city model semantics. The number of table relations
to manage and the large data storage footprint drawbacks are
then extended with the fact not all the systems have a "real"
3D data type. PostGIS, highly adopted in eScience projects,
is a clear example.
We have tackled all these issues by re-designing the con-
ceptual model and the storage model. For conceptual model
we have adopted a voxel-based city model, a path considered
novel and promising [18]. Voxels are the volumetric represen-
tation of pixels. Alongside a length and a width, voxels also
have height thereby forming a cube in 3D space. Voxel storage
offers a number of interesting simplifications, use cases, but
also challenges. One of the major challenges is its storage and
efficient handling by Spatial Database Management Systems
(SDBMSs). With different semantic level of detail (e.g., LOD
in CityGML [16]) models and coverage of in- and out- side
empty spaces, the voxelization of an entire city will generate a
massive 3D grid of voxels at different resolutions with a large
number of semantic attributes attached [18].
It is clear a dense flat relational table is not ideal to
store such massive 3D grid. The holy grail is an architecture
which allows effective compression to reduce storage foot-
print, and efficient data retrieval to access only the attributes
of interest at a specific resolution. Such key features is what
distinguishes a column-oriented architecture from a record-
oriented architecture and the reason for their efficiency on
analytic workloads [5].
Despite column-oriented architectures emerge as the right
candidate and the efforts to extend them for spatiotemporal
analysis over large data sets [8], [6], [12], [13], their flat
storage model is not yet suitable to store a large 3D city
model. To do so, we extended a column-store to also support
a nested column-oriented storage for 3D city models. The
chosen format is Parquet [1]. It is an effective storage model
for sparse data sets with a nested structure (the different
LODs). Its flat columnar format fits well the column-oriented
programming model.
With our contribution, spatial analysis tailored to different
use case scenarios is done on demand and fast enough to be
used by modern risk management systems.The adopted storage
model, Parquet [1], opens doors to also exploit state-of-the-art
processing technologies, such as Spark, to scale out to country
size. Furthermore, the simplicity of the conceptual model gives
the opportunity to use interactive front-ends borrowed from
gaming for real-time interaction with the surroundings.
The remainder of the paper is as follows. Section II de-
scribes the storage strategies and their challenges. Section III
presents the general architecture. Section IV shows the steps
already taken to put the vision in action. The article ends with
future plans in Section V and a summary in Section VI.
II. BACKGROU ND
In this section we do a top-down description of our solution,
i.e., from the conceptual model to the storage model. For the
conceptual model we first identify its advantages followed
by the challenges in supporting it on current SDBMSs. For
the storage model we give a description on the challenges in
mapping a voxel-based conceptual model into a flat and nested
column-oriented storage.
A. Voxels
Our world can be represented in voxels by gridding the 3D
space and specifying what each cell represents by semantically
"attaching" every cell/voxel to a real world object. Storing
volumetric spaces such as air, water and underground is
possible.
Every object is defined by set of voxels, with set’s length
depending on the level of detail (LOD). The storage unit base
is a 3D voxel of certain size and each voxel’s characteristics
e.g. type (wall, glass, roof, door, etc.), color, density, etc. is
then stored as a semantic property. Such data type atomicity
avoids the use of a set of multiple geometries, approach
currently used in other spatial RDBMSs to store 3D city
models [18].
Representing real world objects by a single geometry type
(3D cube) instead of collection of polygons/polyhedron greatly
simplifies a range of geometric operations: volumes and areas
are calculated by simply counting the number of voxels
that form an object; 3D bisections become simple selection
operations; dynamic Levels-of-Detail (LOD) as objects can
be resampled with larger cubes [18].
B. Storage challenges
The storage and indexing of 3D voxels linked with proper-
ties, such as voxels created to simplify a point cloud, two
approaches can be considered, a homogeneous voxel grid
versus a heterogeneous voxel collection. The former allows for
factorization of invariant properties from the data structures,
while the latter is better suited to sparse models such as a 3D
city model with different LODs.
A homogeneous voxel grid is easy to define using a
flat relational schema, i.e., real-world objects are formed
by semantically grouping voxels together via foreign key
relations and relational views. The scheme normalization is
used to reduce the storage footprint at the cost of expensive
spatial joins. The schema normalization storage footprint is
proportional to the size of each voxel. Hence, efficient data
access becomes dependent on efficient column compression
techniques and effective storage of geometric empty spaces.
The latter is very important because it strongly affects the data
set size. If empty spaces were also materialized in the storage
Fig. 1: "Record-wise versus columnar representation of nested data" [14]
scheme, the storage of the whole of The Netherlands as e.g.
10 cm blocks would result in many petabytes of data.
A heterogeneous voxel grid poses extra challenges com-
pared to a homogeneous voxel grid due to the preservation of
the geometry semantics when converting vector to raster data.
The object’s semantics depends on the semantic level of detail
(LOD) [18]. For example, the buildings LOD1 are buildings,
LOD2 semantic is extended with ground surface,wall surface,
and roof surface; LOD 3 has in addition to LOD 2 window
and door; LOD4 room,ceiling surface,interior wall surface,
floor surface,closure surface,door,window,building furniture
and building installation. Hence, depending on the LOD, a
voxel can have different semantic tags, e.g. (building, roof),
(building, wall), (building, wall, window), etc. The LOD has a
clear nested data organization and a sparse structure because of
the in- and out- empty spaces. Furthermore, not all the levels
in the nested structure are defined due to incompleteness or
absence of vector information for a specific LOD.
C. Nested column-oriented storage
For efficient storage and data retrieval at different resolu-
tions we embraced a column-oriented format for voxel-based
3D city models. Columnar formats have several advantages.
Organization by column allows better compression, as data is
more homogeneous. For large data sets the I/O is improved
since it is possible to efficiently scan a subset of the columns
while reading the data. Of course, better compression also
reduces the bandwidth to read input data [4]. By storing
together values of the same primitive type, a columnar format
provides more efficient encoding and decoding.
Hence, to store nested data structures in flat columnar
format, the schema is mapped to a list of columns in such
a way that records are written and read back to its original
nested data structure in an efficient way. Figure 1 illustrates
the record-wise versus columnar representation of nested data.
In the columnar representation all the values of a nested field
are stored contiguously. For example, A.B.C can be retrieved
without reading A.E,A.B.D, etc [14].
D. Parquet
In our work we use the well known Hadoop format called
Parquet [1]. It stores nested data structures in a flat columnar
format using a technique outlined in the Dremel paper from
Google [14]. Parquet file layout is represented in Figure 2. Its
internal structure is designed for efficient data skipping during
Fig. 2: Parquet file layout
query processing. The metadata stored in the file header and
column-page header allows a kernel during predicate evalua-
tion to skip data blocks and have lazy predicate evaluation over
compressed data. At the same time, this metadata is used in our
in-situ data access strategy, which is explained in Section III-B,
to reduce the amount of data imported during query execution.
Parquet captures the record structure of each value through
two integers called repetition level and definition level. During
query processing they are used to fully reconstruct the nested
structure 1.
Definition level. It is used to store the level of which the
field is NULL. From 0 at the root of the schema up to the
maximum level for the column. When a field is defined then
all its parents are also defined. The definition level records at
which level it started being null.
Repetition level. It is used to define when a new list starts
in a column of values. It marks the level at which we have to
create a new list for the current value.
Storing definition levels and repetition levels efficiently.
For each primitive type it is necessary to store three sub
columns. Due to the columnar representation the storage
overhead is low. The depth of the schema defines the number
of levels. For instance with 3 bits it is possible to store 7 levels
of nesting. Required fields do not need definition level, and
fields that are not repeated do not need repetition level.
Figure 3 represents the nested structure for a voxel-based
city model. The LOD is used for the definition level. It is
assumed that if a object has LOD2 semantics, it will also has
LOD1 semantics, i.e., all the voxels inherit the semantics from
the parent. The repetition level is the number of sub-divisions
a parent voxel has. As an example, an object is semantically
identified as a building in LOD1 while in LOD2 it might be
composed by a set of sub-voxels to define walls, floor surface
and etc.
III. A 3D RA ST ER SDBMS
A voxel-based 3D city model is best managed in a spatial
DBMS as each voxel has a semantic relation to a real world
object and various attributes (e.g. color, material, porosity,
reflection properties, etc). Furthermore, a single spatial DBMS
1For a detailed explanation with examples we recommend the read of:
https://blog.twitter.com/2013/dremel-made-simple-with-parquet
Fig. 3: LOD in Parquet
offers all functionality in one place, avoids the need for
multiple software tools with associated high volume data
transfer and format transformations.
During the last decade many DBMSs have been successfully
extended with support for spatial and geo-spatial applications.
For instance the OGC implementation specification, defining
basic geometry types like points and polygons, is followed in
PostGIS, Oracle, MySQL, Microsoft SQL Server, and Mon-
etDB. To implement it they use their user-defined functions
(UDF) functionality augmented in some cases with spatial
search accelerators. However, contemporary DBMSs still lack
advanced functionality and efficient implementations needed
for analysis of voxel-based models.
We might argue that Oracle Spatial, Graphs 12c, and
PostgreSQL 9.2 are developing extensions to support 3D
geometries, even in GIS packages, only GRASS has support
for voxels, but it still stores them as flat files. The systems
are still in their infancy and they offer limited functionality.
Due to the complexity of their software stack, deep integration
with the database engine is even further away.
A. Column-oriented architecture
For our work we have extended a modern column-store,
MonetDB [9], which steps away from traditional SDBMS
which are all record-oriented architectures. Through vertical
partitioning of relational tables column-store significantly re-
duce data access. In our case, vertical partitioning is exploited
to reduce the number of columns to be imported as explained
in Section III-B. Such data organization improves data com-
pression, simplifies data skipping strategies and it suits well
vector processing [4].
Currently through the works [8], [6], [12], [13], MonetDB
spatial features have been matured to provide core technology
components for geo-spatial big data analytics. Atomic spatial
types and their operations are becoming part of the relational
kernel and not an add-on. All the operations are available for
spatial applications through integrated environments, such as
R and Python, and a SQL front-end.
Fig. 4: MonetDB’s spatially enabled architecture
Currently the system is equipped with SQL primitives for
building complex spatial analysis pipelines over 3D point
clouds, more concretely: geometry-based selections ( rect-
angular query window, 3D bounding box), attribute-based
selections (point intensity, RGB, multi-spectral properties),
conversion to Triangulated Irregular Network (TIN) and tri-
angulation using constrained Delaunay. It also provides the
option to export the results into a pre-defined format, such
as X3D, GeoJSON and LAS/LAZ format, to be loaded into
visualization tools.
In the context of 3D city models, MonetDB is currently
being extended with SQL operations to manipulate voxel
attributes: 3D selections (contains, within, intersects); re-
gridding of homogeneous voxel grids; semantic categorization;
volume based aggregation; and also rendering for interactive
visualization tools such as Cubiquity [2], more details in
Section V-B.
Our architecture, represented in Figure 4, is an attempt to
couple under the same storage descriptive spatial data, such as
point clouds, vector data and 3D rasters semantically enriched.
It creates the grounds to have direct and on the fly conversion
to a data type tailored to the type of user interaction.
B. Dynamic data access
The need of large area coverage and up to date information
to support near real-time decisions was the reason for us to
explore in-situ data access, i.e., data is kept in its original
format while scalable and distributed processing functionality
is offered through a DBMS.
Our work adopts the same strategy defined in [8] where the
authors presented a solution for in-situ data access to large
NetCDF data repositories. The work stands on the shoulders
of previous work called data-vaults [10]. In this article we
have extended it to support Shapefiles, Parquet and LAS/LAZ
file format.
The in-situ data access is possible due to the large amounts
of metadata (data of data) existent on file formats such as
Parquet and LAS/LAZ formats. Such metadata is used for
effective data skipping, but also to collect data insights, e.g.
summaries and samples, without having to process the entire
data set.
The dynamic data loading comprises of three phases: the
attachment of a file, the import of the file’s content and the
collection of statistics to boost query optimization. During the
attachment, the file’s metadata is loaded into a special DBMS
catalog. At query time, such a catalog is inspected to decide
whether the file has information relevant to the query. In such a
case the file’s content is imported into the database, otherwise,
it is not.
The data import happens in two ways, if the file format
has each attribute sequentially stored then the import memory
maps each attribute as a column, otherwise, the data is
converted and loaded into the database as temporary data. In
the latter case, cache policies, such as Least Recently Used
(LRU), are used for data eviction.
IV. VISION IN ACTION
Our 3D raster SDBMS emerges from efforts in providing
a scalable and generic solution for eScience projects with
spatio-temporal data analysis. It is a continuous work standing
on the shoulders of [8], [6], [12], [13]. In this section we
summarize the steps taken towards a fully functional solution.
The complete system evaluation is out of the scope of this
article. An extended version with such evaluation will be
submitted to a referee journal.
A. Voxel data Management
A 3D raster is commonly obtained from a existing 3D vector
model or 3D discrete measurements such as point clouds. The
vector model contains structured data and it is represented
according to the rules of either GIS or BIM models. GIS
models are used for modeling natural phenomena and man-
made objects while newly constructed man-made objects such
as buildings, bridges, etc, typically available in BIM models.
Our work supports 3D vector-raster conversion of vector
models stored in CityGML2and it uses the voxelization prin-
ciples defined in [15]. The voxelization of surfaces and curves
are a customization of the Topological Voxelization approach
presented in [11] and they ensure correct representation of
geometries, topology, and semantics [15].
One object at the time is voxelized and the results saved into
a Parquet file. Instead of voxelizing sequentially an entire city,
the voxelization is done in parallel by tiling it. For each tile a
Parquet file is created. For efficient data access, a Parquet file
size is kept above 1GB to maximize the length of the stored
column. Such optimization, and the fact objects distribution is
not uniform, the tiling is not uniform.
B. Point cloud voxelization
The voxelization library [15] additionally provides an ex-
tension for voxelization of point clouds. The methods provide
an easy management of connectivity levels in the resulting
voxels, they are not dependent on any external library except
for primitive types and constructs, therefore, easy to integrate
them into a DBMS.
The in-situ data access combined with efficient spatial
selections allows voxelization of point cloud on demand and
near real-time. It allows us to extract a series of point-clouds
of certain region to determine volume differences of nature
2http://www.citygml.org
Fig. 5: Semantically enriched voxels from a point-cloud [7]
objects, or visualize quickly the impact of introducing or
removing a man-made structure [15], [18].
For risk management such flexibility and efficiency allows
rescue teams to study a building in a question of seconds.
As an example, Figure 5 illustrates a large building of the
Technical University of Delft (TUDelft). It maps the building
into a series of points (red - stairs, yellow - floor and black-
walls) while the voxels mapping empty spaces above the
floor between objects using a color gradient, orange means
objects are close by while blue means they are far away. The
compact representation of the each voxel allows analysis of
the possible routes and the available space to define escape
trajectory routes.
C. Efficient spatial selections
For a performance profiling we have used the benchmark
defined in [17]. The results are compared with the most
efficient solution in [17], LAStools from Rapidlasso3.
1) Setup:The experiments are conducted in a server with
double capacity as the one used in [17]: instead of 16 Intel
Xeon CPUs, it has 32 CPUs; instead of 128GB of main
memory, it has 256GB; and instead of 2 x 41TB SATA in
RAID 5 configuration, it has SAS (Thunderbolt) with 24
x 2TB disks. Despite the difference, input and intermediate
data fits in memory. MonetDB was set to only use 16 cores.
Regarding the software, we used the JUN2016 branch of
MonetDB (M) and the latest version of LAStools (L).
LAStools is assisted by a DBMS (LD) to store each file
bounding box in order to avoid the inspection of each file
header. It also required us to run lassort and lasindex to boost
query performance. Such pre-query stage had the same cost,
around 18 hours, as a complete data import for MonetDB.
3http://rapidlasso.com
2) Data:Massive 3D discrete measurements have been
obtained through airborne LiDAR (Light Detection and Rang-
ing) or terrestrial scanning campaigns. As an example, the
height map of the Netherlands, the Actueel Hoogtebestand
Nederland 2 (AHN2)4which is stored and distributed in more
than 60,000 LAZ files, contains 640 billion points. The data
is only composed by X, Y and Z coordinates, i.e., no extra
benefit for column-oriented architectures compared to record-
oriented architectures.
3) Queries:The queries are a series of small (c), large (l)
space selections using simple (S) or complex (C) polygons.
Table I’s first line has which type of query and all other lines
has the hot-run execution times for LAStools (L), LAStools
combined with a DBMS (LD) and MonetDB (M). Due to space
constraints, we have omitted all the queries for which it was
not possible to obtain result in less than 1000 seconds or if it
was 10 times worse.
4) Results:The 18 hours reported for pre-query data load-
ing and preparation are automatically reduced thanks to our
in-situ data access approach, i.e., only data relevant for the
queries is imported using the file’s metadata loaded during
the attachment phase. For query performance, only for very
small selections or simple polygons, LAStools combined with
a DBMS, performs better than our solution. For all other
queries with exception of query 24 and 27, our solution out-
performs LAStools. It is important to notice that selections
or aggregations on other attributes would be hard to express
for the file-based solution and would required the inspection
of all files’ header. Furthermore, the numbers obtained by our
open-source solution are comparable with the best commercial
solution with customized hardware [17]. Overall our solution
stands as the best DBMS solution.
D. Query execution
The used column-store, MonetDB, has operator-at-the-time
paradigm with output of each operator being materialized
before being passed as an argument to the next operator. The
feature simplifies the integration of a nested column-oriented
format, such as Parquet, since it allows us to un-nest a column
and materialize it when needed.
One of the advantages of columnar processing is late mate-
rialization, i.e., tuples are re-constructed as late as possible in
the query plan. It allows column-oriented architecture to have
a low memory footprint during query execution, especially
during the filtering phase. The extended operators, indepen-
dently if they are processing nested or unested data, continue
to support late materialization.
4http://www.ahn.nl
TABLE I: EFFICI EN T SPATIAL SELECTIONS OV ER A MA SS IVE PO IN T CLOU D DATA SET
Typ sS sS sS sS sS sC sS sC sC sS sS sS lC lC sS sS sS lS lS lS lS lS lS lC lC
Q1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 21 22 23 24 26 27 28 29
LD .07 .16 .07 .16 .7 1.52 0.55 3.72 2.34 .08 .05 .26 412 102 .49 .04 0.32 2.28 142.0 313 234 282 .13 x x
L.9 .87 .78 .92 33.2 32.8 32.29 36.2 34.8 .88 .84 1.03 829 424 1.39 .75 1.18 20.74 x 828 x x 923 x x
M.2 .41 .22 .5 .36 .66 .25 .69 .54 .24 .24 .44 24.9 23.6 .41 .08 .73 1.5 99.9 35.9 363* 17.2 .16 224 108
The scan- and aggregation operators were modified to be
aware of the repetition level. The definition level is only used
by projection operators to un-nest the data and do tuple re-
construction. The tuple re-construction happens in the presence
of a blocking operator, or a result constructor, or when it needs
to combine nested data with flat data.
V. FUTURE PL AN S
Once fully operational, we will study thoroughly the robust-
ness of the proposed conceptual model and the efficiency our
storage model using in-house projects. At the same time, we
will design support for horizontal scalability and for interactive
visualization of voxel-based 3D city models.
A. Horizontal scalability
With voxels stored in Parquet, our current work aligns with
on going advances for large scale spatial processing in the
cloud. As future work we intend to explore the possibility
of 3D data manipulation of large scale voxel-based 3D city
models using GeoSpark [3]. GeoSpark was built to efficiently
exploit the internals of Spark5. It extends Resilient Distributed
Datasets (RDDs) to form Spatial RDDs (SRDDs). For efficient
data parallelism it efficiently partitions SRDD data elements
across machines and introduces novel parallelized spatial
geometric operations.
GeoSpark is still in an early development stage and it only
supports few geometries (point, rectangle, and polygon), two
spatial indexes (R-Tree and Quad-Tree). On top of that, it also
supports spatial queries, e.g., range queries, K nearest neighbor
(KNN) queries, and join queries on large-scale spatial datasets.
Its major advantage is the fact it is built on top of Spark.
By using Spark infra-structure, Hadoop friendly file formats
such as Parquet can be directly ingested and used for large
spatial analysis. In addition to our single-server mode using
MonetDB, we will also provide cluster-mode using Spark
(both providing integrated R and Python environments).
B. Interactive visualization
For interactive visualization we plan to explore Cubiq-
uity [2] as an extension of Unreal Engine 46. Cubiquity is
a voxel engine written in C++ and released under the terms of
the MIT license. It allows the creation of volumetric (voxel-
based) environments which can be dynamically modified, i.e.,
it enables dynamic digging, building, and destruction.
Cubiquity is a flexible and powerful voxel engine, e. g.,
create terrains with caves or defined environments built from
millions of colored cubes. It supports both smooth terrain
and colored cubes type environments, multiple volumes which
can exist in transform hierarchies and direct voxel access for
implementing procedural generation.
5https://spark.apache.org/
6https://github.com/volumesoffun/cubiquity-for-unreal-engine
VI. SUMMARY
In this work we have presented an architecture for a 3D
column-oriented raster DBMS and so far our efforts on its
implementation. The uniqueness of our solution stands on the
combination of a novel concept model for 3D city models,
a voxel-based one, with a efficient nested column-oriented
format to explore the 3D city model at different levels of detail.
It is designed to iteratively load data from different sources
and where topological and geometric functionality for 3D
raster manipulation is part of the relational kernel and not
an add-on. It is the first DBMS based solution with in-situ
access to spatial data repositories and on demand voxelization.
With it, spatial analysis tailored to different use case scenarios
is done on demand and fast enough to be used by modern
risk management systems where real-time interaction is a key
feature.
REFERENCES
[1] https://parquet.apache.org/.
[2] https://bitbucket.org/volumesoffun/cubiquity.
[3] http://geospark.datasyslab.org/.
[4] D. J. Abadi, S. Madden, and M. Ferreira. Integrating compression and
execution in column-oriented database systems. In Proceedings of the
ACM SIGMOD, 2006.
[5] D. J. Abadi, S. Madden, and N. Hachem. Column-stores vs. row-stores:
how different are they really? In Proceedings of the ACM SIGMOD,
2008.
[6] F. Alvanaki, R. Goncalves, M. Ivanova, and et al. GIS navigation boosted
by column stores. PVLDB, 2015.
[7] F. Fichtner. Semantic enrichment of as point cloud based on a octree
for multi-storey path finding. MSc Thesis, TUDelft, 2016.
[8] R. Goncalves, M. Ivanova, F. Alvanaki, J. Maassen, K. Kyzirakos,
O. Martinez-Rubi, and H. Muhleisen. A round table for multi-
disciplinary research on geospatial and climate data. IEEE e-Science,
2015.
[9] S. Idreos, F. Groffen, N. Nes, S. Manegold, and et al. Monetdb: Two
decades of research in column-oriented database architectures. IEEE
Data Engineering Bulletin, 2012.
[10] M. Ivanova, Y. Kargin, and et al. Data Vaults: A Database Welcome to
Scientific File Repositories. SSDBM, 2013.
[11] S. Laine. A topological approach to voxelization. Eurographics
Symposium on Rendering, 2013.
[12] O. Martinez-Rubi and et al. Benchmarking and improving point cloud
data management in monetdb. SIGSPATIAL Special, 2015.
[13] O. Martinez-Rubi, M. L. Kersten, R. Goncalves, and M. Ivanova. A
column-store meets the point cloud. FOSS4G-Europe, 2014.
[14] S. Melnik, A. Gubarev, J. J. Long, G. Romer, S. Shivakumar, M. Tolton,
and T. Vassilakis. Dremel: Interactive analysis of web-scale datasets.
VLDB’10, pages 330–339, 2010.
[15] P. Nourian, R. Gonçalves, and et al. Voxelization algorithms for
geospatial applications: Computational methods for voxelating spatial
datasets of 3d city models containing 3d surface, curve and point data
models. MethodsX, pages 69 – 86, 2016.
[16] A. Stadler and T. H. Kolbe. SPATIO-SEMANTIC COHERENCE IN
THE INTEGRATION OF 3D CITY MODELS. ISPRS Archives –
Volume XXXVI-2/C43, 2007.
[17] P. van Oosterom, O. Martinez-Rubi, and et al. Massive point cloud data
management: design, implementation and execution of a point cloud
benchmark. Computer Graphics, 2015.
[18] S. Zlatanova, P. Nourian, R. Gonçalves, and A. V. Vo. Towards 3d raster
gis: on developing a raster engine for spatial dbms. ISPRS WG IV/2
Workshop: Global Geospatial Information and High Resolution Global
Land Cover/Land Use Mapping, 2016.
... Voxel storage offers a number of interesting simplification, use cases, as well as challenges. One of the major challenges is its storage and efficient handling by spatial database management systems ( Gonçalves et al., 2016). It is clear that a dense flat relational table is not ideal storage format for a large 3D grid. ...
... Despite column-oriented architectures emerging as the right candidate, their flat storage model is not yet suitable to store a large 3D city model. Gonçalves et al. (2016) extended a column- store to also support a nested column-oriented storage for 3D city models. The chosen format is Parquet 1 . ...
... A heterogeneous voxel grid poses extra challenges compared to a homogeneous voxel grid due to the preservation of the geometry semantics when converting vector to raster data. The object's semantics depends on the semantic level of detail (LOD) ( Gonçalves et al., 2016). ...
Chapter
Full-text available
The concepts of three-dimensional (3D) real property have been the subject of increased interest in land use management and research since the late ‘90s. Literature provides various examples of extensive research towards 3D Cadastres as well as those that are already implementing 3D cadastral systems. However, in most countries the legal aspects of 3D real property and its incorporation into 3D cadastral systems have not been so rigorously examined. This chapter compares and discusses 3D property concepts in 15 cadastral jurisdictions, based on the authors’ national experience, covering Europe, North and Latin America, Middle East and Australia. Each of the legal system in these cadastral jurisdiction are based on different origins of Civil Law, including German, Napoleonic and Scandinavian Civil Law, which can prove useful to research in other Civil Law jurisdictions interested in introducing 3D cadastral systems. These jurisdictions are at different stages of introducing and implementing a 3D cadastral system. This contributes to the detection of the 3D real property concepts that apply as well as deficiencies that prohibit introduction of 3D cadastral systems, while highlighting challenges that may have not yet surfaced in individual jurisdictions. This paper aims to present the different legal concepts regarding 3D real property in the examined countries, focusing on the characteristic features of cadastral objects described as 3D within each country’s legal and cadastral framework. The analysis of the case studies revealed that the countries are on different stages of 3D Cadastral implementation, starting from countries with operational 3D cadastral systems, to others where there is yet no interest in introducing a 3D cadastral system. This chapter presents the nature of 3D cadastral objects in each country, as well as differences in the regulatory framework regarding definition, description and registration. The paper continues the legal workshop discussions of the 4th International Workshop on 3D Cadastres in Dubai 2014 by analysing the legal concepts of 3D cadastres in the above-mentioned countries. The outcome is an overview and discussion of existing concepts of 3D property describing their similarities and differences in use, focusing on the legal framework of 3D cadastres. The chapter concludes by presenting a possible way forward and identifies what further research is needed which can be used to draft national and international research proposals and form legislative amendments towards introduction of national 3D cadastral systems.
... Despite column-oriented architectures emerging as the right candidate, their flat storage model is not yet suitable to store a large 3D city model. Gonçalves et al. (2016) extended a columnstore to also support a nested column-oriented storage for 3D city models. The chosen format is Parquet 1 . ...
... The object's semantics depends on the semantic level of detail (LOD) . Nested column-oriented storage For efficient storage and data retrieval at different resolutions, Gonçalves et al. (2016) embraced a column-oriented format for voxel-based 3D city models. Columnar formats have several advantages. ...
Conference Paper
Full-text available
Subdivision of land parcels in the vertical space has made it necessary for cadastral jurisdictions to manage cadastral objects both in 2D as well as 3D. Modern sensor and hardware capabilities for capture and utilisation of large point clouds is one of the major drivers to consider Spatial Database Management Systems (SDBMS) in 3D and organisations are still progressing towards it. 3D data models and their topological relationships are two of the important parts of 3D spatial data management. 3D spatial systems should enable data models that handle a large variety of 3D objects, perform automated data quality checks, search and analysis, rapid data dissemination, 3D rendering and visualisation with close linkages to standards. This chapter asserts that while there has been work done in defining 2D and 3D vector geometry in standards, it is still not sufficient for 3D cadastre purposes as 3D cadastral objects have a much more rigorous definition. The Land Administration Domain Model (LADM), which is an ISO Standard, addresses many of the issues in 3D representation and storage of 3D data in a database management system (DBMS). The chapter further discusses the various approaches to storing 3D data such as through voxels, or point cloud data type and elaborates on the characteristics of a 3D DBMS capable of storing 3D data. Approaches for spatial indexing to improve the fast access of data and the various available options for a 3D geographical database system are presented. Several spatial operations on and amongst 3D objects are illustrated with linkages to the current standards including the LADM. Next, construction of 3D topological and geometrical models based on standards and including their characteristics is discussed. Current 3D spatial database managements systems and their characteristics, including some comparison between selected DBMS including the hardware capabilities are elaborated in detail. Finally, the chapter proposes a 3D topology model based on Tetrahedron Network (TEN) synchronised with LADM specifications for 3D cadastral registration. This topological model utilises surveying boundaries to generate 3D cadastral objects with consistent topology and rapid query and management capabilities. The definition for validation of 3D solids also considers the automatic repair of invalid solids. Point cloud and TEN related data structures available in SDBMSs are also investigated to enable storage of non-spatial attributes so that database updates would store all spatial and attribute information directly inside the spatial database.
... Despite column-oriented architectures emerging as the right candidate, their flat storage model is not yet suitable to store a large 3D city model. Gonçalves et al. (2016) extended a columnstore to also support a nested column-oriented storage for 3D city models. The chosen format is Parquet 1 . ...
... The object's semantics depends on the semantic level of detail (LOD) . Nested column-oriented storage For efficient storage and data retrieval at different resolutions, Gonçalves et al. (2016) embraced a column-oriented format for voxel-based 3D city models. Columnar formats have several advantages. ...
Conference Paper
Full-text available
Subdivision of land parcels in the vertical space has made it necessary for cadastral jurisdictions to manage cadastral objects both in 2D as well as 3D. Modern sensor and hardware capabilities for capture and utilisation of large point clouds is one of the major drivers to consider Spatial Database Management Systems (SDBMS) in 3D and organisations are still progressing towards it. 3D data models and their topological relationships are two of the important parts of 3D spatial data management. 3D spatial systems should enable data models that handle a large variety of 3D objects, perform automated data quality checks, search and analysis, rapid data dissemination, 3D rendering and visualisation with close linkages to standards. This chapter asserts that while there has been work done in defining 2D and 3D vector geometry in standards, it is still not sufficient for 3D cadastre purposes as 3D cadastral objects have a much more rigorous definition. The Land Administration Domain Model (LADM), which is an ISO Standard, addresses many of the issues in 3D representation and storage of 3D data in a database management system (DBMS). The chapter further discusses the various approaches to storing 3D data such as through voxels, or point cloud data type and elaborates on the characteristics of a 3D DBMS capable of storing 3D data. Approaches for spatial indexing to improve the fast access of data and the various available options for a 3D geographical database system are presented. Several spatial operations on and amongst 3D objects are illustrated with linkages to the current standards including the LADM. Next, construction of 3D topological and geometrical models based on standards and including their characteristics is discussed. Current 3D spatial database managements systems and their characteristics, including some comparison between selected DBMS including the hardware capabilities are elaborated in detail. Finally, the chapter proposes a 3D topology model based on Tetrahedron Network (TEN) synchronised with LADM specifications for 3D cadastral registration. This topological model utilises surveying boundaries to generate 3D cadastral objects with consistent topology and rapid query and management capabilities. The definition for validation of 3D solids also considers the automatic repair of invalid solids. Point cloud and TEN related data structures available in SDBMSs are also investigated to enable storage of non-spatial attributes so that database updates would store all spatial and attribute information directly inside the spatial database.
... Systems designed for online analytical processing (OLAP) workloads, especially columnar architectures, have shown promising results in query performance Goncalves et al., 2016;Pavlovic et al., 2017). In this regard, the Lakehouse pattern emerged, combining the strength of Data Lake and Data Warehouse systems to overcome their limitations (Armbrust et al., 2021;Schneider et al., 2024). ...
Article
Full-text available
Over time, the peculiarities of point clouds brought forth ample dedicated and specialized solutions for analyzing and managing point cloud data. However, providing analytical capabilities and visualization at scale remains challenging. We present a next-generation point cloud data management approach inspired by the Lakehouse pattern. It is exemplified by combining point clouds stored in raw files with a query engine, which instantly gives us an analysis-ready database management system with an SQL and DataFrame interface. We further demonstrate how to simplify and optimize this system through conversion to a columnar file format and a novel versatile repartitioning approach. Compared to existing solutions, the evaluation exhibits compelling performance, extraordinary flexibility, and exceptional simplicity.
... An additional advantage is the visualisation of 3D objects through 3D cubes. The disadvantages of this type of representation, a is the complexity of storing the data and its inefficient handling by the spatial database management system (Goncalves et al., 2016). ...
Article
Full-text available
Motives: In the past twenty years, considerable progress has been made in 3D real estate cadastres and 3D visualisation technologies. These developments require advanced solutions for the visualisation of 3D cadastral objects. Aim: The main aim of this study was to propose an optimal 3D cadastre visualisation strategy that accounts for user needs, the types of visualised data, and visualisation platforms. Results: The optimal 3D cadastre visualisation strategy was determined by performing a SWOT/TOWS analysis. Both internal and external factors that can influence the development of 3D cadastre visualisation policies were considered in the analysis. The results of the study were used to propose an aggressive strategy (based on the identified strengths and opportunities) for the development of 3D cadastre visualisation.
... Our world can be modelled in voxels by gridding the 3D space and specifying what each cell represents by semantically "attaching" every cell/voxel to a real-world object (Goncalves et al., 2016). Storing volumetric spaces such as buildings, air, water and terrain is possible. ...
Article
Full-text available
Three-dimensional (3D) raster data (also named voxel) is important sources for 3D geo-information applications, which have long been used for modelling continuous phenomena such as geological and medical objects. Our world can be represented in voxels by gridding the 3D space and specifying what each grid represents by attaching every voxel to a real-world object. Nature-triggered disasters can also be modelled in volumetric representation. Unlike point cloud, it is still a lack of wide research on how to efficiently store and manage such semantic 3D raster data. In this work, we would like to investigate four different data layouts for voxel management in open-source (spatial) DBMS - PostgreSQL/PostGIS, which is suitable for efficiently retrieving and quick querying. Besides, a benchmark has been developed to compare various voxel data management solutions concerning functionality and performance. The main test dataset is the groups of buildings of UNSW Kensington Campus, with 10cm resolution. The obtained storage and query results suggest that the presented approach can be successfully used to handle voxel management, semantic and range queries on large voxel dataset.
... To determine a suitable MD array DBMS for benchmarking, popular solutions are compared. These include Rasdaman (Baumann et al. 1998), SciDB (Stonebraker et al. 2011), MonetDB (Idreos et al. 2012;Gonç alves et al. 2016), Essbase (Oracle 2008), Intersystems Caché (InterSystems 2017) and Oracle spatial (Oracle 2014;Xie 2016). Among them, Rasdaman and SciDB provide sufficient documentation for study and research. ...
Article
Full-text available
Management of large hydrologic datasets including storage, structuring, clustering, indexing, and query is one of the crucial challenges in the era of big data. This research originates from a specific problem: time series extraction at specific locations takes a long time when a large multidimensional (MD) dataset is stored in the NetCDF classic or the 64-bit offset format. The essence of this issue lies in the contiguous storage structure adopted by NetCDF. In this research, NetCDF file-based solutions and a MD array database management system applying a chunked storage structure are benchmarked to determine the best solution for storing and querying large MD hydrologic datasets. Expert consultancy was conducted to establish benchmark sets, with the HydroNET-4 system being utilized to provide the benchmark environment. In the final benchmark tests, the effect of data storage configurations, elaborating chunk size, dimension order (spatio-temporal clustering) and compression on the query performance, is explored. Results indicate that for big hydrologic MD data management, the properly chunked NetCDF-4 solution without compression is, in general, more efficient than the SciDB DBMS. However, benefits of a DBMS should not be neglected, for example, the integration with other data types, smart caching strategies, transaction support, scalability, and out-of-The-box support for parallelization.
Conference Paper
Full-text available
Dealing with LIDAR data in the context of database management systems calls for a re-assessment of their functionality, performance, and storage/processing limitations. The territory for efficient and scalable processing of LIDAR repositories using GIS enabled database systems is still largely unexplored. Bringing together hard core database management experts and GIS application developers is a sine qua non to advance the state of the art. In particular to assess the relative merits of both traditional row-based database engines and the modern column-oriented database engines.
Article
Full-text available
Earth observation sciences produce large sets of data which are inherently rich in spatial and geo-spatial information. Together with live data collected from monitoring systems and large collections of semantically rich objects they provide new opportunities for advanced eScience research on climatology, urban planning and smart cities to name a few. Such combination of heterogeneous data sets forms a new source of knowledge. Efficient knowledge extraction from them is an eScience challenge. It requires efficient bulk data injection from both static and streaming data sources, dynamic adaptation of the physical and logical schema, efficient methods to correlate spatial and temporal data, and flexibility to (re-)formulate the research question at any time. In this work, we present a data management layer over a column-oriented relational data management system that provides efficient analysis of spatiotemporal data. It provides fast data ingestion through different data loaders, tabular and array-based storage, and a dynamic step-wise exploration.
Article
Full-text available
The popularity, availability and sizes of point cloud data sets are increasing, thus raising interesting data management and processing challenges. Various software solutions are available for the management of point cloud data. A benchmark for point cloud data management systems was defined and it was executed for several solutions. In this paper we focus on the solutions based on the column-store MonetDB, the generic out-of-the-box approach is compared with two alternative approaches that exploit the spatial coherence of the data to improve the data access and to minimize the storage requirements.
Article
Full-text available
Point cloud data are important sources for 3D geo-information. An inventory of the point cloud data management user requirements has been compiled using structured interviews with users from different background: government, industry and academia. Based on these requirements a benchmark has been developed to compare various point cloud data management solutions with regard to functionality and performance. The main test dataset is the second national height map of the Netherlands, AHN2, with 6 to 10 samples for every square meter of the country, resulting in 640 billion points. At the database level, a data storage model based on grouping the points in blocks is available in Oracle and PostgreSQL. This model is compared with the ‘flat table’ model, where each point is stored in a table row, in Oracle, PostgreSQL and the column-store MonetDB. In addition, the commonly used file-based solution Rapidlasso LAStools is used for comparison with the database solutions. The results of executing the benchmark on different platforms are presented as obtained during the increasingly challenging stages with more functionality and more data: mini (20 million points), medium (20 billion points), and full benchmark (the complete AHN2).
Article
Full-text available
MonetDB is a state-of-the-art open-source column-store database management system targeting applications in need for analytics over large collections of data. MonetDB is actively used nowadays in health care, in telecommunications as well as in scientific databases and in data management research, accumulating on average more than 10,000 downloads on a monthly basis. This paper gives a brief overview of the MonetDB technology as it developed over the past two decades and the main research highlights which drive the current MonetDB design and form the basis for its future evolution.
Conference Paper
Earth observation sciences produce large sets of data which are inherently rich in spatial and geo-spatial information. Together with live data collected from monitoring systems and large collections of semantically rich objects they provide new opportunities for advanced eScience research on climatology, urban planing and smart cities. Such combination of heterogeneous data sets forms a new source of knowledge. Efficient knowledge extraction from them is an eScience challenge. It requires efficient bulk data injection from both static and streaming data sources, dynamic adaptation of the physical and logical schema, efficient methods to correlate spatial and temporal data, and flexibility to (re-)formulate the research question at any time. In this work, we present a data management layer over a column-oriented relational data management system that provides efficient analysis of spatiotemporal data. It provides fast data ingestion through different data loaders, tabular and array based storage, and a dynamic step-wise exploration.
Article
Earth observation sciences, astronomy, and seismology have large data sets which have inherently rich spatial and geospatial information. In combination with large collections of semantically rich objects which have a large number of thematic properties, they form a new source of knowledge for urban planning, smart cities and natural resource management. Modeling and storing these properties indicating the relationships between them is best handled in a relational database. Furthermore, the scalability requirements posed by the latest 26-attribute light detection and ranging (LIDAR) data sets are a challenge for file-based solutions. In this demo we show how to query a 640 billion point data set using a column store enriched with GIS functionality. Through a lightweight and cache conscious secondary index called Imprints, spatial queries performance on a flat table storage is comparable to traditional file-based solutions. All the results are visualised in real time using QGIS.
Article
We present a novel approach to voxelization, based on intersecting the input primitives against intersection targets in the voxel grid. Instead of relying on geometric proximity measures, our approach is topological in nature, i.e., it builds on the connectivity and separability properties of the input and the intersection targets. We discuss voxelization of curves and surfaces in both 2D and 3D, and derive intersection targets that produce voxelizations with various connectivity, separability and thinness properties. The simplicity of our method allows for easy proofs of these properties. Our approach is directly applicable to curved primitives, and it is independent of input tessellation.
Conference Paper
Efficient management and exploration of high-volume scientific file repositories have become pivotal for advancement in science. We propose to demonstrate the Data Vault, an extension of the database system architecture that transparently opens scientific file repositories for efficient in-database processing and exploration. The Data Vault facilitates science data analysis using high-level declarative languages, such as the traditional SQL and the novel array-oriented SciQL. Data of interest are loaded from the attached repository in a just-in-time manner without need for up-front data ingestion. The demo is built around concrete implementations of the Data Vault for two scientific use cases: seismic time series and Earth observation images. The seismic Data Vault uses the queries submitted by the audience to illustrate the internals of Data Vault functioning by revealing the mechanisms of dynamic query plan generation and on-demand external data ingestion. The image Data Vault shows an application view from the perspective of data mining researchers.