Content uploaded by Zoltan Szantoi
Author content
All content in this area was uploaded by Zoltan Szantoi on Nov 30, 2017
Content may be subject to copyright.
THE SIX FACES OF THE DATA CUBE
Peter Strobl1, Peter Baumann2, Adam Lewis3, Zoltan Szantoi1,4, Brian Killough5,
Matthew Purss3, Max Craglia1, Stefano Nativi1,6, Alex Held7, Trevor Dhu3
1European Commission, Joint Research Centre, Ispra, Italy; 2Jacobs University, Bremen, Germany;
3Geoscience Australia, Canberra, Australia; 4Stellenbosch University, South Africa;
5NASA Langley Research Center, Hampton, United States; 6National Research Council of Italy, Rome, Italy;
7Commonwealth Scientific Industrial Research Organisation (CSIRO), Canberra, Australia
ABSTRACT
This paper provides a structure to the recently intensified
discussion around ‘data cubes’ as a means to facilitate
management and analysis of very large volumes of
structured geospatial data. The goal is to arrive to a widely
agreed and harmonised definition of a ‘data cube’. To this
end, we propose an approach that deconstructs the ‘data
cube’ concept into distinct aspects. We have identified six
such aspects, which we refer to as the 6 faces of the data
cube. More than a pleasing analogy, these 6 faces are fairly
independent, and hence ‘orthogonal’ domains. They should
allow breaking down the description and handling of data
cubes into meaningful and manageable ‘parts’, which
however, only if seen holistically, make it possible to
harness the full potential of this multidisciplinary
infrastructure.
Index Terms— data cube, structured data, data
infrastructure, geospatial data, big data, standardisation,
WCS, CIS, INSPIRE, OGC, ISO
1. INTRODUCTION
The term data cube originally was used in Online Analytical
Processing (OLAP) of business and statistics data;
technically speaking, such a data cube represents a multi-
dimensional array together with metadata describing the
semantics of axes, coordinates, and cells. More recently,
data cubes have emerged in a geospatial context [1,2] as an
approach to the management and analysis of these large and
rapidly growing datasets. While the terminology ‘data cube’
was used as early as in the 1980’s when the first imaging
spectrometers produced ‘hyperspectral data cubes’,
technology was not ready for efficiently storing and serving
data cubes. Geospatial data cubes typically are densely
populated, whereas OLAP data cubes typically are sparse.
A generic requirement remains that data can only be
organised as a ‘cube’ if they have inherent attributes
(usually referred to as coordinates) according to which they
can be ordered. A data cube may have horizontal and
vertical spatial axes, temporal axes, or any other application-
dependent dimensions. For geospatial data cubes, at least
one of those should be non-spatial.
2. DEFINING THE CUBE
Similar to the term ‘big data’, for which no consistent
definition has yet emerged, we find the notion of ‘data cube’
still varying across the literature and often dependent on the
context in which it is used. Whilst ‘big data’ is an
expression at a high and abstract enough level so that a
certain room for interpretation is not problematic,
discussions around data cubes will suffer, unless further
structure is provided to this evolving concept.
For us, a Geospatial Data Cube (GDC) is based on regularly
and irregularly gridded, spatial and/or temporal data with n
dimensions (or axes) and characterised by the presence of
the 6 faces that we explore in this paper. As such, it
complements the conceptual view of the ‘Datacube
Manifesto’ [4] with a holistic system view, whose aim is to
raise awareness for all necessary aspects of such an
infrastructure.
3. DISSECTING THE CUBE
The purpose of a GDC is to allow ingestion, storage,
provision, and analysis of structured geospatial data for
which it has to cover several technical aspects, which we
call, faces. Individually each face is a well-established
domain within data sciences, allowing the respective experts
to enter the discussion at the right end. However, as an
infrastructure a data cube can unfold to its full potential only
if all the following ‘faces’ are comprehensively covered and
well-orchestrated.
3.1. Parameter Model
The semantics of a cube cell value is described by a
parameter model which allows understanding the
information stored in each thematic layer of the cube. This
includes the parameterisation of the property and its quality,
as well as the associated metadata that are necessary for the
analysis. The Open Geospatial Consortium (OGC) Sensor
Web Enablement (SWE) Common Data Model (CDM) [3]
defines important elements of parameter models. Well-
documented implementations of such models for various
themes, such as terrain elevation [5], are given in the
INSPIRE data specifications. However, incorporating data
Data Cubes and Multidimensional Arrays
Proc. of the 2017 conference on
Big Data from Space (BiDS’17) doi: 10.2760/383579
32 Toulouse, France
28–30 November 2017
describing the same parameter data (i.e. radiance imagery)
but from various origin into a geospatial data cube remains a
challenge even in cases where such models are applied, due
to the differences among collecting sensors, imagery
processing chains and algorithms used. Thus, such
geospatial (raster) data need to be either pre-processed with
approved algorithms or, rather should be directly produced
by the corresponding instrument owner such that they fit
into the data cube structure. The latter, and preferred option
is being advocated and endorsed by the Committee on Earth
Observation Satellites (CEOS). Such data, called “Analysis
Ready Data” or ARD, would come from CEOS’ member
space agencies and fulfil a minimum set of criteria, like
consistent parameter models and approved algorithms, thus
largely facilitating the compilation of data cubes and data
exchange among them. Direct or automatic multi sensor
data fusion however calls also for harmonised sensor
characteristics such as spectral band definition and
availability and consistency of ancillary data like Digital
Elevation Models.
3.2. Data Representation
Data representation is the way in which a parameter is
discretised and semantically encoded along the different
axes or dimensions of the cube such as space, time, and
thematic properties. A given parameter might be represented
in different ways and the same representation scheme might
be used for different parameters. Depending on the
representation type a specific set of metadata needs to be
supplied including e.g. range, interval, scale, precision, or
reference. The OGC SWE-CDM contains a comprehensive
overview of representation types [3].
Discretisation in the spatial domain is highly familiar in the
form of gridding [6]. ISO and OGC today base most of their
grid definitions on the EPSG catalogue of projections,
which either limits respective grids to regional coverage or
induces considerable spatial distortion. An example for a
common (quasi) global spatial grid system is the WMTS,
which in fact is a mixture of projection, grid definition and
tiling schema. A relatively new concept is promoted by the
recent OGC standard for Discrete Global Grid Systems
(DGGS), which aim at overcoming limitations of planar
projections by defining hierarchical grids directly on the
ellipsoid.
In other areas standards are often still missing, and the
representation of observation-level metadata such as
measurement quality and uncertainty is in its infancy.
3.3. Data Organisation
The cell values generated by the discretisation of the
parameter need to be physically arranged and stored in a
machine-readable way. This encompasses issues like file
formats, file systems, and database structures. OGC CIS [6]
- which is also adopted as ISO 19123-2 - establishes how
representation can be based on ASCII (such as GML, JSON,
or RDF), binary (such as GeoTIFF or NetCDF), or a mix of
both embedded in some “container format” (such as zip or
GeoPackage). Furthermore, the data cubes representing “Big
Data” typically require data to be partitioned (also called
tiling), and they need to be amenable to streaming (mainly
in case of timeseries); both is included in the current version
CIS 1.1. Furtado [7] performed a general analysis of multi-
dimensional partitioning.
Fig. 1 The Data oriented Faces of the Geospatial Data Cube.
3.4. Infrastructure
The data storage units must be hosted by an IT infrastructure
or ‘hardware’ that also allows their handling. This could be
a centralised or distributed setup of storage and processing
devices. Rapid data access and transfer between storage and
processing instances are important criteria [2], particularly
for very large spatio-temporal datasets.
Amount and increase of geospatial data require significant
financial and logistical investments to offer competitive
services for attracting and retaining users. Among the many
supercomputing facilities, which over the last years have
started offering geospatial data and services are industrial
initiatives such as the Google Earth Engine [8] or Amazon
Web Services. Others are publicly funded and operated such
as the Australian Geospatial Data Cube [2], the Technical
University of Vienna’s Earth Observation Data Centre
(EODC) [9] or the JRC Earth Observation Data Processing
Platform (JEODPP) at European Commission’s (EC) Joint
Research Centre [10]. In the frame of the Copernicus
program the EC is about to fund various consortia uniting
Data Cubes and Multidimensional Arrays
Proc. of the 2017 conference on
Big Data from Space (BiDS’17) doi: 10.2760/383579
33 Toulouse, France
28–30 November 2017
public and private entities to serve as ‘Data Information and
Access Systems’ (DIAS) [11,12].
While all these initiatives also show commitments covering
other aspects of data cubes, their main investments seem to
be directed towards the IT infrastructure. However, the
success of these investments will largely depend on the
functionality of these infrastructures for which they must
also duly cover the other faces described here.
Fig. 2 The functionality oriented faces of the Geospatial
Data Cube.
3.5. Access and Analysis
Within the infrastructure a wide range of functionalities
must be implemented through software to access,
manipulate and analyse the stored data (and metadata) and
to ingest new products into the data cube. These
functionalities must be documented and made available to
users by means of APIs and other interactive interfaces
(GUIs). Between the User API (front-end) and the file
manipulation routines (back-end) one or several layers of
software are imaginable.
One of these layers could consist of common GIS tools (e.g.
QGIS, ArcGIS), and OGC Web Coverage Services (WCS)
can be used to connect these within the data cube. A most
recent example of an API and GUI has been demonstrated
by the CEOS Open Data Cube initiative
(http://tinyurl.com/datacubeui).
An existing standard defining a GDC analytics language is
the OGC Web Coverage Processing Service (WCPS) [13].
Additional recent attempts to establish such languages are
made by OPeNDAP, Google Earth Engine [8] and others.
As substantial processing is being shifted to the data cube
host, anticipative cost estimation as well as access rights and
security will also be of high concern when it comes to
granting access to data and to analysis power. Given the size
of data cubes it will often not be sufficient to give a binary
answer on the whole cube, but guard particular regions,
collections, etc. separately. Costs for accessing, processing,
and transferring data should be determined prior to
execution so that the host can decide about admissibility,
and maybe users can be warned or disproportionate request
rejected.
3.6. Interoperability
Interoperability and scalable fusion of spatial information
across different data cubes is crucial and highly dependent
on the use of robust international standards governing the
access and transfer protocols for communication between
client and server as well as among different servers.
ISO 19123 (which is identical to OGC Abstract Topic 6)
defines an abstract data cube model as part of the coverage
concept; however, due to its level of abstraction it is not yet
interoperable. Its sister standard, OGC CIS 1.1 / ISO 19123-
2, establishes concrete encodings which allow re-encoding
of coverages from one format into another so that a well-
defined, format-independent data cube exchange is possible,
though at the cost of additional interpolation and
resampling.
The corresponding service model is provided by the OGC
Web Coverage Service (WCS) [13], which has been adopted
by INSPIRE and is on the adoption plan of ISO. A large,
growing number of open-source and proprietary
implementations support WCS so that interoperable access
to data cubes is possible through a wide range of tools
today, including map navigation (like OpenLayers, Leaflet),
Web GIS (like QGIS, ArcGIS), visualization (like NASA
WorldWind, Cesium), and analytics (like python and R) -
see the examples in the Jupyter notebook at [14]. This
allows users to remain in the comfort zone of their tools
while accessing data cubes stored in rasdaman, GeoServer,
MapServer, ArcGIS, and other WCS-enabled engines.
Further, the Web Coverage Processing Service (WCPS) geo
datacube analytics language standard provides a means for
“shipping code to data” in an unambiguous, semantically
well-defined manner [13].
Since 2012, the intercontinental EarthServer initiative
(http://www.earthserver.eu) is establishing agile datacube
analytics on 3D x/y/t image timeseries and 4D x/y/z/t
weather data, based on the rasdaman Array Database
System (http://www.rasdaman.org). The largest installation,
EO Data Service (www.eodataservice.org), recently has
passed the 1 Petabyte frontier; ECMWF in EarthServer is
working on unleashing its 220 PB climate archive. Currently
many more stakeholders such as the Committee on Earth
Data Cubes and Multidimensional Arrays
Proc. of the 2017 conference on
Big Data from Space (BiDS’17) doi: 10.2760/383579
34 Toulouse, France
28–30 November 2017
Observation Satellites (CEOS) and W3C have started
working on data cubes. Consistency among these and the
established OGC / ISO / INSPIRE standards will be a key to
success. Barriers to interoperability, on the other hand, will
inevitably lead to silo effects undermining the
multidisciplinary concept and potential of data cubes.
4. OUTLOOK
The future success of (geospatial) data cubes will certainly
not depend on the existence of a widely-agreed definition
alone. But it is likely that a well-structured discussion and a
widespread agreement on key features of data cubes will
enable a much faster convergence, increased interoperability
and more rapid progress at global level.
Valuable technology contributions can be expected from the
field of Array Databases, which is working on flexible,
scalable query services on massive arrays, backed by the
existing OGC Web Coverage Processing Service (WCPS)
[13] and the forthcoming ISO Array SQL [16] standards.
However, users should not need to learn new languages each
time they work on another platform, but be able to use their
own existing tools and scripts (e.g., python and R for
analysis), which can be coupled through the
abovementioned languages as hidden, standards-based
client/server APIs.
Ultimately, the efforts should go beyond just the exchange
of data, but move us towards compatibility and consistency
of the available information and of the way it can be
accessed and analysed.
5. REFERENCES
[1] Salehi, M., Bédard, Y., Mostafavi, M., Brodeur, J., 2007,
“From transactional spatial databases integrity constraints to spatial
data cubes integrity constraints”, Proc. of the 5th International
Symposium on Spatial Data Quality.
[2] Lewis, A., et al., 2017, “The Australian Geoscience Data Cube
— Foundations and lessons learned”, Remote Sensing of
Environment, http://dx.doi.org/10.1016/j.rse.2017.03.015
[3] Robin, A. (Ed.), 2011, SWE CDM Encoding Standard, OGC,
http://www.opengeospatial.org/standards/swecommon
[4] Baumann P., 2017, “The Datacube Manifesto”,
http://www.earthserver.eu/tech/datacube-manifesto
[5] INSPIRE Data Specification on Elevation – Tech. Guidelines
https://inspire.ec.europa.eu/file/1530/download?token=pq85sbLG
[6] Baumann, P., Hirschorn, E., Maso, J., 2017, Coverage
Implementation Schema, version 1.1, OGC,
https://portal.opengeospatial.org/files/?artifact_id=48553
[7] Furtado, P. et al, 1999, “Storage of Multidimensional Arrays
based on Arbitrary Tiling.”, ICDE'99, Sydney, Australia
[8] Gorelick, N., et al., Google Earth Engine: Planetary-scale
geospatial analysis for everyone, Remote Sensing of
Environment(2016), http://dx.doi.org/10.1016/j.rse.2017.06.031
[9] Wagner, W., 2015, Big Data infrastructures for processing
Sentinel data, in Photogrammetric Week 2015, Dieter Fritsch (Ed.),
Wichmann/VDE, Berlin Offenbach, 93-104
[10] Soille, P., et al, 2017, The JRC Earth Observation Data and
Processing Platform, Big Data form Space BiDS’17, this issue
[11] http://copernicus.eu/news/upcoming-copernicus-data-and-
information-access-services-dias
[12] Schick, M., 2017, EUMETSAT, ECMWF & MERCATOR
OCÉAN partners DIAS
[13] Baumann, P., 2009, Web Coverage Processing Service
(WCPS) Language Interface Standard, OGC,
http://www.opengeospatial.org/standards/wcps
[14] Baumann, P., 2012, “OGC Web Coverage Service (WCS)
Core”, OGC, https://portal.opengeospatial.org/files/09-110r4
[15] Clements, O., et al, 2017, “Improving access to big data
through OGC standard interfaces”,
https://nbviewer.jupyter.org/github/earthserver-eu/INSPIRE-
notebooks/blob/master/index.ipynb
[16] Misev, D., et al., 2015, “A Database Language More Suitable
for the Earth System Sciences”. In G. Lohmann et al (eds.):
Towards an Interdisciplinary Approach in Earth System Science
Springer 2015, doi:10.1007/978-3-319-13865-7
Data Cubes and Multidimensional Arrays
Proc. of the 2017 conference on
Big Data from Space (BiDS’17) doi: 10.2760/383579
35 Toulouse, France
28–30 November 2017