Conference PaperPDF Available

Klimatic: A Virtual Data Lake for Harvesting and Distribution of Geospatial Data



Content may be subject to copyright.
A Virtual Data Lake for Harvesting and Distribution of Geospatial Data
Tyler J. Skluzacek1,2, Kyle Chard2, and Ian Foster1,2,3
Abstract Many interesting geospatial datasets are publicly
accessible on web sites and other online repositories. However,
the sheer number of datasets and locations, plus a lack
of support for cross-repository search, makes it difficult for
researchers to discover and integrate relevant data. We describe
here early results from a system, Klimatic, that aims to
overcome these barriers to discovery and use by automating
the tasks of crawling, indexing, integrating, and distributing
geospatial data. Klimatic implements a scalable crawling and
processing architecture that uses an elastic container-based
model to locate and retrieve relevant datasets and to extract
metadata from headers and within files to build a global index
of known geospatial data. In so doing, we create an expansive
geospatial virtual data lake that records the location, formats,
and other characteristics of large numbers of geospatial datasets
while also caching popular data subsets for rapid access. A
flexible query interface allows users to request data that satisfy
supplied type, spatial, temporal, and provider specifications;
in processing such queries, the system uses interpolation and
aggregation to combine data of different types, data formats,
resolutions, and bounds. Klimatic has so far incorporated more
than 10,000 datasets from over 120 sources and has been
demonstrated to scale well with data size and query complexity.
New sensors, simulation models, and observational pro-
grams are producing a veritable deluge of high quality
geospatial data. However, these data are often hard for re-
searchers to access, being stored in independent silos that are
distributed across many locations (e.g., consortium registries,
institutional repositories, and personal computers), accessible
via different protocols, represented in different formats (e.g.,
NetCDF, CSV) and types (e.g., vector, raster), and are in
general, difficult to discover, integrate, and use [1]. These
challenges are none more evident than in environmental and
climate science. Here, vast collections of data are stored in
dark, heterogeneous repositories distributed worldwide.
We aspire to make these large quantities of geospatial
data accessible by creating the virtual data lake, a cached
subset of a data lake paired with additional metadata for
non-cached datasets. A data lake is “a centralized repository
containing virtually inexhaustible amounts of raw (or mini-
mally curated) data that is readily made available anytime to
anyone authorized to perform analytical activities” [2]. Such
a system allows for the local caching of raw data in a stan-
dardized format, making integration and distribution more
efficient at query-time. A geospatial data lake should allow
for the straightforward alignment of spatial and time-based
1Department of Computer Science, The University of Chicago, Chicago,
2Computation Institute, The University of Chicago, Chicago, IL, USA
3Math & Computer Science Div., Argonne Nat. Lab., Argonne, IL, USA
variables, and be able to manage and integrate heterogeneous
data formats. Given the huge quantity of geospatial data, we
extend the data lake model to encompass a metadata index
of all processed data and the use of our virtual lake as a
cache for popular raw data. This approach allows for the
tracking of less popular datasets without giving up valuable
performance and space availability for oft-accessed data.
To explore these ideas, we have prototyped Klimatic, a
system for the automated collection, indexing, integration,
and distribution of big geospatial data. Although there is
prior research in both geospatial metadata extraction and
data lakes, to the best of our knowledge this is the first
example of a centralized, searchable index across disparate
web-accessible resources, combined with a virtual lake cache
for raw data. We adopt a scalable crawling and metadata
extraction model, using a dynamic pool of Docker contain-
ers [3] to discover and process files. Thus, we pave the way
for creation of a scalable system that has the capacity to scour
an increasing number of available resources for geospatial
data. To further reduce usage barriers, Klimatic supports the
integration of heterogeneous datasets (in both file type and
format) to match users’ queries, while also ensuring data
integrity [4], [5].
The rest of this paper is as follows. §II discusses chal-
lenges associated with the creation of a geospatial virtual
lake. §III outlines Klimatic’s architecture and implementa-
tion. §IV explores the data collected in Klimatic. §V dis-
cusses related work. Finally, §VI summarizes the impact of
Klimatic while illuminating future research and applications.
Geospatial data are stored in a variety of repositories,
many accessible via HTTP or Globus GridFTP. Globus
is a service-based research data management system that
provides access to more than 10,000 storage systems (called
“endpoints”), many of which are used for storing scientific
data. Automating the collection and indexing of all geospa-
tial data stored on Globus endpoints and the web would be of
great benefit to researchers. However, this task is not without
significant challenges, as we now discuss.
Discovery: Klimatic needs a way to discover and explore
data stored across an extremely large number of storage
systems. It must do so in such a way that file paths can be
stored for purposes of data provenance and re-examination
at a later time. Klimatic therefore requires a crawler that
can scale to many sites and datasets. It needs to be able
to identify potentially relevant datasets, for example by
looking for relevant file extensions (e.g., .nc and .csv). For
2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems
978-1-5090-5216-5/16 $31.00 © 2016 IEEE
DOI 10.1109/PDSW-DISCS.2016.9
2016 1st Joint International Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems
978-1-5090-5216-5/16 $31.00 © 2016 IEEE
DOI 10.1109/PDSW-DISCS.2016.9
each dataset identified, it needs to be able to introspect on
its contents, which requires interfaces that support data in
different formats accessible via different APIs. It must also
be able to determine quickly whether the dataset already
exists in the virtual data lake, and decide whether to cache
or discard the dataset.
Indexing: Once Klimatic places a dataset in the Docker
container it needs to acquire descriptive metadata that can
identify datasets satisfying user-specific search criteria. Ad-
ditionally, Klimatic must establish indices that allow users
to quickly filter data by means of metadata. Metadata may
be found in file names, in structured file headers, or within
the file body. Thus, we require a flexible indexing model that
can not only identify these metadata, but also allow for many
geospatial queries while tracking provenance.
Integration: The purpose of our approach is to create a
flexible virtual data lake from which users may retrieve not
only individual datasets but also integrated datasets defined
by a specification such as “all temperature measurements
for the region (30W to 32W, 80N to 82N) for the period
of January, 2016.” It must process such requests efficiently,
while also upholding the data’s integrity. Geospatial data
are particularly complicated to integrate as heterogeneous
collection methods result in different representations (e.g.,
raster vs. vector) and different granularities (e.g., spatial and
temporal). Furthermore, the units used to represent common
data may be different (or even missing). Thus, Klimatic must
effectively manage misalignments between datasets to curate
a new dataset with near-equal integrity to its ancestors.
Ensuring integrity: Given the integrative nature of Kli-
matic, a number of geospatial integrity rules must be fol-
lowed when integrating multiple geo-spatial datasets into
one. These constraints include topological, semantic, and
user-defined integrity constraints [4], [5], [6]. Topological
constraints require that data be divided into mutually ex-
clusive regions with all space covered. Semantic constraints
require that geological relationships are maintained, mean-
ing, for example, that a road cannot exist in the same
space as a building. Finally, user-defined constraints require
that data are minimally affected following post-processing.
Additionally, integrated data should include information that
tracks data lineage. If a dataset cannot fit these constraints,
the user is asked whether to reject the integrated dataset.
The Klimatic architecture implements a three phase data
ingestion pipeline to populate the virtual data lake: (1)
crawling and scraping publicly accessible data, (2) extracting
metadata and building a discovery index, and (3) loading data
into virtual data lake storage.
A. Crawling and Scraping
The first step in the Klimatic pipeline works to identify
and then download publicly accessible geospatial files. To
provide scalability, we use an elastically scalable pool of
crawler instances, implemented as Docker containers, each
Fig. 1. Workflow for Klimatic’s metadata extraction and storage.
repeatedly retrieving a URL from a crawling queue, retriev-
ing and processing any content at that address, and adding
any new URLs identified during processing to the queue.
Figure 1 illustrates this phase of the workflow.
Klimatic can initially retrieve data via either HTTP or
GridFTP. In each case, our crawler looks for the commonly
used NetCDF (.nc) [7] and CSV formats. The process by
which the crawler discovers these datasets is dependent on
the target repository.
For HTTP-accessible repositories, we seed the crawling
queue with common repositories for geospatial data, such
as the National Oceanic and Atmospheric Administration
(NOAA) and the University Corporation for Atmospheric
Research (UCAR). Using these links as an initial base, the
crawler then explores those web sites and other linked web
sites by scouring the links within pages. As a result of this
crawling process, a list of datasets (with associated URLs)
is appended to a second extraction queue. We have used this
method to discover more than 10,000 climate files.
For GridFTP-accessible data, we use Globus APIs to seed
the crawling queue with endpoints that analysis of access
control lists show to be publicly accessible. The crawler then
explores those endpoints recursively, filtering files by format
and appending matching files to the extraction queue. Our
crawler has so far identified 441 geospatial datasets, mainly
in CSV format, residing on Globus endpoints.
The final challenge associated with the crawling phase is
to determine whether files contain relevant spatial data, as
well as dealing with false-positive datasets (i.e., datasets that
seem to have spatial data during a scan, but do not). As
NetCDF files contain structured headers (with time- and area-
based keys) and raw data in-file, filtering NetCDF files for
relevant metadata is straightforward. However, this task is
more difficult when analyzing CSV files. To test whether
a CSV file contains spatial data, the program checks for
a number of pre-determined geo-spatial keys (e.g., ‘lon,
‘lons,’ ‘long,’ ‘lng,’ and ‘longitude’ for a longitude variable)
in the first two rows of each column. If such keys are
found, metadata are extracted. We have found that fewer
than 10% of CSV files on the sites that we visited contain
spatial data and CSV files rarely have informative headers
and often require scanning the entire file to create metadata.
Once metadata are stored, Klimatic briefly scans each new
dataset’s metadata to ensure that the geo-spatial data fit
within perceivable bounds (e.g., the latitudes and longitudes
exist). If the metadata seem unlikely to be true, the data are
flagged for human review.
B. Extracting Metadata and Indexing
We next process each dataset added to the extraction
queue. This process is performed via an elastically scalable
pool of Docker-based extractor instances. Each such instance
repeatedly downloads datasets via HTTP or GridFTP and
uses a metadata extraction library to complete the Klimatic
metadata model. (We use the UK Gemini 2.2 [8] standard to
represent geospatial metadata.) All metadata are loaded into a
standard PostgreSQL database and indexed via a PostgreSQL
text-search (TS) vector, an alternative to checksums that
creates a unique string out of a dataset’s metadata. This
index allows the crawler to identify if a dataset is already
known to the virtual warehouse, in which case, the duplicate
is recorded in the index, so as to prevent redundant future
accesses to the same file. The TS vector index also makes
it easy for users to check for the availability of certain data
parameters, such as lat, long, variables, start date, end date,
and the dataset’s publisher.
C. Data Storage
If a new dataset is not determined to be a duplicate, the
Klimatic system next converts its contents to a relational
format and loads them into a new PostgreSQL table, so as
to accelerate subsequent retrieval and integration operations.
The data are not otherwise modified, although future work
could involve automatic transformation to reference grids,
perhaps based on analysis of user query histories.
Given the virtually unlimited number of geospatial
datasets, it is infeasible to retain the contents of every dataset.
Thus, we operate a caching strategy. Metadata for every
dataset located via crawling are stored in the index, but
dataset contents are stored only if smaller than a predefined
threshold and are subject to ejection via an LRU policy when
the cache is full. Thus, larger and less popular datasets may
need to be re-fetched when requested by a user. (In future
work, we will also explore alternatives to discarding datasets,
such as compression and transfer to slower, cheaper storage.)
D. Responding to User Queries
Having loaded some number of datasets into the virtual
data lake, we are next concerned with responding to queries.
We show our query model in Figure 3. Our initial query
interface is a simple web GUI using Flask and Python. With
the goal of making the query interface as simple as possible,
we allow users to query using minimum and maximum
latitudes and longitudes (i.e., a bounding box for their data);
the variable(s) they would like included in their dataset; the
begin and end dates; and (optionally) the data provider(s)
from which data is wanted. Klimatic then estimates the
amount of time required to conduct the join and deliver the
dataset. Many queries require more than two minutes for the
join, as many datasets have upward of 2 million cells.
The multiple possible encodings for climate data, most
notably vector and raster, creates challenges when attempting
.. . 44
Fig. 2. F1on Matrix M to format a vector as a raster. Black values are
original, red are created on first sweep, and orange created on second.
to integrate multiple datasets into one. A vector is a data
structure that represents many observations from a single
point, but at different times (e.g., precipitation levels mea-
sured at a fixed weather station). A raster can be represented
by a two dimensional grid, in which each cell is a certain
area identifiable on a map. Each cell contains the value of
some variable: for example, the percentage of pollen in the
air. Thus, to enable users to retrieve integrated datasets we
require a method for integrating these two formats for cross-
format data analysis: an integration that may involve a sparse
set of vectors and a large raster database. (For example,
180,000 weather stations record precipitation in the U.S.,
each with a fixed latitude and longitude, while a complete
radar mapping of the U.S. results in over 760,000 5 km2
raster cells [9].)
We implement this integration via an interpolation from
point values to a scalar field (a raster). We use a series of
sweeping focal operations for some raster M, where each
point in Mrepresents a cell of a given region denoted by
latitudinal and longitudinal boundaries. A focal operation is
defined as the operation on a certain cell with regards to a
small neighborhood around the cell [10]. Our implementation
of this algorithm begins with a focal neighborhood of 1, or
the eight diagonal or adjacent cells of a selected empty cell.
If there are at least two neighbors, the new cell becomes the
non-weighted average of all cells in region F1. The center
of F1is moved from cell-to-cell until either all cells are full
or there exist F1s such that there are not at least two value-
bearing cells inside. The algorithm then adds one more series
of neighbors (i.e., neighbors of neighbors), which we call F2,
F3through Fn, where Fnresults in a complete matrix.
Figure 2 illustrates this process, where M1is the original
sparse matrix and M2and M3are the second and third
sweeps. As far as the data’s user-defined, post-processing
integrity is concerned, we record in Klimatic’s output header
the number of sweeps necessary to make the vector com-
patible with rasters. We may infer that a higher number of
sweeps results in less ‘pure’ data. Our interface will also
prompt users with information regarding the data’s post-
processing integrity as well as related data that could be
selected to increase this integrity.
Klimatic currently supports the creation of integrated
Fig. 3. Work flow for Klimatic’s data integration and distribution.
Fig. 4. Distribution of Klimatic’s total datasets by provider-type
NetCDF and CSV files. NetCDF conventions simplify the
creation of an integrated NetCDF dataset. NetCDF files can
be conceptualized as having containers for multiple variables,
while assuming that matching indices across the containers
refers to a specific data point; index 0 in each container refers
to the first data point, index 1 the second, and so on.
If a query response requires integration of both vector
and raster data, Klimatic currently uses the grid dictated by
the raster. Each vector always lies within a raster cell, so
each cell containing one vector becomes the value of the
vector at a given time. If multiple vectors fall within the
same raster cell, we currently choose to average their values.
(Here and elsewhere, we apply one data conversion strategy
automatically in our prototype. Ultimately, we will want to
allow the user to control such actions.) Once a standardized
grid is achieved, the addition of a variable only requires
the addition of another variable container, as long as the
spatial and temporal bounds align. If the resolutions and
time-bounds are different (e.g., if one dataset is measured
in months and the other in years), we aggregate to the larger
period (i.e., years). Future work could involve imputing
values for missing areas and time periods, but this will
require statistical distribution analysis.
We aim in Klimatic to include geospatial data that span
all areas and many variables and years, originating from
both large repositories (e.g., UCAR and NOAA) and smaller
private research, educational, and industrial sources. The
importance of considering smaller sources is shown by
the fact that only 19.5% of Klimatic’s data are known to
originate from large sources, as shown in Figure 4. (We
classify providers based on information obtained from the
Fig. 5. Time (minutes) to find, extract and store metadata from, and add
data to virtual warehouse for, 750 100 MB files, via Globus and HTTP
crawled data locations. 28.8% of providers did not supply
this information at the time of the tests.)
Klimatic has so far extracted metadata and constructed a
searchable index for 10,002 datasets (11.5 TB). The area
covered by Klimatic’s collected data is expansive, with the
least-covered regions of the world (e.g., South America, Aus-
tralia, Antarctica) having 1,250–3,350 datasets each and the
most-covered areas (e.g., North America, Europe, Australia,
and Asia) having 8,900–9,500 datasets apiece. The datasets
vary in resolution, from coarse 100 km x 100 km cells to
fine 50 m x 100 m cells. To increase uniformity of coverage
across regions of the globe, Klimatic could prioritize data in
the less-covered areas.
From the perspective of computational efficiency, Klimatic
performs well on dataset ingestion. To test data ingest perfor-
mance, we evaluated the system on 750 randomly selected
datasets averaging 100 MB each (for a total of 75 GB)
stored in remote Globus and web sources, using 1, 2, and 4
crawler instances. As shown in Figure 5, the Globus scraper
outperforms the web scraper due to the overhead inherent
in the web scraper, as it concurrently traverses all links on
a page to find—and explore—applicable paths to datasets
before moving on to the next source.
Related work encompasses areas such as data lakes and
other integrated data approaches, geospatial data distribution,
and metadata extraction. Klimatic expands upon prior work
by addressing the challenges of collecting, indexing, and
distributing geospatial data from diverse sources via a system
based on data lake concepts [2]. We build upon others’ efforts
to encapsulate the steps necessary to bring raw geospatial
data from its source to a user, including its acquisition,
processing, and distribution [11].
The motivation for our work aligns with other efforts to
scrape scientific data and extract metadata. For example,
similar approaches have been applied to collect scientific
information from papers, including our own work on extract-
ing polymer properties from journal publications [12]. Others
have used metadata extraction and indexing for business and
industrial purposes, as there are noticeable increases in I/O
performance and decreases in required human effort [13].
In biomedicine, data commons are proposed for integrating
genomics data [14]. In the geosciences there is growing
emphasis on making data broadly accessible, as in the
Earth Grid System Federation [15], which links climate
simulation data archives worldwide; NCAR’s Research Data
Archive [16], which provides access to NCAR data; and
DataOne [17], an online service that indexes a large number
of geospatial datasets housed in various repositories. Our
approach is differentiated as Klimatic aims to scrape arbitrary
distributed data, rather than only those housed in repositories.
Other applications such as ESRI’s ArcGIS [18] and Cad-
corp’s SIS [19] allow for metadata collection and dataset
integration, but require significant human input. One is
also limited to the data stored in those systems and one’s
own unindexed data. By removing the human element from
metadata extraction, Klimatic ensures that a source’s original
metadata are cited correctly, leaving no room for human
error [20], [21]. Klimatic follows the standard UK Gemini
metadata storage convention [8], but given the broad scope
of the data that Klimatic processes, some metadata are often
justifiably missing, as when vector data, which correspond
to a single point, lack bounding coordinates.
Klimatic effectively provides an accessible architecture for
the collection and dissemination of large, distributed geospa-
tial data. It is able to automatically crawl huge amounts of
data distributed across various storage systems and accessible
via HTTP and Globus.
With continued work to add additional datasets to Klimatic
and make the indexed data more broadly accessible to
applications, we hope that Klimatic will become a great asset
to the many communities that use geospatial data. In addition
to seeking more data for Klimatic, a number of software
improvements can occur in the short run. First we can
create smarter metadata extraction. Additions to the current
Klimatic process could include comparing sources that con-
tain conflicting data, and using the geographic distributions
to determine which data better fit a physical phenomenon.
For example, we find that latitude and longitude are often
encoded inconsistently, particularly in CSV files, as when
154.3is used in some files to mean 154and 0.3, and
in other files 154and 3 minutes). Klimatic could look at
additional dataset elements (e.g., city names, if available)
to determine the intended interpretation of degrees versus
minutes, and convert it to the standard convention (degrees,
minutes, and seconds).
Additionally, the user experience can be improved through
an enhanced interface and better caching strategies. UI
enhancements could include allowing users to choose data
from a map, providing better areas to include for a research
study by analyzing the underlying statistics of a dataset
(i.e., “These adjacent datasets share correlations in [selected
variable]”), or allowing users to trace an outline of their
desired data area on a map and getting a very specialized
dataset in return, perhaps as a shapefile—an area bounded
byaconnect-the-dots convex hull commonly used in geo-
graphic analysis. Shapefiles are helpful in analysis of non-
rectangular neighborhoods or odd-shaped natural features,
including lakes and mountains. Furthermore, we plan to
provide support for other, less popular file types to fully
encompass the geospatial data domain. To allow the fast
processing between datasets, the caching algorithm used in
the data lake can better learn which files to hold on local
disk in order to minimize the time required for the average
user’s queries. We also plan to implement a periodic checker
to search each indexed dataset’s origin for updates.
Other future work will focus on developing collaborative
applications and expanding functionality. Klimatic is built to
support external applications that may access data via APIs.
We will collaborate with diverse disciplines to develop plug-
ins to our system that notify a person or a decision-system
when some threshold is reached, which allows that person
or system to react to changes in data in a timely manner.
We thank Raffaele Montella for discussions on these
topics, as well as Computation Institute at The University of
Chicago for providing the resources necessary to undertake
this work. This work was supported in part by DOE contract
DE-AC02- 06CH11357 and by NSF Decision Making Under
Uncertainty program award 0951576.
[1] P. B. Heidorn, “Shedding light on the dark data in the long tail of
science,” Library Trends, vol. 57, no. 2, pp. 280–299, 2008.
[2] I. Terrizzano, P. M. Schwarz, M. Roth, and J. E. Colino, “Data
wrangling: The challenging journey from the wild to the lake.” in
Conf. on Innovative Data Systems Research, 2015.
[3] D. Merkel, “Docker: Lightweight Linux containers for consistent
development and deployment,Linux Journal, no. 239, p. 2, 2014.
[4] R. Elmasri and S. Navathe, Fundamentals of Database Systems, 2nd
Edition. Addison-Wesley, 1994.
[5] K. A. Borges, A. H. Laender, and C. A. Davis Jr, “Spatial data
integrity constraints in object oriented geographic data modeling,”
in 7th ACM International Symposium on Advances in Geographic
Information Systems. ACM, 1999, pp. 1–6.
[6] S. Cockcroft, “A taxonomy of spatial data integrity constraints,
GeoInformatica, vol. 1, no. 4, pp. 327–343, 1997.
[7] “netCDF: Network common data form,”
software/netcdf/. Visited August 16, 2016.
[8] A. for Geographic Information, “UK GEMINI v.2.2 specification for
discovery metadata for geospatial data resources,” vol. 2.2, pp. 1–61,
[9] “Weather Underground,”
Visited August 25, 2016.
[10] S. Shashi and C. Sanjay, “Spatial databases: A tour,” 2003.
[11] D. J. Maguire and P. A. Longley, “The emergence of geoportals and
their role in spatial data infrastructures,” Computers, Environment
and Urban Systems, vol. 29, no. 1, pp. 3 – 14, 2005, geoportals.
[Online]. Available:
[12] R. Tchoua, K. Chard, D. Audus, J. Qin, J. de Pablo, and I. Foster, “A
hybrid human-computer approach to the extraction of scientific facts
from the literature,” Procedia Computer Science, vol. 80, pp. 386–397,
[13] M. Chisholm, How to build a business rules engine: Extending
application functionality through metadata engineering. Morgan
Kaufmann, 2004.
[14] R. L. Grossman, A. Heath, M. Murphy, M. Patterson, and W. Wells, “A
case for data commons: Toward data science as a service,” Computing
in Science & Engineering, vol. 18, no. 5, pp. 10–20, 2016.
[15] D. Bernholdt, S. Bharathi, D. Brown, K. Chanchio, M. Chen, A. Cher-
venak, L. Cinquini, B. Drach, I. Foster, P. Fox et al., “The Earth
System Grid: Supporting the next generation of climate modeling
research,” Proceedings of the IEEE, vol. 93, no. 3, pp. 485–495, 2005.
[16] “CISL Research Data Archive,” Accessed Septem-
ber 1, 2016.
[17] M. B. Strasser Cook, “DataOne,” DataOne Best Practices Primer, pp.
1–11, 2014.
[18] “ESRI ArcGIS,” Visited August
16, 2016.
[19] “Cadcorp SIS,” Visited August 16, 2016.
[20] J. K. Batcheller, B. M. Gittings, and S. Dowers, “The performance of
vector oriented data storage strategies in ESRI’s ArcGIS,Transactions
in GIS, vol. 11, no. 1, pp. 47–65, 2007.
[21] N. Kussul, A. Shelestov, M. Korbakov, O. Kravchenko, S. Skakun,
M. Ilin, A. Rudakova, and V. Pasechnik, “XML and grid-based ap-
proach for metadata extraction and geospatial data processing,” in 5th
International Conference on Information Research and Applications,
2007, pp. 1–7.
... In practice, developers are moving the building of their applications from monolithic to loosely coupled solutions to achieve more efficient and easily maintainable applications. For instance, currently there are solutions available that provide processing structures such as pipelines, workflows, and processing patterns to integrate sets of applications into a single solution (Babuji et al., 2019;Montella et al., 2018a;Ferguson, 2011;Taylor et al., 2007;Montella et al., 2015;Skluzacek et al., 2016). In this type of solutions, the outputs of some applications represent the inputs of other, which creates software patterns producing continuous delivery of data/metadata from a data source (e.g. a folder or ...
... This type of solution is only focused on the improvement of the application deployment for avoiding the troubleshooting issues in real-world scenarios. However, in scientific environments, workflows are also required to interconnect different applications for processing models about environment, climate (Skluzacek et al., 2016), etc. ...
This paper presents the design, development, and implementation of Kulla, a virtual container-centric construction model that mixes loosely coupled structures with a parallel programming model for building infrastructure-agnostic distributed and parallel applications. In Kulla, applications, dependencies and environment settings, are mapped with construction units called Kulla-Blocks. A parallel programming model enables developers to couple those interoperable structures for creating constructive structures named Kulla-Bricks. In these structures, continuous dataflow and parallel patterns can be created without modifying the code of applications. Methods such as Divide&Containerize (data parallelism), Pipe&Blocks (streaming), and Manager/Block (task parallelism) were developed to create Kulla-Bricks. Recursive combinations of Kulla instances can be grouped in deployment structures called Kulla-Boxes, which are encapsulated into VCs to create infrastructure-agnostic parallel and/or distributed applications. Deployment strategies were created for Kulla-Boxes to improve the IT resource profitability. To show the feasibility and flexibility of this model, solutions combining real-world applications were implemented by using Kulla instances to compose parallel and/or distributed system deployed on different IT infrastructures. An experimental evaluation based on use cases solving satellite and medical image processing problems revealed the efficiency of Kulla model in comparison with some traditional state-of-the-art solutions.
... Klimatic (Skluzacek et al., 2016) integrates over 10,000 different geo-spatial data sets from numerous online repositories. It accesses these data sets via HTTP or Globus GridFTP. ...
Full-text available
Data lakes are a fundamental building block for many industrial data analysis solutions and becoming increasingly popular in research. Often associated with big data use cases, data lakes are, for example, used as central data management systems of research institutions or as the core entity of machine learning pipelines. The basic underlying idea of retaining data in its native format within a data lake facilitates a large range of use cases and improves data reusability, especially when compared to the schema-on-write approach applied in data warehouses, where data is transformed prior to the actual storage to fit a predefined schema. Storing such massive amounts of raw data, however, has its very own challenges, spanning from the general data modeling, and indexing for concise querying to the integration of suitable and scalable compute capabilities. In this contribution, influential papers of the last decade have been selected to provide a comprehensive overview of developments and obtained results. The papers are analyzed with regard to the applicability of their input to data lakes that serve as central data management systems of research institutions. To achieve this, contributions to data lake architectures, metadata models, data provenance, workflow support, and FAIR principles are investigated. Last, but not least, these capabilities are mapped onto the requirements of two common research personae to identify open challenges. With that, potential research topics are determined, which have to be tackled toward the applicability of data lakes as central building blocks for research data management.
... A data lake is seen as an evolution of existing data architecture (e.g., data warehouse) [17]. It gathers data from various private or public data islands, holds all ingested data, structured, semi-structured, or unstructured, in their raw data formats, and provides a unified interface for query processing and data exploration, thus enabling on-demand processing to meet the requirements of various applications [18][19][20][21]. We first propose a multi-threading parallel data ingestion method by adopting the threading pool technique to realize efficient IoT data ingestion from distributed repositories accessed via web services. ...
Full-text available
Multi-source Internet of Things (IoT) data, archived in institutions’ repositories, are becoming more and more widely open-sourced to make them publicly accessed by scientists, developers, and decision makers via web services to promote researches on geohazards prevention. In this paper, we design and implement a big data-turbocharged system for effective IoT data management following the data lake architecture. We first propose a multi-threading parallel data ingestion method to ingest IoT data from institutions’ data repositories in parallel. Next, we design storage strategies for both ingested IoT data and processed IoT data to store them in a scalable, reliable storage environment. We also build a distributed cache layer to enable fast access to IoT data. Then, we provide users with a unified, SQL-based interactive environment to enable IoT data exploration by leveraging the processing ability of Apache Spark. In addition, we design a standard-based metadata model to describe ingested IoT data and thus support IoT dataset discovery. Finally, we implement a prototype system and conduct experiments on real IoT data repositories to evaluate the efficiency of the proposed system.
... In recent years, different ambitious projects have based their solutions on workflows [18][19][20]. All such solutions have faced one common main challenge, which was to simplify the interaction of domain scientists with computational resources [21]. ...
Workflow engines are commonly used to orchestrate large-scale scientific computations such as, but not limited to weather, climate, natural disasters, food safety, and territorial management. However, to implement, manage, and execute real-world scientific applications in the form of workflows on multiple infrastructures (servers, clusters, cloud) remains a challenge. In this paper, we present DagOnStar (Directed Acyclic Graph On Anything), a lightweight Python library implementing a workflow paradigm based on parallel patterns that can be executed on any combination of local machines, on-premise high performance computing clusters, containers, and cloud-based virtual infrastructures. DagOnStar is designed to minimize data movement to reduce the application storage footprint. A case study based on a real-world application is explored to illustrate the use of this novel workflow engine: a containerized weather data collection application deployed on multiple infrastructures. An experimental comparison with other state-of-the-art workflow engines shows that DagOnStar can run workflows on multiple types of infrastructure with an improvement of 50.19% in run time when using a parallel pattern with eight task-level workers.
... Such massive amount of data transfer can be reduced by introducing a data extraction mechanism inside collaboration file system, e.g. Klimatic, VSFS and TagIt are developed based on the same motivation [119,147,115]. Second, it is possible to execute applications directly on remote sites in collaboration. ...
Full-text available
The high-performance computing (HPC) storage systems are one of the critical components of computational, experimental, and observational science today. The ability to selectively access desired information from large volumes of data at very high speeds and minimum overhead is critical to scientific applications. Therefore, several efforts have been made to integrate scientific search and discovery services in HPC storage systems. However, due to a variety of different HPC storage architec- tures such as federated geo-distributed HPC data centers, distributed and parallel file systems, and high-speed persistent memory-based storage pools, it is non-trivial to apply a single solution to multiple storage architectures of HPC paradigm. Some of the main challenges include minimal performance degradation, effective meta- data management, data sharing controls and policies, and awareness of underlying storage. Therefore, accelerating the scientific search and discovery services while addressing the aforementioned challenges is crucial, especially for upcoming era of exascale storage architectures. This dissertation is focused on solving the above challenges and building scien- tific search and discovery service framework targeting each storage layer to accelerate HPC and scientific computing. In the first part of the dissertation (Chapter 3), we build a scientific collaboration friendly storage model for a wide-area storage network, i.e., geo-distributed HPC data centers. So, applications and scientists can benefit from the discovery services without losing performance with our proposed multi-mode metadata indexing approach. In the second part of the dissertation (Chapter 4), we present our solution to enable search services for scientists and applications directly running on scalable and distributed file systems by tightly integrating data man- agement into the file system. Chapter 5, presents a memory-object based scientific discovery service to fully utilize the emerging non-volatile memory pools. This dissertation shows that the proposed scientific metadata search and dis- covery services inside storage layers highly complements the HPC and scientific com- puting architectures.
... In order to better organize, discover, and act upon distributed big data, we first require automated methods to crawl file systems and extract metadata for each file therein. While others have developed end-to-end automated metadata extraction systems, they require that data be moved to a central service [7][8][9]11] or lack built-in Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. ...
Conference Paper
Full-text available
The rapid generation of data from distributed IoT devices, scientific instruments, and compute clusters presents unique data management challenges. The influx of large, heterogeneous, and complex data causes repositories to become siloed or generally unsearchable---both problems not currently well-addressed by distributed file systems. In this work, we propose Xtract, a serverless middleware to extract metadata from files spread across heterogeneous edge computing resources. In my future work, we intend to study how Xtract can automatically construct file extraction workflows subject to users' cost, time, security, and compute allocation constraints. To this end, Xtract will enable the creation of a searchable centralized index across distributed data collections.
Full-text available
During the world’s challenge to confront the rapidly spreading coronavirus disease (COVID-19) pandemic and the consequent heavy losses and disruption to society, returning to normal life has become a demand. Social distancing, also known as physical distancing, plays a pivotal role in this scenario. Social distancing is a practice to maintain a safe space between a person and others who are not from the same household, preventing the spread of contagious viral diseases. To support this case, several public authorities and governments around the world have proposed social distancing applications (also known as contact-tracing apps). However, the adoption of these applications is arguable because of concerns regarding privacy and user data protection. In this study, we present a comprehensive survey of privacy-preserving techniques for social distancing applications. We provide an extensive background on social distancing applications, including measuring the physical distance between people. We also discuss various privacy-preserving techniques that are used by social distancing applications; specifically, we thoroughly analyze and compare these applications, considering multiple features. Finally, we provide insights and recommendations for designing social distancing applications while reducing the burden of privacy problems.
Big data is usually processed in a decentralized computational environment with a number of distributed storage systems and processing facilities to enable both online and offline data analysis. In such a context, data access is fundamental to enhance processing efficiency as well as the user experience inspecting the data and the caching system is a solution widely adopted in many diverse domains. In this context, the optimization of cache management plays a central role to sustain the growing demand for data. In this article, we propose an autonomous approach based on a Reinforcement Learning technique to implement an agent to manage the file storing decisions. Moreover, we test the proposed method in a real context using the information on data analysis workflows of the CMS experiment at CERN.
Agriculture provides food, raw materials, and employment opportunities for a significant percentage of the world's population. Climate, economic, political, social, and other conditions affect decision making in agricultural processes. In many cases, these conditions imply the loss of suitability of many areas for some traditional crops. In contrast, these areas can produce new crops by taking advantage of changing conditions. In this sense, having reliable tools and information for decision making is essential in adapting to new agricultural productivity scenarios. The above implies having sufficient and relevant data sources to reduce the uncertainty in the decision‐making processes. However, data by nature tend to be diverse in structure, storage formats, and access protocols. Data fusion tasks have been immersed in a multitude of applications and have been approached from different points of view when implementing a suitable solution. We propose a multi‐domain data fusion strategy to support data analysis tasks in agricultural contexts. We also describe all the data sources collected, which are the main input to the proposed strategy. The combined data sources were also evaluated through a preliminary exploratory analysis in a multi‐label learning approach. Finally, the data fusion strategy is explained through an example in agricultural crop production.
Conference Paper
Full-text available
An important activity in the design of a particular database appli- cation consists in identifying the integrity constraints that must hold on the database, and that are used to detect and evaluate in- consistencies. It is possible to improve data quality by imposing constraints upon data entered into the database. These constraints must be identified and recorded at the database design level. However, it is clear that modeling geographic data requires mod- els which are more specific and capable of capturing the seman- tics of geographic data. Within a geographic context, topological relations and other spatial relationships are fundamentally impor- tant in the definition of spatial integrity rules. This paper dis- cusses the relationship that exists between the nature of spatial information, spatial relationships, and spatial integrity constraints, and proposes the use of OMT-G, an extension of the OMT model for geographic applications, at an early stage in the specification of integrity constraints in spatial databases. OMT-G provides ade- quate primitives for representing spatial data, supports spatial re- lationships, and allows topological, semantic and user integrity rules to be specified in the database schema.
This is the only book that demonstrates how to develop a business rules engine. Covers user requirements, data modeling, metadata, and more. A sample application is used throughout the book to illustrate concepts. The code for the sample application is available online at Includes conceptual overview chapters suitable for management-level readers, including general introduction, business justification, development and implementation considerations, and more.
A wealth of valuable data is locked within the millions of research articles published each year. Reading and extracting pertinent information from those articles has become an unmanageable task for scientists. This problem hinders scientific progress by making it hard to build on results buried in literature. Moreover, these data are loosely structured, encoded in manuscripts of various formats, embedded in different content types, and are, in general, not machine accessible. We present a hybrid human-computer solution for semi-automatically extracting scientific facts from literature. This solution combines an automated discovery, download, and extraction phase with a semi-expert crowd assembled from students to extract specific scientific facts. To evaluate our approach we apply it to a particularly challenging molecular engineering scenario, extraction of a polymer property: the Flory-Huggins interaction parameter. We demonstrate useful contributions to a comprehensive database of polymer properties.
As the amount of scientific data continues to grow at ever faster rates, the research community is increasingly in need of flexible computational infrastructure that can support the entirety of the data science lifecycle, including long-term data storage, data exploration and discovery services, and compute capabilities to support data analysis and re-analysis, as new data are added and as scientific pipelines are refined. We describe our experience developing data commons-- interoperable infrastructure that co-locates data, storage, and compute with common analysis tools--and present several cases studies. Across these case studies, several common requirements emerge, including the need for persistent digital identifier and metadata services, APIs, data portability, pay for compute capabilities, and data peering agreements between data commons. Though many challenges, including sustainability and developing appropriate standards remain, interoperable data commons bring us one step closer to effective Data Science as Service for the scientific research community.
Docker promises the ability to package applications and their dependencies into lightweight containers that move easily between different distros, start up quickly and are isolated from each other.
Geoportals are World Wide Web gateways that organize content and services such as directories, search tools, community information, support resources, data and applications. This paper traces the emergence of geoportals, outlining the significance of developments in enterprise GIS and national spatial data infrastructures (SDIs), with particular reference to the US experience. Our objectives are principally pedagogic, in order to relate the development of geoportals to SDI initiatives and to review recent technological breakthroughs––specifically the development of direct access facilities for application services and metadata records, and the facility to utilize services directly from conventional desktop GIS applications. We also discuss the contributions that geoportals and SDI have made to simplifying access to GI, and their contribution to diffusing GI concepts, databases, techniques and models. Finally, the role of geoportals in electronic government (e-Government) is considered.
Abstract The emergence of Geographical Information Systems (GIS) as an important tool in the analysis of spatial phenomena has been mirrored by the evolution of the data models underpinning such systems. When considering vector-based solutions, such developments have seen a migration from single-user, file-based, topological hybrid models to multi-user database management system (DBMS) based integrated formats, often with no inherent topology. With all these solutions still being readily available, the decision of which to employ for a given application is a complicated one. This study analyses the performance of a number of vector data storage formats for use with ESRI's ArcGIS, with the aim to facilitate the ‘intelligent selection’ of an appropriate solution. Such a solution will depend upon the application domain and both single-user and multi-user (corporate) scenarios are considered. Findings indicate that single-user ESRI coverages and multi-user ArSDE/Oracle strategies perform better when considering the range of GIS operations used to evaluate data store performance.