Conference PaperPDF Available

Dredging a data lake: decentralized metadata extraction

Authors:

Abstract and Figures

The rapid generation of data from distributed IoT devices, scientific instruments, and compute clusters presents unique data management challenges. The influx of large, heterogeneous, and complex data causes repositories to become siloed or generally unsearchable---both problems not currently well-addressed by distributed file systems. In this work, we propose Xtract, a serverless middleware to extract metadata from files spread across heterogeneous edge computing resources. In my future work, we intend to study how Xtract can automatically construct file extraction workflows subject to users' cost, time, security, and compute allocation constraints. To this end, Xtract will enable the creation of a searchable centralized index across distributed data collections.
Content may be subject to copyright.
Dredging a Data Lake: Decentralized Metadata Extraction
Tyler J. Skluzacek
University of Chicago
skluzacek@uchicago.edu
ABSTRACT
The rapid generation of data from distributed IoT devices, sci-
entic instruments, and compute clusters presents unique data
management challenges. The inux of large, heterogeneous, and
complex data causes repositories to become siloed or generally
unsearchable—both problems not currently well-addressed by dis-
tributed le systems. In this work, we propose Xtract, a serverless
middleware to extract metadata from les spread across heteroge-
neous edge computing resources. In my future work, we intend
to study how Xtract can automatically construct le extraction
workows subject to users’ cost, time, security, and compute allo-
cation constraints. To this end, Xtract will enable the creation of a
searchable centralized index across distributed data collections.
CCS CONCEPTS
Information systems Computing platforms
;Search en-
gine indexing;Applied computing Document metadata.
KEYWORDS
data lakes, serverless, metadata extraction, le systems
ACM Reference Format:
Tyler J. Skluzacek. 2019. Dredging a Data Lake: Decentralized Metadata
Extraction. In Middleware ’19: 20th International Middleware Conference
Doctoral Symposium (Middleware ’19), December 9–13, 2019, Davis, CA, USA.
ACM, New York, NY, USA, 3 pages. https://doi.org/10.1145/3366624.3368170
1 INTRODUCTION
The rapid generation of data from IoT devices, scientic instru-
ments, simulations, and myriad other sources presents unique data
management challenges. Currently data are stored across multiple
machines, are often siloed, and require signicant manual labor to
create metadata that promote usability and searchability. Some [
4
,
5
]
have created data catalogs from user-submitted metadata. These,
however, are not scalable to current and future storage systems,
as humans cannot possibly label billions of heterogeneous les.
In order to better organize, discover, and act upon distributed big
data, we rst require automated methods to crawl le systems and
extract metadata for each le therein. While others have developed
end-to-end automated metadata extraction systems, they require
that data be moved to a central service [
7
9
,
11
] or lack built-in
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
Middleware ’19, December 9–13, 2019, Davis, CA, USA
©2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-7039-4/19/12. . . $15.00
https://doi.org/10.1145/3366624.3368170
scaling capabilities [
6
]. In this work we strive to create a exible,
scalable, and decentralized metadata extraction system that can be
employed centrally or at the edge.
We present our prototype and vision for Xtract, a decentralized
middleware that provides high-throughput and on-demand meta-
data extraction, enabling the automated creation of rich, searchable
data lakes from unsearchable data silos. We leverage a Function
as a service (FaaS) model for managing the invocation of many
short-running extractors on an arbitrarily large number of les.
The current Xtract implementation uses the
f
uncX serverless su-
percomputing platform [
3
] to execute functions across diverse and
distributed computing infrastructure. The advantage of using
f
uncX
is that it allows us to explore a novel distributed FaaS model that
overcomes the need to move large amounts of data to the cloud.
Instead, Xtract is able to push metadata extractors to the edge sys-
tems on which the scientic data reside. We envision that Xtract
could also use other edge computing fabrics, such as Amazon Web
Services IoT Greengrass or Google Cloud IoT. The primary contri-
butions of Xtract are:
Scalable across distributed computing resources, including
laptops, clusters, and edge devices.
Flexible extraction model that can be deployed centrally or
at the edge, facilitating decentralized metadata extraction.
Supports dynamic construction of customized extractor pipelines
for diverse le types.
Intelligently executes data staging decisions based on user-
supplied constraints, including time, cost, compute allocation
availability, and security.
2 APPROACH
Xtract is a decentralized middleware that provides distributed meta-
data extraction capabilities over heterogeneous compute resources.
The Xtract service plans dynamic metadata extraction pipelines,
comprised of specialized metadata extractor functions, and coordi-
nates the execution of those extractors either locally or at the edge,
subject to various constraints. An example two-site deployment of
Xtract is shown in Figure 1. The remainder of this section details
Xtract’s core components and design goals.
Metadata extractors
are functions that input a le or group of
les, and output a metadata dictionary. Each metadata extractor
runs in a given container runtime with all required dependencies
(i.e., les and libraries). Xtract currently provides a number of built-
in extractors, including those to identify null values in tabular les,
nesting patterns in structured XML les, topics and keywords from
free text, and location tags from map images, among others [
10
]. In
future work, we plan to support user-submitted metadata extractors,
automatically generate (and potentially share) runtime containers
based on inferred dependency requirements, and train Xtract to
recognize when it is appropriate to apply user-submitted extractors
in each le’s metadata extraction workow.
Middleware ’19, December 9–13, 2019, Davis, CA, USA Skluzacek
Figure 1: Overview of the Xtract architecture. For Site A ex-
tractors are transmitted to the remote resource for execu-
tion on local computing resources, returning metadata to
the Xtract service. Site B lacks suitable local computing capa-
bilities, requiring data to be staged to Xtract for extraction.
The
Xtract service
dynamically applies a set of metadata ex-
tractors to each le in the repository. First, Xtract sends a crawler
function to the data and populates a metadata dictionary for each
le containing its le system properties (e.g., path, size, extension).
Once the initial dictionary is created, Xtract invokes a le type
extractor on each le that is used to select downstream extractors.
We have shown that feeding the rst
n
bytes of a le as features into
a trained model can predict the appropriate rst extractor functions
to apply to a le, in signicantly less time than attempting to exe-
cute incorrect extractors on each le [
11
]. Once this rst extractor
function returns metadata, the Xtract service dynamically selects
additional extractors to apply based on the output. For instance, a
tabular le with a multi-line free text header (e.g., describing ex-
perimental setup) is identied by Xtract as rst requiring a tabular
extractor, but the resultant metadata will denote free text parts of
the le that can benet from keyword and topic extractors. Xtract
protects requests to the web service using Globus Auth [
13
], and
stores metadata to a Globus Search index.
Endpoints
are the edge computing fabric that provision com-
pute resources and execute functions in container runtimes for les
on the local le system. Endpoints can currently be deployed across
myriad compute providers such as IoT devices, cloud instances, and
clusters. Endpoints deployed in
f
uncX utilize the Parsl [
1
] paral-
lel programming library to provision compute resources, and to
manage the execution of functions in containers on provisioned
resources. Endpoints enable Xtract to execute metadata extraction
functions at any registered and accessible endpoint. We have shown
that deploying
f
uncX endpoints on HPC systems allows Xtract to
reliably scale to deploy millions of metadata extraction functions
across thousands of nodes spanning multiple compute locations [
3
].
3 EVALUATION PLAN
In future work we plan to study how Xtract can optimize the cre-
ation of extraction workows and deployment of extractors subject
to user-dened constraints. Specically, we plan to explore how
metadata extraction workows can be augmented to place extrac-
tors on, or stage data to, idle or under-utilized resources. We will
also investigate globally optimal extraction strategies with respect
to diverse user constraints of nancial cost, computing time, secu-
rity, and compute resource allocation availability or volatility.
We intend to evaluate Xtract’s performance and optimization
strategies across a diverse set of datasets. These include the Carbon
Dioxide Information Analysis Center (330+ GB, 10,000+ unique
le extensions of carbon dioxide data); the Materials Data Facil-
ity [
2
] (30+ TB, tens of millions of materials science les); Petrel
(4+ PB, 50,000+ les of cross-disciplinary data at Argonne National
Lab); and Globus-accessible endpoints (20,000+ unique endpoints
containing hundreds of billions of les).
4 CONCLUSION
Xtract is a metadata extraction middleware that addresses data
locality and scalability challenges by deploying metadata extrac-
tors to edge devices and constructing extraction workows subject
to a number of user constraints. Xtract will enable researchers,
companies, and individuals alike to more easily discover, organize,
and understand increasingly large, complex, and distributed data,
leading to enhanced scientic and industrial progress.
ACKNOWLEDGMENTS
This research is conducted under the guidance of Dr. Ian Foster
and Dr. Kyle Chard, and with contributions from Dr. Ryan Chard,
Dr. Zhuozhao Li, Yadu Babuji, and Ryan Wong. We gratefully ac-
knowledge the use of compute resources from the Jetstream cloud
for science and engineering [12].
REFERENCES
[1]
Yadu Babuji, Anna Woodard, Zhuozhao Li, Daniel Katz, Ben Cliord, Rohan
Kumar, Lukasz Lacinski, Ryan Chard, Justin Wozniak, and Ian Foster. 2019. Parsl:
Pervasive parallel programming in python. In Proceedings of the 28th Int’l Sym-
posium on High-Performance Parallel and Distributed Computing. ACM, 25–36.
[2]
Ben Blaiszik, Logan Ward, Marcus Schwarting, Jonathon Ga, Ryan Chard,Daniel
Pike, Kyle Chard, and Ian Foster. 2019. A Data Ecosystem to Support Machine
Learning in Materials Science. (apr 2019). arXiv:1904.10423
[3]
Ryan Chard, Tyler J Skluzacek, Zhuozhao Li, Yadu Babuji, Anna Woodard, Ben
Blaiszik, Steven Tuecke, Ian Foster, and Kyle Chard. 2019. Serverless Super-
computing: High Performance Function as a Service for Science. arXiv preprint
arXiv:1908.04907 (2019).
[4]
MP Egan, SD Price, KE Kraemer, DR Mizuno, SJ Carey, CO Wright, CW Engelke,
M Cohen, and MG Gugliotti. 2003. VizieR Online Data Catalog: MSX6C Infrared
Point Source Catalog. The Midcourse Space Experiment Point Source Catalog
Version 2.3 (October 2003). VizieR Online Data Catalog 5114 (2003).
[5]
Gary King. 2007. An introduction to the dataverse network as an infrastructure
for data sharing.
[6]
Chris Mattmann and Jukka Zitting. 2011. Tika in action. Manning Publications.
[7]
Smruti Padhy, Greg Jansen, Jay Alameda, Edgar Black, Liana Diesendruck, Mike
Dietze, Praveen Kumar, Rob Kooper, Jong Lee, Rui Liu, et al
.
2015. Brown Dog:
Leveraging everything towards autocuration. In 2015 IEEE International Confer-
ence on Big Data (Big Data). IEEE, 493–500.
[8]
Gonzalo P Rodrigo, Matt Henderson, Gunther H Weber, Colin Ophus, Katie Anty-
pas, and Lavanya Ramakrishnan. 2018. ScienceSearch: Enabling search through
automatic metadata generation. In 2018 IEEE 14th International Conference on
e-Science (e-Science). IEEE, 93–104.
[9]
Tyler J Skluzacek, Kyle Chard, and Ian Foster. 2016. Klimatic: a virtual data lake
for harvesting and distribution of geospatial data. In 2016 1st Joint International
Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems
(PDSW-DISCS). IEEE, 31–36.
[10]
Tyler J. Skluzacek, Ryan Chard, Ryan Wong, Zhuozhao Li, Yadu Babuji, Logan
Ward, Ben Blaiszik, Kyle Chard, and Ian Foster. 2019. Serverless Workows for
Indexing Large Scientic Data. In 5th Workshop on Serverless Computing (WoSC
’19). ACM, New York, NY, USA, 6. https://doi.org/10.1145/3366623.3368140
[11]
Tyler J Skluzacek, Rohan Kumar, Ryan Chard, Galen Harrison, Paul Beckman,
Kyle Chard, and Ian Foster. 2018. Skluma: An extensible metadata extraction
pipeline for disorganized data. In 2018 IEEE 14th International Conference on
e-Science (e-Science). IEEE, 256–266.
[12]
Craig A Stewart, Timothy M Cockerill, Ian Foster, David Hancock, Nirav Mer-
chant, Edwin Skidmore, Daniel Stanzione, James Taylor, Steven Tuecke, George
Dredging a Data Lake: Decentralized Metadata Extraction Middleware ’19, December 9–13, 2019, Davis, CA, USA
Turner, et al
.
2015. Jetstream: a self-provisioned, scalable science and engineer-
ing cloud environment. In Proceedings of the 2015 XSEDE Conference: Scientic
Advancements Enabled by Enhanced Cyberinfrastructure. ACM, 29.
[13]
Steven Tuecke, Rachana Ananthakrishnan, Kyle Chard, Mattias Lidman, Brendan
McCollam, Stephen Rosen, and Ian Foster. 2016. Globus Auth: A research identity
and access management platform. In 2016 IEEE 12th International Conference on
e-Science (e-Science). IEEE, 203–212.
... In the course of the digitalization, new technological developments such as the Internet of Things led to a considerable increase in generated data [28]. This data contains important knowledge and has the potential to provide new insights that can be used, for example, to optimize processes or to achieve a competitive advantage [2]. ...
... This data contains important knowledge and has the potential to provide new insights that can be used, for example, to optimize processes or to achieve a competitive advantage [2]. However, handling the generated data volumes and extracting the data's value poses a challenge for organizations [28] and requires new strategies and concepts beyond, e.g., traditional data warehousing and business intelligence solutions. In this context, the data lake concept was born. ...
Book
Data contains important knowledge and has the potential to provide new insights. Due to new technological developments such as the Internet of Things, data is generated in increasing volumes. In order to deal with these data volumes and extract the dataââ¬â¢s value new concepts such as the data lake were created. The data lake is a data management platform designed to handle data at scale for analytical purposes. To prevent a data lake from becoming inoperable and turning into a data swamp, metadata management is needed. To store and handle metadata, a generic metadata model is required that can reflect metadata of any potential metadata management use case, e.g., data versioning or data lineage. However, an evaluation of existent metadata models yields that none so far are sufficiently generic as their design basis is not suited. In this work, we use a different design approach to build HANDLE, a generic metadata model for data lakes. The new metadata model supports the acquisition of metadata on varying granular levels, any metadata categorization, including the acquisition of both metadata that belongs to a specific data element as well as metadata that applies to a broader range of data. HANDLE supports the flexible integration of metadata and can reflect the same metadata in various ways according to the intended utilization. Furthermore, it is created for data lakes and therefore also supports data lake characteristics like data lake zones. With these capabilities HANDLE enables comprehensive metadata management in data lakes. HANDLEââ¬â¢s feasibility is shown through the application to an exemplary access-use-case and a prototypical implementation. By comparing HANDLE with existing models we demonstrate that it can provide the same information as the other models as well as adding further capabilities needed for metadata management in data lakes.
... In the course of the digitalization, new technological developments such as the Internet of Things led to a considerable increase in generated data [28]. This data contains important knowledge and has the potential to provide new insights that can be used, for example, to optimize processes or to achieve a competitive advantage [2]. ...
... This data contains important knowledge and has the potential to provide new insights that can be used, for example, to optimize processes or to achieve a competitive advantage [2]. However, handling the generated data volumes and extracting the data's value poses a challenge for organizations [28] and requires new strategies and concepts beyond, e.g., traditional data warehousing and business intelligence solutions. In this context, the data lake concept was born. ...
Article
Data contains important knowledge and has the potential to provide new insights. Due to new technological developments such as the Internet of Things, data is generated in increasing volumes. In order to deal with these data volumes and extract the data’s value new concepts such as the data lake were created. The data lake is a data management platform designed to handle data at scale for analytical purposes. To prevent a data lake from becoming inoperable and turning into a data swamp, metadata management is needed. To store and handle metadata, a generic metadata model is required that can reflect metadata of any potential metadata management use case, e.g., data versioning or data lineage. However, an evaluation of existent metadata models yields that none so far are sufficiently generic as their design basis is not suited. In this work, we use a different design approach to build HANDLE, a generic metadata model for data lakes. The new metadata model supports the acquisition of metadata on varying granular levels, any metadata categorization, including the acquisition of both metadata that belongs to a specific data element as well as metadata that applies to a broader range of data. HANDLE supports the flexible integration of metadata and can reflect the same metadata in various ways according to the intended utilization. Furthermore, it is created for data lakes and therefore also supports data lake characteristics like data lake zones. With these capabilities HANDLE enables comprehensive metadata management in data lakes. HANDLE’s feasibility is shown through the application to an exemplary access-use-case and a prototypical implementation. By comparing HANDLE with existing models we demonstrate that it can provide the same information as the other models as well as adding further capabilities needed for metadata management in data lakes.
... The metadata annotation is integrated in the simulation code, and metadata to be output to the database can defined from inside the code that is developed and maintained by the researcher. Another approach of extracting application level metadata information is Xtract [41,42] (and its preliminary work Skluma [43]). 11 Xtract is a service to automatically collect metadata from heterogeneous scientific file types and can be run in two different modes. ...
Article
Full-text available
The deluge of dark data is about to happen. Lacking data management capabilities, especially in the field of supercomputing, and missing data documentation (i.e., missing metadata annotation) constitute a major source of dark data. The present work contributes to addressing this challenge by presenting ExtractIng, a generic automated metadata extraction toolkit. Existing metadata information of simulation output files scattered through the file system, can be aggregated, parsed and converted to the EngMeta metadata model. Use cases from computational engineering are considered to demonstrate the viability of ExtractIng. The evaluation results show that the metadata extraction is simulation-code independent in the sense that it can handle data outputs from various fields of science, is easy to integrate into simulation workflows and compatible with a multitude of computational environments.
... In future work [16] we are focused on scaling the Xtract model and exploring the use of Xtract on larger and globally distributed datasets. We will investigate strategies for guiding extractor placement across scientific repositories, weighing data and extractor transfer costs to optimize placement. ...
Conference Paper
Full-text available
The use and reuse of scientific data is ultimately dependent on the ability to understand what those data represent, how they were captured, and how they can be used. In many ways, data are only as useful as the metadata available to describe them. Unfortunately, due to growing data volumes, large and distributed collaborations, and a desire to store data for long periods of time, scientific "data lakes" quickly become disorganized and lack the metadata necessary to be useful to researchers. New automated approaches are needed to derive metadata from scientific files and to use these metadata for organization and discovery. Here we describe one such system, Xtract, a service capable of processing vast collections of scientific files and automatically extracting metadata from diverse file types. Xtract relies on function as a service models to enable scalable metadata extraction by orchestrating the execution of many, short-running extractor functions. To reduce data transfer costs, Xtract can be configured to deploy extractors centrally or near to the data (i.e., at the edge). We present a prototype implementation of Xtract and demonstrate that it can derive metadata from a 7 TB scientific data repository.
Conference Paper
Full-text available
The use and reuse of scientific data is ultimately dependent on the ability to understand what those data represent, how they were captured, and how they can be used. In many ways, data are only as useful as the metadata available to describe them. Unfortunately, due to growing data volumes, large and distributed collaborations, and a desire to store data for long periods of time, scientific "data lakes" quickly become disorganized and lack the metadata necessary to be useful to researchers. New automated approaches are needed to derive metadata from scientific files and to use these metadata for organization and discovery. Here we describe one such system, Xtract, a service capable of processing vast collections of scientific files and automatically extracting metadata from diverse file types. Xtract relies on function as a service models to enable scalable metadata extraction by orchestrating the execution of many, short-running extractor functions. To reduce data transfer costs, Xtract can be configured to deploy extractors centrally or near to the data (i.e., at the edge). We present a prototype implementation of Xtract and demonstrate that it can derive metadata from a 7 TB scientific data repository.
Conference Paper
Full-text available
High-level programming languages such as Python are increasingly used to provide intuitive interfaces to libraries written in lower-level languages and for assembling applications from various components. This migration towards orchestration rather than implementation, coupled with the growing need for parallel computing (e.g., due to big data and the end of Moore's law), necessitates rethinking how parallelism is expressed in programs. Here, we present Parsl, a parallel scripting library that augments Python with simple, scalable, and flexible constructs for encoding parallelism. These constructs allow Parsl to construct a dynamic dependency graph of components that it can then execute efficiently on one or many processors. Parsl is designed for scalability, with an extensible set of executors tailored to different use cases, such as low-latency, high-throughput, or extreme-scale execution. We show, via experiments on the Blue Waters supercomputer, that Parsl executors can allow Python scripts to execute components with as little as 5 ms of overhead, scale to more than 250000 workers across more than 8000 nodes, and process upward of 1200 tasks per second. Other Parsl features simplify the construction and execution of composite programs by supporting elastic provisioning and scaling of infrastructure, fault-tolerant execution, and integrated wide-area data management. We show that these capabilities satisfy the needs of many-task, interactive, online, and machine learning applications in fields such as biology, cosmology, and materials science.
Article
Facilitating the application of machine learning (ML) to materials science problems requires enhancing the data ecosystem to enable discovery and collection of data from many sources, automated dissemination of new data across the ecosystem, and the connecting of data with materials-specific ML models. Here, we present two projects, the Materials Data Facility (MDF) and the Data and Learning Hub for Science (DLHub), that address these needs. We use examples to show how MDF and DLHub capabilities can be leveraged to link data with ML models and how users can access those capabilities through web and programmatic interfaces.
Conference Paper
Globus Auth is a foundational identity and access management platform service designed to address unique needs of the science and engineering community. It serves to broker authentication and authorization interactions between end-users, identity providers, resource servers (services), and clients (including web, mobile, desktop, and command line applications, and other services). Globus Auth thus makes it easy, for example, for a researcher to authenticate with one credential, connect to a specific remote storage resource with another identity, and share data with colleagues based on another identity. By eliminating friction associated with the frequent need for multiple accounts, identities, credentials, and groups when using distributed cyberinfrastructure, Globus Auth streamlines the creation, integration, and use of advanced research applications and services. Globus Auth builds upon the OAuth 2 and OpenID Connect specifications to enable standards-compliant integration using existing client libraries. It supports identity federation models that enable diverse identities to be linked together, while also providing delegated access tokens via which client services can obtain short term delegated tokens to access other services. We describe the design and implementation of Globus Auth, and report on experiences integrating it with a range of research resources and services, including the JetStream cloud, XSEDE, NCAR's Research Data Archive, and FaceBase.
Article
Jetstream will be the first production cloud resource supporting general science and engineering research within the XD ecosystem. In this report we describe the motivation for proposing Jetstream, the configuration of the Jetstream system as funded by the NSF, the team that is implementing Jetstream, and the communities we expect to use this new system. Our hope and plan is that Jetstream, which will become available for production use in 2016, will aid thousands of researchers who need modest amounts of computing power interactively. The implementation of Jetstream should increase the size and disciplinary diversity of the US research community that makes use of the resources of the XD ecosystem.