Conference PaperPDF Available

Serverless Workflows for Indexing Large Scientific Data


Abstract and Figures

The use and reuse of scientific data is ultimately dependent on the ability to understand what those data represent, how they were captured, and how they can be used. In many ways, data are only as useful as the metadata available to describe them. Unfortunately, due to growing data volumes, large and distributed collaborations, and a desire to store data for long periods of time, scientific "data lakes" quickly become disorganized and lack the metadata necessary to be useful to researchers. New automated approaches are needed to derive metadata from scientific files and to use these metadata for organization and discovery. Here we describe one such system, Xtract, a service capable of processing vast collections of scientific files and automatically extracting metadata from diverse file types. Xtract relies on function as a service models to enable scalable metadata extraction by orchestrating the execution of many, short-running extractor functions. To reduce data transfer costs, Xtract can be configured to deploy extractors centrally or near to the data (i.e., at the edge). We present a prototype implementation of Xtract and demonstrate that it can derive metadata from a 7 TB scientific data repository.
Content may be subject to copyright.
Serverless Workflows for Indexing Large Scientific Data
Tyler J. Skluzacek
University of Chicago
Ryan Chard
Argonne National Laboratory
Ryan Wong
University of Chicago
Zhuozhao Li
University of Chicago
Yadu N. Babuji
University of Chicago
Logan Ward
Argonne National Laboratory
Ben Blaiszik
Argonne National Laboratory
Kyle Chard
University of Chicago
Ian Foster
Argonne & University of Chicago
The use and reuse of scientic data is ultimately dependent on the
ability to understand what those data represent, how they were
captured, and how they can be used. In many ways, data are only
as useful as the metadata available to describe them. Unfortunately,
due to growing data volumes, large and distributed collaborations,
and a desire to store data for long periods of time, scientic “data
lakes” quickly become disorganized and lack the metadata neces-
sary to be useful to researchers. New automated approaches are
needed to derive metadata from scientic les and to use these
metadata for organization and discovery. Here we describe one
such system, Xtract, a service capable of processing vast collections
of scientic les and automatically extracting metadata from di-
verse le types. Xtract relies on function as a service models to
enable scalable metadata extraction by orchestrating the execution
of many, short-running extractor functions. To reduce data transfer
costs, Xtract can be congured to deploy extractors centrally or
near to the data (i.e., at the edge). We present a prototype imple-
mentation of Xtract and demonstrate that it can derive metadata
from a 7 TB scientic data repository.
Information systems Computing platforms
;Search en-
gine indexing; Document structure;
Applied computing Doc-
ument metadata.
data lakes, serverless, metadata extraction, le systems, materials
ACM Reference Format:
Tyler J. Skluzacek, Ryan Chard, Ryan Wong, Zhuozhao Li, Yadu N. Babuji,
Logan Ward, Ben Blaiszik, Kyle Chard, and Ian Foster. 2019. Serverless
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from
WOSC ’19, December 9–13, 2019, Davis, CA, USA
©2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-7038-7/19/12. . . $15.00
Workows for Indexing Large Scientic Data. In 5th Workshop on Serverless
Computing (WOSC ’19), December 9–13, 2019, Davis, CA, USA. ACM, New
York, NY, USA, 6 pages.
Advances in scientic instruments, computational capabilities, and
the proliferation of IoT devices have increased the volume, velocity,
and variety of research data. As a result, researchers are increasingly
adopting data-driven research practices, in which data are collected,
stored for long periods of time, shared, and reused. Unfortunately,
these factors create new research data management challenges, not
least of which is the need for data to be adequately described for
them to be generally useful. Without descriptive metadata, data
may become unidentiable, siloed, and in general, not useful to
either the researchers who own the data or the broader scientic
community. Unfortunately, organizing and annotating data is a
time-consuming process and the gulf between data generation rates
and the nite management capabilities of researchers continues to
grow. To increase the value and usefulness of scientic data, new
methods are required to automate the extraction and association
of rich metadata that describe not only the data themselves, but
also their structure and format, provenance, and administrative
Data lakes have become a popular paradigm for managing large
and heterogeneous data from various sources. A data lake contains
a collection of data in dierent formats, accompanied by metadata
describing those data. Unfortunately, the data deluge and the desire
to store all data for eternity can quickly turn a data lake into a
“data swamp” [18]—a term used to describe the situation in which
the data stored in the data lake lack the metadata necessary to
be discoverable, understandable, and usable. Without automated
and scalable approaches to derive metadata from scientic data, the
utility of these data are reduced. This problem is especially prevalent
in science as datasets can be enormous (many petabytes), are created
and shared by dynamic collaborations, are often collected under
tight time constraints where data management processes become
afterthoughts, and for which preservation is important for purposes
of reproducibility and reuse.
Extracting metadata from scientic data is a complex task. Scien-
tic repositories may exceed millions of les and petabytes of data;
data are created at dierent rates, by dierent people; and there
WOSC ’19, December 9–13, 2019, Davis, CA, USA Skluzacek et al.
exist an enormous number of data formats and conventions. While
metadata can be rapidly extracted from some data types, others,
such as images and large hierarchical le formats can require the
use of multiple extraction methods. As a result, the metadata ex-
traction process must be scalable to process large numbers of les,
exible to support dierent extraction methods, and extensible to
be applied to various scientic data types and formats.
Serverless computing, and in particular function as a service
(FaaS), provides an ideal model for managing the execution of many
short-running extractors on an arbitrarily large number of les.
Serverless computing abstracts computing resources from the user,
enabling the deployment of applications without consideration for
the physical and virtual infrastructure on which they are hosted.
FaaS allows users to register programming functions with prede-
ned input signatures. Registered functions can subsequently be
invoked many times without the need to provision or scale any
In this paper we propose the use of FaaS for mapping the meta-
data extraction problem to a collection of granular metadata ex-
tractor functions. We describe how such a model can support the
exibility, scalability, and extensibility required for scientic meta-
data extraction. Rather than rely on commercial FaaS systems, we
use a distributed FaaS model that overcomes the limitation of mov-
ing large amounts of data to the cloud. Instead, we are able to push
metadata extractors to the edge systems on which the scientic
data reside.
Our prototype system, Xtract, provides high-throughput and on-
demand metadata extraction that enables the automated creation
of rich, searchable data lakes from previously unsearchable data
swamps. Xtract implements dynamic metadata extraction work-
ows comprised of serverless functions that may be executed in
the cloud or at the edge. Xtract runs as a centralized management
service that orchestrates the crawling, transfer (when necessary),
and execution of metadata extraction functions. Xtract uses the
uncX serverless supercomputing platform [
] to execute functions
across diverse and distributed computing infrastructure. We eval-
uate Xtract’s performance by extracting metadata from materials
science data stored in a 7 TB subset of the Materials Data Facility
(MDF) [3, 4].
The remainder of this paper is organized as follows. §2 outlines
example scientic data lakes. §3 describes Xtract’s serverless ar-
chitecture and presents the set of available extractors. §4 provides
initial results of system performance on hundreds of thousands of
scientic les. Finally, §5 and §6 present related work and conclud-
ing remarks, respectively.
Data lakes are schemaless collections of heterogeneous les ob-
tained from dierent sources. Unlike traditional data warehouses,
data lakes do not require upfront schema integration and instead
allow users to store data without requiring complex and expen-
sive Extract-Transfer-Load (ETL) pipelines. This low barrier to data
archival encourages the storage of more bytes and types of data.
However, the lack of a well-dened schema shifts responsibility
from upfront integration to descriptive metadata to allow users
to search, understand, and use the data. To motivate our work we
briey describe three scientic data lakes below.
The Carbon Dioxide Information Analysis Center (CDIAC) col-
lected an emissions dataset from the 1800s through 2017. The
dataset contains more than
500 000
les (330+ GB) with over
10 000
unique le extensions. The archive contains little descriptive meta-
data and includes a number of irrelevant les, such as as debug-cycle
error logs and Windows desktop shortcuts. The data are currently
being moved to the Environmental System Science Data Infrastruc-
ture for a Virtual Ecosystem (ESS-DIVE) archive [
]. In prior work
we extracted metadata from les in this repository and created a
data lake for users [17, 18].
DataONE [
]provides access to a distributed network of biolog-
ical and environmental sciences data repositories. DataONE man-
ages a central index across these distributed repositories, enabling
users to discover datasets based on metadata queries. Member data
repositories provide dataset- and le-level metadata to DataONE.
As of May 2019, DataONE contains over 1.2 million les and
809 000
unique metadata entries.
The Materials Data Facility (MDF) [
] is a centralized hub for
publishing, sharing, and discovering materials science data. The
MDF stores many terabytes of data from many dierent research
groups, covering many disciplines of materials science, and with a
diverse range of le types. The downside of the expansive range of
materials data held by the MDF is that it can be dicult for users
to nd data relevant to their science. The MDF reduces the “data
discovery" challenge by hosting a search index that provides access
to metadata from the les (e.g., which material was simulated in
a certain calculation). The data published by the MDF is primar-
ily stored on storage at the National Center for Supercomputing
Applications (NCSA) and is accessible via Globus.
Xtract is a metadata extraction system that provides on-demand
extraction from heterogeneous scientic le formats. Xtract can
operate in one of two modes: centralized or edge metadata extrac-
tion. In the centralized mode, Xtract processes les stored on a
Globus endpoint by rst staging them to a centralized location and
then executing metadata extraction pipelines on those les. In the
edge mode, Xtract can execute metadata extraction pipelines on
edge computers near the data by deploying a collection of metadata
extraction functions to distributed FaaS endpoints.
In order to extract metadata, Xtract applies various extractors—
functions that take a le as input and generate a JSON document
of metadata for that le. In either centralized or edge mode, Xtract
assembles a pipeline of extraction functions based on le contents
and metadata extracted from other extractors. Xtract begins this
process by applying a le type extractor which informs selection
of subsequent extractors. Subsequent extractors are selected based
on their expected metadata yield. This allows Xtract to apply the
appropriate extractors to a specic le.
Xtract creates a metadata document for each le it processes.
When an extractor is applied to a le, Xtract appends the extractor’s
output metadata to the le’s metadata document. Once all applicable
extractors are applied to a le, Xtract loads this metadata document
into a Globus Search index [1].
Serverless Workflows for Indexing Large Scientific Data WOSC ’19, December 9–13, 2019, Davis, CA, USA
3.1 Extractors
Xtract includes extractors for many le types commonly used in
science. Each extractor is implemented as either a Python function
or Bash script. The remainder of this section outlines Xtract’s library
of extractors and the data types they have been designed to process.
The universal extractor
extracts generic le information such
as le extension, path, and size. It also computes an MD5 hash (for
duplicate detection). The universal extractor is typically the rst
applied to a le.
The le type extractor
applies a machine learning model to
infer the type of a le. This information is crucial for determin-
ing which downstream extractors could yield metadata from that
le. Users can opt to use a pre-trained model, or to automatically
train one on their data. In the latter case, training commences by
% of the les in the repository at random and running
those les through all other Xtract extractors. We record whether
or not the extractor produces metadata and build a le
label mapping based on those that yield metadata. It then trains
a random forests model using these labels and the rst 512 bytes
of the le as features. The primary goal of this extractor is to save
time—the amount of time to incorrectly apply metadata extractors
to a le can take seconds, whereas predicting which extractors will
likely produce metadata via the byte footprint of a le is tens of
The tabular extractor
extracts metadata from les with strict
row-column structure, such as .csv and .tsv les. It rst extracts the
delimiter, location of the column-labels or header, and applies binary
search over the le to identify the existence and location of a free
text preamble. The tabular extractor then processes the columns in
parallel to collect aggregate information (means, medians, modes).
The free text preamble is re-queued for separate processing by the
keyword extractor.
The keyword extractor
identies uniquely descriptive words
in unstructured free text documents such as READMEs, academic
papers (e.g., .pdf,.doc), and abstracts. The keyword extractor uses
word embeddings to curate a list of the top-nkeywords in a le,
and an associated weight corresponding to the relative relevance
of a given keyword as a proper descriptor for that document.
The semi-structured extractor
takes data or pre-existing meta-
data in semi-structured formats (e.g., .json or .xml) and returns a
metadata summary of the document, such as the maximum nesting
depth and the types of data represented at each level (e.g., structured,
unstructured, or lists). Furthermore, free text elds are isolated and
re-queued for processing by the keyword extractor.
The hierarchical extractor
processes hierarchical HDF5 les
(and HDF5-based le formats such as NetCDF) commonly used in
science. The hierarchical extractor uses HDF5 libraries to extract
both the self-describing, in-le metadata as well as metadata about
various dimensions of the data.
The image extractor
utilizes an SVM trained on a manually
labeled set of over 600 images to derive the class of an image,
which is useful for both downstream extractors and general le
categorization. The image classes include scientic graphics (e.g.,
Figure 1), geographic maps, map plots (i.e., geographic maps with
an interesting color index), photographs, and scientic plots (e.g.,
Figure 3). The features for this model include a color histogram, a
Figure 1: Overview of the Xtract architecture. For Site A,
functions are transmitted to the remote resource and per-
formed on local computing resources, returning metadata
to the Xtract service. Site B lacks suitable local computing
capabilities, requiring data to be staged to Xtract for analy-
standard grayscaled version of the original image, and the image
size. Further, the Xtract library also contains the downstream
that can isolate geographic entities from a photograph
of a map.
The materials extractor
provides a thin wrapper over Mate-
rialsIO [
], a metadata extractor designed to process common le
formats used in materials science. MaterialsIO contains le and le-
group parsers for atomistic simulations, crystal structures, density
functional theory (DFT) calculations, electron microscopy outputs,
and images. If the Xtract sampler classies a le as a materials le,
the materials extractor is invoked, launching each parser at the
contents of a directory.
3.2 Prototype Implementation
Xtract is implemented as a service via which users can submit
requests to extract metadata from a collection of les. Xtract rst
crawls the specied les and determines an initial set of extractors
to apply to them. As outlined above, the extractors may be executed
either centrally on the Xtract server or remotely alongside the data.
As processing continues, Xtract assembles a metadata document
for each le and dynamically selects other extractors to apply.
Xtract is deployed on Amazon Web Services (AWS) and makes
use of various services. The main Xtract service is deployed on an
AWS Elastic Compute Cloud instance. Xtract manages state in an
AWS Relational Database Service (RDS) instance. Each extraction
request is stored in the database and the state is updated throughout
the extraction process. Xtract is able to send the derived metadata
to an external metadata catalog such as a Globus Search index. The
Xtract architecture is shown in Figure 1.
WOSC ’19, December 9–13, 2019, Davis, CA, USA Skluzacek et al.
3.2.1 Metadata Extractors. Xtract is designed to execute its extrac-
tors centrally or on edge storage systems near to where data are
stored. Our implementation uses the
uncX [
] FaaS platform to
deploy and run extractors.
uncX is specically designed to inte-
grate with research computing cyberinfrastructure and enable a
FaaS execution interface.
uncX builds upon the Parsl [
] paral-
lel programming library to manage the execution of functions in
containers on arbitrary compute resources.
uncX enables Xtract
to execute metadata extraction functions at any registered and ac-
uncX endpoint. We deploy the prototype with endpoints
located both on the central service and at the edge to enable both
centralized and edge extraction.
Each metadata extractor and its dependencies are wrapped in
a Docker container so that it can be executed on heterogeneous
compute environments. We have published each extractor container
uncX and registered a
uncX function for each extractor. The
function is responsible for invoking the extractor and returning the
resulting metadata as a JSON dictionary.
uncX enables Xtract to reliably scale to thousands of nodes and
deploy metadata extraction tasks on arbitrary computing resources.
Xtract can make use of any accessible
uncX endpoint to process
data at the edge, sending extractor codes and containers to the
uncX endpoint for execution. In addition,
uncX supports Singu-
larity and Shifter, allowing extractors to be executed on various
high performance computing systems.
3.2.2 Data Staging. Irrespective of the extraction mode, either
centralized or edge, the data to be processed must be available to
the deployed extractor container. In the case where data cannot be
directly accessed within the container (e.g., where the container
does not mount the local lesystem), data are dynamically staged
for processing. Each container includes Xtract tooling to stage data
in and out of itself. We use Globus as the basis for data staging,
using Globus HTTPS requests, to securely download remote data
into the container.
3.2.3 Security Model. Xtract implements a comprehensive security
model using Globus Auth [
]. All interactions with the Xtract Web
service are secured with Globus Auth. Users can authenticate with
Xtract using one of several hundred identity providers, including
many institutions. Xtract uses Globus Auth OAuth 2 ows to stage
data on behalf of authenticated users. Xtract rst veries a user
identity, requests an access token to perform data transfers on their
behalf, and then uses Globus to stage data from remote storage to
the Xtract extractor. Finally, the resulting metadata are published
into the search index using a Globus Search access token. The search
index is congured with a visible_to eld, restricting discovery to
authenticated users.
We evaluate Xtract by extracting metadata from more than
250 000
les stored in MDF. We deployed the Xtract service on an AWS
EC2 t2.small instance (Intel Xeon; 1 vCPU; 2 GB memory) and
deployed a private
uncX endpoint on ANL’s PetrelKube—a 14-
node Kubernetes cluster. The MDF data are stored on the Petrel
data service, a Globus-managed 3 PB data store at ANL. While
Petrel and PetrelKube are located within the same network, they
do not share a le system. Thus, when executing extractors close
to the data we still stage the data from Petrel to PetrelKube for
In this section we rst evaluate Xtract’s performance by crawling
all les on the MDF as a means of initializing the downstream
metadata extraction workow and providing summary statistics
about the data. We next prole the downstream metadata extractor
functions on hundreds of thousands of heterogeneous les in MDF.
Finally, we illustrate the performance of batching multiple les into
one extractor function across multiple representative le types.
4.1 Crawling Performance
First we crawl MDF to identify and record le locations, as well as
general attributes about each le such as size and extension. Xtract
manages the crawling process from its central service, employing a
remote breadth-rst search algorithm on the directory via Globus.
In processing MDF, Xtract crawled each of the 2.2 million les in
hours—at an eective rate of
les crawled per
second. As part of crawling, Xtract generated descriptive treemaps
about general attributes of the data. One such treemap illustrating
the proportion of the most common extensions in MDF is shown in
Figure 2. Here we observe that atomistic structure (.xyz), unknown
(nan), and image les (.ti/.tif ) are most common in MDF relative
to other types.
4.2 File Type Training
We next evaluate the time taken to perform the optional model
training step for the le type extractor. Automated training of this
model occurs by trying every extractor on each le in a 5-10%
subset of the total data set, denoting the rst extractor that returns
serviceable metadata without error. This extractor represents the
le’s label and the rst 512 bytes represent the features for the
random forests model. Xtract conducted such automated training
110 900
MDF les. The entire label creation, feature collection,
and training workow took approximately
hours. We found
that label generation constitutes a majority of this time, as feature
generation and model training total just 45 seconds. It is important
to note that, in the future, increasing the number of PetrelKube
pods concurrently serving feature label-collection functions can
drastically reduce the time taken to train the model.
4.3 Extractor Performance
We next evaluate the performance of Xtract’s extractors by invoking
extraction functions on MDF’s data. We process a variety of le
types including all available tabular and structured les and at least
25 000
of each other type of le, selected randomly from MDF. The
performance of each extractor is summarized in Table 1.
We observe that a majority of extractor function invocations
nish within milliseconds, and a majority of a le’s round-trip
processing time occurs due to le staging. This observation exem-
plies the need to invoke functions near to data, directly mounted
to the le system housing the data, whenever possible. Moreover
we note that a few dozen large hierarchical les, exceeding 10 GB
in size, were not processed due to ephemeral storage constraints
on PetrelKube.
Serverless Workflows for Indexing Large Scientific Data WOSC ’19, December 9–13, 2019, Davis, CA, USA
Figure 2: A treemap of the MDF extension frequency. The proportion of the size of each box relative to the entire treemap is
equivalent to the proportion of the frequency of that le extension in MDF out of the 2.2 million total les
Table 1: Extractor performance.
Extractor #Files Avg. Size
Avg. Extract
Time (ms)
Avg. Stage
Time (ms)
File Type 25,132 1.52 3.48 714
Images 76,925 4.17 19.30 1,198
Semi-structured 29,850 0.38 8.97 412
Keyword 25,997 0.06 0.20 346
Materials 95,434 0.001 24 1,760
Hierarchical* 3,855 695 1.90 9,150
Tabular 1,227 1.03 113 625
Figure 3: Batching: extraction time per le (ms) on batches
sized 1-256 for representative les processed by the le type,
image, keyword, and materials extractors
Finally, we explore the benets of batching multiple les into one
function invocation request. We choose the four most applicable
extractors: le type, keyword, images and materials. We randomly
select representative les of all four types, each within 2% of the
average le sizes shown in Table 1. We implement batching by
aggregating 1-256 les into a single Xtract request and have modi-
ed the
uncX function to download and process les in parallel.
Figure 3 shows that the average time required to process a le de-
creases as the batch size increases. Thus, batching can increase the
performance of metadata extraction workows, especially those
requiring le transfers.
Xtract builds upon a large body of work in both metadata extraction
systems and serverless computing. Xtract is not the rst to provide
a scalable solution to extract metadata from datasets. Pioneering
research on data lakes developed methods for extracting standard
metadata from nonstandard le types and formats [
]. Most data
lakes are designed with a specic target domain for which they are
optimized, whether they primarily focus on transactional, scientic,
or networked data. Xtract is designed to be easily extensible and
therefore can be easily applied to dierent domains.
We [
] and others [
] have created systems to man-
age metadata catalogs that support the organization and discovery
of research data. However, these approaches typically require that
users provide metadata and that curators continue to organize data
over time. A number of systems exist for automatically extracting
metadata from repositories. For example, ScienceSearch [
] uses
machine learning techniques to extract metadata from a dataset
served by the National Center for Electron Microscopy (NCEM).
Most data in this use case are micrograph images, but additional
contextual metadata are derived from le system data and free text
proposals and publications. Like Xtract, ScienceSearch provides a
means for users to extensibly switch metadata extractors to suit a
given dataset. Brown Dog [
] is an extensible metadata extraction
platform, providing metadata extraction services for a number of
disciplines ranging from materials science to social science. Unlike
Xtract, Brown Dog requires that les are uploaded for extraction.
The Apache Tika toolkit [
] is an open-source content and meta-
data extraction library. Tika has a robust parser interface in which
users can create and employ their own parsers in metadata ex-
traction workows. While Apache Tika has parsers that support
thousands of le formats, the automated parser-to-le mapping
WOSC ’19, December 9–13, 2019, Davis, CA, USA Skluzacek et al.
utilizes MIME types to nd suitable parsers for a le, which is of-
ten misleading for many scientic data use cases. Xtract could be
extended to support Tika extractors and to enable execution on a
serverless platform.
While most related research performs metadata extraction to
enable search, Xtract-like systematic sweeps across repositories
can also be used for analysis. For example, the Big Data Quality
Control (BDQC) framework [
] sweeps over large collections of
biomedical data without regard to their meaning (domain-blind
analysis) with the goal of identifying anomalies. BDQC employs a
pipeline of extractors to derive properties of imaging, genomic, and
clinical data. While BDQC is implemented as a standalone system,
the approach taken would be similarly viable in Xtract.
The growing volume, velocity, and variety of scientic data is be-
coming unmanageable. Without proper maintenance and manage-
ment, data lakes quickly degrade into disorganized data swamps,
lacking the necessary metadata for researchers to eciently dis-
cover, use, and repurpose data. The growing size and heterogeneity
of scientic data makes extracting rich metadata a complex and
costly process, requiring a suite of customized extractors and ad-
vanced extraction techniques. We have described a serverless-based
approach for metadata extraction, called Xtract. Xtract enables the
scalable extraction of metadata from large-scale and distributed
data lakes, in turn increasing the value of data. We showed that
our prototype can crawl and process hundreds of thousands of les
from a multi-terabyte repository in hours, and that batching les
and parallelizing le staging and extraction tasks can improve the
performance of metadata extraction times.
In future work [
] we are focused on scaling the Xtract model
and exploring the use of Xtract on larger and globally distributed
datasets. We will investigate strategies for guiding extractor place-
ment across scientic repositories, weighing data and extractor
transfer costs to optimize placement. Finally, we will extend Xtract
to facilitate the integration of custom metadata extractors.
This research used resources of the Argonne Leadership Computing
Facility, which is a DOE Oce of Science User Facility supported
under Contract DE-AC02-06CH11357. We gratefully acknowledge
the computing resources provided and operated by the Joint Labora-
tory for System Evaluation (JLSE) at Argonne National Laboratory
as well as the Jetstream cloud for science and engineering [19].
Rachana Ananthakrishnan, Ben Blaiszik, Kyle Chard, Ryan Chard, Brendan Mc-
Collam, Jim Pruyne, Stephen Rosen, Steven Tuecke, and Ian Foster. 2018. Globus
platform services for data publication. In Proceedings of the Practice and Experience
on Advanced Research Computing. ACM, 14.
Yadu Babuji, Anna Woodard, Zhuozhao Li, Daniel S Katz, Ben Cliord, Rohan
Kumar, Lukasz Lacinski, Ryan Chard, Justin M Wozniak, Ian Foster, et al
Parsl: Pervasive parallel programming in python. In Proceedings of the 28th
International Symposium on High-Performance Parallel and Distributed Computing.
ACM, 25–36.
Ben Blaiszik, Kyle Chard, Jim Pruyne, Rachana Ananthakrishnan, Steven Tuecke,
and Ian Foster. 2016. The Materials Data Facility: Data services to advance
materials science research. JOM 68, 8 (2016), 2045–2052.
Ben Blaiszik, Logan Ward, Marcus Schwarting, Jonathon Ga, Ryan Chard,Daniel
Pike, Kyle Chard, and Ian Foster. 2019. A Data Ecosystem to Support Machine
Learning in Materials Science. (apr 2019). arXiv:1904.10423
Ryan Chard, Tyler J Skluzacek, Zhuozhao Li, Yadu Babuji, Anna Woodard, Ben
Blaiszik, Steven Tuecke, Ian Foster, and Kyle Chard. 2019. Serverless Super-
computing: High Performance Function as a Service for Science. arXiv preprint
arXiv:1908.04907 (2019).
Eric Deutsch, Roger Kramer, Joseph Ames, Andrew Bauman, David S Campbell,
Kyle Chard, Kristi Clark, Mike D’Arcy, Ivo Dinov, Rory Donovan, et al
BDQC: a general-purpose analytics tool for domain-blind validation of Big Data.
bioRxiv (2018), 258822.
MP Egan, SD Price, KE Kraemer, DR Mizuno, SJ Carey, CO Wright, CW Engelke,
M Cohen, and MG Gugliotti. 2003. VizieR Online Data Catalog: MSX6C Infrared
Point Source Catalog. The Midcourse Space Experiment Point Source Catalog
Version 2.3 (October 2003). VizieR Online Data Catalog 5114 (2003).
Materials Data Facility. 2019. MaterialsIO.
Environmental Systems Science Data Infrastructure for a Virtual Ecosystem. 2019.
Gary King. 2007. An introduction to the dataverse network as an infrastructure
for data sharing.
Chris Mattmann and Jukka Zitting. 2011. Tika in action. Manning Publications
William Michener, Dave Vieglais, Todd Vision, John Kunze, Patricia Cruse, and
Greg Janée. 2011. DataONE: Data Observation Network for Earth—Preserving
data and enabling innovation in the biological and environmental sciences. D-Lib
Magazine 17, 1/2 (2011), 12.
Smruti Padhy, Greg Jansen, Jay Alameda, Edgar Black, Liana Diesendruck, Mike
Dietze, Praveen Kumar, Rob Kooper, Jong Lee, Rui Liu, et al
2015. Brown Dog:
Leveraging everything towards autocuration. In 2015 IEEE Int’l Conference on Big
Data (Big Data). IEEE, 493–500.
Arcot Rajasekar, Reagan Moore, Chien-yi Hou, Christopher A Lee, Richard Mar-
ciano, Antoine de Torcy, Michael Wan, Wayne Schroeder, Sheau-Yen Chen, Lucas
Gilbert, et al
2010. iRODS primer: integrated rule-oriented data system. Synthesis
Lectures on Information Concepts, Retrieval, and Services 2, 1 (2010), 1–143.
Gonzalo P Rodrigo, Matt Henderson, Gunther H Weber, Colin Ophus, Katie
Antypas, and Lavanya Ramakrishnan. 2018. ScienceSearch: Enabling search
through automatic metadata generation. In 2018 IEEE 14th Int’l Conference on
e-Science (e-Science). IEEE, 93–104.
Tyler J. Skluzacek. 2019. Dredging a Data Lake: Decentralized Metadata Ex-
traction. In Middleware ’19: 20th International Middleware Conference Doctoral
Symposium (Middleware ’19). ACM, New York, NY, USA, 3.
Tyler J Skluzacek, Kyle Chard, and Ian Foster. 2016. Klimatic: a virtual data
lake for harvesting and distribution of geospatial data. In 2016 1st Joint Int’l
Workshop on Parallel Data Storage and data Intensive Scalable Computing Systems
(PDSW-DISCS). IEEE, 31–36.
Tyler J Skluzacek, Rohan Kumar, Ryan Chard, Galen Harrison, Paul Beckman,
Kyle Chard, and Ian Foster. 2018. Skluma: An extensible metadata extraction
pipeline for disorganized data. In 2018 IEEE 14th Int’l Conference on e-Science
(e-Science). IEEE, 256–266.
Craig A Stewart, Timothy M Cockerill, Ian Foster, David Hancock, Nirav Mer-
chant, Edwin Skidmore, Daniel Stanzione, James Taylor, Steven Tuecke, George
Turner, et al
2015. Jetstream: a self-provisioned, scalable science and engineer-
ing cloud environment. In Proceedings of the 2015 XSEDE Conference: Scientic
Advancements Enabled by Enhanced Cyberinfrastructure. ACM, 29.
Ignacio G Terrizzano, Peter M Schwarz, Mary Roth, and John E Colino. 2015.
Data Wrangling: The Challenging Yourney from the Wild to the Lake.. In CIDR.
Steven Tuecke, Rachana Ananthakrishnan, Kyle Chard, Mattias Lidman, Brendan
McCollam, Stephen Rosen, and Ian Foster. 2016. Globus Auth: A research identity
and access management platform. In 2016 IEEE 12th Int’l Conference on e-Science
(e-Science). IEEE, 203–212.
Danielle Welter, Jacqueline MacArthur, Joannella Morales, Tony Burdett, Peggy
Hall, Heather Junkins, Alan Klemm, Paul Flicek, Teri Manolio, Lucia Hindor, et al
2013. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations.
Nucleic Acids Research 42, D1 (2013), D1001–D1006.
J. M. Wozniak, K. Chard, B. Blaiszik, R. Osborn, M. Wilde, and I. Foster. 2015.
Big Data Remote Access Interfaces for Light Source Science. In 2nd IEEE/ACM
International Symposium on Big Data Computing (BDC). 51–60.
... It is a common standard in the scientific community to tag or annotate data objects with additional descriptive metadata for a better understanding of data for collaborators [2], [25], [27]. Fourth, with ever-changing data analysis scenarios, to select a subset of PMOs from million and billions of PMOs in a shared PM pool based on metadata or user-defined tags without additional indexing becomes highly challenging [27]- [30]. These challenges drive the need for essentially effective scientific metadata search services and querying on top of persistent memory object storage abstraction. ...
... An approach to build a high-performance scientific metadata search service on top of the PM pool is storing, indexing, and querying all of its metadata and at least significant data in the main memory [25], [26], [31], [32]. Although, there exist in-memory scientific data management solutions [25], [30], [31]. However, in-memory solution suffers from failuretolerance and recurring recovery cost in case of failures. ...
... These scientific data formats are often referred to as self-described and self-contained, i.e., the metadata is stored alongside the data objects [26]. This metadata enables use and reuse of scientific data [30]. HDF5 is one of the most widely used data formats in the scientific community. ...
Full-text available
Scientific applications often require high-bandwidth shared storage to perform joint simulations and collaborative data analytics. Shared memory pools provide a chance to satisfy such needs. Recently, a high-speed network such as Gen-Z utilizing persistent memory (PM) offers an opportunity to create a shared memory pool connected to compute nodes. However, there are several challenges to use scientific applications on the shared memory pool directly such as scalability, failure-atomicity, and lack of scientific metadata-based search and query. In this paper, we propose MOSIQS, a persistent memory object storage framework with metadata indexing and querying for scientific computing. We design MOSIQS based on the key idea that memory objects on PM pool can live beyond the application lifetime and can become the sharing currency for applications and scientists. MOSIQS provides an aggregate memory pool atop an array of persistent memory devices to store and access memory objects to accelerate scientific computing. MOSIQS uses a lightweight persistent memory key-value store to manage the metadata of memory objects, which enables memory object sharing. To facilitate metadata search and query over millions of memory objects resident on memory pool, we introduce Group Split and Merge (GSM), a novel persistent index data structure designed primarily for scientific datasets. GSM splits and merges dynamically to minimize the query search space and maintains low query processing time while overcoming the index storage overhead. MOSIQS is implemented atop of PMDK. We evaluate the proposed approach on many-core server with an array of real PM devices. Experimental results show that MOSIQS gains a 100% write performance improvement and executes multi-attribute queries efficiently with 2.7× less index storage overhead offering significant potential to speed up scientific computing applications.
... These scientific data formats are often referred as self-described and self-contained, i.e., the metadata is stored alongside the data objects [152]. This metadata enables use and reuse of scientific data [120]. The HDF5 is most widely used data format in the scientific community. ...
... • Multi-Attribute Indexing and Search: Scientific data is largely unstructured and contains alot of descriptive metadata. Therefore, searching a desired subset of data out of massive datasets is very challenging [118,120]. For this purpose, there have been many efforts to design and develop multi-dimensional indexing and search services such as [153,63,115,154]. ...
... It is a common standard in the scientific community to tag or annotate data objects with additional descriptive metadata for a better understanding of data for collaborators [115,72,153]. Fourth, with ever-changing data analysis scenarios, to select a subset of PMOs from millions of PMOs in a shared memory pool based on a particular query, PMO metadata, multiple attributes or user-defined tags without additional index data structures becomes highly challenging [90,37,19,28,24,141,27,120,43]. These challenges drive the need for essentially effective scientific metadata search services atop of persistent memory object storage ...
Full-text available
The high-performance computing (HPC) storage systems are one of the critical components of computational, experimental, and observational science today. The ability to selectively access desired information from large volumes of data at very high speeds and minimum overhead is critical to scientific applications. Therefore, several efforts have been made to integrate scientific search and discovery services in HPC storage systems. However, due to a variety of different HPC storage architec- tures such as federated geo-distributed HPC data centers, distributed and parallel file systems, and high-speed persistent memory-based storage pools, it is non-trivial to apply a single solution to multiple storage architectures of HPC paradigm. Some of the main challenges include minimal performance degradation, effective meta- data management, data sharing controls and policies, and awareness of underlying storage. Therefore, accelerating the scientific search and discovery services while addressing the aforementioned challenges is crucial, especially for upcoming era of exascale storage architectures. This dissertation is focused on solving the above challenges and building scien- tific search and discovery service framework targeting each storage layer to accelerate HPC and scientific computing. In the first part of the dissertation (Chapter 3), we build a scientific collaboration friendly storage model for a wide-area storage network, i.e., geo-distributed HPC data centers. So, applications and scientists can benefit from the discovery services without losing performance with our proposed multi-mode metadata indexing approach. In the second part of the dissertation (Chapter 4), we present our solution to enable search services for scientists and applications directly running on scalable and distributed file systems by tightly integrating data man- agement into the file system. Chapter 5, presents a memory-object based scientific discovery service to fully utilize the emerging non-volatile memory pools. This dissertation shows that the proposed scientific metadata search and dis- covery services inside storage layers highly complements the HPC and scientific com- puting architectures.
... This paper extends our prior work [23,24] by creating a system that leverages federated FaaS to construct scalable, efficient, decentralized metadata extraction workflows on scientific data. The contributions of our work are: ...
... We briefly summarize several extractors, their use cases for this work, and the types of files (or file components) on which they are meant to operate. We describe these extractors and present detailed performance information in a previous paper [23]. ...
Conference Paper
Full-text available
We introduce Xtract, an automated and scalable system for bulk metadata extraction from large, distributed research data repositories. Xtract orchestrates the application of metadata extractors to groups of files, determining which extractors to apply to each file and, for each extractor and file, where to execute. A hybrid computing model, built on the funcX federated FaaS platform, enables Xtract to balance tradeoffs between extraction time and data transfer costs by dispatching each extraction task to the most appropriate location. Experiments on a range of clouds and supercomputers show that Xtract can efficiently process multi-million-file repositories by orchestrating the concurrent execution of container-based extractors on thousands of nodes. We highlight the flexibility of Xtract by applying it to a large, semi-curated scientific data repository and to an uncurated scientific Google Drive repository. We show that by remotely orchestrating metadata extraction across decentralized storage and compute nodes, Xtract can process large repositories in 50% of the time it takes just to transfer the same data to a machine within the same computing facility. We also show that when transferring data is necessary (e.g., no local compute is available), Xtract can scale to process files as fast as they are received, even over a multi-GB/s network.
... To the best of our knowledge, no prior system prioritizes extractors based on the expected value of metadata. While this work strictly focuses on designing an FTI-based extractor scheduler for our system Xtract, prior work illuminates the system design [18,3] and extractor library [19]. Table 1: Taxonomy of metadata extraction systems. ...
Conference Paper
Full-text available
The increasing volume and variety of science data has led to the creation of metadata extraction systems that automatically derive and synthesize relevant information from files. A critical component of metadata extraction systems is a mechanism for mapping extractors-lightweight tools to mine information from a particular file types-to each file in a repository. However, existing methods do little to address the heterogeneity and scale of science data, thereby leaving valuable data unextracted or wasting significant compute resources applying incorrect extractors to data. We construct an extractor scheduler that leverages file type identification (FTI) methods. We show that by training lightweight multi-label, multi-class statistical models on byte samples from files, we can correctly map 35% more extractors to files than by using libmagic. Further, we introduce a metadata quality toolkit to automatically assess the utility of extracted metadata.
... Since the introduction of AWS Lambda [1] by Amazon in 2014, serverless computing has grown to support a wide variety of applications such as machine learning [2], map/reduce-style jobs [3], and compute-intensive scientific workloads [4], [5], [6], [7]. Function-as-a-Service (FaaS), a key enabler of serverless computing allows a traditional monolithic application to be decomposed into fine-grained functions that are executed in response to event triggers or HTTP requests [8] on a FaaS platform. ...
Full-text available
FaaS allows an application to be decomposed into functions that are executed on a FaaS platform. The FaaS platform is responsible for the resource provisioning of the functions. Recently, there is a growing trend towards the execution of compute-intensive FaaS functions that run for several seconds. However, due to the billing policies followed by commercial FaaS offerings, the execution of these functions can incur significantly higher costs. Moreover, due to the abstraction of underlying processor architectures on which the functions are executed, the performance optimization of these functions is challenging. As a result, most FaaS functions use pre-compiled libraries generic to x86-64 leading to performance degradation. In this paper, we examine the underlying processor architectures for Google Cloud Functions (GCF) and determine their prevalence across the 19 available GCF regions. We modify, adapt, and optimize three compute-intensive FaaS workloads written in Python using Numba, a JIT compiler based on LLVM, and present results wrt performance, memory consumption, and costs on GCF. Results from our experiments show that the optimization of FaaS functions can improve performance by 12.8x (geometric mean) and save costs by 73.4% on average for the three functions. Our results show that optimization of the FaaS functions for the specific architecture is very important. We achieved a maximum speedup of 1.79x by tuning the function especially for the instruction set of the underlying processor architecture.
... This is in line with the work by Jiang et al. [9] which integrates a combination of Functions as a Service (FaaS)/local clusters execution approach for Montage-based workflows. The work by Skluzacek et al. [10] describes a service that processes large collections of scientific files to extract metadata from diverse file types, relying on Function as a Service models to enable scalability. Furthermore, the work by Chard et al. [11] proposes funcX, a high-performance FaaS platform for flexible, efficient, and scalable, remote function execution on infrastructures such as clouds, clusters, and supercomputers. ...
Full-text available
Serverless computing has introduced scalable event-driven processing in Cloud infrastructures. However, it is not trivial for multimedia processing to benefit from the elastic capabilities featured by serverless applications. To this aim, this paper introduces the evolution of a framework to support the execution of customized runtime environments in AWS Lambda in order to accommodate workloads that do not satisfy its strict computational requirements: increased execution times and the ability to use GPU-based resources. This has been achieved through the integration of AWS Batch, a managed service to deploy virtual elastic clusters for the execution of containerized jobs. In addition, a Functions Definition Language (FDL) is introduced for the description of data-driven workflows of functions. These workflows can simultaneously leverage both AWS Lambda for the highly-scalable execution of short jobs and AWS Batch, for the execution of compute-intensive jobs that can profit from GPU-based computing. To assess the developed open-source framework, we executed a case study for efficient serverless video processing. The workflow automatically generates subtitles based on the audio and applies GPU-based object recognition to the video frames, thus simultaneously harnessing different computing services. This allows for the creation of cost-effective highly-parallel scale-to-zero serverless workflows in AWS.
... The metadata annotation is integrated in the simulation code, and metadata to be output to the database can defined from inside the code that is developed and maintained by the researcher. Another approach of extracting application level metadata information is Xtract [41,42] (and its preliminary work Skluma [43]). 11 Xtract is a service to automatically collect metadata from heterogeneous scientific file types and can be run in two different modes. ...
Full-text available
The deluge of dark data is about to happen. Lacking data management capabilities, especially in the field of supercomputing, and missing data documentation (i.e., missing metadata annotation) constitute a major source of dark data. The present work contributes to addressing this challenge by presenting ExtractIng, a generic automated metadata extraction toolkit. Existing metadata information of simulation output files scattered through the file system, can be aggregated, parsed and converted to the EngMeta metadata model. Use cases from computational engineering are considered to demonstrate the viability of ExtractIng. The evaluation results show that the metadata extraction is simulation-code independent in the sense that it can handle data outputs from various fields of science, is easy to integrate into simulation workflows and compatible with a multitude of computational environments.
The increasing volume and variety of science data has led to the creation of metadata extraction systems that automatically derive and synthesize relevant information from files. A critical component of metadata extraction systems is a mechanism for mapping extractors—lightweight tools to mine information from a particular file types—to each file in a repository. However, existing methods do little to address the heterogeneity and scale of science data, thereby leaving valuable data unextracted or wasting significant compute resources applying incorrect extractors to data. We construct an extractor scheduler that leverages file type identification (FTI) methods. We show that by training lightweight multi-label, multi-class statistical models on byte samples from files, we can correctly map 35% more extractors to files than by using libmagic. Further, we introduce a metadata quality toolkit to automatically assess the utility of extracted metadata.KeywordsMetadata qualityExtractionFile type identification
Full-text available
BATTERY 2030+ targets the development of a chemistry neutral platform for accelerating the development of new sustainable high-performance batteries. Here, a description is given of how the AI-assisted toolkits and methodologies developed in BATTERY 2030+ can be transferred and applied to representative examples of future battery chemistries, materials, and concepts. This perspective highlights some of the main scientific and technological challenges facing emerging low-technology readiness level (TRL) battery chemistries and concepts, and specifically how the AI-assisted toolkit developed within BIG-MAP and other BATTERY 2030+ projects can be applied to resolve these. The methodological perspectives and challenges in areas like predictive long time- and length-scale simulations of multi-species systems, dynamic processes at battery interfaces, deep learned multi-scaling and explainable AI, as well as AI-assisted materials characterization, self-driving labs, closed-loop optimization, and AI for advanced sensing and self-healing are introduced. A description is given of tools and modules can be transferred to be applied to a select set of emerging low-TRL battery chemistries and concepts covering multivalent anodes, metal-sulfur/oxygen systems, non-crystalline, nano-structured and disordered systems, organic battery materials, and bulk vs. interface-limited batteries.
Conference Paper
Full-text available
The rapid generation of data from distributed IoT devices, scientific instruments, and compute clusters presents unique data management challenges. The influx of large, heterogeneous, and complex data causes repositories to become siloed or generally unsearchable---both problems not currently well-addressed by distributed file systems. In this work, we propose Xtract, a serverless middleware to extract metadata from files spread across heterogeneous edge computing resources. In my future work, we intend to study how Xtract can automatically construct file extraction workflows subject to users' cost, time, security, and compute allocation constraints. To this end, Xtract will enable the creation of a searchable centralized index across distributed data collections.
Conference Paper
Full-text available
High-level programming languages such as Python are increasingly used to provide intuitive interfaces to libraries written in lower-level languages and for assembling applications from various components. This migration towards orchestration rather than implementation, coupled with the growing need for parallel computing (e.g., due to big data and the end of Moore's law), necessitates rethinking how parallelism is expressed in programs. Here, we present Parsl, a parallel scripting library that augments Python with simple, scalable, and flexible constructs for encoding parallelism. These constructs allow Parsl to construct a dynamic dependency graph of components that it can then execute efficiently on one or many processors. Parsl is designed for scalability, with an extensible set of executors tailored to different use cases, such as low-latency, high-throughput, or extreme-scale execution. We show, via experiments on the Blue Waters supercomputer, that Parsl executors can allow Python scripts to execute components with as little as 5 ms of overhead, scale to more than 250000 workers across more than 8000 nodes, and process upward of 1200 tasks per second. Other Parsl features simplify the construction and execution of composite programs by supporting elastic provisioning and scaling of infrastructure, fault-tolerant execution, and integrated wide-area data management. We show that these capabilities satisfy the needs of many-task, interactive, online, and machine learning applications in fields such as biology, cosmology, and materials science.
Conference Paper
Full-text available
Data publication systems are typically tailored to the requirements and processes of a specific domain, collaboration, and/or use case. We propose here an alternative approach to engineering such systems, based on customizable compositions of simple, independent platform services, each of which provides a distinct function such as identification, metadata association, and discovery. We argue that this approach can reduce costs and increase flexibility and overall service quality. We describe a collection of such services that we are developing within Globus, which initially provide persistent identifier association, data management, and discovery capabilities; we are also working towards an automation service that can reliably and flexibly coordinate these and other services to satisfy varied user needs. We describe data publication use cases that motivate our design, present our vision for a data publication platform, and report on current implementation status.
Full-text available
With increasingly strict data management requirements from funding agencies and institutions, expanding focus on the challenges of research replicability, and growing data sizes and heterogeneity, new data needs are emerging in the materials community. The materials data facility (MDF) operates two cloud-hosted services, data publication and data discovery, with features to promote open data sharing, self-service data publication and curation, and encourage data reuse, layered with powerful data discovery tools. The data publication service simplifies the process of copying data to a secure storage location, assigning data a citable persistent identifier, and recording custom (e.g., material, technique, or instrument specific) and automatically-extracted metadata in a registry while the data discovery service will provide advanced search capabilities (e.g., faceting, free text range querying, and full text search) against the registered data and metadata. The MDF services empower individual researchers, research projects, and institutions to (I) publish research datasets, regardless of size, from local storage, institutional data stores, or cloud storage, without involvement of third-party publishers; (II) build, share, and enforce extensible domain-specific custom metadata schemas; (III) interact with published data and metadata via representational state transfer (REST) application program interfaces (APIs) to facilitate automation, analysis, and feedback; and (IV) access a data discovery model that allows researchers to search, interrogate, and eventually build on existing published data. We describe MDF’s design, current status, and future plans.
Facilitating the application of machine learning (ML) to materials science problems requires enhancing the data ecosystem to enable discovery and collection of data from many sources, automated dissemination of new data across the ecosystem, and the connecting of data with materials-specific ML models. Here, we present two projects, the Materials Data Facility (MDF) and the Data and Learning Hub for Science (DLHub), that address these needs. We use examples to show how MDF and DLHub capabilities can be leveraged to link data with ML models and how users can access those capabilities through web and programmatic interfaces.
Conference Paper
Globus Auth is a foundational identity and access management platform service designed to address unique needs of the science and engineering community. It serves to broker authentication and authorization interactions between end-users, identity providers, resource servers (services), and clients (including web, mobile, desktop, and command line applications, and other services). Globus Auth thus makes it easy, for example, for a researcher to authenticate with one credential, connect to a specific remote storage resource with another identity, and share data with colleagues based on another identity. By eliminating friction associated with the frequent need for multiple accounts, identities, credentials, and groups when using distributed cyberinfrastructure, Globus Auth streamlines the creation, integration, and use of advanced research applications and services. Globus Auth builds upon the OAuth 2 and OpenID Connect specifications to enable standards-compliant integration using existing client libraries. It supports identity federation models that enable diverse identities to be linked together, while also providing delegated access tokens via which client services can obtain short term delegated tokens to access other services. We describe the design and implementation of Globus Auth, and report on experiences integrating it with a range of research resources and services, including the JetStream cloud, XSEDE, NCAR's Research Data Archive, and FaceBase.