Conference PaperPDF Available

Ripple: Home Automation for Research Data Management

Authors:

Abstract and Figures

Exploding data volumes and acquisition rates, plus ever more complex research processes, place significant strain on research data management processes. It is increasingly common for data to flow through pipelines comprised of dozens of different management, organization, and analysis steps distributed across multiple institutions and storage systems. To alleviate the resulting complexity, we propose a home automation approach to managing data throughout its lifecycle, in which users specify via high-level rules the actions that should be performed on data at different times and locations. To this end, we have developed RIPPLE, a responsive storage architecture that allows users to express data management tasks via a rules notation. RIPPLE monitors storage systems for events, evaluates rules, and uses serverless computing techniques to execute actions in response to these events. We evaluate our solution by applying RIPPLE to the data lifecycles of two real-world projects, in astronomy and light source science, and show that it can automate many mundane and cumbersome data management processes.
Content may be subject to copyright.
RIPPLE: Home Automation for Research Data
Management
Ryan Chard1, Kyle Chard2, Jason Alt2, Dilworth Y. Parkinson3, Steve Tuecke2, and Ian Foster2
1Computing, Environment, and Life Sciences, Argonne National Laboratory
2Computation Institute, University of Chicago and Argonne National Laboratory
3Advanced Light Source Division, Lawrence Berkeley National Laboratory
Abstract—Exploding data volumes and acquisition rates, plus
ever more complex research processes, place significant strain on
research data management processes. It is increasingly common
for data to flow through pipelines comprised of dozens of dif-
ferent management, organization, and analysis steps distributed
across multiple institutions and storage systems. To alleviate the
resulting complexity, we propose a home automation approach
to managing data throughout its lifecycle, in which users specify
via high-level rules the actions that should be performed on data
at different times and locations. To this end, we have developed
RIP PLE, a responsive storage architecture that allows users to
express data management tasks via a rules notation. RIP PLE
monitors storage systems for events, evaluates rules, and uses
serverless computing techniques to execute actions in response
to these events. We evaluate our solution by applying RIP PL E
to the data lifecycles of two real-world projects, in astronomy
and light source science, and show that it can automate many
mundane and cumbersome data management processes.
I. INTRODUCTION
Researchers are faced with an increasingly complex data
landscape in which data are obtained from a number of dif-
ferent sources (e.g., instruments, computers, published data),
stored in disjoint storage systems, and analyzed on an area of
high performance and cloud computers. Given the increasing
speed at which data are produced, combined with increasingly
complex scientific processes and the requisite data manage-
ment, munging, and organization activities required to make
sense of data, researchers are faced with new bottleneck in
the discovery process. Improving data lifecycle management
practices is essential to enhancing productivity, facilitating
reproducible research, and encouraging collaboration [1]. Yet
current practices are typically manual and ad hoc, requiring
considerable human effort and ensuring little adherence to
best practices. As a result, researchers struggle to manage,
analyze, and share data reliably and efficiently [2] and research
results are frequently irreproducible. We posit that researchers
require automated methods for managing their data such that
tedious and repetitive tasks (e.g., transferring, archiving, and
analyzing) are accomplished without continuous user input.
Automated approaches have revolutionized many domains
such as machinery use in factories, aircraft flight, and more
recently managing devices within the home. Home automation
in particular shares similar features to research data manage-
ment: as it is focuses on controlling and automating a range
of devices within a home such as lighting, heating, security,
and other appliances. The main goal of these systems is to
increase convenience and decrease time spent on mundane
tasks by automating repetitive processes, such as turning on
lights when it gets dark, setting security alarms when leaving
the house, and controlling room temperatures based on weather
conditions. Home automation systems enable users to finely
customize and control their environments by defining policies
that dictate how household appliances should perform under
different circumstances. It is these same types of capabilities
that are needed for managing research data.
In this paper we present a new approach to data management
called RI PPLE. RIPP LE aims to allow researchers, lab man-
agers, and administrators to define data management practices
as a set of simple if-trigger-then-action recipes. Actions, such
as moving data or executing an analysis script, are triggered
in response to events, such as files being created, modified,
or deleted. Filesystem events are captured through various
APIs, including Linux’s inotify and the Globus API [3].
Given the broad range of actions that might be possible
RIP PLE builds upon severless computing systems to enable
on-demand processing of recipes. In particular, using Amazon
Web Services Lambda as a scalable and low-latency solution
for performing arbitrary, loosely coupled actions.
To guide and evaluate our approach we focus on two
use cases: the data management processes associated with
the Large Synoptic Survey Telescope (LSST) and a multi-
institutional materials science project. We show that RIPP LE
can satisfy the needs of these two projects by automating im-
portant data management tasks. Furthermore, we evaluate the
scalability and performance of our prototype implementation
by analyzing event collection and processing operations.
The rest of this paper is organized as follows. Section II
discusses related work. We describe the LSST and materials
science scenarios in Section III. We present RIPPLE in Sec-
tion IV. We evaluate RI PPLE’s performance in Section V and
its ability to meet application requriements in Section VI. We
summarize in Section VII.
II. RELATED WORK
Previous rule-based approaches to data management [4]
are primarily designed for expert administration of large data
stores. Our approach is differentiated by its simple recipe
notation, decoupling of rules from data management and stor-
age technologies, and use of serverless computing to evaluate
rules. Below we discuss several rules engines and discuss how
they relate to our work.
The integrated Rule-Oriented Data System (iRODS) [5] uses
a powerful rules engine to manage the data lifecycle of the
files and stores that it governs. iRODS is a closed system:
data are imported into an iRODS data grid that is managed
entirely by iRODS and administrators use rules to configure
the management policies to be followed within that data grid.
In contrast, RIPP LE is an open system: any authorized user
may associate rules with any storage system. Both approaches
have their place in the data landscape. iRODS has been used
successfully in large projects. Whereas RI PPLE aims to benefit
the dynamic, heterogeneous, multi-project environments that
typify many modern research labs.
IOBox [6] is designed to extract, transform, and load data
into a catalog. It is able to crawl and monitor a filesystem,
detect file changes (e.g., creation, modification, deletion), and
apply pattern matching rules against file names to determine
what actions (ETL) should be taken. RIPPLE extends this
model by allowing scalable and distributed event detection,
and supporting an arbitrary range of actions.
The Robinhood Policy Engine [7] is designed to manage
large HPC filesystems. Although it can support any POSIX
filesystem, it implements advanced features for Lustre. Robin-
hood maintains a database of file metadata. It allows bulk
actions to be scheduled for execution against sets of files. For
example, migrating or purging stale data. Robinhood provides
routines to manage and monitor filesystems efficiently, such as
those used to find files, determine usage, and produce reports.
It is not the goal of RIPP LE to provide such utilities. Instead,
RIP PLE is designed to empower users to implement simple,
yet effective data management strategies.
SPADE [8] supports automated transfer and transformation
of data. Users configure a SPADE dropbox. If a file is written
to the dropbox, SPADE creates (or detects) an accompanying
semaphore file to signal that the file is complete and that a
transfer should begin. SPADE can also enable data archival
or execution of analysis scripts in response to data arrival.
The SPOT framework [9] is a workflow management solution
developed specifically for the Advanced Light Source (ALS).
SPOT leverages SPADE to automate the analysis, transfer,
storage, and sharing, of ALS users’ data using HPC resources.
The framework includes a Web interface to provide real-
time feedback. In contrast to SPOT’s pre-defined flows which
handle very large data volumes and numbers of data sets,
RIP PLE empowers non-technical users to define custom recipes
which can be combined into adaptive flows.
III. SCE NAR IOS
We base the design and implementation of RIPPLE on the
data management requirements of two large research projects:
the LSST and X-ray science at the ALS. Each provides a
unique set of data management requirements and demonstrates
the flexibility of our solution. Here we briefly describe these
projects and their requirements.
A. Large Synoptic Survey Telescope
The LSST is a wide-field telescope currently under con-
struction in Chile. It uses a new kind of telescope to capture
panoramic, 3.2-gigapixel, snapshots of the visible sky every
30 seconds. It is expected to produce more than 30 terabytes
of data every night. Over the LSST’s 10 year lifetime, it will
map billions of stars and galaxies. These data will be used
to explore the structure of the Milky Way and assist in the
exploration of dark energy and dark matter.
The computational and data management requirements of
the LSST are substantial [10]. The project requires near real-
time detection of “interesting” events, such as supernovae and
asteroid detection. In addition, all image data will be analyzed
and made available to the public. To meet the storage needs of
this project two data centers (called custodial stores), located
in Chile and Illinois, are being readied to reliably store the
data generated by the telescope.
The custodial stores have replication, recovery, and analyti-
cal actions needs that must be reliably enforced and could be
automated. For example, data must be immediately transferred
from the telescope and archived in both stores to provide fault
tolerance; data must be cataloged and made discoverable once
it enters a store such that scientists can later use the data for
analysis; and corrupted or deleted data must be recovered from
another store.
B. X-ray science at the ALS
The ALS is a DOE-funded synchrotron light source housed
at Lawrence Berkeley National Laboratory. It is one of the
brightest sources of ultraviolet and soft x-ray light in the
world. Given its unique characteristics, scientists from many
fields use the ALS to conduct a wide array of experiments
including spectroscopy, microscopy, and diffraction. The ALS
is comprised of almost 40 beamlines that serve approximately
2,000 researchers each year.
The ALS is representative of the data management lifecycle
of many large instruments and research facilities. The data
generated by a beamline are large, generated frequently, and
requires substantial computational resources to analyze in a
timely manner. Moreover, researchers using each beamline are
granted short allocations of time in which to conduct their
experiments. They typically run a number of experiments,
collect a large amount of data, and then analyze those data
at a latter point in time using large-scale resources.
In some cases, analyses are conducted throughout the ex-
periment to guide the experimental process. Typically, data are
transferred to a compute resource where a variety of quality
control procedures and analysis algorithms are executed. In
most cases, analysis requires data transformations, parameter
and configuration selection, and creation of a batch submission
file for execution. Upon completion, analysis output is then
transferred back to the researchers.
Fig. 1: RI PP LE architecture. Filesystem events are captured
and evaluated against registered recipes. Lambda functions
evaluate recipes and execute actions.
IV. RIPPLE
RIP PLE is comprised of a cloud-hosted service and a
lightweight agent that can be deployed on various filesystems.
An overview of RIP PL E’s architecture is shown in Fig. 1. The
local agent includes a SQLite database, a monitor service, a
message queue, and a processing element for executing actions
(in containers) locally. The SQLite database stores a local copy
of all active recipes. The monitor, based on the Python module
Watchdog, captures local filesystem events. The basic evalua-
tion process compares each event against all filter conditions
defined by active recipes. When an event matches a particular
recipe, the event is passed to the processing component via a
reliable message queue.
The processing component of the agent publishes triggered
events to an Amazon Simple Notification Service (SNS) topic.
The SNS topic triggers the invocation of a Lambda function.
Lambda functions are used to evaluate the event and recipe,
and then to manage the execution of actions.
RIP PLE supports three types of actions: 1) execution of
processes on the local node; 2) invocation of external APIs;
and 3) execution of lightweight Lambda functions. In each
case RI PPLE uses a Lambda function to invoke the action.
Execution on the local machine is achieved by initiating local
Docker containers or submitting subprocess procedures.
A. Recipes
The flexibility of RI PPLE recipes enables the expression of
a wide range of custom functions. To make our solution as
accessible as possible, we have employed a simple if-trigger-
then-action style definition of recipes. This representation
allows even non-technical users to create custom programs.
The usability of the trigger-action programming model has
been proven by the If-This-Then-That (IFTTT) [11] service.
Users have created and shared hundreds of thousands of IFTTT
recipes [12] with vastly different functions, such as notifying
of predicted weather events and automatically extracting and
storing of images from Facebook.
A RI PPLE recipe is represented as a JSON document com-
prised of a “trigger” event and an “action” component. The
trigger specifies the condition that an event must match in
order to invoke the action. The action describes what function
is to be executed in response to the event. An example recipe
can be seen in Listing 1. In this recipe, the trigger defines
the source of the event as the local filesystem; the type
of event as file creation; and a path and regular expression
describing conditions to execute the action (any file in the
/path/to/monitor/ directory with a .h5 extension). The action
describes the service to invoke (in this case, globus); the
operation to execute (transfer); and arguments for performing
the operation. Target modifiers allow actions to be performed
on files that did not raise the triggering event. In the given
example, the target implies the operation should be performed
on the file raising the event.
Listing 1: An example RI PPLE recipe.
"recipe":{
"trigger":{
"username":"ryan",
"monitor":"filesystem",
"event":"FileCreatedEvent",
"directory":"/path/to/monitor/",
"filename":".*.h5$"
},
"action":{
"service":"globus",
"type":"transfer"
"source_ep":"endpoint1",
"dest_ep":"endpoint2",
"target_name":"$filename",
"target_match":"",
"target_replace":"",
"target_path":"/˜/$filename.h5",
"task":"",
"access_token":"<access token>"
}
}
B. Event detection and evaluation
RIPPLE relies on a flexible event monitoring model that
can be used in different environments. Specifically, it uses
the Python Watchdog module which offers multiple observers
to detect filesystem events. Watchdog’s observers include:
Linux inotify, Windows FSMonitor, Mac OS FSEvents and
kqueue, OS-independent polling (periodic disk snapshots), and
support for custom event sources via third party APIs (e.g., the
Globus Transfer API). The scope for a Watchdog observer is
determined by the paths specified during its creation. RIPP LE
evaluates all active recipes to determine filesystem paths of
interest and collapses these paths to instantiate a number of
observers to monitor them.
Multiple events can occur in response to an individual action
on a filesystem. For example, the action of creating a file can
cause inotify to raise events concerning the file’s creation,
modification, and closure, as well as the parent directory’s
modification. Thus it is crucial to filter events and minimize
the overhead caused by passing irrelevant events to the RI PPLE
service. The agent’s filtering phase removes superfluous events
by reporting only those that match active recipe conditions.
RIPPLE can also respond to events created by external
services. For example, we have implemented a Watchdog
observer that periodically polls the Globus Transfer service,
checks for successful transfer operations (file movement, cre-
ation, deletion), and raises appropriate events. The events
generated by this observer are modeled in the same way as
filesystem events and are processed by the monitor.
C. Actions
RIP PLE aims to support an arbitrary set of extensible actions
that may be invoked as a result of an event. The initial set of
actions have been developed to address the requirements of
the scenarios discussed in Section III. Specifically, RIP PLE
currently supports execution of Lambda functions, external
services (via Globus and AWS services), and containers lo-
cated on the storage system. As part of the recipe definition
users are able to specify additional state to be passed to
the action for execution: for example, where data should be
replicated or the HPC queue to use for execution.
Lambda functions provide the ability to execute arbitrary
code, typically, for short periods of time. By supporting such
functions, users can define complex actions that are executed
within their own Amazon context as a result of a recipe. This is
achieved by specifying a Lambda function’s Amazon Resource
Number (ARN) as the action of a recipe.
External services: RI PPLE supports two general services:
Globus and Amazon Simple Messaging Service (SMS).
Globus allows users to transfer, replicate, synchronize, share,
and delete data from arbitrary storage locations. Globus actions
can be configured to use any of these capabilities. To do so,
we have developed a Lambda function that authenticates with
Globus (using a predefined token included in the recipe def-
inition) and executes the appropriate action using the Globus
API. Depending on the operation to be performed information
such as the source or destination endpoint is included in
the recipe definition. In addition to transferring data, the
same Lambda function is also capable of initiating Globus
delete commands and constructing Globus shared endpoints—
a means to securely share data between collaborators.
RIP PLE can send emails to notify users of events such as
new data being placed in an archive or data being deleted.
This capability is based on integration with Amazon’s Simple
Messaging Service (SMS) and again uses a Lambda function
to perform this operation using a customizable destination
email address.
Local execution: There are many scenarios in which the
desired action of a rule is to perform an operation on a
local file directly. RIPPLE provides this support in one of two
ways, either by running a Docker container, or by initiating a
subprocess procedure. Docker enables execution encapsulation
such that it is reliably and securely executed on arbitrary
endpoints. Examples of supported actions include extracting
metadata, creating a persistent identifier for a dataset, and
compressing data prior to archival. However, Docker is not
always suitable due to its need for administrator privileges.
For example, in situations where users do not have permission
to run docker containers (such as the HPC resources used
by the ALS scenario) we instead leverage Python subprocess
commands to execute different scripts and applications. Ex-
amples of RIPP LE’s subprocess execution include modifying
files using Linux commands, creating batch submission files,
and dispatching jobs to a supercomputer execution queue.
0
10000
20000
Laptop-poll
Laptop-inotify
Cloud-poll
Cloud-inotify
NERSC-poll
NERSC-inotify
Observer
Events per second
create
delete
modify
Fig. 2: Number of events processed per second on different
machines with the inotify and polling observers.
V. EVAL UATIO N
We explore the performance and scalability of the RIPP LE
system from three perspectives: event detection, event filtering,
and execution of actions when Amazon Lambda functions are
in different states of readiness.
A. Event Detection
We deployed RIPPLE over three machines: a personal lap-
top, a c4.xlarge AWS EC2 instance, and a supercomputer
login node (NERSC’s Edison). RIP PL E’s event detection rate
has been determined for each of these machines by timing
how long it takes for 10,000 events to be detected. To min-
imize overhead we disable RI PPLE’s filtering (rule condition
matching) capabilities and simply count the events that are
detected. Fig. 2 shows the performance of two distinct event
observation methods: inotify and polling. In each experiment,
we first established the observer and then created 10,000
files in a monitored directory, touched each file to raise a
modification event, and deleted each file to raise a deletion
event. The results show that the personal laptop and EC2
instance are capable of detecting more than 10,000 events per
second with both the polling and inotify observers. We note
that modification events are not detected as reliably as creation
and deletion events with either observer. This is shown in the
figure as less than 1000 modification events are recorded when
10,000 files were modified in the space of 0.2 seconds. These
results are expected when using the polling observer as it is
configured to take snapshots just once a second. The NERSC
experiments demonstrate the lowest event throughput as they
were conducted on a large networked file system.
B. Filtering Cost
In order to understand the overhead incurred by the RI PPL E
agent filtering events we investigated the rate at which events
0
10
20
30
Laptop-poll
Laptop-inotify
Cloud-poll
Cloud-inotify
NERSC-poll
NERSC-inotify
Observer
Overhead (%)
create
delete
modify
Fig. 3: Event filtering overhead.
are detected when filtering is conducted. Fig. 3 shows the
overhead incurred by the filtering process. The figure shows
that the overhead is typically largest for creation events. This
is due to the filters enabled in the experiment. The experiment
employs a two-step filtering process where each event is first
evaluated to see if it is of type create and if successful, the
filename is compared to a condition. The overhead caused by
filtering is negligible for modification events due to the limited
detection rate.
C. Lambda Performance
We explore the performance of four distinct Lambda func-
tions which perform one of the following tasks: initiate a
Globus transfer, send an email, log data in a database, and
query a database. Lambda functions are said to be in a cold
state if they are first started in response to an invocation. Fol-
lowing an invocation the function becomes warm, or cached.
Fig. 4 shows the average execution time for each of the
four Lambda functions in both cold and warm states. The
invocation time is computed as the difference between the
reported time a request is placed in a SNS queue and the
time that the Lambda function starts. The execution time is
the reported duration of the Lambda function’s execution. The
results show a significant overhead incurred by cold functions
and that invoking Globus actions requires substantially more
time than AWS services. This is in part due to the requirement
of the Lambda function to import the globus-sdk module.
VI. USE CASE S
To explore RIPP LE’s ability to meet application require-
ments we have deployed testbeds of the scenarios discussed
in Section III. In each case we have deployed RIPPL E agents
and developed a suite of recipes to automate their respective
data lifecycles.
0
500
1000
1500
2000
2500
Transfer-Cold
Email-Cold
Query-Cold
Insert-Cold
Transfer-Warm
Email-Warm
Query-Warm
Insert-Warm
Lambda Function
Execution Time (ms)
Invocation
Execution
Cold
Warm
Fig. 4: Execution and invocation time of various Lambda
functions in warm and cold states.
Fig. 5: LSST testbed and automated data lifecycle.
A. RIP PLE and LSST
To represent the LSST scenario we deployed a testbed
with three AWS instances and a Sparrow, a Blackpearl tape
storage system at Argonne National Laboratory. The three
AWS instances represent the observatory and two custodial
stores (Chile and NCSA). Within each instance, we created
multiple Globus shared endpoints to represent different storage
facilities. For example, the instance representing the observa-
tory has shared endpoints mounted at /telescope and /archiver.
The custodial storage instances mount shared endpoints to
represent the landing zone, lower performance storage (mag-
netic), and archival tape storage. The instance representing
the Chilean store uses Argonne’s Sparrow, to archive data.
Each instance has an individual RI PPLE agent deployed and is
configured with recipes to monitor the local shared endpoints.
Finally, to represent the LSST file catalog we use an AWS
RDS database that manages information about each file stored
in the custodial stores.
Fig. 5 provides an overview of the testbed and the associated
data lifecycle. The data flow is initiated by (1) the telescope
generating an image, typically stored in a FITS format, which
is then placed at the archiver. RIPP LE filters filesystem events
within the archiver directory for those with the FITS extension.
Creation of a file in this directory triggers a recipe to (2)
perform a high-speed HTTPS data upload to the Chilean
custodial store. Once data arrive in the custodial landing zone,
(3, 4) recipes are triggered that launch local Docker containers
to perform metadata extraction and cataloging for each new
file. Metadata regarding the file are placed in a local JSON
file so that Globus’ search capabilities can index the file.
Additionally, a unique identifier for the file is generated using
the Minid [13] service before being (5) inserted into the file
catalog. The data are then (6) automatically synchronized to
the NCSA custodial store where similar metadata processing
occurs. The use of the Minid service allows us to use the same
unique identifier for the file regardless of location, meaning
the file catalog is consistent between stores. As data are
propagated down the storage tiers (7) within each custodial
store, the metadata and file catalogs are dynamically updated.
Prior to archiving the data, a RI PPLE recipe triggers file
compression; creation of the compressed (gzip) file triggers
the final recipe that transfers the file to Sparrow.
Recovery and consistency are of utmost importance for
LSST as files cannot be recreated. The LSST testbed has been
instrumented with RI PP LE recipes to detect the deletion and
modification of FITS files. Any file that is found to be deleted
or corrupted is automatically replaced by a copy from the other
custodial store.
B. RIP PLE and ALS
The ALS testbed has been constructed by deploying a
RIP PLE agent on both an ALS machine and a NERSC login
node. For exploratory purposes, we were given access to a
heartbeat application to reproduce data being generated from
an ALS beamline. We implemented recipes to manage the data
lifecycle of beamline datasets from creation, through execu-
tion, and finally share the results with specific collaborators.
Once the heartbeat application finishes writing results to an
HDF5 file a new file, which signifies the process’s completion,
is created. The system first detects the subsequent file creation
and initiates a transfer of the HDF5 output to NERSC. On
arrival at NERSC (detected by the Globus Transfer API
observer) a recipe ensures that a metadata file and a batch
submission file are dynamically created using the input. The
creation of the batch submission file triggers the workload
to be submitted to the Edison supercomputer’s queue. Once
the workload completes the output file is detected and is
transferred back to the ALS machine. On arrival at the ALS
a shared endpoint is created in order to expose the resulting
dataset to a set of collaborators. Finally, an email notification
is sent to inform collaborators of the new data and directs
them to the shared endpoint.
VII. CONCLUSION
RIP PLE aims to simplify the management of complex data
lifecycles. In the same way that home automation systems
simplify and automate tedious tasks associated with managing
a large number of home appliances, RI PPLE supports auto-
mated actions in response to various events. In this paper we
described how RI PPLE can achieve these goals and investi-
gated the scalability challenges associated with deploying such
a service in practice. We showed that RIPP LE can be used
to automate two real-world workflows that manage data in
two large scientific scenarios. These examples used RIPPL E to
automate data transfer, replication, cataloging, recovery, HPC
execution, archival, and sharing.
In future work we aim to apply RI PP LE to additional
scientific use cases to gather requirements and generalize our
model. We ultimately aim to integrate RIPP LE in the Globus
platform, enabling thousands of users to create custom data
flows. We will also investigate developing a programming
model, inspired by IFTTT, to simplify the definition of flows.
ACKNOWLEDGMENTS
This work was supported in part by the U.S. Department of
Energy under contract DE-AC02-06CH11357.
REFERENCES
[1] D. Atkins, T. Hey, and M. Hedstrom, “National Science Foundation
Advisory Committee for Cyberinfrastructure Task Force on Data and
Visualization Final Report, National Science Foundation, Tech. Rep.,
2011.
[2] J. P. Birnholtz and M. J. Bietz, “Data at work: Supporting sharing in
science and engineering,” in International ACM SIGGROUP Conference
on Supporting Group Work. ACM, 2003, pp. 339–348.
[3] K. Chard, S. Tuecke, and I. Foster, “Efficient and secure transfer,
synchronization, and sharing of big data,” IEEE Cloud Computing,
vol. 1, no. 3, pp. 46–55, 2014.
[4] A. Rajasekar, M. Wan, R. Moore, and W. Schroeder, “A prototype rule-
based distributed data management system,” in HPDC workshop on Next
Generation Distributed Data Management, vol. 102, 2003.
[5] A. Rajasekar, R. Moore, C.-y. Hou, C. A. Lee, R. Marciano, A. de Torcy,
M. Wan, W. Schroeder, S.-Y. Chen, L. Gilbert, P. Tooby, and B. Zhu,
“iRODS Primer: Integrated rule-oriented data system, Synthesis Lec-
tures on Information Concepts, Retrieval, and Services, vol. 2, no. 1,
pp. 1–143, 2010.
[6] R. Schuler, C. Kesselman, and K. Czajkowski, “Data centric discovery
with a data-oriented architecture,” in 1st Workshop on The Science of
Cyberinfrastructure: Research, Experience, Applications and Models,
ser. SCREAM ’15. New York, NY, USA: ACM, 2015, pp. 37–44.
[7] T. Leibovici, “Taking back control of HPC file systems with Robinhood
Policy Engine,” arXiv preprint arXiv:1505.01448, 2015.
[8] “SPADE. [Online]. Available: http://nest.lbl.gov/projects/spade/html/.
Visited March, 2017
[9] J. Deslippe, A. Essiari, S. J. Patton, T. Samak, C. E. Tull, A. Hexemer,
D. Kumar, D. Parkinson, and P. Stewart, “Workflow management for
real-time analysis of lightsource experiments,” in 9th Workshop on
Workflows in Support of Large-Scale Science. IEEE Press, 2014, pp.
31–40.
[10] M. Juri´
c, J. Kantor, K. Lim, R. H. Lupton, G. Dubois-Felsmann,
T. Jenness, T. S. Axelrod, J. Aleksi ´
c, R. A. Allsman, Y. AlSayyad et al.,
“The LSST data management system,” arXiv preprint arXiv:1512.07914,
2015.
[11] “If This Then That,” http://www.ifttt.com. Visited March, 2017.
[12] B. Ur, M. Pak Yong Ho, S. Brawner, J. Lee, S. Mennicken, N. Picard,
D. Schulze, and M. L. Littman, “Trigger-action programming in the
wild: An analysis of 200,000 IFTTT recipes,” in CHI Conference on
Human Factors in Computing Systems. ACM, 2016, pp. 3227–3231.
[13] K. Chard, M. D’Arcy, B. Heavner, I. Foster, C. Kesselman, R. Madduri,
A. Rodriguez, S. Soiland-Reyes, C. Goble, K. Clark, E. W. Deutsch,
I. Dinov, N. Price, and A. Toga, “I’ll take that to go: Big data bags and
minimal identifiers for exchange of large, complex datasets, in IEEE
International Conference on Big Data, Washington, DC, USA, 2016.
... Serverless computing is a new paradigm in cloud computing that allows developers to develop and deploy applications on cloud platforms without having to manage any underlying infrastructure, e.g., load-balancing, auto-scaling, and operational monitoring [18,20,21,[23][24][25]29]. Due to its significant advantages, serverless computing has been an increasingly hot topic in both academia [17,26,30] and industry [3, 10,12,31]; its market growth is expected to exceed $8 billion per year by 2021 [9]. ...
... In this section, we summarize the related work serverless computing and serverless workflow. Nowadays, serverless computing has already been used in various scenarios including Internet of Things and edge computing [23], data processing [21,24,25], scientific workflow [29], system security [18], etc. Authors generally believe running applications in a serverless architecture is more cost-efficient than microservices or monoliths. Wang et al. [32] conducted the largest measurement study for AWS Lambda, Azure Functions, and Google Cloud Functions, and they used more than 50,000 function instances to characterize architectures, performance, and resource management. ...
Preprint
Along with the wide-adoption of Serverless Computing, more and more applications are developed and deployed on cloud platforms. Major cloud providers present their serverless workflow services to orchestrate serverless functions, making it possible to perform complex applications effectively. A comprehensive instruction is necessary to help developers understand the pros and cons, and make better choices among these serverless workflow services. However, the characteristics of these serverless workflow services have not been systematically analyzed. To fill the knowledge gap, we survey four mainstream serverless workflow services, investigating their characteristics and performance. Specifically, we review their official documents and compare them in terms of seven dimensions including programming model, state management, etc. Then, we compare the performance (i.e., execution time of functions, execution time of workflows, orchestration overhead of workflows) under various experimental settings considering activity complexity and data-flow complexity of workflows, as well as function complexity of serverless functions. Finally, we discuss and verify the service effectiveness for two actual workloads. Our findings could help application developers and serverless providers to improve the development efficiency and user experience.
... To build a superfacility, it is necessary to maximally automate the entire data lifecycle of a large facility, to tag all the Big Data produced during the experiments with the metadata produced in each step of the lifecycle itself (figure 5) [78,79]. This will guarantee the creation of a robust basis for AI training through ML, and robotic automation through the IoT. ...
Article
Full-text available
With recent technological advances, large-scale experimental facilities generate huge datasets, into the petabyte range, every year, thereby creating the Big Data deluge effect. Data management, including the collection, management, and curation of these large datasets, is a significantly intensive precursor step in relation to the data analysis that underpins scientific investigations. The rise of artificial intelligence (AI), machine learning (ML), and robotic automation has changed the landscape for experimental facilities, producing a paradigm shift in how different datasets are leveraged for improved intelligence, operation, and data analysis. Therefore, such facilities, known as superfacilities, which fully enable user science while addressing the challenges of the Big Data deluge, are critical for the scientific community. In this work, we discuss the process of setting up the Big Data Science Center within the Shanghai Synchrotron Radiation Facility (SSRF), China’s first superfacility. We provide details of our initiatives for enabling user science at SSRF, with particular consideration given to recent developments in AI, ML, and robotic automation.
... The novelty of this research, that concerns productification, is the idea to take the auto-categorization neural network model from its experimental phases all the way to a use case where it brings value. Several research projects has been done for managing and supporting big data pipelines, where live data needs to be processed by several organizations (for example analysis and business intelligence) [2]- [3]. The autocategorization model perdictions needs a similar automatic data pipeline to uphold the load of each prediction and at the same time allow analysis of the result that will be brought back to the model in a feedback loop. ...
Article
Trigger-action programming (TAP) empowers a wide array of users to automate Internet of Things (IoT) devices. However, it can be challenging for users to create completely correct trigger-action programs (TAPs) on the first try, necessitating debugging. While TAP has received substantial research attention, TAP debugging has not. In this paper, we present the first empirical study of users' end-to-end TAP debugging process, focusing on obstacles users face in debugging TAPs and how well users ultimately fix incorrect automations. To enable this study, we added TAP capabilities to an existing 3-D smart home simulator. Thirty remote participants spent a total of 84 hours debugging TAPs using this simulator. Without additional support, participants were often unable to fix buggy TAPs due to a series of obstacles we document. However, we also found that two novel tools we developed helped participants overcome many of these obstacles and more successfully debug TAPs. These tools collect either implicit or explicit feedback from users about automations that should or should not have happened in the past, using a SAT-solving-based algorithm we developed to automatically modify the TAPs to account for this feedback.
Conference Paper
The exceedingly exponential-growing data rate highlighted numerous requirements and several approaches have been released to maximize the added-value of cloud and edge resources. Whereas data scientists utilize algorithmic models in order to transform datasets and extract actionable knowledge, a key challenge is oriented towards abstracting the underline layers: the ones enabling the management of infrastructure resources and the ones responsible to provide frameworks and components as services. In this sense, the serverless approach features as the novel paradigm of new cloud-related technology, enabling the agile implementation of applications and services. The concept of Function as a Service (FaaS) is introduced as a revolutionary model that offers the means to exploit serverless offerings. Developers have the potential to design their applications with the necessary scalability in the form of nanoservices without addressing themselves the way the infrastructure resources should be deployed and managed. By abstracting away the underlying hardware allocations, the data scientist concentrates on the business logic and critical problems of Machine Learning (ML) algorithms. This paper introduces an approach to realize the provision of ML Functions as a Service (i.e., ML-FaaS), by exploiting the Apache OpenWhisk event-driven, distributed serverless platform. The presented approach tackles also composite services that consist of single ones i.e., workflows of ML tasks including processes such as aggregation, cleaning, feature extraction, and analytics; thus, reflecting the complete data path. We also illustrate the operation of the approach mentioned above and assess its performance and effectiveness exploiting a holistic, end-to-end anti-fraud detection machine learning pipeline.
Conference Paper
Full-text available
Big data workflows often require the assembly and exchange of complex, multi-element datasets. For example, in biomedical applications, the input to an analytic pipeline can be a dataset consisting thousands of images and genome sequences assembled from diverse repositories, requiring a description of the contents of the dataset in a concise and unambiguous form. Typical approaches to creating datasets for big data workflows assume that all data reside in a single location, requiring costly data marshaling and permitting errors of omission and commission because dataset members are not explicitly specified. We address these issues by proposing simple methods and tools for assembling, sharing, and analyzing large and complex datasets that scientists can easily integrate into their daily workflows. These tools combine a simple and robust method for describing data collections (BDBags), data descriptions (Research Objects), and simple persistent identifiers (Minids) to create a powerful ecosystem of tools and services for big data analysis and sharing. We present these tools and use biomedical case studies to illustrate their use for the rapid assembly, sharing, and analysis of large datasets.
Conference Paper
Full-text available
Increasingly, scientific discovery is driven by the analysis, manipulation, organization, annotation, sharing, and reuse of high-value scientific data. While great attention has been given to the specifics of analyzing and mining data, we find that there are almost no tools nor systematic infrastructure to facilitate the process of discovery from data. We argue that a more systematic perspective is required, and in particular, propose a data-centric approach in which discovery stands on a foundation of data and data collections, rather than on fleeting transformations and operations. To address the challenges of data-centric discovery, we introduce a Data-Oriented Architecture and contrast it with the prevalent Service-Oriented Architecture. We describe an instance of the Data-Oriented Architecture and describe how it has been used in a variety of use cases.
Conference Paper
While researchers have long investigated end-user programming using a trigger-action (if-then) model, the website IFTTT is among the first instances of this paradigm being used on a large scale. To understand what IFTTT users are creating, we scraped the 224,590 programs shared publicly on IFTTT as of September 2015 and are releasing this dataset to spur future research. We characterize aspects of these programs and the IFTTT ecosystem over time. We find a large number of users are crafting a diverse set of end-user programs---over 100,000 different users have shared programs. These programs represent a very broad array of connections that appear to fill gaps in functionality, yet users often duplicate others' programs.
Article
Today, the largest Lustre file systems store billions of entries. On such systems, classic tools based on namespace scanning become unusable. Operations such as managing file lifetime, scheduling data copies, and generating overall filesystem statistics become painful as they require collecting, sorting and aggregating information for billions of records. Robinhood Policy Engine is an open source software developed to address these challenges. It makes it possible to schedule automatic actions on huge numbers of filesystem entries. It also gives a synthetic understanding of file systems contents by providing overall statistics about data ownership, age and size profiles. Even if it can be used with any POSIX filesystem, Robinhood supports Lustre specific features like OSTs, pools, HSM, ChangeLogs, and DNE. It implements specific support for these features, and takes advantage of them to manage Lustre file systems efficiently.
Article
Cloud computing provides a scalable computing platform through which large datasets can be stored and analyzed. However, because of the number of storage models used and rapidly increasing data sizes, it is often difficult to efficiently and securely access, transfer, synchronize, and share data. The authors describe the approaches taken by Globus to create standard data interfaces and common security models for performing these actions on large quantities of data. These approaches are general, allowing users to access different types of cloud storage with the same ease with which they access local storage. Through an existing network of more than 8,000 active storage endpoints and support for direct access to cloud storage, Globus has demonstrated both the effectiveness and scalability of the approaches presented.
Conference Paper
Data are a fundamental component of science and engineering work, and the ability to share data is critical to the validation and progress of science. Data sharing and reuse in some fields, however, has proven to be a difficult problem. This paper argues that the development of effective CSCW systems to support data sharing in work groups requires a better understanding of the use of data in practice. Drawing on our work with three scientific disciplines, we show that data play two general roles in scientific communities: 1) they serve as evidence to support scientific inquiry, and 2) they make a social contribution to the establishment and maintenance of communities of practice. A clearer consideration and understanding of these roles can contribute to the design of more effective data sharing systems. We suggest that this can be achieved through supporting social interaction around data abstractions, reaching beyond current metadata models, and supporting the social roles of data.
Book
Policy-based data management enables the creation of community-specific collections. Every collection is created for a purpose. The purpose defines the set of properties that will be associated with the collection. The properties are enforced by management policies that control the execution of procedures that are applied whenever data are ingested or accessed. The procedures generate state information that defines the outcome of enforcing the management policy. The state information can be queried to validate assessment criteria and verify that the required collection properties have been conserved. The integrated Rule-Oriented Data System implements the data management framework required to support policy-based data management. Policies are turned into computer actionable Rules. Procedures are composed from a Micro-service-oriented architecture. The result is a highly extensible and tunable system that can enforce management policies, automate administrative tasks, and periodically validate assessment criteria. Table of Contents: Introduction / Integrated Rule-Oriented Data System / iRODS Architecture / Rule-Oriented Programming / The iRODS Rule System / iRODS Micro-services / Example Rules / Extending iRODS / Appendix A: iRODS Shell Commands / Appendix B: Rulegen Grammar / Appendix C: Exercises / Author Biographies
National Science Foundation Advisory Committee for Cyberinfrastructure Task Force on Data and Visualization Final Report
  • D Atkins
  • T Hey
  • M Hedstrom
D. Atkins, T. Hey, and M. Hedstrom, "National Science Foundation Advisory Committee for Cyberinfrastructure Task Force on Data and Visualization Final Report," National Science Foundation, Tech. Rep., 2011.
A prototype rulebased distributed data management system
  • A Rajasekar
  • M Wan
  • R Moore
  • W Schroeder
A. Rajasekar, M. Wan, R. Moore, and W. Schroeder, "A prototype rulebased distributed data management system," in HPDC workshop on Next Generation Distributed Data Management, vol. 102, 2003.