Conference PaperPDF Available

Toward Scalable Monitoring on Large-Scale Storage for Software Defined Cyberinfrastructure

Authors:

Abstract and Figures

As research processes become yet more collaborative and increasingly data-oriented, new techniques are needed to eeciently manage and automate the crucial, yet tedious, aspects of the data life-cycle. Researchers now spend considerable time replicating, cataloging , sharing, analyzing, and purging large amounts of data, distributed over vast storage networks. Software Deened Cyberin-frastructure (SDCI) provides a solution to this problem by enhancing existing storage systems to enable the automated execution of actions based on the speciication of high-level data management policies. Our SDCI implementation, called RRRRRR, relies on agents being deployed on storage resources to detect and act on data events. However, current monitoring technologies, such as inotify, are not generally available on large or parallel le systems, such as Lustre. We describe here an approach for scalable, lightweight, event detection on large (multi-petabyte) Lustre le systems. Together, RRRRRR and the Lustre monitor enable new types of lifecycle automation across both personal devices and leadership computing platforms. ACM Reference format:
Content may be subject to copyright.
Toward Scalable Monitoring on Large-Scale Storage for
Soware Defined Cyberinfrastructure
Arnab K. Paul
Virginia Tech
Ryan Chard
Argonne National Laboratory
Kyle Chard
University of Chicago
Steven Tuecke
University of Chicago
Ali R. Butt
Virginia Tech
Ian Foster
Argonne and University of Chicago
ABSTRACT
As research processes become yet more collaborative and increas-
ingly data-oriented, new techniques are needed to eciently man-
age and automate the crucial, yet tedious, aspects of the data life-
cycle. Researchers now spend considerable time replicating, cat-
aloging, sharing, analyzing, and purging large amounts of data,
distributed over vast storage networks. Software Dened Cyberin-
frastructure (SDCI) provides a solution to this problem by enhanc-
ing existing storage systems to enable the automated execution of
actions based on the specication of high-level data management
policies. Our SDCI implementation, called Ripple, relies on agents
being deployed on storage resources to detect and act on data events.
However, current monitoring technologies, such as inotify, are not
generally available on large or parallel le systems, such as Lustre.
We describe here an approach for scalable, lightweight, event detec-
tion on large (multi-petabyte) Lustre le systems. Together, Ripple
and the Lustre monitor enable new types of lifecycle automation
across both personal devices and leadership computing platforms.
ACM Reference format:
Arnab K. Paul, Ryan Chard, Kyle Chard, Steven Tuecke, Ali R. Butt, and Ian
Foster. 2017. Toward Scalable Monitoring on Large-Scale Storage for Soft-
ware Dened Cyberinfrastructure. In Proceedings of PDSW-DISCS’17: Second
Joint International Workshop on Parallel Data Storage & Data Intensive Scal-
able Computing Systems, Denver, CO, USA, November 12–17, 2017 (PDSW-
DISCS’17), 6 pages.
DOI: 10.1145/3149393.3149402
1 INTRODUCTION
The data-driven and distributed nature of modern research means
scientists must manage complex data lifecycles, across large-scale
and distributed storage networks. As data scales increase so too
does the overhead of data management—a collection of tasks and
processes that are often tedious and repetitive, such as replicating,
cataloging, sharing, and purging data. Software Dened Cyberin-
frastructure (SDCI) [
5
] can drastically lower the cost of performing
many of these tasks by transforming humble storage devices into
ACM acknowledges that this contribution was authored or co-authored by an em-
ployee, or contractor of the national government. As such, the Government retains a
nonexclusive, royalty-free right to publish or reproduce this article, or to allow others
to do so, for Government purposes only. Permission to make digital or hard copies for
personal or classroom use is granted. Copies must bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. To copy otherwise, distribute, republish, or post, requires prior
specic permission and/or a fee. Request permissions from permissions@acm.org.
PDSW-DISCS’17, Denver, CO, USA
©2017 ACM. 978-1-4503-5134-8/17/11. . . $15.00
DOI: 10.1145/3149393.3149402
“active” environments in which such tasks are automatically exe-
cuted in response to data events. SDCI enables high-level policies
to be dened and applied to storage systems, thereby facilitating
automation throughout the end-to-end data lifecycle. We have pre-
viously presented a prototype SDCI implementation, called Rip-
ple [
4
], capable of performing various actions in response to le
system events.
Ripple empowers scientists to express and automate mundane
data management tasks. Using a simple If-Trigger-Then-Action rule
notation, users program their storage devices to respond to specic
events and invoke custom actions. For example, one can express
that when les appear in a specic directory of their laboratory
machine they are automatically analyzed and the results replicated
to their personal device. Ripple supports inotify-enabled storage
devices (such as personal laptops); however inotify is not often
supported on large-scale or parallel le systems. To support large-
scale le systems we have developed a scalable monitoring solution
for the Lustre [
13
] le system. Our monitor exploits Lustre’s internal
metadata capabilities and uses a hierarchical approach to collect,
aggregate, and broadcast data events for even the largest storage
devices. Using this monitor Ripple agents can consume site-wide
events in real time, enabling SDCI over leadership class computing
platforms.
In this paper we present our scalable Lustre monitor. We analyze
the performance of our monitor using two Lustre le systems: an
Amazon Web Service deployment and a high performance deploy-
ment at Argonne National Laboratory’s (ANL) Leadership Comput-
ing Facility (ALCF). We show that our monitor is a scalable, reliable,
and light-weight solution for collecting and aggregating le system
events such that SDCI can be applied to multi-petabyte storage
devices.
The rest of this paper is organized as follows: Section 2 presents
related work. Section 3 discusses the SDCI concept and our imple-
mentation, Ripple. Section 4 describes our scalable monitor. We
evaluate our monitor in Section 5 before presenting concluding
remarks and future research directions in Section 6.
2 RELATED WORK
SDCI and data-driven policy engines are essential for reliably per-
forming data management tasks at scale. A common requirement for
these tools is the reliable detection of trigger events. Prior eorts in
this space have applied various techniques including implementing
data management abstraction layers and reliance on applications
to raise events. For example, the integrated Rule-Oriented Data
System [
11
] works by ingesting data into a closed data grid such
that it can manage the data and monitor events throughout the data
PDSW-DISCS’17, November 12–17, 2017, Denver, CO, USA A. K. Paul, R. Chard, K. Chard, S. Tuecke, A. R. Bu, and I. Foster
lifecycle. Other SDCI-like implementations rely on applications to
raise trigger events [1].
Monitoring distributed systems is crucial to their eective op-
eration. Tools such as MonALISA [
9
] and Nagios [
2
] have been
developed to provide insight into the health of resources and pro-
vide the necessary information to debug, optimize, and eectively
operate large computing platforms. Although such tools generally
expose le system status, utilization, and performance statistics,
they do not capture and report individual le events. Thus, these
tools cannot be used to enable ne-grained data-driven rule engines,
such as Ripple.
Other data-driven policy engines, such as IOBox [
12
], also re-
quire individual data events. IOBox is an extract, transform, and
load (ETL) system, designed to crawl and monitor local le systems
to detect le events, apply pattern matching, and invoke actions.
Like the initial implementation of Ripple, IOBox is restricted to us-
ing either inotify or a polling mechanism to detect trigger events. It
therefore cannot be applied at scale to large or parallel le systems,
such as Lustre.
Monitoring of large Lustre le systems requires explicitly de-
signed tools [
8
]. One policy engine that leverages a custom Lustre
monitor is the Robinhood Policy Engine [
7
]. Robinhood facilitates
the bulk execution of data management actions over HPC le sys-
tems. Administrators can congure, for example, policies to migrate
and purge stale data. Robinhood maintains a database of le system
events, using it to provide various routines and utilities for Lustre,
such as tools to eciently nd les and produce usage reports.
Robinhood employs a centralized approach to collecting and ag-
gregating data events from Lustre le systems, where metadata is
sequentially extracted from each metadata server by a single client.
Our approach employs a distributed method of collecting, process-
ing, and aggregating these data. In addition, our monitor publishes
events to any subscribed listener, allowing external services to
utilize the data.
3 BACKGROUND: RIPPLE
SDCI relies on programmable agents being deployed across storage
and compute devices. Together, these agents create a fabric of smart,
programmable resources. These agents can be employed to monitor
the underlying infrastructure, detecting and reporting data events
of interest, while also facilitating the remote execution of actions
on behalf of users. SDCI is underpinned by the same concepts as
Software Dened Networking (SDN). A separation of data and
control planes enables the denition of high-level, abstract rules
that can then be distributed to, and enforced by, the storage and
compute devices comprising the system.
Ripple [
4
] enables users to dene custom data management
policies which are then automatically enforced by participating
resources. Management policies are expressed as If-Trigger-Then-
Action style rules. Ripple’s implementation is based on a deployable
agent that captures events and a cloud service that manages the
reliable evaluation of rules and execution of actions. An overview
of Ripple’s architecture is depicted in Figure 1.
Architecture:
Ripple comprises a cloud-based service plus a
light-weight agent that is deployed on target storage systems. The
agent is responsible for detecting data events, ltering them against
active rules, and reporting events to the cloud service. The agent
also provides an execution component, capable of performing local
actions on a user’s behalf, for example running a container or
performing a data transfer with Globus [3].
A scalable cloud service processes events and orchestrates the
execution of actions. Ripple emphasizes reliability, employing mul-
tiple strategies to ensure events are not lost and that actions are
successfully completed. For example, agents repeatedly try to report
events to the service. Once an event is reported it is immediately
placed in a reliable Simple Queue Service (SQS) queue. Serverless
Amazon Lambda functions act on entries in this queue and remove
them once successfully processed. A cleanup function periodically
iterates through the queue and initiates additional processing for
events that were unsuccessfully processed.
Rules:
Ripple rules are distributed to agents to inform the event
ltering process and ensure relevant events are reported. A Ripple
rule consists of a trigger and an action. The trigger species the con-
ditions under which the action will be invoked. For example, a user
may set a rule to trigger when an image le is created in a specic
directory of their laptop. An action species the type of execution
to perform (such as initiating a transfer, sending an email, running
a docker container, or executing a local bash command, to name a
few), the agent on which to perform the action, and any necessary
parameters. These simple rules can be used to implement complex
pipelines whereby the output of one rule triggers a subsequent
action.
Event Detection:
Ripple uses the Python Watchdog module to
detect events on the local le systems. Using tools such as inotify
and kqueue, Watchdog enables Ripple to function over a wide
range of operating systems. As rules are registered with an agent
users also specify the path to be monitored. The agent employs
“Watchers” on each directory relevant to a rule. As events occur in
a monitored directory, the agent processes them against the active
rules to determine whether the event is relevant and warrants
reporting to the cloud service.
Limitations:
A key limitation of Ripple is its inability to be
applied, at scale, to large storage devices (i.e., those that are not
supported by Watchdog). Further, our approach primarily relies
on targeted monitoring techniques, such as inotify, where specic
directories are monitored. Thus, Ripple cannot enforce rules which
are applied to many directories, such as site-wide purging policies.
Relying on targeted monitors presents a number of limitations.
For example, inotify has a large setup cost due to its need to crawl
the le system to place watchers on each monitored directory. This
is both time consuming and resource intensive, often consuming a
signicant amount of unswappable kernel memory. Each watcher
requires 1Kb of memory on a 64-bit machine, meaning over 512MB
of memory is required to concurrently monitor the default maxi-
mum (524,288) directories.
We have explored an alternative approach using a polling tech-
nique to detect le system changes. However, crawling and record-
ing le system data is prohibitively expensive over large storage
systems.
Scalable Monitoring on Large-Scale Storage for SDCI PDSW-DISCS’17, November 12–17, 2017, Denver, CO, USA
Figure 1: Ripple architecture. A local agent captures and lters le events before reporting them to the cloud service for
processing. Actions are routed to agents for execution.
4 SCALABLE MONITORING
Ripple requires scalable monitoring techniques in order to be ap-
plied to leadership class storage systems. To address this need we
have developed a light-weight, scalable monitor to detect and report
data events for Lustre le systems. The monitor leverages Lustre’s
internal metadata catalog to detect events in a distributed manner
and aggregates them for evaluation. The monitor produces a com-
plete stream of all le system events to any subscribed device, such
as a Ripple agent. The monitor also maintains a rotating catalog
of events and an API to retrieve recent events in order to provide
fault tolerance.
Like other parallel le systems, Lustre does not support ino-
tify; however, it does maintain an internal metadata catalog, called
“ChangeLog.” An example ChangeLog is depicted in Table 1. Every
entry in a ChangeLog consists of the record number, type of le
event, timestamp, date, ags, target File Identier (FID), parent FID,
and the target name. Lustre’s ChangeLog is distributed across a set
of Metadata Servers (MDS). Actions which cause changes in the
le system namespace or metadata are recorded in a single MDS
ChangeLog. Thus, to capture all changes made on a le system our
monitor must be applied to all MDS servers.
Our Lustre monitor, depicted in Figure 2, employs a hierarchi-
cal publisher-subscriber model to collect events from each MDS
ChangeLog and report them for aggregation. This model has been
proven to enable scalable data collection solutions, such as those
that monitor performance statistics from distributed Lustre storage
servers [
10
]. One Collector service is deployed for each MDS. The
Collector is responsible for interacting with the local ChangeLog to
extract new events before processing and reporting them. Events
are reported to a single Aggregator for prosperity and publication
to consumers.
File events, such as creation, deletion, renaming, and attribute
changes, are recorded in the ChangeLog as a tuple containing a
timestamp, event type, parent directory identier, and le name.
Our monitor collects, aggregates, and publishes these events using
three key steps:
(1) Detection:
Events are initially extracted from the ChangeLog
by a Collector. The monitor will deploy multiple Collec-
tors such that each MDS can be monitored concurrently.
Each new event detected by a Collector is required to be
processed prior to being reported.
(2) Processing:
Lustre’s ChangeLog uses parent and target
le identiers (FIDs) to uniquely represent les and di-
rectories. These FIDs are not useful to external services,
such as Ripple agents, and must be resolved to absolute
path names. Therefore, once a new event is retrieved by
a Collector it uses the Lustre d2path tool to resolve FIDs
and establish absolute path names. The raw event tuples
are then refactored to include the user-friendly paths in
place of the FIDs before being reported.
(3) Aggregation:
A publisher-subscriber message queue (Ze-
roMQ [
6
]) is used to pass messages between the Collectors
and the Aggregator. Once an event is reported to the Aggre-
gator it is immediately placed in a queue to be processed.
The Aggregator is multi-threaded, enabling it to both pub-
lish events to subscribed consumers and store the events
in a local database with minimal overhead. The Aggrega-
tor maintains this database and exposes an API to enable
consumers to retrieve historic events.
Collector’s are also responsible for purging their respective
ChangeLogs. Each Collector maintains a pointer to the most re-
cently extracted event and can therefore clear the ChangeLog of
previously processed events. This ensures that events are not missed
and also means the ChangeLog will not become overburdened with
stale events.
5 EVALUATION
We have deployed our monitor over two Lustre testbeds to analyze
the performance, overheads, and bottlenecks of our solution. Before
investigating the monitor’s performance we rst characterize the
capabilities of the testbeds to determine the rate at which events
are generated. Using a specically built event generation script, we
PDSW-DISCS’17, November 12–17, 2017, Denver, CO, USA A. K. Paul, R. Chard, K. Chard, S. Tuecke, A. R. Bu, and I. Foster
Table 1: A Sample ChangeLog Record.
Event ID Type Timestamp Datestamp Flags Target FID Parent FID Target Name
13106 01CREAT 20:15:37.1138 2017.09.06 0x0 t=[0x200000402:0xa046:0x0] p=[0x200000007:0x1:0x0] data1.txt
13107 02MKDIR 20:15:37.5097 2017.09.06 0x0 t=[0x200000420:0x3:0x0] p=[0x61b4:0xca2c7dde:0x0] DataDir
13108 06UNLNK 20:15:37.8869 2017.09.06 0x1 t=[0x200000402:0xa048:0x0] p=[0x200000007:0x1:0x0] data1.txt
Figure 2: The scalable Lustre monitor used to collect, aggre-
gate, and publish events to Ripple agents.
apply the monitor under high load to determine maximum through-
put and identify bottlenecks. Finally, we use le system dumps from
a production 7PB storage system to evaluate whether the monitor
is capable of supporting very large-scale storage systems.
5.1 Testbeds
We employ two testbeds to evaluate the monitor’s performance. The
rst testbed, referred to as AWS, is a cloud deployment of Lustre
using ve Amazon Web Service EC2 instances. The deployment
uses Lustre Intel Cloud Edition, version 1.4, to construct a 20GB
Lustre le system over ve, low-performance, t2.micro instances
and an unoptimized EBS volume. The conguration includes two
compute nodes, a single Object Storage Service (OSS), an MGS, and
one MDS.
The second testbed provides a larger, production-quality, storage
system. This testbed, referred to as Iota, uses Argonne National Lab-
oratory’s Iota cluster’s le system. Iota is one of two pre-exascale
systems at Argonne and comprises 44 compute nodes, each with 72
cores and 128GB of memory. Iota’s 897TB Lustre store leverages the
same high performance hardware and conguration (including four
MDS) as the 150PB store planned for deployment with the Aurora
supercomputer. However, it is important to note that at present, the
le system is not yet congured to load balance metadata across
all four MDS, thus these tests were performed with just one MDS.
As a baseline analysis we rst compare operation throughput on
each le system. We use a Python script to record the time taken
to create, modify, or delete 10,000 les on each le system. The
performance of these two parallel le systems diers substantially,
as is shown in Table 2. Due to the low-performance nature of the
instances comprising the AWS testbed (t2.micro), just 352 les could
be written per second. A total of 1366 events can be generated per
second. As expected, the performance of the Iota testbed signif-
icantly exceeded this rate. It is able to create over 1300 les per
second and more than 9500 total events per second.
Table 2: Testbed Performance Characteristics.
AWS Iota
Storage Size 20GB 897TB
Files Created (events/s) 352 1389
Files Modied (events/s) 534 2538
Files Deleted (events/s) 832 3442
Total Events (events/s) 1366 9593
5.2 Results
To investigate the performance of our monitor we use the Python
script to generate le system events while our monitor extracts
them from an MDS ChangeLog, processes them, and reports them
to a listening Ripple agent. To minimize the overhead caused by
passing messages over the network, we have conducted these tests
on a single node. The node is also instrumented to collect mem-
ory and CPU counters during the tests to determine the resource
utilization of the collection and aggregation processes.
Event Throughput:
Our event generation script combines le
creation, modication, and deletion to generate multiple events for
each le. Using this technique we are able to generate over 1300
events per second on AWS and more than 9500 events per second
on Iota.
When generating 1366 events per second the AWS-based monitor
is capable of detecting, processing, and reporting just 1053 to the
consuming Ripple agent. Analysis of the monitor’s pipeline shows
that the throughput is primarily limited by the preprocessing step
following events being extracted from a ChangeLog. This is due
to in part to the low-performance, t2.micro instance types used in
the testbed. When experimenting on the Iota testbed we found the
monitor is able to process and report, on average, 8162 events per
second. This is 14.91% lower than the maximum event generation
rate achieved on the testbed. Although this is an improvement
over the AWS testbed, we found the overhead to be caused by the
repetitive use of the d2path tool when resolving an event’s absolute
path. To alleviate this problem we plan to process events in batches,
rather than independently, and temporarily cache path mappings
Scalable Monitoring on Large-Scale Storage for SDCI PDSW-DISCS’17, November 12–17, 2017, Denver, CO, USA
to minimize the number of invocations. Another limitation with
this experimental conguration is the use of a single MDS. If the
d2path resolutions were distributed across multiple MDS, the
throughput of the monitor would surpass the event generation rate.
It is important to note that there is no loss of events once they
have been processed, meaning the aggregation and reporting steps
introduce no additional overhead.
Monitor Overhead:
We have captured the CPU and memory
utilization of the Collector, Aggregator, and a Ripple agent con-
sumer processes. Table 3 shows the peak resource utilization during
the Iota throughput experiments. These results show the CPU cost
of operating the monitor is small. The memory footprint is due to
the use of a local store that records a list of every event captured by
the monitor. In a production setting we could further limit the size
of this local store, which would in turn reduce the overall resource
usage. We conclude that when using an appropriate maximum store
size, deploying these components on the physical MDS and MGS
servers would induce negligible overhead on their performance.
Table 3: Maximum Monitor Resource Utilization.
CPU (%) Memory (MB)
Collector 6.667 281.6
Aggregator 0.059 217.6
Consumer 0.02 12.8
5.3 Scaling Performance
Understanding the throughput of the monitor only provides value
when put in the context of real-world requirements. Thus, we ana-
lyzed NERSC’s production 7.1PB GPFS le system, called tlproject2.
This system has 16,506 users and over 850 million les. We analyzed
le system dumps from a 36 day period and compared consecutive
days to establish the number of les that are created or changed
each day. It is important to note that this method does not represent
an accurate value for the number of times a le is modied, as only
the most recent le modication is detectable, and also does not
account for short lived les.
As shown in Figure 3, we found a peak of over 3.6 million dier-
ences between two consecutive days. When distributed over a 24
hour period this equates to just 42 events per second. Assuming a
worst-case scenario where all of these events occur within an eight
hour period results in approximately 127 events per second, still
well within the monitor’s performance range. Although only hy-
pothetical, if we assume events scale linearly with storage size, we
can extrapolate and expect Aurora’s 150PB to generate 25 times as
many events, or 3,178 events per second, which is also well within
the capabilities of the monitor. It should be noted that this estimate
could signicantly underestimate the peak generation of le events.
Further online monitoring of such devices is necessary to account
for short lived les, le modications, and the sporadic nature of
data generation.
6 CONCLUSION
SDCI can resolve many of the challenges associated with routine
data management processes enabling researchers to automate many
0
1,000,000
2,000,000
3,000,000
0 10 20 30
Day
Events
Event Type
Created
Modied
Figure 3: The number of les created and modied on
NERSC’s 7.1PB GPFS le system, tlproject2, over a 35 day
period.
of the tedious tasks they must perform. In prior work we presented
a system for enabling such automation, however it was designed
using libraries commonly available on personal computers but
not often available on large-scale storage systems. Our scalable
Lustre monitor addresses this shortcoming and enables Ripple
to be used on some of the world’s largest storage systems. Our
results show that the Lustre monitor is able to detect, process, and
report thousands of events per second—a rate sucient to meet the
predicted needs of the forthcoming 150PB Aurora le system.
Our future research focuses on investigating monitor perfor-
mance when using multiple distributed MDS, exploring and evalu-
ating dierent message passing techniques between the collection
and aggregation points, and comparing performance against Robin-
hood in production settings. We will also further investigate the
behavior of large le systems to more accurately characterize the
requirements of our monitor. Finally we are actively working to
deploy Ripple on production systems and in real scientic data
management scenarios, in so doing we are demonstrating the value
of SDCI concepts in scientic computing platforms.
ACKNOWLEDGMENTS
This research used resources of the Argonne Leadership Computing
Facility, which is a DOE Oce of Science User Facility supported
under Contract DE-AC02-06CH11357. We also acknowledge gener-
ous research credits provided by Amazon Web Services. This work
is also sponsored in part by the NSF under the grants: CNS-1565314,
CNS-1405697, and CNS-1615411.
REFERENCES
[1]
M. AbdelBaky, J. Diaz-Montes, and M. Parashar. Software-dened environments
for science and engineering. The International Journal of High Performance
Computing Applications, page 1094342017710706, 2017.
[2] W. Barth. Nagios: System and network monitoring. No Starch Press, 2008.
PDSW-DISCS’17, November 12–17, 2017, Denver, CO, USA A. K. Paul, R. Chard, K. Chard, S. Tuecke, A. R. Bu, and I. Foster
[3]
K. Chard, S. Tuecke, and I. Foster. Ecient and secure transfer, synchronization,
and sharing of big data. IEEE Cloud Computing, 1(3):46–55, 2014.
[4]
R. Chard, K. Chard, J. Alt, D. Y. Parkinson, S. Tuecke, and I. Foster. RIPPLE: Home
Automation for Research Data Management. In The 37th IEEE International
Conference on Distributed Computing Systems (ICDCS), 2017.
[5]
I. Foster, B. Blaiszik, K. Chard, and R. Chard. Software Dened Cyberinfrastruc-
ture. In The 37th IEEE International Conference on Distributed Computing Systems
(ICDCS), 2017.
[6]
P. Hintjens. ZeroMQ: messaging for many applications. " O’Reilly Media, Inc.",
2013.
[7]
T. Leibovici. Taking back control of HPC le systems with Robinhood Policy
Engine. arXiv preprint arXiv:1505.01448, 2015.
[8] R. Miller, J. Hill, D. A. Dillow, R. Gunasekaran, G. M. Shipman, and D. Maxwell.
Monitoring tools for large scale systems. In Proceedings of Cray User Group
Conference (CUG 2010), 2010.
[9]
H. B. Newman, I. C. Legrand, P. Galvez, R. Voicu, and C. Cirstoiu. Monalisa: A
distributed monitoring service architecture. arXiv preprint cs/0306096, 2003.
[10]
A. K. Paul, A. Goyal, F. Wang,S. Oral, A. R. Butt, M. J. Brim, and S. B. Srinivasa. I/o
load balancing for big data hpc applications. In 5th IEEE International Conference
on Big Data(Big Data), 2017.
[11]
A. Rajasekar, R. Moore, C.-y. Hou, C. A. Lee, R. Marciano, A. de Torcy, M. Wan,
W. Schroeder, S.-Y. Chen, L. Gilbert, P. Tooby, and B. Zhu. iRODS Primer: In-
tegrated rule-oriented data system. Synthesis Lectures on Information Concepts,
Retrieval, and Services, 2(1):1–143, 2010.
[12]
R. Schuler, C. Kesselman, and K. Czajkowski. Data centric discovery with a
data-oriented architecture. In 1st Workshop on The Science of Cyberinfrastructure:
Research, Experience, Applications and Models, SCREAM ’15, pages 37–44, New
York, NY, USA, 2015. ACM.
[13]
P. Schwan et al. Lustre: Building a le system for 1000-node clusters. In Proceed-
ings of the 2003 Linux symposium, volume 2003, pages 380–386, 2003.
... To build a superfacility, it is necessary to maximally automate the entire data lifecycle of a large facility, to tag all the Big Data produced during the experiments with the metadata produced in each step of the lifecycle itself (figure 5) [78,79]. This will guarantee the creation of a robust basis for AI training through ML, and robotic automation through the IoT. ...
Article
Full-text available
With recent technological advances, large-scale experimental facilities generate huge datasets, into the petabyte range, every year, thereby creating the Big Data deluge effect. Data management, including the collection, management, and curation of these large datasets, is a significantly intensive precursor step in relation to the data analysis that underpins scientific investigations. The rise of artificial intelligence (AI), machine learning (ML), and robotic automation has changed the landscape for experimental facilities, producing a paradigm shift in how different datasets are leveraged for improved intelligence, operation, and data analysis. Therefore, such facilities, known as superfacilities, which fully enable user science while addressing the challenges of the Big Data deluge, are critical for the scientific community. In this work, we discuss the process of setting up the Big Data Science Center within the Shanghai Synchrotron Radiation Facility (SSRF), China’s first superfacility. We provide details of our initiatives for enabling user science at SSRF, with particular consideration given to recent developments in AI, ML, and robotic automation.
... Unlike profiles, traces record a detailed report of the execution chronology of function and system calls together with a timestamp, which produces much more log data and potentially degrades the system performance while collecting the traces. For example, Recorder [25], [26] is a multi-level I/O tracing tool that captures I/O calls at multiple layers of the I/O stack, and FSMonitor [27], [28] captures the metadata file system events in storage systems. ...
... Every event that is collected needs to be processed to collect the directory name in order to be placed in the suspect file. In particular, FIDs are not necessarily interpretable by BRINDEXER, and thus must be processed and resolved to absolute path names [24]. In Lustre file system, to process the FIDs, Lustre fid2path tool is provided which resolves FIDs to absolute path names. ...
... Metadata extraction: The effects of high-velocity data expansion are making it increasingly difficult to organize, discover, and manage data. Edge file systems and data repositories now store petabytes of data which are created and modified at an alarming rate [39]. Xtract [44] is a distributed metadata extraction system that applies a set of general and specialized metadata extractors, such as those for identifying topics in text, computing aggregate values from tables, and recognizing locations in maps. ...
Preprint
Full-text available
Exploding data volumes and velocities, new computational methods and platforms, and ubiquitous connectivity demand new approaches to computation in the sciences. These new approaches must enable computation to be mobile, so that, for example, it can occur near data, be triggered by events (e.g., arrival of new data), be offloaded to specialized accelerators, or run remotely where resources are available. They also require new design approaches in which monolithic applications can be decomposed into smaller components, that may in turn be executed separately and on the most suitable resources. To address these needs we present funcX---a distributed function as a service (FaaS) platform that enables flexible, scalable, and high performance remote function execution. funcX's endpoint software can transform existing clouds, clusters, and supercomputers into function serving systems, while funcX's cloud-hosted service provides transparent, secure, and reliable function execution across a federated ecosystem of endpoints. We motivate the need for funcX with several scientific case studies, present our prototype design and implementation, show optimizations that deliver throughput in excess of 1 million functions per second, and demonstrate, via experiments on two supercomputers, that funcX can scale to more than more than 130000 concurrent workers.
... The aggregator gathers events from each collector for later consumption. The publisher-subscriber model provides scalability in the event collection process, which has been shown when monitoring Lustre server statistics [23], [24], [28]. We divide the overall design for the scalable monitor into four steps: Detection, Processing, Aggregation and Consumption. ...
... The load monitoring (statistics collection) solution needs to be scalable. Therefore, we use a publisher-subscriber model [34] for the statistics collection framework. This is shown in Figure 5. OSSs act as publishers and MDS as the subscriber. ...
... Metadata Extraction: The effects of high-velocity data expansion is making it increasingly difficult to organize and discover data. Edge file systems and data repositories now store petabytes of data and new data is created and data is modified at an alarming rate [48]. To make sense of these repositories and file systems, systems such as Skluma [52] are used to crawl file systems and extract metadata. ...
Preprint
Full-text available
Growing data volumes and velocities are driving exciting new methods across the sciences in which data analytics and machine learning are increasingly intertwined with research. These new methods require new approaches for scientific computing in which computation is mobile, so that, for example, it can occur near data, be triggered by events (e.g., arrival of new data), or be offloaded to specialized accelerators. They also require new design approaches in which monolithic applications can be decomposed into smaller components, that may in turn be executed separately and on the most efficient resources. To address these needs we propose funcX---a high-performance function-as-a-service (FaaS) platform that enables intuitive, flexible, efficient, scalable, and performant remote function execution on existing infrastructure including clouds, clusters, and supercomputers. It allows users to register and then execute Python functions without regard for the physical resource location, scheduler architecture, or virtualization technology on which the function is executed---an approach we refer to as "serverless supercomputing." We motivate the need for funcX in science, describe our prototype implementation, and demonstrate, via experiments on two supercomputers, that funcX can process millions of functions across more than 65000 concurrent workers. We also outline five scientific scenarios in which funcX has been deployed and highlight the benefits of funcX in these scenarios.
... In our prototype system we have created a modular event monitoring system that can be deployed on arbitrary storage systems and which leverages native storage system notification mechanisms to capture data events [20]. We have demonstrated monitoring via the Linux inotify API [22], and achieved, via a hierarchical approach, ∼10,000 events per second on a 1PB Lustre file system [23]. ...
Conference Paper
Rapidly growing data volumes at light sources demand increasingly automated data collection, distribution, and analysis processes, in order to enable new scientific discoveries while not overwhelming finite human capabilities. We present here the case for automating and outsourcing light source science using cloud-hosted data automation and enrichment services, institutional computing resources, and high- performance computing facilities to provide cost-effective, scalable, and reliable implementations of such processes. We discuss three specific services that accomplish these goals for data distribution, automation, and transformation. In the first, Globus cloud-hosted data automation services are used to implement data capture, distribution, and analysis workflows for Advanced Photon Source and Advanced Light Source beamlines, leveraging institutional storage and computing. In the second, such services are combined with cloud-hosted data indexing and institutional storage to create a collaborative data publication, indexing, and discovery service, the Materials Data Facility (MDF), built to support a host of informatics applications in materials science. The third integrates components of the previous two projects with machine learning capabilities provided by the Data and Learning Hub for science (DLHub) to enable on-demand access to machine learning models from light source data capture and analysis workflows, and provides simplified interfaces to train new models on data from sources such as MDF on leadership scale computing resources. We draw conclusions about best practices for building next-generation data automation systems for future light sources.
Conference Paper
Full-text available
High Performance Computing (HPC) big data problems require efficient distributed storage systems. However, at scale, such storage systems often experience load imbalance and resource contention due to two factors: the bursty nature of scientific application I/O; and the complex I/O path that is without centralized arbitration and control. For example, the extant Lustre parallel file system-that supports many HPC centers-comprises numerous components connected via custom network topologies, and serves varying demands of a large number of users and applications. Consequently, some storage servers can be more loaded than others, which creates bottlenecks and reduces overall application I/O performance. Existing solutions typically focus on per application load balancing, and thus are not as effective given their lack of a global view of the system. In this paper, we propose a data-driven approach to load balance the I/O servers at scale, targeted at Lustre deployments. To this end, we design a global mapper on Lustre Metadata Server, which gathers runtime statistics from key storage components on the I/O path, and applies Markov chain modeling and a minimum-cost maximum-flow algorithm to decide where data should be placed. Evaluation using a realistic system simulator and a real setup shows that our approach yields better load balancing, which in turn can improve end-to-end performance.
Conference Paper
Full-text available
Within and across thousands of science labs, researchers and students struggle to manage data produced in experiments, simulations, and analyses. Largely manual research data lifecycle management processes mean that much time is wasted, research results are often irreproducible, and data sharing and reuse remain rare. In response, we propose a new approach to data lifecycle management in which researchers are empowered to define the actions to be performed at individual storage systems when data are created or modified: actions such as analysis, transformation, copying, and publication. We term this approach software-defined cyberinfrastructure because users can implement powerful data management policies by deploying rules to local storage systems, much as software-defined networking allows users to configure networks by deploying rules to switches. We argue that this approach can enable a new class of responsive distributed storage infrastructure that will accelerate research innovation by allowing any researcher to associate data workflows with data sources, whether local or remote, for such purposes as data ingest, characterization, indexing, and sharing. We report on early experiments with this approach in the context of experimental science, in which a simple if-trigger-then-action (IFTA) notation is used to define rules.
Conference Paper
Full-text available
Exploding data volumes and acquisition rates, plus ever more complex research processes, place significant strain on research data management processes. It is increasingly common for data to flow through pipelines comprised of dozens of different management, organization, and analysis steps distributed across multiple institutions and storage systems. To alleviate the resulting complexity, we propose a home automation approach to managing data throughout its lifecycle, in which users specify via high-level rules the actions that should be performed on data at different times and locations. To this end, we have developed RIPPLE, a responsive storage architecture that allows users to express data management tasks via a rules notation. RIPPLE monitors storage systems for events, evaluates rules, and uses serverless computing techniques to execute actions in response to these events. We evaluate our solution by applying RIPPLE to the data lifecycles of two real-world projects, in astronomy and light source science, and show that it can automate many mundane and cumbersome data management processes.
Conference Paper
Full-text available
Increasingly, scientific discovery is driven by the analysis, manipulation, organization, annotation, sharing, and reuse of high-value scientific data. While great attention has been given to the specifics of analyzing and mining data, we find that there are almost no tools nor systematic infrastructure to facilitate the process of discovery from data. We argue that a more systematic perspective is required, and in particular, propose a data-centric approach in which discovery stands on a foundation of data and data collections, rather than on fleeting transformations and operations. To address the challenges of data-centric discovery, we introduce a Data-Oriented Architecture and contrast it with the prevalent Service-Oriented Architecture. We describe an instance of the Data-Oriented Architecture and describe how it has been used in a variety of use cases.
Article
Full-text available
Operating computing systems, file systems, and associated networks at unprecedented scale offer unique challenges for fault monitoring, performance monitoring and problem diagnosis. Conventional system monitoring tools are in-sufficient to process the increasingly large and diverse volume of performance and status log data produced by the world's largest systems. In addition to the large data volume, the wide variety of systems employed by the largest com-puting facilities present diverse information from multiple sources, further complicating analysis efforts. At leadership scale, new tool development is required to acquire, condense, correlate, and present status and performance data to systems staff for timely evaluation. This paper details a set of system monitoring tools developed by the authors and utilized by systems staff at Oak Ridge National Laboratory's Leadership Computing Facility, which includes the Cray XT5 Jaguar. These tools include utilities to correlate I/O performance and event data with specific systems, resources, and jobs. Where possible, existing utilities are incorporated to reduce development effort and increase community participation. Future work may include additional integration among tools and implementation of fault-prediction tools.
Article
Service-based access models coupled with recent advances in application deployment technologies are enabling opportunities for realizing highly customized software-defined environments that can achieve new levels of efficiencies and can support emerging dynamic and data-driven applications. However, achieving this vision requires new models that can support dynamic (and opportunistic) compositions of infrastructure services, which can adapt to evolving application needs and the state of resources. In this article, we present a programmable dynamic infrastructure service composition approach that uses software-defined environment concepts to control the composition process. The resulting software-defined infrastructure service composition adapts to meet objectives and constraints set by the users, applications, and/or resource providers. We present and compare two different approaches for programming resources and controlling the service composition, one that is based on a rule engine and another that leverages a constraint programming model for resource description. We present the design and prototype implementation of such software-defined service composition and demonstrate its operation through a use case where multiple views of heterogeneous, geographically distributed services are aggregated on demand based on user and resource provider specifications. The resulting compositions are used to run different bioinformatics workloads, which are encapsulated inside Docker containers. Each view independently adapts to various constraints and events that are imposed on the system while minimizing the workload completion time.
Article
Today, the largest Lustre file systems store billions of entries. On such systems, classic tools based on namespace scanning become unusable. Operations such as managing file lifetime, scheduling data copies, and generating overall filesystem statistics become painful as they require collecting, sorting and aggregating information for billions of records. Robinhood Policy Engine is an open source software developed to address these challenges. It makes it possible to schedule automatic actions on huge numbers of filesystem entries. It also gives a synthetic understanding of file systems contents by providing overall statistics about data ownership, age and size profiles. Even if it can be used with any POSIX filesystem, Robinhood supports Lustre specific features like OSTs, pools, HSM, ChangeLogs, and DNE. It implements specific support for these features, and takes advantage of them to manage Lustre file systems efficiently.
Article
Cloud computing provides a scalable computing platform through which large datasets can be stored and analyzed. However, because of the number of storage models used and rapidly increasing data sizes, it is often difficult to efficiently and securely access, transfer, synchronize, and share data. The authors describe the approaches taken by Globus to create standard data interfaces and common security models for performing these actions on large quantities of data. These approaches are general, allowing users to access different types of cloud storage with the same ease with which they access local storage. Through an existing network of more than 8,000 active storage endpoints and support for direct access to cloud storage, Globus has demonstrated both the effectiveness and scalability of the approaches presented.