Abstract and Figures

Extreme-scale simulations and experiments can generate large amounts of data, whose volume can exceed the compute and/or storage capacity at the simulation or experimental facility. With the emergence of ultra-high-speed networks, researchers are considering pipelined approaches in which data are passed to a remote facility for analysis. Here we examine an extreme-scale cosmology simulation that, when run on a large fraction of a leadership computer, generates data at a rate of one petabyte per elapsed day. Writing those data to disk is inefficient and impractical, and in situ analysis poses its own difficulties. Thus we implement a pipeline in which data are generated on one supercomputer and then transferred, as they are generated, to a remote supercomputer for analysis. We use the Swift scripting language to instantiate this pipeline across Argonne National Laboratory and the National Center for Supercomputing Applications, which are connected by a 100 Gb/s network; and we demonstrate that by using the Globus transfer service we can achieve a sustained rate of 93 Gb/s over a 24-hour period, thus attaining our performance goal of one petabyte moved in 24 hours. This paper describes the methods used and summarizes the lessons learned in this demonstration.
Content may be subject to copyright.
Preprint published in the Future Generation Computer Systems. DOI: 10.1016/j.future.2018.05.051
Transferring a Petabyte in a Day1
Rajkumar Kettimuthua, Zhengchun Liua,
, David Wheelerd, Ian Fostera,b, Katrin Heitmanna,c , Franck
Cappelloa
aMathematics & Computer Science Division, Argonne National Laboratory, Lemont, IL 60439, USA
bDepartment of Computer Science, University of Chicago, Chicago, IL 60637, USA
cHigh Energy Physics Division, Argonne National Laboratory, Lemont, IL 60439, USA
dNational Center for Supercomputing Applications, University Of Illinois at Urbana–Champaign, Urbana, IL 61801, USA
Abstract
Extreme-scale simulations and experiments can generate large amounts of data, whose volume can exceed
the compute and/or storage capacity at the simulation or experimental facility. With the emergence of
ultra-high-speed networks, researchers are considering pipelined approaches in which data are passed to a
remote facility for analysis. Here we examine an extreme-scale cosmology simulation that, when run on a
large fraction of a leadership computer, generates data at a rate of one petabyte per elapsed day. Writing
those data to disk is inefficient and impractical, and in situ analysis poses its own difficulties. Thus we
implement a pipeline in which data are generated on one supercomputer and then transferred, as they are
generated, to a remote supercomputer for analysis. We use the Swift scripting language to instantiate this
pipeline across Argonne National Laboratory and the National Center for Supercomputing Applications,
which are connected by a 100 Gb/s network; and we demonstrate that by using the Globus transfer service
we can achieve a sustained rate of 93 Gb/s over a 24-hour period, thus attaining our performance goal of
one petabyte moved in 24 hours. This paper describes the methods used and summarizes the lessons learned
in this demonstration.
Keywords: Wide area data transfer, GridFTP, Large data transfer, Cosmology workflow, Pipeline
1. Introduction
Extreme-scale scientific simulations and experiments can generate much more data than can be stored
and analyzed efficiently at a single site. For example, a single trillion-particle simulation with the Hard-
ware/Hybrid Accelerated Cosmology Code (HACC) [1] generates 20 PiB of raw data (500 snapshots, each
40 TiB), which is more than petascale systems such as the Mira system at the Argonne Leadership Comput-5
ing Facility (ALCF) and the Blue Waters system at the National Center for Supercomputing Applications
(NCSA) can store in their file systems. Moreover, as scientific instruments are optimized for specific objec-
tives, both the computational infrastructure and the codes become more specialized as we reach the end of
Moore’s law. For example, one version of the HACC is optimized for the Mira supercomputer, on which it
can scale to millions of cores, while the Blue Waters supercomputer is an excellent system for data analysis,10
because of its large memory (1.5 PiB) and 4000+ GPU accelerators. To demonstrate how we overcame
the storage limitations and enabled the coordinated use of these two specialized systems, we conducted a
pipelined remote analysis of HACC simulation data, as shown in Figure 1.
1a Petabyte (PiB) = 250 bytes
ICorresponding author: Zhengchun Liu, Bldg. 240, Argonne National Lab., 9700 S. Cass Avenue, Lemont, IL 60439, USA
Email addresses: kettimut@anl.gov (Rajkumar Kettimuthu), zhengchun.liu@anl.gov (Zhengchun Liu),
dwheeler@illinois.edu (David Wheeler), foster@anl.gov (Ian Foster), heitmann@anl.gov (Katrin Heitmann),
cappello@mcs.anl.gov (Franck Cappello)
c
2018. This manuscript version is made available under the CC-BY-NC-ND 4.0 license. URL
ANL NCSA
NCSA
Booth
(SC16)
ORNL
NERSC
Cosmology
Simulation
(MIRA)
First level
Data Analytics
+ Visualization
(Blue Waters)
Second level
Data Analytics
ANL-NCSA (100Gb/s)
Transfer snapshots
1 PB/day (once)
Archive
29 Billion particles
(transmit all
snapshots)
100Gb/s
2nd level data
+ Visualization streaming
>1PB of Storage (DDN)
+ 2nd level Visualization
Display (NCSA, EVL)
Data Pulling Vis. Streaming Data
Analytics
Data Pulling
Vis. Streaming
Figure 1: Pipelined execution of a cosmology workflow that involves a mix of streaming scenarios, ranging from sustained
100 Gb/s over a 24-hour period to high-bandwidth bursts during interactive analysis sessions.
Our demonstration at the NCSA booth of SC16 (the International Conference for High Performance
Computing, Networking, Storage, and Analysis in 2016), a state-of-the-art, 29-billion-particle cosmology15
simulation combining high spatial and temporal resolution in a large cosmological volume was performed
on Mira at ALCF. As this simulation ran, the Globus [2] transfer service was used to transmit simulation
data to NCSA each of 500 temporal snapshots as it was produced [3]. In total, this workflow moved 1 PiB
in 24 hours from the ALCF to NCSA, requiring an average end-to-end rate of 93 Gb/s. In this paper,
we describe how we achieved this feat, including the experiments performed to gather insights on tuning20
parameters, data organization and the lessons we learned from the demonstration.
As snapshots arrived at NCSA, a first level of data analysis and visualization was performed using
the GPU partition of Blue Waters. We note that the analysis tasks have to be carried out sequentially:
information from the previous time snapshot is captured for the analysis of the next time snapshot in order
to enable detailed tracking of the evolution of structures. The workflow system therefore was carefully25
designed to resubmit any unsuccessful analysis job and to wait for an analysis job to finish before starting
the next one. The output data (half the size of the input data) was then sent to the NCSA booth at SC16 to
allow access to and sharing of the resulting data from remote sites. The whole experiment was orchestrated
by the Swift parallel scripting language [4]. In previous simulations, scientists were able to analyze only
100 snapshots because of infrastructure limitations.30
This experiment achieved two objectives never accomplished before: (1) running a state-of-the-art cos-
mology simulation and analyzing all snapshots (currently only one in every five or 10 snapshots is stored or
communicated); and (2) combining two different types of systems (simulation on Mira and data analytics
on Blue Waters) that are geographically distributed and belong to different administrative domains to run
an extreme-scale simulation and analyze the output in a pipelined fashion.35
The work presented here is also unique in two other respects. First, while many previous studies have
varied transfer parameters such as concurrency and parallelism in order to improve data transfer perfor-
mance [5,6,7], we also demonstrate the value of varying the file size used for data transfer, which provides
additional flexibility for optimization. Second, we demonstrate these methods in the context of dedicated
data transfer nodes and a 100 Gb/s network circuit.40
The rest of the paper is organized as follows. In §2we introduce the science case and the environment in
which we performed these transfers. In §3we describe the tests to find the optimal transfer parameters. In
§4we summarize the performance of the transfers during the pipelined simulation and analysis experiments,
and we describe our experiences with checksum-enabled transfers. Based on the demo at SC16, we propose
2
in §5an analytical model to identify the optimal file size and show that it can help improve the performance45
of checksum-enabled transfers significantly. In §6we review related work, and in §7we summarize our work.
2. Science case and demonstration environment
We first present the challenges raised by the science problem and then describe the environment used for
the demonstration.
2.1. Science case50
To understand the Universe, cosmologists use large telescopes to conduct observational surveys. These sur-
veys are becoming increasingly complex as telescopes reach deeper into space, mapping out the distributions
of galaxies at farther distances. Cosmological simulations that track the detailed evolution of structure
in the Universe over time are essential for interpreting these surveys. Cosmologists vary the fundamental
physics in the simulations, evaluate resulting measurements in a controlled way, and then predict new phe-55
nomena. They also model systematic errors in the data, mimic inaccuracies due to limiting telescope and
sensor artifacts, and determine how these limitations can influence scientific results. In order to achieve
high-quality simulations, high temporal and spatial resolution are critical. Cosmologists need to track the
evolution of small overdensities in detail and follow how they evolve into larger structures. Events early in
the life of such a structure will determine what kind of galaxy it will host later, controlling, for example,60
the brightness, color, and morphology of the galaxy. Current (and next-generation) supercomputers (will)
allow them to attain high spatial resolution in large cosmological volumes by simulating trillions of tracer
particles. But the supercomputer on which the simulation is carried out might not be– and usually is not–
the optimal system for data analysis because of storage limitation or supercomputer specialization.
2.2. Demonstration environment65
For the demonstration, we ran a large simulation at the ALCF, moved the raw simulation output to NCSA,
and ran an analysis program on the Blue Waters supercomputer. The simulation evolved 30723(=29 billion)
particles in a simulation box of volume (512h1Mpc)3. This led to an approximate mass resolution (mass
of individual particles) of mp=3.8*108h1Msun.
Each snapshot holds 1.2 TiB. The source of the data was the GPFS parallel file system on the Mira70
supercomputer at Argonne, and the destination was the Lustre parallel file system on the Blue Waters
supercomputer at NCSA. Argonne and NCSA have 12 and 28 data transfer nodes (DTNs) [8], respectively,
dedicated for wide area data transfer. Each DTN runs a Globus GridFTP server. We chose to use Globus
to orchestrate our data transfers in order to get automatic fault recovery and load balancing among the
available GridFTP servers on both ends.75
It is clear that wide-area data transfer is central to distributed science. The data transfer time has direct
influence on workflow performance and the transfer throughput estimation is crucial for workflow scheduling
and resource allocation [9].
3. Exploration of tunable parameters
Concurrency, parallelism, and pipelining are three key performance optimization parameters for Globus80
GridFTP transfers. These effectiveness of the parameters depends on data transfer node, local storage
system and network. There is not a one-size-fits-all setting that is optimal in any case [10]. Concurrency
and parallelism mechanism are illustrated in Figure 2, and pipelining is illustrated in Figure 3.
3.1. Concurrency
Concurrency uses multiple GridFTP server processes at the source and destination, where each process85
transfers a separate file, and thus provides for concurrency at the file system I/O, CPU core, and network
levels.
3
GridFTP
Server
Process
GridFTP
Server
Process
GridFTP
Server
Process
GridFTP
Server
Process
Data Transfer
Node at Site A
Data Transfer
Node at Site B
TCP Connection
TCP Connection
TCP Connection
TCP Connection
TCP Connection
TCP Connection
Concurrency = 2
Parallelism = 3
Parallel File System
Parallel File System
Figure 2: Illustration of concurrency and parallelism in Globus GridFTP.
Traditional Pipeline
Figure 3: Illustration of pipelining in Globus GridFTP.
3.2. Parallelism
Parallelism is a network-level optimization that uses multiple socket connections to transfer chunks of a
file in parallel from a single-source GridFTP server process to a single-destination GridFTP server process.90
3.3. Pipelining
Pipelining speeds lots of tiny files by sending multiple FTP commands to a single GridFTP server process
without waiting for the first command’s response. This approach reduces latency between file transfers in a
single GridFTP server process.
Thus, in order to find the best application-tunable parameters, we first arbitrarily fixed the average file95
size to be 4 GiB and evaluated different combinations of three Globus GridFTP parameters: concurrency,
parallelism, and pipeline depth.
Figure 4 shows the achieved throughput as a function of parallelism for different concurrencies and
pipeline depths. Parallelism clearly does not provide any obvious improvement in performance. We con-
jecture that the reason is that the round-trip time between source DTNs and destination DTNs is small100
(around 6 ms). Figure 5 shows the achieved throughput as a function of concurrency for different pipeline
depths. We omitted parallelism in this comparison because it does not have much impact (Figure 4).
4
Figure 4: Data transfer performance versus parallelism with different concurrency (C) and pipeline depth(D).
From Figure 5, we see that the throughput increases with increasing concurrency, especially for small
pipeline depths. Each DTN at ALCF has only a 10 Gb/s network interface card, and a transfer needs to
use at least 10 DTNs to achieve the desired rate of 1 PiB in a day or sustained 93 Gb/s. In order for a105
transfer to use 10 DTNs, the concurrency has to be at least 10. In order for each DTN to drive close to
10 Gb/s (read data from the storage at 10 Gb/s and send data on the network at the same rate), many (or
all) DTN cores need to be used. In this case, each DTN has 8 cores, and thus a concurrency of at least 96
is needed to use all cores. This explains why a concurrency of 128 gives the highest performance.
We also see in Figure 5 that increasing the pipeline depth reduces performance. The reason is that the110
Globus policy was designed for regular data transfers (i.e., transfer size is not as big as in this case, and the
endpoints and network are not as powerful). Specifically, at the time of experiment, Globus doubles pipeline
depth, and splits multi-file transfers into batches of 1,000 files and treats each batch like an independent
transfer request. This policy was put in place to control the total number of concurrent file transfers and thus
the memory load on the servers. For example, if we use a pipeline depth of 8, the maximum concurrency can115
only be 1000
16 = 62, which is why concurrency with 64 and 128 achieved the same performance in Figure 5.
After feeding back findings in this paper, Globus has optimized this 1,000 files batch mechanism.
Similarly, when the pipeline depth is 16, the actual concurrency will be 1000
32 = 31, and thus transfers
with concurrency greater than 32 achieve the same performance as those with 32 in Figure 5. Therefore, the
optimal pipeline depth for our use case is 1, because pipelining is good for transferring tiny files (when the120
elapsed transfer time for one file is less than the round-trip time between the client and the source endpoint)
but not for larger files.
Although we used Globus transfer service in this work, we believe that our methods and conclusions
are applicable to other wide area data transfers. Because the three tunable parameters: concurrency (i.e.,
equivalent to the number of network and disk I/O threads / processes), parallelism (equivalent to the number125
of TCP connections) and pipeline (i.e., the number of control channel) of Globus are generally used by other
high performance data transfer tools such as BBCP [11], FDT [12], dCache [13], FTS [14] and XDD [15].
4. Experiences transferring the data
As mentioned, we split each 1.2 TiB snapshot into 256 files of approximately equal size. We determined
that transferring 64 or 128 files concurrently, with a total of 128 or 256 TCP streams, yielded the maximum130
transfer rate. We achieved an average disk-to-disk transfer rate of 92.4 Gb/s (or 1 PiB in 24 hours and
5
Figure 5: Data transfer performance versus concurrency for different pipeline depth values.
3 minutes): 99.8% of our goal of 1 PiB in 24 hours, when the end-to-end verification of data integrity in
Globus is disabled. In contrast, when the end-to-end verification of data integrity in Globus is enabled, we
achieved an average transfer rate of only 72 Gb/s (or 1 PiB in 30 hours and 52 minutes).
The Globus approach to checksum verification is motivated by the observations that the 16-bit TCP135
checksum is inadequate for detecting data corruption during communication [16,17] and that corruption
can occur during file system operations [18]. Globus pipelines the transfer and checksum computation; that
is, the checksum computation of the ith file happens in parallel with the transfer of the (i+ 1)th file. Data
are read twice at the source storage system (once for transfer and once for checksum) and written once (for
transfer) and read once (for checksum) at the destination storage system. Therefore, in order to achieve the140
desired rate of 93 Gb/s for checksum-enabled transfers, in the absence of checksum failures, 186 Gb/s of read
bandwidth from the source storage system and 93 Gb/s write bandwidth and 93 Gb/s of read bandwidth
concurrently from the destination storage system are required. If checksum verification failures occur (i.e.,
one or more files are corrupted during the transfer), even more storage I/O bandwidth, CPU resources, and
network bandwidth are required in order to achieve the desired rate. Figure 6 shows the overall transfer145
throughout, as determined via SNMP network monitoring, and the DTN CPU utilization, when performing
transfers using the optimal parameters that we identified.
We see that transfers without integrity checking (marked by dashed line boxes in Figure 6) can sustain
rates close to our environment’s theoretical bandwidth of 100 Gb/s, with little CPU utilization. If integrity
checking is enabled (solid line boxes in Figure 6), however, the CPU utilization increases significantly, and150
it is hard to get close to the theoretical bandwidth continuously. We note three points: (1) the network
is not dedicated to this experiment, and so some network bandwidth was unavoidably consumed by other
programs; (2) we used the same optimal tunable parameters (concurrency(128) and parallelism(1)) and the
same file size for transfers with and without checksum in order to make sure that the only difference between
the two cases is data verification; and (3) there are transfers with non-optimal parameters, performed as155
part of our explorations, running during other times (outside the boxes) in Figure 6.
4.1. Checksum failures
Globus restarts failed or interrupted transfers from the last checkpoint in order to avoid retransmission costs.
In the case of a checksum error, however, it retransmits the entire erroneous file. About 5% of our transfers
experienced checksum failure. Such failures can be caused by network corruption, source storage error, or160
destination storage error. Since the data integrity is verified after the whole file has been transferred to
6
Figure 6: Data transfer throughput vs. DTN CPU usage over eight days. We highlight two periods in which our testing
data transfers were occurring with checksum computations (periods delineated by solid line box) and three periods in which
our testing transfers were occurring without checksum computations (dashed line boxes). The rest of the periods, i.e., not
highlighted, are other users’ regular transfers.
the destination, a retransmission must be done if their checksum does not match. For a given transfer, the
number of failure represents the number of retransferred files. Obviously the transfer throughput will go
down if too many failures occur. We show the transfer throughput versus failures in Figure 7.
Assume that a transfer contains Nfiles, each of xbytes, and there are nchecksum verification failures.
Thus, nfiles are retransferred, and the total bytes transferred will then be (N+n)x. If we assume that the
end-to-end throughout is Re2e, the actual transfer time Ttrs will be
Ttrs =x(N+n)
Re2e.(1)
Thus, the effective throughout Rtrs to the transfer users, that is, the time it takes Ttrs seconds to transfer
Nx bytes, will be
Rtrs =Nx
Ttrs
=NRe2e
N+n. (2)
We note that transfers in Figure 7 have different concurrency, parallelism, pipeline depth, and average165
file size; and thus their Re2e are different. If we look only at the transfers with similar concurrency, the
shape in Figure 7 fits well with Equation 2.
5. Retrospective analysis: A model-based approach to finding the optimal number of files
The ability to control the file size (and thus number of files) in the dataset is a key flexibility in this use case.
Thus, while we used a file size of 4 GB in our experiments, based on limited exploration and intuition, we170
realized in retrospect that we could have created a model to identify the optimal file size. Here, we present
follow-up work in which we develop and apply such a model.
In developing a model of file transfer plus checksum costs, we start with a simple linear model of transfer
time for a single file:
Ttrs =atrsx+btrs ,(3)
where atrs is the unit transfer time, btrs the transfer startup cost, and xthe file size. Similarly, we model
the time to verify file integrity as
Tck =ackx+bck,(4)
where ack is the unit checksum time, bck the checksum startup cost, and xthe file size.
7
Figure 7: Data transfer performance versus number of checksum failures. Classified by the transfer concurrency (C). Each dot
represents one transfer.
Transfer pipeline
Verification pipeline
Ttrs
Tck
Ttrs
Ttrs
Ttrs
Tck
Tck
Tck
bsrvs
Figure 8: Illustration of file transfer and verification overlap within a single GridFTP process, as modeled by Equation 5. File
transfer and associated file verification operations are managed as independent pipelines, connected only in that a file can be
verified only after it has been transferred, and verification failure causes a file retransmit.
The time to transfer and checksum a file is not the simple sum of these two equations because, as shown
in Figure 8, Globus GridFTP pipelines data transfers and associated checksum computations. Note how the
data transfer and file integrity verification computations overlap. Thus, assuming no file system contention
and that the unit checksum time is less than the unit transfer time (we verified that this is indeed the case
in our environment), the total time Tto transfer nfiles with one GridFTP process, each of size x, is the
time required for Nfile transfers and one checksum, namely,
T=nTtrs +Tck +bsrvs =n(xatrs +btrs) + xack +bck +bsrvs,(5)
where bsrvs is the transfer service (Globus in this paper) startup cost (e.g., time to establish the Globus
control channel). Let us now assume that concurrency = cc.Sdenotes the bytes to be transferred in total;
and we equally divide Sbytes into Nfiles, where Nis perfectly dividable by cc. Thus there are n=N/cc
files per concurrent transfer (i.e., per GridFTP process), and each file of size x=S
N. The transfer time T
to the number of files Nwill be
T(N) = S
ccatrs +N
cc btrs +S
Nack +bck +bsrvs.(6)
8
We use experimental data to estimate the four parameters in Equation 6,atrs ,btrs,ack, and (bck +bsrvs),
in our environment and for different concurrency values, cc. Since we previously determined (see Figure 4)175
that parallelism makes little difference in our low RTT environment, we fixed parallelism at four in these
experiments. We note that, for other scenarios with long RTT, the best parallelism should be determined
first, e.g., by iteratively exploring parallelism as shown in Figure 4. For each concurrency value, we fixed
S=1.2 TiB, used four measured (N,T) points to fit the four parameters, and then used the resulting
model to predict performance for other values of N.Figure 9 shows our results. Here, the lines are model180
predictions, stars are measured values used to fit the model, and other dots are other measured values not
used to fit the model.
Figure 9: Evaluation of our transfer performance model when transferring a single 1.2 TiB snapshot. Solid markers are points
that were used to fit the model parameters shown in Equation 6.
Figure 9 shows the accuracy of the performance model. We see that our model does a good job of
predicting throughput as a function of Nand cc. Since the four model parameters in Equation 6 are
independent of source, network, and destination, at least four experiments are needed to fit the model, after185
which the model can be used to determine the best file size to split a simulation snapshot. We conclude
that the optimal file size is around 800 MiB (i.e., split the 1.2 TiB snapshot to 1536–2048 files) and that it
can achieve a throughput of 89.7 Gb/s with integrity verification. This throughput represents an increase
of 25% compared with that obtained with the ad hoc approach, when we used a file size of 4 GB.
6. Related work190
Elephant flows such as those considered here have been known to account for over 90% of the bytes transferred
on typical networks [19], making their optimization important.
At the 2009 Supercomputing conference, a multidisciplinary team of researchers from DOE national
laboratories and universities demonstrated the seamless, reliable, and rapid transfer of 10 TiB of Earth
System Grid data [20] from three sources—the Argonne Leadership Computing Facility, Lawrence Livermore195
National Laboratory, and National Energy Research Scientific Computing Center. The team achieved a
sustained data rate of 15 Gb/s on a 20 Gb/s network provided by DOE’s ESnet. More important, their work
provided critical feedback on how to deploy, tune, and monitor the middleware used to replicate petascale
climate datasets [21]. Their work clearly showed why supercomputer centers need to install dedicated hosts,
referred to as data transfer nodes, for wide area transfers [8].200
9
In another SC experiment, this time in 2011, Balman et al. [22] streamed cosmology simulation data
over a 100 Gb/s network to a remote visualization system, obtaining an average performance of 85 Gb/s.
However, data were communicated with a lossy UDP-based protocol.
Many researchers have studied the impact of parameters such as concurrency and parallelism on data
transfer performance [5,6,7] and have proposed and evaluated alternative transfer protocols [23,24,25]205
and implementations [26]. Jung et al. [27] proposed a serverless data movement architecture that bypasses
data transfer nodes, the filesystem stack, and the host system stack and directly moves data from one disk
array controller to another, in order to obtain the highest end-to-end data transfer performance. Newman
et al. [28] summarized the next-generation exascale network integrated architecture project that is designed
to accomplish new levels of network and computing capabilities in support of global science collaborations210
through the development of a new class of intelligent, agile networked systems.
Rao et al. [29] studied the performance of TCP variants and their parameters for high-performance trans-
fers over dedicated connections by collecting systematic measurements using physical and emulated dedicated
connections. These experiments revealed important properties such as concave regions and relationships be-
tween dynamics and throughput profiles. Their analyses enable the selection of a high-throughput transport215
method and corresponding parameters for a given connection based on round-trip time. Liu et al. [30]
similarly studied UDT [31].
Specifically for bulk wide area data transfer, Liu et al. [32] analyzed millions of Globus [3] data transfers
involving thousands of DTNs that DTN performance has a nonlinear relationship with load. Liu et al. [33]
conducted a systematic examination of a large set of data transfer log data to characterize into transfer220
characteristics, including the nature of the datasets transferred, achieved throughput, user behavior, and
resource usage. Their analysis yields new insights that can help design better data transfer tools, optimize
networking and edge resources used for transfers, and improve the performance and experience for end users.
Specifically, their analysis show that most of the datasets as well as individual files transferred are very small;
data corruption is not negligible for large data transfers; the data transfer nodes utilization is low.225
7. Conclusion
We have presented our experiences in transferring one petabyte of science data within one day. We first
described the exploration that we performed to identify parameter values that yield maximum performance
for Globus transfers. We then discussed our experiences in transferring data while the data are produced
by the simulation, both with and without end-to-end integrity verification. We achieved 99.8% of our one230
petabyte-per-day goal without integrity verification and 78% with integrity verification. We also used a
model-based approach to identify the optimal file size for transfers; the results that suggest that we could
achieve 97% of our goal with integrity verification by choosing the appropriate file size. We believe that our
work serves as a useful lesson in the time-constrained transfer of large datasets.
Acknowledgments235
We would like to thank the FGCS anonymous reviewers, for the valuable feedback and good questions they
brought up. This work was supported in part by the U.S. Department of Energy under contract number
DEAC02-06CH11357, National Science Foundation award 1440761 and the Blue Waters sustained-petascale
computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-
1238993) and the state of Illinois.240
References
[1] S. Habib, A. Pope, H. Finkel, N. Frontiere, K. Heitmann, D. Daniel, P. Fasel, V. Morozov, G. Zagaris, T. Peterka, V. Vish-
wanath, Z. Luki´c, S. Sehrish, W. Liao, HACC: Simulating sky surveys on state-of-the-art supercomputing architectures,
New Astronomy 42 (2016) 49–65.
[2] www.globus.org, globus, https://www.globus.org (2018 (accessed January 3, 2018)).245
10
[3] K. Chard, S. Tuecke, I. Foster, Efficient and secure transfer, synchronization, and sharing of big data, IEEE Cloud
Computing 1 (3) (2014) 46–55.
[4] M. Wilde, M. Hategan, J. M. Wozniak, B. Clifford, D. S. Katz, I. Foster, Swift: A language for distributed parallel
scripting, Parallel Computing 37 (9) (2011) 633–652.
[5] T. J. Hacker, B. D. Athey, B. Noble, The end-to-end performance effects of parallel TCP sockets on a lossy wide-area250
network, in: Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002, Abstracts and
CD-ROM, IEEE, 2001, pp. 10–pp.
[6] W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, I. Raicu, I. Foster, The Globus striped GridFTP
framework and server, in: ACM/IEEE Conference on Supercomputing, IEEE Computer Society, 2005, p. 54.
[7] E. Yildirim, E. Arslan, J. Kim, T. Kosar, Application-level optimization of big data transfers through pipelining, parallelism255
and concurrency, IEEE Transactions on Cloud Computing 4 (1) (2016) 63–75.
[8] E. Dart, L. Rotman, B. Tierney, M. Hester, J. Zurawski, The Science DMZ: A network design pattern for data-intensive
science, Scientific Programming 22 (2) (2014) 173–185.
[9] Z. Liu, R. Kettimuthu, S. Leyffer, P. Palkar, I. Foster, A mathematical programming- and simulation-based framework to
evaluate cyberinfrastructure design choices, in: 2017 IEEE 13th International Conference on e-Science, 2017, pp. 148–157.260
doi:10.1109/eScience.2017.27.
[10] Z. Liu, R. Kettimuthu, I. Foster, P. H. Beckman, Towards a smart data transfer node, in: 4th International Workshop on
Innovating the Network for Data Intensive Science, 2017, p. 10.
[11] BBCP, http://www.slac.stanford.edu/~abh/bbcp/.
[12] FDT, FDT - Fast Data Transfer, http://monalisa.cern.ch/FDT/ (2018 (accessed January 3, 2018)).265
[13] P. Fuhrmann, V. G¨ulzow, dCache, storage system for the future, in: European Conference on Parallel Processing, Springer,
2006, pp. 1106–1113.
[14] CERN, FTS3: Robust, simplified and high-performance data movement service for WLCG, http://fts3-service.web.
cern.ch (2018 (accessed January 3, 2018)).
[15] B. W. Settlemyer, J. D. Dobson, S. W. Hodson, J. A. Kuehn, S. W. Poole, T. M. Ruwart, A technique for moving large270
data sets over high-performance long distance networks, in: 27th Symp. on Mass Storage Systems and Technologies, 2011,
pp. 1–6. doi:10.1109/MSST.2011.5937236.
[16] V. Paxson, End-to-end internet packet dynamics, IEEE/ACM Transactions on Networking 7 (3) (1999) 277–292.
[17] J. Stone, C. Partridge, When the CRC and TCP checksum disagree, in: ACM SIGCOMM Computer Communication
Review, Vol. 30, ACM, 2000, pp. 309–319.275
[18] L. N. Bairavasundaram, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, G. R. Goodson, B. Schroeder, An analysis of data
corruption in the storage stack, ACM Transactions on Storage 4 (3) (2008) 8.
[19] K. Lan, J. Heidemann, A measurement study of correlations of internet flow characteristics, Computer Networks 50 (1)
(2006) 46–62.
[20] D. N. Williams, R. Drach, R. Ananthakrishnan, I. T. Foster, D. Fraser, F. Siebenlist, D. E. Bernholdt, M. Chen, J. Schwid-280
der, S. Bharathi, A. L. Chervenak, R. Schuler, M. Su, D. Brown, L. Cinquini, P. Fox, J. Garcia, D. E. Middleton, W. G.
Strand, N. Wilhelmi, S. Hankin, R. Schweitzer, P. Jones, A. Shoshani, A. Sim, The Earth System Grid: Enabling ac-
cess to multimodel climate simulation data, Bulletin of the American Meteorological Society 90 (2) (2009) 195–205.
doi:10.1175/2008BAMS2459.1.
[21] R. Kettimuthu, A. Sim, D. Gunter, B. Allcock, P.-T. Bremer, J. Bresnahan, A. Cherry, L. Childers, E. Dart, I. Foster,285
K. Harms, J. Hick, J. Lee, M. Link, J. Long, K. Miller, V. Natarajan, V. Pascucci, K. Raffenetti, D. Ressman, D. Williams,
L. Wilson, L. Winkler, Lessons learned from moving Earth System Grid data sets over a 20 Gbps wide-area network, in:
19th ACM International Symposium on High Performance Distributed Computing, HPDC ’10, ACM, New York, NY,
USA, 2010, pp. 316–319. doi:10.1145/1851476.1851519.
[22] M. Balman, E. Pouyoul, Y. Yao, E. Bethel, B. Loring, M. Prabhat, J. Shalf, A. Sim, B. L. Tierney, Experiences with290
100Gbps network applications, in: 5th International Workshop on Data-Intensive Distributed Computing, ACM, 2012,
pp. 33–42.
[23] E. Kissel, M. Swany, B. Tierney, E. Pouyoul, Efficient wide area data transfer protocols for 100 Gbps networks and beyond,
in: 3rd International Workshop on Network-Aware Data Management, ACM, 2013, p. 3.
[24] D. X. Wei, C. Jin, S. H. Low, S. Hegde, FAST TCP: Motivation, architecture, algorithms, performance, IEEE/ACM295
Transactions on Networking 14 (6) (2006) 1246–1259.
[25] L. Zhang, W. Wu, P. DeMar, E. Pouyoul, mdtmFTP and its evaluation on ESNET SDN testbed, Future Generation
Computer Systems.
[26] H. Bullot, R. Les Cottrell, R. Hughes-Jones, Evaluation of advanced TCP stacks on fast long-distance production networks,
Journal of Grid Computing 1 (4) (2003) 345–359.300
[27] E. S. Jung, R. Kettimuthu, High-performance serverless data transfer over wide-area networks, in: 2015 IEEE International
Parallel and Distributed Processing Symposium Workshop, 2015, pp. 557–564. doi:10.1109/IPDPSW.2015.69.
[28] H. Newman, M. Spiropulu, J. Balcas, D. Kcira, I. Legrand, A. Mughal, J. Vlimant, R. Voicu, Next-generation exascale
network integrated architecture for global science, Journal of Optical Communications and Networking 9 (2) (2017) A162–
A169.305
[29] N. S. Rao, Q. Liu, S. Sen, D. Towlsey, G. Vardoyan, R. Kettimuthu, I. Foster, TCP throughput profiles using measurements
over dedicated connections, in: 26th International Symposium on High-Performance Parallel and Distributed Computing,
ACM, 2017, pp. 193–204.
[30] Q. Liu, N. S. Rao, C. Q. Wu, D. Yun, R. Kettimuthu, I. T. Foster, Measurement-based performance profiles and dynamics
of UDT over dedicated connections, in: 24th International Conference on Network Protocols, IEEE, 2016, pp. 1–10.310
11
[31] Y. Gu, R. L. Grossman, UDT: UDP-based data transfer for high-speed wide area networks, Computer Networks 51 (7)
(2007) 1777–1799.
[32] Z. Liu, P. Balaprakash, R. Kettimuthu, I. Foster, Explaining wide area data transfer performance, in: Proceedings of the
26th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’17, ACM, New York,
NY, USA, 2017, pp. 167–178. doi:10.1145/3078597.3078605.315
URL http://doi.acm.org/10.1145/3078597.3078605
[33] Z. Liu, R. Kettimuthu, I. Foster, N. S. V. Rao, Cross-geography scientific data transferring trends and behavior, in:
Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’18,
ACM, New York, NY, USA, 2018, pp. 267–278. doi:10.1145/3208040.3208053.
URL http://doi.acm.org/10.1145/3208040.3208053320
12
... All XSEDE sites use Lustre as a parallel file system. Since file size affects transfer throughput and convergence behavior [23], we transferred different datasets with various file sizes (ranges between 512 KB and 1 GB) and counts (ranges between 30 and 180, 000). We also tuned a few application-layer transfer configurations, such as the number of concurrent file transfers and parallel network connections, to capture their impact on transfer convergence behavior. ...
Conference Paper
Full-text available
Active probing is extensively used in high-speed research networks for performance troubleshooting and transfer optimization. For end-to-end (i.e., disk-to-disk) throughput measurements, current active probing practice involves transferring a set of files and measuring throughput upon the completion of the transfer, which leads to long probing times. We present F astP rob that takes an adaptive approach to determine the duration of probing transfers based on the stability behavior of reported instantaneous throughput values. F astP rob employs a hybrid machine learning model which utilizes binary classifiers to determine the "predictability" of probing transfers and regression models to actually predict the transfer throughput upon convergence. Experimental results show that F astP rob lowers the duration of probing transfers by 48% while attaining up to 61% higher measurement accuracy. We further incorporate FastProb into an online file transfer optimization algorithm to demonstrate that shortening the duration of probing transfers results in 35% higher overall throughput for data transfers in production high-speed networks.
... End-to-end data integrity verification in large scale data transfers is not only essential but also very expensive. It increases the amount of disk I/O and processing needed for the data transfer, thereby reducing the overall data transfer performance [17,18]. In this work, our aim is to design a data integrity verification framework for object-based big data transfer systems. ...
Article
Full-text available
Computational science simulations produce huge volumes of data for scientific research organizations. Often, this data is shared by data centers distributed geographically for storage and analysis. Data corruption in the end-to-end route of data transmission is one of the major challenges in distributing the data geographically. End-to-end integrity verification is therefore critical for transmitting such data across data centers effectively. Although several data integrity techniques currently exist, most have a significant negative influence on the data transmission rate as well as the storage overhead. Therefore, existing data integrity techniques are not viable solutions in high performance computing environments where it is very common to transfer huge volumes of data across data centers. In this study, we propose a two-phase Bloom-filter-based end-to-end data integrity verification framework for object-based big data transfer systems. The proposed solution effectively handles data integrity errors by reducing the memory and storage overhead and minimizing the impact on the overall data transmission rate. We investigated the memory, storage, and data transfer rate overheads of the proposed data integrity verification framework on the overall data transfer performance. The experimental findings showed that the suggested framework had 5% and 10% overhead on the total data transmission rate and on the total memory usage, respectively. However, we observed significant savings in terms of storage requirements, when compared with state-of-the-art solutions.
... We benchmarked the file transfer throughput using Globus with one DTN with a 10 Gbps network card at each side and presented results in Figure 3 with different level of parallelism. As one can see, we can get more than 1GB/s when transfer multiple files concurrently (detailed study in [36]). As discussed in §4.1, the network is over-provisioned, we conservatively assume that we can get 1 GB/s at most time. ...
Preprint
Full-text available
Extremely high data rates at modern synchrotron and X-ray free-electron lasers (XFELs) light source beamlines motivate the use of machine learning methods for data reduction, feature detection, and other purposes. Regardless of the application, the basic concept is the same: data collected in early stages of an experiment, data from past similar experiments, and/or data simulated for the upcoming experiment are used to train machine learning models that, in effect, learn specific characteristics of those data; these models are then used to process subsequent data more efficiently than would general-purpose models that lack knowledge of the specific dataset or data class. Thus, a key challenge is to be able to train models with sufficient rapidity that they can be deployed and used within useful timescales. We describe here how specialized data center AI systems can be used for this purpose.
Article
Globally distributed computing infrastructures, such as clouds and supercomputers, are currently used to manage data that is generated with an unprecedented speed from a variety of resources. Coping with this trend, the volume of data exchanged across distant sites increases substantially. To accelerate data transfer, high-speed networks are provided to connect remote sites. Most existing data movement solutions are optimized for moving large files. However, it is still challenging to transfer a large number of small files across networks. This disadvantage not only lowers data transfer performance, but also decreases overall system utilization. We identify that moving small files is mainly constrained by degraded file system throughput, not just network performance as might be suspected. We have built a data transfer pipeline model to analyze the impact of small network I/O and storage I/O on data movement. Extending one of the widely used open source data movement solutions, GridFTP, we demonstrate several appropriate engineering approaches that mitigate the bottleneck and increase data transfer efficiency. We show optimizations that improve data transfer performance more than 5 times. In comparison to existing solutions, our approaches can save a significant amount of system resources for moving lots of small files.
Article
Modern science and engineering computing environments often feature storage systems of different types, from parallel file systems in high-performance computing centers to object stores operated by cloud providers. To enable easy, reliable, secure, and performant data exchange among these different systems, we propose Connector, a plug-able data access architecture for diverse, distributed storage. By abstracting low-level storage system details, this abstraction permits a managed data transfer service (Globus, in our case) to interact with a large and easily extended set of storage systems. Equally important, it supports third-party transfers: that is, direct data transfers from source to destination that are initiated by a third-party client but do not engage that third party in the data path. The abstraction also enables management of transfers for performance optimization, error handling, and end-to-end integrity. We present the Connector design, describe implementations for different storage services, evaluate tradeoffs inherent in managed vs. direct transfers, motivate recommended deployment options, and propose a model-based method that allows for easy characterization of performance in different contexts without exhaustive benchmarking.
Article
With increasing complexity of HPC workflows, data management services need to perform expensive I/O operations asynchronously in the background, aiming to overlap the I/O with the application runtime. However, this may cause interference due to competition for resources: CPU, memory/network bandwidth. The advent of multi-core architectures has exacerbated this problem, as many I/O operations are issued concurrently, thereby competing not only with the application but also among themselves. Furthermore, the interference patterns can dynamically change as a response to variations in application behavior and I/O subsystems (e.g. multiple users sharing a parallel file system). Without a thorough understanding, I/O operations may perform suboptimally, potentially even worse than in the blocking case. To fill this gap, this paper investigates the causes and consequences of interference due to asynchronous I/O on HPC systems. Specifically, we focus on multi-core CPUs and memory bandwidth, isolating the interference due to each resource. Then, we perform an in-depth study to explain the interplay and contention in a variety of resource sharing scenarios such as varying priority and number of background I/O threads and different I/O strategies: sendfile, read/write, mmap/write underlining trade-offs. The insights from this study are important both to enable guided optimizations of existing background I/O, as well as to open new opportunities to design advanced asynchronous I/O strategies.
Article
Full-text available
Scientific computing systems are becoming significantly more complex, with distributed teams and complex workflows spanning resources from telescopes and light sources to fast networks and Internet of Things sensor systems. In such settings, no single, centralized administrative team and software stack can coordinate and manage all resources used by a single application. Indeed, we have reached a critical limit in manageability using current human-in-the-loop techniques. We therefore argue that resources must begin to respond automatically, adapting and tuning their behavior in response to observed properties of scientific workflows. Over time, machine learning methods can be used to identify effective strategies for autonomic, goal-driven management behaviors that can be applied end-to-end across the scientific computing landscape. Using the data transfer nodes that are widely deployed in modern research networks as an example, we explore the architecture, methods, and algorithms needed for a smart data transfer node to support future scientific computing systems that self-tune and self-manage.
Conference Paper
Full-text available
Wide area data transfers play an important role in many science applications but rely on expensive infrastructure that often delivers disappointing performance in practice. In response, we present a systematic examination of a large set of data transfer log data to characterize transfer characteristics, including the nature of the datasets transferred, achieved throughput, user behavior, and resource usage. This analysis yields new insights that can help design better data transfer tools, optimize networking and edge resources used for transfers, and improve the performance and experience for end users. Our analysis shows that (i) most of the datasets as well as individual files transferred are very small; (ii) data corruption is not negligible for large data transfers; and (iii) the data transfer nodes utilization is low. Insights gained from our analysis suggest directions for further analysis.
Conference Paper
Full-text available
Modern scientific experimental facilities such as x-ray light sources increasingly require on-demand access to large-scale computing for data analysis, for example to detect experimental errors or to select the next experiment. As the number of such facilities, the number of instruments at each facility, and the scale of computational demands all grow, the question arises as to how to meet these demands most efficiently and cost-effectively. A single computer per instrument is unlikely to be cost-effective because of low utilization and high operating costs. A single national compute facility, on the other hand, introduces a single point of failure and perhaps excessive communication costs. We introduce here methods for evaluating these and other potential design points, such as per-facility computer systems and a distributed multisite " superfacility. " We use the U.S. Department of Energy light sources as a use case and build a mixed-integer programming model and a customizable superfacility simulator to enable joint optimization of design choices and associated operational decisions. The methodology and tools provide new insights into design choices for on-demand computing facilities for real-time analysis of scientific experiment data. The simulator can also be used to support facility operations, for example by simulating the impact of events such as outages.
Conference Paper
Full-text available
Disk-to-disk wide-area file transfers involve many subsystems and tunable application parameters that pose signiicant challenges for bottleneck detection, system optimization, and performance prediction. Performance models can be used to address these challenges but have not proved generally usable because of a need for extensive online experiments to characterize subsystems. We show here how to overcome the need for such experiments by applying machine learning methods to historical data to estimate parameters for predictive models. Starting with log data for millions of Globus transfers involving billions of files and hundreds of petabytes, we engineer features for endpoint CPU load, network interface card load, and transfer characteristics; and we use these features in both linear and nonlinear models of transfer performance, We show that the resulting models have high explanatory power. For a representative set of 30,653 transfers over 30 heavily used source-destination pairs (" edges "), totaling 2,053 TB in 46.6 million les, we obtain median absolute percentage prediction errors (MdAPE) of 7.0% and 4.6% when using distinct linear and nonlinear models per edge, respectively; when using a single nonlinear model for all edges, we obtain an MdAPE of 7.8%. Our work broadens understanding of factors that influence file transfer rate by clarifying relationships between achieved transfer rates, transfer characteristics, and competing load. Our predictions can be used for distributed workkow scheduling and optimization, and our features can also be used for optimization and explanation.
Conference Paper
Full-text available
Wide-area data transfers in high-performance computing and big data scenarios are increasingly being carried over dedicated network connections that provide high capacities at low loss rates. UDP-based transport protocols are expected to be particularly well-suited for such transfers but their performance is relatively unexplored over a wide range of connection lengths, compared to TCP over shared connections. We present extensive throughput measurements of UDP-based Data Transfer (UDT) over a suite of physical and emulated 10 Gbps connections. In sharp contrast to current UDT analytical models, these measurements indicate much more complex throughput dynamics that are sensitive to the connection modality, protocol parameters, and round-trip times. Lyapunov exponents estimated from the Poincaré maps of UDT traces clearly indicate regions of instability and complex dynamics. We propose a simple model based on the ramp-up and sustainment regimes of a generic transport protocol, which qualitatively illustrates the dominant monotonicity and concavity properties of throughput profiles and relates them to Lyapunov exponents. These measurements and analytical results together enable us to comprehensively evaluate UDT performance and select parameters to achieve high throughput, and they also provide guidelines for designing effective transport protocols for dedicated connections.
Article
To address the high-performance challenges of data transfer in the big data era, we are developing and implementing mdtmFTP: a high-performance data transfer tool for big data. mdtmFTP has four salient features. First, it adopts an I/O centric architecture to execute data transfer tasks. Second, it more efficiently utilizes the underlying multicore platform through optimized thread scheduling. Third, it implements a large virtual file mechanism to address the lots-of-small-files (LOSF) problem. Finally, mdtmFTP integrates multiple optimization mechanisms, including—zero copy, asynchronous I/O, pipelining, batch processing, and pre-allocated buffer pools—to enhance performance. mdtmFTP has been extensively tested and evaluated within the ESNET 100G testbed. Evaluations show that mdtmFTP can achieve higher performance than existing data transfer tools, such as GridFTP, FDT, and BBCP.
Conference Paper
Wide-area data transfers in high-performance computing infrastructures are increasingly being carried over dynamically provisioned dedicated network connections that provide high capacities with no competing traffic. We present extensive TCP throughput measurements and time traces over a suite of physical and emulated 10 Gbps connections with 0-366 ms round-trip times (RTTs). Contrary to the general expectation, they show significant statistical and temporal variations, in addition to the overall dependencies on the congestion control mechanism, buff‚er size, and the number of parallel streams. We analyze several throughput profi€les that have highly desirable concave regions wherein the throughput decreases slowly with RTTs, in stark contrast to the convex profi€les predicted by various TCP analytical models. We present a generic throughput model that abstracts the ramp-up and sustainment phases of TCP fƒows, which provides insights into qualitative trends observed in measurements across TCP variants: (i) slow-start followed by well sustained throughput leads to concave regions; (ii) large buff‚ers and multiple parallel streams expand the concave regions in addition to improving the throughput; and (iii) stable throughput dynamics, indicated by a smoother Poincare map and smaller Lyapunov exponents, lead to wider concave regions. ŒThese measurements and analytical results together enable us to select a TCP variant and its parameters for a given connection to achieve high throughput with statistical guarantees.
Article
The next-generation exascale network integrated architecture (NGENIA-ES) is a project specifically designed to accomplish new levels of network and computing capabilities in support of global science collaborations through the development of a new class of intelligent, agile networked systems. Its path to success is built upon our ongoing developments in multiple areas, strong ties among our high energy physics, computer and network science, and engineering teams, and our close collaboration with key technology developers and providers deeply engaged in the national strategic computing initiative (NSCI). This paper describes the building of a new class of distributed systems, our work with the leadership computing facilities (LFCs), the use of software-defined networking (SDN) methods, and the use of data-driven methods for the scheduling and optimization of network resources. Sections I-III present the challenges of data-intensive research and the important ingredients of this ecosystem. Sections IV-VI describe some crucial elements of the foreseen solution and some of the progress so far. Sections VII-IX go into the details of orchestration, software-defined networking, and scheduling optimization. Finally, Section X talks about engagement and partnerships, and Section XI gives a summary. References are given at the end.
Article
The ever-increasing scale of scientific data has become a significant challenge for researchers that rely on networks to interact with remote computing systems and transfer results to collaborators worldwide. Despite the availability of high-capacity connections, scientists struggle with inadequate cyberinfrastructure that cripples data transfer performance, and impedes scientific progress. The Science DMZ paradigm comprises a proven set of network design patterns that collectively address these problems for scientists. We explain the Science DMZ model, including network architecture, system configuration, cybersecurity, and performance tools, that creates an optimized network environment for science. We describe use cases from universities, supercomputing centers and research laboratories, highlighting the effectiveness of the Science DMZ model in diverse operational settings. In all, the Science DMZ model is a solid platform that supports any science workflow, and flexibly accommodates emerging network technologies. As a result, the Science DMZ vastly improves collaboration, accelerating scientific discovery.