Preprint published in the Future Generation Computer Systems. DOI: 10.1016/j.future.2018.05.051
Transferring a Petabyte in a Day1
Rajkumar Kettimuthua, Zhengchun Liua,∗
, David Wheelerd, Ian Fostera,b, Katrin Heitmanna,c , Franck
aMathematics & Computer Science Division, Argonne National Laboratory, Lemont, IL 60439, USA
bDepartment of Computer Science, University of Chicago, Chicago, IL 60637, USA
cHigh Energy Physics Division, Argonne National Laboratory, Lemont, IL 60439, USA
dNational Center for Supercomputing Applications, University Of Illinois at Urbana–Champaign, Urbana, IL 61801, USA
Extreme-scale simulations and experiments can generate large amounts of data, whose volume can exceed
the compute and/or storage capacity at the simulation or experimental facility. With the emergence of
ultra-high-speed networks, researchers are considering pipelined approaches in which data are passed to a
remote facility for analysis. Here we examine an extreme-scale cosmology simulation that, when run on a
large fraction of a leadership computer, generates data at a rate of one petabyte per elapsed day. Writing
those data to disk is ineﬃcient and impractical, and in situ analysis poses its own diﬃculties. Thus we
implement a pipeline in which data are generated on one supercomputer and then transferred, as they are
generated, to a remote supercomputer for analysis. We use the Swift scripting language to instantiate this
pipeline across Argonne National Laboratory and the National Center for Supercomputing Applications,
which are connected by a 100 Gb/s network; and we demonstrate that by using the Globus transfer service
we can achieve a sustained rate of 93 Gb/s over a 24-hour period, thus attaining our performance goal of
one petabyte moved in 24 hours. This paper describes the methods used and summarizes the lessons learned
in this demonstration.
Keywords: Wide area data transfer, GridFTP, Large data transfer, Cosmology workﬂow, Pipeline
Extreme-scale scientiﬁc simulations and experiments can generate much more data than can be stored
and analyzed eﬃciently at a single site. For example, a single trillion-particle simulation with the Hard-
ware/Hybrid Accelerated Cosmology Code (HACC)  generates 20 PiB of raw data (500 snapshots, each
40 TiB), which is more than petascale systems such as the Mira system at the Argonne Leadership Comput-5
ing Facility (ALCF) and the Blue Waters system at the National Center for Supercomputing Applications
(NCSA) can store in their ﬁle systems. Moreover, as scientiﬁc instruments are optimized for speciﬁc objec-
tives, both the computational infrastructure and the codes become more specialized as we reach the end of
Moore’s law. For example, one version of the HACC is optimized for the Mira supercomputer, on which it
can scale to millions of cores, while the Blue Waters supercomputer is an excellent system for data analysis,10
because of its large memory (1.5 PiB) and 4000+ GPU accelerators. To demonstrate how we overcame
the storage limitations and enabled the coordinated use of these two specialized systems, we conducted a
pipelined remote analysis of HACC simulation data, as shown in Figure 1.
1a Petabyte (PiB) = 250 bytes
ICorresponding author: Zhengchun Liu, Bldg. 240, Argonne National Lab., 9700 S. Cass Avenue, Lemont, IL 60439, USA
Email addresses: email@example.com (Rajkumar Kettimuthu), firstname.lastname@example.org (Zhengchun Liu),
email@example.com (David Wheeler), firstname.lastname@example.org (Ian Foster), email@example.com (Katrin Heitmann),
firstname.lastname@example.org (Franck Cappello)
2018. This manuscript version is made available under the CC-BY-NC-ND 4.0 license. URL
1 PB/day (once)
29 Billion particles
2nd level data
+ Visualization streaming
>1PB of Storage (DDN)
+ 2nd level Visualization
Display (NCSA, EVL)
Data Pulling Vis. Streaming Data
Figure 1: Pipelined execution of a cosmology workﬂow that involves a mix of streaming scenarios, ranging from sustained
∼100 Gb/s over a 24-hour period to high-bandwidth bursts during interactive analysis sessions.
Our demonstration at the NCSA booth of SC16 (the International Conference for High Performance
Computing, Networking, Storage, and Analysis in 2016), a state-of-the-art, 29-billion-particle cosmology15
simulation combining high spatial and temporal resolution in a large cosmological volume was performed
on Mira at ALCF. As this simulation ran, the Globus  transfer service was used to transmit simulation
data to NCSA each of 500 temporal snapshots as it was produced . In total, this workﬂow moved 1 PiB
in 24 hours from the ALCF to NCSA, requiring an average end-to-end rate of ∼93 Gb/s. In this paper,
we describe how we achieved this feat, including the experiments performed to gather insights on tuning20
parameters, data organization and the lessons we learned from the demonstration.
As snapshots arrived at NCSA, a ﬁrst level of data analysis and visualization was performed using
the GPU partition of Blue Waters. We note that the analysis tasks have to be carried out sequentially:
information from the previous time snapshot is captured for the analysis of the next time snapshot in order
to enable detailed tracking of the evolution of structures. The workﬂow system therefore was carefully25
designed to resubmit any unsuccessful analysis job and to wait for an analysis job to ﬁnish before starting
the next one. The output data (half the size of the input data) was then sent to the NCSA booth at SC16 to
allow access to and sharing of the resulting data from remote sites. The whole experiment was orchestrated
by the Swift parallel scripting language . In previous simulations, scientists were able to analyze only
∼100 snapshots because of infrastructure limitations.30
This experiment achieved two objectives never accomplished before: (1) running a state-of-the-art cos-
mology simulation and analyzing all snapshots (currently only one in every ﬁve or 10 snapshots is stored or
communicated); and (2) combining two diﬀerent types of systems (simulation on Mira and data analytics
on Blue Waters) that are geographically distributed and belong to diﬀerent administrative domains to run
an extreme-scale simulation and analyze the output in a pipelined fashion.35
The work presented here is also unique in two other respects. First, while many previous studies have
varied transfer parameters such as concurrency and parallelism in order to improve data transfer perfor-
mance [5,6,7], we also demonstrate the value of varying the ﬁle size used for data transfer, which provides
additional ﬂexibility for optimization. Second, we demonstrate these methods in the context of dedicated
data transfer nodes and a 100 Gb/s network circuit.40
The rest of the paper is organized as follows. In §2we introduce the science case and the environment in
which we performed these transfers. In §3we describe the tests to ﬁnd the optimal transfer parameters. In
§4we summarize the performance of the transfers during the pipelined simulation and analysis experiments,
and we describe our experiences with checksum-enabled transfers. Based on the demo at SC16, we propose
in §5an analytical model to identify the optimal ﬁle size and show that it can help improve the performance45
of checksum-enabled transfers signiﬁcantly. In §6we review related work, and in §7we summarize our work.
2. Science case and demonstration environment
We ﬁrst present the challenges raised by the science problem and then describe the environment used for
2.1. Science case50
To understand the Universe, cosmologists use large telescopes to conduct observational surveys. These sur-
veys are becoming increasingly complex as telescopes reach deeper into space, mapping out the distributions
of galaxies at farther distances. Cosmological simulations that track the detailed evolution of structure
in the Universe over time are essential for interpreting these surveys. Cosmologists vary the fundamental
physics in the simulations, evaluate resulting measurements in a controlled way, and then predict new phe-55
nomena. They also model systematic errors in the data, mimic inaccuracies due to limiting telescope and
sensor artifacts, and determine how these limitations can inﬂuence scientiﬁc results. In order to achieve
high-quality simulations, high temporal and spatial resolution are critical. Cosmologists need to track the
evolution of small overdensities in detail and follow how they evolve into larger structures. Events early in
the life of such a structure will determine what kind of galaxy it will host later, controlling, for example,60
the brightness, color, and morphology of the galaxy. Current (and next-generation) supercomputers (will)
allow them to attain high spatial resolution in large cosmological volumes by simulating trillions of tracer
particles. But the supercomputer on which the simulation is carried out might not be– and usually is not–
the optimal system for data analysis because of storage limitation or supercomputer specialization.
2.2. Demonstration environment65
For the demonstration, we ran a large simulation at the ALCF, moved the raw simulation output to NCSA,
and ran an analysis program on the Blue Waters supercomputer. The simulation evolved 30723(=29 billion)
particles in a simulation box of volume (512h−1Mpc)3. This led to an approximate mass resolution (mass
of individual particles) of mp=3.8*108h−1Msun.
Each snapshot holds 1.2 TiB. The source of the data was the GPFS parallel ﬁle system on the Mira70
supercomputer at Argonne, and the destination was the Lustre parallel ﬁle system on the Blue Waters
supercomputer at NCSA. Argonne and NCSA have 12 and 28 data transfer nodes (DTNs) , respectively,
dedicated for wide area data transfer. Each DTN runs a Globus GridFTP server. We chose to use Globus
to orchestrate our data transfers in order to get automatic fault recovery and load balancing among the
available GridFTP servers on both ends.75
It is clear that wide-area data transfer is central to distributed science. The data transfer time has direct
inﬂuence on workﬂow performance and the transfer throughput estimation is crucial for workﬂow scheduling
and resource allocation .
3. Exploration of tunable parameters
Concurrency, parallelism, and pipelining are three key performance optimization parameters for Globus80
GridFTP transfers. These eﬀectiveness of the parameters depends on data transfer node, local storage
system and network. There is not a one-size-ﬁts-all setting that is optimal in any case . Concurrency
and parallelism mechanism are illustrated in Figure 2, and pipelining is illustrated in Figure 3.
Concurrency uses multiple GridFTP server processes at the source and destination, where each process85
transfers a separate ﬁle, and thus provides for concurrency at the ﬁle system I/O, CPU core, and network
Node at Site A
Node at Site B
Concurrency = 2
Parallelism = 3
Parallel File System
Parallel File System
Figure 2: Illustration of concurrency and parallelism in Globus GridFTP.
Figure 3: Illustration of pipelining in Globus GridFTP.
Parallelism is a network-level optimization that uses multiple socket connections to transfer chunks of a
ﬁle in parallel from a single-source GridFTP server process to a single-destination GridFTP server process.90
Pipelining speeds lots of tiny ﬁles by sending multiple FTP commands to a single GridFTP server process
without waiting for the ﬁrst command’s response. This approach reduces latency between ﬁle transfers in a
single GridFTP server process.
Thus, in order to ﬁnd the best application-tunable parameters, we ﬁrst arbitrarily ﬁxed the average ﬁle95
size to be ∼4 GiB and evaluated diﬀerent combinations of three Globus GridFTP parameters: concurrency,
parallelism, and pipeline depth.
Figure 4 shows the achieved throughput as a function of parallelism for diﬀerent concurrencies and
pipeline depths. Parallelism clearly does not provide any obvious improvement in performance. We con-
jecture that the reason is that the round-trip time between source DTNs and destination DTNs is small100
(around 6 ms). Figure 5 shows the achieved throughput as a function of concurrency for diﬀerent pipeline
depths. We omitted parallelism in this comparison because it does not have much impact (Figure 4).
Figure 4: Data transfer performance versus parallelism with diﬀerent concurrency (C) and pipeline depth(D).
From Figure 5, we see that the throughput increases with increasing concurrency, especially for small
pipeline depths. Each DTN at ALCF has only a 10 Gb/s network interface card, and a transfer needs to
use at least 10 DTNs to achieve the desired rate of 1 PiB in a day or sustained ∼93 Gb/s. In order for a105
transfer to use 10 DTNs, the concurrency has to be at least 10. In order for each DTN to drive close to
10 Gb/s (read data from the storage at 10 Gb/s and send data on the network at the same rate), many (or
all) DTN cores need to be used. In this case, each DTN has 8 cores, and thus a concurrency of at least 96
is needed to use all cores. This explains why a concurrency of 128 gives the highest performance.
We also see in Figure 5 that increasing the pipeline depth reduces performance. The reason is that the110
Globus policy was designed for regular data transfers (i.e., transfer size is not as big as in this case, and the
endpoints and network are not as powerful). Speciﬁcally, at the time of experiment, Globus doubles pipeline
depth, and splits multi-ﬁle transfers into batches of 1,000 ﬁles and treats each batch like an independent
transfer request. This policy was put in place to control the total number of concurrent ﬁle transfers and thus
the memory load on the servers. For example, if we use a pipeline depth of 8, the maximum concurrency can115
only be 1000
16 = 62, which is why concurrency with 64 and 128 achieved the same performance in Figure 5.
After feeding back ﬁndings in this paper, Globus has optimized this 1,000 ﬁles batch mechanism.
Similarly, when the pipeline depth is 16, the actual concurrency will be 1000
32 = 31, and thus transfers
with concurrency greater than 32 achieve the same performance as those with 32 in Figure 5. Therefore, the
optimal pipeline depth for our use case is 1, because pipelining is good for transferring tiny ﬁles (when the120
elapsed transfer time for one ﬁle is less than the round-trip time between the client and the source endpoint)
but not for larger ﬁles.
Although we used Globus transfer service in this work, we believe that our methods and conclusions
are applicable to other wide area data transfers. Because the three tunable parameters: concurrency (i.e.,
equivalent to the number of network and disk I/O threads / processes), parallelism (equivalent to the number125
of TCP connections) and pipeline (i.e., the number of control channel) of Globus are generally used by other
high performance data transfer tools such as BBCP , FDT , dCache , FTS  and XDD .
4. Experiences transferring the data
As mentioned, we split each 1.2 TiB snapshot into 256 ﬁles of approximately equal size. We determined
that transferring 64 or 128 ﬁles concurrently, with a total of 128 or 256 TCP streams, yielded the maximum130
transfer rate. We achieved an average disk-to-disk transfer rate of 92.4 Gb/s (or 1 PiB in 24 hours and
Figure 5: Data transfer performance versus concurrency for diﬀerent pipeline depth values.
3 minutes): 99.8% of our goal of 1 PiB in 24 hours, when the end-to-end veriﬁcation of data integrity in
Globus is disabled. In contrast, when the end-to-end veriﬁcation of data integrity in Globus is enabled, we
achieved an average transfer rate of only 72 Gb/s (or 1 PiB in 30 hours and 52 minutes).
The Globus approach to checksum veriﬁcation is motivated by the observations that the 16-bit TCP135
checksum is inadequate for detecting data corruption during communication [16,17] and that corruption
can occur during ﬁle system operations . Globus pipelines the transfer and checksum computation; that
is, the checksum computation of the ith ﬁle happens in parallel with the transfer of the (i+ 1)th ﬁle. Data
are read twice at the source storage system (once for transfer and once for checksum) and written once (for
transfer) and read once (for checksum) at the destination storage system. Therefore, in order to achieve the140
desired rate of 93 Gb/s for checksum-enabled transfers, in the absence of checksum failures, 186 Gb/s of read
bandwidth from the source storage system and 93 Gb/s write bandwidth and 93 Gb/s of read bandwidth
concurrently from the destination storage system are required. If checksum veriﬁcation failures occur (i.e.,
one or more ﬁles are corrupted during the transfer), even more storage I/O bandwidth, CPU resources, and
network bandwidth are required in order to achieve the desired rate. Figure 6 shows the overall transfer145
throughout, as determined via SNMP network monitoring, and the DTN CPU utilization, when performing
transfers using the optimal parameters that we identiﬁed.
We see that transfers without integrity checking (marked by dashed line boxes in Figure 6) can sustain
rates close to our environment’s theoretical bandwidth of 100 Gb/s, with little CPU utilization. If integrity
checking is enabled (solid line boxes in Figure 6), however, the CPU utilization increases signiﬁcantly, and150
it is hard to get close to the theoretical bandwidth continuously. We note three points: (1) the network
is not dedicated to this experiment, and so some network bandwidth was unavoidably consumed by other
programs; (2) we used the same optimal tunable parameters (concurrency(128) and parallelism(1)) and the
same ﬁle size for transfers with and without checksum in order to make sure that the only diﬀerence between
the two cases is data veriﬁcation; and (3) there are transfers with non-optimal parameters, performed as155
part of our explorations, running during other times (outside the boxes) in Figure 6.
4.1. Checksum failures
Globus restarts failed or interrupted transfers from the last checkpoint in order to avoid retransmission costs.
In the case of a checksum error, however, it retransmits the entire erroneous ﬁle. About 5% of our transfers
experienced checksum failure. Such failures can be caused by network corruption, source storage error, or160
destination storage error. Since the data integrity is veriﬁed after the whole ﬁle has been transferred to
Figure 6: Data transfer throughput vs. DTN CPU usage over eight days. We highlight two periods in which our testing
data transfers were occurring with checksum computations (periods delineated by solid line box) and three periods in which
our testing transfers were occurring without checksum computations (dashed line boxes). The rest of the periods, i.e., not
highlighted, are other users’ regular transfers.
the destination, a retransmission must be done if their checksum does not match. For a given transfer, the
number of failure represents the number of retransferred ﬁles. Obviously the transfer throughput will go
down if too many failures occur. We show the transfer throughput versus failures in Figure 7.
Assume that a transfer contains Nﬁles, each of xbytes, and there are nchecksum veriﬁcation failures.
Thus, nﬁles are retransferred, and the total bytes transferred will then be (N+n)x. If we assume that the
end-to-end throughout is Re2e, the actual transfer time Ttrs will be
Thus, the eﬀective throughout Rtrs to the transfer users, that is, the time it takes Ttrs seconds to transfer
Nx bytes, will be
We note that transfers in Figure 7 have diﬀerent concurrency, parallelism, pipeline depth, and average165
ﬁle size; and thus their Re2e are diﬀerent. If we look only at the transfers with similar concurrency, the
shape in Figure 7 ﬁts well with Equation 2.
5. Retrospective analysis: A model-based approach to ﬁnding the optimal number of ﬁles
The ability to control the ﬁle size (and thus number of ﬁles) in the dataset is a key ﬂexibility in this use case.
Thus, while we used a ﬁle size of 4 GB in our experiments, based on limited exploration and intuition, we170
realized in retrospect that we could have created a model to identify the optimal ﬁle size. Here, we present
follow-up work in which we develop and apply such a model.
In developing a model of ﬁle transfer plus checksum costs, we start with a simple linear model of transfer
time for a single ﬁle:
Ttrs =atrsx+btrs ,(3)
where atrs is the unit transfer time, btrs the transfer startup cost, and xthe ﬁle size. Similarly, we model
the time to verify ﬁle integrity as
where ack is the unit checksum time, bck the checksum startup cost, and xthe ﬁle size.
Figure 7: Data transfer performance versus number of checksum failures. Classiﬁed by the transfer concurrency (C). Each dot
represents one transfer.
Figure 8: Illustration of ﬁle transfer and veriﬁcation overlap within a single GridFTP process, as modeled by Equation 5. File
transfer and associated ﬁle veriﬁcation operations are managed as independent pipelines, connected only in that a ﬁle can be
veriﬁed only after it has been transferred, and veriﬁcation failure causes a ﬁle retransmit.
The time to transfer and checksum a ﬁle is not the simple sum of these two equations because, as shown
in Figure 8, Globus GridFTP pipelines data transfers and associated checksum computations. Note how the
data transfer and ﬁle integrity veriﬁcation computations overlap. Thus, assuming no ﬁle system contention
and that the unit checksum time is less than the unit transfer time (we veriﬁed that this is indeed the case
in our environment), the total time Tto transfer nﬁles with one GridFTP process, each of size x, is the
time required for Nﬁle transfers and one checksum, namely,
T=nTtrs +Tck +bsrvs =n(xatrs +btrs) + xack +bck +bsrvs,(5)
where bsrvs is the transfer service (Globus in this paper) startup cost (e.g., time to establish the Globus
control channel). Let us now assume that concurrency = cc.Sdenotes the bytes to be transferred in total;
and we equally divide Sbytes into Nﬁles, where Nis perfectly dividable by cc. Thus there are n=N/cc
ﬁles per concurrent transfer (i.e., per GridFTP process), and each ﬁle of size x=S
N. The transfer time T
to the number of ﬁles Nwill be
T(N) = S
cc btrs +S
Nack +bck +bsrvs.(6)
We use experimental data to estimate the four parameters in Equation 6,atrs ,btrs,ack, and (bck +bsrvs),
in our environment and for diﬀerent concurrency values, cc. Since we previously determined (see Figure 4)175
that parallelism makes little diﬀerence in our low RTT environment, we ﬁxed parallelism at four in these
experiments. We note that, for other scenarios with long RTT, the best parallelism should be determined
ﬁrst, e.g., by iteratively exploring parallelism as shown in Figure 4. For each concurrency value, we ﬁxed
S=1.2 TiB, used four measured (N,T) points to ﬁt the four parameters, and then used the resulting
model to predict performance for other values of N.Figure 9 shows our results. Here, the lines are model180
predictions, stars are measured values used to ﬁt the model, and other dots are other measured values not
used to ﬁt the model.
Figure 9: Evaluation of our transfer performance model when transferring a single 1.2 TiB snapshot. Solid markers are points
that were used to ﬁt the model parameters shown in Equation 6.
Figure 9 shows the accuracy of the performance model. We see that our model does a good job of
predicting throughput as a function of Nand cc. Since the four model parameters in Equation 6 are
independent of source, network, and destination, at least four experiments are needed to ﬁt the model, after185
which the model can be used to determine the best ﬁle size to split a simulation snapshot. We conclude
that the optimal ﬁle size is around 800 MiB (i.e., split the 1.2 TiB snapshot to 1536–2048 ﬁles) and that it
can achieve a throughput of 89.7 Gb/s with integrity veriﬁcation. This throughput represents an increase
of 25% compared with that obtained with the ad hoc approach, when we used a ﬁle size of 4 GB.
6. Related work190
Elephant ﬂows such as those considered here have been known to account for over 90% of the bytes transferred
on typical networks , making their optimization important.
At the 2009 Supercomputing conference, a multidisciplinary team of researchers from DOE national
laboratories and universities demonstrated the seamless, reliable, and rapid transfer of 10 TiB of Earth
System Grid data  from three sources—the Argonne Leadership Computing Facility, Lawrence Livermore195
National Laboratory, and National Energy Research Scientiﬁc Computing Center. The team achieved a
sustained data rate of 15 Gb/s on a 20 Gb/s network provided by DOE’s ESnet. More important, their work
provided critical feedback on how to deploy, tune, and monitor the middleware used to replicate petascale
climate datasets . Their work clearly showed why supercomputer centers need to install dedicated hosts,
referred to as data transfer nodes, for wide area transfers .200
In another SC experiment, this time in 2011, Balman et al.  streamed cosmology simulation data
over a 100 Gb/s network to a remote visualization system, obtaining an average performance of 85 Gb/s.
However, data were communicated with a lossy UDP-based protocol.
Many researchers have studied the impact of parameters such as concurrency and parallelism on data
transfer performance [5,6,7] and have proposed and evaluated alternative transfer protocols [23,24,25]205
and implementations . Jung et al.  proposed a serverless data movement architecture that bypasses
data transfer nodes, the ﬁlesystem stack, and the host system stack and directly moves data from one disk
array controller to another, in order to obtain the highest end-to-end data transfer performance. Newman
et al.  summarized the next-generation exascale network integrated architecture project that is designed
to accomplish new levels of network and computing capabilities in support of global science collaborations210
through the development of a new class of intelligent, agile networked systems.
Rao et al.  studied the performance of TCP variants and their parameters for high-performance trans-
fers over dedicated connections by collecting systematic measurements using physical and emulated dedicated
connections. These experiments revealed important properties such as concave regions and relationships be-
tween dynamics and throughput proﬁles. Their analyses enable the selection of a high-throughput transport215
method and corresponding parameters for a given connection based on round-trip time. Liu et al. 
similarly studied UDT .
Speciﬁcally for bulk wide area data transfer, Liu et al.  analyzed millions of Globus  data transfers
involving thousands of DTNs that DTN performance has a nonlinear relationship with load. Liu et al. 
conducted a systematic examination of a large set of data transfer log data to characterize into transfer220
characteristics, including the nature of the datasets transferred, achieved throughput, user behavior, and
resource usage. Their analysis yields new insights that can help design better data transfer tools, optimize
networking and edge resources used for transfers, and improve the performance and experience for end users.
Speciﬁcally, their analysis show that most of the datasets as well as individual ﬁles transferred are very small;
data corruption is not negligible for large data transfers; the data transfer nodes utilization is low.225
We have presented our experiences in transferring one petabyte of science data within one day. We ﬁrst
described the exploration that we performed to identify parameter values that yield maximum performance
for Globus transfers. We then discussed our experiences in transferring data while the data are produced
by the simulation, both with and without end-to-end integrity veriﬁcation. We achieved 99.8% of our one230
petabyte-per-day goal without integrity veriﬁcation and 78% with integrity veriﬁcation. We also used a
model-based approach to identify the optimal ﬁle size for transfers; the results that suggest that we could
achieve 97% of our goal with integrity veriﬁcation by choosing the appropriate ﬁle size. We believe that our
work serves as a useful lesson in the time-constrained transfer of large datasets.
We would like to thank the FGCS anonymous reviewers, for the valuable feedback and good questions they
brought up. This work was supported in part by the U.S. Department of Energy under contract number
DEAC02-06CH11357, National Science Foundation award 1440761 and the Blue Waters sustained-petascale
computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-
1238993) and the state of Illinois.240
 S. Habib, A. Pope, H. Finkel, N. Frontiere, K. Heitmann, D. Daniel, P. Fasel, V. Morozov, G. Zagaris, T. Peterka, V. Vish-
wanath, Z. Luki´c, S. Sehrish, W. Liao, HACC: Simulating sky surveys on state-of-the-art supercomputing architectures,
New Astronomy 42 (2016) 49–65.
 www.globus.org, globus, https://www.globus.org (2018 (accessed January 3, 2018)).245
 K. Chard, S. Tuecke, I. Foster, Eﬃcient and secure transfer, synchronization, and sharing of big data, IEEE Cloud
Computing 1 (3) (2014) 46–55.
 M. Wilde, M. Hategan, J. M. Wozniak, B. Cliﬀord, D. S. Katz, I. Foster, Swift: A language for distributed parallel
scripting, Parallel Computing 37 (9) (2011) 633–652.
 T. J. Hacker, B. D. Athey, B. Noble, The end-to-end performance eﬀects of parallel TCP sockets on a lossy wide-area250
network, in: Parallel and Distributed Processing Symposium., Proceedings International, IPDPS 2002, Abstracts and
CD-ROM, IEEE, 2001, pp. 10–pp.
 W. Allcock, J. Bresnahan, R. Kettimuthu, M. Link, C. Dumitrescu, I. Raicu, I. Foster, The Globus striped GridFTP
framework and server, in: ACM/IEEE Conference on Supercomputing, IEEE Computer Society, 2005, p. 54.
 E. Yildirim, E. Arslan, J. Kim, T. Kosar, Application-level optimization of big data transfers through pipelining, parallelism255
and concurrency, IEEE Transactions on Cloud Computing 4 (1) (2016) 63–75.
 E. Dart, L. Rotman, B. Tierney, M. Hester, J. Zurawski, The Science DMZ: A network design pattern for data-intensive
science, Scientiﬁc Programming 22 (2) (2014) 173–185.
 Z. Liu, R. Kettimuthu, S. Leyﬀer, P. Palkar, I. Foster, A mathematical programming- and simulation-based framework to
evaluate cyberinfrastructure design choices, in: 2017 IEEE 13th International Conference on e-Science, 2017, pp. 148–157.260
 Z. Liu, R. Kettimuthu, I. Foster, P. H. Beckman, Towards a smart data transfer node, in: 4th International Workshop on
Innovating the Network for Data Intensive Science, 2017, p. 10.
 BBCP, http://www.slac.stanford.edu/~abh/bbcp/.
 FDT, FDT - Fast Data Transfer, http://monalisa.cern.ch/FDT/ (2018 (accessed January 3, 2018)).265
 P. Fuhrmann, V. G¨ulzow, dCache, storage system for the future, in: European Conference on Parallel Processing, Springer,
2006, pp. 1106–1113.
 CERN, FTS3: Robust, simpliﬁed and high-performance data movement service for WLCG, http://fts3-service.web.
cern.ch (2018 (accessed January 3, 2018)).
 B. W. Settlemyer, J. D. Dobson, S. W. Hodson, J. A. Kuehn, S. W. Poole, T. M. Ruwart, A technique for moving large270
data sets over high-performance long distance networks, in: 27th Symp. on Mass Storage Systems and Technologies, 2011,
pp. 1–6. doi:10.1109/MSST.2011.5937236.
 V. Paxson, End-to-end internet packet dynamics, IEEE/ACM Transactions on Networking 7 (3) (1999) 277–292.
 J. Stone, C. Partridge, When the CRC and TCP checksum disagree, in: ACM SIGCOMM Computer Communication
Review, Vol. 30, ACM, 2000, pp. 309–319.275
 L. N. Bairavasundaram, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, G. R. Goodson, B. Schroeder, An analysis of data
corruption in the storage stack, ACM Transactions on Storage 4 (3) (2008) 8.
 K. Lan, J. Heidemann, A measurement study of correlations of internet ﬂow characteristics, Computer Networks 50 (1)
 D. N. Williams, R. Drach, R. Ananthakrishnan, I. T. Foster, D. Fraser, F. Siebenlist, D. E. Bernholdt, M. Chen, J. Schwid-280
der, S. Bharathi, A. L. Chervenak, R. Schuler, M. Su, D. Brown, L. Cinquini, P. Fox, J. Garcia, D. E. Middleton, W. G.
Strand, N. Wilhelmi, S. Hankin, R. Schweitzer, P. Jones, A. Shoshani, A. Sim, The Earth System Grid: Enabling ac-
cess to multimodel climate simulation data, Bulletin of the American Meteorological Society 90 (2) (2009) 195–205.
 R. Kettimuthu, A. Sim, D. Gunter, B. Allcock, P.-T. Bremer, J. Bresnahan, A. Cherry, L. Childers, E. Dart, I. Foster,285
K. Harms, J. Hick, J. Lee, M. Link, J. Long, K. Miller, V. Natarajan, V. Pascucci, K. Raﬀenetti, D. Ressman, D. Williams,
L. Wilson, L. Winkler, Lessons learned from moving Earth System Grid data sets over a 20 Gbps wide-area network, in:
19th ACM International Symposium on High Performance Distributed Computing, HPDC ’10, ACM, New York, NY,
USA, 2010, pp. 316–319. doi:10.1145/1851476.1851519.
 M. Balman, E. Pouyoul, Y. Yao, E. Bethel, B. Loring, M. Prabhat, J. Shalf, A. Sim, B. L. Tierney, Experiences with290
100Gbps network applications, in: 5th International Workshop on Data-Intensive Distributed Computing, ACM, 2012,
 E. Kissel, M. Swany, B. Tierney, E. Pouyoul, Eﬃcient wide area data transfer protocols for 100 Gbps networks and beyond,
in: 3rd International Workshop on Network-Aware Data Management, ACM, 2013, p. 3.
 D. X. Wei, C. Jin, S. H. Low, S. Hegde, FAST TCP: Motivation, architecture, algorithms, performance, IEEE/ACM295
Transactions on Networking 14 (6) (2006) 1246–1259.
 L. Zhang, W. Wu, P. DeMar, E. Pouyoul, mdtmFTP and its evaluation on ESNET SDN testbed, Future Generation
 H. Bullot, R. Les Cottrell, R. Hughes-Jones, Evaluation of advanced TCP stacks on fast long-distance production networks,
Journal of Grid Computing 1 (4) (2003) 345–359.300
 E. S. Jung, R. Kettimuthu, High-performance serverless data transfer over wide-area networks, in: 2015 IEEE International
Parallel and Distributed Processing Symposium Workshop, 2015, pp. 557–564. doi:10.1109/IPDPSW.2015.69.
 H. Newman, M. Spiropulu, J. Balcas, D. Kcira, I. Legrand, A. Mughal, J. Vlimant, R. Voicu, Next-generation exascale
network integrated architecture for global science, Journal of Optical Communications and Networking 9 (2) (2017) A162–
 N. S. Rao, Q. Liu, S. Sen, D. Towlsey, G. Vardoyan, R. Kettimuthu, I. Foster, TCP throughput proﬁles using measurements
over dedicated connections, in: 26th International Symposium on High-Performance Parallel and Distributed Computing,
ACM, 2017, pp. 193–204.
 Q. Liu, N. S. Rao, C. Q. Wu, D. Yun, R. Kettimuthu, I. T. Foster, Measurement-based performance proﬁles and dynamics
of UDT over dedicated connections, in: 24th International Conference on Network Protocols, IEEE, 2016, pp. 1–10.310
 Y. Gu, R. L. Grossman, UDT: UDP-based data transfer for high-speed wide area networks, Computer Networks 51 (7)
 Z. Liu, P. Balaprakash, R. Kettimuthu, I. Foster, Explaining wide area data transfer performance, in: Proceedings of the
26th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’17, ACM, New York,
NY, USA, 2017, pp. 167–178. doi:10.1145/3078597.3078605.315
 Z. Liu, R. Kettimuthu, I. Foster, N. S. V. Rao, Cross-geography scientiﬁc data transferring trends and behavior, in:
Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’18,
ACM, New York, NY, USA, 2018, pp. 267–278. doi:10.1145/3208040.3208053.