Conference Paper

Experimental Analysis of File Transfer Rates Over Wide-Area Dedicated Connections

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

File transfers over dedicated connections, supported by large parallel filesystems, have become increasingly important in high-performance computing and big data workflows. It remains a challenge to achieve peak rates for such transfers due to the complexities of file I/O, host, and network transport subsystems, and equally importantly, their interactions. We present extensive measurements of disk-to-disk file transfers using Lustre and XFS filesystems mounted on multi-core servers over a suite of 10 Gbps emulated connections with 0–366 ms round trip times. Our results indicate that large buffer sizes and many parallel flows do not always guarantee high transfer rates. Furthermore, large variations in the measured rates necessitate repeated measurements to ensure confidence in inferences based on them. We propose a new method to efficiently identify the optimal joint file I/O and network transport parameters using a small number of measurements. We show that for XFS and Lustre with direct I/O, this method identifies configurations achieving 97% of the peak transfer rate while probing only 12% of the parameter space.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Supplementary resource (1)

... We consider transfers similar to those plotted in Fig. 7a but with Lustre filesystem, where two stripes are used for default and direct IO options. improves with higher flow counts, as evidenced by the higher C U C (Θ E ) curves, the default IO Lustre option does not share the same characteristics: using 4 flows yields the best performance, which has also been observed in [30]. Finally, we plot C U C of profiles for TCP, end-to-end transfer, and file transfer mechanism for eight-stripe LNet Lustre configuration in Fig. 12. ...
... Liu et al. [19] extract features for endpoint CPU load, NIC load and transfer characteristics, and use these features in linear and nonlinear transfer performance models. Some studies have focused on profiling of subsystems [18,29,30] including network, IO, and host systems. In particular, both Rao et al. [29] and Liu et al. [18] have investigated conditions under which the overall memory transfer throughput profile exhibits the desirable concave characteristic, whereas in Rao et al. [30] extensive XDD file transfer throughput performance is discussed. ...
... Some studies have focused on profiling of subsystems [18,29,30] including network, IO, and host systems. In particular, both Rao et al. [29] and Liu et al. [18] have investigated conditions under which the overall memory transfer throughput profile exhibits the desirable concave characteristic, whereas in Rao et al. [30] extensive XDD file transfer throughput performance is discussed. This paper extends the above concavity-convexity analysis to infrastructure data transfers for the first time, whose variations across sites lead to non-smooth profiles. ...
Chapter
To support increasingly distributed scientific and big-data applications, powerful data transfer infrastructures are being built with dedicated networks and software frameworks customized to distributed file systems and data transfer nodes. The data transfer performance of such infrastructures critically depends on the combined choices of file, disk, and host systems as well as network protocols and file transfer software, all of which may vary across sites. The randomness of throughput measurements makes it challenging to assess the impact of these choices on the performance of infrastructure or its parts. We propose regression-based throughput profiles by aggregating measurements from sites of the infrastructure, with RTT as the independent variable. The peak values and convex-concave shape of a profile together determine the overall throughput performance of memory and file transfers, and its variations show the performance differences among the sites. We then present projection and difference operators, and coefficients of throughput profiles to characterize the performance of infrastructure and its parts, including sites and file transfer tools. In particular, the utilization-concavity coefficient provides a value in the range [0, 1] that reflects overall transfer effectiveness. We present results of measurements collected using (i) testbed experiments over dedicated 0–366 ms 10 Gbps connections with combinations of TCP versions, file systems, host systems and transfer tools, and (ii) Globus GridFTP transfers over production infrastructure with varying site configurations.
... 1(a) and 1(b) are quite different not only in their peak throughput values (10 and 4.5 Gbps respectively) but also in their concave and convex shapes, respectively. The first represents a near-optimal performance achieved by balancing and tuning XFS and TCP parameters in XDD [18], whereas the second represents a performance bottleneck due to LNet credit limits [16]. On the other hand, the third profile, in Fig. 1(c), shows a combination of both concave and convex regions [17], wherein TCP buffers limit throughput values beyond a certain RTT at which the profile switches from concave to convex in shape. ...
... For Lustre filesystems, important parameters are the stripe size and number of stripes for the files, and these are typically specified at the creation time; the number of parallel I/O threads for read/write operations are specified at the transfer time. To sustain high throughput, I/O buffer size and the number of parallel threads are chosen to be sufficiently large, but this heuristic is not always optimal [18]. For instance, wide-area file transfers over 10 Gbps connections between two Lustre filesystems achieve transfer rates of only 1.5 Gbps, when striped across 8 storage servers, accessed with 8 MB buffers, and with 8 I/O and TCP threads [18], even though peak network memory-transfer rate and local file throughput are each close to 10 Gbps. ...
... To sustain high throughput, I/O buffer size and the number of parallel threads are chosen to be sufficiently large, but this heuristic is not always optimal [18]. For instance, wide-area file transfers over 10 Gbps connections between two Lustre filesystems achieve transfer rates of only 1.5 Gbps, when striped across 8 storage servers, accessed with 8 MB buffers, and with 8 I/O and TCP threads [18], even though peak network memory-transfer rate and local file throughput are each close to 10 Gbps. ...
Conference Paper
Dedicated data transport infrastructures are increasingly being deployed to support distributed big-data and high-performance computing scenarios. These infrastructures employ data transfer nodes that use sophisticated software stacks to support network transport among sites, which often house distributed file and storage systems. Throughput measurements collected over such infrastructures for a range of round trip times (RTTs) reflect the underlying complex end-to-end connections, and have revealed dichotomous throughput profiles as functions of RTT. In particular, concave regions of throughput profiles at lower RTTs indicate near-optimal performance, and convex regions at higher RTTs indicate bottlenecks due to factors such as buffer or credit limits. We present a machine learning method that explicitly infers these concave and convex regions and transitions between them using sigmoid functions. We also provide distribution-free confidence estimates for the generalization error of these concave-convex profile estimates. Throughput profiles for data transfers over 10 Gbps connections with 0-366 ms RTT provide important performance insights, including the near optimality of transfers performed with the XDD tool between XFS filesystems, and the performance limits of wide-area Lustre extensions using LNet routers. A direct application of generic machine learning packages does not adequately highlight these critical performance regions or provide as precise confidence estimates .
... We consider transfers similar to those plotted in Fig. 7a but with Lustre filesystem, where two stripes are used for default and direct IO options. improves with higher flow counts, as evidenced by the higher C U C (Θ E ) curves, the default IO Lustre option does not share the same characteristics: using 4 flows yields the best performance, which has also been observed in [30]. Finally, we plot C U C of profiles for TCP, end-to-end transfer, and file transfer mechanism for eight-stripe LNet Lustre configuration in Fig. 12. ...
... Liu et al. [19] extract features for endpoint CPU load, NIC load and transfer characteristics, and use these features in linear and nonlinear transfer performance models. Some studies have focused on profiling of subsystems [18,29,30] including network, IO, and host systems. In particular, both Rao et al. [29] and Liu et al. [18] have investigated conditions under which the overall memory transfer throughput profile exhibits the desirable concave characteristic, whereas in Rao et al. [30] extensive XDD file transfer throughput performance is discussed. ...
... Some studies have focused on profiling of subsystems [18,29,30] including network, IO, and host systems. In particular, both Rao et al. [29] and Liu et al. [18] have investigated conditions under which the overall memory transfer throughput profile exhibits the desirable concave characteristic, whereas in Rao et al. [30] extensive XDD file transfer throughput performance is discussed. This paper extends the above concavity-convexity analysis to infrastructure data transfers for the first time, whose variations across sites lead to non-smooth profiles. ...
Conference Paper
To support increasingly distributed scientific and big-data applications, powerful data transfer infrastructures are being built with dedicated networks and software frameworks customized to distributed file systems and data transfer nodes. The data transfer performance of such infrastructures critically depends on the combined choices of file, disk, and host systems as well as network protocols and file transfer software, all of which may vary across sites. The randomness of throughput measurements makes it challenging to assess the impact of these choices on the performance of infrastructure or its parts. We propose regression-based throughput profiles by aggregating measurements from sites of the infrastructure, with RTT as the independent variable. The peak values and convex-concave shape of a profile together determine the overall throughput performance of memory and file transfers, and its variations show the performance differences among the sites. We then present projection and difference operators, and coefficients of throughput profiles to characterize the performance of infrastructure and its parts, including sites and file transfer tools. In particular, the utilization-concavity coefficient provides a value in the range [0,1] that reflects overall transfer effectiveness. We present results of measurements collected using (i) testbed experiments over dedicated 0-366 ms 10 Gbps connections with combinations of TCP versions, file systems, host systems and transfer tools, and (ii) Globus GridFTP transfers over production infrastructure with varying site configurations.
... Overall, the current and past measurements indicate thatΘ E (τ ) is typically concave for smaller RTT and then switches to convex as RTT is increased [15][16][17][18]. Consider the memory transfer profilê Θ T (τ ) of the underlying TCP transport, which itself exhibits such a dual property, typically with a wider concave region [18]. ...
... In this paper, we show the dependencies ofΘ E (τ ) on both network and file IO parameters, and highlight the differences among various file transfer methods, and then propose systematic methods to compare them. We mainly focus on analytics of measurements under various configurations that have described in detail in [14][15][16][17]19], which we briefly summarize here in various sections to provide the context and illustrate specific performance parameters. ...
... We summarize various file throughput measurements collected over a suite of dedicated 10 Gbps emulated connections with 0-366 ms round-trip times, using GridFTP [22] and XDD [23] file transfer tools, and Lustre file system extended over WANs using LNet routers [14]. Their collection spans a five-year period and covers a variety of combinations with large Terabyte datasets, and the individual configurations and tests have been described in detail in [14][15][16][17]19]. Together, these test configurations are composed of: (i) three types of host systems, 48-and 32-core data transfer servers, and 32-core cluster nodes; (ii) three file systems, host file systems ext3 on local hard disk, XFS on SSD drive, and Lustre implemented on IB network; and (iii) two types of network connections, 10 Gbps Ethernet, and 9.6 Gbps OC192 SONET wide-area connections. ...
Conference Paper
Distributed scientific and big-data computations are becoming increasingly dependent on access to remote files. Wide-area file transfers are supported by two basic schemes: (i) application-level tools, such as GridFTP, that provide transport services between file systems housed at geographically separated sites, and (ii) file systems mounted over wide-area networks, using mechanisms such as LNet routers that make them transparently available. In both cases, transfer performance depends critically on the configuration of associated host, file, IO, and disk subsystems, each of which is complex by itself, as well as on their complex compositions, implemented using buffers and IO-network data transitions. We present extensive file transfer rate measurements collected over dedicated 10 Gbps connections with 0–366 ms round-trip times, using GridFTP and XDD file transfer tools, and the Lustre file system extended over wide-area networks with LNet routers. Our test configurations are composed of: three types of host systems; XFS, Lustre, and ext3 file systems; and Ethernet and SONET wide-area connections. We present analytics based on the convexity-concavity of throughput profiles which provide insights into throughput and its superior or inferior trend compared to linear interpolations. We propose the utilization-concavity coefficient, a scalar metric that characterizes the overall performance of any file transfer method consisting of specific configuration and scheme. Our results enable performance optimizations by highlighting the significant roles of (i) buffer sizes and parallelism in GridFTP and XDD, and (ii) buffer utilization and credit mechanism in LNet routers.
... However, experimental and analytical studies of such flows are quite limited, since most research has focused on shared network environments [25]. Our work has shown that while parameter optimizations specific to dedicated connections are indeed somewhat simpler than those required for shared connections, they are not simple extensions of the well-studied shared connection solutions [18,10]. This observation led to the establishment of a testbed at Oak Ridge National Laboratory (ORNL) and the development of measurement and analysis methods for these scenarios. ...
... This chapter presents a comprehensive view of measurements and analyses of systematic experiments conducted using our testbed. It unifies the analysis of throughput profiles of TCP [17], UDT [10], XDD [18], and BBR [19], which we have previously treated separately but in more detail. Additionally, we include some preliminary throughput profiles of the Lustre filesystem mounted over long-haul Ethernet connections, and initial GridFTP throughput measurements over dedicated connections with 0-366ms RTT. ...
... Transfers also involve a range of software components, including filesystem I/O modules for disk access and the TCP/IP stack for network transport. When filesystems are mounted at individual sites, file transfer software such as GridFTP [26] and XDD [28] running on the hosts is used for transport, and detailed measurements and analyses of XDD transfers using our testbed are presented in [18]. Another method is to mount filesystems over wide-area networks [27,15], wherein file transfers are handled by the underlying filesystem and are transparent to the user. ...
Chapter
Dedicated wide-area network connections are employed in big data and high-performance computing scenarios, since the absence of cross-traffic promises to make it easier to analyze and optimize data transfers over them. However, nonlinear transport dynamics and end-system complexity due to multi-core hosts and distributed filesystems make these tasks surprisingly challenging. We present an overview of methods to analyze memory and disk file transfers using extensive measurements over 10 Gbps physical and emulated connections with 0–366 ms round trip times (RTTs). For memory transfers, we derive performance profiles of TCP and UDT throughput as a function of RTT, which show concave regions in contrast to entirely convex regions predicted by previous models. These highly desirable concave regions can be expanded by utilizing large buffers and more parallel flows. We also present Poincare maps and Lyapunov exponents of TCP and UDT throughput traces that indicate complex throughput dynamics. For disk file transfers, we show that throughput can be optimized using a combination of parallel I/O and network threads under direct I/O mode. Our initial throughput measurements of Lustre filesystems mounted over long-haul connections using LNet routers show convex profiles indicative of I/O limits.
... Within the DOE HPC infrastructure, special-purpose Data Transfer Nodes (DTNs) [5] are installed to take advantage of dedicated circuits [17] provisioned over ESnet. Furthermore, Lustre over Ethernet enables filesystems to be mounted across long-haul links, thereby overcoming the 2.5 ms latency limitation of Infiniband (IB) [26]; this approach provides wide-area file access without the use of special tools such as GridFTP [29] or XDD [21], [31], or hardware IB range extenders [1], [16]. Transmission Control Protocol (TCP) and to a lesser extent User Datagram Protocol (UDP) (with suitable loss recovery and fairness enhancements) are used for widearea data transfers, including over dedicated connections. ...
... However, experimental and analytical studies of such flows are quite limited, since most research has focused on shared network environments [28]. Our work has shown that while parameter optimizations specific to dedicated connections are indeed somewhat simpler than those required for shared connections, they are not simple extensions of the well-studied shared connection solutions [12], [21]. This observation led to the establishment of a testbed at Oak Ridge National Laboratory (ORNL) and the development of measurement and analysis methods for these scenarios. ...
... This paper presents a comprehensive view of measurements and analyses of systematic experiments conducted using our testbed. It unifies the analysis of throughput profiles of TCP [22], UDT [12], and XDD [21], which we have previously treated separately but in more detail. We also present newer throughput profiles of BBR and compare them with those of HTCP, STCP, and CUBIC in [22]. ...
Conference Paper
Dedicated wide-area network connections are increasingly employed in high-performance computing and big data scenarios. One might expect the performance and dynamics of data transfers over such connections to be easy to analyze due to the lack of competing traffic. However, non-linear transport dynamics and end-system complexities (e.g., multi-core hosts and distributed filesystems) can in fact make analysis surprisingly challenging. We present extensive measurements of memory-to-memory and disk-to-disk file transfers over 10 Gbps physical and emulated connections with 0–366 ms round trip times (RTTs). For memory-to-memory transfers, profiles of both TCP and UDT throughput as a function of RTT show concave and convex regions; large buffer sizes and more parallel flows lead to wider concave regions, which are highly desirable. TCP and UDT both also display complex throughput dynamics, as indicated by their Poincare maps and Lyapunov exponents. For disk-to-disk transfers, we determine that high throughput can be achieved via a combination of parallel I/O threads, parallel network threads, and direct I/O mode. Our measurements also show that Lustre filesystems can be mounted over long-haul connections using LNet routers, although challenges remain in jointly optimizing file I/O and transport method parameters to achieve peak throughput.
... The advantages of using high performance data transfer protocols over traditional protocols (e.g., HTTP, SCP) have been well explored (Mattmann et al., 2006;Rao et al., 2016;Subramoni et al., 2010). The benefits of building upon a professionally managed platform are also well established (Cusumano, 2010). ...
... For larger transfers, we know from other work (Liu et al., 2017;Rao et al., 2016) that excellent performance can be achieved when both the source and destination endpoints and the intervening network(s) are configured appropriately. Figure 8B, which extracts from Fig. 8A the transfers from RDA to NERSC, illustrates this effect. ...
Article
Full-text available
We describe best practices for providing convenient, high-speed, secure access to large data via research data portals. We capture these best practices in a new design pattern, the Modern Research Data Portal, that disaggregates the traditional monolithic web-based data portal to achieve orders-of-magnitude increases in data transfer performance, support new deployment architectures that decouple control logic from data storage, and reduce development and operations costs. We introduce the design pattern; explain how it leverages high-performance data enclaves and cloud-based data management services; review representative examples at research laboratories and universities, including both experimental facilities and supercomputer sites; describe how to leverage Python APIs for authentication, authorization, data transfer, and data sharing; and use coding examples to demonstrate how these APIs can be used to implement a range of research data portal capabilities. Sample code at a companion web site, https://docs.globus.org/mrdp , provides application skeletons that readers can adapt to realize their own research data portals.
... ProbData is able to explore the near-optimal configurations through sample transfers, but it takes several hours to converge. Rao et al. [31] presented stochastic gradient descent based solution to tune the number of parallel flows. HARP [28] models data transfers using historical data and real-time sampling, and uses this model to estimate the application layer transfer parameters that would maximize the throughput of given transfer task. ...
Conference Paper
Full-text available
Scientific applications generate large volumes of data that often needs to be moved between geographically distributed sites for collaboration or backup which has led to a significant increase in data transfer rates. As an increasing number of scientific applications are becoming sensitive to silent data corruption, end-to-end integrity verification has been proposed. It minimizes the likelihood of silent data corruption by comparing checksum of files at the source and the destination using secure hash algorithms such as MD5 and SHA1. In this paper, we investigate the robustness of existing end-to-end integrity verification approaches against silent data corruption and propose a Robust Integrity Verification Algorithm (RIVA) to enhance data integrity. Extensive experiments show that unlike existing solutions, RIVA is able to detect silent disk corruptions by invalidating file contents in page cache and reading them directly from disk. Since RIVA clears page cache and reads file contents directly from the disk, it incurs delay to execution time. However, by running transfer, cache invalidation, and checksum operations concurrently, RIVA is able to keep its overhead below 15% in most cases compared to the state-of-the-art solutions in exchange of increasing the robustness to silent data corruption. We also implemented dynamic transfer and checksum parallelism to overcome performance bottlenecks and observed more than 5x increase in RIVA's speed.
... There have been several attempts to tune some of these applicationlayer parameters to maximize transfer throughput using heuristic [3,6,7], supervised [14,21], semi-supervised [4,5], and unsupervised [20,23,24,28] models. Since heuristic, supervised, and semi-supervised models require a significant upfront work and re-adjustments when system configuration changes, unsupervised methods such as online optimization are favored as they can adapt to changing network conditions by discovering the optimal transfer settings in the real-time. ...
Conference Paper
Real-time transfer optimization approaches offer promising solutions as they can discover optimal transfer configuration in the runtime without requiring an upfront work or making assumptions about underlying system architectures. On the other hand, existing implementations suffer from slow convergence speed due to running many sample transfers with suboptimal configurations. In this work, we evaluate time-series models to minimize the impact of sample transfers with suboptimal configurations by shortening the transfer duration without degrading the accuracy. The results gathered in various networks with rich set of transfer configurations indicate that, in most cases, Autoregressive model can accurately estimate sample transfer throughput in less than 5 seconds which is up-to 4x improvement over the state-of-the-art solution. We also realized that while the most common transfer applications report transfer throughput at most once a second, decreasing the reporting interval is the key to further reduce the impact of sample transfers by quickly determining their performance.
... This is one of the most basic tasks as different sites often need to share files that contain important scientific data. XDD is a data transfer tool that has been used in the high-performance computing environment [18]. A single XDD file transfer process spawns a set of threads, called Qthreads, to open a file and perform data transfers between either storage and memory or memory and network. ...
Conference Paper
Recent developments in software-defined infrastructures promise that scientific workflows utilizing supercomputers, instruments, and storage systems will be dynamically composed and orchestrated using software at unprecedented speed and scale in the near future. Testing of the underlying networking software, particularly during initial exploratory stages, remains a challenge due to potential disruptions, and resource allocation and coordination needed over the multi-domain physical infrastructure. To overcome these challenges, we develop the Virtual Science Network Environment (VSNE) that emulates the multi-site host, storage, and network infrastructure using Virtual Machines (VMs), wherein the production and nascent software can be tested. Within each VM, which represents a site, the hosts and local-area networks are emulated using Mininet, and the Software-Defined Network (SDN) controllers and service daemon codes are natively run to support dynamic provisioning of network connections. Additionally, Lustre filesystem support at the sites and an emulation of the long-haul network using Mininet, are provided using separate VMs. As case studies, we describe Lustre file transfers using XDD, Red5 streaming service demonstration, and an emulated experiment with remote monitoring and steering modules, all supported over dynamically configured connections using SDN controllers.
... It is of our future interest to investigate and improve the convergence speed of SA-based profiling. It is also of our future interest to extend the capabilities of Prob-Data by incorporating various toolkits such as XDD [17] for fast disk-to-disk data transfer profiling based on optimization algorithms such as depth-width [16]. ...
Conference Paper
Full-text available
The network infrastructures have been rapidly upgraded in many high-performance networks (HPNs). However, such infrastructure investment has not led to corresponding performance improvement in big data transfer, especially at the application layer, largely due to the complexity of optimizing transport control on end hosts. We design and implement ProbData, a PRofiling Optimization Based DAta Transfer Advisor, to help users determine the most effective data transfer method with the most appropriate control parameter values to achieve the best data transfer performance. ProbData employs a profiling optimization-based approach to exploit the optimal operational zone of various data transfer methods in support of big data transfer in extreme-scale scientific applications. We present a theoretical framework of the optimized profiling approach employed in ProbData as well as its detailed design and implementation. The advising procedure and performance benefits of ProbData are illustrated and evaluated by proof-of-concept experiments in real-life networks.
... Within the DOE HPC infrastructure, special purpose Data Transfer Nodes (DTN) [7] are installed to take advantage of the dedicated OSCARS circuits [17] provisioned over ESnet. Furthermore, Lustre over Ethernet enables le systems to be mounted across long-haul links [2], thereby overcoming the 2.5 ms latency limitation of In niband [25]; this approach provides le access over wide area without requiring special transfer tools such as GridFTP [28], XDD [21,30], or hardware IB range extenders [1,16,23]. It is generally expected that the underlying Transmission Control Protocol (TCP) ows over dedicated connections provide peak throughput and stable dynamics that are critical in ensuring predictable transfer performance. ...
Conference Paper
Wide-area data transfers in high-performance computing infrastructures are increasingly being carried over dynamically provisioned dedicated network connections that provide high capacities with no competing traffic. We present extensive TCP throughput measurements and time traces over a suite of physical and emulated 10 Gbps connections with 0-366 ms round-trip times (RTTs). Contrary to the general expectation, they show significant statistical and temporal variations, in addition to the overall dependencies on the congestion control mechanism, buff‚er size, and the number of parallel streams. We analyze several throughput profi€les that have highly desirable concave regions wherein the throughput decreases slowly with RTTs, in stark contrast to the convex profi€les predicted by various TCP analytical models. We present a generic throughput model that abstracts the ramp-up and sustainment phases of TCP fƒows, which provides insights into qualitative trends observed in measurements across TCP variants: (i) slow-start followed by well sustained throughput leads to concave regions; (ii) large buff‚ers and multiple parallel streams expand the concave regions in addition to improving the throughput; and (iii) stable throughput dynamics, indicated by a smoother Poincare map and smaller Lyapunov exponents, lead to wider concave regions. ŒThese measurements and analytical results together enable us to select a TCP variant and its parameters for a given connection to achieve high throughput with statistical guarantees.
Article
Many large science projects rely on remote clusters for (near) real-time data processing, thus they demand reliable wide-area data transfer performance for smooth end-to-end workflow executions. However, data transfers are often exposed to performance variations due to the changing network (e.g., background traffic) and dataset (e.g., average file size) conditions, necessitating adaptive solutions to meet stringent performance requirements of delay-sensitive streaming workflows. In this article, we propose FStream++ to provide reliable transfer performance for large streaming science applications by dynamically adjusting transfer settings to adapt to changing transfer conditions. FStream++ combines three optimization methods as dynamic tuning , online profiling , and historical analysis to swiftly and accurately discover optimal transfer settings that can meet workflow requirements. Dynamic tuning uses a heuristic model to predict the values of transfer parameters based on dataset characteristics and network settings. Since heuristic models fall short to incorporate many important factors such as I/O throughput and resource interference, we complement it with online profiling to execute a real-time search for a subset of transfer settings. Finally, historical analysis takes advantage of the long-running nature of streaming workflows by storing and analyzing previous performance observations to shorten the execution time of online profiling. We evaluate the performance of FStream++ by transferring several synthetic and real-world workloads in high-performance production networks and show that it offers up to $3.6x$ performance improvement over legacy transfer applications and up to 24% over our previous work FStream .
Article
Reliability in petabyte-scale file transfers is critical for data collected from scientific instruments. We introduce a Multi-Layer Error Detection (MLED) architecture that significantly reduces the Undetected Error Probability (UEP) in file transfer. MLED is parameterized by a number of layers $n$ , and a policy $P_{i}$ for each layer $1 \le i \le n$ that describes its operation. MLED generalizes existing error detection approaches used in file transfer. We show conditions under which MLED reduces UEP. Analytical results show that a petabyte-size file transfer in MLED with $n=2$ using CRC-32s improves UEP by $2.49E\,+\,28$ compared to a single-layer CRC-64, when the BER is 10 <sup xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">−10</sup> .
Article
Big data transfer in next-generation scientific applications is now commonly carried out over dedicated channels in high-performance networks (HPNs), where transport protocols play a critical role in maximizing application-level throughput. Optimizing the performance of these protocols is challenging: i) transport protocols perform differently in various network environments, and the protocol choice is not straightforward; ii) even for a given protocol in a given environment, different parameter settings of the protocol may lead to significantly different performance and oftentimes the default setting does not yield the best performance. However, it is prohibitively time-consuming to conduct exhaustive transport profiling due to the large parameter space. In this paper, we propose a PRofiling Optimization Based DAta Transfer Advisor (ProbData) to help end users determine the most effective transport method with the most appropriate parameter settings to achieve satisfactory performance for big data transfer over dedicated connections in HPNs. ProbData employs a fast profiling scheme based on the Simultaneous Perturbation Stochastic Approximation algorithm, namely, FastProf, to accelerate the exploration of the optimal operational zones of various transport methods to improve profiling efficiency. We first present a theoretical background of the optimized profiling approach in ProbData and then detail its design and implementation. The advising procedure and performance benefits of FastProf and ProbData are illustrated and evaluated by both extensive emulations based on real-life performance measurements and experiments over various physical connections in existing production HPNs.
Chapter
Dedicated data transport infrastructures are increasingly being deployed to support distributed big-data and high-performance computing scenarios. These infrastructures employ data transfer nodes that use sophisticated software stacks to support network transport among sites, which often house distributed file and storage systems. Throughput measurements collected over such infrastructures for a range of round trip times (RTTs) reflect the underlying complex end-to-end connections, and have revealed dichotomous throughput profiles as functions of RTT. In particular, concave regions of throughput profiles at lower RTTs indicate near-optimal performance, and convex regions at higher RTTs indicate bottlenecks due to factors such as buffer or credit limits. We present a machine learning method that explicitly infers these concave and convex regions and transitions between them using sigmoid functions. We also provide distribution-free confidence estimates for the generalization error of these concave-convex profile estimates. Throughput profiles for data transfers over 10 Gbps connections with 0–366 ms RTT provide important performance insights, including the near optimality of transfers performed with the XDD tool between XFS filesystems, and the performance limits of wide-area Lustre extensions using LNet routers. A direct application of generic machine learning packages does not adequately highlight these critical performance regions or provide as precise confidence estimates.
Conference Paper
File transfers between the decentralized storage sites over dedicated wide-area connections are becoming increasingly important in high-performance computing and big data scenarios. Designing such scientific workflows for large file transfers is extremely challenging as they depend on the file, I/O, host, and local- and wide-area network subsystems, and their interactions. To gain insights into file-transfer rate profiles, we develop polynomial, bagging, and boosting regression models for Lustre and XFS file transfer measurements, which are collected using XDD over a suite of 10 Gbps connections with 0-366 ms round trip times (RTTs). In addition to overall trends and analytics, these regressions also provide file-transfer rate estimates for RTTs and number of parallel flows at which measurements might not have been collected. They show that bagging and boosting techniques provide closer data fits than the polynomial regression. We develop probabilistic bounds on the generalization error of these methods, which combined with the cross-validation error establish that former two are more accurate estimators than the polynomial regression. In addition, we present a method to efficiently determine the number of parallel flows to achieve a peak file-transfer rate using fewer than full sweep measurements; in our measurements, the peak is achieved in 96% of cases with 15-25% of measurements of a full sweep.
Article
Full-text available
Two types of sampling plans are examined as alternatives to simple random sampling in Monte Carlo studies. These plans are shown to be improvements over simple random sampling with respect to variance for a class of estimators that includes the sample mean and the empirical distribution function. 6 figures.
Conference Paper
Full-text available
Exascale supercomputers will have the potential for billion-way parallelism. While physical implementations of these systems are currently not available, HPC system designers can develop models of exascale systems to evaluate system design points. Modeling these systems and associated subsystems is a significant challenge. In this paper, we present the Co-design of Exascale Storage System (CODES) framework for evaluating exascale storage system design points. As part of our early work with CODES, we discuss the use of the CODES framework to simulate leadership-scale storage systems in a tractable amount of time using parallel discrete-event simulation. We describe the current storage system models and protocols included with the CODES framework and demonstrate the use of CODES through simulations of an existing petascale storage system.
Article
Full-text available
Long haul data transfers require the optimization and balancing of performances of host and storage systems as well as network transport. An assessment of such transport methods requires a systematic generation of throughput profiles from measurements collected over different system parameters and connection lengths. We describe host and storage systems to support wide-area I/O transfers at 10 Gbps, and present measurements of memory and disk transfer throughputs over suites of physical and emulated connections of several thousands of miles. The physical connections are limited by the infras-tructure and incur significant costs. The emulated connections could be of arbitrary lengths at significantly lower costs but only approximate the physical connections. We present a differen-tial regression method to estimate the differences between the performance profiles of physical and emulated connections, and then to estimate "physical" profiles from emulated measurements. We present a systematic analysis of wide-area memory and disk transfer throughput measurements, and establish that robust estimates of physical profiles can be generated using much less expensive emulated connections. Index Terms—I/O throughput measurements, disk and host systems, performance analysis, differential and segmented re-gression, wide-area connections.
Conference Paper
Full-text available
In this paper we look at the performance characteristics of three tools used to move large data sets over dedicated long distance networking infrastructure. Although performance studies of wide area networks have been a frequent topic of interest, performance analyses have tended to focus on network latency characteristics and peak throughput using network traffic generators. In this study we instead perform an end-to-end long distance networking analysis that includes reading large data sets from a source file system and committing the data to a remote destination file system. An evaluation of end-to-end data movement is also an evaluation of the system configurations employed and the tools used to move the data. For this paper, we have built several storage platforms and connected them with a high performance long distance network configuration. We use these systems to analyze the capabilities of three data movement tools: BBcp, GridFTP, and XDD. Our studies demonstrate that existing data movement tools do not provide efficient performance levels or exercise the storage devices in their highest performance modes.
Article
Full-text available
This paper presents a new TCP variant, called CUBIC, for high-speed network environments. CUBIC is an enhanced version of BIC: it simplifies the BIC window control and improves its TCP-friendliness and RTT-fairness. The window growth function of CUBIC is governed by a cubic function in terms of the elapsed time since the last loss event. Our experience indicates that the cubic function provides a good stability and scalability. Furthermore, the real-time nature of the protocol keeps the window growth rate independent of RTT, which keeps the protocol TCP friendly under both short and long RTT paths. Index Terms—Congestion Control, High-Speed TCP, TCP Friendliness
Article
Full-text available
With the advent of computational grids, networking performance over the wide-area network (WAN) has become a critical component in the grid infras- tructure. Unfortunately, many high-performance grid applications only use a small fraction of their available bandwidth because operating systems and their associated protocol stacks are still tuned for yesterday's WAN speeds. As a result, network gurus undertake the tedious process of manually tuning system buers to allow TCP flow control to scale to today's WAN grid environments. And although recent research has shown how to set the size of these system buers automatically at connection set-up, the buer sizes are only appropriate at the beginning of the connection's lifetime. To address these problems, we describe an automated and lightweight technique called Dynamic Right-Sizing that can improve throughput by as much as an order of magnitude while still abiding by TCP semantics.
Article
Full-text available
We show that the dynamics of transmission control protocol (TCP) may often be chaotic via a quasiperiodic route consisting of more than two independent frequencies, by employing a commonly used ns-2 network simulator. To capture the essence of the additive increase and multiplicative decrease mechanism of TCP congestion control, and to qualitatively describe why and when chaos may occur in TCP dynamics, we develop a 1D discrete map. The relevance of these chaotic transport dynamics to real Internet connections is discussed.
Article
Full-text available
. We describe work being performed in the Globus project to develop enabling protocols and services for distributed data-intensive science. These services include: * High-performance, secure data transfer protocols based on FTP, plus a range of libraries and tools that use these protocols * Replica catalog services supporting the creation and location of file replicas in distributed systems These components leverage the substantial body of "Grid" services and protocols developed within the Globus project and by its collaborators, and are being used in a number of data-intensive application projects. INTRODUCTION We describe work being performed in the Globus project 1 to develop enabling protocols and services for distributed data-intensive science. We will begin with a discussion of the differences between protocols, API's, and services and why each is important. The features of our data transfer technology, GridFTP, and our replica catalog services will then be discussed. A ...
Article
Two types of sampling plans are examined as alternatives to simple random sampling in Monte Carlo studies. These plans are shown to be improvements over simple random sampling with respect to variance for a class of estimators which includes the sample mean and the empirical distribution function.
Conference Paper
Wide-area file transfers are an integral part of several High-Performance Computing (HPC) scenarios. Dedicated network connections with high capacity, low loss rate and low competing traffic, are increasingly being provisioned over current HPC infrastructures to support such transfers. To gain insights into these file transfers, we collected transfer rate measurements for Lustre and xfs file systems between dedicated multi-core servers over emulated 10 Gbps connections with round trip times (rtt) in 0-366 ms range. Memory transfer throughput over these connections is measured using iperf, and file IO throughput on host systems is measured using xddprof. We consider two file system configurations: Lustre over IB network and xfs over SSD connected to PCI bus. Files are transferred using xdd across these connections, and the transfer rates are measured, which indicate the need to jointly optimize the connection and host file IO parameters to achieve peak transfer rates. In particular, these measurements indicate that (i) peak file transfer rate is lower than peak connection and host IO throughput, in some cases by as much as 50% or lower, (ii) xdd request sizes that achieve peak throughput for host file IO do not necessarily lead to peak file transfer rates, and (iii) parallelism in host IO and TCP transport does not always improve the file transfer rates.
Conference Paper
The contemporary parallel I/O software stack is complex due to a large number of configurations for tuning I/O performance. Without a proper configuration, I/O becomes a performance bottleneck. As high performance computing (HPC) is moving towards exascale, poor I/O performance has a significant impact on the runtime of large-scale simulations producing massive amounts of data. In this paper, we focus on developing a framework for tuning parallel I/O configurations automatically. This auto-tuning framework first traces high-level I/O accesses and analyzes data write patterns. Based on these patterns and historically available tuning parameters for similar patterns, the framework selects best performing configurations at runtime. If previous history for a pattern is unavailable, the framework initiates model-based training to acquire efficient set of tuning parameters. Our framework includes a runtime system to apply the selected configurations using dynamic linking, without the need for changing application source code. In this paper, we describe this framework and evaluate it using multiple I/O kernels extracted from real applications and demonstrate substantial I/O performance improvement.
Article
TCP congestion control can perform badly in highspeed wide area networks because of its slow response with large congestion windows. The challenge for any alternative protocol is to better utilize networks with high bandwidth-delay products in a simple and robust manner without interacting badly with existing traffic. Scalable TCP is a simple sender-side alteration to the TCP congestion window update algorithm. It offers a robust mechanism to improve performance in highspeed wide area networks using traditional TCP receivers. Scalable TCP is designed to be incrementally deployable and behaves identically to traditional TCP stacks when small windows are sufficient. The performance of the scheme is evaluated through experimental results gathered using a Scalable TCP implementation for the Linux operating system and a gigabit transatlantic network. The preliminary results gathered suggest that the deployment of Scalable TCP would have negligible impact on existing network traffic at the same time as improving bulk transfer performance in highspeed wide area networks.
Article
Background Regenerative SystemsOptimization with Finite-Difference and Simultaneous Perturbation Gradient EstimatorsCommon Random NumbersSelection Methods for Optimization with Discrete-Valued θConcluding Remarks
Conference Paper
The trend in parallel computing toward clusters running thousands of cooperating processes per application has led to an I/O bottleneck that has only gotten more severe as the CPU density of clusters has increased. Current parallel file systems provide large amounts of aggregate I/O bandwidth; however, they do not achieve the high degrees of metadata scalability required to manage files distributed across hundreds or thousands of storage nodes. In this paper we examine the use of collective communication between the storage servers to improve the scalability of file metadata operations. In particular, we apply server-to-server communication to simplify consistency checking and improve the performance of file creation, file removal, and file stat. Our results indicate that collective communication is an effective scheme for simplifying consistency checks and significantly improving the performance for several real metadata intensive workloads.
Chapter
Latin squares designs have an equal number of values for three variables. The designs provide information about the variables with fewer observations than three-variable factorial designs, but confound interactions with main effects. Models for Latin squares are displayed and computational examples described, together with information about replicated Latin squares. Keywords: carryover effects; confounding; fixed effects; period effects; random effects; Williams squares
Article
During the last decade, the data sizes have grown faster than the speed of processors. In this context, the capabilities of statistical machine learning meth-ods is limited by the computing time rather than the sample size. A more pre-cise analysis uncovers qualitatively different tradeoffs for the case of small-scale and large-scale learning problems. The large-scale case involves the computational complexity of the underlying optimization algorithm in non-trivial ways. Unlikely optimization algorithms such as stochastic gradient descent show amazing perfor-mance for large-scale problems. In particular, second order stochastic gradient and averaged stochastic gradient are asymptotically efficient after a single pass on the training set.
Conference Paper
In wide-area grid computing, geographically distributed computational resources are connected for enabling efficient and large-scale scientific/engineering computations. In the wide-area grid computing, a data transfer protocol called GridFTP has been commonly used for large file transfers. GridFTP has the following features for solving problems of the existing TCP. First, for accelerating the start-up in TCP's slow start phase and achieving high throughput in TCP's congestion avoidance phase, multiple TCP connections can be established in parallel. Second, according to the bandwidth-delay product of a network, the TCP socket buffer size can be negotiated between GridFTP server and client. However, in the literature, sufficient investigation has not been performed either on the optimal number of TCP connections or the optimal TCP socket buffer size. In this paper, we therefore quantitatively investigate the optimal parameter configuration of GridFTP in terms of the number of TCP connections and the TCP socket buffer size. We first derive performance metrics of GridFTP in steady state (i.e., goodput and packet loss probability). We then derive the optimal parameter configuration for GridFTP and quantitatively show performance limitations of GridFTP through several numerical examples. We also demonstrate validity of our approximate analysis by comparing simulation results with analytic ones
Article
Efforts are being made to deliver research data-management capabilities to users as a hosted 'software as a service' (SaaS) to address the challenges of computational crisis in many laboratories and a growing need for far more powerful data-management tools. SaaS is a software-delivery model in which software is hosted centrally and accessed by users using a thin client over the Internet. SaaS leverages intuitive Web 2.0 interfaces, deep domain knowledge, and economies of scale to deliver capabilities are easier to use, more capable, or more cost-effective than software accessed through other means as demonstrated in many business and consumer tools. The opportunity for continuous improvement via dynamic deployment of new features and bug fixes is also significant, along with the potential for expert operators to intervene and troubleshoot on the user's behalf.
Article
With the advent of computational grids, networking performance over the wide-area network (WAN) has become a critical component in the grid infrastructure. Unfortunately, many high-performance grid applications only use a small fraction of their available bandwidth because operating systems and their associated protocol stacks are still tuned for yesterday's network speeds. As a result, network gurus undertake the tedious process of manually tuning system buffers to allow TCP flow control to scale to today's WAN environments. And although recent research has shown how to set the size of these system buffers automatically at connection set-up, the buffer sizes are only appropriate at the beginning of the connection's lifetime. To address these problems, we describe an automated and lightweight technique called Dynamic Right-Sizing that can improve throughput by as much as an order of magnitude while still abiding by TCP semantics. We show the performance of two user-space implementations of DRS: drsFTP and DRS-enabled GridFTP. q 2004 Elsevier B.V. All rights reserved.
Conference Paper
GridFTP has been used as a data transfer protocol to effectively transfer a large volume of data in grid computing. GridFTP supports a feature called parallel data transfer that improves throughput by establishing multiple TCP connections in parallel. However, for achieving high GridFTP throughput, the number of TCP connections should be optimized based on the network status. In this paper, we propose an automatic parallelism tuning mechanism called GridFTP-APT (GridFTP with automatic parallelism tuning) that adjusts the number of parallel TCP connections only using information measurable in the grid middleware. Through simulation experiments, we demonstrate that GridFTP-APT significantly improves the performance of GridFTP in various network environments.
Conference Paper
In this paper, we propose an automatic parameter configuration mechanism for GridFTP, which optimizes the number of parallel TCP connections by utilizing analytic results in the work of Ito et al. (2005). The proposed mechanism first measures the network status (e.g., the goodput and the round-trip time of GridFTP data channels) at the GridFTP client. Based on these measurement results, it adjusts the number of parallel TCP connections for maximizing the GridFTP goodput. Three operational modes, MI (multiplicative increase), MI+ (multiplicative increase plus), and AIMD (additive increase and multiplicative decrease) are proposed in this paper, each of which takes a different strategy for adjusting the number of parallel TCP connections. We evaluate performance of the proposed automatic parameter configuration mechanism through simulation experiments. We demonstrate that the proposed automatic parameter configuration mechanism significantly improves the performance of GridFTP.
Article
This comprehensive book offers 504 main pages divided into 17 chapters. In addition, five very useful and clearly written appendices are provided, covering multivariate analysis, basic tests in statistics, probability theory and convergence, random number generators and Markov processes. Some of the topics covered in the book include: stochastic approximation in nonlinear search and optimization; evolutionary computations; reinforcement learning via temporal differences; mathematical model selection; and computer-simulation-based optimizations. Over 250 exercises are provided in the book, though only a small number of them have solutions included in the volume. A separate solution manual is available, as is a very informative webpage. The book may serve as either a reference for researchers and practitioners in many fields or as an excellent graduate level textbook.
Article
In this paper we present a congestion control protocol that is suitable for deployment in highspeed and long-distance networks. The new protocol, H-TCP, is shown to fair when deployed in homogeneous networks, to be friendly when competing with conventional TCP sources, to rapidly respond to bandwidth as it becomes available, and to utilise link bandwidth in an e#cient manner.
Article
TCP congestion control can perform badly in highspeed wide area networks because of its slow response with large congestion windows. The challenge for any alternative protocol is to better utilize networks with high bandwidth-delay products in a simple and robust manner without interacting badly with existing traffic. Scalable TCP is a simple sender-side alteration to the TCP congestion window update algorithm. It offers a robust mechanism to improve performance in highspeed wide area networks using traditional TCP receivers. Scalable TCP is designed to be incrementally deployable and behaves identically to traditional TCP stacks when small windows are sufficient. The performance of the scheme is evaluated through experimental results gathered using a Scalable TCP implementation for the Linux operating system and a gigabit transatlantic network. The results gathered suggest that the deployment of Scalable TCP would have negligible impact on existing network traffic at the same time as improving bulk transfer performance in highspeed wide area networks.
Science DMZ: Data Transfer Nodes
  • Dmz Science
Protocols and services for distributed data-intensive science
  • W Allcock
  • I Foster
  • S Tuecke
  • A Chervenak
  • C Kesselman
W. Allcock, I. Foster, S. Tuecke, A. Chervenak, and C. Kesselman. Protocols and services for distributed data-intensive science. In AIP Conference Proceedings, pages 161-163. Institute of Physics, 2000.
H-TCP: TCP for high-speed and longdistance networks
  • R N Shorten
  • D J Leith
R.N. Shorten and D.J. Leith. H-TCP: TCP for high-speed and longdistance networks. In 3rd International Workshop on Protocols for Fast Long-Distance Networks, 2004.