Conference Paper

24/7 Characterization of petascale I/O workloads

Math. & Comput. Sci. Div., Argonne Nat. Lab., Argonne, IL, USA
DOI: 10.1109/CLUSTR.2009.5289150 Conference: Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on
Source: IEEE Xplore

ABSTRACT Developing and tuning computational science applications to run on extreme scale systems are increasingly complicated processes. Challenges such as managing memory access and tuning message-passing behavior are made easier by tools designed specifically to aid in these processes. Tools that can help users better understand the behavior of their application with respect to I/O have not yet reached the level of utility necessary to play a central role in application development and tuning. This deficiency in the tool set means that we have a poor understanding of how specific applications interact with storage. Worse, the community has little knowledge of what sorts of access patterns are common in today's applications, leading to confusion in the storage research community as to the pressing needs of the computational science community. This paper describes the Darshan I/O characterization tool. Darshan is designed to capture an accurate picture of application I/O behavior, including properties such as patterns of access within files, with the minimum possible overhead. This characterization can shed important light on the I/O behavior of applications at extreme scale. Darshan also can enable researchers to gain greater insight into the overall patterns of access exhibited by such applications, helping the storage community to understand how to best serve current computational science applications and better predict the needs of future applications. In this work we demonstrate Darshan's ability to characterize the I/O behavior of four scientific applications and show that it induces negligible overhead for I/O intensive jobs with as many as 65,536 processes.

  • [Show abstract] [Hide abstract]
    ABSTRACT: The lasting memory-wall problem combined with the newly emerged big-data problem makes data access delay the first citizen of performance optimizations of cluster computing. Reduction of data access delay, however, is application dependent. It depends on the data access behaviors of the underlying applications. Therefore, leaning and understanding data access behaviors is a must for effective data access optimizations. Modern microprocessors are equipped with hardware data prefetchers, which predict data access patterns and prefetch data for CPU. However, memory systems in design do not have the capability to understand data access behaviors for performance optimizations. In this study, we propose a novel approach, named KNOWAC, to collect I/O information automatically through high-level I/O libraries. KNOWAC accumulates I/O knowledge and reveals data usage patterns by exploring the collected high-level I/O characteristics. The discovered data usage patterns can be used for different I/O optimizations. We apply KNOWAC to I/O prefetch under the framework of PnetCDF in this study. Experimental results on a real-world application show that KNOWAC is promising and has a true practical value in mitigating the I/O bottleneck.
    Proceedings of the 2012 IEEE International Conference on Cluster Computing; 09/2012
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Input/Output (I/O) operations can represent a significant proportion of the run-time of parallel scientific computing applications. Although there have been several advances in file format libraries, file system design and I/O hardware, a growing divergence exists between the performance of parallel file systems and the compute clusters that they support. In this paper, we document the design and application of the RIOT I/O toolkit (RIOT) being developed at the University of Warwick with our industrial partners at the Atomic Weapons Establishment and Sandia National Laboratories. We use the toolkit to assess the performance of three industry-standard I/O benchmarks on three contrasting supercomputers, ranging from a mid-sized commodity cluster to a large-scale proprietary IBM BlueGene/P system. RIOT provides a powerful framework in which to analyse I/O and parallel file system behaviour—we demonstrate, for example, the large file locking overhead of IBM's General Parallel File System, which can consume nearly 30% of the total write time in the FLASH-IO benchmark. Through I/O trace analysis, we also assess the performance of HDF-5 in its default configuration, identifying a bottleneck created by the use of suboptimal Message Passing Interface hints. Furthermore, we investigate the performance gains attributed to the Parallel Log-structured File System (PLFS) being developed by EMC Corporation and the Los Alamos National Laboratory. Our evaluation of PLFS involves two high-performance computing systems with contrasting I/O backplanes and illustrates the varied improvements to I/O that result from the deployment of PLFS (ranging from up to 25× speed-up in I/O performance on a large I/O installation to 2× speed-up on the much smaller installation at the University of Warwick).
    The Computer Journal 02/2013; 56(2):141-155. · 0.89 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The growing speed gap between CPU and memory makes I/O the main bottleneck of many industrial applications. Some applications need to perform I/O operations for very large volume of data frequently, which will harm the performance seriously. This work's motivation are geophysical applications used for oil and gas exploration. These applications process Terabyte size datasets in HPC facilities. The datasets represent subsurface models and field recorded data. In general term, these applications read as inputs and write as intermediate/final results huge amount of data, where the underlying algorithms implement seismic imaging techniques. The traditional sequential I/O, even when couple with advance storage systems, cannot complete all I/O operations for so large volumes of data in an acceptable time range. Parallel I/O is the general strategy to solve such problems. However, because of the dynamic property of many of these applications, each parallel process does not know the data size it needs to write until its computation is done, and it also cannot identify the position in the file to write. In order to write correctly and efficiently, communication and synchronization are required among all processes to fully exploit the parallel I/O paradigm. To tackle these issues, we use a dynamic load balancing framework that is general enough for most of these applications. And to reduce the expensive synchronization and communication overhead, we introduced a I/O node that only handles I/O request and let compute nodes perform I/O operations in parallel. By using both POSIX I/O and memory-mapping interfaces, the experiment indicates that our approach is scalable. For instance, with 16 processes, the bandwidth of parallel reading can reach the theoretical peak performance (2.5 GB/s) of the storage infrastructure. Also, the parallel writing can be up to 4.68x (speedup, POSIX I/O) and 7.23x (speedup, memory-mapping) more efficient than the serial I/O implementation. Since, - ost geophysical applications are I/O bounded, these results positively impact the overall performance of the application, and confirm the chosen strategy as path to follow.
    Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), 2013 IEEE 27th International; 01/2013


Available from