Conference Paper

24/7 Characterization of petascale I/O workloads

Math. & Comput. Sci. Div., Argonne Nat. Lab., Argonne, IL, USA
DOI: 10.1109/CLUSTR.2009.5289150 Conference: Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on
Source: IEEE Xplore

ABSTRACT

Developing and tuning computational science applications to run on extreme scale systems are increasingly complicated processes. Challenges such as managing memory access and tuning message-passing behavior are made easier by tools designed specifically to aid in these processes. Tools that can help users better understand the behavior of their application with respect to I/O have not yet reached the level of utility necessary to play a central role in application development and tuning. This deficiency in the tool set means that we have a poor understanding of how specific applications interact with storage. Worse, the community has little knowledge of what sorts of access patterns are common in today's applications, leading to confusion in the storage research community as to the pressing needs of the computational science community. This paper describes the Darshan I/O characterization tool. Darshan is designed to capture an accurate picture of application I/O behavior, including properties such as patterns of access within files, with the minimum possible overhead. This characterization can shed important light on the I/O behavior of applications at extreme scale. Darshan also can enable researchers to gain greater insight into the overall patterns of access exhibited by such applications, helping the storage community to understand how to best serve current computational science applications and better predict the needs of future applications. In this work we demonstrate Darshan's ability to characterize the I/O behavior of four scientific applications and show that it induces negligible overhead for I/O intensive jobs with as many as 65,536 processes.

Download full-text

Full-text

Available from: Katherine Riley, Aug 04, 2015
  • Source
    • "In practice, for the platform to be self-benchmarking, the profiler needs to be deployed by default for all jobs on the platform, so it must add negligible overhead to each job's execution time. In this paper, we use data from the lightweight profiler Darshan[1], which is enabled in the default environment on supercomputers at Argonne National Laboratory (ANL), National Energy Research Scientific Computing Center (NERSC), and the National Center for Supercomputing Applications (NCSA). I/O experts have used Darshan data for application-specific, system-wide and crossplatform analysis[2][3], crafting queries and generating visualizations by hand. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Analyzing the I/O performance of high-performance computing applications can provide valuable insights for application developers, users, and platform administrators. However, the analysis is difficult and requires parallel I/O expertise few users possess. Analyzing an entire platform's I/O workload is even harder, as it requires large-scale collection, cleaning and exploration of data. To address this problem, we created a web-based dashboard for interactive analysis and visualization of application I/O behavior, based on data collected by a lightweight I/O profiler that can observe all jobs on a platform at low cost. The dashboard's target audience includes application users and developers who are starting to analyze their application's I/O performance; system administrators who want to look into the usage of their storage system and find potential candidate applications for improvement; and parallel I/O experts who want to understand the behavior of an application or set of applications. The dashboard leverages relational database technology, a portable graphing library, and lightweight I/O profiling to provide I/O behavior insights previously only available with great effort.
    Full-text · Conference Paper · Nov 2015
  • Source
    • "That said, there are library calls available that can query various aspects of memory usage, such as the size of the text, data, and BSS segments, together with the maximum amount of memory consumed on the heap. We also use Darshan [3], a resource monitoring tool with two major design points. First, it was explicitly aimed at parallel I/O, since there are no well accepted tools for doing so. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Robust high throughput computing requires effective monitoring and enforcement of a variety of resources including CPU cores, memory, disk, and network traffic. Without effective monitoring and enforcement, it is easy to overload machines, causing failures and slowdowns, or underload machines, which results in wasted opportunities. This paper explores how to describe, measure, and enforce resources used by computational tasks. We focus on tasks running in distributed execution systems, in which a task requests the resources it needs, and the execution system ensures the availability of such resources. This presents two non-trivial problems: how to measure the resources consumed by a task, and how to monitor and report resource exhaustion in a robust and timely manner. For both of these tasks, operating systems have a variety of mechanisms with different degrees of availability, accuracy, overhead, and intrusiveness. We develop a model to describe various forms of monitoring and map the available mechanisms in contemporary operating systems to that model. Based on this analysis, we present two specific monitoring tools that choose different tradeoffs in overhead and accuracy, and evaluate them on a selection of benchmarks. We conclude by describing our experience in collecting large quantities of monitoring data for complex workflows.
    Full-text · Conference Paper · Jan 2015
  • Source
    • "Darshan's minimal collection of data (1-2% overhead, depending on the app [10][23]) allows it to be enabled for all jobs by default. This allows us to observe a platform at workload scale and to identify its jobs and apps that can most benefit from follow-up analyses with I/O tracing and other performance analysis tools. "
    [Show abstract] [Hide abstract]
    ABSTRACT: We examine the I/O behavior of thousands of supercomputing applications " in the wild, " by analyzing the Darshan logs of over a million jobs representing a combined total of six years of I/O behavior across three leading high-performance computing platforms. We mined these logs to analyze the I/O behavior of applications across all their runs on a platform; the evolution of an application's I/O behavior across time, and across platforms; and the I/O behavior of a platform's entire workload. Our analysis techniques can help developers and platform owners improve I/O performance and I/O system utilization, by quickly identifying underperforming applications and offering early intervention to save system resources. We summarize our observations regarding how jobs perform I/O and the throughput they attain in practice.
    Full-text · Conference Paper · Jan 2015
Show more