Conference Paper

Open Science on Trinity's Knights Landing Partition: An Analysis of User Job Data

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

High-performance computing (HPC) systems are critically important to the objectives of universities, national laboratories, and commercial companies. Because of the cost of deploying and maintaining these systems ensuring their efficient use is imperative. Job scheduling and resource management are critically important to the efficient use of HPC systems. As a result, significant research has been conducted on how to effectively schedule user jobs on HPC systems. Developing and evaluating job scheduling algorithms, however, requires a detailed understanding of how users request resources on HPC systems. In this paper, we examine a corpus of job data that was collected on Trinity, a leadership-class supercomputer. During the stabilization period of its Intel Xeon Phi (Knights Landing) partition, it was made available to users outside of a classified environment for the Trinity Open Science Phase 2 campaign. We collected information from the resource manager about each user job that was run during this Open Science period. In this paper, we examine the jobs contained in this dataset. Our analysis reveals several important characteristics of the jobs submitted during the Open Science period and provides critical insight into the use of one of the most powerful supercomputers in existence. Specifically, these data provide important guidance for the design, development, and evaluation of job scheduling and resource management algorithms.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Article
Full-text available
Blue Waters is a Petascale-level supercomputer whose mission is to enable the national scientific and research community to solve "grand challenge" problems that are orders of magnitude more complex than can be carried out on other high performance computing systems. Given the important and unique role that Blue Waters plays in the U.S. research portfolio, it is important to have a detailed understanding of its workload in order to guide performance optimization both at the software and system configuration level as well as inform architectural balance tradeoffs. Furthermore, understanding the computing requirements of the Blue Water's workload (memory access, IO, communication, etc.), which is comprised of some of the most computationally demanding scientific problems, will help drive changes in future computing architectures, especially at the leading edge. With this objective in mind, the project team carried out a detailed workload analysis of Blue Waters.
Conference Paper
Full-text available
High Performance Computing applications and platforms have been typically designed without regard to power consump- tion. With increased awareness of energy cost, power man- agement is now an issue even for compute-intensive server clusters. In this work, we investigate the use of power manage- ment techniques for high performance applications on mod- ern power-e!cient servers with virtualization support. We consider power management techniques such as dynamic con- solidation and usage of dynamic power range enabled by low power states on servers. We identify application performance isolation and virtual- ization overhead with multiple virtual machines as the key bottlenecks for server consolidation. We perform a compre- hensive experimental study to identify the scenarios where applications are isolated from each other. We also establish that the power consumed by HPC applications may be appli- cation dependent, non-linear and have a large dynamic range. We show that for HPC applications, working set size is a key parameter to take care of while placing applications on virtu- alized servers. We use the insights obtained from our exper- imental study to present a framework and methodology for power-aware application placement for HPC applications.
Article
Full-text available
Scheduling jobs on the IBM SP2 system and many other distributed-memory MPPs is usually done by giving each job a partition of the machine for its exclusive use. Allocating such partitions in the order in which the jobs arrive (FCFS scheduling) is fair and predictable, but suffers from severe fragmentation, leading to low utilization. This situation led to the development of the EASY scheduler which uses aggressive backfilling: Small jobs are moved ahead to fill in holes in the schedule, provided they do not delay the first job in the queue. We compare this approach with a more conservative approach in which small jobs move ahead only if they do not delay any job in the queue and show that the relative performance of the two schemes depends on the workload. For workloads typical on SP2 systems, the aggressive approach is indeed better, but, for other workloads, both algorithms are similar. In addition, we study the sensitivity of backfilling to the accuracy of the runtime estimates provided by the users and find a very surprising result. Backfilling actually works better when users overestimate the runtime by a substantial factor
Chapter
EASY-Backfilling is a popular scheduling heuristic for allocating jobs in large scale High Performance Computing platforms. While its aggressive reservation mechanism is fast and prevents job starvation, it does not try to optimize any scheduling objective per se. We consider in this work the problem of tuning EASY using queue reordering policies. More precisely, we propose to tune the reordering using a simulation-based methodology. For a given system, we choose the policy in order to minimize the average waiting time. This methodology departs from the First-Come, First-Serve rule and introduces a risk on the maximum values of the waiting time, which we control using a queue thresholding mechanism. This new approach is evaluated through a comprehensive experimental campaign on five production logs. In particular, we show that the behavior of the systems under study is stable enough to learn a heuristic that generalizes in a train/test fashion. Indeed, the average waiting time can be reduced consistently (between 11% to 42% for the logs used) compared to EASY, with almost no increase in maximum waiting times. This work departs from previous learning-based approaches and shows that scheduling heuristics for HPC can be learned directly in a policy space.
Chapter
Computing resources in data centers are usually managed by a Resource and Job Management System whose main objective is to complete submitted jobs as soon as possible while maximizing resource usage and ensuring fairness among users. However, some users might not be as hurried as the job scheduler but only interested in their jobs to complete before a given deadline. In this paper, we derive from this initial hypothesis a low-complexity scheduling algorithm, called Deadline-Based Backfilling (DBF), that distinguishes regular jobs that have to complete as early as possible from deadline-driven jobs that come with a deadline before when they have to finish. We also investigate a scenario in which deadline-driven jobs are submitted and evaluate the impact of the proposed algorithm on classical performance metrics with regard to state-of-the-art scheduling algorithms. Experiments conducted on four different workloads show that the proposed algorithm significantly reduces the average wait time and average stretch when compared to Conservative Backfilling.
Article
Increasing scale and the need for rapid response to changing requirements are hard to meet with current monolithic cluster scheduler architectures. This restricts the rate at which new features can be deployed, decreases efficiency and utilization, and will eventually limit cluster growth. We present a novel approach to address these needs using parallelism, shared state, and lock-free optimistic concurrency control. We compare this approach to existing cluster scheduler designs, evaluate how much interference between schedulers occurs and how much it matters in practice, present some techniques to alleviate it, and finally discuss a use case highlighting the advantages of our approach -- all driven by real-life Google production workloads.
Article
Science is based upon observation. The scientific study of complex computer systems should therefore be based on observation of how they are used in practice, as opposed to how they are assumed to be used or how they were designed to be used. In particular, detailed workload logs from real computer systems are invaluable for research on performance evaluation and for designing new systems. Regrettably, workload data may suffer from quality issues that might distort the study results, just as scientific observations in other fields may suffer from measurement errors. The cumulative experience with the Parallel Workloads Archive, a repository of job-level usage data from large-scale parallel supercomputers, clusters, and grids, has exposed many such issues. Importantly, these issues were not anticipated when the data was collected, and uncovering them was not trivial. As the data in this archive is used in hundreds of studies, it is necessary to describe and debate procedures that may be used to improve its data quality. Specifically, we consider issues like missing data, inconsistent data, erroneous data, system configuration changes during the logging period, and unrepresentative user behavior. Some of these may be countered by filtering out the problematic data items. In other cases, being cognizant of the problems may affect the decision of which datasets to use. While grounded in the specific domain of parallel jobs, our findings and suggested procedures can also inform similar situations in other domains.
Conference Paper
To better understand the challenges in developing effective cloud-based resource schedulers, we analyze the first publicly available trace data from a sizable multi-purpose cluster. The most notable workload characteristic is heterogeneity: in resource types (e.g., cores:RAM per machine) and their usage (e.g., duration and resources needed). Such heterogeneity reduces the effectiveness of traditional slot- and core-based scheduling. Furthermore, some tasks are constrained as to the kind of machine types they can use, increasing the complexity of resource assignment and complicating task migration. The workload is also highly dynamic, varying over time and most workload features, and is driven by many short jobs that demand quick scheduling decisions. While few simplifying assumptions apply, we find that many longer-running jobs have relatively stable resource utilizations, which can help adaptive resource schedulers.
Article
k-Means algorithm and its variations are known to be fast clustering algorithms. However, they are sensitive to the choice of starting points and inefficient for solving clustering problems in large data sets. Recently, a new version of the k-means algorithm, the global k-means algorithm has been developed. It is an incremental algorithm that dynamically adds one cluster center at a time and uses each data point as a candidate for the k-th cluster center. Results of numerical experiments show that the global k-means algorithm considerably outperforms the k-means algorithms. In this paper, a new version of the global k-means algorithm is proposed. A starting point for the k-th cluster center in this algorithm is computed by minimizing an auxiliary cluster function. Results of numerical experiments on 14 data sets demonstrate the superiority of the new algorithm, however, it requires more computational time than the global k-means algorithm.
Conference Paper
Scheduling jobs on the IBM SP2 system is usually done by giving each job a partition of the machine for its exclusive use. Allocating such partitions in the order that the jobs arrive (FCFS scheduling) is fair and predictable, but suffers from severe fragmentation, leading to low utilization. An alternative is to use the EASY scheduler, which uses aggressive backfilling: small jobs are moved ahead to fill in holes in the schedule, provided they do not delay the first job in the queue. The authors show that a more conservative approach, in which small jobs move ahead only if they do not delay any job in the queue, produces essentially the same benefits in terms of utilization. The conservative scheme has the added advantage that queueing times can be predicted in advance, whereas in EASY the queueing time is unbounded
Analysis and lessons from a publicly available Google cluster trace
  • Yanpei Chen
  • Archana Sulochana Ganapathi
  • Rean Griffith
  • Randy H Katz
Yanpei Chen, Archana Sulochana Ganapathi, Rean Griffith, and Randy H Katz
  • Yanpei Chen
  • Archana Sulochana Ganapathi
  • Rean Griffith
  • Randy H Katz
  • Chen Yanpei