Conference Paper

Fault-aware, utility-based job scheduling on Blue, Gene/P systems

Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL, USA
DOI: 10.1109/CLUSTR.2009.5289206 Conference: Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on
Source: IEEE Xplore

ABSTRACT Job scheduling on large-scale systems is an increasingly complicated affair, with numerous factors influencing scheduling policy. Addressing these concerns results in sophisticated scheduling policies that can be difficult to reason about. In this paper, we present a general utility-based scheduling framework to balance various scheduling requirements and priorities. It enables system owners to customize scheduling policies under different circumstances without changing the scheduling code. We also develop a fault-aware job allocation strategy for Blue Gene/P systems to address the increasing concern of system failures. We demonstrate the effectiveness of these facilities by means of event-driven simulations with real job traces collected from the production Blue Gene/P system at Argonne National Laboratory.

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The estimate of a parallel job’s running time (walltime) is an important attribute used by resource managers and job schedulers in various scenarios, such as backfilling and short-job-first scheduling. This value is provided by the user, however, and has been repeatedly shown to be inaccurate. We studied the workload characteristic based on a large amount of historical data (over 275,000 jobs in two and a half years) from a production leadership-class computer. Based on that study, we proposed a set of walltime adjustment schemes producing more accurate estimates. To ensure the utility of these schemes on production systems, we analyzed their potential impact in scheduling and evaluated the schemes with an event-driven simulator. Our experimental results show that our method can achieve not only better overall estimation accuracy but also improved overall system performance. Specifically, the average estimation accuracy of the tested workload can be improved by up to 35%, and the system performance in terms of average waiting time and weighted average waiting time can be improved by up to 22% and 28%, respectively.
    Journal of Parallel and Distributed Computing 03/2013; · 1.12 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Job scheduling is a critical and complex task on large-scale supercomputers where a scheduling policy is expected to fulfill amorphous and sometimes conflicting goals from both users and system owners. Moreover, the effectiveness of a scheduling policy is dependent on workload characteristics which vary from time to time. Thus it is challenging to design a versatile scheduling policy that is effective in all circumstances. To address this issue, we propose an adaptive metric-aware job scheduling strategy. First, we propose metric-aware scheduling which enables the scheduler to balance competing scheduling goals represented by different metrics such as job waiting time, fairness, and system utilization. Second, we enhance the scheduler to adaptively adjust scheduling policies based on feedback information of monitored metrics at runtime. We evaluate our design using real workloads from supercomputer centers and demonstrate that our scheduling mechanism can significantly improve system performance in a balanced, sustainable fashion.
    Parallel Processing Workshops (ICPPW), 2012 41st International Conference on; 09/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we presented a new method for job scheduling and modeled the access requests to web pages in computer network or internet with job scheduling in single machine. We simulated this model in two kinds of problems: 1. Small scaled problems (10 users). 2. Large scaled problems (100 users). The purpose of all problems is to find the minimum amount of mean and variance time. Since these problems are NP-hard, we proposed one type of innovative V shaped arrangement for job scheduling. It's possible to find the optimal response for small scaled problems with little spent time, so by examining all possible states the optimal responses (minimum mean and variance) were found and evaluated.
    Communication Systems and Network Technologies (CSNT), 2012 International Conference on; 01/2012

Full-text (2 Sources)

Available from
May 28, 2014