Conference Paper

Fault-aware, utility-based job scheduling on Blue, Gene/P systems

Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL, USA
DOI: 10.1109/CLUSTR.2009.5289206 Conference: Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on
Source: IEEE Xplore


Job scheduling on large-scale systems is an increasingly complicated affair, with numerous factors influencing scheduling policy. Addressing these concerns results in sophisticated scheduling policies that can be difficult to reason about. In this paper, we present a general utility-based scheduling framework to balance various scheduling requirements and priorities. It enables system owners to customize scheduling policies under different circumstances without changing the scheduling code. We also develop a fault-aware job allocation strategy for Blue Gene/P systems to address the increasing concern of system failures. We demonstrate the effectiveness of these facilities by means of event-driven simulations with real job traces collected from the production Blue Gene/P system at Argonne National Laboratory.

Download full-text


Available from: Wei Tang,
26 Reads
  • Source
    • "Backfilling [17], for example, needs to know the expected runtime of both running and waiting jobs so that it can fill short jobs into backfilling windows, reducing fragmentation without delaying high-priority jobs. Some schedulers favor short jobs in order to achieve improved average response time [23]; they need to know the runtime estimates of the waiting jobs when sorting the queue. Moreover, job runtime estimates are essential to other resource management strategies, such as advance reservation [13], queuing time prediction [9][21], and walltime-aware job allocation reducing fragmentation on torus-connected systems [25]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The estimate of a parallel job’s running time (walltime) is an important attribute used by resource managers and job schedulers in various scenarios, such as backfilling and short-job-first scheduling. This value is provided by the user, however, and has been repeatedly shown to be inaccurate. We studied the workload characteristic based on a large amount of historical data (over 275,000 jobs in two and a half years) from a production leadership-class computer. Based on that study, we proposed a set of walltime adjustment schemes producing more accurate estimates. To ensure the utility of these schemes on production systems, we analyzed their potential impact in scheduling and evaluated the schemes with an event-driven simulator. Our experimental results show that our method can achieve not only better overall estimation accuracy but also improved overall system performance. Specifically, the average estimation accuracy of the tested workload can be improved by up to 35%, and the system performance in terms of average waiting time and weighted average waiting time can be improved by up to 22% and 28%, respectively.
    Journal of Parallel and Distributed Computing 03/2013; 73(7). DOI:10.1016/j.jpdc.2013.02.006 · 1.18 Impact Factor
  • Source
    • "FCFS is the most widely used job scheduling policy [11]. WFP, a variant of shortest-job-first policy which also favors old and large jobs [24], is the scheduling policy used in production on Intrepid. Specifically , in WFP, the queuing priority of a job is calculated as (í µí±¡ í µí±ží µí±¢í µí±’í µí±¢í µí±’ /í µí±¡ í µí±Ÿí µí±’í µí±ž ) 3 × í µí±›, where í µí±¡ í µí±ží µí±¢í µí±’í µí±¢í µí±’ , í µí±¡ í µí±Ÿí µí±’í µí±ž , and í µí±› í µí±– denote the job waiting time, user-requested runtime, and the number of nodes of the job, respectively. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Torus-based networks are prevalent on leadership- class petascale systems, providing a good balance between network cost and performance. The major disadvantage of this network architecture is its susceptibility to fragmentation. Many studies have attempted to reduce resource fragmentation in this architecture. Although the approaches suggested can make good allocation decisions reducing fragmentation at job start time, none of them considers a job's walltime, which can cause resource fragmentation when neighboring jobs do not complete closely. In this paper, we propose a walltime- aware job allocation strategy, which adjacently packs jobs that finish around the same time, in order to minimize resource fragmentation caused by job length discrepancy. Event-driven simulations using real job traces from a production Blue Gene/P system at Argonne National Laboratory demonstrate that our walltime-aware strategy can effectively reduce system fragmentation and improve overall system performance.
    25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2011, Anchorage, Alaska, USA, 16-20 May, 2011 - Conference Proceedings; 05/2011
  • Source
    • "At each scheduling point, for instance, every 10 seconds, the scheduler sorts the jobs by their utility score and picks the top-score job run. A partition blocking least other partition will be selected to allocate the scheduled job [26]. If no suitable partitions can run the job, backfilling may occur. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Backfilling and short-job-first are widely acknowledged enhancements to the simple but popular first-come, first-served job scheduling policy. However, both enhancements depend on user-provided estimates of job runtime, which research has repeatedly shown to be inaccurate. We have investigated the effects of this inaccuracy on backfilling and different queue prioritization policies, determining which part of the scheduling policy is most sensitive. Using these results, we have designed and implemented several estimation-adjusting schemes based on historical data. We have evaluated these schemes using workload traces from the Blue Gene/P system at Argonne National Laboratory. Our experimental results demonstrate that dynamically adjusting job runtime estimates can improve job scheduling performance by up to 20%.
    24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010, Atlanta, Georgia, USA, 19-23 April 2010 - Conference Proceedings; 04/2010
Show more