Conference Paper

Fault-aware, utility-based job scheduling on Blue, Gene/P systems

Dept. of Comput. Sci., Illinois Inst. of Technol., Chicago, IL, USA
DOI: 10.1109/CLUSTR.2009.5289206 Conference: Cluster Computing and Workshops, 2009. CLUSTER '09. IEEE International Conference on
Source: IEEE Xplore

ABSTRACT Job scheduling on large-scale systems is an increasingly complicated affair, with numerous factors influencing scheduling policy. Addressing these concerns results in sophisticated scheduling policies that can be difficult to reason about. In this paper, we present a general utility-based scheduling framework to balance various scheduling requirements and priorities. It enables system owners to customize scheduling policies under different circumstances without changing the scheduling code. We also develop a fault-aware job allocation strategy for Blue Gene/P systems to address the increasing concern of system failures. We demonstrate the effectiveness of these facilities by means of event-driven simulations with real job traces collected from the production Blue Gene/P system at Argonne National Laboratory.

  • [Show abstract] [Hide abstract]
    ABSTRACT: Job scheduling is a critical and complex task on large-scale supercomputers where a scheduling policy is expected to fulfill amorphous and sometimes conflicting goals from both users and system owners. Moreover, the effectiveness of a scheduling policy is dependent on workload characteristics which vary from time to time. Thus it is challenging to design a versatile scheduling policy that is effective in all circumstances. To address this issue, we propose an adaptive metric-aware job scheduling strategy. First, we propose metric-aware scheduling which enables the scheduler to balance competing scheduling goals represented by different metrics such as job waiting time, fairness, and system utilization. Second, we enhance the scheduler to adaptively adjust scheduling policies based on feedback information of monitored metrics at runtime. We evaluate our design using real workloads from supercomputer centers and demonstrate that our scheduling mechanism can significantly improve system performance in a balanced, sustainable fashion.
    Parallel Processing Workshops (ICPPW), 2012 41st International Conference on; 09/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we presented a new method for job scheduling and modeled the access requests to web pages in computer network or internet with job scheduling in single machine. We simulated this model in two kinds of problems: 1. Small scaled problems (10 users). 2. Large scaled problems (100 users). The purpose of all problems is to find the minimum amount of mean and variance time. Since these problems are NP-hard, we proposed one type of innovative V shaped arrangement for job scheduling. It's possible to find the optimal response for small scaled problems with little spent time, so by examining all possible states the optimal responses (minimum mean and variance) were found and evaluated.
    Communication Systems and Network Technologies (CSNT), 2012 International Conference on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Cloud computing' service-oriented characteristics advance a new way of service provisioning called utility based computing. However, toward the practical application of commercialized Cloud, we encounter two challenges: i) there is no well-defined job scheduling algorithm for the Cloud that considers the system state in the future, particularly under overloading circumstances; ii) the existing job scheduling algorithms under utility computing paradigm do not take hardware/software failure and recovery in the Cloud into account. In an attempt to address these challenges, we introduce the failure and recovery scenario in the Cloud computing entities and propose a Reinforcement Learning (RL) based algorithm to make job scheduling fault-tolerable while maximizing utilities attained in the long term. We carry out experimental comparison with Resource-constrained Utility Accrual algorithm (RUA), Utility Accrual Packet scheduling algorithm (UPA) and LBESA to demonstrate the feasibility of our proposed approach.

Full-text (2 Sources)

Available from
May 28, 2014