Improving Job Scheduling on Production Supercomputers.
- [show abstract] [hide abstract]
ABSTRACT: In parallel computer systems with a number of processors, external fragmentation is caused by continuous allocation and deallocation of processors to tasks which require exclusive use of several contiguous processors. With this condition, the system may not be able to find contiguous processors to be allocated to an incoming task even with a sufficient number of free processors. Relocation is an approach for alleviating this problem by reassigning the running tasks to other processors. In this paper, we examine two relocation schemes—full relocation and partial relocation scheme—for two-dimensional meshes. The full relocation scheme is desirable when the system is highly fragmented, while the partial relocation scheme is used for minimizing the number of relocated tasks. For the relocation process, we formally define and use two basic submesh movement operations—shifting and rotating. Comprehensive computer simulation reveals that the proposed schemes are beneficial when the relocation overhead is not high, which is machine dependent.Journal of Parallel and Distributed Computing. 01/2000;
Conference Proceeding: Prediction Services for Distributed Computing.[show abstract] [hide abstract]
ABSTRACT: Users of distributed systems such as the TeraGrid and Open Science Grid can execute their applications on many different systems. We wish to help such users, or the grid schedulers they use, select where to run applications by providing predictions of when tasks will complete if sent to different systems. We make predictions of file transfer times, batch scheduler queue wait times, and application execution times using historical information and instance- based learning techniques. Our prediction errors for data from the TACC lonestar system are 37 percent of mean file transfer time, 115 percent for mean queue wait time, and 72 percent of mean execution time. Our approach achieves significantly lower prediction error on other work- loads. We have wrapped these prediction techniques with web services, making predictions available to users of dis- tributed systems as well as tools such as resource brokers and metaschedulers.21th International Parallel and Distributed Processing Symposium (IPDPS 2007), Proceedings, 26-30 March 2007, Long Beach, California, USA; 01/2007
Conference Proceeding: A comprehensive model of the supercomputer workload[show abstract] [hide abstract]
ABSTRACT: As with any computer system, the performance of supercomputers depends upon the workloads that serve as their input. Unfortunately, however, there are many important aspects of the supercomputer workloads that have not been modeled, or that have been modeled only incipiently. This paper attacks this problem by considering requested time (and its relation with execution time) and the possibility of job cancellation, two aspects of the supercomputer workload that have not been modeled yet. Moreover, we also improve upon existing models for the arrival instant and partition size.Workload Characterization, 2001. WWC-4. 2001 IEEE International Workshop on; 01/2002