Conference Paper

A Dynamic Load Balancing Tool for One and Two Dimensional Parallel Loops

Center for Computational Sci., Mississippi State Univ.
DOI: 10.1109/ISPDC.2006.1 Conference: Parallel and Distributed Computing, 2006. ISPDC '06. The Fifth International Symposium on
Source: IEEE Xplore

ABSTRACT

This paper describes a dynamic load balancing tool intended for computational investigators who have little familiarity with programming for a message-passing environment. Motivated by the PAR DOALL directive available in some compilers for shared-memory systems, the tool is designed to simplify the manual conversion of sequential programs containing computationally intensive loops with independent iterates into parallel programs that execute with high efficiency on general-purpose clusters. The tool implements a dynamic loop scheduling strategy to address load imbalance which may be induced by the non-uniformity of loop iterate times, and by the heterogeneity of processors. The tool is based on the message passing interface library for wide availability. Timings of a nontrivial application that utilize the tool on a Linux cluster are presented to demonstrate sample achievable performance

0 Followers
 · 
8 Reads
  • Source
    • "Due to space limitations, the interested reader is referred to the appropriate references for details of the above DLS algorithms. The performance of DLS methods using hierarchical management has been shown to be better than that of the centralized management approach [9][10][11]. Figure 1 illustrates the centralized management approach (left), and the distributed management approach (right). The coordination of and interactions between the processors in the first case are straightforward, whereas for details of the hierarchical management approach, due to space limitations, the interested reader if referred to [9]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Loops are the richest source of parallelism in scientific applications. A large number of loop scheduling schemes have therefore been devised for loops with and without data dependencies (modeled as dependence distance vectors) on heterogeneous clusters. The loops with data dependencies require synchronization via cross-node communication. Synchronization requires fine-tuning to overcome the communication overhead and to yield the best possible overall performance. In this paper, a theoretical model is presented to determine the granularity of synchronization that minimizes the parallel execution time of loops with data dependencies when these are parallelized on heterogeneous systems using dynamic self-scheduling algorithms. New formulas are proposed for estimating the total number of scheduling steps when a threshold for the minimum work assigned to a processor is assumed. The proposed model uses these formulas to determine the synchronization granularity that minimizes the estimated parallel execution time. The accuracy of the proposed model is verified and validated via extensive experiments on a heterogeneous computing system. The results show that the theoretically optimal synchronization granularity, as determined by the proposed model, is very close to the experimentally observed optimal synchronization granularity, with no deviation in the best case, and within 38.4% in the worst case. Copyright © 2012 John Wiley & Sons, Ltd.
    Full-text · Article · Dec 2012 · Concurrency and Computation Practice and Experience
  • Source
    • "Due to space limitations, the interested reader is referred to the appropriate references for details of the above DLS algorithms. The performance of DLS methods using hierarchical management has been shown to be better than that of the centralized management approach [9][10][11]. Figure 1 illustrates the centralized management approach (left), and the distributed management approach (right). The coordination of and interactions between the processors in the first case are straightforward, whereas for details of the hierarchical management approach, due to space limitations, the interested reader if referred to [9]. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Dynamic loop scheduling (DLS) algorithms provide application-level load balancing of loop iterates, with the goal of maximizing application performance on the underlying system. These methods use run-time information regarding the performance of the application’s execution (for which irregularities change over time). Many DLS methods are based on probabilistic analyses, and therefore account for unpredictable variations of application and system related parameters. Scheduling scientific and engineering applications in large-scale distributed systems (possibly shared with other users) makes the problem of DLS even more challenging. Moreover, the chances of failure, such as processor or link failure, are high in such large-scale systems. In this paper, we employ the hierarchical approach for three DLS methods, and propose metrics for quantifying their robustness with respect to variations of two parameters (load and processor failures), for scheduling irregular applications in large-scale heterogeneous distributed systems.
    Full-text · Conference Paper · Jun 2009
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: To improve the performance of scientific applications with parallel loops, dynamic loop scheduling methods have been proposed. Such methods address performance degradations due to load imbalance caused by predictable phenomena like nonuniform data distribution or algorithmic variance, and unpredictable phenomena such as data access latency or operating system interference. In particular, methods such as factoring, weighted factoring, adaptive weighted factoring, and adaptive factoring have been developed based on a probabilistic analysis of parallel loop iterates with variable running times. These methods have been successfully implemented in a number of applications such as: N-Body and Monte Carlo simulations, computational fluid dynamics, and radar signal processing. The focus of this paper is on adaptive weighted factoring (AWF), a method that was designed for scheduling parallel loops in time-stepping scientific applications. The main contribution of the paper is to relax the time-stepping requirement, a modification that allows the AWF to be used in any application with a parallel loop. The modification further allows the AWF to adapt to load imbalance that may occur during loop execution. Results of experiments to compare the performance of the modified AWF with the performance of the other loop scheduling methods in the context of three nontrivial applications reveal that the performance of the modified method is comparable to, and in some cases, superior to the performance of the most recently introduced adaptive factoring method.
    Preview · Article · Apr 2008 · The Journal of Supercomputing
Show more