Conference Paper

Dynamic multi phase scheduling for heterogeneous clusters.

Dept. of Electr. & Comput. Eng., Athens Nat. Tech. Univ., Greece;
DOI: 10.1109/IPDPS.2006.1639308 Conference: 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), Proceedings, 25-29 April 2006, Rhodes Island, Greece
Source: DBLP

ABSTRACT Distributed computing systems are a viable and less
expensive alternative to parallel computers. However,
concurrent programming methods in distributed systems have not been studied as extensively as for parallel computers. Some of the main research issues are
how to deal with scheduling and load balancing of such a
system, which may consist of heterogeneous computers.
In the past, a variety of dynamic scheduling schemes
suitable for parallel loops (with independent iterations)
on heterogeneous computer clusters have been obtained
and studied. However, no study of dynamic schemes
for loops with iteration dependencies has been reported
so far. In this work we study the problem of scheduling loops with iteration dependencies for heterogeneous
(dedicated and non-dedicated) clusters. The presence
of iteration dependencies incurs an extra degree of dif-
ficulty and makes the development of such schemes
quite a challenge. We extend three well known dynamic schemes (CSS, TSS and DTSS) by introducing
synchronization points at certain intervals so that processors compute in pipelined fashion. Our scheme is
called dynamic multi-phase scheduling (DMPS) and
we apply it to loops with iteration dependencies. We
implemented our new scheme on a network of heterogeneous computers and studied its performance. Through
extensive testing on two real-life applications (the heat
equation and the Floyd-Steinberg algorithm), we show
that the proposed method is efficient for parallelizing
nested loops with dependencies on heterogeneous systems.

  • [Show abstract] [Hide abstract]
    ABSTRACT: One of the most significant causes for performance degradation of scientific and engineering applications on high performance computing systems is the uneven distribution of the com- putational work to the resources of the system. This effect, which is known as load imbal- ance, is even more noticeable in the case of irregular applications and heterogeneous distributed systems. This motivated the parallel and distributed computing research com- munity to focus on methods that provide good load balancing for scientific and engineering applications running on (heterogeneous) distributed systems. Efficient load balancing and scheduling methods are employed for scientific applications from various fields, such as mechanics, materials, physics, chemistry, biology, applied mathematics, etc. Such applica- tions typically employ a large number of computational methods in order to simulate com- plex phenomena, on very large scales of time and magnitude. These simulations consist of routines that perform repetitive computations (in the form of DO/FOR loops) over very large data sets, which, if not properly implemented and executed, may suffer from poor perfor- mance. The number of repetitive computations in the simulation codes is not always con- stant. Moreover, the computational nature of these simulations may be in fact irregular, leading to the case when one computation takes (unpredictably) more time than others. For successful and timely results, large scale simulations require the use of large scale com- puting systems, which often are widely distributed and highly heterogeneous. Moreover, large scale computing systems are usually shared among multiple users, which causes the quality and quantity of the available resources to be highly unpredictable. There are numerous load balancing methods in the literature for different parallel architectures. The most recent of these methods typically follow the master-worker paradigm, where a single coordinator (master) is responsible for making all the scheduling decisions based on information provided by the workers. Depending on the application requirements, the scheduling policy and the computational environment, the benefits of this paradigm may be limited as follows: (1) its efficiency may not scale as the number of processors increases, and (2) it is quite probable that the scheduling decisions are made based on outdated infor- mation, especially on systems where the workload changes rapidly. In an effort to address these limitations, we propose a distributed (master-less) load balancing scheme, in which the scheduling decisions are made by the workers in a distributed fashion. We implemented this method along with other two master-worker schemes (a previously existing one and a recently modified one) for three different scientific computational kernels. In order to val- idate the usefulness and efficiency of the proposed scheme, we conducted a series of com- parative performance tests with the two master-worker schemes for each computational kernel. The target system is an SMP cluster, on which we simulated three different patterns of system load fluctuation. The experiments strongly support the belief that the distributed approach offers greater performance and better scalability on such systems, showing an overall improvement ranging from 13% to 24% over the master-worker approaches.
    Parallel Computing. 01/2011; 37:713-729.
  • [Show abstract] [Hide abstract]
    ABSTRACT: We propose and analyze threading algorithms for hybrid MPI/OpenMP parallelization of a molecular-dynamics simulation, which are scalable on large multicore clusters. Two data-privatization thread scheduling algorithms via nucleation-growth allocation are introduced: (1) compact-volume allocation scheduling (CVAS); and (2) breadth-first allocation scheduling (BFAS). The algorithms combine fine-grain dynamic load balancing and minimal memory-footprint data privatization threading. We show that the computational costs of CVAS and BFAS are bounded by Θ(n 5/3 p −2/3) and Θ(n), respectively, for p threads working on n particles on a multicore compute node. Memory consumption per node of both algorithms scales as O(n+n 2/3 p 1/3), but CVAS has smaller prefactors due to a geometric effect. Based on these analyses, we derive the selection criterion between the two algorithms in terms of the granularity, n/p. We observe that memory consumption is reduced by 75 % for p=16 and n=8,192 compared to a naïve data privatization, while maintaining thread imbalance below 5 %. We obtain a strong-scaling speedup of 14.4 with 16-way threading on a four quad-core AMD Opteron node. In addition, our MPI/OpenMP code achieves 2.58× and 2.16× speedups over the MPI-only implementation on 32,768 cores of BlueGene/P for 0.84 and 1.68 million particle systems, respectively.
    The Journal of Supercomputing 10/2013; 66(1):406-430. · 0.92 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Today's heterogeneous architectures bring together multiple general-purpose CPUs and multiple domain-specific GPUs and FPGAs to provide dramatic speedup for many applications. However, the challenge lies in utilizing these heterogeneous processors to optimize overall application performance by minimizing workload completion time. Operating system and application development for these systems is in their infancy. In this article, we propose a new scheduling and workload balancing scheme, HDSS, for execution of loops having dependent or independent iterations on heterogeneous multiprocessor systems. The new algorithm dynamically learns the computational power of each processor during an adaptive phase and then schedules the remainder of the workload using a weighted self-scheduling scheme during the completion phase. Different from previous studies, our scheme uniquely considers the runtime effects of block sizes on the performance for heterogeneous multiprocessors. It finds the right trade-off between large and small block sizes to maintain balanced workload while keeping the accelerator utilization at maximum. Our algorithm does not require offline training or architecture-specific parameters. We have evaluated our scheme on two different heterogeneous architectures: AMD 64-core Bulldozer system with nVidia Fermi C2050 GPU and Intel Xeon 32-core SGI Altix 4700 supercomputer with Xilinx Virtex 4 FPGAs. The experimental results show that our new scheduling algorithm can achieve performance improvements up to over 200% when compared to the closest existing load balancing scheme. Our algorithm also achieves full processor utilization with all processors completing at nearly the same time which is significantly better than alternative current approaches.
    ACM Transactions on Architecture and Code Optimization (TACO). 01/2013; 9(4).

Full-text (3 Sources)

Available from
May 15, 2014