Conference Paper

# Dynamic multi phase scheduling for heterogeneous clusters.

Dept. of Electr. & Comput. Eng., Athens Nat. Tech. Univ., Greece;

DOI: 10.1109/IPDPS.2006.1639308 Conference: 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), Proceedings, 25-29 April 2006, Rhodes Island, Greece Source: DBLP

- [Show abstract] [Hide abstract]

**ABSTRACT:**� Abstract-- In this paper we present the theoretical framework for enumerating the solutions of the first order Diophantine equation with unitary coefficients that helps us to define the sets of points of a nested loop that can be executed in parallel. The analytical expression for finding those points is: i1 + i2 + … + iD = c where i1, i2,…, iD, 0 � ipLp, � p= 1…D are the loop indices, Lp is the loop bound and c defines the time all points satisfying this equation will be executed in parallel. Moreover, we present an innovative "refined" algorithm which speeds up the generation of those solutions compared to the traditional 'brute-force" approach. Finally, we present a modular hardware implementation of this "refined" algorithm on FPGA platforms, an approach which increases even more the algorithm's performance. The presented architecture and theoretical solution is suitable in load balancing applications, consisting of nested for-loops with dependencies, since it allows rapid and dynamic generation of the index points of loop instances that can be executed in parallel. Moreover, this architecture can be easily reconfigured. Index Terms—FPGA Design, Diophantine equation HE platform based design methodology has been proven to be an effective approach for reducing the computational complexity involved in the design process of embedded systems (1). Reconfigurable platforms consist of several programmable components (microprocessors) interconnected to hardware and reconfigurable components. Reconfigurable components allow the flexibility of selecting specific computationally intensive parts (mainly nested loops) of the initial application to be implemented in hardware, during hardware/software partitioning (2), in the effort of achieving the best possible increase in performance, while - [Show abstract] [Hide abstract]

**ABSTRACT:**We propose and analyze threading algorithms for hybrid MPI/OpenMP parallelization of a molecular-dynamics simulation, which are scalable on large multicore clusters. Two data-privatization thread scheduling algorithms via nucleation-growth allocation are introduced: (1) compact-volume allocation scheduling (CVAS); and (2) breadth-first allocation scheduling (BFAS). The algorithms combine fine-grain dynamic load balancing and minimal memory-footprint data privatization threading. We show that the computational costs of CVAS and BFAS are bounded by Θ(n 5/3 p −2/3) and Θ(n), respectively, for p threads working on n particles on a multicore compute node. Memory consumption per node of both algorithms scales as O(n+n 2/3 p 1/3), but CVAS has smaller prefactors due to a geometric effect. Based on these analyses, we derive the selection criterion between the two algorithms in terms of the granularity, n/p. We observe that memory consumption is reduced by 75 % for p=16 and n=8,192 compared to a naïve data privatization, while maintaining thread imbalance below 5 %. We obtain a strong-scaling speedup of 14.4 with 16-way threading on a four quad-core AMD Opteron node. In addition, our MPI/OpenMP code achieves 2.58× and 2.16× speedups over the MPI-only implementation on 32,768 cores of BlueGene/P for 0.84 and 1.68 million particle systems, respectively.The Journal of Supercomputing 10/2013; 66(1):406-430. · 0.84 Impact Factor - [Show abstract] [Hide abstract]

**ABSTRACT:**Today's heterogeneous architectures bring together multiple general-purpose CPUs and multiple domain-specific GPUs and FPGAs to provide dramatic speedup for many applications. However, the challenge lies in utilizing these heterogeneous processors to optimize overall application performance by minimizing workload completion time. Operating system and application development for these systems is in their infancy. In this article, we propose a new scheduling and workload balancing scheme, HDSS, for execution of loops having dependent or independent iterations on heterogeneous multiprocessor systems. The new algorithm dynamically learns the computational power of each processor during an adaptive phase and then schedules the remainder of the workload using a weighted self-scheduling scheme during the completion phase. Different from previous studies, our scheme uniquely considers the runtime effects of block sizes on the performance for heterogeneous multiprocessors. It finds the right trade-off between large and small block sizes to maintain balanced workload while keeping the accelerator utilization at maximum. Our algorithm does not require offline training or architecture-specific parameters. We have evaluated our scheme on two different heterogeneous architectures: AMD 64-core Bulldozer system with nVidia Fermi C2050 GPU and Intel Xeon 32-core SGI Altix 4700 supercomputer with Xilinx Virtex 4 FPGAs. The experimental results show that our new scheduling algorithm can achieve performance improvements up to over 200% when compared to the closest existing load balancing scheme. Our algorithm also achieves full processor utilization with all processors completing at nearly the same time which is significantly better than alternative current approaches.ACM Transactions on Architecture and Code Optimization 01/2013; 9(4). · 0.60 Impact Factor

Data provided are for informational purposes only. Although carefully collected, accuracy cannot be guaranteed. The impact factor represents a rough estimation of the journal's impact factor and does not reflect the actual current impact factor. Publisher conditions are provided by RoMEO. Differing provisions from the publisher's actual policy or licence agreement may be applicable.