Conference Paper

Coloured Petri Net Modelling of Task Scheduling on a Heterogeneous Computational Node

To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

This paper presents the development of a Coloured Petri Net model for a concurrent application running on a heterogeneous multi/manycore node. The used software runtime (StarPu) allows the expression of the application as a DAG (Directed Acyclic Graph) of tasks and the partition of the heterogeneous hardware in worker units. The CPN modelling allows the rapid evaluation of the suitability of the implemented scheduling algorithms for a given problem and supports the process of new algorithms design and implementation. The scheduler models were validated through runs on the real architecture.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... This is accomplished by numerical solving the system of partial differential equations describing the flow. As described in [1], [12], [13] the parallelisation for such application can be achieved through domain decomposition and an alternation of parallel independent computation steps on the sub-domains with communication steps. In the communication steps, border zones, called ghost zones, are transferred between neighbors to permit computation in the next step. ...
... A scheduler distributes the tasks as soon as their dependencies are solved to the available idle PU. We extends here our work presented in [12] with the part responsible for the interaction with the superior level. We used AMPI/Charm++ for the inter-node level [11]. ...
... The Snoopy framework [16] for Petri Net modeling and simulation was used. In contrast to the CPN Tools we used in [12], [13], this framework has an active development, support for remote simulation and interactive steering [17] and an ecosystem of tools for formal analysis (Charlie [18]) and verification (Marcie [19]). From the rich hierarchy of classes provided by Snoopy the XSPN c class ( c olored eXtended Stochastic Petri Net) was used. ...
Conference Paper
The increase in demand for High Performance Computing (HPC) scientific applications motivates the efforts to reduce costs of running these applications. The problem to solve is that of dynamical multi-criterial optimal scheduling of an application on a HPC platform with a high number of heterogeneous nodes. The solution proposed by the authors is a HPC hardware-software architecture that includes the infrastructure for two level (node and inter-node level) adaptive load balancing. The article presents the development of an Coloured Petri Net(CPN) for such an architecture. The model was used for the development of a dynamic distributed algorithm for the scheduling problem. The CPN allowed a holistic hardware-software formal verification and analysis. Some simple properties were formally proofed. Simulations were performed to assess performance and the results were in the performance range of other load balancing algorithms with significant benefits in reducing the optimization's complexity.
Full-text available
Due to their complexity, nowadays it is virtually unconceivable to design and implement large digital systems without the use of computer-aided design tools. Many petri net extensions have been proposed aiming at describing hardware characteristics as accurately as possible. Among all petri net extensions developed for use with digital systems, only two of them have nearly all characteristics needed to describe such systems in full. They are place chart nets, and petri nets for embedded systems. Using the latter as an example we discuss some issues that may improve the capability of petri nets in dealing with digital systems.
Conference Paper
Full-text available
Iterative stencil loops (ISLs) are used in many applications and tiling is a well-known technique to localize their compu- tation. When ISLs are tiled across a parallel architecture, there are usually halo regions that need to be updated and exchanged among different processing elements (PEs). In addition, synchronization is often used to signal the comple- tion of halo exchanges. Both communication and synchro- nization may incur significant overhead on parallel architec- tures with shared memory. This is especially true in the case of graphics processors (GPUs), which do not preserve the state of the per-core L1 storage across global synchroniza- tions. To reduce these overheads, ghost zones can be created to replicate stencil operations, reducing communication and synchronization costs at the expense of redundantly com- puting some values on multiple PEs. However, the selection of the optimal ghost zone size depends on the character- istics of both the architecture and the application, and it has only been studied for message-passing systems in a grid environment. To automate this process on shared memory systems, we establish a performance model using NVIDIA's Tesla architecture as a case study and propose a framework that uses the performance model to automatically select the ghost zone size that performs best and generate appropriate code. The modeling is validated by four diverse ISL applica- tions, for which the predicted ghost zone configurations are able to achieve a speedup no less than 98% of the optimal speedup.
Conference Paper
Full-text available
Predicting sequential execution blocks of a large scale parallel application is an essential part of accurate prediction of the overall performance of the application. When simulating a future machine, or a prototype system only available at a small scale, it becomes a significant challenge. Using hardware simulators may not be feasible due to excessively slowed down execution times and insufficient resources. The difficulty of these challenges increases proportionally with the scale of the simulation. In this paper, we propose an approach based on statistical models to accurately predict the performance of the sequential execution blocks that comprise a parallel application. We deployed these techniques in a trace-driven simulation framework to capture both the detailed behavior of the application as well as the overall predicted performance. The technique is validated using both synthetic benchmarks and the NAMD application.
Conference Paper
Full-text available
The OpenCL standard allows targeting a large variety of CPU, GPU and accelerator architectures using a single unified programming interface and language. While the standard guarantees portability of functionality for complying applications and platforms, performance portability on such a diverse set of hardware is limited. Devices may vary significantly in memory architecture as well as type, number and complexity of computational units. To characterize and compare the OpenCL performance of existing and future devices we propose a suite of microbenchmarks, uCLbench. We present measurements for eight hardware architectures – four GPUs, three CPUs and one accelerator – and illustrate how the results accurately reflect unique characteristics of the respective platform. In addition to measuring quantities traditionally benchmarked on CPUs like arithmetic throughput or the bandwidth and latency of various address spaces, the suite also includes code designed to determine parameters unique to OpenCL like the dynamic branching penalties prevalent on GPUs. We demonstrate how our results can be used to guide algorithm design and optimization for any given platform on an example kernel that represents the key computation of a linear multigrid solver. Guided manual optimization of this kernel results in an average improvement of 61% across the eight platforms tested.
Full-text available
Open-world software is a paradigm which allows to develop distributed and heterogeneous software systems. They can be built by integrating already developed third-party services, which use to declare QoS values (e.g., related to performance). It is true that these QoS values are subject to some uncertainties. Consequently, the performance of the systems using these services may unexpectedly decrease. A challenge for this kind of software is to self-adapt its behavior as a response to changes in the availability or performance of the required services. In this paper, we develop an approach to model self-renconfigurable open-world software systems with stochastic Petri nets. Moreover, we develop strategies for a system to gain a new state where it can recover its availability or even improve its performance. Through an example, we apply these strategies and evaluate them to discover suitable reconfigurations for the system. Results will announce appropriate strategies for system performance enhancement.
In today’s computer architectures the design spaces are huge, thus making it very difficult to find optimal configurations. One way to cope with this problem is to use Automatic Design Space Exploration (ADSE) techniques. We developed the Framework for Automatic Design Space Exploration (FADSE) which is focused on microarchitectural optimizations. This framework includes several state-of-the art heuristic algorithms. In this paper we selected three of them, NSGA-II and SPEA2 as genetic algorithms as well as SMPSO as a particle swarm optimization, and compared their performance. As test case we optimize the parameters of the Grid ALU Processor (GAP) microarchitecture and then GAP together with the post-link code optimizer GAPtimize. An analysis of the simulation results shows a very good performance of all the three algorithms. SMPSO reveals the fastest convergence speed. A clear winner between NSGA-II and SPEA2 cannot be determined.
Conference Paper
The paper presents the development of a model and a simulation technique for the prediction of the performance of a concurrent application running on a HPC architecture. Models for the hardware and software were developed using the Coloured Petri Net formalism and then coupled for the simulation. Timed simulations were performed in the Coloured Petri Net tools environment and in the Charm++ respectively the BigSim simulator. The results of both simulations for the same optimization problem show similar trends.
Conference Paper
Efficient application scheduling is critical for achieving high performance in heterogeneous computing (HC) environments. Because of its importance, there are many researches on this problem and various algorithms have been proposed. Duplication-based algorithm is a kind of famous algorithm to solve scheduling problem, which achieve high performance on minimizing the overall completion time(makespan) of applications. However, they do not consider energy consumption. With the growing advocacy for green computing system, energy conservation has been an important issue and gained a particular interest. An existing technique to reduce energy consumption of application is dynamic voltage/frequcny scaling(DVFS), but its efficiency is affected by the overhead of time and energy caused by voltage scaling. In this paper, we propose a new energy-aware scheduling algorithm called Energy Aware Scheduling by Minimizing Duplication(EAMD), which considers the energy consumption as well as the makespan of applications. It adopts a subtle energy-aware method to determine and delete the abundant task copies in the schedules generated by duplication-based algorithms, which is easier to operate than DVFS and produces no extra time and energy consumption. This algorithm can reduce large amount of energy consumption while having the same makespan compared with duplication-based algorithms without energy awareness. Randomly generated DAGs are tested in our experiments. Experimental results show that EAMD can save up to 15.59% energy consumption for the existed duplication-based algorithms. Several factors affecting the performance are analyzed in the paper, too.
This paper proposes a novel Colored Petri Net (CPN) based dynamic scheduling scheme, which aims at scheduling real-time tasks on multiprocessor system-on-chip (MPSoC) platforms. Our CPN based scheme addresses two key issues on task scheduling problems, dependence detecting and task dispatching. We model inter-task dependences using CPN, including true-dependences, output-dependences, anti-dependences and structural dependences. The dependences can be detected automatically during model execution. Additionally, the proposed model takes the checking of real-time constraints into consideration. We evaluated the scheduling scheme on the state-of-art FPGA based multiprocessor hardware system and modeled the system behavior using CPN tools. Simulations and state space analyses are conducted on the model. Experimental results demonstrate that our scheme can achieve 98.9% of the ideal speedup on a real FPGA based hardware prototype.
Coloured Petri Nets (CPN) is a graphical language for modelling and validating concurrent and distributed systems, and other systems in which concurrency plays a major role. The development of such systems is particularly challenging because of inherent intricacies like possible nondeterminism and the immense number of possible execution sequences. In this textbook Jensen and Kristensen introduce the constructs of the CPN modelling language and present the related analysis methods in detail. They also provide a comprehensive road map for the practical use of CPN by showcasing selected industrial case studies that illustrate the practical use of CPN modelling and validation for design, specification, simulation, verification and implementation in various application domains. Their presentation primarily aims at readers interested in the practical use of CPN. Thus all concepts and constructs are first informally introduced through examples and then followed by formal definitions (which may be skipped). The book is ideally suitable for a one-semester course at an advanced undergraduate or graduate level, and through its strong application examples can also serve for self-study. An accompanying website offers additional material such as slides, exercises and project proposals.
In a heterogeneous system, processor and network failure are inevitable and can have adverse effect on the complex applications executing on the systems. To reduce the rate of these failures, matching and scheduling algorithms should take into account the objectives of minimizing schedule length makespan and reducing the probability of failure. Equitable distribution of workload over resources contributes in reducing the probability of failure. The Heterogeneous Earliest Finish Time HEFT algorithm has been proved as a performance effective task scheduling algorithm, addressing the objective of minimizing makespan. Reliable Dynamic Level Scheduling RDLS algorithm is a bi-objective scheduling algorithm that maximizes the reliability more effectively. Though the reliable version of HEFT algorithm RHEFT considers failure rate in scheduling decision, the improvement in reliability is less, compared to that of RDLS. To overcome this deficiency, we propose to incorporate the task--processor pair finding step of RDLS in HEFT algorithm, since it meets both the objectives of minimizing the makespan and maximizing the reliability. We define the load on a processor as the amount of time the processor is engaged in completing the scheduled subtasks. In this paper, a modification to the HEFT is proposed as a new algorithm called Improved Reliable HEFT IRHEFT for minimizing the schedule length, balancing the load and maximizing the reliability of schedule. The algorithm is compared for its performance with RDLS algorithm for randomly generated task graphs and a real application task graph.
Heterogeneous clusters that include accelerators have become more common in the realm of high performance computing because of the high GFlop/s rates such clusters are capable of achieving. However, heterogeneous clusters are typically considered hard to program as they usually require programmers to interleave architecture-specific code within application code. We have extended the Charm++ programming model and runtime system to support heterogeneous clusters (with host cores that differ in their architecture) that include accelerators. We are currently focusing on clusters that include commodity processors, Cell processors, and Larrabee devices. When our extensions are used to develop code, the resulting code is portable between various homogeneous and heterogeneous clusters that may or may not include accelerators. Using a simple example molecular dynamics (MD) code, we demonstrate our programming model extensions and runtime system modifications on a heterogeneous cluster comprised of Xeon and Cell processors. Even though there is no architecture-specific code in the example MD program, it is able to successfully make use of three core types, each with a different ISA (Xeon, PPE, SPE), three SIMD instruction extensions (SSE, AltiVec/VMX and the SPE's SIMD instructions), and two memory models (cache hierarchies and scratchpad memories) in a single execution. Our programming model extensions abstract away hardware complexities while our runtime system modifications automatically adjust application data to account for architectural differences between the various cores.
Conference Paper
In the field of HPC, the current hardware trend is to design multiprocessor architectures that feature heterogeneous technologies such as specialized coprocessors (e.g. Cell/BE SPUs) or data-parallel accelerators (e.g. GPGPUs). Approaching the theoretical performance of these architectures is a complex issue. Indeed, substantial efforts have already been devoted to efficiently offload parts of the computations. However, designing an execution model that unifies all computing units and associated embedded memory remains a main challenge. We have thus designed StarPU, an original runtime system providing a high-level, unified execution model tightly coupled with an expressive data management library. The main goal of StarPU is to provide numerical kernel designers with a convenient way to generate parallel tasks over heterogeneous hardware on the one hand, and easily develop and tune powerful scheduling algorithms on the other hand. We have developed several strategies that can be selected seamlessly at run time, and we have demonstrated their efficiency by analyzing the impact of those scheduling policies on several classical linear algebra algorithms that take advantage of multiple cores and GPUs at the same time. In addition to substantial improvements regarding execution times, we obtained consistent superlinear parallelism by actually exploiting the heterogeneous nature of the machine.
Conference Paper
In this paper a method for estimating task execution times is presented, in order to facilitate dynamic scheduling in a heterogeneous metacomputing environment. Execution time is treated as a random variable and is statistically estimated from past observations. This method predicts the execution time as a function of several parameters of the input data, and does not require any direct information about the algorithms used by the tasks or the architecture of the machines. Techniques based upon the concept of analytic benchmarking/code profiling are used to accurately determine the performance differences between machines, allowing observations to be shared between machines. Experimental results using real data are presented
The issues and problems posed by heterogeneous computing are discussed. They include design of algorithms for applications, partitioning and mapping of application tasks, interconnection requirements, and the design of programming environments. The use of heterogeneous computing in image understanding is reviewed. An example vision task is presented, and the different types of parallelism used in the example are identified.< >
Optimally Mapping a CFD Application on a HPC Architecture
  • I D Mironescu
  • L Vintan
I. D. Mironescu, L. Vintan, Optimally Mapping a CFD Application on a HPC Architecture, ACACES 2011 Seventh International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems, Poster Abstracts, pp. 227-230, ISBN 978 90 382 1798 7, Published by FP7 HiPEAC Network of Excellence, 10-16 July 2011, Fiuggi, Italy
StarPU: a unified platform for task scheduling on heterogeneous multicore architectures
  • C Augonnet
  • S Thibault
  • R Namyst
  • P.-A Wacrenier
C. Augonnet, S. Thibault, R. Namyst, P.-A. Wacrenier, StarPU: a unified platform for task scheduling on heterogeneous multicore architectures, Concurrency and Computation: Practice & Experience, v. 23 no. 2, pp. 187-198, February 2011
  • A A Khokhar
  • V K Prasanna
  • M E Shaaban
  • Cho-Li Wang
A.A. Khokhar, V. K. Prasanna, M. E. Shaaban, Cho-Li Wang, Heterogeneous Computing: Challenges and Opportunities,IEEE Computer, Volume 26 Issue 6, June 1993, 18-27, IEEE Computer Society Press Los Alamitos, CA, USA
Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design, Euro-Par 2011 Parallel Processing
  • P Thoman
  • K Kofler
  • H Studt
  • J Thomson
  • T Fahringer
P. Thoman, K. Kofler, H. Studt, J. Thomson, T. Fahringer, Automatic OpenCL Device Characterization: Guiding Optimized Kernel Design, Euro-Par 2011 Parallel Processing, Lecture Notes in Computer Science 6853, pp. 438-452, Springer Berlin Heidelberg, 2011
Finding Near-Perfect Parameters for Hardware and Code Optimizations with Automatic Multi-Objective Design Space Explorations
  • R Jahr
  • H Calborean
  • L Vintan
  • T Ungerer
R. Jahr, H.Calborean, L.Vintan, T.Ungerer, Finding Near-Perfect Parameters for Hardware and Code Optimizations with Automatic Multi-Objective Design Space Explorations, Concurrency and Computation: Practice and Experience, doi: 10.1002/cpe.2975, John Wiley & Sons, 2012