The emergence of applications that demand to handle efficiently growing amounts of data has stimulated the development of new computing architectures with several Processing Units (PUs), such as CPUs core, graphics processing units (GPUs) and Intel Xeon Phi (MIC). Aiming to better exploit these architectures, recent works focus on proposing novel runtime environments that offer a variety of methods for scheduling tasks dynamically on different PUs. A main limitation of such proposals refers to the constrained system configurations, usually adopted to tune and test the proposals, since setting more complete and diversified evaluation environments is costly. In this context, we present D-STHARk, a GUI tool for evaluating Dynamic Scheduling of Tasks in Hybrid Simulated ARchitectures. D-STHARk provides a complete simulated execution environment that allows evaluating dynamic scheduling strategies on simulated applications and hybrid architectures. We evaluate our tool by simulating the dynamic scheduling strategies presented in∼\cite{sbac2014}, using the same architecture and application. {\it D-STHARk} was able to achieve the same conclusions originally reported by the authors. Moreover, we performed an experiment varying the number of coprocessors, which was not previously verified due to lack of real architectures, showing that we may reduce the energy consumption, while keeping the same performance.

Recently we observe that mobile phones stopped being just devices for basic communication to become providers of many applications that require increasing performance for good user experience. Inside today’s mobile phones we find different processing units (PU) with high computational capacity, as multicore architectures and co-processors like GPUs. Libraries and run-time environments have been proposed to improve applications’ performance by taking advantage of different PUs in a transparent way. Among these environments we can highlight the ParallelME. Despite the importance of task scheduling strategies in these environments, ParallelME has implemented only the First Come Firs Serve (FCFS) strategy. In this paper we extended the ParallelME framework by implementing and evaluating two different dynamic scheduling strategies, Heterogeneous Earliest Finish Time (HEFT) and Performance-Aware Multiqueue Scheduler (PAMS). We evaluate these strategies considering synthetic applications, and compare the proposals with the FCFS. For some scenarios, PAMS was proved to be up to 39% more efficient than FCFS. These gains usually imply on lower energy consumption, which is very desirable when working with mobile architectures.
This paper presents preliminary performance comparisons of parallel applications developed natively for the Intel Xeon Phi accelerator using three different parallel programming environments and their associated runtime systems. We compare Intel OpenMP, Intel CilkPlus and XKaapi together on the same benchmark suite and we provide comparisons between an Intel Xeon Phi coprocessor and a Sandy Bridge Xeon-based machine. Our benchmark suite is composed of three computing kernels: a Fibonacci computation that allows to study the overhead and the scalability of the runtime system, a NQueens application generating irregular and dynamic tasks and a Cholesky factorization algorithm. We also compare the Cholesky factorization with the parallel algorithm provided by the Intel MKL library for Intel Xeon Phi. Performance evaluation shows our XKaapi data-flow parallel programming environment exposes the lowest overhead of all and is highly competitive with native OpenMP and CilkPlus environments on Xeon Phi. Moreover, the efficient handling of data-flow dependencies between tasks makes our XKaapi environment exhibit more parallelism for some applications such as the Cholesky factorization. In that case, we observe substantial gains with up to 180 hardware threads over the state of the art MKL, with a 47% performance increase for 60 hardware threads.
High performance computing is experiencing a major paradigm shift with the introduction of accelerators, such as graphics processing units (GPUs) and Intel Xeon Phi (MIC). These processors have made available a tremendous computing power at low cost, and are transforming machines into hybrid systems equipped with CPUs and accelerators. Although these systems can deliver a very high peak performance, making full use of its resources in real-world applications is a complex problem. Most current applications deployed to these machines are still being executed in a single processor, leaving other devices underutilized. In this paper we explore a scenario in which applications are composed of hierarchical dataflow tasks which are allocated to nodes of a distributed memory machine in coarse-grain, but each of them may be composed of several finer-grain tasks which can be allocated to different devices within the node. We propose and implement novel performance aware scheduling techniques that can be used to allocate tasks to devices. We evaluate our techniques using a pathology image analysis application used to investigate brain cancer morphology, and our experimental evaluation shows that the proposed scheduling strategies significantly outperforms other efficient scheduling techniques, such as Heterogeneous Earliest Finish Time - HEFT, in cooperative executions using CPUs, GPUs, and Masc. also experimentally show that our strategies are less sensitive to inaccuracy in the scheduling input data and that the performance gains are maintained as the application scales.
On many computers, a request to run a job is not serviced immediately but instead is placed in a queue and serviced only when resources are released by preceding jobs. In this paper, we build on runtime prediction techniques that we developed in previous research to explore two problems. The first problem is to predict how long applications will wait in a queue until they receive resources. We develop runtime estimates that result in more accurate wait-time predictions than other run-time prediction techniques. The second problem we investigate is improving scheduling performance. We use run-time predictions to improve the performance of the least-work-first and backfill scheduling algorithms. We find that using our run-time predictor results in lower mean wait times for the workloads with higher offered loads and for the backfill scheduling algorithm. 1