Intel’s Threading Building Blocks (TBB) provide a high-level abstraction for expressing parallelism in applications without
writing explicitly multi-threaded code. However, TBB is only available for shared-memory, homogeneous multicore processors.
Codeplay’s Offload C++ provides a single-source, POSIX threads-like approach to programming heterogeneous multicore devices where cores are equipped with private, local memories—code to move data between memory spaces is generated
automatically. In this paper, we show that the strengths of TBB and Offload C++ can be combined, by implementing part of the TBB headers
in Offload C++. This allows applications parallelised using TBB to run, without source-level modifications, across all the
cores of the Cell BE processor. We present experimental results applying our method to a set of TBB programs. To our knowledge,
this work marks the first demonstration of programs parallelised using TBB executing on a heterogeneous multicore architecture.
All content in this area was uploaded by Andrew Richards on Oct 15, 2015
Content may be subject to copyright.
A preview of the PDF is not available
... Programmers can use portable accessor classes (efficient data access abstractions) and knowledge of their application's access patterns to achieve high performance. Offload provides a number of accessor classes, written using C++ templates and operator overloading [14]. ...
Memory architectures need to adapt in order for performance and scalability to be achieved in software for multicore systems. In this paper, we discuss the impact of techniques for scalable memory architectures, especially the use of multiple, non-cache-coherent memory spaces, on the implementation and performance of consumer software. Primarily, we report extensive real-world experience in this area gained by Codeplay Software Ltd., a software tools company working in the area of compilers for video games and GPU software. We discuss the solutions we use to handle variations in memory architecture in consumer software, and the impact such variations have on software development effort and, consequently, development cost. This paper introduces preliminary findings regarding impact on software, in advance of a larger-scale analysis planned over the next few years. The techniques discussed have been employed successfully in the development and optimisation of a shipping AAA cross-platform video game.
HPC systems are widely used for accelerating calculation-intensive irregular applications, e.g., molecular dynamics (MD) simulations, astrophysics applications, and irregular grid applications. As the scalability and complexity of current HPC systems keeps growing, it is difficult to parallelize these applications in an efficient fashion due to irregular communication patterns, load imbalance issues, dynamic characteristics, and many more. This paper presents a fine granular programming scheme, on which programmers are able to implement parallel scientific applications in a fine granular and SPMD (single program multiple data) fashion. Different from current programming models starting from the global data structure, this programming scheme provides a high-level and object-oriented programming interface that supports writing applications by focusing on the finest granular elements and their interactions. Its implementation framework takes care of the implementation details e.g., the data partition, automatic EP aggregation, memory management, and data communication. The experimental results on SuperMUC show that the OOP implementations of multi-body and irregular applications have little overhead compared to the manual implementations using C++ with OpenMP or MPI. However, it improves the programming productivity in terms of the source code size, the coding method, and the implementation difficulty.
Compiler writers have developed various techniques, such as constant folding, subexpression elimination, loop transformation and vectorization, to help compilers in code optimization for performance improvement. Yet, they have been far less successful in developing techniques or cost models that compilers can rely on to simplify parallel programming and tune the performance of parallel applications automatically. This paper is the first of two-phase study to develop an analytical model that can be used to estimate the cost for sequential and parallel execution of array expressions on multicore architectures. While this paper discuss the possibility of developing a cost model to estimate the sequential execution of array expressions on a single CPU, the second part of the investigation shall focus on developing a model to estimate parallel execution of arrays on multicore platforms. The model presented in this paper is expected to be used by programming language compilers as complement component with the other model to estimate and subsequently decide whether to parallelize individual array expressions or not. The preliminary results which are presented here show that this model can give a satisfactory evaluation and high-precision estimation for the cost of executing a regular array expression on a single core processor.
In this survey we give a short introduction into appearance-based object recog-nition. In general, one distinguishes between two different strategies, namely local and global approaches. Local approaches search for salient regions char-acterized by e.g. corners, edges, or entropy. In a later stage, these regions are characterized by a proper descriptor. For object recognition purposes the thus obtained local representations of test images are compared to the repre-sentations of previously learned training images. In contrast to that, global approaches model the information of a whole image. In this report we give an overview of well known and widely used region of interest detectors and descriptors (i.e, local approaches) as well as of the most important subspace methods (i.e., global approaches). Note, that the discussion is reduced to meth-ods, that use only the gray-value information of an image.
Consistency of image edge filtering is of prime importance for 3D interpretation of image sequences using feature tracking algorithms. To cater for image regions containing texture and isolated features, a combined corner and edge detector based on the local auto-correlation function is utilised, and it is shown to perform with good consistency on natural imagery.
To improve the utilization of machine resources in superscalar processors, the instructions have to be carefully scheduled by the compiler. As internal parallelism and pipelining increases, it becomes evident that scheduling should be done beyond the basic block level. A scheme for global (intra-loop) scheduling is proposed, which uses the control and data dependence information summarized in a Program Dependence Graph, to move instructions well beyond basic block boundaries. This novel scheduling framework is based on the parametric description of the machine architecture, which spans a range of superscakis and VLIW machines, and exploits speculative execution of instructions to further enhance the performance of the general code. We have implemented our algorithms in the IBM XL family of compilers and have evaluated them on the IBM RISC System/6000 machines.
This paper presents a novel software pipelining approach, which is called Swing Modulo Scheduling (SMS). It generates schedules that are near optimal in terms of initiation interval, register requirements and stage count. Swing Modulo Scheduling is an heuristic approach that has a low computational cost. The paper describes the technique and evaluates it for the Perfect Club benchmark suite. SMS is compared with other heuristic methods showing that it outperforms them in terms of the quality of the obtained schedules and compilation time. SMS is also compared with an integer linear programming approach that generates optimum schedules but with a huge computational cost, which makes it feasible only for very small loops. For a set of small loops, SMS obtained the optimum initiation interval in all the cases and its schedules required only 5% more registers and a 1% higher stage count than the optimum.
Programs that are not loop intensive and which have small basic blocks present a challenge to architectures that rely on instruction level parallelism to deliver high performance. This paper presents a heuristic for moving operations across basic block boundaries based on profile information gathered from previous executions of a program. Architectural support is required for executing these operations speculatively. Performance improvements on a heapsort routine are discussed for a Very Long Instruction Word (VLIW) machine model, using two different sets of operation latencies. Experiments were done both at the assembly code level and on the intermediate representation produced by the compiler. The results show a speedup of about two over the original code after the global code motion heuristic is applied.