OpenMP-based parallelization on an MPCore multiprocessor platform – A performance and power analysis

Chair for Electrical Engineering and Computer Systems, RWTH Aachen University, Schinkelstraße 2, 52062 Aachen, Germany
Journal of Systems Architecture (Impact Factor: 0.44). 11/2008; 54(11):1019-1029. DOI: 10.1016/j.sysarc.2008.04.001
Source: DBLP


In this contribution, the potential of parallelized software that implements algorithms of digital signal processing on a multicore processor platform is analyzed. For this purpose various digital signal processing tasks have been implemented on a prototyping platform i.e. an ARM MPCore featuring four ARM11 processor cores. In order to analyze the effect of parallelization on the resulting performance-power ratio, influencing parameters like e.g. the number of issued program threads have been studied. For parallelization issues the OpenMP programming model has been used which can be efficiently applied on C-level. In order to elaborate power efficient code also a functional and instruction level power model of the MPCore has been derived which features a high estimation accuracy. Using this power model and exploiting the capabilities of OpenMP a variety of exemplary tasks could be efficiently parallelized. The general efficiency potential of parallelization for multiprocessor architectures can be assembled.

Download full-text


Available from: Jörg Brakensiek, Mar 14, 2014
  • Source
    • "Also a spinlock for longer duration can reduce the system performance. The authors discuss parallelization efforts using OpenMP-based high-level language on multiprocessor platforms like ARM MPCore in [22]. The authors exploit capabilities of signal processing algorithms and parallel computation capabilities. "
    [Show abstract] [Hide abstract]
    ABSTRACT: The shift towards multicore architectures poses significant challenges to the programmers. Unlike programming on single core architectures, multicore architectures require the programmer to decide on how the work needs to be distributed across multiple processors. In this contribution, we analyze the needs of a high-level programming model to program multicore architectures. We use OpenMP as the high-level programming model to increase programmer productivity, reduce time to market and development/design costs for these systems. In this work, we have explored the medical ultrasound application using OpenMP on a TI-based Tomahawk platform that is a six-core, high performance multicore DSP system. This application is heavily based on image processing and the goal is to achieve desired level of image quality. We have explored the different cache configurations of the system. In this process, we were able to study the performance impacts of data locality when data objects are placed into different components of the Tomahawk memory system.
    Full-text · Conference Paper · Nov 2012
  • Source
    • "This trend has been adopted by several players of the consumer electronic industry in order to benefit from the properties of these models [5] [6] [7]. Other approaches are more loose, using subsets of well known parallel programming libraries for which the properties of the programming model is less clear: MPI [8], light versions of Corba [9], Open- MP [10] [11], or even bare shared memory threads. "
    [Show abstract] [Hide abstract]
    ABSTRACT: During the past few years, embedded digital systems have been requested to provide a huge amount of processing power and functionality. A very likely foreseeable step to pursue this computational and flexibility trend is the generalization of on-chip multiprocessor platforms (MPSoC). In that context, choosing a programming model and providing optimized hardware support to it on these platforms is a challenging task. To deal in a portable way with MPSoCs having a different number of processors running possibly at different frequencies, work-stealing (WS) based parallelization is a current research trend.The contribution of this paper is to evaluate the impact of some simple MPSoCs’ architecture characteristics on the performance of WS in the MPSoC context. The previous evaluations of WS, either theoretical or experimental, were done on fixed multicores architectures. This work extends these studies by exploring the use of WS for the codesign of embedded applications on MPSoC platforms with different hardware capabilities, thanks to cycle-accurate measures.We firstly study the architectural choices suited to WS algorithms and measure the benefit of these architectural modifications. To assert whether WS is suited to the MPSoC context, we experimentally measure its intrinsic implementation overhead on the most efficient architectural designs. Finally, we validate the performances of the approach on two real applications: a regular multimedia application (temporal noise reduction) and an irregular computation intensive application (frames of the Mandelbrot set).Our results show that enhancing MPSoC platforms having up to 16 processors with widespread hardware support mechanisms can lead to important performance improvements at acceptable hardware cost for the considered applications.
    Full-text · Article · Aug 2010 · Journal of Systems Architecture
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: L'avènement des puces multicoeurs repose certaines questions quant aux moyens d'écrire les programmes, qui doivent alors intégrer un degré élevé de parallélisme. Nous abordons cette question par l'intermédiaire de deux points de vue orthogonaux. Premièrement via le paradigme du vol de travail, pour lequel nous effectuons une étude visant d'une part à rechercher quelles sont les caractéristiques architecturales simples donnant les meilleures performances pour une implémentation de ce paradigme ; et d'autre part à montrer que le surcout par rapport à une parallélisation statique est faible tout en permettant des gains en performances grâce à l'équilibrage dynamique des charges. Cette question est néanmoins surtout abordée via le paradigme de programmation à base de transactions -- ensemble d'instructions s'exécutant de manière atomique du point de vue des autres coeurs. Supporter cette abstraction nécessite l'implantation d'un système dit TM, souvent complexe, pouvant être logiciel ou matériel. L'étude porte premièrement sur la comparaison de systèmes TM matériels basés sur des choix architecturaux différents (protocole de cohérence de cache), puis sur l'impact d'un point de vue performances de plusieurs politiques de résolution des conflits, autrement dit des actions à prendre quand deux transactions essaient d'accéder simultanément les mêmes données.
    Preview · Article ·
Show more