Conference Paper

Understanding performance, power and energy behavior in asymmetric multiprocessors

Sch. of Comput. Sci., Georgia Inst. of Technol., Atlanta, GA
DOI: 10.1109/ICCD.2008.4751903 Conference: 26th International Conference on Computer Design, ICCD 2008, 12-15 October 2008, Lake Tahoe, CA, USA, Proceedings
Source: IEEE Xplore


Multiprocessor architectures are becoming popular in both desktop and mobile processors. Among multiprocessor architectures, asymmetric architectures show promise in saving energy and power. However, the performance and energy consumption behavior of asymmetric multiprocessors with desktop-oriented multithreaded applications has not been studied widely. In this study, we measure performance and power consumption in asymmetric and symmetric multiprocessors using real 8 and 16 processor systems to understand the relationships between thread interactions and performance/power behavior. We find that when the workload is asymmetric, using an asymmetric multiprocessor can save energy, but for most of the symmetric workloads, using a symmetric multiprocessor (with the highest clock frequency) consumes less energy.

Full-text preview

Available from:
  • Source
    • "In this paper we investigate the practical performance-powerenergy balance of ARM's asymmetric big.LITTLE technology , employing as a case of study the compute-intensive general matrix multiplication (gemm): C += A · B, where the sizes of A, B, C are respectively m × k, k × n, m × n. Most previous related work targets the parallelization of gemm on i) distributed-memory heterogeneous architectures (see [3] [2] and references therein); or ii) asymmetric multicores, but using trivial (unoptimized) implementations of gemm [11] [10]. Compared with these other efforts, our paper makes the following contributions: First, we leverage a static mapping of threads and we propose a workload partitioning strategy of the BLIS implementation of gemm specifically tailored for the Exynos 5422 big.LITTLE architecture, a systemon-chip (SoC) featuring two processing clusters: an ARM Cortex-A15 quad core and a Cortex-A7 quad core. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Asymmetric processors have emerged as an appealing technology for severely energy-constrained environments, especially in the mobile market where heterogeneity in applications is mainstream. In addition, given the growing interest on ultra low-power architectures for high performance computing, this type of platforms are also being investigated in the road towards the implementation of energy- efficient high-performance scientific applications. In this paper, we propose a first step towards a complete implementation of the BLAS interface adapted to asymmetric ARM big.LITTLE processors, analyzing the trade-offs between performance and energy efficiency when compared to existing homogeneous (symmetric) multi-threaded BLAS implementations. Our experimental results reveal important gains in performance while maintaining the energy efficiency of homogeneous solutions by efficiently exploiting all the resources of the asymmetric processor.
    Full-text · Article · Jul 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: For higher processing and computing power, chip multiprocessors (CMPs) have become the new mainstream architecture. This shift to CMPs has created many challenges for fully utilizing the power of multiple execution cores. One of these challenges is managing contention for shared resources. Most of the recent research address contention for shared resources by single-threaded applications. However, as CMPs scale up to many cores, the trend of application design has shifted towards multi-threaded programming and new parallel models to fully utilize the underlying hardware. There are differences between how single- and multi-threaded applications contend for shared resources. Therefore, to develop approaches to reduce shared resource contention for emerging multi-threaded applications, it is crucial to understand how their performances are affected by contention for a particular shared resource. In this research, we propose and evaluate a general methodology for characterizing multi-threaded applications by determining the effect of shared-resource contention on performance. To demonstrate the methodology, we characterize the applications in the widely used PARSEC benchmark suite for shared-memory resource contention. The characterization reveals several interesting aspects of the benchmark suite. Three of twelve PARSEC benchmarks exhibit no contention for cache resources. Nine of the benchmarks exhibit contention for the L2-cache. Of these nine, only three exhibit contention between their own threads-most contention is because of competition with a co-runner. Interestingly, contention for the Front Side Bus is a major factor with all but two of the benchmarks and degrades performance by more than 11%.
    Full-text · Conference Paper · May 2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: With the shift to chip multiprocessors, managing shared resources has become a critical issue in realizing their full potential. Previous research has shown that thread mapping is a powerful tool for resource management. However, the difficulty of simultaneously managing multiple hardware resources and the varying nature of the workloads have impeded the efficiency of thread mapping algorithms. To overcome the difficulties of simultaneously managing multiple resources with thread mapping, the interaction between various microarchitectural resources and thread characteristics must be well understood. This paper presents an in-depth analysis of PARSEC benchmarks running under different thread mappings to investigate the interaction of various thread mappings with microarchitectural resources including, L1 I/D-caches, I/D TLBs, L2 caches, hardware prefetchers, off-chip memory interconnects, branch predictors, memory disambiguation units and the cores. For each resource, the analysis provides guidelines for how to improve its utilization when mapping threads with different characteristics. We also analyze how the relative importance of the resources varies depending on the workloads. Our experiments show that when only memory resources are considered, thread mapping improves an application's performance by as much as 14% over the default Linux scheduler. In contrast, when both memory and processor resources are considered the mapping algorithm achieves performance improvements by as much as 28%. Additionally, we demonstrate that thread mapping should consider L2 caches, prefetchers and off-chip memory interconnects as one resource, and we present a new metric called L2-misses-memory-latency-product (L2MP) for evaluating their aggregated performance impact.
    Full-text · Article · Jan 2012
Show more