Conference Paper

Towards Data Tiling for Whole Programs in Scratchpad Memory Allocation.

DOI: 10.1007/978-3-540-74309-5_8 Conference: Advances in Computer Systems Architecture, 12th Asia-Pacific Conference, ACSAC 2007, Seoul, Korea, August 23-25, 2007, Proceedings
Source: DBLP

ABSTRACT Data tiling is an array layout transformation technique that partitions an array into smaller subarray blocks. It was originally
proposed to improve the cache performance of regular loops. Recently, researchers have applied this technique to scratchpad
memory (SPM) allocation. Arrays whose sizes exceed a given SPM size can be tiled or divided into smaller subarray blocks or
tiles and the program performance can be significantly improved by placing the smaller subarray tiles in SPM. Existing data
tiling techniques are applicable to regularly-accessed arrays in individual loop nests. In embedded applications, arrays are
often accessed in multiple loop nests via possibly aliased pointers. Tiling arrays in a loop nest alone will often affect
the tiling and allocation decisions for arrays accessed in other loop nests. Moreover, tiling arrays accessed via aliased
pointers is difficult since their access patterns are unknown at compile time. This paper presents a new data tiling approach
to address these practical issues. We perform alias profiling to detect the most likely memory access patterns and use an
ILP solver to select the best tiling schemes for all loop nests in the program as a whole. We have integrated data tiling
in an existing SPM allocation framework. Our preliminary experimental results show that our approach can improve significantly
the performance of a set of programs selected from the Mediabench suite.

  • [Show abstract] [Hide abstract]
    ABSTRACT: A new compilation framework enables the execution of numerical-intensive applications, written in Python, on a hybrid execution environment formed by a CPU and a GPU. This compiler automatically computes the set of memory locations that need to be transferred to the GPU, and produces the correct mapping between the CPU and the GPU address spaces. Thus, the programming model implements a virtual shared address space. This framework is implemented as a combination of unPython, an ahead-of-time compiler from Python/NumPy to the C programming language, and jit4GPU, a just-in-time compiler from C to the AMD CAL interface. Experimental evaluation demonstrates that for some benchmarks the generated GPU code is 50 times faster than generated OpenMP code. The GPU performance also compares favorably with optimized CPU BLAS code for single-precision computations in most cases.
    Proceedings of 3rd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU 2010, Pittsburgh, Pennsylvania, USA, March 14, 2010; 01/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: cratch-pad memory (SPM) is widely used in embedded systems. It is a topical and crucial subject to reduce power consumption for SPM systems, since high power consumption can reduce systems reliability and increase the cost and size of heat sinks. In this paper, we propose an effective approach of power reducing to scale down voltage and frequency as much as possible. We first pipelined data transference and processing. Second, we find the comparative time slack between fast data processing and low data transference, and then provide both single and dynamic scaling to reduce power consumption. We conduct our approach on the simulator of Trimaran, and the experimental results show that the approach achieves significant power reduction improvement while the run-time performance outperforms previous work.
    2011 International Conference on Parallel Processing Workshops, ICPPW 2011, Taipei, Taiwan, Sept. 13-16, 2011; 01/2011
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Existing scratchpad memory (SPM) allocation algorithms for arrays, whether they rely on well-crafted heuristics or resort to integer linear programming (ILP) techniques, typically assume that every array is small enough to fit directly into the SPM. As a result, some arrays have to be spilled entirely to the off-chip memory in order to make room for other arrays to stay in the SPM, resulting in sometimes poor SPM utilization. In this paper, we introduce a new comparability graph coloring allocator that integrates for the first time data tiling and SPM allocation for arrays by tiling arrays on-demand to improve utilization of the SPM. The novelty lies in repeatedly identifying the heaviest path in an array interference graph and then reducing its weight by tiling certain arrays on the path appropriately with respect to the size of the SPM. The effectiveness of our allocator, which is presently restricted to tiling 1-D arrays, is validated by using a number of selected benchmarks for which existing allocators are ineffective.
    Proceedings of the 2010 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, CASES 2010, Scottsdale, AZ, USA, October 24-29, 2010; 01/2010