Conference Paper

Towards Data Tiling for Whole Programs in Scratchpad Memory Allocation.

DOI: 10.1007/978-3-540-74309-5_8 Conference: Advances in Computer Systems Architecture, 12th Asia-Pacific Conference, ACSAC 2007, Seoul, Korea, August 23-25, 2007, Proceedings
Source: DBLP


Data tiling is an array layout transformation technique that partitions an array into smaller subarray blocks. It was originally
proposed to improve the cache performance of regular loops. Recently, researchers have applied this technique to scratchpad
memory (SPM) allocation. Arrays whose sizes exceed a given SPM size can be tiled or divided into smaller subarray blocks or
tiles and the program performance can be significantly improved by placing the smaller subarray tiles in SPM. Existing data
tiling techniques are applicable to regularly-accessed arrays in individual loop nests. In embedded applications, arrays are
often accessed in multiple loop nests via possibly aliased pointers. Tiling arrays in a loop nest alone will often affect
the tiling and allocation decisions for arrays accessed in other loop nests. Moreover, tiling arrays accessed via aliased
pointers is difficult since their access patterns are unknown at compile time. This paper presents a new data tiling approach
to address these practical issues. We perform alias profiling to detect the most likely memory access patterns and use an
ILP solver to select the best tiling schemes for all loop nests in the program as a whole. We have integrated data tiling
in an existing SPM allocation framework. Our preliminary experimental results show that our approach can improve significantly
the performance of a set of programs selected from the Mediabench suite.

2 Reads
  • Source
    • "Kandemir et al. [11], Zhang and Kurdahi [23] and Li et al. [16] apply data tiling [12] to improve utilization of SPM. However, the methods described in [11] [23] are restricted to the matrix multiplication kernel only while [16] relies on ILP to find optimal tile sizes to tile user-specified arrays in an ILP-based allocator. Fabri [6] discovered the connection between interval coloring and compile-time memory allocation. "
    [Show abstract] [Hide abstract]
    ABSTRACT: Existing scratchpad memory (SPM) allocation algorithms for arrays, whether they rely on well-crafted heuristics or resort to integer linear programming (ILP) techniques, typically assume that every array is small enough to fit directly into the SPM. As a result, some arrays have to be spilled entirely to the off-chip memory in order to make room for other arrays to stay in the SPM, resulting in sometimes poor SPM utilization. In this paper, we introduce a new comparability graph coloring allocator that integrates for the first time data tiling and SPM allocation for arrays by tiling arrays on-demand to improve utilization of the SPM. The novelty lies in repeatedly identifying the heaviest path in an array interference graph and then reducing its weight by tiling certain arrays on the path appropriately with respect to the size of the SPM. The effectiveness of our allocator, which is presently restricted to tiling 1-D arrays, is validated by using a number of selected benchmarks for which existing allocators are ineffective.
    Proceedings of the 2010 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, CASES 2010, Scottsdale, AZ, USA, October 24-29, 2010; 01/2010
  • [Show abstract] [Hide abstract]
    ABSTRACT: A new compilation framework enables the execution of numerical-intensive applications, written in Python, on a hybrid execution environment formed by a CPU and a GPU. This compiler automatically computes the set of memory locations that need to be transferred to the GPU, and produces the correct mapping between the CPU and the GPU address spaces. Thus, the programming model implements a virtual shared address space. This framework is implemented as a combination of unPython, an ahead-of-time compiler from Python/NumPy to the C programming language, and jit4GPU, a just-in-time compiler from C to the AMD CAL interface. Experimental evaluation demonstrates that for some benchmarks the generated GPU code is 50 times faster than generated OpenMP code. The GPU performance also compares favorably with optimized CPU BLAS code for single-precision computations in most cases.
    Proceedings of 3rd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU 2010, Pittsburgh, Pennsylvania, USA, March 14, 2010; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we propose an effective data pipelining technique, SPDP (scratch-pad data pipelining), for dynamic scratch-pad memory (SPM)management with DMA (Direct Memory Access). InSPDP, we group multiple iterations of a loop into a block for SPM allocation, and implement a data pipeline by overlapping the execution of CPU instructions and DMA operations. We have implemented our SPDP technique into the IMPACT compiler,and conduct experiments using a set of benchmarks from DSP stone, Mibench and Mediabench on the cycle-accurate VLIW simulator of Trimaran. The experimental results show that our technique achieves significant performance improvement compared with the previous work.
    Concurrency and Computation Practice and Experience 09/2010; 22(13):1874-1892. DOI:10.1109/CSE.2009.295 · 1.00 Impact Factor
Show more

Similar Publications