M. Hall

University of Southern California, Los Angeles, CA, USA

Are you M. Hall?

Claim your profile

Publications (5)0 Total impact

  • Source
    Conference Proceeding: Model-Guided Empirical Optimization for Multimedia Extension Architectures: A Case Study
    [show abstract] [hide abstract]
    ABSTRACT: Compiler technology for multimedia extensions must effectively utilize not only the SIMD compute engines but also the various levels of the memory hierarchy: superword registers, multi-level caches and TLB. In this paper, we describe a compiler that combines optimization across all levels of the memory hierarchy with automatic generation of SIMD code for multimedia extensions. At the high-level, model-guided empirical optimization is used to transform code to optimize for all levels of the memory hierarchy. This compiler interacts with a backend compiler exploiting superword-level parallelism that takes sequential code as input and produces SIMD code. This paper discusses how we have combined these technologies into a single framework. Through a case study with matrix multiply, we observe performance results that outperform the hand-tuned Intel MKL library, and achieve performance that is within 4% of the ATLAS self-tuning library with architectural defaults and more than 4X faster than the native Intel compiler.
    Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International; 01/2007
  • Conference Proceeding: An overview of the ECO project
    [show abstract] [hide abstract]
    ABSTRACT: In this paper, we describe a compilation system that automates much of the process of performance tuning that is currently done manually by application programmers interested in high performance. Our approach combines compiler models and heuristics with guided empirical search to take advantage of their complementary strengths. The models and heuristics limit the search to a small number of candidate implementations, and the empirical results provide the most accurate information to the compiler to select among candidates and tune optimization parameter values. The overall approach can be employed to alleviate some of the performance problems that lead to inefficiencies in key applications today: register pressure, cache conflict misses, and the trade-off between synchronization, parallelism and locality in SMPs. The main focus of the paper is an algorithm for simultaneously optimizing across multiple levels of the memory hierarchy for dense-matrix computations. We have developed an initial compiler implementation, and present automatically-generated results on matrix multiply. Results on two architectures, SGI R10000 and Sun UltraSparc IIe, outperform the native compiler, and either outperform or achieve comparable performance as the ATLAS self-tuning library and the hand-tuned vendor BLAS library. This paper describes other components of the ECO system, including supporting tools and experiments with programmer-guided performance tuning. This approach has provided a foundation for a general framework for systematic optimization of domain-specific applications. Specifically, we are developing an optimization system for signal and image processing that exploits signal properties, and we are using machine learning and a knowledge-rich representation can be exploited to optimize molecular dynamics simulation
    Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International; 05/2006
  • Source
    Conference Proceeding: Superword-level parallelism in the presence of control flow
    J. Shin, M. Hall, J. Chame
    [show abstract] [hide abstract]
    ABSTRACT: In this paper, we describe how to extend the concept of superword-level parallelization (SLP), used for multimedia extension architectures, so that it can be applied in the presence of control flow constructs. Superword-level parallelization involves identifying scalar instructions in a large basic block that perform the same operation, and, if dependences do not prevent it, combining them into a superword operation on a multi-word object. A key insight is that we can use techniques related to optimizations for architectures supporting predicated execution, even for multimedia ISAs that do not provide hardware predication. We derive large basic blocks with predicated instructions to which SLP can be applied. We describe how to minimize overheads for superword predicates and re-introduce control flow for scalar operations. We discuss other extensions to SLP to address common features of real multimedia codes. We present automatically-generated performance results on 8 multimedia codes to demonstrate the power of this approach. We observe speedups ranging from 1.97X to 15.07X as compared to both sequential execution and SLP alone.
    Code Generation and Optimization, 2005. CGO 2005. International Symposium on; 04/2005
  • Source
    Conference Proceeding: Combining models and guided empirical search to optimize for multiple levels of the memory hierarchy
    C. Chen, J. Chame, M. Hall
    [show abstract] [hide abstract]
    ABSTRACT: This paper describes an algorithm for simultaneously optimizing across multiple levels of the memory hierarchy for dense-matrix computations. Our approach combines compiler models and heuristics with guided empirical search to take advantage of their complementary strengths. The models and heuristics limit the search to a small number of candidate implementations, and the empirical results provide the most accurate information to the compiler to select among candidates and tune optimization parameter values. We have developed an initial implementation and applied this approach to two case studies, matrix multiply and Jacobi relaxation. For matrix multiply, our results on two architectures, SGI R10000 and Sun UltraSparc IIe, outperform the native compiler, and either outperform or achieve comparable performance as the ATLAS self-tuning library and the hand-tuned vendor BLAS library. Jacobi results also substantially outperform the native compilers.
    Code Generation and Optimization, 2005. CGO 2005. International Symposium on; 04/2005
  • Conference Proceeding: Mapping Irregular Applications to DIVA, a PIM-based Data-Intensive Architecture
    [show abstract] [hide abstract]
    ABSTRACT: Processing-in-memory (PIM) chips that integrate processor logic into memory devices offer a new opportunity for bridging the growing gap between processor and memory speeds, especially for applications with high memory-bandwidth requirements. The Data-IntensiVe Architecture (DIVA) system combines PIM memories with one or more external host processors and a PIM-to-PIM interconnect. DIVA increases memory bandwidth through two mechanisms: (1) performing selected computation in memory, reducing the quantity of data transferred across the processor-memory interface; and (2) providing communication mechanisms called parcels for moving both data and computation throughout memory, further bypassing the processor-memory bus. DIVA uniquely supports acceleration of important irregular applications, including sparse-matrix and pointer-based computations. In this paper, we focus on several aspects of DIVA designed to effectively support such computations at very high performance levels: (1) the memory model and parcel definitions; (2) the PIM-to-PIM interconnect; and, (3) requirements for the processor-to-memory interface. We demonstrate the potential of PIM-based architectures in accelerating the performance of three irregular computations, sparse conjugate gradient, a natural-join database operation and an object-oriented database query.
    Supercomputing, ACM/IEEE 1999 Conference; 12/1999