Henk Corporaal

Technische Universiteit Eindhoven, Eindhoven, North Brabant, Netherlands

Are you Henk Corporaal?

Claim your profile

Publications (269)30.41 Total impact

  • Dongrui She, Yifan He, Luc Waeijen, Henk Corporaal
    [Show abstract] [Hide abstract]
    ABSTRACT: Energy efficiency is one of the most important metrics in embedded processor design. The use of wide SIMD architecture is a promising approach to build energy-efficient high performance embedded processors. In this paper, we propose a design framework for a configurable wide SIMD architecture that utilizes an explicit datapath to achieve high energy efficiency. The framework is able to generate processor instances based on architecture specification files. It includes a compiler to efficiently program the proposed architecture with standard programming languages including OpenCL. This compiler can analyze the static memory access patterns in OpenCL kernels, generate efficient mappings, and schedule the code to fully utilize the explicit datapath. Extensive experimental results show that the proposed architecture is efficient and scalable in terms of area, performance, and energy. In a 128-PE SIMD processor, the proposed architecture is able to achieve up to 200 times speed-up and reduce the total energy consumption by 50 % compared to a basic RISC processor.
    Journal of Signal Processing Systems 07/2015; 80(1). DOI:10.1007/s11265-014-0957-1 · 0.56 Impact Factor
  • Luc Waeijen, Dongrui She, Henk Corporaal, Yifan He
    [Show abstract] [Hide abstract]
    ABSTRACT: Energy efficiency has become one of the most important topics in computing. To meet the ever increasing demands of the mobile market, the next generation of processors will have to deliver a high compute performance at an extremely limited energy budget. Wide single instruction, multiple data (SIMD) architectures provide a promising solution, as they have the potential to achieve high compute performance at a low energy cost. We propose a configurable wide SIMD architecture that utilizes explicit datapath techniques to further optimize energy efficiency without sacrificing computational performance. To demonstrate the efficiency of the proposed architecture, multiple instantiations of the proposed wide SIMD architecture and its automatic bypassing counterpart, as well as a baseline RISC processor, are implemented. Extensive experimental results show that the proposed architecture is efficient and scalable in terms of area, performance, and energy. In a 128-PE SIMD processor, the proposed architecture is able to achieve an average of 206 times speed-up and reduces the total energy dissipation by 48.3 % on average and up to 94 %, compared to a reduced instruction set computing (RISC) processor. Compared to the corresponding SIMD architecture with automatic bypassing, an average of 64 % of all register file accesses is avoided by the 128-PE, explicitly bypassed SIMD. For total energy dissipation, an average of 27.5 %, and maximum of 43.0 %, reduction is achieved.
    Journal of Signal Processing Systems 07/2015; 80(1). DOI:10.1007/s11265-014-0950-8 · 0.56 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Programming models such as CUDA and OpenCL allow the programmer to specify the independence of threads, effectively removing ordering constraints. Still, parallel architectures such as the graphics processing unit (GPU) do not exploit the potential of data-locality enabled by this independence. Therefore, programmers are required to manually perform data-locality optimisations such as memory coalescing or loop tiling. This work makes a case for locality-aware thread scheduling: re-ordering threads automatically for better locality to improve the programmability of multi-threaded processors. In particular, we analyse the potential of locality-aware thread scheduling for GPUs, considering among others cache performance, memory coalescing and bank locality. This work does not present an implementation of a locality-aware thread scheduler, but rather introduces the concept and identifies the potential. We conclude that non-optimised programs have the potential to achieve good cache and memory utilisation when using a smarter thread scheduler. A case-study of a naive matrix multiplication shows for example a 87% performance increase, leading to an IPC of 457 on a 512-core GPU.
    7th International Workshop on Multi-/Many-Core Computing Systems (MuCoCoS); 08/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Many applications in important domains, such as communication, multimedia, etc. show a significant data-level parallelism (DLP). A large part of the DLP is usually exploited through application vectorization and implementation of vector operations in processors executing the applications. While the amount of DLP varies between applications of the same domain or even within a single application, processor architectures usually support a single vector width. This may not be optimal and may cause a substantial energy and performance inefficiency. Therefore, an adequate more sophisticated exploitation of DLP is highly relevant. This paper studies the construction and exploitation of VLIW ASIPs with multiple vector widths.
    MECO 2014 - 3rd Mediterranean Conference on Embedded Computing, Budva, Montenegro; 06/2014
  • Roel Jordans, Lech Jozwiak, Henk Corporaal
    [Show abstract] [Hide abstract]
    ABSTRACT: Genetic algorithms are commonly used for automatically solving complex design problem because exploration using genetic algorithms can consistently deliver good results when the algorithm is given a long enough run-time. However, the exploration time for problems with huge design spaces can be very long, often making exploration using a genetic algorithm practically infeasible. In this work, we present a genetic algorithm for exploring the instruction-set architecture of VLIW ASIPs and demonstrate its effectiveness by comparing it to two heuristic algorithms. We present several optimizations to the genetic algorithm configuration, and demonstrate how caching of intermediate compilation and simulation results can reduce the exploration time by an order of magnitude.
    MECO 2014 - 3rd Mediterranean Conference on Embedded Computing, Budva, Montenegro; 06/2014
  • Luc Waeijen, Dongrui She, Henk Corporaal, Yifan He
    [Show abstract] [Hide abstract]
    ABSTRACT: It has been shown that wide Single Instruction Multiple Data architectures (wide-SIMDs) can achieve high energy efficiency, especially in domains such as image and vision processing. In these and various other application domains, reduction is a frequently encountered operation, where multiple input elements need to be combined into a single element by an associative operation, e.g. addition or multiplication. There are many applications that require reduction such as: partial histogram merging, matrix multiplication and min/max-finding. Wide-SIMDs contain a large number of processing elements (PEs), which in general are connected by a minimal form of interconnect for scalability reasons. To efficiently support reduction operations on wide-SIMDs with such a minimal interconnect, we introduce two novel reduction algorithms which do not rely on complex communication networks or any dedicated hardware. The proposed approaches are compared with both dedicated hardware and other software solutions in terms of performance, area, and energy consumption. A practical case study demonstrates that the proposed software approach has much better generality, flexibility and no additional hardware cost. Compared to a dedicated hardware adder tree, the proposed software approach saves 6.8% area with a performance penalty of only 6.5%.
  • Firew Siyoum, Marc Geilen, Henk Corporaal
  • [Show abstract] [Hide abstract]
    ABSTRACT: Numerous applications in important domains, such as communication, multimedia, etc. show a significant data-level parallelism (DLP). A large part of the DLP is usually exploited through application vectorization and implementation of vector operations in processors executing the applications. While the amount of DLP varies between applications of the same domain or even within a single application, processor architectures usually support a single vector width. This may not be optimal and may cause a substantial energy inefficiency. Therefore, an adequate more sophisticated exploitation of DLP is highly relevant. This paper proposes the use of heterogeneous vector widths and a method to explore the heterogeneous vector widths for VLIW ASIPs. In our context, heterogeneity corresponds to the usage of two or more different vector widths in a single ASIP. After a brief explanation of the target ASIP architecture model, the paper describes the vector-width exploration method and explains the associated design automation tools. Subsequently, experimental results are discussed.
    Microprocessors and Microsystems 05/2014; 38(8). DOI:10.1016/j.micpro.2014.05.004 · 0.60 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we introduce and discuss the BuildMaster framework. This framework supports the design space exploration of application specific VLIW processors and offers automated caching of intermediate compilation and simulation results. Both the compilation and the simulation cache can greatly help to shorten the exploration time and make it possible to use more realistic data for the evaluation of selected designs. In each of the experiments we performed, we were able to reduce the number of required simulations with over 90% and save up to 50% on the required compilation time.
    DDECS 2014 - 17th International Symposium on Design and Diagnostics of Electronic Circuits and Systems, Warsaw, Poland; 04/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality systematically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: 1) the GPU’s hierarchy of threads, warps, threadblocks, and sets of active threads, 2) conditional and non-uniform latencies, 3) cache associativity, 4) miss-status holding-registers, and 5) warp divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator
    High Performance Computer Architecture (HPCA), Orlando, FL, USA; 02/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Instruction Set Customization is a well-known technique to enhance the performance and efficiency of Application-Specific Processors (ASIPs). An extensive application profiling can indicate which parts of a given application, or class of applications, are most frequently executed, enabling the implementation of such frequently executed parts in hardware as custom instructions. However, a naive ad hoc instruction set customization process may identify and select poor instruction extension candidates, which may not result in a significantly improved performance with low circuit-area and energy footprints. In this paper we propose and discuss an efficient instruction set customization method and automatic tool, which exploit the maximal common subgraphs (common operation patterns) of the most frequently executed basic blocks of a given application. The speed results from our tool for a VLIW ASIP are provided for a set of benchmark applications. The average execution time reduction ranges from 30% to 40%, with only a few custom instructions.
    2014 IEEE 5th Latin American Symposium on Circuits and Systems (LASCAS); 02/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Graphics processing units (GPUs) are becoming increasingly popular for compute workloads, mainly because of their large number of processing elements and high-bandwidth to off-chip memory. The roofline model captures the ratio between the two (the compute-memory ratio), an important architectural parameter. This work proposes to change the compute-memory ratio dynamically, scaling the voltage and frequency (DVFS) of 1) memory for compute-intensive workloads and 2) processing elements for memory-intensive workloads. The result is an adaptive roofline-aware GPU that increases energy efficiency (up to 58%) while maintaining performance.
    International Workshop on Adaptive Self-tuning Computing Systems, Vienna, Austria; 01/2014
  • Orlando Moreira, Henk Corporaal
    Embedded Systens edited by Peter Marwedel, 01/2014; Springer., ISBN: 978-3-319-01245-2
  • [Show abstract] [Hide abstract]
    ABSTRACT: In smart environments, the embedded sensing systems should intelligently adapt to the behavior of the users. Many interesting types of behavior are characterized by repetition of actions such as certain activities or movements. A generic methodology to detect and classify repetitions that may occur at different scales is introduced in this paper. The proposed method is called Action History Matrices (AHM). The properties of AHM for detecting repetitive movement behavior are demonstrated in analyzing four customer behavior classes in a shop environment observed by multiple uncalibrated cameras. Two different datasets, video recordings in the shop environment and motion path simulations, are created and used in the experiments. The AHM-based system achieves an accuracy of 97% with most suitable scale and naive Bayesian classifier on the single-view simulated movement data. In addition, the performance of two fusion levels and three fusion methods are compared with AHM method on the multi-view recordings. In our results, fusion at the decision-level offers consistently better accuracy than feature-level, and the coverage-based view-selection fusion method (51%) marginally outperforms the majority method. The upper limit with the recorded data for accuracy by view-selection is found to be 75% .
    Information Fusion 01/2014; DOI:10.1016/j.inffus.2014.01.010 · 3.47 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Automatic optimization of application-specific instruction-set processor (ASIP) architectures mostly focuses on the internal memory hierarchy design, or the extension of reduced instruction-set architectures with complex custom operations. This paper focuses on very long instruction word (VLIW) architectures and, more specifically, on automating the selection of an application specific VLIW issue-width. The issue-width selection strongly influences all the important processor properties (e.g. processing speed, silicon area, and power consumption). Therefore, an accurate and efficient issue-width estimation and optimization are some of the most important aspects of VLIW ASIP design. In this paper, we first compare different methods for the estimation of required the issue-width, and subsequently introduce a new force-based parallelism estimation method which is capable of estimating the required issue-width with only 3% error on average. Furthermore, we present and compare two techniques for estimating the required issue-width of software pipelined loop kernels and show that a simple utilization-based measure provides an error margin of less than 1% on average.
  • Yifan He, Dongrui She, Sander Stuijk, Henk Corporaal
    [Show abstract] [Hide abstract]
    ABSTRACT: Streaming applications are an important class of applications in emerging embedded systems such as smart camera network, unmanned vehicles, and industrial printing. These applications are usually very computationally intensive and have real-time constraints. To meet the increasing demand for performance and efficiency in these applications, the use of application specific IP cores in heterogeneous Multi-Processor System-on-Chips (MPSoCs) becomes inevitable. However, two of the key challenges in integrating these IP cores into MPSoCs are (i) how to properly handle inter-core communication; (ii) how to map streaming applications in an efficient and predictable way. In this paper, we first present a predictable high-performance communication assist (CA) that helps to tackle these design challenges. The proposed CA has zero throughput overhead, negligible latency overhead, and significantly less resource usage compared to existing CA designs. The proposed CA also provides a unified abstract interface for both processors and accelerator IP cores with flexible data access support. Based on the proposed CA design, we present a predictable heterogeneous multi-processor platform template for streaming applications. The template is used in a predictable design flow that uses Synchronous Data Flow (SDF) graphs for design time analysis. An accurate SDF model of our CA is introduced, enabling the mapping of applications onto heterogeneous MPSoCs in an efficient and predictable way. As a case study, we map the complete high-speed vision processing pipeline of an industrial application, Organic Light Emitting Diode (OLED) screen printing, onto one instance of the proposed platform. The result demonstrates that system design and analysis effort is greatly reduced with the proposed CA-based design flow.
    Journal of Systems Architecture 11/2013; 59(10):878–888. DOI:10.1016/j.sysarc.2013.04.005 · 0.69 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: GPUs are increasingly used as compute accelerators. With a large number of cores executing an even larger number of threads, significant speed-ups can be attained for parallel workloads. Applications that rely on atomic operations, such as histogram and Hough transform, suffer from serialization of threads in case they update the same memory location. Previous work shows that reducing this serialization with software techniques can increase performance by an order of magnitude. We observe, however, that some serialization remains and still slows down these applications. Therefore, this paper proposes to use a hash function in both the addressing of the banks and the locks of the scratchpad memory. To measure the effects of these changes, we first implement a detailed model of atomic operations on scratchpad memory in GPGPU-Sim, and verify its correctness. Second, we test our proposed hardware changes. They result in a speed-up up to 4.9× and 1.8× on implementations utilizing the aforementioned software techniques for histogram and Hough transform applications respectively, with minimum hardware costs.
    Computer Design (ICCD), 2013 IEEE 31st International Conference on, Asheville, NC, USA; 10/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Synchronous dataflow graphs (SDFGs) are used extensively to model streaming applications. An SDFG can be extended with scheduling decisions, allowing SDFG analysis to obtain properties, such as throughput or buffer sizes for the scheduled graphs. Analysis times depend strongly on the size of the SDFG. SDFGs can be statically scheduled using static-order schedules. The only generally applicable technique to model a static-order schedule in an SDFG is to convert it to a homogeneous SDFG (HSDFG). This may lead to an exponential increase in the size of the graph and to suboptimal analysis results (e.g., for buffer sizes in multiprocessors). We present techniques to model two types of static-order schedules, i.e., periodic schedules and periodic single appearance schedules, directly in an SDFG. Experiments show that both techniques produce more compact graphs compared to the technique that relies on a conversion to an HSDFG. This results in reduced analysis times for performance properties and tighter resource requirements.
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 10/2013; 32(10):1495-1508. DOI:10.1109/TCAD.2013.2265852 · 1.20 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: In the near future, cameras will be used everywhere as flexible sensors for numerous applications. For mobility and privacy reasons, the required image processing should be local on embedded computer platforms with performance requirements and energy constraints. Dedicated acceleration of Convolutional Neural Networks (CNN) can achieve these targets with enough flexibility to perform multiple vision tasks. A challenging problem for the design of efficient accelerators is the limited amount of external memory bandwidth. We show that the effects of the memory bottleneck can be reduced by a flexible memory hierarchy that supports the complex data access patterns in CNN workload. The efficiency of the on-chip memories is maximized by our scheduler that uses tiling to optimize for data locality. Our design flow ensures that on-chip memory size is minimized, which reduces area and energy usage. The design flow is evaluated by a High Level Synthesis implementation on a Virtex 6 FPGA board. Compared to accelerators with standard scratchpad memories the FPGA resources can be reduced up to 13× while maintaining the same performance. Alternatively, when the same amount of FPGA resources is used our accelerators are up to 11× faster.
    2013 IEEE 31st International Conference on Computer Design (ICCD); 10/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: In recent years accurate algorithms for detecting objects in images have been developed. Among these algorithms, the object detection scheme proposed by Viola and Jones gained great popularity, especially after the release of high-quality face classifiers by the OpenCV group. However, as any other sliding-window based object detector, it is affected by a strong increase in the computational cost as the size of the scene grows. Especially in real-time applications, a search strategy based on a sliding window can be computationally too expensive. In this paper, we propose an efficient approach to adapt at run time the sliding window step size in order to speed-up the detection task without compromising the accuracy. We demonstrate the effectiveness of the proposed Run-time Adaptive Sliding Window (RASW) in improving the performance of Viola-Jones object detection by providing better throughput-accuracy tradeoffs. When comparing our approach with the OpenCV face detection implementation, we obtain up to 2.03x speedup in frames per second without any loss in accuracy.
    2013 Seventh International Conference on Distributed Smart Cameras (ICDSC); 10/2013

Publication Stats

2k Citations
30.41 Total Impact Points


  • 2003–2015
    • Technische Universiteit Eindhoven
      • • Department of Applied Physics
      • • Department of Electrical Engineering
      • • Embedded Systems Institute (ESI)
      Eindhoven, North Brabant, Netherlands
  • 2013
    • University of Cordoba (Spain)
      • Department of Computers Architecture and Technology, Electronics and Electric Technology
      Cordoue, Andalusia, Spain
  • 2008
    • NXP Semiconductors
      Eindhoven, North Brabant, Netherlands
  • 2004–2006
    • imec Belgium
      • Smart Systems and Energy Technology
      Louvain, Flemish, Belgium
  • 1900–2006
    • Delft University of Technology
      • Information- and Communication Technology Section
      Delft, South Holland, Netherlands