Henk Corporaal

Technische Universiteit Eindhoven, Eindhoven, North Brabant, Netherlands

Are you Henk Corporaal?

Claim your profile

Publications (251)7.74 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Many applications in important domains, such as communication, multimedia, etc. show a significant data-level parallelism (DLP). A large part of the DLP is usually exploited through application vectorization and implementation of vector operations in processors executing the applications. While the amount of DLP varies between applications of the same domain or even within a single application, processor architectures usually support a single vector width. This may not be optimal and may cause a substantial energy and performance inefficiency. Therefore, an adequate more sophisticated exploitation of DLP is highly relevant. This paper studies the construction and exploitation of VLIW ASIPs with multiple vector widths.
    MECO 2014 - 3rd Mediterranean Conference on Embedded Computing, Budva, Montenegro; 06/2014
  • Roel Jordans, Lech Jozwiak, Henk Corporaal
    [Show abstract] [Hide abstract]
    ABSTRACT: Genetic algorithms are commonly used for automatically solving complex design problem because exploration using genetic algorithms can consistently deliver good results when the algorithm is given a long enough run-time. However, the exploration time for problems with huge design spaces can be very long, often making exploration using a genetic algorithm practically infeasible. In this work, we present a genetic algorithm for exploring the instruction-set architecture of VLIW ASIPs and demonstrate its effectiveness by comparing it to two heuristic algorithms. We present several optimizations to the genetic algorithm configuration, and demonstrate how caching of intermediate compilation and simulation results can reduce the exploration time by an order of magnitude.
    MECO 2014 - 3rd Mediterranean Conference on Embedded Computing, Budva, Montenegro; 06/2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we introduce and discuss the BuildMaster framework. This framework supports the design space exploration of application specific VLIW processors and offers automated caching of intermediate compilation and simulation results. Both the compilation and the simulation cache can greatly help to shorten the exploration time and make it possible to use more realistic data for the evaluation of selected designs. In each of the experiments we performed, we were able to reduce the number of required simulations with over 90% and save up to 50% on the required compilation time.
    DDECS 2014 - 17th International Symposium on Design and Diagnostics of Electronic Circuits and Systems, Warsaw, Poland; 04/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: As modern GPUs rely partly on their on-chip memories to counter the imminent off-chip memory wall, the efficient use of their caches has become important for performance and energy. However, optimising cache locality systematically requires insight into and prediction of cache behaviour. On sequential processors, stack distance or reuse distance theory is a well-known means to model cache behaviour. However, it is not straightforward to apply this theory to GPUs, mainly because of the parallel execution model and fine-grained multi-threading. This work extends reuse distance to GPUs by modelling: 1) the GPU’s hierarchy of threads, warps, threadblocks, and sets of active threads, 2) conditional and non-uniform latencies, 3) cache associativity, 4) miss-status holding-registers, and 5) warp divergence. We implement the model in C++ and extend the Ocelot GPU emulator to extract lists of memory addresses. We compare our model with measured cache miss rates for the Parboil and PolyBench/GPU benchmark suites, showing a mean absolute error of 6% and 8% for two cache configurations. We show that our model is faster and even more accurate compared to the GPGPU-Sim simulator
    High Performance Computer Architecture (HPCA), Orlando, FL, USA; 02/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Graphics processing units (GPUs) are becoming increasingly popular for compute workloads, mainly because of their large number of processing elements and high-bandwidth to off-chip memory. The roofline model captures the ratio between the two (the compute-memory ratio), an important architectural parameter. This work proposes to change the compute-memory ratio dynamically, scaling the voltage and frequency (DVFS) of 1) memory for compute-intensive workloads and 2) processing elements for memory-intensive workloads. The result is an adaptive roofline-aware GPU that increases energy efficiency (up to 58%) while maintaining performance.
    International Workshop on Adaptive Self-tuning Computing Systems, Vienna, Austria; 01/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: In smart environments, the embedded sensing systems should intelligently adapt to the behavior of the users. Many interesting types of behavior are characterized by repetition of actions such as certain activities or movements. A generic methodology to detect and classify repetitions that may occur at different scales is introduced in this paper. The proposed method is called Action History Matrices (AHM). The properties of AHM for detecting repetitive movement behavior are demonstrated in analyzing four customer behavior classes in a shop environment observed by multiple uncalibrated cameras. Two different datasets, video recordings in the shop environment and motion path simulations, are created and used in the experiments. The AHM-based system achieves an accuracy of 97% with most suitable scale and naive Bayesian classifier on the single-view simulated movement data. In addition, the performance of two fusion levels and three fusion methods are compared with AHM method on the multi-view recordings. In our results, fusion at the decision-level offers consistently better accuracy than feature-level, and the coverage-based view-selection fusion method (51%) marginally outperforms the majority method. The upper limit with the recorded data for accuracy by view-selection is found to be 75% .
    Information Fusion 01/2014; · 2.26 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Numerous applications in important domains, such as communication, multimedia, etc. show a significant data-level parallelism (DLP). A large part of the DLP is usually exploited through application vectorization and implementation of vector operations in processors executing the applications. While the amount of DLP varies between applications of the same domain or even within a single application, processor architectures usually support a single vector width. This may not be optimal and may cause a substantial energy inefficiency. Therefore, an adequate more sophisticated exploitation of DLP is highly relevant. This paper proposes the use of heterogeneous vector widths and a method to explore the heterogeneous vector widths for VLIW ASIPs. In our context, heterogeneity corresponds to the usage of two or more different vector widths in a single ASIP. After a brief explanation of the target ASIP architecture model, the paper describes the vector-width exploration method and explains the associated design automation tools. Subsequently, experimental results are discussed.
    Microprocessors and Microsystems 01/2014; · 0.55 Impact Factor
  • Orlando Moreira, Henk Corporaal
    Embedded Systens edited by Peter Marwedel, 01/2014; Springer., ISBN: 978-3-319-01245-2
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Automatic optimization of application-specific instruction-set processor (ASIP) architectures mostly focuses on the internal memory hierarchy design, or the extension of reduced instruction-set architectures with complex custom operations. This paper focuses on very long instruction word (VLIW) architectures and, more specifically, on automating the selection of an application specific VLIW issue-width. The issue-width selection strongly influences all the important processor properties (e.g. processing speed, silicon area, and power consumption). Therefore, an accurate and efficient issue-width estimation and optimization are some of the most important aspects of VLIW ASIP design. In this paper, we first compare different methods for the estimation of required the issue-width, and subsequently introduce a new force-based parallelism estimation method which is capable of estimating the required issue-width with only 3% error on average. Furthermore, we present and compare two techniques for estimating the required issue-width of software pipelined loop kernels and show that a simple utilization-based measure provides an error margin of less than 1% on average.
    International Journal of Microelectronics and Computer Science. 11/2013; 4(2):55-64.
  • [Show abstract] [Hide abstract]
    ABSTRACT: GPUs are increasingly used as compute accelerators. With a large number of cores executing an even larger number of threads, significant speed-ups can be attained for parallel workloads. Applications that rely on atomic operations, such as histogram and Hough transform, suffer from serialization of threads in case they update the same memory location. Previous work shows that reducing this serialization with software techniques can increase performance by an order of magnitude. We observe, however, that some serialization remains and still slows down these applications. Therefore, this paper proposes to use a hash function in both the addressing of the banks and the locks of the scratchpad memory. To measure the effects of these changes, we first implement a detailed model of atomic operations on scratchpad memory in GPGPU-Sim, and verify its correctness. Second, we test our proposed hardware changes. They result in a speed-up up to 4.9× and 1.8× on implementations utilizing the aforementioned software techniques for histogram and Hough transform applications respectively, with minimum hardware costs.
    Computer Design (ICCD), 2013 IEEE 31st International Conference on, Asheville, NC, USA; 10/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Design space exploration for ASIP instruction-set design is a very complex problem, involving a large set of architectural choices. Existing methods are usually handcrafted and time-consuming. In this paper, we propose and investigate a rapid method to estimate the energy consumption of candidate architectures for VLIW ASIP processors. The proposed method avoids the time-consuming simulation of the candidate prototypes, without any loss of accuracy in the predicted energy consumption. We experimentally show the effect of this fast cost evaluation method when used in an automated instruction-set architecture exploration. In our experiments, we compare three different methods for cost estimation and find that we can accurately predict the energy consumption of proposed architectures while avoiding simulation of the complete system.
    DSD 2013 - 16th Euromicro Conference on Digital System Design, Santander, Spain; 09/2013
  • Dongrui She, Yifan He, Henk Corporaal
    [Show abstract] [Hide abstract]
    ABSTRACT: In application-specific processor design, a common approach to improve performance and efficiency is to use special instructions that execute complex operation patterns. However, in a generic embedded processor with compact Instruction Set Architecture (ISA), these special instructions may lead to large overhead such as: (i) more bits are needed to encode the extra opcodes and operands, resulting in wider instructions; (ii) more Register File (RF) ports are required to provide the extra operands to the function units. Such overhead may increase energy consumption considerably. In this article, we propose to support flexible operation pair patterns in a processor with a compact 24-bit RISC-like ISA using: (i) a partially reconfigurable decoder that exploits the pattern locality to reduce opcode space requirement; (ii) a software-controlled bypass network to reduce operand encoding bit and RF port requirement. An energy-aware compiler backend is designed for the proposed architecture that performs pattern selection and bypass-aware scheduling to generate energy-efficient codes. Though the proposed design imposes extra constraints on the operation patterns, the experimental results show that for benchmark applications from different domains, the average dynamic instruction count is reduced by over 25%, which is only about 2% less than the architecture without such constraints. The proposed architecture reduces total energy by an average of 15.8% compared to the RISC baseline, while the one without constraints achieves almost no improvement due to its high overhead. When high performance is required, the proposed architecture is able to achieve a speedup of 13.8% with 13.1% energy reduction compared to the baseline by introducing multicycle SFU operations.
    ACM Transactions on Architecture and Code Optimization (TACO). 09/2013; 10(3).
  • 07/2013: pages 41-60; , ISBN: 978-1-4398-8063-0
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Instruction-set architecture exploration for clustered VLIW processors is a very complex problem. Most of the existing exploration methods are hand-crafted and time consuming. This paper presents and compares several methods for automating this exploration. We propose and discuss a two-phase method which can quickly explore many different architectures and experimentally demonstrate that this method is capable of automatically achieving a 50% improvement on the energy-delay product cost of an automatically generated architecture for an ECG detection application and a 1% energy-delay product cost improvement compared to a hand-crafted design.
    ECyPS 2013 - EUROMICRO/IEEE Workshop on Embedded and Cyber-Physical Systems, Budva, Montenegro; 06/2013
  • Gert-Jan van den Braak, Henk Corporaal
    [Show abstract] [Hide abstract]
    ABSTRACT: GPUs have evolved to programmable, energy efficient compute accelerators for massively parallel applications. Still, compute power is lost in many applications because of cycles spent on data movement and control instead of computations on actual data. Additional cycles can be lost as well on pipeline stalls due to long latency operations. To improve performance and energy efficiency, we introduce GPU-CC: a reconfigurable GPU architecture with communicating cores. It is based on a contemporary GPU, which can still be used as such, but also has the ability to reorganize the cores of a GPU in a reconfigurable network. In GPU-CC data movement and control is implicit in the configuration of the communication network. Additionally each core executes a fixed instruction, reducing instruction decode count and increasing energy efficiency. We show a large performance potential for GPU-CC, e.g. 1.9x and 2.4x for a 3x3 and 5x5 convolution application. The hardware cost of GPU-CC is mainly determined by the buffers in the added network, which amounts to 12.4% of extra memory space.
    Proceedings of the 16th International Workshop on Software and Compilers for Embedded Systems, St. Goar, Germany; 06/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Dynamic behavior of streaming applications can be effectively modeled by scenario-aware dataflow graphs (SADFs). Many streaming applications must provide timing guarantees (e.g., throughput) to assure their quality-of-service. For instance, a video decoder which is running on a mobile device is expected to deliver a video stream with a specific frame rate. Moreover, the energy consumption of such applications on handheld devices should be as low as possible. This paper proposes a technique to select a suitable multiprocessor DVFS point for each mode (scenario) of a dynamic application described by an SADF. The technique assures strict timing guarantees while minimizing energy consumption. The technique is evaluated by applying it to several streaming applications. It solves the problem faster than the state of the art technique for dataflow graphs. Moreover, the DVFS controller devised using the proposed technique is more compact and reduces energy consumption compared to the controller devised using the counterpart technique.
    Proceedings of the 2013 IEEE 19th Real-Time and Embedded Technology and Applications Symposium (RTAS); 04/2013
  • Cedric Nugteren, Gert-Jan Braak, Henk Corporaal
    [Show abstract] [Hide abstract]
    ABSTRACT: As graphics processing units (GPUs) are becoming increasingly popular for general purpose workloads (GPGPU), the question arises how such processors will evolve architecturally in the near future. In this work, we identify and discuss tradeoffs for three GPU architecture parameters: active thread count, compute-memory ratio, and cluster and warp sizing. For each parameter, we propose changes to improve GPU design, keeping in mind trends such as dark silicon and the increasing popularity of GPGPU architectures. A key-enabler is dynamism and workload-adaptiveness, enabling among others: dynamic register file sizing, latency aware scheduling, roofline-aware DVFS, runtime cluster fusion, and dynamic warp sizing.
    Proceedings of the Conference on Design, Automation and Test in Europe, Grenoble, France; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Implementing Multi-Processor-Systems-on-Chip (MPSoCs) in 3-Dimensional (3D) ICs has many benefits, but the increased power density can cause significant thermal problems, resulting in decreased reliability, lifetime and performance. This paper presents a fast thermal-aware approach for mapping throughput constrained streaming applications on a 3D MPSoC. While there are some published works on thermal-aware mapping of real-time applications, throughput constraints and data dependencies are mostly not considered. Further, conventional approaches have long running times due to slow iterative thermal simulations. In our approach, to avoid slow thermal simulations for every candidate mapping, a thermal model of the 3D IC is used to derive an on-chip power distribution that minimizes the temperature before the actual mapping is done. Next, this distribution is used in a resource allocation algorithm to derive a mapping that meets the throughput constraint while approaching the target power distribution and minimizing energy consumption. This way, in contrast to most existing approaches, a mapping can be derived in the order of minutes. Experiments show a 7% reduction in peak temperature and a 47% reduction in communication energy compared to mappings based on load balancing.
    Embedded Systems for Real-time Multimedia (ESTIMedia), 2013 IEEE 11th Symposium on; 01/2013
  • Cedric Nugteren, Pieter Custers, Henk Corporaal
    [Show abstract] [Hide abstract]
    ABSTRACT: Code generation and programming have become ever more challenging over the last decade due to the shift towards parallel processing. Emerging processor architectures such as multi-cores and GPUs exploit increasingly parallelism, requiring programmers and compilers to deal with aspects such as threading, concurrency, synchronization, and complex memory partitioning. We advocate that programmers and compilers can greatly benefit from a structured classification of program code. Such a classification can help programmers to find opportunities for parallelization, reason about their code, and interact with other programmers. Similarly, parallelising compilers and source-to-source compilers can take threading and optimization decisions based on the same classification. In this work, we introduce algorithmic species, a classification of affine loop nests based on the polyhedral model and targeted for both automatic and manual use. Individual classes capture information such as the structure of parallelism and the data reuse. To make the classification applicable for manual use, a basic vocabulary forms the base for the creation of a set of intuitive classes. To demonstrate the use of algorithmic species, we identify 115 classes in a benchmark set. Additionally, we demonstrate the suitability of algorithmic species for automated uses by showing a tool to automatically extract species from program code, a species-based source-to-source compiler, and a species-based performance prediction model.
    ACM Transactions on Architecture and Code Optimization (TACO). 01/2013; 9(4).
  • Yifan He, Dongrui She, Sander Stuijk, Henk Corporaal
    [Show abstract] [Hide abstract]
    ABSTRACT: Streaming applications are an important class of applications in emerging embedded systems such as smart camera network, unmanned vehicles, and industrial printing. These applications are usually very computationally intensive and have real-time constraints. To meet the increasing demand for performance and efficiency in these applications, the use of application specific IP cores in heterogeneous Multi-Processor System-on-Chips (MPSoCs) becomes inevitable. However, two of the key challenges in integrating these IP cores into MPSoCs are (i) how to properly handle inter-core communication; (ii) how to map streaming applications in an efficient and predictable way. In this paper, we first present a predictable high-performance communication assist (CA) that helps to tackle these design challenges. The proposed CA has zero throughput overhead, negligible latency overhead, and significantly less resource usage compared to existing CA designs. The proposed CA also provides a unified abstract interface for both processors and accelerator IP cores with flexible data access support. Based on the proposed CA design, we present a predictable heterogeneous multi-processor platform template for streaming applications. The template is used in a predictable design flow that uses Synchronous Data Flow (SDF) graphs for design time analysis. An accurate SDF model of our CA is introduced, enabling the mapping of applications onto heterogeneous MPSoCs in an efficient and predictable way. As a case study, we map the complete high-speed vision processing pipeline of an industrial application, Organic Light Emitting Diode (OLED) screen printing, onto one instance of the proposed platform. The result demonstrates that system design and analysis effort is greatly reduced with the proposed CA-based design flow.
    Journal of Systems Architecture. 01/2013; 59(10):878–888.

Publication Stats

1k Citations
7.74 Total Impact Points

Institutions

  • 1988–2014
    • Technische Universiteit Eindhoven
      • Department of Electrical Engineering
      Eindhoven, North Brabant, Netherlands
  • 2008
    • Philips
      • Philips Research
      Eindhoven, North Brabant, Netherlands
    • NXP Semiconductors
      Eindhoven, North Brabant, Netherlands
  • 2007
    • Embedded Systems Institute
      Eindhoven, North Brabant, Netherlands
  • 1900–2006
    • Delft University of Technology
      • Information- and Communication Technology Section
      Delft, South Holland, Netherlands
  • 2003
    • imec Belgium
      • Smart Systems and Energy Technology
      Louvain, Flanders, Belgium