Henk Corporaal

Technische Universiteit Eindhoven, Eindhoven, North Brabant, Netherlands

Are you Henk Corporaal?

Claim your profile

Publications (378)57.43 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper proposes a modeling approach to capture the mapping of an application on a platform. The approach is based on Scenario-Aware Dataflow (SADF) models. In contrast to the related work, we express the complete design-space in a single formal SADF model. This allows us to have a compact and explorable state-space linked with an executable model capable of symbolically analyzing different mappings for their timing behavior. We can model different bindings for application tasks, different static-orders schedules for tasks bound in shared resources, as well as naturally capturing resource claiming/unclaiming using SADF semantics. Moreover, by using the inherent properties of dataflow graphs and the dynamic behavior of a Finite-State Machine, we can model different levels of pipelining, such as full application pipelining and interleaved pipelining of consecutive executions of the application. The size of the model is independent of the number of executions of the application. Since we are able to capture all this behavior in a single SADF model we can use available dataflow analysis, such as worst-case and best-case throughput and deadlock-freedom checking. Furthermore, since the model captures the design-space independently of the analysis technique, one can use different exploration approaches to analyze different sets of requirements.
    No preview · Conference Paper · Sep 2015
  • [Show abstract] [Hide abstract]
    ABSTRACT: CPS play an important role in the modern high-tech industry. Designing such systems is a challenging task due to the multi-disciplinary nature of these systems, and the range of abstraction levels involved. To facilitate hands-on experience with such systems, we develop a cyber-physical platform that aids in research and education on CPS. This paper describes this platform, which contains all typical CPS components. The platform is used in various research and education projects for bachelor, master, and PhD students. We discuss the platform and a number of projects and the educational opportunities they provide.
    No preview · Conference Paper · Sep 2015
  • Source
    Siham Tabik · Maurice Peemen · Nicolas Guil · Henk Corporaal
    [Show abstract] [Hide abstract]
    ABSTRACT: Stencil computation is of paramount importance in many fields, in image processing, structural biology and biomedicine, among others. There exists a permanent demand of maximizing the performance of stencils on state-of-the-art architectures, such graphics processing units (GPUs). One of the important issues when optimizing these kernels for the GPU is the selection of the best thread-block that maximizes the overall performance. Usually, programmers look for the optimal thread-block configuration in a reduced space of square thread-block configurations or simply use the best configurations reported in previous works, which is usually 16 × 16. This paper provides a better understanding of the impact of thread-block configurations on the performance of stencils on the GPU. In particular, we model locality and parallelism and consider that the optimal configurations are within the space that provides: (1) a small number of global memory communications; (2) a good shared memory utilization with small numbers of conflicts; (3) a good streaming multi-processors utilization; and (4) a high efficiency of the threads within a thread-block. The model determines the set of optimal thread-block configurations without the need of executing the code. We validate the proposed model using six stencils with different halo widths and show that it reduces the optimization space to around 25% of the total valid space. The configurations in this space achieve at least a throughput of 75% of the best configuration and guarantee the inclusion of the best configurations. Copyright © 2015 John Wiley & Sons, Ltd.
    Full-text · Article · Aug 2015 · Concurrency and Computation Practice and Experience
  • Source
    Mathias Funk · Piet van der Putten · Henk Corporaal
    [Show abstract] [Hide abstract]
    ABSTRACT: Nowadays interactive electronics products offer a huge functionality to prospective customers, but often it is too huge and complex to be grasped and used successfully. In this case, customers obviate the struggle and return the products to the shop. Also the variability in scope and features of a product is so large that an up-front specification becomes hard if not impossible. To avoid the problem of an inadequate match between customer expectations and designer assumptions, new sources of product usage information have to be developed. One possibility is to integrate observation functionality into the products, continuously involving real users in the product development process. The integration of such functionality is an often overlooked challenge that should be tackled with an appropriate engineer- ing methodology. This paper presents on-going work about a novel design for observation approach that supports early observation integrations and enables the cooperation with various information stakeholders. We show how observation can be embedded seamlessly in a model-driven development process using UML. An industrial case-study shows the feasibility of the approach.
    Preview · Article · Jul 2015
  • Luc Waeijen · Dongrui She · Henk Corporaal · Yifan He
    [Show abstract] [Hide abstract]
    ABSTRACT: Energy efficiency has become one of the most important topics in computing. To meet the ever increasing demands of the mobile market, the next generation of processors will have to deliver a high compute performance at an extremely limited energy budget. Wide single instruction, multiple data (SIMD) architectures provide a promising solution, as they have the potential to achieve high compute performance at a low energy cost. We propose a configurable wide SIMD architecture that utilizes explicit datapath techniques to further optimize energy efficiency without sacrificing computational performance. To demonstrate the efficiency of the proposed architecture, multiple instantiations of the proposed wide SIMD architecture and its automatic bypassing counterpart, as well as a baseline RISC processor, are implemented. Extensive experimental results show that the proposed architecture is efficient and scalable in terms of area, performance, and energy. In a 128-PE SIMD processor, the proposed architecture is able to achieve an average of 206 times speed-up and reduces the total energy dissipation by 48.3 % on average and up to 94 %, compared to a reduced instruction set computing (RISC) processor. Compared to the corresponding SIMD architecture with automatic bypassing, an average of 64 % of all register file accesses is avoided by the 128-PE, explicitly bypassed SIMD. For total energy dissipation, an average of 27.5 %, and maximum of 43.0 %, reduction is achieved.
    No preview · Article · Jul 2015 · Journal of Signal Processing Systems
  • Dongrui She · Yifan He · Luc Waeijen · Henk Corporaal
    [Show abstract] [Hide abstract]
    ABSTRACT: Energy efficiency is one of the most important metrics in embedded processor design. The use of wide SIMD architecture is a promising approach to build energy-efficient high performance embedded processors. In this paper, we propose a design framework for a configurable wide SIMD architecture that utilizes an explicit datapath to achieve high energy efficiency. The framework is able to generate processor instances based on architecture specification files. It includes a compiler to efficiently program the proposed architecture with standard programming languages including OpenCL. This compiler can analyze the static memory access patterns in OpenCL kernels, generate efficient mappings, and schedule the code to fully utilize the explicit datapath. Extensive experimental results show that the proposed architecture is efficient and scalable in terms of area, performance, and energy. In a 128-PE SIMD processor, the proposed architecture is able to achieve up to 200 times speed-up and reduce the total energy consumption by 50 % compared to a basic RISC processor.
    No preview · Article · Jul 2015 · Journal of Signal Processing Systems
  • Source
    Roel Jordans · Henk Corporaal
    [Show abstract] [Hide abstract]
    ABSTRACT: Software-pipelining is an important technique for increasing the instruction level parallelism of loops during compilation. Currently, the LLVM compiler infrastructure does not offer this optimization although some target specific implementations do exist. We have implemented a high-level method for software-pipelining within the LLVM framework. By implementing this within LLVM's optimization layer we have taken the first steps towards a target independent software-pipelining method.
    Full-text · Conference Paper · Jun 2015
  • Ang Li · Akash Kumar · Yajun Ha · Henk Corporaal
    [Show abstract] [Hide abstract]
    ABSTRACT: Volume image registration remains one of the best candidates for Graphics Processing Unit (GPU) acceleration because of its enormous computation time and plentiful data-level parallelism. However, an efficient GPU implementation for image registration is still challenging due to the heavy utilization of expensive atomic operations for similarity calculations. In this paper, we first propose five GPU-friendly Correlation Ratio (CR) based methods to accelerate the process of image registration. Compared to widely used Mutual Information (MI) based methods, the CR-based approaches require less resource for shadow histograms, a faster storage, such as the on-chip scratchpad memory, therefore can be fully exploited to achieve better performance. Second, we make design space exploration of the CR-based methods, and study the trade-off of introducing shadow histograms on different storage (shared memory, global memory) by computation units of different granularity (thread, warp, thread block). Third, we exhaustively test the proposed designs on GPUs of different generations (Fermi, Kepler and Maxwell) so that performance variations due to hardware migration are addressed. Finally, we evaluate the performance impact corresponding to the tuning of concurrency, algorithm settings as well as overheads incurred by preprocessing, smoothing and workload unbalancing. We highlight our last CR approach which completely avoids updating conflicts of histogram calculation, leading to substantial performance improvements (up to 55x speedup over naive CPU implementation). It reduces the registration time from 145s to 2.6s for two typical 256x256x160 volume images on a Kepler GPU.
    No preview · Article · May 2015 · Microprocessors and Microsystems
  • [Show abstract] [Hide abstract]
    ABSTRACT: A Large Scale Printer (LSP) is a Cyber Physical System (CPS) printing thousands of sheets per day with high quality. The print requests arrive at run-time requiring online scheduling. We capture the LSP scheduling problem as online scheduling of re-entrant flowshops with sequence dependent setup times and relative due dates with makespan minimization as the scheduling criterion. Exhaustive approaches like Mixed Integer Programming can be used, but they are compute intensive and not suited for online use. We present a novel heuristic for scheduling of LSPs that on average requires 0.3 seconds per sheet to find schedules for industrial test cases. We compare the schedules to lower bounds, to schedules generated by the current scheduler and schedules generated by a modified version of the classical NEH (MNEH) heuristic [1], [2]. On average, the proposed heuristic generates schedules that are 40% shorter than the current scheduler, have an average difference of 25% compared to the estimated lower bounds and generates schedules with less than 67% of the makespan of schedules generated by the MNEH heuristic.
    No preview · Conference Paper · Mar 2015
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: One of the most critical challenges for today’s and future data-intensive and big-data problems is data storage and analysis. This paper first highlights some challenges of the new born Big Data paradigm and shows that the increase of the data size has already surpassed the capabilities of today’s computation architectures suffering from the limited bandwidth, programmability overhead, energy inefficiency, and limited scalability. Thereafter, the paper introduces a new memristor-based architecture for data-intensive applications. The potential of such an architecture in solving data-intensive problems is illustrated by showing its capability to increase the computation efficiency, solving the communication bottleneck, reducing the leakage currents, etc. Finally, the paper discusses why memristor technology is very suitable for the realization of such an architecture; using memristors to implement dual functions (storage and logic) is illustrated.
    Full-text · Conference Paper · Mar 2015
  • Francesco Comaschi · Sander Stuijk · Twan Basten · Henk Corporaal
    [Show abstract] [Hide abstract]
    ABSTRACT: Object detection and tracking is one of the most important components in computer vision applications. To carefully evaluate the performance of detection and tracking algorithms, it is important to develop benchmark data sets. One of the most tedious and error-prone aspects when developing benchmarks, is the generation of the ground truth. This paper presents FAST-GT (FAst Semi-automatic Tool for Ground Truth generation), a new generic framework for the semiautomatic generation of ground truths. FAST-GT reduces the need for manual intervention thus speeding-up the ground-truthing process.
    No preview · Article · Jan 2015

  • No preview · Article · Jan 2015 · IEEE Transactions on Computers
  • Firew Siyoum · Marc Geilen · Henk Corporaal

    No preview · Article · Jan 2015 · IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
  • Article: Bones
    Cedric Nugteren · Henk Corporaal

    No preview · Article · Dec 2014 · ACM Transactions on Architecture and Code Optimization
  • [Show abstract] [Hide abstract]
    ABSTRACT: Custom Instruction Identification is an important part in the design of efficient Application-Specific Processors (ASIPs). It consists of profiling of a given application to find patterns of basic operations that are frequently executed. Operations of such patterns can be implemented together as a single custom instruction to speedup the execution of the application. Because of the problem's high complexity, several methods have been proposed for specific single-issue (RISC) processors and architectures, limiting the shape and size of custom instructions that can actually be identified and, possibly, implemented. In this paper, we propose and discuss an efficient custom instruction set identification method and corresponding automatic tool for multi-issue VLIW ASIPs, which search for the common operation patterns of the most frequently executed basic blocks of a given application, with different sizes and shapes. The speedup results for the custom instructions identified by our tool are provided for a set of benchmark applications. The speedup is up to 68%, with only a few custom instructions used.
    No preview · Article · Nov 2014
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: For next-generation radio telescopes such as the Square Kilometre Array, seemingly minor changes in scientific constraints can easily push computing requirements into the exascale domain. The authors propose a model for engineers and astronomers to understand these relations and make tradeoffs in future instrument designs.
    Full-text · Article · Sep 2014 · Computer
  • [Show abstract] [Hide abstract]
    ABSTRACT: Programming models such as CUDA and OpenCL allow the programmer to specify the independence of threads, effectively removing ordering constraints. Still, parallel architectures such as the graphics processing unit (GPU) do not exploit the potential of data-locality enabled by this independence. Therefore, programmers are required to manually perform data-locality optimisations such as memory coalescing or loop tiling. This work makes a case for locality-aware thread scheduling: re-ordering threads automatically for better locality to improve the programmability of multi-threaded processors. In particular, we analyse the potential of locality-aware thread scheduling for GPUs, considering among others cache performance, memory coalescing and bank locality. This work does not present an implementation of a locality-aware thread scheduler, but rather introduces the concept and identifies the potential. We conclude that non-optimised programs have the potential to achieve good cache and memory utilisation when using a smarter thread scheduler. A case-study of a naive matrix multiplication shows for example a 87% performance increase, leading to an IPC of 457 on a 512-core GPU.
    No preview · Conference Paper · Aug 2014
  • Erkan Diken · Roel Jordans · Lech Jozwiak · Henk Corporaal
    [Show abstract] [Hide abstract]
    ABSTRACT: Many applications in important domains, such as communication, multimedia, etc. show a significant data-level parallelism (DLP). A large part of the DLP is usually exploited through application vectorization and implementation of vector operations in processors executing the applications. While the amount of DLP varies between applications of the same domain or even within a single application, processor architectures usually support a single vector width. This may not be optimal and may cause a substantial energy and performance inefficiency. Therefore, an adequate more sophisticated exploitation of DLP is highly relevant. This paper studies the construction and exploitation of VLIW ASIPs with multiple vector widths.
    No preview · Conference Paper · Jun 2014
  • Roel Jordans · Lech Jozwiak · Henk Corporaal
    [Show abstract] [Hide abstract]
    ABSTRACT: Genetic algorithms are commonly used for automatically solving complex design problem because exploration using genetic algorithms can consistently deliver good results when the algorithm is given a long enough run-time. However, the exploration time for problems with huge design spaces can be very long, often making exploration using a genetic algorithm practically infeasible. In this work, we present a genetic algorithm for exploring the instruction-set architecture of VLIW ASIPs and demonstrate its effectiveness by comparing it to two heuristic algorithms. We present several optimizations to the genetic algorithm configuration, and demonstrate how caching of intermediate compilation and simulation results can reduce the exploration time by an order of magnitude.
    No preview · Conference Paper · Jun 2014
  • Luc Waeijen · Dongrui She · Henk Corporaal · Yifan He
    [Show abstract] [Hide abstract]
    ABSTRACT: It has been shown that wide Single Instruction Multiple Data architectures (wide-SIMDs) can achieve high energy efficiency, especially in domains such as image and vision processing. In these and various other application domains, reduction is a frequently encountered operation, where multiple input elements need to be combined into a single element by an associative operation, e.g. addition or multiplication. There are many applications that require reduction such as: partial histogram merging, matrix multiplication and min/max-finding. Wide-SIMDs contain a large number of processing elements (PEs), which in general are connected by a minimal form of interconnect for scalability reasons. To efficiently support reduction operations on wide-SIMDs with such a minimal interconnect, we introduce two novel reduction algorithms which do not rely on complex communication networks or any dedicated hardware. The proposed approaches are compared with both dedicated hardware and other software solutions in terms of performance, area, and energy consumption. A practical case study demonstrates that the proposed software approach has much better generality, flexibility and no additional hardware cost. Compared to a dedicated hardware adder tree, the proposed software approach saves 6.8% area with a performance penalty of only 6.5%.
    No preview · Article · Jun 2014

Publication Stats

3k Citations
57.43 Total Impact Points

Institutions

  • 2003-2015
    • Technische Universiteit Eindhoven
      • Department of Electrical Engineering
      Eindhoven, North Brabant, Netherlands
  • 2013
    • University of Cordoba (Spain)
      • Department of Computers Architecture and Technology, Electronics and Electric Technology
      Cordoue, Andalusia, Spain
  • 2008
    • NXP Semiconductors
      Eindhoven, North Brabant, Netherlands
  • 1991-2008
    • Delft University of Technology
      • • Faculty of Applied Sciences (AS)
      • • Information- and Communication Technology Section
      Delft, South Holland, Netherlands
  • 2002-2006
    • imec Belgium
      • Smart Systems and Energy Technology
      Louvain, Flemish, Belgium