Ben H. H. Juurlink

Ben H. H. Juurlink
Technische Universität Berlin | TUB · Department of Computer Engineering and Microelectronics

PhD

About

233
Publications
35,547
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,396
Citations
Additional affiliations
January 2010 - present
Technische Universität Berlin
Position
  • Professor (Full)
February 2007 - December 2009
Delft University of Technology
Position
  • Professor (Associate)
August 1998 - January 2007
Delft University of Technology
Position
  • Professor (Assistant)

Publications

Publications (233)
Chapter
In many applications there is a gap between the accuracy provided by the platform and the accuracy that is required by the application to produce good-enough results. Exploiting this gap specifically is the concept of Approximate Computing, where a small reduction in accuracy is traded for better performance or a reduction in energy consumption. We...
Article
Full-text available
GPUs are capable of delivering peak performance in TFLOPs, however, peak performance is often difficult to achieve due to several performance bottlenecks. Memory divergence is one such performance bottleneck that makes it harder to exploit locality, cause cache thrashing, and high miss rate, therefore, impeding GPU performance. As data locality is...
Chapter
Approximate computing comprises a large variety of techniques that trade the accuracy of an application’s output for other metrics such as computing time or energy cost. Many existing approximation techniques focus on loops such as loop perforation, which skips iterations for faster, approximated computation. This paper introduces ALONA, a novel ap...
Article
Agent-based simulations represent an effective scientific tool, with numerous applications from social sciences to biology, which aims to emulate or predict complex phenomena through a set of simple rules performed by multiple agents. To simulate a large number of agents with complex models, practitioners have developed high-performance parallel im...
Conference Paper
Full-text available
GPUs use a large memory access granularity (MAG) that often results in a low effective compression ratio for memory compression techniques. The low effective compression ratio is caused by a significant fraction of compressed blocks that have a few bytes above a multiple of MAG. While MAG-aware selective approximation, based on a tree structure, ha...
Conference Paper
Full-text available
Traditionally, GPUs only had programmer-managed caches. The advent of hardware-managed caches accelerated the use of GPUs for general-purpose computing. However, as GPU caches are shared by thousands of threads, they are usually a victim of contention and can suffer from thrashing and high miss rate, in particular, for memory-divergent workloads. A...
Article
Compiler optimization passes employ cost models to determine if a code transformation will yield performance improvements. When this assessment is inaccurate, compilers apply transformations that are not beneficial, or refrain from applying ones that would have improved the code. We analyze the accuracy of the cost models used in LLVM’s and GCC’s v...
Article
Full-text available
Energy optimization is an increasingly important aspect of today’s high-performance computing applications. In particular, dynamic voltage and frequency scaling (DVFS) has become a widely adopted solution to balance performance and energy consumption, and hardware vendors provide management libraries that allow the programmer to change both memory...
Conference Paper
Dynamic voltage and frequency scaling (DVFS) is an important solution to balance performance and energy consumption, and hardware vendors provide management libraries that allow the programmer to change both memory and core frequencies. The possibility to manually set these frequencies is a great opportunity for application tuning, which can focus...
Chapter
Full-text available
This paper presents the MEMPower power model. MEMPower is a detailed empirical power model for GPU memory access. It models the data dependent energy consumption as well as individual core specific differences. We explain how the model was calibrated using special micro benchmarks as well as a high-resolution power measurement testbed. A novel tech...
Article
Processor vendors have been expanding Single Instruction Multiple Data (SIMD) extensions to exploit data-level-parallelism in their General Purpose Processors (GPPs). Each SIMD technology such as Streaming SIMD Extensions (SSE) and Advanced Vector eXtensions (AVX) has its own Instruction Set Architecture (ISA) which equipped with Special Purpose In...
Conference Paper
Agent-based simulations are becoming widespread among scientists from different areas, who use them to model increasingly complex problems. To cope with the growing computational complexity, parallel and distributed implementations have been developed for a wide range of platforms. However, it is difficult to have simulations that are portable to d...
Conference Paper
Single Instruction Multiple Data (SIMD) extensions in processors enable in-core parallelism for operations on vectors of data. From the compiler perspective, SIMD instructions require automatic techniques to determine how and when it is possible to express computations in terms of vector operations. When this is not possible automatically, a user m...
Conference Paper
Many applications provide inherent resilience to some amount of error and can potentially trade accuracy for performance by using approximate computing. Applications running on GPUs often use local memory to minimize the number of global memory accesses and to speed up execution. Local memory can also be very useful to improve the way approximate c...
Conference Paper
Many applications provide inherent resilience to some amount of error and can potentially trade accuracy for performance by using approximate computing. Applications running on GPUs often use local memory to minimize the number of global memory accesses and to speed up execution. Local memory can also be very useful to improve the way approximate c...
Article
Full-text available
In the High Efficiency Video Coding (HEVC) standard, multiple decoding modules have been designed to take advantage of parallel processing. In particular, the HEVC in-loop filters (i.e., the deblocking filter and sample adaptive offset) were conceived to be exploited by parallel architectures. However, the type of the offered parallelism mostly sui...
Article
The High Efficiency Video Coding HEVC standard provides a higher compression efficiency than other video coding standards but at the cost of an increased computational load, which makes hard to achieve real-time encoding/decoding for ultra high-resolution and high-quality video sequences. Graphics Processing Units GPU are known to provide massive p...
Conference Paper
The increasing performance of today's computer architecture comes with an unprecedented augment of hardware complexity. Unfortunately this results in difficult-to-tune software and consequentially in a gap between the potential peak performance and the actual performance. Automatic tuning is an emerging approach that assists the programmer in manag...
Conference Paper
The LPGPU2 project is a 30-month-project (Innovation Action) funded by the European Union. Its overall goal is to develop an analysis and visualization framework that enables GPU application developers to improve the performance and power consumption of their applications. To achieve this overall goal, several key objectives need to be achieved. Fi...
Conference Paper
In recent years, so called FPGA-SoCs have been introduced by Intel (formerly Altera) and Xilinx. These devices combine multi-core processors with programmable logic. This paper analyzes the various memory and communication interconnects found in actual devices, particularly the Zynq-7020 and Zynq-7045 from Xilinx and the Cyclone V SE SoC from Intel...
Conference Paper
Full-text available
PHP is a dynamically typed programming language commonly used for the server-side implementation of web applications. Approachability and ease of deployment have made PHP one of the most widely used scripting languages for the web, powering important web applications such as WordPress, Wikipedia, and Facebook. PHP's highly dynamic nature, while pro...
Article
Context-based Adaptive Binary Arithmetic Coding (CABAC) is the entropy coding module in the HEVC/H.265 video coding standard. As in its predecessor, H.264/AVC, CABAC is a well-known throughput bottleneck due to its strong data dependencies. Besides other optimizations, the replacement of the context model memory by a smaller cache has been proposed...
Conference Paper
SIMD extensions were added to microprocessors in the mid '90s to speed-up data-parallel code by vectorization. Unfortunately, the SIMD programming model has barely evolved and the most efficient utilization is still obtained with elaborate intrinsics coding. As a consequence, several approaches to write efficient and portable SIMD code have been pr...
Conference Paper
We propose a technique for optimizing the High Efficiency Video Coding (HEVC) encoder by reducing the number of operations performed in the motion estimation stage. The technique is based on the fact that a significant number of motion estimation operations are performed repetitively for the same image samples, but for different block partition siz...
Article
Full-text available
Temporal SIMT (TSIMT) has been suggested as an alternative to conventional (spatial) SIMT for improving GPU performance on branch-intensive code. Although TSIMT has been briefly mentioned before, it was not evaluated. We present a complete design and evaluation of TSIMT GPUs, along with the inclusion of scalarization and a combination of temporal a...
Article
A high-speed image compression architecture with region-of-interest (ROI) support and with flexible access to compressed data based on the Consultative Committee for Space Data Systems 122.0-B-1 image data compression standard is presented. Modifications of the standard permit a change of compression parameters and the reorganization of the bit str...
Conference Paper
The usage of graphics processing units (GPUs) as computing architectures for inherently data parallel signal processing applications in this computing era is very popular. In principle, GPUs in comparison with central processing units (CPUs) could achieve significant speed-up over the latter, especially considering data parallel applications which...
Article
Full-text available
Single instruction multiple data (SIMD) instructions have been commonly used to accelerate video codecs. The recently introduced High Efficiency Video Coding (HEVC) codec like its predecessors is based on the hybrid video codec principle and, therefore, is also well suited to be accelerated with SIMD. In this paper we present the SIMD optimization...
Article
Full-text available
This paper presents different views exposed in a special session on the current standing of programming and design tools for multi and manycores in the embedded domain. After approximately ten years of the advent of multicore architectures, we take a look at state-of-the-art and trends in model-based programming methodologies from an academic point...
Conference Paper
This paper proposes a hardware architecture based on the object detection system of Viola and Jones using Haar-like features. The proposed design is able to discover faces in real-time with high accuracy. Speed-up is achieved by exploiting the parallelism in the design, where multiple classifier cores can be added. To maintain a flexible design, cl...
Article
Motion compensation is one of the most compute-intensive parts in H.264/AVC video decoding. It exposes massive parallelism, which can reap the benefit from graphics processing units (GPUs). Control and memory divergence, however, may lead to performance penalties on GPUs. In this paper, we propose two GPU motion-compensation kernels, implemented wi...
Conference Paper
Robust and rapid face detection systems are constantly gaining more interest, since they represent the first stone for many challenging tasks in the field of computer vision. In this paper a software-hardware co-design approach is presented, that enables the detection of frontal faces in real time. A complete hardware implementation of all componen...
Article
Full-text available
In this article, we investigate how code optimization techniques and low-power states of general-purpose processors improve the power efficiency of HEVC decoding. The power and performance efficiency of the use of SIMD instructions, multicore architectures, and low-power active and idle states are analyzed in detail for offline video decoding. In a...
Conference Paper
Full-text available
Recent research results show that there is a high degree of code sharing between cores in multi-core architectures. In this paper we propose a proximity scheme for the instruction caches, a scheme in which the shared code blocks among the neighbouring L2 caches in tiled multi-core architectures are exploited to reduce the average cache miss penalty...
Article
Predictors are used in many fields of computer architectures to enhance performance. With good estimations of future system behaviour, policies can be developed to improve system performance or reduce power consumption. These policies become more effective if the predictors are implemented in hardware and can provide quantified forecasts and not on...
Conference Paper
Full-text available
Directory coherence protocols have been employed in large-scale many-core architectures to maintain cache coherence as an alternative solution to snoopy coherence protocols. However, these protocols introduce an indirection through the directory to obtain the cache coherence information, thus increasing the cache miss latencies. Recent research res...
Article
Full-text available
While multicore architectures are used in the whole product range from server systems to handheld computers, the deployed software still undergoes the slow transition from sequential to parallel. This transition, however, is gaining more and more momentum due to the increased availability of more sophisticated parallel programming environments. Com...
Conference Paper
Full-text available
The recently finalized High-Efficiency Video Coding (HEVC) standard was jointly developed by the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG) to improve the compression performance of current video coding standards by 50%. Especially when it comes to transmit high resolution video like 4K over the inte...
Conference Paper
Over the last decade, multicore architectures have become omnipresent. Today, they are used in the whole product range from server systems to handheld computers. The deployed software still undergoes the slow transition from sequential to parallel. This transition, however, is gaining more and more momentum due to the increased availability of more...
Conference Paper
Full-text available
The new High Efficiency Video Coding Standard (HEVC) was finalized in January 2013. Compared to its predecessor H.264 / MPEG4-AVC, this new international standard is able to reduce the bitrate by 50% for the same subjective video quality. This paper investigates decoder optimizations that are needed to achieve HEVC real-time software decoding on a...
Conference Paper
Full-text available
Modern GPUs are true power houses in every meaning of the word: While they offer general-purpose (GPGPU) compute performance an order of magnitude higher than that of conventional CPUs, they have also been rapidly approaching the infamous “power wall”, as a single chip sometimes consumes more than 300W. Thus, the design space of GPGPU microarchitec...
Conference Paper
Full-text available
In this paper we present an implementation of the H.264/AVC Inverse Discrete Cosine Transform (IDCT) optimized for Graphics Processing Units (GPUs) using OpenCL. By exploiting that most of the input data of the IDCT for real videos are zero valued coefficients a new compacted data representation is created that allows for several optimizations. Exp...
Article
Full-text available
Unlike H.264/advanced video coding, where parallelism was an afterthought, High Efficiency Video Coding currently contains several proposals aimed at making it more parallel-friendly. A performance comparison of the different proposals, however, has not yet been performed. In this paper, we will fill this gap by presenting efficient implementations...
Conference Paper
The quest for higher performance within a certain power budget in the fields of embedded computing demands unconventional architectural approaches. To this end, in this paper we present synZEN (sZ): a (micro-)architecture that combines features of very long instruction word (VLIW) and transport triggered architectures (TTAs) to cover the needs of d...
Conference Paper
Full-text available
In this paper, we evaluate the performance and usability of the parallel programming model OpenMP Superscalar (OmpSs), apply it to 10 different benchmarks and compare its performance with corresponding POSIX threads implementations.
Article
In this paper, we evaluate the performance and usability of the parallel programming model OpenMP Superscalar (OmpSs), apply it to 10 different benchmarks and compare its performance with corresponding POSIX threads implementations.