Tomas Karnagel’s research while affiliated with TUD Dresden University of Technology and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (15)


Big data causing big (TLB) problems: taming random memory accesses on the GPU
  • Conference Paper

May 2017

·

216 Reads

·

19 Citations

Tomas Karnagel

·

Tal Ben-Nun

·

Matthias Werner

·

[...]

·

GPUs are increasingly adopted for large-scale database processing, where data accesses represent the major part of the computation. If the data accesses are irregular, like hash table accesses or random sampling, the GPU performance can suffer. Especially when scaling such accesses beyond 2GB of data, a performance decrease of an order of magnitude is encountered. This paper analyzes the source of the slowdown through extensive micro-benchmarking, attributing the root cause to the Translation Lookaside Buffer (TLB). Using the micro-benchmarks, the TLB hierarchy and structure are fully analyzed on two different GPU architectures, identifying never-before-published TLB sizes that can be used for efficient large-scale application tuning. Based on the gained knowledge, we propose a TLB-conscious approach to mitigate the slowdown for algorithms with irregular memory access. The proposed approach is applied to two fundamental database operations - random sampling and hash-based grouping - showing that the slowdown can be dramatically reduced, and resulting in a performance increase of up to 13×.


Adaptive work placement for query processing on heterogeneous computing resources

March 2017

·

53 Reads

·

44 Citations

Proceedings of the VLDB Endowment

The hardware landscape is currently changing from homogeneous multi-core systems towards heterogeneous systems with many different computing units, each with their own characteristics. This trend is a great opportunity for data-base systems to increase the overall performance if the heterogeneous resources can be utilized efficiently. To achieve this, the main challenge is to place the right work on the right computing unit. Current approaches tackling this placement for query processing assume that data cardinalities of intermediate results can be correctly estimated. However, this assumption does not hold for complex queries. To overcome this problem, we propose an adaptive placement approach being independent of cardinality estimation of intermediate results. Our approach is incorporated in a novel adaptive placement sequence. Additionally, we implement our approach as an extensible virtualization layer, to demonstrate the broad applicability with multiple database systems. In our evaluation, we clearly show that our approach significantly improves OLAP query processing on heterogeneous hardware, while being adaptive enough to react to changing cardinalities of intermediate query results.


Heterogeneous placement optimization for database query processing

January 2017

·

10 Reads

·

4 Citations

it - Information Technology

Computing hardware is constantly evolving and database systems need to adapt to ongoing hardware changes to improve performance. The current hardware trend is heterogeneity, where multiple computing units like CPUs and GPUs are used together in one system. In this paper, we summarize our efforts to use hardware heterogeneity efficiently for query processing. We discuss different approaches of execution and investigate heterogeneous placement in detail by showing, how to automatically determine operator placement decisions according to the given hardware environment and query properties.



Limitations of Intra-operator Parallelism Using Heterogeneous Computing Resources

August 2016

·

21 Reads

·

3 Citations

Lecture Notes in Computer Science

The hardware landscape is changing from homogeneous multi-core systems towards wildly heterogeneous systems combining different computing units, like CPUs and GPUs. To utilize these heterogeneous environments, database query execution has to adapt to cope with different architectures and computing behaviors. In this paper, we investigate the simple idea of partitioning an operator’s input data and processing all data partitions in parallel, one partition per computing unit. For heterogeneous systems, data has to be partitioned according to the performance of the computing units. We define a way to calculate the partition sizes, analyze the parallel execution exemplarily for two database operators, and present limitations that could hinder significant performance improvements. The findings in this paper can help system developers to assess the possibilities and limitations of intra-operator parallelism in heterogeneous environments, leading to more informed decisions if this approach is beneficial for a given workload and hardware environment.



Query processing on low-energy many-core processors

June 2015

·

10 Reads

·

10 Citations

Aside from performance, energy efficiency is an increasing challenge in database systems. To tackle both aspects in an integrated fashion, we pursue a hardware/software co-design approach. To fulfill the energy requirement from the hardware perspective, we utilize a low-energy processor design offering the possibility to us to place hundreds to millions of chips on a single board without any thermal restrictions. Furthermore, we address the performance requirement by the development of several database-specific instruction set extensions to customize each core, whereas each core does not have all extensions. Therefore, our hardware foundation is a low-energy processor consisting of a high number of heterogeneous cores. In this paper, we introduce our hardware setup on a system level and present several challenges for query processing. Based on these challenges, we describe two implementation concepts and a comparison between these concepts. Finally, we conclude the paper with some lessons learned and an outlook on our upcoming research directions.


Local vs. Global optimization: Operator placement strategies in heterogeneous environments

January 2015

·

37 Reads

·

6 Citations

In several parts of query optimization, like join enumeration or physical operator selection, there is always the question of how much optimization is needed and how large the performance benefits are. In particular, a decision for either global optimization (e.g., during query optimization) or local optimization (during query execution) has to be taken. In this way, heterogeneity in the hardware environment is adding a further optimization aspect while it is yet unknown, how much optimization is actually required for that aspect. Generally, several papers have shown that heterogeneous hardware environments can be used efficiently by applying operator placement for OLAP queries. However, whether it is better to apply this placement in a local or global optimization strategy is still an open question. To tackle this challenge, we examine both strategies for a column-store database system in this paper. Aside from describing local and global placement in detail, we conduct an exhaustive evaluation to draw some conclusions. For the global placement strategy, we also propose a novel approach to address the challenge of an exploding search space together with discussing wellknown solutions for improving cardinality estimation.


Heterogeneity-Aware Operator Placement in Column-Store DBMS
  • Article
  • Full-text available

November 2014

·

89 Reads

·

14 Citations

Datenbank-Spektrum

Due to the tremendous increase in the amount of data efficiently managed by current database systems, optimization is still one of the most challenging issues in database research. Today’s query optimizer determine the most efficient composition of physical operators to execute a given SQL query, whereas the underlying hardware consists of a multi-core CPU. However, hardware systems are more and more shifting towards heterogeneity, combining a multi-core CPU with various computing units, e.g., GPU or FPGA cores. In order to efficiently utilize the provided performance capability of such heterogeneous hardware, the assignment of physical operators to computing units gains importance. In this paper, we propose a heterogeneity-aware physical operator placement strategy (HOP) for in-memory columnar database systems in a heterogeneous environment. Our placement approach takes operators from the physical query execution plan as an input and assigns them to computing units using a cost model at runtime. To enable this runtime decision, our cost model uses the characteristics of the computing units, execution properties of the operators, as well as runtime data to estimate execution costs for each unit. We evaluated our approach on full TPC-H queries within a prototype database engine. As we are going to show, the placement in a heterogeneous hardware system has a high influence on query performance.

Download

Figure 3: Algorithm of CityHash32 Hashing  
Figure 4: Hash-specific Processor Model
Figure 6: 8-fold SIMD Hash Key Computation void int_hash( unsigned int* key, //input keys unsigned short* hashValue, //output hash values int keySize, //number of keys unsigned int hashFunc //hash mask ){ int i; init_states(key, hashValue, hashFunc); LD_0(); LD_1(); for(i=0; i<(keySize/16); i++){ LD_0(); LD_1(); HOP(); LD_0(); LD_1(); HOP(); ST_0(); ST_1(); } HOP(); ST_0(); ST_1(); }  
Figure 9: Sampling Instruction  
Figure 5: Processing steps of the integer hash oper- ation  
HASHI: An Application-Specific Instruction Set Extension for Hashing

September 2014

·

452 Reads

·

9 Citations

Hashing is one of the most relevant operations within query processing. Almost all core database operators like group-by, selections, or different join implementations rely on highly efficient hash implementations. In this paper, we present a way to significantly improve performance and en-ergy efficiency of hash operations using specialized instruc-tion set extensions for the Tensilica Xtensa LX5 core. To show the applicability of instruction set extensions, we im-plemented a bit extraction hashing scheme for 32-bit integer keys as well as the CityHash function for string values. We identify the individual parts of the algorithms required to be optimized, we describe our hashing-specific instruction set, and finally give a comprehensive experimental evalua-tion. We observed that the hash implementation using the hashing-specific instruction set (1) is up to two orders of magnitudes faster than the basic core without extensions, (2) exhibits always better performance compared to hand-tuned code running on modern high-end general purpose CPUs, and (3) has a significantly better footprint with re-spect to energy consumption as well as chip area. Especially the third observation has the potential for a higher packing density and therefore a significantly better overall system performance.


Citations (14)


... Prior works addressing this issue can be broadly categorized into two approaches: (1) utilizing multiple GPUs [29, 39,45], and (2) leveraging CPU-side memory for data processing [12,16,27,30,33,59]. The first approach employs multiple GPUs to horizontally scale memory capacity. ...

Reference:

Vortex: Overcoming Memory Capacity Limitations in GPU-Accelerated Large-Scale Data Analytics
Adaptive work placement for query processing on heterogeneous computing resources
  • Citing Article
  • March 2017

Proceedings of the VLDB Endowment

... However, as the working sets and irregular memory accesses of applications are increased, the address translation is becoming the bottleneck of GPU performance due to the poor TLB reach [14]- [19]. The address translation overhead due to TLB misses can cause slowdown of up to 13× for irregular GPU applications with divergent memory accesses [15], that is because a single TLB miss can stall hundreds of threads and the performance can be degraded significantly with the reduced thread level parallelism (TLP). Modern processors and operating systems (OSes) use 4KB base page coupling with 2MB or 1GB large pages to improve the reach of each TLB entry dramatically. ...

Big data causing big (TLB) problems: taming random memory accesses on the GPU
  • Citing Conference Paper
  • May 2017

... In addition to that, it is not yet clear how to combine novel research suggestions in a unified system, and how such suggestions may affect or benefit from each other. In particular, our research community shows opportunities and challenges of modern hardware in database systems in isolation, among them the need for analysis of novel adaptive data layouts and data structures for operational and analytical systems [6,7,9,55], novel processing, storage and federation approaches on non-relational data models [11,18,51,52,62], benefits and drawbacks of porting to new compute platforms [12,16,33,63], opportunities and limitations of GPUs and other co-processors as building blocks for storage and querying purposes [8,14,33], novel proposals for main memory databases on modern hardware [3,15,21,53,56], adaptive optimization, and first attempts towards self-managing database systems [17,36,43,46]. ...

Heterogeneous placement optimization for database query processing
  • Citing Article
  • January 2017

it - Information Technology

... While programming the CPU-GPU coupling is one side of the problem, occupying both is the other. Many problems can be functionally decomposed, for instance database operations [9] and molecular dynamics simulations [11]. However, for computational fluid dynamics (CFD), where tightly coupled array operations are prevalent, data parallelism is required and load balancing is of key importance. ...

Limitations of Intra-operator Parallelism Using Heterogeneous Computing Resources
  • Citing Conference Paper
  • August 2016

Lecture Notes in Computer Science

... At a first glance, the low synchronization overheads of Hardware Transactional Memory (HTM) [32] make this technology a promising solution for handling concurrent transactions accessing shared data that is durably stored in PM. However, existing HTM implementations do not guarantee that updates generated by a committed transaction are atomically transposed from the (volatile) CPU cache to PM [9,12,23,24]. ...

Improving in-memory database index performance with Intel® Transactional Synchronization Extensions
  • Citing Conference Paper
  • February 2014

... As fundamental studies for energy efficiency, characteristics of power-performance relationship were analyzed in various aspects such as operator-level analysis [Ungethüm et al. 2015], or workload level analysis such as transactional workloads [Poess and Nambiar 2008], analytical workloads [Lang et al. 2012;Tsirogiannis et al. 2010], etc. ...

Query processing on low-energy many-core processors
  • Citing Article
  • June 2015

... In this cost-based approach, cardinality estimation plays a dominant role with the task to approximate the number of returned tuples for every query operator within a query execution plan [2,19,21]. These estimations are used within dierent optimization techniques for various decisions such as determining the right join order [3], choosing the optimal operator variant [27], or nding the optimal placement within a heterogeneous hardware environment [9,10]. For this reason, it is important to have cardinality estimations with high accuracy. ...

Local vs. Global optimization: Operator placement strategies in heterogeneous environments
  • Citing Article
  • January 2015

... Another interesting research issue is addressing topics such as enhanced data and operator placement algorithms [27,28], adaptive query execution models [29,30], novel co-processing methodologies [29,31], and query optimizers that take hardware heterogeneity into account [32,33]. Furthermore, the question of whether hardware-conscious or hardware-oblivious algorithm design is the right option for modern hardware environments has also become a hot research subject. ...

Heterogeneity-Aware Operator Placement in Column-Store DBMS

Datenbank-Spektrum