ArticlePublisher preview available

VIP: A SIMD vectorized analytical query engine

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract and Figures

Query execution engines for analytics are continuously adapting to the underlying hardware in order to maximize performance. Wider SIMD registers and more complex SIMD instruction sets are emerging in mainstream CPUs and new processor designs such as the many-core Intel Xeon Phi CPUs that rely on SIMD vectorization to achieve high performance per core while packing a greater number of smaller cores per chip. In the database literature, using SIMD to optimize stand-alone operators with key–rid pairs is common, yet the state-of-the-art query engines rely on compilation of tightly coupled operators where hand-optimized individual operators become impractical. In this article, we extend a state-of-the-art analytical query engine design by combining code generation and operator pipelining with SIMD vectorization and show that the SIMD speedup is diminished when execution is dominated by random memory accesses. To better utilize the hardware features, we introduce VIP, an analytical query engine designed and built bottom up from pre-compiled column-oriented data parallel sub-operators and implemented entirely in SIMD. In our evaluation using synthetic and TPC-H queries on a many-core CPU, we show that VIP outperforms hand-optimized query-specific code without incurring the runtime compilation overhead, and highlight the efficiency of VIP at utilizing the hardware features of many-core CPUs.
This content is subject to copyright. Terms and conditions apply.
The VLDB Journal (2020) 29:1243–1261
https://doi.org/10.1007/s00778-020-00621-w
SPECIAL ISSUE PAPER
VIP: A SIMD vectorized analytical query engine
Orestis Polychroniou1·Kenneth A. Ross2
Received: 27 January 2020 / Revised: 10 June 2020 / Accepted: 22 June 2020 / Published online: 13 July 2020
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract
Query execution engines for analytics are continuously adapting to the underlying hardware in order to maximize performance.
Wider SIMD registers and more complex SIMD instruction sets are emerging in mainstream CPUs and new processor designs
such as the many-core Intel Xeon Phi CPUs that rely on SIMD vectorization to achieve high performance per core while
packing a greater number of smaller cores per chip. In the database literature, using SIMD to optimize stand-alone operators
with key–rid pairs is common, yet the state-of-the-art query engines rely on compilation of tightly coupled operators where
hand-optimized individual operators become impractical. In this article, we extend a state-of-the-art analytical query engine
design by combining code generation and operator pipelining with SIMD vectorization and show that the SIMD speedup is
diminished when execution is dominated by random memory accesses. To better utilize the hardware features, we introduce
VIP, an analytical query engine designed and built bottom up from pre-compiled column-oriented data parallel sub-operators
and implemented entirely in SIMD. In our evaluation using synthetic and TPC-H queries on a many-core CPU, we show that
VIP outperforms hand-optimized query-specific code without incurring the runtime compilation overhead, and highlight the
efficiency of VIP at utilizing the hardware features of many-core CPUs.
Keywords Query execution ·Modern hardware ·OLAP ·SIMD ·Vectorization
1 Introduction
Hardware-conscious database design and implementation
are a topic of ongoing research due to the profound impact of
modern hardware advances on query execution. Large main
memory capacity and multi-core CPUs raised the bar for effi-
cient in-memory execution. Databases diverged to focus on
transactional, analytical, scientific, or other workloads. Stor-
age and execution, narrowed down to specific workloads,
were redesigned by adapting to the new hardware dynamics.
In analytical databases, columnar storage is now standard,
since most queries access a few columns from a large number
of tuples, in contrast to transactions that update a small num-
This article is an extension of earlier published work [41], done while
the first author was affiliated with Columbia University, and supported
by NSF Grant IIS-1422488 and an Oracle gift.
BOrestis Polychroniou
orestis@amazon.com
Kenneth A. Ross
kar@cs.columbia.edu
1Amazon Web Services, Palo Alto, USA
2Columbia University, New York, USA
ber of tuples. However, analytical query engines are based on
multiple distinctive designs, including column-oriented and
row-oriented execution, interpretation and runtime compila-
tion, cache-conscious execution, and operator pipelining.
Efficient in-memory execution requires low interpretation
cost, optimized memory access, and high CPU efficiency.
Low interpretation cost is coupled with high instruction-level
parallelism and is achieved by processing entire columns
[5,29], batches of tuples per iterator call [6,9,10], or by
compiling query-specific code at runtime [15,21,33]. Mem-
ory access can be optimized by combining operators to
avoid materializing the pipelined stream [14], or by using
partitioning to avoid cache and TLB misses [30]. Data par-
allelism is achieved via SIMD vectorization. Linear access
operators such as scans and compression [23,40,54,55]are
naturally data parallel and easy to vectorize. For other oper-
ators such as sorting, the common approach is to use ad hoc
SIMD optimizations [8,16,18,24,37,38,44,45,47]. Recently,
we introduced SIMD vectorization for nonlinear access oper-
ators, such as hash tables and partitioning [36,39,50].
The advent of the many-core platforms known as Intel
Xeon Phi shows that the trade-off between many simple cores
and fewer complex cores is revisited. Fewer complex cores
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.
... To increase the query performance, vectorization is a state-of-the-art optimization technique nowadays, because all recent x86-processors offer powerful SIMD extensions [1]- [5]. To achieve the best performance, explicit vectorization using SIMD intrinsics is still the best way [4], [7], whereas intrinsics are functions wrapping the underlying machine calls. However, these SIMD extensions are increasingly diverse in terms of (i) the number of available vector instructions, (ii) the vector length, and (iii) the granularity of the bit-level parallelism, i.e., on which data widths the vector instructions are executable [7]. ...
Article
SIMD is an instruction set in mainstream processors, which provides the data level parallelism to accelerate the performance of applications. However, its advantages diminish when applications suffer from heavy cache misses. To eliminate cache misses in SIMD vectorization, we present interleaved multi-vectorizing (IMV) in this paper. It interleaves multiple execution instances of vectorized code to hide memory access latency with more computation. We also propose residual vectorized states to solve the control flow divergence in vectorization. IMV can make full use of the data parallelism in SIMD and the memory level parallelism through prefetching. It reduces cache misses, branch misses and computation overhead to significantly speed up the performance of pointer-chasing applications, and it can be applied to executing entire query pipelines. As experimental results show, IMV achieves up to 4.23X and 3.17X better performance compared with the pure scalar implementation and the pure SIMD vectorization, respectively.
Conference Paper
Query execution engines are adapting to the underlying hardware in order to maximize performance. Wider SIMD registers and more complex SIMD instruction sets are emerging in mainstream CPUs as well as new processor designs, such as the many-core platforms that rely on data parallelism via SIMD vectorization to pack a larger number of smaller cores per chip. In the database literature, using SIMD to optimize standalone operators with key--rid pairs is common, yet the state-of-the-art query engines rely on compilation of tightly coupled operators where hand-optimized individual operators become impractical. In this paper, we present VIP, an analytical query engine designed and built bottom-up from pre-compiled column-oriented data-parallel sub-operators and implemented entirely in SIMD. In our evaluation derived from the TPC-H workload, VIP outperforms query-specific hand-optimized scalar code.
Article
We define the concept of performance-optimal filtering to indicate the Bloom or Cuckoo filter configuration that best accelerates a particular task. While the space-precision tradeoff of these filters has been well studied, we show how to pick a filter that maximizes the performance for a given workload. This choice might be "suboptimal" relative to traditional space-precision metrics, but it will lead to better performance in practice. In this paper, we focus on high-throughput filter use cases, aimed at avoiding CPU work, e.g., a cache miss, a network message, or a local disk I/O - events that can happen at rates of millions to hundreds per second. Besides the false-positive rate and memory footprint of the filter, performance optimality has to take into account the absolute cost of the filter lookup as well as the saved work per lookup that filtering avoids; while the actual rate of negative lookups in the workload determines whether using a filter improves overall performance at all. In the course of the paper, we introduce new filter variants, namely the register-blocked and cache-sectorized Bloom filters. We present new implementation techniques and perform an extensive evaluation on modern hardware platforms, including the wide-SIMD Skylake-X and Knights Landing. This experimentation shows that in high-throughput situations, the lower lookup cost of blocked Bloom filters allows them to overtake Cuckoo filters.
Chapter
This paper presents the design of a read-optimized relational DBMS that contrasts sharply with most current systems, which are write-optimized. Among the many differences in its design are: storage of data by column rather than by row, careful coding and packing of objects into storage including main memory during query processing, storing an overlapping collection of column-oriented projections, rather than the current fare of tables and indexes, a non-traditional implementation of transactions which includes high availability and snapshot isolation for read-only transactions, and the extensive use of bitmap indexes to complement B-tree structures. We present preliminary performance data on a subset of TPC-H and show that the system we are building, C-Store, is substantially faster than popular commercial products. Hence, the architecture looks very encouraging.
Conference Paper
Increasing single instruction multiple data (SIMD) capabilities in modern hardware allows for compiling efficient data-parallel query pipelines. This means GPU-alike challenges arise: control flow divergence causes underutilization of vector-processing units. In this paper, we present efficient algorithms for the AVX-512 architecture to address this issue. These algorithms allow for fine-grained assignment of new tuples to idle SIMD lanes. Furthermore, we present strategies for their integration with compiled query pipelines without introducing inefficient memory materializations. We evaluate our approach with a high-performance geospatial join query, which shows performance improvements of up to 35%.
Conference Paper
Advanced processor architectures have been driving new designs, implementations and optimizations of main-memory hash join algorithms recently. The newly released Intel Xeon Phi many-core processor of the Knights Landing architecture (KNL) embraces interesting hardware features such as many low-frequency out-of-order cores connected on a 2D mesh, and high-bandwidth multi-channel memory (MCDRAM). In this paper, we experimentally revisit the state-of-the-art main-memory hash join algorithms to study how the new hardware features of KNL affect the algorithmic design and tuning as well as to identify the opportunities for further performance improvement on KNL. Our experiments show that, although many existing optimizations are still valid on KNL with proper tuning, even the state-of-the-art algorithms have severely underutilized the memory bandwidth and other hardware resources.
Article
In-memory database management systems (DBMSs) are a key component of modern on-line analytic processing (OLAP) applications, since they provide low-latency access to large volumes of data. Because disk accesses are no longer the principle bottleneck in such systems, the focus in designing query execution engines has shifted to optimizing CPU performance. Recent systems have revived an older technique of using just-in-time (JIT) compilation to execute queries as native code instead of interpreting a plan. The state-of-the-art in query compilation is to fuse operators together in a query plan to minimize materialization overhead by passing tuples efficiently between operators. Our empirical analysis shows, however, that more tactful materialization yields better performance. We present a query processing model called "relaxed operator fusion" that allows the DBMS to introduce staging points in the query plan where intermediate results are temporarily materialized. This allows the DBMS to take advantage of inter-tuple parallelism inherent in the plan using a combination of prefetching and SIMD vectorization to support faster query execution on data sets that exceed the size of CPU-level caches. Our evaluation shows that our approach reduces the execution time of OLAP queries by up to 2.2× and achieves up to 1.8× better performance compared to other in-memory DBMSs.
Article
In-memory databases require careful tuning and many engineering tricks to achieve good performance. Such database performance engineering is hard: a plethora of data and hardware-dependent optimization techniques form a design space that is difficult to navigate for a skilled engineer --- even more so for a query compiler. To facilitate performance-oriented design exploration and query plan compilation, we present Voodoo, a declarative intermediate algebra that abstracts the detailed architectural properties of the hardware, such as multi- or many-core architectures, caches and SIMD registers, without losing the ability to generate highly tuned code. Because it consists of a collection of declarative, vector-oriented operations, Voodoo is easier to reason about and tune than low-level C and related hardware-focused extensions (Intrinsics, OpenCL, CUDA, etc.). This enables our Voodoo compiler to produce (OpenCL) code that rivals and even outperforms the fastest state-of-the-art in memory databases for both GPUs and CPUs. In addition, Voodoo makes it possible to express techniques as diverse as cache-conscious processing, predication and vectorization (again on both GPUs and CPUs) with just a few lines of code. Central to our approach is a novel idea we termed control vectors, which allows a code generating frontend to expose parallelism to the Voodoo compiler in a abstract manner, enabling portable performance across hardware platforms. We used Voodoo to build an alternative backend for MonetDB, a popular open-source in-memory database. Our backend allows MonetDB to perform at the same level as highly tuned in-memory databases, including HyPeR and Ocelot. We also demonstrate Voodoo's usefulness when investigating hardware conscious tuning techniques, assessing their performance on different queries, devices and data.
Conference Paper
Relational equi-joins are at the heart of almost every query plan. They have been studied, improved, and reexamined on a regular basis since the existence of the database community. In the past four years several new join algorithms have been proposed and experimentally evaluated. Some of those papers contradict each other in their experimental findings. This makes it surprisingly hard to answer a very simple question: what is the fastest join algorithm in 2015? In this paper we will try to develop an answer. We start with an end-to-end black box comparison of the most important methods. Afterwards, we inspect the internals of these algorithms in a white box comparison. We derive improved variants of state-of-the-art join algorithms by applying optimizations like~software-write combine buffers, various hash table implementations, as well as NUMA-awareness in terms of data placement and scheduling. We also inspect various radix partitioning strategies. Eventually, we are in the position to perform a comprehensive comparison of thirteen different join algorithms. We factor in scaling effects in terms of size of the input datasets, the number of threads, different page sizes, and data distributions. Furthermore, we analyze the impact of various joins on an (unchanged) TPC-H query. Finally, we conclude with a list of major lessons learned from our study and a guideline for practitioners implementing massive main-memory joins. As is the case with almost all algorithms in databases, we will learn that there is no single best join algorithm. Each algorithm has its strength and weaknesses and shines in different areas of the parameter space.
Conference Paper
We live in the golden age of distributed computing. Public cloud platforms now offer virtually unlimited compute and storage resources on demand. At the same time, the Software-as-a-Service (SaaS) model brings enterprise-class systems to users who previously could not afford such systems due to their cost and complexity. Alas, traditional data warehousing systems are struggling to fit into this new environment. For one thing, they have been designed for fixed resources and are thus unable to leverage the cloud's elasticity. For another thing, their dependence on complex ETL pipelines and physical tuning is at odds with the flexibility and freshness requirements of the cloud's new types of semi-structured data and rapidly evolving workloads. We decided a fundamental redesign was in order. Our mission was to build an enterprise-ready data warehousing solution for the cloud. The result is the Snowflake Elastic Data Warehouse, or "Snowflake" for short. Snowflake is a multi-tenant, transactional, secure, highly scalable and elastic system with full SQL support and built-in extensions for semi-structured and schema-less data. The system is offered as a pay-as-you-go service in the Amazon cloud. Users upload their data to the cloud and can immediately manage and query it using familiar tools and interfaces. Implementation began in late 2012 and Snowflake has been generally available since June 2015. Today, Snowflake is used in production by a growing number of small and large organizations alike. The system runs several million queries per day over multiple petabytes of data. In this paper, we describe the design of Snowflake and its novel multi-cluster, shared-data architecture. The paper highlights some of the key features of Snowflake: extreme elasticity and availability, semi-structured and schema-less data, time travel, and end-to-end security. It concludes with lessons learned and an outlook on ongoing work.