- A preview of this full-text is provided by Springer Nature.
- Learn more
Preview content only
Content available from The VLDB Journal
This content is subject to copyright. Terms and conditions apply.
The VLDB Journal (2020) 29:1243–1261
https://doi.org/10.1007/s00778-020-00621-w
SPECIAL ISSUE PAPER
VIP: A SIMD vectorized analytical query engine
Orestis Polychroniou1·Kenneth A. Ross2
Received: 27 January 2020 / Revised: 10 June 2020 / Accepted: 22 June 2020 / Published online: 13 July 2020
© Springer-Verlag GmbH Germany, part of Springer Nature 2020
Abstract
Query execution engines for analytics are continuously adapting to the underlying hardware in order to maximize performance.
Wider SIMD registers and more complex SIMD instruction sets are emerging in mainstream CPUs and new processor designs
such as the many-core Intel Xeon Phi CPUs that rely on SIMD vectorization to achieve high performance per core while
packing a greater number of smaller cores per chip. In the database literature, using SIMD to optimize stand-alone operators
with key–rid pairs is common, yet the state-of-the-art query engines rely on compilation of tightly coupled operators where
hand-optimized individual operators become impractical. In this article, we extend a state-of-the-art analytical query engine
design by combining code generation and operator pipelining with SIMD vectorization and show that the SIMD speedup is
diminished when execution is dominated by random memory accesses. To better utilize the hardware features, we introduce
VIP, an analytical query engine designed and built bottom up from pre-compiled column-oriented data parallel sub-operators
and implemented entirely in SIMD. In our evaluation using synthetic and TPC-H queries on a many-core CPU, we show that
VIP outperforms hand-optimized query-specific code without incurring the runtime compilation overhead, and highlight the
efficiency of VIP at utilizing the hardware features of many-core CPUs.
Keywords Query execution ·Modern hardware ·OLAP ·SIMD ·Vectorization
1 Introduction
Hardware-conscious database design and implementation
are a topic of ongoing research due to the profound impact of
modern hardware advances on query execution. Large main
memory capacity and multi-core CPUs raised the bar for effi-
cient in-memory execution. Databases diverged to focus on
transactional, analytical, scientific, or other workloads. Stor-
age and execution, narrowed down to specific workloads,
were redesigned by adapting to the new hardware dynamics.
In analytical databases, columnar storage is now standard,
since most queries access a few columns from a large number
of tuples, in contrast to transactions that update a small num-
This article is an extension of earlier published work [41], done while
the first author was affiliated with Columbia University, and supported
by NSF Grant IIS-1422488 and an Oracle gift.
BOrestis Polychroniou
orestis@amazon.com
Kenneth A. Ross
kar@cs.columbia.edu
1Amazon Web Services, Palo Alto, USA
2Columbia University, New York, USA
ber of tuples. However, analytical query engines are based on
multiple distinctive designs, including column-oriented and
row-oriented execution, interpretation and runtime compila-
tion, cache-conscious execution, and operator pipelining.
Efficient in-memory execution requires low interpretation
cost, optimized memory access, and high CPU efficiency.
Low interpretation cost is coupled with high instruction-level
parallelism and is achieved by processing entire columns
[5,29], batches of tuples per iterator call [6,9,10], or by
compiling query-specific code at runtime [15,21,33]. Mem-
ory access can be optimized by combining operators to
avoid materializing the pipelined stream [14], or by using
partitioning to avoid cache and TLB misses [30]. Data par-
allelism is achieved via SIMD vectorization. Linear access
operators such as scans and compression [23,40,54,55]are
naturally data parallel and easy to vectorize. For other oper-
ators such as sorting, the common approach is to use ad hoc
SIMD optimizations [8,16,18,24,37,38,44,45,47]. Recently,
we introduced SIMD vectorization for nonlinear access oper-
ators, such as hash tables and partitioning [36,39,50].
The advent of the many-core platforms known as Intel
Xeon Phi shows that the trade-off between many simple cores
and fewer complex cores is revisited. Fewer complex cores
123
Content courtesy of Springer Nature, terms of use apply. Rights reserved.