ArticlePDF Available

Vectorwise: Beyond Column Stores

Authors:

Abstract and Figures

This paper tells the story of Vectorwise, a high-performance analytical database system, from multiple perspectives: its history from academic project to commercial product, the evolution of its technical architecture, customer reactions to the product and its future research and development roadmap. One take-away from this story is that the novelty in Vectorwise is much more than just column-storage: it boasts many query processing innovations in its vectorized execution model, and an adaptive mixed row/column data storage model with indexing support tailored to analytical workloads. Another one is that there is a long road from research prototype to commercial product, though database research continues to achieve a strong innovative influence on product development.
Content may be subject to copyright.
Vectorwise: Beyond Column Stores
Marcin Zukowski, Actian, Amsterdam, The Netherlands
Peter Boncz, CWI, Amsterdam, The Netherlands
Abstract
This paper tells the story of Vectorwise, a high-performance analytical database system, from multiple
perspectives: its history from academic project to commercial product, the evolution of its technical
architecture, customer reactions to the product and its future research and development roadmap.
One take-away from this story is that the novelty in Vectorwise is much more than just column-storage:
it boasts many query processing innovations in its vectorized execution model, and an adaptive mixed
row/column data storage model with indexing support tailored to analytical workloads.
Another one is that there is a long road from research prototype to commercial product, though database
research continues to achieve a strong innovative influence on product development.
1 Introduction
The history of Vectorwise goes back to 2003 when a group of researchers from CWI in Amsterdam, known for
the MonetDB project [5], invented a new query processing model. This vectorized query processing approach
became the foundation of the X100 project [6]. In the following years, the project served as a platform for
further improvements in query processing [23, 26] and storage [24, 25]. Initial results of the project showed
impressive performance improvements both in decision support workloads [6] as well as in other application
areas like information retrieval [7]. Since the commercial potential of the X100 technology was apparent, CWI
spun-out this project and founded Vectorwise BV as a company in 2008. Vectorwise BV decided to combine
the X100 processing and storage components with the mature higher-layer database components and APIs of
the Ingres DBMS; a product of Actian Corp. After two years of cooperation between the developer teams, and
delivery of the first versions of the integrated product aimed at the analytical database market, Vectorwise was
acquired and became a part of Actian Corp.
2 Vectorwise Architecture
The upper layers of the Vectorwise architecture consist of Ingres, providing database administration tools, con-
nectivity APIs, SQL parsing and a cost-based query optimizer based on histogram statistics [13]. The lower
layers come from the X100 project, delivering cutting-edge query execution and data storage [21], outlined in
Figure 1. The details of the work around combining these two platforms are described in [11]. Here we focus
on how the most important feature of Vectorwise, dazzling query execution speed, was carefully preserved and
improved from its inception in an academic prototype into a full-fledged database product.
Copyright 2012 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for
advertising or promotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any
copyrighted component of this work in other works must be obtained from the IEEE.
Bulletin of the IEEE Computer Society Technical Committee on Data Engineering
1
Aggr
Select
Aggr
XCHG
Aggr
Select
Vectorized
Parallel
Query
Execution
Compressed
NSM/DSM
Storage
Scan−optimized
Buffer Manager
and I/O
Fast Transactional
Updates PDTPDT
Oracle
26 April 2010
0.5TB RAM
64 cores
SybaseIQ
15 Dec 2010
32 cores
0.5TB RAM
80 cores
2TB RAM
SQL Server
5 April 2011
32 cores
1TB RAM
Vectorwise
3 May 2011
SQL Server
80 cores
2TB RAM
30 August 2011
64 cores
0.5TB RAM
3 June 2011
Oracle Oracle
26 Sept 2011
32 cores
0.5TB RAM SQL Server
7 Dec 2011
40 cores
1TB RAM
$6.85 USD
Price/QphH
$1.37 USD
Price/QphH
$0.88 USD
Price/QphH
$12.15 USD
Price/QphH
$9.53 USD
Price/QphH
$1.86 USD
Price/QphH
$4.60 USD
Price/QphH
$1.30 USD
Price/QphH
201,487
QphH 134,117
QphH
140,181
QphH 164,747
QphH 173,961
QphH 436,788
QphH QphH
209,533 219,887
QphH
Composite Queries Per Hour TPC−H@ 1TB
Source: www.tpc.org, March 2012
1TB TPC−H Benchmark (non−clustered)
Figure 1: A simplified architec-
ture of the Vectorwise kernel.
Figure 2: Latest official 1TB TPC-H performance results (non-clustered),
in publication order, as of March 18, 2012.
Data Storage. While Vectorwise provides great performance for memory-resident data sets, when deployed
on a high-bandwidth IO subsystem (typically locally attached), it also allows efficient analysis of much larger
datasets, often allowing processing of disk-resident data with performance close to that of buffered data. To
achieve that, a number of techniques are applied.
Vectorwise stores data using a generalized row/column storage based on PAX [2]. A table is stored in
multiple PAX partitions, each of which contains a group of columns. This allows providing both “DSM/PAX”
(with each column in a separate PAX group) and “NSM/PAX” (with all columns in one PAX group), as well
as all options in between. We argue here from the IO perspective: disk blocks containing data from only one
column we call DSM/PAX, and containing all columns we call NSM/PAX (this is called PAX in [2]).
The exact grouping of a given table in PAX partitions can be controlled by explicit DDL, but in absence
of this, it is self-tuned. One odd-ball example of this are nullable columns. For query processing efficiency,
Vectorwise represents nullable columns internally as a column containing values, and a boolean column that
indicates whether the value is NULL or not. The motivation behind this is query processing efficiency: testing
each tuple for being NULL slows down predicate evaluation, due to hard-to-predict branching CPU instructions
and because the presence of NULLs prevents the use of SIMD instructions. Often, however, NULL testing can
be skipped and queries can be processed by ignoring the NULL column altogether, so separating the data repre-
sentations makes sense. The boolean NULL columns are one of the ways PAX partitions are used automatically
in Vectorwise: each nullable attribute stores the value and NULL column in the same PAX partition.
Another default PAX policy is to store composite (multi-column) primary keys automatically in the same
partition. NSM/PAX storage (i.e. a single PAX partition) is used in case of small tables, where a DSM represen-
tation would waste a lot of space using one (almost) empty disk block per column. The overhead of such empty
blocks is higher in Vectorwise than in traditional systems, since Vectorwise uses relatively large block-sizes;
typically 512KB on magnetic disks (or 32KB for SSDs [4]).
A more advanced PAX grouping algorithm is used in case of very wide tables, to automatically cluster certain
columns together in PAX partitions. The reason to limit the amount of PAX groups in which a table is stored
lies in the buffer memory needed for table scans. For each PAX group, a scan needs to allocate multiple blocks
per physical disk device for each PAX partition; which in the widest tables encountered in practice (hundreds of
columns) otherwise would lead to many GBs needed for a single scan.
Data on disk is stored in compressed form, using automatically selected compression schemes and automat-
ically tuned parameters. Vectorwise only uses compression schemes that allow very high decompression ratios,
2
with a cost of only a few cycles per tuple [24]. Thanks to the very low overhead of decompression, it is possi-
ble to store data compressed in buffer pool and decompress immediately before query processing. This allows
increasing the effective size of the buffer pool further reducing the need for IO. We initially stayed away from
compressed execution [1], because it can complicate the query executor and our extremely high decompression
speed and fast vectorized execution limits the benefit of compressed execution in many common cases. How-
ever, there are some high-benefit cases for compressed execution, such as aggregation on RLE (which can be an
order of magnitude less effort) or operations on dictionary-compressed strings (which convert a string compar-
ison into a much cheaper integer or even SIMD comparison), so recently we have worked on simple forms of
compression execution [14].
While its strength is in fast scans, Vectorwise allows users in its DDL to declare one index per table; this
simply means that the physical tuple order becomes determined by the index keys; rather than insertion order.
As Vectorwise keeps a single copy of all data, only one such index declaration is allowed per table, such that for
users it is similar to a clustered index. The main benefit of a clustered index is push-down of range-predicates
on the index keys. When an index is declared on a foreign key, treatment is special as the tuple order then gets
derived from that of the referenced table, which accelerates foreign key joins between these tables. In the future,
we expect to improve this functionality towards multi-dimensional indexing where tables are co-clustered on
multiple dimensions such that multiple kinds of selection predicates get accelerated, as well as foreign key joins
between co-clustered tables (tables clustered on a common dimension).
Vectorwise automatically keeps so-called MinMax indices on all columns. MinMax indices, based on the
idea of small materialized aggregates [15], store simple metadata about the values in a given range of records,
such as Min and Max values. They allow quick elimination of ranges of records during scan operations. MinMax
indices are heavily consulted during query rewriting, and are effective in eliminating IO if there are correlations
between attribute values and tuple position. In data warehouses, fact table order is often time-related, so date-
time columns typically have such correlations. The previously mentioned use of sorted tables (clustered indices)
are a direct source of correlation between position and column key values. The rewriter will restrict table scans
with range-selections on any position correlated columns, thanks to MinMax indexes.
In the IO and buffer pool layers, Vectorwise focuses on providing optimal performance for concurrent scan-
intensive queries, typical for analytical workloads. For this purpose, the X100 project originally proposed coop-
erative scans[25], where table scans accept data out-of-order, and an Active Buffer Manager (ABM) determines
the order to fetch tuples at runtime, depending on the interest of all concurrent queries, optimizing both the
average query latency and throughput. The ABM is a quite complex component that influences the system ar-
chitecture considerably, such that in the product version of Vectorwise we have switched to a less radical, but
still highly effective variant of intelligent data buffering, that to a large extent achieves the same goals [19].
The final element of Vectorwise storage layer are high-performance updates, using a differential update
mechanism based on Positional Delta Trees (PDT) [10]. A three-level design of PDTs, with one very small
PDT, private to the transaction, one shared CPU-cache resident PDT and one potentially large RAM-resident
PDT, offers snapshot isolation without slowing down read-only queries in any way. The crucial feature of
PDTs as differential structure is the fact that they organize differences by position rather than by key value, and
therefore the task of merging in differences during a table scan has virtually no cost, as it does not involve costly
key comparisons (nor key scans).
Query Execution. The core technology behind the high processing speeds of Vectorwise is its vectorized
processing model [6]. It dramatically reduces the interpretation overhead typically found in the tuple-at-a-time
processing systems. Additionally, it exposes possibilities of exploiting performance-critical features of modern
CPUs like super-scalar execution and SIMD instructions. Finally, thanks to its focus on storing data in the CPU
cache, main-memory traffic is reduced which is especially important in modern multi-core systems.
The vectorized execution model was further improved including (i) lazy vectorized expression evaluation,
(ii) choosing different function implementations depending on the environment, (iii) pushing up selections if this
3
enables more SIMD predicate evaluation [9], and (iv) NULL-processing optimizations. Further, strict adherence
to a vertical (columnar) layout in all operations was dropped and now Vectorwise uses a NSM record layout
during the execution for (parts of) tuples where the access pattern makes this more beneficial (mostly in hash
tables) [26]. Additionally, Volcano-based parallelism based on exchange operators has been added, allowing
Vectorwise to efficiently scale to multiple cores [3]. Many internal operations were further improved, e.g. highly
efficient Bloom-filters were applied to speed-up join processing. Another example of such improvements was
cooperation with Intel, to use new CPU features such a large TLB pages, and exploit the SSE4.2 instructions for
optimizing processing of text data [20].
On the innovation roadmap for query execution are execution on compressed data, and introducing the intel-
ligent use of just-in-time (JIT) compilation of complex predicates – only in those situations where this actually
brings benefits over vectorized execution [17, 18]. All these innovations are aimed at bolstering the position of
Vectorwise as the query execution engine that gets most work done per CPU cycle. An additional project to
introduce MPP cluster capabilities is underway to make Vectorwise available in a scale-out architecture.
3 Vectorwise Experiences
While the history of X100 goes back all the way in 2003, for many years it was only a research prototype
offering very low-level interfaces for data storage and processing [22]. After the birth of Vectorwise BV in the
summer of 2008 this prototype quickly started evolving into an industry-strength component, combined with the
mature Ingres upper layers. Over the course of less than two years the system grew into a full-fledged product,
leading to the release of Vectorwise 1.0 in June of 2010. This release of the product was met with highly positive
reception thanks to its unparalleled processing performance. Still, it faced a number of challenges typical for
young software projects. A number of users missed features available in other products necessary to migrate
to a new system. Also, some features initially did not meet expectations, especially around updates. Finally,
exposing the product to a large number of new users revealed a sizable number of stability problems.
Over the following 18 months, a number of updates have been released providing a lot of requested fea-
tures, optimizing many elements of the system and dramatically increasing the system stability. Vectorwise 2.0,
released in November 2011, is a solid product providing features like optimized loading, full transactional sup-
port, better storage management, parallel execution, temporary tables, major parts of analytical SQL 1999 and
disk-spilling operations. It also supports dozens of new functions, both from the SQL standard as well as some
used by other systems, making the migrations easier.
To make Vectorwise adoption easier, a lot of effort has been invested to make sure it works well with popular
tools. As a result it is now certified with products like Pentaho, Jaspersoft, SAP Business Objects, MicroStrategy,
IBM Cognos, Tableau and Yellowfin. Another critical milestone was a release of the fully functional Windows
version, making Vectorwise one of the very few analytical DBMS systems for that platform.
To demonstrate the performance and scalability of Vectorwise, a series of record-breaking TPC-H bench-
marks were published – as of March 18, 2012, Vectorwise continues to hold the leadership in the single-node
100GB to 1TB results (see Figure 2 for 1TB results).
Customer Reactions. Since the early releases of Vectorwise, users and analysts have been highly impressed
with its performance. In a number of cases, the improvement was so large customers believed the system must
be doing some sort of query result caching (which it does not). High performance also resulted in customers
adopting previously impossible approaches to using their databases. Typical examples include:
Removing indices. Combination of efficient in-memory and on-disk scan performance with optimized
filtering delivers performance better than previous systems when using indexing.
Normalizing tables. Thanks to quick data transformations, large-volume data normalizations are now
possible, improving performance and reducing storage volume.
4
De-normalizing tables. In contrast to above, some users find Vectorwise performance with de-normalized
tables more than sufficient and prefer that approach due to a simplified schema and loading.
Running on the raw data. Many customers now avoid performing expensive data precomputation, as
raw-data processing performance is efficient enough.
Full data reloads. All above features, combined with high loading performance of Vectorwise, make data
loading process faster and simpler. As a result, for many customers full data reload is now a feasible
method.
Improved efficiency combined with the possibility to simplify data storage and management, translate di-
rectly into reduced need for hardware and human resources.
While high performance delivered by Vectorwise receives a lot of praise, system adoption would not be
possible without the technical and organizational contributions of the much more mature Ingres product. On the
technical side, users appreciate a wide range of connectivity options, solid SQL support and an ever-growing
number of available tools. Even more importantly, users praise the worldwide, 24/7 support capability and active
involvement of pre-sales, support and engineering teams with their POC and production systems.
Adoption Challenges. While Vectorwise capabilities provide a great value to many customers, its adoption also
faces a number of challenges. Very many issues are related to migrations from older DBMS systems. Off-line
data migration is relatively easy, either using manual methods or with support of ETL tools. On-line data transfer
from transactional systems poses a bigger challenge, and is discussed below. Migration of different SQL flavors
with a plethora of non-standard functions turns out to be a relatively simple, but laborious process. The hardest
problem is the application logic stored in DBMSs, e.g. PL/SQL – migration from complex systems using this
approach turns out extremely labor intensive.
Vectorwise was originally designed with an idea that the data will be loaded relatively rarely. However,
once the users got accustomed to high processing speeds, they requested the ability to use it on much more
up-to-date data, including sub-second data loading latency. To address that, Vectorwise quickly improved its
incremental-load as well as data-update capabilities, also providing full ACID properties. Additionally, Actian
offers Vectorstream: a separate product/service that enables very low-latency data loads into Vectorwise.
Another set of challenges was related to complex database schemas. Scenarios with hundreds of databases,
many thousands of tables, and tables with many thousands of attributes stressed the system capabilities, calling
for schema reorganizations as well as numerous system improvements.
4 Vectorwise Research Program
Vectorwise has strong roots in academia, and a continued research track is an important part of its technical
innovation roadmap. Vectorwise offers academic institutions source code access under a research license. Apart
from CWI, licensees include the universities of Ilmenau, Tuebingen and Edinburgh, and the Barcelona Super-
computing Center. Actian Corp. is also sponsoring a number of PhD students at these institutes.
In cooperation with Vrije Universiteit Amsterdam and University of Warsaw, multiple MSc projects have
been pursued. Completed topics include: Volcano-style multi-core parallelism in Vectorwise [3], just-in-time
compilation of predicates [17, 18], non-intrusive mechanisms for query execution on compressed data [14], ma-
terialization and caching of interesting intermediate query results in order to accelerate a query workload by
re-using results [16] (i.e. adapting the Recycler [12] idea to pipelined query execution), and buffer manage-
ment policies that make concurrent queries cooperate rather than fight for IO [19], using an approach that is
less system-intrusive than so-called Cooperative Scans [25]. There has also been work on XML storage and
processing in Vectorwise [8].
5
Research activities continue with a number of projects including further improving vectorized execution
performance, accelerating processing with multi-dimensional data organization, and improving performance
and scalability of Vectorwise in an MPP architecture.
5 Conclusion
This paper gave a short overview of the Vectorwise system, focusing on its origins, technology, user reception
and adoption challenges. It shows that achieving really high performance requires much more than just “column
storage”. Additionally, we discuss other elements required to find adoption in the market: functionality, usability,
support capabilities and strong future roadmap.
Less than two years since its first product release, Vectorwise continues to make rapid progress. This includes
usability advances such as storage management, backup functionality and rich SQL support encompassing func-
tions, data types and analytical features. Internal and external research activities have created a solid innovation
pipeline that will bolster and improve the performance of the product in the future.
References
[1] D. J. Abadi. Query Execution in Column-Oriented Database Systems. PhD thesis, MIT, 2008.
[2] A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis. Weaving Relations for Cache Performance. In VLDB, 2001.
[3] K. Anikiej. Multi-core parallelization of vectorized query execution. MSc thesis, Vrije Universiteit Amsterdam, 2010.
[4] S. Baumann, G. de Nijs, M. Strobel, and K.-U. Sattler. Flashing databases: expectations and limitations. In DaMoN, 2010.
[5] P. Boncz. Monet: a next-Generation DBMS Kernel For Query-Intensive Applications. PhD thesis, Universiteit van Amsterdam,
2002.
[6] P. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: Hyper-pipelining query execution. In CIDR, 2005.
[7] R. Cornacchia, S. H´
eman, M. Zukowski, A. P. de Vries, and P. Boncz. Flexible and Efficient IR using Array Databases. VLDB
Journal, 17(1), 2008.
[8] T. Grust, J. Rittinger, and J. Teubner. Pathfinder: XQuery Off the Relational Shelf. DEBULL, 31(4), 2008.
[9] S. H´
eman, N. Nes, M. Zukowski, and P. Boncz. Vectorized data processing on the cell broadband engine. In DaMoN, 2007.
[10] S. H´
eman, M. Zukowski, N. Nes, L. Sidirourgos, and P. Boncz. Positional update handling in column stores. In SIGMOD, 2010.
[11] D. Inkster, M. Zukowski, and P. Boncz. Integration of VectorWise with Ingres. SIGMOD Record, 40(3), 2011.
[12] Ivanova, M. and Kersten, M. and Nes, N. and Gonc¸alves, R. An architecture for recycling intermediates in a column-store. In
SIGMOD, 2009.
[13] R. Kooi. The Optimization of Queries in Relational Database Systems. PhD thesis, Case Western Reserve University, 1980.
[14] A. Luszczak. Simple Solutions for Compressed Execution in Vectorized Database System. MSc thesis, Vrije Universiteit Amster-
dam, 2011.
[15] G. Moerkotte. Small materialized aggregates: A light weight index structure for data warehousing. In VLDB, 1998.
[16] F. Nagel. Recycling Intermediate Results in Pipelined Query Evaluation. MSc thesis, Tuebingen University, 2010.
[17] J. Sompolski. Just-in-time Compilation in Vectorized Query Execution. MSc thesis, Vrije Universiteit Amsterdam, 2011.
[18] J. Sompolski, M. Zukowski, and P. Boncz. Vectorization vs. compilation in query execution. In DaMoN, 2011.
[19] M. Switakowski. Integrating Cooperative Scans in a column-oriented DBMS. MSc thesis, Vrije Universiteit Amsterdam, 2011.
[20] Vectorwise. Ingres/VectorWise Sneak Preview on the Intel Xeon 5500 Platform. Technical report, 2009.
[21] M. Zukowski. Balancing Vectorized Query Execution with Bandwidth-Optimized Storage. PhD thesis, Universiteit van Amsterdam,
2009.
[22] M. Zukowski, P. A. Boncz, N. Nes, and S. H ´
eman. MonetDB/X100 - A DBMS In The CPU Cache. DEBULL, 28(2), 2005.
[23] M. Zukowski, S. H´
eman, and P. Boncz. Architecture-Conscious Hashing. In DaMoN, 2006.
[24] M. Zukowski, S. H´
eman, N. Nes, and P. Boncz. Super-Scalar RAM-CPU Cache Compression. In ICDE, 2006.
[25] M. Zukowski, S. H´
eman, N. Nes, and P. Boncz. Cooperative Scans: Dynamic Bandwidth Sharing in a DBMS. In VLDB, 2007.
[26] M. Zukowski, N. Nes, and P. Boncz. DSM vs. NSM: CPU Performance Tradeoffs in Block-Oriented Query Processing. In DaMoN,
2008.
6
... Several commercial works adopted PAX and introduced how to use PAX in their product, e.g., Spanner [20] and Vectorwise [27]. Besides, several successive research works also follow the PAX data format and propose new optimizations. ...
Article
Full-text available
To provide applications with the ability to analyze fresh data and eliminate the time-consuming ETL workflow, hybrid transactional and analytical (HTAP) systems have been developed to serve online transaction processing and online analytical processing workloads in a single system. In recent years, HTAP systems have attracted considerable interest from both academia and industry. Several new architectures and technologies have been proposed. This paper provides a comprehensive overview of these HTAP systems. We review recently published papers and technical reports in this field and broadly classify existing HTAP systems into two categories based on their data formats: monolithic and hybrid HTAP. We further classify hybrid HTAP into four sub-categories based on their storage architecture: row-oriented, column-oriented, separated, and hybrid. Based on such a taxonomy, we outline each stream’s design challenges and performance issues (e.g., the contradictory format demand for monolithic HTAP). We then discuss potential solutions and their trade-offs by reviewing noteworthy research findings. Finally, we summarize emerging HTAP applications, benchmarks, future trends, and open problems.
... A number of systems were built based on the PAX model. The next generation of MonetDB, Vectorwise, uses PAX to support workloads in which the total size of tables exceeds the RAM limit [4]. However, the hybrid format is used only on disk, while the Vectorwise query engine is more focused on vectorized computations, concurrent query processing, and data caching. ...
Preprint
Full-text available
In column-oriented query processing, a materialization strategy determines when lightweight positions (row IDs) are translated into tuples. It is an important part of column-store architecture, since it defines the class of supported query plans, and, therefore, impacts the overall system performance. In this paper we continue investigating materialization strategies for a distributed disk-based column-store. We start with demonstrating cases when existing approaches impose fundamental limitations on the resulting system performance. Then, in order to address them, we propose a new hybrid materialization model. The main feature of hybrid materialization is the ability to manipulate both positions and values at the same time. This way, query engine can flexibly combine advantages of all the existing strategies and support a new class of query plans. Moreover, hybrid materialization allows the query engine to flexibly customize the materialization policy of individual attributes. We describe our vision of how hybrid materialization can be implemented in a columnar system. As an example, we use PosDB~ -- a distributed, disk-based column-store. We present necessary data structures, the internals of a hybrid operator, and describe the algebra of such operators. Based on this implementation, we evaluate performance of late, ultra-late, and hybrid materialization strategies in several scenarios based on TPC-H queries. Our experiments demonstrate that hybrid materialization is almost two times faster than its counterparts, while providing a more flexible query model.
... They process data one column at a time, hence are better suited for OLAP workloads [1] where the queries tend to operate on multiple tuples but only access a small fraction of the attributes. Many contemporary data systems like Vertica [52], Actian Vector (formerly Vectorwise [90]), MonetDB [21], Snowflake [26] use columnar storage. Even traditional row-stores have developed new variants that support columnar format [18,50,53]. ...
Conference Paper
Full-text available
Analytical database systems are typically designed to use a column-first data layout to access only the desired fields. On the other hand, storing data row-first works great for accessing, inserting, or updating entire rows. Transforming rows to columns at run-time is expensive, hence, many analytical systems ingest data in row-first form and transform it in the background to columns to facilitate future analytical queries. How will this design change if we can always efficiently access only the desired set of columns? To address this question, we present a radically new approach to data transformation from rows to columns. We build upon recent advancements in embedded platforms with re-programmable logic to design native in-memory access on rows and columns. Our approach, termed Relational Memory (RM), relies on an FPGA-based accelerator that sits between the CPU and main memory and transparently transforms base data to any group of columns with minimal overhead at runtime. This design allows accessing any group of columns as if it already exists in memory. We implement and deploy RM in real hardware, and we show that we can access the desired columns up to 1.63× faster compared to a row-wise layout, while matching the performance of pure columnar access for low projectivity, and outperforming it by up to 2.23× as projectivity (and tuple reconstruction cost) increases. Overall, RM allows the CPU to access the optimal data layout, radically reducing unnecessary data movement without high data transformation costs, thus, simplifying software complexity and physical design, while accelerating query execution.
Article
Businesses are increasingly demanding real-time analytics on up-to-date data. However, current solutions fail to efficiently combine transactional and analytical processing in a single system. Instead, they rely on extract-transform-load pipelines to transfer transactional data to analytical systems, which introduces a significant delay in the time-to-insight. In this paper, we address this need by proposing a new storage engine design for the cloud, called Colibri , that enables hybrid transactional and analytical processing beyond main memory. Colibri features a hybrid column-row store optimized for both workloads, leveraging emerging hardware trends. It effectively separates hot and cold data to accommodate diverse access patterns and storage devices. Our extensive experiments showcase up to 10x performance improvements for processing hybrid workloads on solid-state drives and cloud object stores.
Article
Data-intensive applications have fueled the evolution of log-structured merge (LSM) based key-value engines that employ the out-of-place paradigm to support high ingestion rates with low read/write interference. These benefits, however, come at the cost of treating deletes as second-class citizens . A delete operation inserts a tombstone that invalidates older instances of the deleted key. State-of-the-art LSM-engines do not provide guarantees as to how fast a tombstone will propagate to persist the deletion . Further, LSM-engines only support deletion on the sort key. To delete on another attribute (e.g., timestamp), the entire tree is read and re-written, leading to undesired latency spikes and increasing the overall operational cost of a database. Efficient and persistent deletion is key to support: (i) streaming systems operating on a window of data, (ii) privacy with latency guarantees on data deletion, and (iii) en masse cloud deployment of data systems. Further, we document that LSM-based key-value engines perform suboptimally in presence of deletes in a workload. Tombstone-driven logical deletes, by design, are unable to purge the deleted entries in a timely manner, and retaining the invalidated entries perpetually affects the overall performance of LSM-engines in terms of space amplification, write amplification, and read performance. Moreover, the potentially unbounded latency for persistent deletes brings in critical privacy concerns in light of the data privacy protection regulations, such as the right to be forgotten in EU’s GDPR, the right to delete in California’s CCPA and CPRA, and deletion right in Virginia’s VCDPA. Toward this, we introduce the delete design space for LSM-trees and highlight the performance implications of the different classes of delete operations. To address these challenges, in this article, we build a new key-value storage engine, Lethe ⁺ , that uses a very small amount of additional metadata, a set of new delete-aware compaction policies, and a new physical data layout that weaves the sort and the delete key order. We show that Lethe ⁺ supports any user-defined threshold for the delete persistence latency offering higher read throughput (1.17 × −1.4 ×) and lower space amplification (2.1 × −9.8 ×), with a modest increase in write amplification (between 4%4\% and 25%25\% ) that can be further amortized to less than 1%1\% . In addition, Lethe ⁺ supports efficient range deletes on a secondary delete key by dropping entire data pages without sacrificing read performance or employing a costly full tree merge.
Article
Full-text available
The query execution engine plays an important role in database system performance. The effectiveness of query execution engines in the hardware level is determined by their ability to find enough independent work and exploit the parallel execution capabilities of the modern CPUs [1]. The processing model in database management systems defines how the system executes a query plan in the execution engine. Thus, to gain the maximal throughput from the CPU and to utilize the enormous hardware developments and parallelism in the past decade, most modern database systems have adopted vectorization based query processing in their execution engine. This state-of-the-art query processing paradigm yields faster execution speed by adopting a policy of column-wise execution and operating in a batch of tuples instead of a single tuple. Thus, working on a batch gives the database systems an opportunity to utilize Single Instruction Multiple Data (SIMD) architecture in the CPU where a single operation can be applied to a batch of input and multiple results can be obtained at once. Moreover, executing on SIMD architecture reduces the overhead caused by costly branches and function calls, and stores data in CPU cache, reducing the main-memory traffic (mostly useful in multicore systems). Due to these advantages, most of the modern database systems including Apache Spark, Apache Hive, Vectorwise [4,5], DB2 BLU, Quickstep, columnar SQL server have adopted vectorization based query processing. There are also vectorized versions of relational databases which adopt the iterator model of query processing by default. Some of the examples include the vectorized MySQL implemented by MonetDB/X100 [1], vectorized model of postgres have been implemented by Citus [15,16,17], the SQL server introduced vector extensions in 2016: Advanced Vector Extensions (AVX 2), Streaming SIMD Extensions 4 (SSE 4). The SIMD vectorization and the vectorized model ofquery processing is also being used in research relating to query compilation strategies [2, 3]. There is also some ongoing research to speed up the performance of vectorization based query execution through the use of efficient algorithms by Vectorwise and Hyper [4,5,6]. Vectorization based query processing is mostly being used in computer intensive application areas like Online Analytical Applications (OLAP), decision support and multimedia retrieval [1, 2, 3]. This can also be scaled up to multiple machines due to the use of columnar stores of data, which can further increase the speed of query execution in case of large data [1].
Conference Paper
Full-text available
Database systems typically execute queries in isolation. Sharing recurring intermediate and final results between successive query invocations is ignored or only exploited by caching final query results. The DBA is kept in the loop to make explicit sharing decisions by identifying and/or defining materialized views. Thus decisions are made only after a long time and sharing opportunities may be missed. Recycling intermediate results has been proposed as a method to make database query engines profit from opportunities to reuse fine-grained partial query results, that is fully autonomous and is able to continuously adapt to changes in the workload. The technique was recently revisited in the context of MonetDB, a system that by default materializes all intermediate results. Materializing intermediate results can consume significant system resources, therefore most other database systems avoid this where possible, following a pipelined query architecture instead. The novelty of this paper is to show how recycling can successfully be applied in pipelined query executors, by tracking the benefit of materializing possible intermediate results and then choosing the ones making best use of a limited intermediate result cache. We present ways to maximize the potential of recycling by leveraging subsumption and proactive query rewriting. We have implemented our approach in the Vectorwise database engine and have experimentally evaluated its potential using both synthetic and real-world datasets. Our results show that intermediate result recycling significantly improves performance.
Article
Full-text available
Monet is a database kernel targeted at query-intensive, heavy analysis applications (the opposite of transaction processing), which include OLAP and data mining, but also go beyond the business domain in GIS processing, multi-media retrieval and XML. The clean sheet approach of Monet tries to depart from the traditional RDBMS design and implementation patterns in an attempt to obtain best performance on modern hardware, which has changed a lot since the currently dominant relational database systems were designed and developed. While most hardware components have experienced exponential growth in power over the years (a.k.a. Moore's law), I/O and especially memory latency have been lagging, creating an exponentially growing bottleneck. Additionally, modern hyperpipelined CPUs increasingly require code that is fully predictable (as to avoid branch mispredictions) and independent (to exploit parallel execution units) in order to reach their advertised performance, which makes a tough match with the interpreted, highly unpredictable and interdependent code found in database execution engines. The choice in Monet for the Decomposed Storage Model (DSM), which stores data in binary (2-column) tables only is motivated by the fact that query-intensive access patterns often profit from a vertically fragmented physical data model, which minimizes both I/O and cache misses when queries touch many rows but few columns. The column-wise processing model followed in the MIL language allows for a ``RISC'' (Reduced Instruction Set) query processing algebra, whose operators have a very low degree of freedom, thus allowing for a CPU-wise highly efficient implementation (i.e. one that consists of predictable and independent instructions). Also, specific attention was paid in Monet in developing cache-conscious query processing algorithms, in particular the radix-algorithms for join processing. This thesis is a reference to the Monet system in all its detail, and also outlines an SQL front-end that uses Monet as a back-end, for constructing a full-fledged SQL compliant RDBMS including ACID properties.
Conference Paper
Full-text available
This paper analyzes the performance of concurrent (index) scan operations in both record (NSM/PAX) and column (DSM) disk storage models and shows that existing schedul- ing policies do not fully exploit data-sharing opportunities and therefore result in poor disk bandwidth utilization. We propose the Cooperative Scans framework that enhances per- formance in such scenarios by improving data-sharing be- tween concurrent scans. It performs dynamic scheduling of queries and their data requests, taking into account the cur- rent system situation. We first present results on top of an NSM/PAX storage layout, showing that it achieves sig- nificant performance improvements over traditional policies in terms of both the number of I/Os and overall execution time, as well as latency of individual queries. We provide benchmarks with varying system parameters, data sizes and query loads to confirm the improvement occurs in a wide range of scenarios. Then we extend our proposal to a more complicated DSM scenario, discussing numerous problems related to the two-dimensional nature of disk scheduling in column stores.
Conference Paper
Full-text available
In this paper we investigate techniques that allow for on-line updates to columnar databases, leaving intact their high read-only performance. Rather than keeping differential structures organized by the table key values, the core proposition of this paper is that this can better be done by keeping track of the tuple position of the modifications. Not only does this minimize the computational overhead of merging in differences into read-only queries, but this makes the differential structure oblivious of the value of the order keys, allowing it to avoid disk I/O for retrieving the order keys in read-only queries that otherwise do not need them - a crucial advantage for a column-store. We describe a new data structure for maintaining such positional updates, called the Positional Delta Tree (PDT), and describe detailed algorithms for PDT/column merging, updating PDTs, and for using PDTs in transaction management. In experiments with a columnar DBMS, we perform microbenchmarks on PDTs, and show in a TPC-H workload that PDTs allow quick on-line updates, yet significantly reduce their performance impact on read-only queries compared with classical value-based differential methods.
Conference Paper
Full-text available
Hashing is one of the fundamental techniques used to imple- ment query processing operators such as grouping, aggrega- tion and join. This paper studies the interaction between modern computer architecture and hash-based query pro- cessing techniques. First, we focus on extracting maximum hashing performance from super-scalar CPUs. In particu- lar, we discuss fast hash functions, ways to eciently han- dle multi-column keys and propose the use of a recently introduced hashing scheme called Cuckoo Hashing over the commonly used bucket-chained hashing. In the second part of the paper, we focus on the CPU cache usage, by dy- namically partitioning data streams such that the partial hash tables fit in the CPU cache. Conventional partition- ing works as a separate preparatory phase, forcing materi- alization, which may require I/O if the stream does not fit in RAM. We introduce best-eort partitioning , a technique that interleaves partitioning with execution of hash-based query processing operators and avoids I/O. In the process, we show how to prevent issues in partitioning with cache- line alignment, that can strongly decrease throughput. We also demonstrate overall query processing performance when both CPU-ecient hashing and best-eort partitioning are combined.
Conference Paper
Full-text available
Flash devices (solid state disks) promise a significant performance improvement for disk-based database processing. However, database storage structures and processing strategies originally designed for magnetic disks prevent the optimal utilization of SSDs. Based on previous work on bench-marking SSDs and a detailed discussion of I/O methods, in this paper, we analyze appropriate execution methods for database processing as well as important parameters and boundaries and present a tool which helps to derive these parameters.
Conference Paper
Full-text available
Comparisons between the merits of row-wise storage (NSM) and columnar storage (DSM) are typically made with re- spect to the persistent storage layer of database systems. In this paper, however, we focus on the CPU eciency trade- os of tuple representations inside the query execution en- gine, while tuples flow through a processing pipeline. We analyze the performance in the context of query engines us- ing so-called "block-oriented" processing - a recently popu- larized technique that can strongly improve the CPU e- ciency. With this high eciency, the performance trade-os between NSM and DSM can have a decisive impact on the query execution performance, as we demonstrate using both microbenchmarks and TPC-H query 1. This means that NSM-based database systems can sometimes benefit from converting tuples into DSM on-the-fly, and vice versa.
Conference Paper
Full-text available
Compiling database queries into executable (sub-) programs provides substantial benefits comparing to traditional interpreted execution. Many of these benefits, such as reduced interpretation overhead, better instruction code locality, and providing opportunities to use SIMD instructions, have previously been provided by redesigning query processors to use a vectorized execution model. In this paper, we try to shed light on the question of how state-of-the-art compilation strategies relate to vectorized execution for analytical database workloads on modern CPUs. For this purpose, we carefully investigate the behavior of vectorized and compiled strategies inside the Ingres VectorWise database system in three use cases: Project, Select and Hash Join. One of the findings is that compilation should always be combined with block-wise query execution. Another contribution is identifying three cases where "loop-compilation" strategies are inferior to vectorized execution. As such, a careful merging of these two strategies is proposed for optimal performance: either by incorporating vectorized execution principles into compiled query plans or using query compilation to create building blocks for vectorized processing.
Conference Paper
Full-text available
In this work, we research the suitability of the Cell Broad- band Engine for database processing. We start by outlin- ing the main architectural features of Cell and use micro- benchmarks to characterize the latency and throughput of its memory infrastructure. Then, we discuss the challenges of porting RDBMS software to Cell: (i) all computations need to SIMD-ized, (ii) all performance-critical branches need to be eliminated, (iii) a very small and hard limit on program code size should be respected. While we argue that conventional database implementa- tions, i.e. row-stores with Volcano-style tuple pipelining, are a hard fit to Cell, it turns out that the three challenges are quite easily met in databases that use column-wise pro- cessing. We managed to implement a proof-of-concept port of the vectorized query processing model of MonetDB/X100 on Cell by running the operator pipeline on the PowerPC, but having it execute the vectorized primitives (data paral- lel) on its SPE cores. A performance evaluation on TPC-H Q1 shows that vectorized query processing on Cell can beat conventional PowerPC and Itanium2 CPUs by a factor 20.
Article
With the performance of modern computers improving at a rapid pace, database technology has problems with fully exploiting the benefits that each new hardware generation brings. This has caused a significant performance gap between general-purpose databases and specialized, application-optimized solutions for large-volume computation-intensive processing problems, as found in areas including information retrieval, scientific data management and decision support. This thesis attempts to enhance the state-of-the-art in architecture-conscious database research, both in the query execution layer as well as in the data storage layer, and in the way these work together. Thus, rather than focusing on an isolated problem or algorithm, the thesis presents a new database system architecture, realized in the MonetDB/X100 prototype, that combines a coherent set of new architecture-conscious techniques that are designed to work well together. The motivation for the new query execution layer comes from the analysis of the problems of two popular approaches to query processing: tuple-at-a-time operator pipelining, used in most existing systems, and column-at-a-time materializing operators, found in MonetDB. MonetDB/X100 proposes a new vectorized in-cache execution model, that exploits ideas from both approaches, and combines the scalability of the former with the high-performance bulk processing of the latter. This is achieved by modifying the traditional operator pipeline model to operate on cache-resident vectors of data using highly optimized primitive functions. Additionally, within this architecture, a set of hardware-conscious design and programming techniques is presented, enabling efficient execution of typical data processing tasks. The resulting query execution layer efficiently exploits modern super-scalar CPUs and cache-memory systems and achieves in-memory performance often one or two orders of magnitude higher than the existing approaches. In the storage area there are two hardware trends that significantly influence 219 220 Summary database performance. First, the imbalance between sequential disk bandwidth and random disk latency continuously increases. As a result, access methods that rely on random I/O become less attractive, making various forms of sequential access the preferred option. MonetDB/X100 follows this idea with ColumnBM – a bandwidth-optimized column store. Secondly, both disk bandwidth and latency improve significantly slower than the computing power of modern CPUs, especially with the advent of multi-core CPUs. ColumnBM introduces two techniques that address this issue. Lightweight in-cache compression allows trading some processor time for an increased perceived disk bandwidth. High decompression performance is achieved by applying the decompression on the RAMcache boundary, providing cache-resident data directly to the execution layer. Additionally, introduced family of compression methods provides performance an order of magnitude higher than previous solutions. Cooperative scans observe current system activity and dynamically schedule I/O operations to exploit overlapping demands of different queries. This allows to amortize the cost of disk access among multiple consumers, and also better utilize the available buffer space, providing much better performance with many concurrently executing queries. By combining the CPU-efficient processing with a bandwidth-optimized storage facility, MonetDB/X100 has been able to achieve its high in-memory raw query execution power also on huge disk-resident datasets. We evaluated its performance both on TPC-H decision support data sets as well as in the area of large-volume information retrieval (the Terabyte TREC task), where it successfully competed with the specialized solutions, both for in-memory and diskbased tasks.