Content uploaded by Georgios Theodorakis
Author content
All content in this area was uploaded by Georgios Theodorakis on May 13, 2024
Content may be subject to copyright.
An Empirical Evaluation of Variable-length Record
B+Trees on a Modern Graph Database System
Georgios Theodorakis
Neo4j, UK
george.theodorakis@neo4j.com
James Clarkson
Neo4j, UK
james.clarkson@neo4j.com
Jim Webber
Neo4j, UK
jim.webber@neo4j.com
Abstract—B+Trees are widely used as persistent index imple-
mentations for databases. They are often implemented in a way
that allows the index to be in main memory while the indexed
data remains on disk. Over the years, multiple optimization
techniques have been proposed to improve the efficiency of
B+Trees by accelerating the key search within a node or com-
pressing data based on common prefixes. This paper describes
our empirical research implementing such optimized B+Trees in
Neo4j, a modern graph database management system (DBMS).
We were able to confirm that the optimized versions lived up to
their performance claims over plain B+Trees when benchmarked
in isolation. However, we also found that incorporating them
into a real DBMS yields marginal improvements only. This is
partly because Neo4j is not index-heavy, typically only using
indexes to find starting points for graph traversals. The other part
is that integrating optimized indexes into the transactions and
page-based storage components of Neo4j incurs a performance
penalty (for reasons of crash-tolerance) compared to the stan-
dalone implementations. Given the additional implementation
and maintenance complexity of optimized B+Trees, our research
suggests that regular B+Trees remain the preferred general-
purpose implementation.
Index Terms—B+Trees, Graph Database Management Systems
I. INTRODUCTION
Database management systems (DBMS) have a long history
of using persistent indexes to improve performance. The
B+Tree is a popular choice for a DBMS because it allows
users to take advantage of the memory hierarchy. The smaller
index structure is held in RAM, the fast (but smaller) part of
the hierarchy, while the data set is held on the much larger (but
slower) disk. This makes index look-ups fast while allowing
for large data sets that exceed main memory capacity.
The workloads processed by a DBMS (including indexes)
are not random. Yet traditional B+Trees treat all accesses
equally without exploiting repetition or access patterns to
similar data. Often, the only commonality between subsequent
accesses is the index entries in an underlying disk cache.
There is scope for disk-based indexes to take advantage of
specific data access patterns and accelerate them to handle ef-
ficiently general-purpose transactional processing. In response,
researchers have proposed novel optimizations for the index
data structures [1]–[5], as well as buffer managers [6], [7], and
caching strategies for hot and cold data blocks [8].
While fixed-length record B+Trees are amenable to clas-
sic database optimizations, such as SIMD parallel execu-
tion [4], [5], accelerating accesses on variable-length records
is more challenging. In this work, we focus on state-of-the-art
optimizations for variable-length record B+Trees (necessary
for indexing unstructured data) and incorporate them into a
production-grade graph database management system called
Neo4j [9], [10]. Our contributions are as follows:
(i) We study the performance of different variable-length
record B+Trees optimization techniques, such as cache-
efficient searches and prefix compression, as reported in recent
literature [1], [2] using a range of synthetic and real-world
datasets. We augment previous analysis and evaluate existing
optimizations under a different scope (e.g., how they affect
B+Tree operations such as node splits or range scans).
(ii) Having validated the performance of the aforementioned
optimizations in isolation, we integrate them into Neo4j.
We then evaluate them in a bottom-up fashion: (a) using
transactions; (b) adding Cypher query language [11] atop
transactions; (c) submitting Cypher queries over TCP. By
measuring B+Trees accesses using different execution layers,
we provide a realistic performance analysis of Neo4j.
Our results show that previous works have overlooked
systemic overheads (e.g., networking and transactions) when
evaluating such optimizations in academic prototypes, leading
to marginal gains in practice. Thus, a general-purpose B+Tree
index may still be a better overall implementation choice.
II. BACKGROU ND
This section provides background on persistent indexes
focusing on approaches for variable-length records.
Variable-length record B+Trees. The main indexing data
structure of Neo4j is a B+Tree [10], [12] atop a buffer pool
of pages (also called page cache). B+Trees are optimized for
paged environments, i.e., accessing data at page granularity.
Every node is a page that stores sorted keys for efficient exact-
match search, range scans, and prefix search. Both keys and
values are transformed and stored as opaque byte sequences.
In this work, we focus on variable-length record B+Trees,
which are more challenging to optimize. Operations on fixed-
length records (e.g., storing integers) are usually well de-
fined [1], [4], [5] and involve cheap (memory-aligned) compar-
isons. In contrast, variable-length records (e.g., strings) require
storing additional metadata for disk serialization, introduce
irregular data accesses, and involve costly byte-wise compar-
isons to determine a valid lexicographical order. Next, we will
discuss optimizations found in the contemporary literature.
Page
Header
Indirection
Vect or
oogle.com eo4j.com
Vari abl e-length
Records
g n
(a) B+Tree page with variable-length records
Page
Header
Indirection
Vect or
oogle.com eo4j.com
Vari abl e-length
Records
g n
https://www.
Prefix
(b) B+Tree page with prefix compression
Page
Header
Indirection
Vect or
Vari abl e-length
Records
Prefix Tri e
https://www
g n
(c) B+Tree page with a prefix trie
Fig. 1: B+Tree page layouts
Storing variable-length records. The standard approach
for storing variable-length records within a page is shown
in Fig. 1a: an indirection vector with fixed-size entries contains
the byte offset of the actual record (depicted as a logical
pointer). Usually, the indirection vector grows from left to
right while the records grow from the opposite direction. At
the beginning of the page, the header contains information
about the node’s siblings, type of node (i.e., internal or leaf),
current generation (used for concurrency), or the key count.
Cache-efficient binary search. Performing a binary search
using the indirection vector requires dereferencing the pointer
(i.e., byte offset) to read the key. This operation leads to cache
faults, as the keys are stored in different cache lines than
the indirection vector entries. To avoid fetching the complete
key, a fixed-size prefix of the keys [1] is stored within the
indirection vector.1Therefore, we first perform an inexpensive
int comparison before retrieving the key only if necessary.
Data compression. Fig. 1b shows a simplified B+Tree page
layout that stores the common prefix of all keys only once (i.e.,
https://www.). Prefix compression is usually derived from
the longest common prefix of the node’s fence keys [13]. This
optimization not only saves storage space but also accelerates
key comparisons, as we can ignore the bytes from the prefix.
Suffix truncation [1] is another common optimization that
stores to inner nodes only the smallest possible separator near
the middle of keys when splitting a page.
When prefix compression cannot be applied to all the keys,
a prefix trie structure can be utilized. Such data structures com-
bine the advantages of prefix compression with more efficient
binary search. When searching within a node, we traverse the
prefix trie before reaching the indirection vector (see Fig. 1c).
The trie gives hints to avoid searching the whole key range and
improves cache locality. Finally, subsequent comparisons only
require checking the remaining suffixes. Given the promising
results reported in recent work [2], we analyze them in Sec. III.
Apache Lucene. Next, we briefly describe Apache
1In Fig. 1a, the records do not start with https://www. to show how
we store data with different prefixes.
SpanNode
@
0x0053
https://www.imdb.com/title/tt0
0x0458 0x0460
0x0456
0
0x0458
0
0x045a
362
0x045c
646
0x045e
769
0x0460
812
LeafNode
@
0x0068
4 8
0x045a 0x045c 0x045e
Fig. 2: Embedded Tree instance
Lucene [14], a library used by Neo4j to index string
properties for full-text search capabilities. Lucene uses an
inverted index data structure to efficiently store and retrieve
text-based data. In an inverted index, data is tokenized to
enable matching terms anywhere within the strings. While
other approaches exist for string indexing [15]–[17], we omit
their evaluation as they are not optimal for transactional
workloads, in which reads dominate the execution time.
III. EMBEDDED TRE ES : IN DE X WITHIN B+TRE E NODES
Having introduced indexing variable-length records and a
set of optimization techniques, let us now describe how we
created a single-threaded B+Tree prototype to validate their
benefits. We focus on Embedded Trees (ETrees) [2], a trie-
based search tree that accelerates B+Tree operations. First, we
provide an overview of ETrees. Next, we discuss our changes
to the original work before presenting our prototype.
A. Embedded Tree Overview
Performing binary search within a large B+Tree page
(i.e., around 32 KB) is an expensive operation that produces
unfriendly cache accesses and branch mispredictions. The
problem is amplified for variable-length records, which cannot
be cache-aligned, as ints or floats, without wasting space.
Therefore, using a secondary index to increase the page
locality of data accesses can lead to more efficient lookups.
An ETree is an immutable prefix trie stored within a B+Tree
node, constituting a structure called a B2Tree. As shown
in Fig. 1c, a lookup starts from the prefix trie before accessing
the indirection vector and finally the records’ suffixes. There-
fore, the trie serves a three-fold purpose: (i) accelerates binary
search by limiting the search space, increasing data locality;
(ii) accelerates key comparisons, as lookups must check only
the remaining suffixes; (iii) compresses the stored keys.
In Fig. 2, we show an instance of an ETree, which consists
of a position vector at the bottom (called a range array [2])
and nodes, i.e., rectangles with pointers to memory locations
(hex numbers). Given that ETrees are immutable, the range
array is used to support updates. It stores the logical position
of indirection vector slots of the B+Tree page. The traversal
Algorithm 1: Constructing an Embedded Tree
1Function BuildTree(cursor,ranges,keys,depth ,start ,end):
2nodeOffset ←cursor.getOffset ()
3prefix ←GetCommonPrefix(keys,depth,start,end )
4if prefix ! = NULL then
5ranges.add(ranges.get(ranges.size() −1))
6BuildTree (cursor,ranges,keys,depth +
prefix.size(),start,end )
7ranges.add(ranges.get(ranges.size() −1))
8SerializeSpanNode (cursor,prefix,ranges)
9else if end −start + 1 <=LEAF THRESHOLD then
10 partitions ←Partition(keys,depth,start,end )
11 // Add ranges from the created partitions
12 SerializeLeafNode(cursor,partitions,ranges)
13 else
14 partitions ←Partition(keys,depth,start,end )
15 offsets ← {}
16 for p←partitions do
17 offset ←
BuildTree(cursor,ranges,keys,depth ,p.st ,p.end)
18 offsets.add(offset )
19 SerializeInnerNode(cursor,partitions,offsets)
20 return nodeOffset
of an ETree returns a range (stored in two consecutive range
array slots), which acts as a binary search hint.
There are three types of nodes: leaf, inner, and span nodes.
The leaf nodes store an array of fixed-sized decision bound-
aries and the byte offsets of range array slots. Similar to a
tree node, the decision boundaries (i.e., single bytes) direct
the traversal to the correct child (i.e., slot). The inner nodes
share the same layout and logic as the leaf nodes, but instead of
storing range array byte offsets, they store offsets that guide
traversals to other nodes. Finally, the span nodes store the
longest common prefix of the subtree rooted at that node. If
the current key comparison matches the prefix, the span node
directs the remaining comparisons to its subtree. Otherwise,
virtual edges are used (depicted as faded arrows on the span
node’s left and right) to handle inequality. At every node, we
compare a node’s bytes with the key at a specific position until
we reach a slot in the range array. 2
For example, in Fig. 2, if the beginning of a given string
matches the span node’s prefix, then depending on how the
string’s next byte compares to 4and 8, the traversal will return
a range from [0, 362),[362, 646), or [646, 769). If the
string has a prefix smaller than the prefix, the traversal returns
an empty range ([0, 0)), which means that there are no keys
to look up. If it is larger, it returns ([769, 812)).
B. Building a Java Prototype
To study the performance of ETrees, we built a single-
threaded prototype in Java. To simplify the page layout
of Fig. 1c, we used fixed byte offsets for storing the ETree,
the indirection vector, and the variable-length records. In the
indirection vector, each slot requires eight bytes: four bytes for
storing the key prefix and four bytes for encoding the offset of
the actual record. While three bytes are enough for encoding
2Refer to [2] for details on tree traversal and updates.
the offset for large pages (64 KB), we decided to use four
to memory align the slots. We employed the same logic for
storing the records, which were 32- bit aligned. The length
of every record is stored before the actual payload. Since the
prototype supports only lookups without concurrent writes, we
removed all memory copies when comparing keys (i.e., for
optimistic reads), significantly accelerating performance.
In the remaining section, we provide a new algorithm for
ETree construction with two tuning options to determine the
tree’s shape. Based on these options, we introduce a memory
budget and attempt to optimize comparisons with vectoriza-
tion. Finally, we evaluate the prototype’s performance.
Constructing Embedded Trees. While the original work [2]
describes how to build an ETree at a high level, it does not
provide a comprehensive algorithm to explain the creation of
the range array or choosing between a leaf or inner nodes.
Thus, in Alg. 1, we formalize and simplify ETree creation.
An ETree is constructed recursively once, when a page
split or merge occurs or after a predefined number of changes
(i.e., key insertions and deletions). During its construction, the
nodes and range array (called ranges) are serialized to a page
using a page cursor. The ranges array is initialized with a
value of zero (denoting the start of keys), and after serializing
all nodes, it is stored at a predefined offset.
Each node keeps track of its current byte offset (line 2) and
returns it to its parent node, as inner nodes need to store this
information (lines 17-18) for subsequent traversals. First, in
line 3, the algorithm checks whether a common prefix exists
for the keys between the [start,end]range for the given depth
(i.e., byte position from which to start creating nodes). If it
exists, in line 5, the latest range is added to ranges before
creating a subtree in which all keys share the prefix. When
the subtree is constructed, it adds the latest range again and
then serializes the node to the page (line 8). These empty
ranges allow insertions of keys that do not share the prefix.
If no common prefix exists and the current key
range is smaller than a predefined threshold (called
LEAF THRESHOLD), the algorithm partitions that range.
Using the created partitions, it populates the ranges array and
serializes the leaf node using the cursor (line 12). Finally, for
an internal node, the creation process is similar to a leaf node,
but instead of storing the byte offsets to the range array, it
stores byte offsets to children nodes. An internal node with a
single partition is converted to a leaf.
Optimizing Embedded Trees. The first difference from the
approach in [2] is that we store the virtual pointers (i.e., byte
offsets) from the span nodes to the range array, as shown
in Fig. 2. This simplifies traversals at the expense of using
additional bytes. In addition, we introduce two new tuning
parameters when constructing an ETree: (i) the threshold for
creating leaf nodes (see line 9); and (ii) the maximum number
of partitions allowed when creating a leaf or inner node
(lines 10 and 14). These parameters affect the shape of the
trees: smaller LEAF THRESHOLD, and additional partitions
lead to more expensive traversals but fewer cache misses (up
urls
wiki_titles_2M
uk_postcodes
movies
gen_emails
gen_phones
random_strings
0.0
0.1
0.2
0.3
0.4
0.5
0.6
Throughput (106 ops/s)
B+Tree
B2Tree
GBPTree
Lucene
Fig. 3: Comparing different index implementations
urls
wiki_titles_2M
uk_postcodes
movies
gen_emails
gen_phones
random_strings
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
Speedup over GBPTree
GBPTree-IV GBPTree-ETree
Fig. 4: Speedup over GBPTree
to 3×due to smaller ranges). Therefore, there is a performance
trade-off between the cost of traversals and cache misses.
Managing Embedded Trees’ size. As the tree size can
become arbitrarily large depending on the partitions created at
runtime, we introduce a memory budget using Java exceptions.
We found it simpler to attempt to reconstruct the tree with
larger parameter values instead of tracking the complete his-
tory and retracting invalid nodes. Even though exceptions are
expensive [3], we did not observe them frequently when setting
the budget to 1 KB, the LEAF THRESHOLD to 128, and the
number of partitions to 4. We leave a more comprehensive
analysis of tuning options as future work.
Vectorizing comparisons. In Fig. 2, notice that the fixed-sized
decision boundaries of the leaf (or inner) nodes expose an
exploitable pattern: since they are stored in consecutive byte
offsets, they are amenable to data-level parallelism. Therefore,
we performed parallel comparisons using the Java Vector
API. Unfortunately, we found that this optimization did not
improve the performance due to: (i) the additional branch
mispredictions for handling corner cases for a different number
of partitions, which is data dependent; (ii) the overhead of
copying the data into registers to perform a simple operation;
and (iii) the overhead of traversals, i.e., dereferencing pointers
for the next node or performing byte-wise comparison for
prefixes, which outweighs the cost of comparisons.
Validating prototype’s performance. Fig. 3 compares the
performance of our prototype B+Tree, with and without the
ETree (B2Tree and B+Tree respectively), against the B+Tree
implementation of Neo4j (GBPTree) and Apache Lucene. We
perform lookups only, using different datasets (see Sec. V) to
measure the throughput in ops/ s. Across all datasets, Apache
Lucene yields the lowest performance as it is optimized for
full-text searches. B2Tree is, on average, 24% better than the
implementation without ETrees (B+Tree). At the same time, it
yields around 2.3×higher throughput compared to GBPTree.
This is caused by the memory alignment optimizations dis-
cussed above and by removing overheads related to copying
data or retrying reads due to optimistic concurrency.
IV. INT EG RATING B+TREE OPTIMIZATIONS IN NEO4J
The previous results reveal a trade-off between a special-
ized string-based index (B2Tree) and a fully-fledged imple-
Bolt Driver
Cypher
Traversal &
Core API
Transaction
Manager
Page
Cache
Fig. 5: Neo4j system architecture
mentation (GBPTree): the first offers more than 2×higher
performance, while the second provides a reliable structure
with additional features (e.g., fixed-size records’ optimizations,
transactional access). To avoid implementing and testing these
functionalities from scratch, we integrated ETrees into Neo4j
native indexes. First, we provide an overview of Neo4j’s
architecture to understand its operational overheads and dis-
cuss GBPTrees. We then measure ETrees’ benefits before
performing an end-to-end system evaluation in Sec. V.
A. Neo4j System Architecture
In Fig. 5, we show a simplified overview of Neo4j’s ar-
chitecture. As a traditional DBMS, Neo4j supports workloads
larger than memory stored on disks. To accelerate access to
hot graph data or indexes, Neo4j manages its own page cache
(i.e., buffer manager) over persisted data. All data accesses are
performed in a transaction (using the Transaction Manager)
with (at least) read-committed isolation level based on two-
phase locking at a node or relationship granularity. Finally,
users can access Neo4j in (i) embedded or (ii) server mode.
With the embedded mode, users can bundle the database
with their application and wrap operations in a transaction.
This allows them to utilize the Traversal or Core API directly
or even submit Cypher queries [11], which get compiled to
graph operations. While the Traversal and Core APIs are
less readable and more complex to use, they offer more
expressivity and better performance.
With the server mode, users can communicate with the
CREATE INDEX FOR (n:Node) ON n.property
OPTIONS {
indexProvider: ’b2tree’
}
Fig. 6: CREATE B2Tree command in Cypher
graph database over TCP or WebSockets using an efficient
binary communication protocol called Bolt [18]. While this
mode decouples the Neo4j process from the user’s application,
the network stack introduces additional overhead to query
execution. Using query parameters for Cypher queries and
batching operations can increase performance significantly.
Neo4j indexes are implemented between the page cache and
transactions layer of Fig. 5 for indexing nodes and relation-
ships on single or multiple properties. Graph operations use
indexes to find the starting point for updates or traversals. The
default index is GBPTree: a generation-aware B+tree. Apart
from exists and equals predicates, the GBPTree can evalu-
ate one-dimensional range predicates, string prefix predicates
(STARTS WITH), and non-unique keys. In addition, it supports
multiple concurrent reads and writes using optimistic locking.
Every GBPTree node has pointers to the left and right
siblings on the same level, and its inner nodes employ suffix
truncation [1]. GBPTree uses generations to simplify handling
changes and recovering from past checkpoints. Upon changes,
nodes are copied to an unstable generation (i.e., not check-
pointed), while the stable version is kept to handle crashes.
Regarding B+Tree’s node layout, in Neo4j v5.7.0, the max-
imum page size supported for inlined (i.e., stored within the
page) records is 32 KB. As no prefix is stored in the indirection
vector, only two bytes are required for each slot. GBPTrees can
also offload records externally at an additional page indirection
cost. For fixed-size records, GBPTree uses an optimized node
layout. Finally, to choose between different B+Tree implemen-
tations, a user can specify the appropriate indexProvider
with Cypher when creating the index (see Fig. 6).
B. Optimizing GBPTrees
Having validated the potential benefits of ETrees, we shall
now discuss the additional steps required to integrate them into
GBPTrees and our observations regarding their complexity.
Optimizing cache misses. GBPTrees use 2 B slots in the
indirection vector, which can lead to redundant cache misses,
compared to performing a cheap comparison before fetching
the full key. Implementing this optimization was straightfor-
ward: storing four additional bytes for each indirection vector
slot and performing a two-step binary search comparison.
Handling optimistic reads. While holding an exclusive lock
over the page simplifies concurrent access and ensures consis-
tency upon updates, there are cases in which optimistic reads
may lead to invalid results. For example, during a node update,
the ETree can be reconstructed or deleted while accessed by a
reader. Similar to Sec. III-B, we handle these cases with Java
exceptions and retry the reads.
Handling range scans with ETrees was not the goal of
the original work. Range scans require checking subsequent
TABLE I: Evaluation Datasets
Dataset Distinct Values Avg Length Min Length Max Length
URLs 31.5M36 3 721
Wikipedia titles 420 M20 1 255
UK postcodes 52.5M7 6 8
IMDB movies 759 K25 10 217
Generated emails 811 K23 15 34
Generated phones 2M12 12 12
Random strings 2M68 8 128
keys before adding them to the result set. This involves
reconstructing the full key by retrieving: (i) the prefix from
the Embedded Tree; (ii) the four bytes from the indirection
vector; and (iii) the suffix of the key. To rebuild the prefix for
a given slot, we have to perform DFS over the Embedded Tree
and find the prefix that covers that range.
Handling node split and merges in GBPTrees is performed
using a single memory copy for all records. With ETrees, full
key reconstruction is required, with keys stored in multiple
places. In addition, estimating the new key sets during splits
becomes more complicated because of prefixes. When copying
records to a node during a split or merge, there are cases that
the new key set may not fit in the node. This happens when the
created ETree does not compress all records enough, leading
to additional code complexity to handle such edge cases.
Performance Results. Next, we compare the speedup for
single-threaded reads of the two approaches integrated into
Neo4j over GBPTree: (i) storing the four bytes prefix of
keys in the indirection vector (GBPTree-IV); (ii) using ETrees
atop the previous optimization (GBPTree-ETree). Fig. 4 shows
that GBPTree-IV performs worse when the first four bytes
cannot differentiate the keys, except for the last dataset,
which consists of randomly generated strings. On the other
hand, GBPTree-ETree achieves, on average, 16% and up to
31% better throughput by improving cache efficiency, which
aligns with the results in [2]. Overall, prefix tries enhance
lookup performance at the expense of increased complexity
and overhead for operations that require key reconstruction.
V. EVAL UATIO N
We evaluate ETree in Neo4j across different execution
layers to determine its impact as part of a complex DBMS. We
use read-only and mixed read/write workloads with varying
levels of concurrency and a range of real and synthetic data.
Experimental setup. All experiments are performed on
an m5.4xlarge AWS EC2 instance with 16 physical cores,
a 35.8 MiB LLC cache, and 64 GiB of memory. We use
Ubuntu 22.04 with kernel 5.19, OpenJDK17, and Neo4j En-
terprise Edition (v5.7.0) with 32 KB pages.
We compare GBPTrees with the optimizations discussed
in Sec. IV-B, i.e., the indirection vector and ETrees (-IV and
-ETree suffixes), in terms of throughput (i.e., ops/ s). These
approaches are compared using the Core API, Cypher, and
3www.kaggle.com/datasets/shawon10/url-classification-dataset-dmoz
4A sample of wikipedia titles with date=”20221120”.
5https://www.getthedata.com/open-postcode-geo
urls
wiki_titles
uk_postcodes
movies
gen_emails
gen_phones
random_strings
102
103
104
105
106
Throughput (ops/s)
C-API
C-API-IV
C-API-ETree
Cypher
Cypher-IV
Cypher-ETree
Bolt
Bolt-IV
Bolt-ETree
Fig. 7: Single-threaded index compa-
rison in Neo4j (read-only)
urls
wiki_titles
uk_postcodes
movies
gen_emails
gen_phones
random_strings
102
103
104
105
106
Throughput (ops/s)
C-API
C-API-ETree
Cypher
Cypher-ETree
Bolt
Bolt-ETree
Fig. 8: Multi-threaded index compa-
rison in Neo4j (read-only)
urls
wiki_titles
movies
uk_postcodes
gen_emails
gen_phones
random_strings
102
103
104
105
106
Throughput (ops/s)
C-API
C-API-ETree
Cypher
Cypher-ETree
Bolt
Bolt-ETree
Fig. 9: Multi-threaded index compa-
rison in Neo4j (5% writes)
Cypher over Bolt (C-API, Cypher-, and Bolt- prefix).
Datasets. Table I summarizes the datasets used for our evalu-
ation. For synthetic data, we generate emails, phone numbers,
and random strings (as in [2]). In all experiments, the data fits
in memory (i.e., Page Cache) to avoid disk overheads, but it
is still persisted on disk.
Lookup performance. We first compare all implementations
for read-only workloads using a single thread (Fig. 7) before
analyzing their behaviour in a multi-threaded scenario (Fig. 8).
We omit the performance of Apache Lucene, as it consistently
performed worse for lookups.
Fig. 7 depicts that with a single thread when using the
Core API, performance drops 2.5×from the results shown
in Fig. 3. This occurs due to the additional overhead of
wrapping accesses with transactions and tasks. When the index
finds a node, another pointer dereferencing is required to fetch
its content. Therefore, less time is spent on actual index reads,
which results in only 7% increased performance on average
when using ETrees (C-API-ETree). At the same, ETrees also
increase construction and maintenance costs by nearly 70%.
The indirection vector optimization performs better only with
the random strings.
We next perform index operations using Cypher, which in-
troduces additional overhead for parsing, compiling, schedul-
ing, and caching queries. This leads to another 80% perfor-
mance degradation compared to the Core API and declining
benefits from ETrees (Cypher-ETree), with a 3% average
speedup. Finally, submitting Cypher queries over Bolt drops
throughput over 20×, and all approaches exhibit similar per-
formance, as only a fraction of execution is spent on indexing.
In Fig. 8, we observe that concurrent reads provide close to
8×speedup for the Core API and Cypher, respectively, for all
implementations. For Bolt, the multi-reader speedup is nearly
5×. On average, all three settings reveal no speedup when
using ETrees due to increased data cache misses.
Performance with inserts. In Fig. 9, we compare regular
GBPTrees with GBPTrees using the ETree optimization in
a concurrent setup with 16 threads. At the beginning of
the experiment, we load 80% of the dataset and leave the
remaining data for the writers to insert. Every thread selects
a read or write operation based on a 95% read and 5% write
ratio to emulate a common graph use case.
Compared to the results of Fig. 8, writes reduce the per-
formance of the Core API, Cypher, and Bolt by 2.8×, 2×,
and 14% respectively. Only C-API-Etree yields, on average, a
3% improvement compared to regular B+Trees (C-API). This
occurs because of performing copy-on-write updates on stable
generation B+Tree nodes, creating and persisting new graph
nodes, and committing transactions. Given that queries over
Bolt are not affected as much with writes, we assume that net-
working introduces the most significant overhead compared to
mutable transactions. Even though transaction and networking
overheads can be reduced with buffering [19], we measured the
throughput of single operations to identify these bottlenecks.
Discussion. While adding more execution layers over the
index accesses (see Fig. 5), we observe an increase in in-
struction cache- and ITLB misses.6This indicates that each
layer increases the code footprint, requiring more instructions
for index lookups.
In multi-threaded scenarios with concurrent operations (i.e.,
task creation and scheduling, query compilation, networking),
we observe increased data misses in all cache levels. As
Neo4j is optimized for concurrent users, worker threads are not
pinned to individual CPUs, eventually leading to cache trash-
ing despite ETrees providing more cache-friendly lookups.
With Neo4j being a representative transactional system that
differs from relational DBMSs in storage format and query
operators (e.g., graph pattern matching), we expect marginal
performance benefits in practice from other graph database
systems [20], [21], especially when performing complex graph
analytics after the index lookup. We investigate how to exploit
ETrees by improving code quality with code generation [22].
VI. CONCLUSION
In this paper, we present our empirical research on B+Tree
optimizations for variable-length records in a modern graph
DBMS. Our study reveals that while these optimizations offer
performance improvements when tested in isolation, intro-
ducing essential database components reduces their impact
significantly. Thus, considering the maintenance complexity
of these optimizations, regular B+Trees remain a good default
choice for read-heavy workloads.
6Measurements taken using Intel VTune 2023.1.0 and m5.metal.
REFERENCES
[1] G. Graefe et al., “Modern b-tree techniques,” Foundations and Trends®
in Databases, 2011.
[2] J. Schmeißer, M. E. Sch¨
ule, V. Leis, T. Neumann, and A. Kem-
per, “B2-tree: Page-based string indexing in concurrent environments,”
Datenbank-Spektrum, 2022.
[3] A. Alhomssi, M. Haubenschild, and V. Leis, “The evolution of
leanstore,” BTW, 2023.
[4] S. Zeuch, J.-C. Freytag, and F. Huber, “Adapting tree structures for
processing with simd instructions.” in EDBT, 2014.
[5] J. Zhou and K. A. Ross, “Implementing database operations using simd
instructions,” in SIGMOD, 2002.
[6] V. Leis, M. Haubenschild, A. Kemper, and T. Neumann, “Leanstore:
In-memory data management beyond main memory,” in ICDE, 2018.
[7] T. Neumann and M. J. Freitag, “Umbra: A disk-based system with in-
memory performance.” in CIDR, 2020.
[8] X. Zhou, X. Yu, G. Graefe, and M. Stonebraker, “Two is better than
one: The case for 2-tree for skewed data sets,” CIDR, 2023.
[9] Neo4j, “Neo4j Graph Data Platform,” https://neo4j.com/, 2010, last
access: May 13, 2024.
[10] I. Robinson, J. Webber, and E. Eifrem, Graph databases: new opportu-
nities for connected data. ” O’Reilly Media, Inc.”, 2015.
[11] N. Francis, A. Green, P. Guagliardo, L. Libkin, T. Lindaaker,
V. Marsault, S. Plantikow, M. Rydberg, P. Selmer, and A. Taylor,
“Cypher: An evolving query language for property graphs,” in SIGMOD,
2018.
[12] Neo4j, “Indexes for search performance,” https://neo4j.com/docs/
cypher-manual/current/indexes-for-search-performance/, 2022, last ac-
cess: May 13, 2024.
[13] G. Graefe, “Write-optimized b-trees,” in VLDB, 2004.
[14] M. McCandless, E. Hatcher, O. Gospodneti´
c, and O. Gospodneti´
c,
Lucene in action, 2010.
[15] S. Dong, A. Kryczka, Y. Jin, and M. Stumm, “Evolution of develop-
ment priorities in key-value stores serving large-scale applications: The
rocksdb experience.” in FAST, 2021.
[16] Z. Cao, S. Dong, S. Vemuri, and D. H. Du, “Characterizing, modeling,
and benchmarking {RocksDB}{Key-Value}workloads at facebook,” in
FAST 20, 2020.
[17] B. Chandramouli, G. Prasaad, D. Kossmann, J. Levandoski, J. Hunter,
and M. Barnett, “Faster: A concurrent key-value store with in-place
updates,” in SIGMOD, 2018.
[18] Neo4j, “Bolt,” https://neo4j.com/docs/bolt/current/bolt/, 2023, last ac-
cess: May 13, 2024.
[19] G. Graefe, “B-tree indexes for high update rates,” SIGMOD Rec., 2006.
[20] X. Feng, G. Jin, Z. Chen, C. Liu, and S. Saliho˘
glu, “K`
uzu graph database
management system.” CIDR, 2023.
[21] A. Deutsch, Y. Xu, M. Wu, and V. Lee, “Tigergraph: A native mpp
graph database,” arXiv preprint arXiv:1901.08248, 2019.
[22] J. Clarkson, G. Theodorakis, and J. Webber, “Bifrost: A future graph
database runtime,” in ICDE, 2024.