ArticlePDF Available

Making updates disk-I/O friendly using SSDs


Abstract and Figures

Multiversion databases store both current and historical data. Rows are typically annotated with timestamps representing the period when the row is/was valid. We develop novel techniques for reducing index maintenance in multiversion databases, so that indexes can be used effectively for analytical queries over current data without being a heavy burden on transaction throughput. To achieve this end, we re-design persistent index data structures in the storage hierarchy to employ an extra level of indirection. The indirection level is stored on solid state disks that can support very fast random I/Os, so that traversing the extra level of indirection incurs a relatively small overhead. The extra level of indirection dramatically reduces the number of magnetic disk I/Os that are needed for index updates, and localizes maintenance to indexes on updated attributes. Further, we batch insertions within the indirection layer in order to reduce physical disk I/Os for indexing new records. By reducing the index maintenance overhead on transactions, we enable operational data stores to create more indexes to support queries. We have developed a prototype of our indirection proposal by extending the widely used Generalized Search Tree (GiST) open-source project, which is also employed in PostgreSQL. Our working implementation demonstrates that we can significantly reduce index maintenance and/or query processing cost, by a factor of 3. For insertions of new records, our novel batching technique can save up to 90% of the insertion time.
Content may be subject to copyright.
Making Updates Disk-I/O Friendly Using SSDs
Mohammad Sadoghi1, Kenneth A. Ross1,2, Mustafa Canim1, Bishwaranjan Bhattacharjee1
1IBM T.J. Watson Research Center
2Columbia University
Multiversion databases store both current and historical data. Rows
are typically annotated with timestamps representing the period
when the row is/was valid. We develop novel techniques for re-
ducing index maintenance in multiversion databases, so that in-
dexes can be used effectively for analytical queries over current
data without being a heavy burden on transaction throughput. To
achieve this end, we re-design persistent index data structures in
the storage hierarchy to employ an extra level of indirection. The
indirection level is stored on solid state disks that can support very
fast random I/Os, so that traversing the extra level of indirection
incurs a relatively small overhead.
The extra level of indirection dramatically reduces the number of
magnetic disk I/Os that are needed for index updates, and localizes
maintenance to indexes on updated attributes. Further, we batch in-
sertions within the indirection layer in order to reduce physical disk
I/Os for indexing new records. By reducing the index maintenance
overhead on transactions, we enable operational data stores to cre-
ate more indexes to support queries. We have developed a proto-
type of our indirection proposal by extending the widely used Gen-
eralized Search Tree (GiST) open-source project, which is also em-
ployed in PostgreSQL. Our working implementation demonstrates
that we can significantly reduce index maintenance and/or query
processing cost, by a factor of 3. For insertions of new records, our
novel batching technique can save up to 90% of the insertion time.
In a multiversion database system, new records do not physically
replace old ones. Instead, a new version of the record is created,
which becomes visible to other transactions at commit time. Con-
ceptually, there may be many rows for a record, each corresponding
to the state of the database at some point in the past. Very old ver-
sions may be garbage-collected as the need for old data diminishes,
in order to reclaim space for new data.
When indexing data, one typically indexes only the most recent
version of the data, since that version is most commonly accessed.
In such a setting, record insertions, deletions and updates trigger
I/O to keep the indexes up to date. With a traditional index struc-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. Articles from this volume were invited to present
their results at The 39th International Conference on Very Large Data Bases,
August 26th - 30th 2013, Riva del Garda, Trento, Italy.
Proceedings of the VLDB Endowment, Vol. 6, No. 11
Copyright 2013 VLDB Endowment 2150-8097/13/09... $10.00.
7 (PKs) 8 9 10 11 12 13 14 15 16 17
TPC-H: all tables; Scale Factor: 20
Query (Base)
Query (Indirection)
Update (Base)
Update (Indirection)
Number of Indexes
Relative Execution Time
Figure 1: Effect of Adding Indexes in Operational Data Store
ture, the deletion of a record requires the traversal of each index and
the removal of the row-identifier (RID) from the leaf node. The up-
date of a record (changing one attribute value to another) creates a
new version, triggering a traversal of all indexes to change the RIDs
to the new version’s RID. (In the case of the modified attribute, the
position of the record in the index may also change.) For a newly
inserted record, the new RID must be inserted into each index. In-
dexes may be large, and in aggregate much too large to fit in the
RAM bufferpool. As a result, all of these index maintenance oper-
ations will incur the overhead of physical I/O on the storage device.
These overheads have historically been problematic for OLTP
workloads that are update-intensive. As a result, OLTP workloads
are often tuned to minimize the number of indexes available. This
choice makes it more difficult to efficiently process queries and to
locate records based on secondary attributes. These capabilities
are often important for operational data stores [35]. For example,
it is not uncommon to find tens of indexes to improve analytical
and decision-making queries even in TPC benchmarks [20, 19] or
enterprise resource planning (ERP) scenarios [14, 13].
Our goal is to reduce the overhead of index updates, so that in-
dexes can be used effectively for analytical query processing with-
out being a heavy burden on transaction throughput. The query vs.
update dilemma is clearly captured in Figure 1, a preview of our
experimental results. The execution time for analytical queries is
reduced as more indexes are added. However, this reduction comes
at the cost of increasing the update time in the Base approach. In
contrast, by employing our technique (the Indirection approach) the
incurred update cost is significantly smaller.
To address this dilemma, we utilize a solid state storage layer.
Based on current technologies, solid state disks (SSDs) are orders
of magnitude faster than magnetic disks (HDDs) for small random
I/Os. However, per gigabyte, SSDs are more expensive than mag-
netic disks. It therefore pays to store the bulk of the data on mag-
netic disk, and reserve the SSD storage for portions of the data that
can benefit the most, typically items that are accessed frequently
and randomly. In this paper we describe a prototype of our In-
direction approach based on the widely used Generalized Search
Tree (GiST) package [31, 29], which is also employed in industrial-
strength open-source database management systems such as Post-
greSQL [6], PostGIS [5], OpenFTS [4], and BioPostgres [1].
Unlike previous approaches [17, 37, 15, 18, 23], we do not pro-
pose to simply store “hot” data on SSDs. Instead, we change the
data structures in the storage hierarchy to employ an extra level
of indirection through solid state storage. Because the solid state
memory is fast, the extra time incurred during index traversal is
small, as we demonstrate experimentally. The extra level of indi-
rection dramatically reduces the amount of magnetic disk I/O that
is needed for index updates. Only SSD I/O is needed for deletions
and updates, with the exception of indexes on changed attributes.
We can also reduce the magnetic disk I/O overhead for insertions.
While we describe our techniques in terms of SSDs, we are not lim-
ited to a disk-like form factor. In fact, alternative form factors (e.g.,
FusionIO auto-commit memory [27]) with smaller I/O granulari-
ties would provide even better performance because our proposed
solid state updates are small.
Another potential advantage of our work – efficient index main-
tenance of multiversion databases – is to support multiversion con-
currency control (MVCC). The MVCC model is reviving [6, 32,
38, 39] mostly due to the increased concurrency available in mod-
ern hardware such as large main memories and multicore proces-
sors. But this increased concurrency comes at the cost of increased
locking contention among concurrent reads/updates queries, which
could be alleviated using optimistic locking over a multiversion
1.1 Multiversion Databases
By keeping old data versions, a system can enable queries about
the state of the database at points in the past. The ability to query
the past has a number of important applications [50], for example,
(1) a financial firm requires to retain any changes made to client
information for up to five years in accordance with auditing reg-
ulations; (2) a retailer ensures to offer only one discount for each
product at any given time; (3) a bank needs to retroactively correct
an error for miscalculating the promised introductory interest rate.
In addition to these business-specific scenarios, there is an inherent
algorithmic benefit for retaining the old versions of the record and
avoiding in-place, that is, to utilize efficient optimistic locking and
latch-free data structures.
A simple implementation of a multiversion database would store
the row-identifier (RID) of the old version within the row of the
new version, defining a linked list of versions. Such an implemen-
tation allows for the easy identification of old versions of each row,
but puts the burden of reconstructing consistent states at particular
times on the application, which would need to keep timing infor-
mation within each row.
To relieve applications of such burdens, a multiversion database
system can maintain explicit timing information for each row. In
a valid time temporal model [28] each row is associated with an
interval [begin-time,end-time) for which it was/is current. Several
implementation choices exist for such a model. One could store
the begin-time with each new row, and infer the end-time as the
begin-time of the next version. Compared with storing both the
begin-time and end-time explicitly for each row, this choice saves
space and also saves some write I/O to update the old version. On
the other hand, queries over historical versions are more complex
because they need to consult more rows to reconstruct validity in-
Abitemporal database maintains two kinds of temporal informa-
tion, the system (i.e., transaction) time, as well as the application
time (sometimes called “business time”).
In this work we do not commit to any one of these implementa-
tion options, each of which might be a valid choice for some work-
loads. For any of these choices, our proposed methods will reduce
the I/O burden of index updates. Some of our techniques, such as
the LIDBlock technique (Section 3.1), apply to both versioned and
non-versioned databases.
1.2 Physical Organization
There are several options for the physical organization of a tem-
poral database. A complete discussion of the alternatives is beyond
the scope of this paper. We highlight two options that have been
used in commercial systems, namely, the history table vs. the sin-
gle table approach.
One organization option appends old versions of records to a his-
tory table and only keeps the most recent version in the main table,
updating it in-place. Commercial systems have implemented this
technique: In IBM DB2 it is called “System-period data version-
ing” [34], and it is used whenever a table employs transaction time
as the temporal attribute. The Oracle Flashback Archive [47] also
uses a history table. Such an organization clusters the history table
by end-time, and does not impose a clustering order on the main ta-
ble. Updates need to read and write the main table, and also write to
the end of the history table. Because updates to the main table are
in-place,1an index needs to be updated only when the correspond-
ing attribute value changes. For insertions and deletions, all in-
dexes need to be updated. In short, using the history table approach
(1) the temporal ordering of the data is lost; (2) additional random
I/Os are required to perform in-place updates of records; (3) the
number of database objects (e.g., tables, indexes, and constraints)
are potentially doubled, which increases the overall management
and maintenance cost of database and slows down the query opti-
mization runtime; and (4) less effective query plan is constructed
for certain temporal range queries that are forced to union the his-
tory and the main tables instead of using range-partitioned table for
maintaining the single table approach.
In this paper, we assume an organization in which there is a sin-
gle table containing both current and historical data. Commercial
systems that implement this technique include Oracle 11g where
the concept is called “version-enabled tables” [46]. IBM’s DB2
also uses this approach for tables whose only temporal attribute is
the application time. The single table approach is central to IBM
DB2 with BLU Acceleration as well [3]. New rows are appended
to the table, so that the entire table is clustered by begin-time. Up-
dates need to read the table once and write a new version of the
record to the end of the table.
We focus on applications that primarily use current data, but oc-
casionally need to access older versions of the data. To support
queries over current data, the most recent data may be extensively
indexed. Older data may be less heavily indexed because it is
queried less frequently, and is often more voluminous. Even within
a single table, the system can offer an implementation option in
which only the most recent version of a record appears in an index.
Traditional index structures directly reference a record via a pointer
known as a physical row-identifier (RID). The RID usually encodes
1If one wanted to cluster the main table by a temporal attribute
to improve temporal locality, then updates would not be in-place
and additional indexes would need to be updated. Our proposed
solution would reduce the burden of such index updates.
a combination of the database partition identifier, the page number
within the partition, and the row number within the page. A RID
index over current HDD-resident data is shown in Figure 2.
RID Index
version ID
RID Index
Figure 2: Traditional RID index structure
The choice of a physical identifier hinders the update perfor-
mance of a multiversion database in which updates result in a new
physical location for the updated record. Changes to the record in-
duce I/O for every index, even indexes on “unaffected” attributes,
i.e., attributes that have not changed. Random I/Os are required to
modify HDD-resident leaf pages.
To avoid HDD I/O for indexes on unaffected attributes, we de-
couple the physical and logical representations of records spanning
many versions. We distinguish between a physical row-identifier
(RID) and a logical record identifier (LID). For any given record,
there may be many RIDs for that record corresponding to the phys-
ical placement of all of the versions of that record. In contrast, the
LID is a reference to the RID representing the most recent version
of the record. For now, one can think of a table LtoR(LID,RID) that
has LID as the primary key. Indexes now contain LIDs rather than
RIDs in their leaves.
Under our proposed Indirection technique, an index traversal must
convert a LID to a RID using the LtoR table. A missing LID, or
a LID with a NULL RID in the LtoR table are treated as deleted
rows, and are ignored during search. By placing the LtoR table on
an SSD, we ensure that the I/O overhead for the extra indirection
is relatively small.2Because the SSD is persistent, index struc-
tures can be recovered after a crash. Because we need only a few
SSD bytes per record, it is possible to handle a large magnetic disk
footprint with a much smaller solid state footprint. The new index
design is demonstrated in Figure 3.
When an existing record is modified, a new version of that record
is created. The LtoR table is updated to associate the new row’s
RID to the existing LID. That way, indexes on unchanged attributes
remain valid. Only for the changed attribute value will index I/O
be required for the indirection layer.
When a record is deleted, the (LID,RID) pair for this record in
the LtoR table is deleted. Index traversals ignore missing LIDs. In-
dexes can lazily update their leaves during traversal, when a read
I/O is performed anyway. At that time, any missing LIDs encoun-
tered lead to the removal of those LIDs from the index leaf page.
After a long period of activity, indexes should be validated off-
line against the LtoR table to remove deleted LIDs that have subse-
quently never been searched for.
When a new record is added, the new record is appended to the
tail of the relation and its RID is fetched and associated to a new
LID. The (LID, RID) pair for the new record is added to the LtoR
2Frequently accessed LIDs would naturally be cached in RAM by
the database bufferpool manager, further reducing the overhead.
RID: Row Identifier
LID: Logical Identifier
Indirection Mapping
(LtoR table)
LID Index LID Index
Figure 3: LID index using the indirection technique
table. All indexes are also updated with the new record LID ac-
cordingly. In Section 3, we discuss how to further improve record
insertion and deletion.
We now develop techniques for improving the index performance
of insertion. We define a batching structure called a LIDBlock, and
employ yet another level of indirection.
3.1 LIDBlocks
To reduce the index overhead for insertions, we propose an SSD-
resident auxiliary LIDBlock structure containing a fixed number of
LIDs. The LIDs in a LIDBlock may be NULL, or may be valid
LIDs from the LtoR table. References to a LIDBlock are mapped
to multiple LIDs through this extra level of indirection. Figure 4
shows the extended structure with LIDBlocks.
HDD Tail
Non-leaf Pages
Leaf Pages
batched LIDs
flushed LIDs
Figure 4: Indirection Technique with LIDBlock
The arrow from index leaf pages to LIDBlocks in Figure 4 could
be implemented by keeping a block identifier (BID) within the leaf
page. The disadvantage of such a choice is that the leaf node of the
index needs to be read from magnetic disk to locate the BID, requir-
ing extra HDD I/O. Instead, we propose to store LIDBlocks within
hash tables on the SSD. In the following sections, we describe more
precisely how index pages refer to and traverse LIDBlocks.
3.2 The Dense Index Case
Consider first the case of a dense index, i.e., a secondary index
where there are many records (more than a leaf-node’s worth) for
each attribute value. For such indexes, we keep a list of LIDs for
each value as before. In addition, we store a collection of LIDBlocks
on the SSD in a hash table, hashed by the attribute value. Initially,
each indexed value has a LIDBlock whose LIDs are all NULL.
When a new record is inserted, we need to modify the indexes to
reflect the insertion. Suppose that the value of an indexed attribute
is v, and that the index is dense. A LID is created for the record, and
a suitable (LID,RID) pair is added to the LtoR table. The LIDBlock
Bfor vis identified by accessing the hash table of LIDBlocks on
the SSD. If there are unused (NULL) slots in B, one of the slots is
overwritten with the LID of the new record. If there are no unused
slots, then all LIDS in Band the LID of the new record are moved
in bulk into the LID list in the index, amortizing the I/O cost.3
In this model, index traversal is slightly more complex: all the
non-NULL LIDs in the LIDBlock for a value also need to be treated
as matches. Deletions and attribute value updates may also need to
traverse and modify a LIDBlock. There is some additional I/O, but
only on solid state storage.
3.3 The Sparse Index Case
When there are few matches per index value, the organization
above would need a very large number of LIDBlocks, most of which
would be underutilized. Instead, for sparse indexes we maintain a
single LIDBlock for an entire leaf node of the index. Rather than
using a hash table hashed by attribute value, we use a hash table
hashed by the address of the index leaf page. This address can be
obtained using a partial traversal of the index, without accessing
the leaf node itself. Since the internal nodes of a tree index occupy
much less space than the leaf nodes, they are much more likely to
be resident in the main memory bufferpool.
Searches have some additional overhead, because the shared LID-
Block would need to be consulted even for records that may not
match the search condition. There would also be overhead during
node splits and merges to maintain the LIDBlock structure.
The overhead of LIDBlocks on searches may be high for sparse
indexes. For example, a unique index search would previously only
have to look up one main file record. With a LIDBlock for a given
key range, a search may need to read bof them, where bis the
LIDBlock capacity. This example suggests an optimization: store
both the LID and the key in the LIDBlock for sparse indexes. This
optimization reduces the capacity of LIDBlocks, but significantly
improves the magnetic disk I/O for narrow searches.
3.4 Revisiting Deletions and Updates
With LIDBlocks, it is now possible that an update or deletion oc-
curs to a record whose LID is in a LIDBlock rather than the LID list.
For deletions, one can simply set the LID to NULL in the LIDBlock.
For updates that change the position of the record in the index, one
needs to nullify the LID in the previous LIDBlock or LID-list, and
insert it into the new LIDBlock.
3One can shift the burden of LIDBlock flushing outside of running
transactions by triggering an asychronous flush once a LIDBlock
becomes full (or nearly full), rather than waiting until it overflows.
In this way, LIDBlocks also improve the I/O behavior of updates
to indexes on changed attributes. In the original scheme, HDD-
resident index leaf pages appropriate to the new attribute value
needed to be updated with the LID of the updated record. In the en-
hanced scheme, writes are needed only on SSD-resident LIDBlocks
most of the time. Magnetic disk writes are amortized over the num-
ber of records per LIDBlock.
In this section, we provide a detailed example to further illustrate
the core of our Indirection proposal and motivate the analysis in
Section 5.
Consider a table R(A, B, C, D )with B-tree indexes on all four
attributes. Ais the primary key, and is, therefore, sparse. Bis also
sparse, while Cand Dare dense. Ris stored on disk in a versioned
fashion. For each row we show the RID of the previous version
(if any) of the row; the previous-RID may or may not be explicitly
stored. Suppose that at a certain point in time, the extension of R
includes the rows given below. A flag indicating whether the row
has been deleted is also included.
RID A B C D Prev-RID Deleted
345 100 3732 3 5 123 0
367 120 4728 3 6 NULL 0
369 130 2351 2 5 NULL 0
501 100 3732 2 5 345 0
Suppose the LtoR table for Ris given by
10 367
11 369
13 501
Indexes use LIDs (e.g., 10, 11, 13) rather than RIDs to refer to
rows, and only the most recent version is indexed. Given the above
database, the immediate and deferred changes to the database in
response to various update requests are described below. The I/O
needed is summarized in square brackets (log I/O and I/O for index
node splits/merges are ignored). Our I/O estimates assume that all
internal index nodes are resident in the RAM bufferpool, and that
all leaf nodes require I/O. These estimates also assume that all ac-
cesses to SSD-resident structures require I/O.4The estimates also
assume direct access to a page given the LID, as might be supported
by a hash table. We assume that LIDBlocks contain bLIDs, and that
for sparse indexes we are storing both the key and the LID in the
LIDBlock. We now describe precisely each key operation, break-
ing down its steps into immediate and deferred actions. Immediate
actions must be completed within the operation itself. Deferred ac-
tions are those that can happen later, such as when an operation
causes a page to become dirty in the bufferpool but the actual I/O
write comes later.
Update. We update the row with key 100 such that attribute D
is changed from 5 to 6. The immediate actions are
The LID of the row in question (i.e., 13) is identified using
the index on A[1 HDD read].
The entry (13,501) in the LtoR mapping is read [1 SSD
The row with RID 501 is read [1 HDD read].
4This might be too pessimistic, particularly for LIDBlocks that
could be small enough to be cached in RAM.
A new version of the record is created at the tail of Rwith a
new value for Dand a new RID (suppose 601).
An index traversal for key 5 is initiated on attribute D. If
LID 13 is present in the corresponding leaf, it is deleted; oth-
erwise, LID 13 is located in the LIDBlock for key 5 and is
removed from the LIDBlock. [1 HDD read, possibly 1 SSD
A partial index traversal for key 6 is initiated on attribute D,
and LID 13 is inserted into the corresponding LIDBlock [1
SSD read].
The required deferred actions are summarized as follows.
The data page containing the row with RID 601 is dirty and
will need to be flushed to the HDD. [1 HDD write amortized
over all modifications to the page5].
The entry (13,501) in the LtoR mapping is changed to (13,601)
[1 SSD write].
The index page containing the key 5 will be dirty if LID 13
was in the leaf. The dirty page will need to be flushed to the
HDD [1 HDD write, amortized over all modifications to the
page]. If LID 13 was in the LIDBlock for key 5 then the dirty
LIDBlock will need to be flushed to the SSD [1 SSD write].
The LIDBlock for key 6 is dirty and will need to be flushed to
the SSD [1 SSD write].
Insertion. Consider the insertion of a new record (140,9278,2,6).
The resulting immediate actions are:
The absence of a previous version of the row is verified using
the index on A, including the LIDBlock [1 HDD read, 1 SSD
A new row is created at the tail of R, with a new RID (sup-
pose 654). The previous-RID field is NULL.
A new LID (suppose 15) is allocated.
For the two dense indexes on Cand D, the LIDBlocks for
keys 2and 6(respectively) are identified, and LID 15 is in-
serted into each [2 SSD reads].
For the two sparse indexes on Aand B, the LIDBlocks for
keys 140 and 9278 (respectively) are identified using partial
index traversals, and LID 15 is inserted (paired with the key)
into each [2 SSD reads].
The deferred actions are as follows.
The data page containing the row with RID 654 is dirty and
will need to be flushed to the HDD [1 HDD write, amortized
over all modifications to the page].
The entry (15,654) is inserted into the LtoR mapping [1 SSD
read, 1 SSD write].
The LIDBlocks for each of the four indexes are dirty, and need
to be flushed [4 SSD writes].
5Amortization is expected to be high on the tail of a table.
In the event that a LIDBlock fills (one time in binsertions),
we need to convert all LIDs in the LIDBlock into regular in-
dex LIDs, and reset the LIDBlock to an empty state [4/b SSD
writes, 3/b HDD reads (the leaf in the index on Ahas already
been read), 4/b HDD writes].
Deletion. Now suppose deleting the row with key 100, which
results in the following immediate actions:
The LID of the row in question (i.e., 13) is identified using
the index on A[1 HD read, possibly 1 SSD read].
The pair (13,501) in the LtoR table is located [1 SSD read].
The row with RID 501 is read and the deleted flag is set to 1
[1 HDD read].
LID 13 is removed from the leaf node of the index on A.
The deferred actions for deleting row with key 100 are
The data page containing the row with RID 501 is dirty and
will need to be flushed to the disk. [1 HDD write, amortized
over all modifications to the page].
The pair (13,501) in the LtoR table is dropped [1 SSD write].
The index leaf page for Acontaining the key 100 is dirty and
will need to be flushed to the HDD [1 HDD write amortized
over all modifications to the page].
Whenever one of the other indexes is traversed and LID 13
is reached, LID 13 will be removed from the corresponding
LID list [1 extra HDD write to modify the leaf, amortized
over all modifications to the page].
Search. Suppose that the search returns mmatches that all fit in
one index leaf page of a sparse index. Since no write is involved,
the search consists of only immediate actions:
Traverse the index [1 HDD read].
Read the LIDBlock for the leaf node [1 SSD read].
Map LIDs to RIDs [mSSD reads].
Read all matching records [mHDD reads (assuming an un-
clustered table)].
We analyze the performance of the Indirection method accord-
ing to three criteria: (a) time for core operations; (b) SSD space
requirements; (c) SSD wear-out thresholds.
5.1 Time Complexity
Table 4 summarizes the I/O complexity of the Base and Indirec-
tion methods. Most I/Os are random, meaning that SSD I/Os are
much faster than HDD I/Os, by about two orders of magnitude on
current devices. It is therefore worth investing a few extra SSD
I/Os to save even one HDD I/O. Note that these estimates are pes-
simistic in that they assume that none of the SSD-resident data is
cached in RAM. If commonly and/or recently accessed data was
cached in RAM, many SSD read I/Os could be avoided, and many
SSD writes could be combined into fewer I/Os.
The most striking observation is that with a small increase in
SSD I/Os the costs of updates, insertions, and deletions are substan-
tially reduced, while the cost of searches increases only slightly.
With the Indirection technique, the immediate HDD cost is inde-
pendent of the number of indexes.
Technique Type Immediate SSD Deferred SSD Immediate HDD Deferred HDD
Base Single-attr. update 0 0 3 + k2 + k
Insertion 0 0 1 + k1 + k
Deletion 0 0 2 + k1 + k
Search Uniq. 0 0 20
Search Mult. 0 0 dm/I e+m0
Indirection Single-attr. update 2 2 3 2
Insertion 1 + k2 + k+k/b 11 + (2k1)/b
Deletion 1 1 2 1 + k
Search Uniq. 2020
Search Mult. dm/I e+m0dm/Ie+m0
Table 1: Base vs. Indirection technique analysis. For single-attribute updates and deletions, we show the case where the LID was
initially in the leaf rather than the LIDBlock.nis the number of attributes, kis the number of B-tree indexes, mis the number of
results returned by a search, and Iis the number of keys in an index leaf node.
5.2 SSD Space Complexity
We now estimate the space needed on the SSD for the LtoR table
and the LIDBlocks. If Nis the number of latest-version records
in a table, then we need N(LID,RID) pairs in the LtoR table. In
a typical scenario, table records may be 150 bytes wide, with 8
byte LIDs and 8 byte RIDs. Assuming a fudge-factor of 1.2 for
representing a hash table, the LtoR table would constitute roughly
13% of the size of the current data in the main table.
We now consider the space consumption of LIDBlocks for sparse
indexes; dense indexes would take less space. The number of LID-
Blocks for a single sparse index is equal to the number of leaves
in the index. With an 8 byte key, an 8 byte LID, an 8KB HDD
page size and a 2/3occupancy factor, a leaf can represent about
340 records. A LIDBlock contains b(LID,key) pairs, leading to a
total space consumption of 16bN ×1.2/340 bytes per index. Even
a small value of b, say 16, is sufficiently large to effectively amor-
tize insertions. At this value of b, the LIDBlocks for a single index
constitute less than 1% of the size of the current data in the main
Thus, even with 7 indexes per table, the SSD space required in
a typical scenario is only about 20% of the HDD space required
for the current data. Taking the historical data into account, the
relative usage of SSD space would be even lower. Given that SSDs
are more expensive per byte than HDDs, it is reassuring that a well-
provisioned system would need much less SSD capacity than HDD
5.3 SSD Life-Expectancy
SSDs based on enterprise-grade SLC flash memory are rated for
about 105erase cycles before a page wears out [40]. SSDs based
on phase-change memory are rated for even higher rates. SSDs in-
ternally implement wear-leveling algorithms that spread the load
among all physical pages, so that no individual page wears out
In flash-based SSDs there is a write-amplification phenomenon
in which internal movement of data to generate new erase units
adds to the application write workload. This amplification factor
has been estimated at about 2 for common scenarios [25].
To estimate the wear-out threshold, suppose that the Indirection
method uses an SSD page size of 512 bytes. Then a device6with a
capacity of 8×107KB can tolerate 8×1012 writes before wear-out,
assuming a write-amplification of 2 and 105erases per page.
Our most write-intensive operation is insertion, with about 1 + k
SSD writes per insertion when there are kindexes.7Assuming nine
indexes, and ten insertions per transaction, we would need 100 SSD
writes per transaction. Even running continuously at the high rate
6Device characteristics are based on an 80GB FusionIO device.
7Again, this estimate is pessimistic because it assumes no RAM
caching of SSD pages.
of 800 transactions/second, the device would last more than three
We present a comprehensive evaluation of our Indirection pro-
totype based on the Generalized Search Tree (GiST) index pack-
age [31, 29]. Additionally, we provide complimentary kinds of
evidence using DB2 [2] in two ways to further support the results
demonstrated using our prototype. Since we are targeting opera-
tional data stores with both transactions and analytic queries, we
base our evaluation on the TPC-H benchmark [54].
First, in Section 6.2 we try to answer the question “How would
the performance of a state-of-the-art database change if it were to
use our methods?” We employ the commercial database system
DB2, but other database systems would have been equally good
choices. Given a workload W, we construct a rewritten workload
W0for the DB2 engine that simulates the I/O behavior of our tech-
nique for W. While the workload W0is not identical to W, we
argue that the performance of DB2 on W0provides a conservative
estimate of the performance of Won a (hypothetical) version of
DB2 that implements the Indirection technique.
Second, in Section 6.3, we evaluate our Indirection technique by
implementing it within the popular GiST index package [31, 29].
The GiST package have successfully been deployed into a num-
ber of well-known open-source projects including PostgreSQL [6],
PostGIS [5], OpenFTS [4], BioPostgres [1], and YAGO2 [7]. All
aspects of the method (insertions, deletions, modifications) have
been implemented; we refer to the resulting prototype as LIBGiSTmv .
We profile the I/O behavior of LIBGiSTmv, and create a detailed
I/O and execution cost model for the Indirection technique.
Finally, in Section 6.4, we shift our focus to TPC-H style ana-
lytical query processing that is geared towards an operational data
store. We provide evidence for the key tenet of this work, namely,
reducing the burden of index maintenance means that the system
can afford more indexes, which in turn improves the performance
of analytical query processing.
6.1 Platform
Experiments were run on two machines. The first machine, ded-
icated to our DB2 experiments, was running Fedora 8 and was
equipped with a Quad-core Intel Xeon CPU E5345 running at 2.33
GHz, having 4GB of RAM, three magnetic disks (7200 RPM SATA),
and one 80GB Single Level Cell (SLC) FusionIO solid state drive.8
The configuration of our second machine running Ubuntu 10.4,
used exclusively for our LIBGiSTmv experiments, was a 6-core In-
tel Xeon X5650 CPU running at 2.67GHz, having 32GB of RAM,
8Our SLC FusionIO can support up to 88,000 I/O operations per
second at 50µs latency.
Base Indirection
VALUES (?,· · · ,?) VALUES (?,· · · ,?)
Table 2: Query re-writing to capture indirection mechanism
one magnetic disk (7200 RPM SATA), and one 80GB SLC Fusio-
nIO solid state drive.9
In our experiments, we used IBM DB2 version 9.7 [2]. We con-
figured DB2 with adequate bufferpool size (warmed up prior to
starting the experiments) to achieve an average 90% hit ratio on
both data and index pages. For the DB2 experiments, we generate
a TPC-H database [54] with scale factor 20. File system caching
was disabled in all experiments.
For LIBGiSTmv, we extended LIBGiST v.1.0 to a multiversion
generalized search tree C++ library that supports our Indirection
techniques including LIDBlocks.
6.2 DB2 Query Re-writing Experiments
The goal of the query re-writing experiment is to study the I/O
pattern for both the unmodified DB2 system (“Base”) and the In-
direction approach. To evaluate Base for a query Qwe simply run
Qin the original schema S. To evaluate Indirection we rewrite Q
into another query Q0.Q0is run in a schema S0containing an ex-
plicit LtoR table representing the LID-to-RID mapping. Ideally,
the LtoR table is physically located on the SSD device; we empir-
ically examine the impact of the location below. In S0, base tables
are augmented with an additional LID column, where the value
of LID is generated randomly. In Swe build as many indexes as
desired on the base tables. In S0we build a single index on the
attribute selected in the query, typically the primary key of the base
6.2.1 Rewriting Queries
For queries, the rewriting simply adds the LtoR table to the
FROM clause, with a LID equijoin condition in the WHERE clause.
An example template of our query re-writing that simulates the in-
direction mechanism is shown in Table 2. The queries are written
over TPC-H LINEITEM table and the indirection table, denoted by
To see why the performance of Q0is a conservative estimate of
the cost of the indirection technique, consider two cases for the
query in Table 2. In the first case, some selection condition on
an indexed attribute (or combination of conditions) is sufficiently
selective that an access plan based on an index lookup is used.
This case includes a point query specified as an equality condition
on a key attribute. The Base plan for Qwould include an index
traversal and a RID-based lookup in the LINEITEM table. For Q0,
we will also have an index traversal and a RID-based lookup of
9While the FusionIO devices are high-end devices that are rela-
tively expensive, we remark that recent SSDs such as the Intel 520
series can store about 500GB, cost about $500, and can support
50,000 I/O operations per second at 85µs latency, more than suffi-
cient for the workloads described here.
LINEITEM, together with a LID-based lookup of the LtoR table.
This is precisely the I/O pattern of the Indirection technique.
In the second case, the selection conditions are not very selec-
tive. In such a case, the system is likely to scan the base table in
answering Q.10 To answer Q0in such a case requires a join of
LINEITEM and LtoR in order to make sure that we only process
the most recent versions of each record. This may actually an over-
estimate of the cost needed by the Indirection technique, because
the Indirection technique can also employ a scan without consulting
the LtoR table.
6.2.2 Rewriting Updates
For updates, an extra UPDATE statement is added to keep the
LtoR table current, as illustrated in Table 2. Since we are simu-
lating a multiversion database, the Base method inserts a new row
rather than updating an existing row. While the Base method pays
the cost of inserting the new row into each index, we are slightly
favoring the Base method by ignoring the cost of deleting the old
row from the indexes. Depending on the implementation technique
used for temporal attributes (Section 1.1) there may be additional
I/O required to update the temporal attributes of the old row, but we
ignore such I/O here.
For updates in which a single attribute value is modified, the
Indirection method incurs just one index update and one update to
the LtoR table. At first it may seem like there is more work done by
the Indirection method simulation for the updates of Table 2, since
the INSERT statements are the same and the Indirection method
has an extra UPDATE statement. This impression is true only for
the case in which there is one base table index in the base schema
S. As soon as there are multiple indexes on base tables in S, the
cost of the INSERT statement for the Base method exceeds that of
the corresponding INSERT in the Indirection method because more
indexes need to be updated.
Our profiling of update statements can easily be extended to
delete statements, but we omit such profiling because the perfor-
mance profile would be similar to that for updates. On the other
hand, our rewriting does not model the LIDBlock technique, and
thus cannot fully capture its performance advantages for insertions.
The benefits of the LIDBlock technique will be evaluated in Sec-
tion 6.3.
6.2.3 Results
All DB2 measurements are averages over 5 million randomly
chosen queries/updates with the exception of our selectivity exper-
10We do not include old versions of records for these experiments;
in a true multiversion database the base tables would contain some
old record versions that would need to be filtered out during the
scan, using the valid time attributes.
Indirection Base (1 x) Base (4x) Base (8x) Base (16x)
TPC-H: Lineitem table; Scale Factor: 20
Number of Indexes
Average Execution Time (ms)
Figure 5: Varying the number of indexes vs. update/query time
iments and analytical queries, in which fewer operations were per-
formed to complete the experiments in a reasonable time.
Effect of Indexes on Execution Time. Our first result confirms
that adding the extra indirection layer has negligible overhead for
query processing as shown in Figure 5. For this experiment, queries
are point queries that are accessed by specifying the value of the
primary key attribute. The query overhead is negligible because
the indirection mapping from LID-to-RID requires only a single
additional random SSD I/O, a delay that is orders of magnitude
faster than the necessary random magnetic disk I/O for retrieving
data pages holding entire records. Figure 5 also shows that the up-
date execution time for the Base technique increases dramatically
as the number of indexes is increased. With 16 indexes, Indirection
outperforms Base by a factor of 3.6.
Indirection Base (1x) Base (4x) Base (8x) Base (16x)
TPC-H: Lineitem table; Scale Factor: 20
Workload Size = 0.5M Updates
BP Data Writes (HDD)
BP Index Writes (HDD)
BP Data Writes (SSD)
Number of Indexes
Number of Pages in Thousads
Figure 6: Varying the number of indexes vs. page writes
Digging deeper into the bufferpool behavior of DB2 also reveals
that with a modest number of SSD page writes, the number of mag-
netic disk index writes are substantially reduced, as shown in Fig-
ure 6. This result demonstrates the effectiveness of Indirection in
reducing the index maintenance cost for multiversion databases.
Effect of Query Selectivity and Index Clustering. Consider a
range query over a single indexed attribute in the LINEITEM table.
By varying the width of the range, we can vary the query selectiv-
ity.11 Figure 7 shows the results for various selectivities, where the
indexed attribute is the key of the LINEITEM table by which the ta-
ble is clustered. On average the query overhead (for consulting the
11The starting value of the range in our range queries are chosen
Point Query 0.03 0.10 0.44 1.7 7.0 27 100
TPC-H: Lineitem table; Scale Factor: 20
Query Selectivity (%)
Average Execution Time (s)
Figure 7: Varying the query selectivity vs. query execution time
RCT on SSD B-Tree on SSD RCT on HDD B-Tree on HDD
TPC-H: Lineitem table; Scale Factor: 20
Implementation Techniques
Average Execution Time (ms)
Figure 8: Varying the indirection implementation techniques
LtoR table) remains under 20% as query selectivity varied. There
is a sudden jump at around 0.22% selectivity, but a closer exami-
nation of the DB2 query plans reveals that the sudden increase in
execution time is attributable to the optimizer incorrectly switching
from a nested-loops to a merge-sort join plan.12
The 20% overhead, while still small, is higher than the negligible
overhead seen for point queries in Figure 5. We also observed that
as the index clustering ratio decreases, the query processing gap be-
tween Indirection and Base decreases. For example, for the lowest
clustering ratio index of the LINEITEM table (on the SUPPKEY
attribute), the overhead drops to only 4% on average. These dif-
ferences can be understood as a consequence of caching. With a
high clustering ratio, one base table disk page I/O will be able to
satisfy many access requests, meaning that each magnetic disk I/O
is amortized over many records. On the other hand, every record
will contribute an SSD I/O for the LtoR table since that table is
not suitably clustered. For point queries and queries over an un-
clustered index, the ratio of SSD I/O operations to magnetic disk
I/Os will be close to 1.
Indirection Mapping Implementation. Finally, we illustrate
that the indirection random read/write access pattern is ideal for
SSDs and not for magnetic disks. We tried two different imple-
mentations of the LID-to-RID mapping using either a DB2 range-
clustered table (RCT) or a traditional B-Tree index hosted on either
SSDs or HDDs. As shown in Figure 8, when an SSD is used, the
12This behavior can be avoided if optimizer hints are provided.
Insert Update Query
TPC-H: Lineitem table; Scale Factor: 1
One Unique Index; Bufferpool Size: Small
Index Operation
Average Execution Time (ms)
Figure 9: Insert/Update/Query execution time
Indirection Base (1x) Base (4x) Base (8x) Base (16x)
TPC-H: Lineitem table; Scale Factor: 1
Bufferpool Size: Small
Number of Indexes
Average Execution Time (ms)
Figure 10: Varying the number of indexes vs. update time
overall query and update cost is 1.9X and 2.7X lower, respectively,
than on an HDD.
6.3 GiST Implementation
Our LIBGiSTmv codebase directly implements all aspects of our
proposed approach, including the LIDBlock technique. We employ
LIBGiSTmv as the basis for a systematic performance study in a
controlled environment. All HDD and SSD reads and writes used
Direct I/O to avoid extra copying of operating-system buffers. No
bufferpool space was allocated for the indirection table, so requests
to the indirection table always incur SSD I/Os.
In our prototype, we also extended the LIBGiST bufferpool in
order to test a variety of memory configurations. We further en-
hanced the LIBGiST library to collect statistics on index opera-
tions, file operations, and a detailed bufferpool snapshots. All pro-
totype experiments used the TPC-H schema with scale factor 1, and
the workload consisted of random point queries and random insert
and update queries (conceptually similar to the workload presented
in Section6.2). We focus on the I/O profile of index traversals
and updates as encountered by queries, updates, and insertions. All
results are averaged over 105random queries/updates/insertions.
Comparison of Average Execution Time. We first isolate the
query, update, and insert execution times for a single index with a
small bufferpool size, enough pages to pin the root of the index and
all pages in one path from the root to a leaf. Figure 9 shows that
the insert and query times are virtually the same due to the neg-
ligible overhead introduced by the additional SSD read and write
I/O operations. The “Update” column for the Indirection method in
[Index+Data+Indirection] [Index+Data] [Index] [Non-leaf Pages] [Small]
TPC-H: Lineitem table; Scale Factor: 1
1 x Unique Index
Bufferpool Size
Average Execution Time (ms)
(a) One Index
[Index+Data+Indirection] [Index+Data] [I ndex] [Non-leaf Pages] [Small]
TPC-H: Lineitem table; Scale Factor: 1
16 x Unique Index
Bufferpool Size
Average Execution Time (ms)
(b) Sixteen Indexes
Figure 11: Varying the bufferpool size vs. update time
Figure 9 reflects a traversal of the index to locate the prior version
of the record, plus an update of the SSD-resident LID-to-RID map-
ping. The base method is more expensive because it has to perform
additional writes to the magnetic disk.
Multiple Indexes. The update time shown in Figure 9 does not
capture the true benefit of the Indirection method when there are
multiple indexes. In such a case, the Indirection method needs to
traverse only one index, and update one LID-to-RID mapping.13
In contrast, the base method needs to traverse and update HDD-
resident pages for every index. Figure 10 shows that the perfor-
mance improvement approaches 20X as the number of indexes in-
creases from 1 to 16.
Varying the Bufferpool Size. We consider five categories of
bufferpool sizes large enough to hold: (1) only few pages (small),
(2) all index non-leaf pages, (3) the entire index, (4) the entire in-
dex and all data pages, or (5) everything including the index, data,
and LtoR table. These sizes reflect possible use cases where the
system has memory ranging from very small to very large. We ex-
plored an update workload under these five settings when having
one (Figure 11(a)) or sixteen (Figure 11(b)) indexes on unaffected
attributes. Note the logarithmic vertical scale.
In Figure 11(a), only when both the index and data pages are
memory-resident, but the LtoR table is not, does the Indirection
method perform poorly compared to the Base approach. This is
not surprising, because even a fast SSD I/O is slower than access
13If an indexed attribute is updated, then extra I/O is needed for that
index to relocate the index entry. Indexes that are neither traversed
nor updated in the Indirection method are said to be “unaffected.
0 4 8 16 32 64 256
TPC-H: Lineitem table; Scale Factor: 1
Non- unique Index on lineitem.suppkey; Average Number of Duplicates: 36
Index Clustering Ratio = 3%; Bufferpool Size: Small
Number Records per LIDBlock
Average Execution Time (ms)
Figure 12: Varying the LIDBlock size vs. insertion time
to RAM. When an update operation results in some HDD I/O (ei-
ther due to updating leaf pages or data pages), then Indirection is
superior by up to 2X. When everything including the LID-to-RID
mapping table is memory-resident, then Indirection continues to be
superior because it does not touch most of the indexes.
Again, the true benefit of the Indirection technique surfaces as
we increase the number of indexes. For example, when scaling the
number of indexes to sixteen in Figure 11(b), Indirection typically
wins by an order of magnitude. These experiments demonstrate
that the Indirection technique provides significant benefits even if
large portions of the database are memory-resident.
Varying the LIDBlock Size. So far our focus has been on im-
proving index maintenance involving updates. We now demon-
strate the power of our LIDBlock approach for insertions. In Fig-
ure 12, we vary the capacity of LIDBlock from no LIDs to 256 LIDs
for the non-unique index defined on the suppkey attribute of the
lineitem table. When increasing the LIDBlock size to around 32
LIDs, we observed that the insertion cost is significantly reduced
by up to 15.2X. This improvement is due to the amortization of
a random leaf page update over many insertions, e.g., a LIDBlock
size of 32 results in batching and flushing to the disk once every 32
insertions on average.
We can demonstrate a similar benefit with a non-unique index
having many more duplicate entries, such as an index on the quan-
tity attribute of the lineitem table, having 50 distinct values and
0.2M records per value. It is beneficial to allow larger LIDBlock
sizes as shown in Figure 13, in which the insertion execution time
is reduced by up to 4.9X. Unlike the previous case, the main reason
for the speedup is not simple amortization of insertions; since there
are so few key values the tree structure of the index is shallow and
its traversal is already cheap due to caching. Instead, the speedup is
due to having batches large enough to be resolved with one or two
I/Os to update the LID lists for a key.
6.4 DB2 Operational Data Store Experiments
In this section, we study the effects of adding indexes in the con-
text of an operational data store, in which small-scale analytical
query processing is competing with transactional throughput. Our
query workload is based on prior work [17] that modifies TPC-
H queries so that they each touch less data. For our index work-
load, we rely on the DB2 Index Advisor recommendation given
our query workloads defined over the entire TPC-H schema.
We first consider only the primary key indexes of the TPC-H re-
lations. Subsequently, we add the remaining indexes recommended
by DB2 Advisor one at a time, starting from the most to least
beneficial index. After each round of index addition, we re-run
0 64 256 512 102 4 2048 4096
TPC-H: Lineitem table; Scale Factor: 1
Non- unique Index on lineitem.quantity; Average Number of Duplicates: 0.2M
Index Clustering Ratio = 25%; Bufferpool Size: Small
NUmber of Records per LIDBlock
Average Execution Time (ms)
Figure 13: Varying the LIDBlock size vs. insertion time
7 (PKs) 8 9 1 0 11 12 13 14 15 16 17
TPC-H: all tables; Scale Factor: 20
Query (Base)
Query (Indirection)
Update (Base)
Update (Indirection)
Number of Indexes
Relative Execution Time
Figure 14: Effects of adding indexes on query/update time
our query workload. Likewise, after adding each index, we com-
pute the update cost for a uniformly generated update workload, in
which each non-primary-key attribute has an equal chance of being
updated. The update cost is a normalized average execution time
of updating indexes on DB2 and our on LIBGiSTmv with buffer-
pool size set to either small and non-leaf pages. The results are
summarized in Figures 14 and 15.
Our first observation is that analytical query time is substantially
reduced (by a factor of 3) as we add more indexes recommended
by DB2 Advisor. More importantly, we observe that the additional
indexes are more “affordable” for updates because our Indirection
technique reduces the index maintenance overhead. In the base
configuration, the index maintenance overhead increases linearly
as more indexes are added, reducing transaction throughput. Our
Indirection technique reduces the update cost by more than 40%.
Figure 15 shows a two dimensional plot of relative query time
qversus relative update time u. On both axes, smaller is better.
Each index configuration determines a point (u, q)on this graph,
and one can choose a suitable configuration to achieve a desired
query/update trade-off. In Figure 15, the Indirection technique dom-
inates14 the Base method. This is a key result: The Indirection tech-
nique makes indexes more affordable, leading to lower query and
update times.
To understand the importance of Figure 15 consider the follow-
ing scenarios.
1. A DBA has an update-time budget of 0.6 units, and within
that budget wants to optimize query processing time. Ac-
cording to Figure 15, the Indirection technique can achieve
query performance of about 0.32 units under such condi-
tions, while the Base method can achieve only 1.0 units,
three times worse.
14Except when a very small weight is assigned to update time.
0 0.1 0.2 0.3 0.4 0.45 0.5 0.55 0.6 0.7 0.75 0.85 0.9 1
TPC-H: all tables; Scale Factor: 20
Relative Update Time
Relative Query Time
Figure 15: Effects of adding indexes on query/update time
2. A DBA has a query-time budget of 0.5 units, and within that
budget wants to optimize update-time. The Indirection tech-
nique can achieve update performance of about 0.5 units un-
der such conditions, while the Base method can achieve only
0.7 units, 40% worse.
3. A DBA wants to minimize the sum of update-time and query-
time. The Indirection method can achieve a sum of about 0.87
at (0.55,0.32), whereas the best configuration in the Base
method is 1.15 at (0.75,0.4), 32% worse.
There has been extensive work on storing high value data on
SSDs. Some of these studies target the problem of data placement
in relational databases to take better advantage of the SSD char-
acteristics. In [17] database objects are either placed on HDDs or
SSDs based on workload characteristics. As opposed to using SSDs
and HDDs at the same level in the storage hierarchy, SSDs are also
used as a second layer cache between main memory and HDDs [37,
15, 18, 23].
The use of an SSD cache with a write-back policy would not
solve our problem as effectively as the Indirection technique. One
would need space to hold entire indexes if one wants to avoid HDD
index I/O for random updates and insertions. In contrast, our method
avoids HDD I/O with a much smaller SSD footprint. The Indirec-
tion technique and SSD caching of HDD pages are complemen-
tary, and can be used in tandem. In fact, using Indirection improves
the cache behaviour by avoiding reading/writing unnecessary pages
(i.e., it avoids polluting the cache).
Adding a level of indirection is a commonly used programming
technique relevant to a variety of systems problems [10]. The kind
of indirection we propose in this paper is used in log-structured
file systems [48, 33] and database indexing [39] but only at page
granularity. We use indirection at record granularity. Furthermore,
in [39], a latch-free B-Tree structure, called Bw-Tree, is proposed
that virtualizes the concept of the page through page-level indirec-
tion (unlike our record-level indirection). But, more importantly,
our Indirection technique is index-agnostic (not tuned for any partic-
ular index such as B-Tree) and addresses the general problem of in-
dex maintenance in multiversion databases, that is, to avoid updat-
ing indexes on unaffected attributes. Therefore, our approach could
reduce the index maintenance of Bw-Tree as well. Another funda-
mental difference between our philosophy and Bw-Tree is that we
consider de-coupling of the data from the index (in order to sim-
plify the support of secondary indexes on the data), while Bw-Tree
assumes a tight coupling of the index and data (i.e., storing the data
as part of the index), essentially Bw-Tree is intended as an atomic
record store similar to key-value stores [39].
The log-structured storage is also exploited in database manage-
ment systems [55, 45, 41, 51], but no indirection layer is used; thus,
the common merging and compaction operations in log-structured
storage result in expensive rebuilding of all indexes. By employing
our Indirection technique this index rebuilding can be avoided.
Page-level indirection tables are also used to improve the lifes-
pan of SSDs. In [21], a system called CAFTL is proposed to elim-
inate duplicate writes on SSDs. By keeping a mapping table of
blocks the redundant block writes on SSDs are eliminated.
In [56], Wu et al. proposes a software layer called BFTL to store
B-tree indexes on flash devices efficiently. IUD operations cause
significant byte-wise operations for B-tree reorganization. The pro-
posed layer reduces the performance overhead of these updates on
the flash device by aggregating the updates on a particular page.
In [24], Dou et al. propose specialized index structures and al-
gorithms that support querying of historical data in flash-equipped
sensor devices. Since the sensor devices have limited memory ca-
pacity (SRAM) and the underlying flash devices have certain limi-
tations, there are challenges in maintaining and querying indexes.
The deferral of index I/Os is used in several recent papers on im-
proving index performance [14, 44, 8]. In those papers, changes
are accumulated at intermediate nodes, and propagated to children
in batches. Searches need to examine buffers for keys that match.
This line of work is complementary but similar to our LIDBlock
method in that both techniques buffer insertions to amortize physi-
cal I/O.
SSDs are used to support online updates in data warehouses [11].
Incoming updates are first cached in SSDs and later merged with
the older records on HDDs to answer queries. In [11], data records
are accessed primarily through table scans rather than indexes.
Many specialized indexes for versioned and temporal data have
been proposed. A comprehensive survey of temporal indexing meth-
ods is provided in [49]. Tree based indexes on temporal data in-
clude the multiversion B-tree [12], Interval B-tree [9], Interval B+-
tree [16], TP-Index [52], Append-only Tree [30] and Monotonic
B+tree [26]. Efficiently indexing data with branched evolution is
discussed by Jouni et al. [36], who build efficient structures to run
queries on both current and historical data.
Specialized transaction time database systems such as Immortal
DB [42, 43] provide high performance for temporal applications.
Lomet et al. [43] describe how a temporal indexing technique, the
TSB-tree, is integrated into SQL Server. The paper also describes
an efficient page layout for multiversion databases.
The multiversion temporal database market is growing [22]. A
temporal database simplifies application development and deploy-
ment by pushing the management of temporal logic into database
engines. By adopting temporal technologies, the development cost
can be reduced by a factor of 10 [22]. This success has led ma-
jor database vendors (including Oracle [46], IBM [34], and Tera-
Data [53]) to provide support for multiversion temporal data.
We tackle a key challenge of multiversion databases: providing
good update performance and good query performance in a single
system. Transaction throughput and analytical query processing
often have conflicting requirements due to the high index mainte-
nance cost for transactions. Our efficient index maintenance us-
ing Indirection makes indexes more “affordable,” substantially im-
proving the available configuration choices. Our evaluation demon-
strates a query cost reduction by a factor of 3 without an increase
in update cost. The batching of insertions using our LIDBlock tech-
nique can save up to 90% of the insertion time.
There remain several open issues that we plan to address in future
work. One primary concern in multiversion databases is space. If
every version of a record is copied in full, the space needed to store
the database will grow dramatically. There is obvious potential to
compress records, taking advantage of commonality between old
and new record versions. Even so, the trade-offs are not straight-
forward. One does not want to burden transactions or queries on
current data with extensive compression/decompression tasks, so
a background process that leaves the current record uncompressed
but incrementally compresses the historical records may be a good
idea. Keeping many versions of a record on a common page may
help compression, but would be incompatible with other clustering
[1] BioPostgres: Data management for computational biology.
[2] IBM DB2 Database for Linux, UNIX, and Windows. windows/.
[3] IBM DB2 with BLU Acceleration.
db2/linux-unix- windows/db2-blu-acceleration/.
[4] OpenFTS: Open source full text search engine.
[5] PostGIS: Geographic information systems.
[6] PostgreSQL: Open source object-relational database system.
[7] YAGO2: High-quality knowledge base. naga/yago/.
[8] D. Agrawal, D. Ganesan, R. K. Sitaraman, Y. Diao, and S. Singh. Lazy-adaptive
tree: An optimized index structure for flash devices. PVLDB, 2(1):361–372,
[9] C.-H. Ang and K.-P. Tan. The interval B-tree. Inf. Process. Lett., 53(2):85–89,
Jan. 1995.
[10] R. Arpaci-Dusseau and A. Arpaci-Dusseau. Operating Systems: Three Easy
Pieces. Arpaci-Dusseau Books, 0.5 edition, 2012.
[11] M. Athanassoulis, S. Chen, A. Ailamaki, P. B. Gibbons, and R. Stoica. MaSM:
efficient online updates in data warehouses. In SIGMOD Conference, pages
865–876, 2011.
[12] B. Becker, S. Gschwind, T. Ohler, B. Seeger, and P. Widmayer. An
asymptotically optimal multiversion B-Tree. VLDB J., 5(4):264–275, 1996.
[13] B. Bhattacharjee, L. Lim, T. Malkemus, G. Mihaila, K. Ross, S. Lau,
C. McArthur, Z. Toth, and R. Sherkat. Efficient index compression in DB2
LUW. Proc. VLDB Endow., 2(2):1462–1473, Aug. 2009.
[14] B. Bhattacharjee, T. Malkemus, S. Lau, S. Mckeough, J.-A. Kirton, R. V.
Boeschoten, and J. Kennedy. Efficient bulk deletes for multi dimensionally
clustered tables in DB2. In VLDB, pages 1197–1206, 2007.
[15] B. Bhattacharjee, K. A. Ross, C. A. Lang, G. A. Mihaila, and M. Banikazemi.
Enhancing recovery using an SSD buffer pool extension. In DaMoN, pages
10–16, 2011.
[16] T. Bozkaya and M. ¨
glu. Indexing valid time intervals. Lecture Notes in
Computer Science, 1460:541–550, 1998.
[17] M. Canim, B. Bhattacharjee, G. A. Mihaila, C. A. Lang, and K. A. Ross. An
object placement advisor for DB2 using solid state storage. PVLDB,
2(2):1318–1329, 2009.
[18] M. Canim, G. A. Mihaila, B. Bhattacharjee, K. A. Ross, and C. A. Lang. SSD
bufferpool extensions for database systems. PVLDB, 3(2):1435–1446, 2010.
[19] S. Chaudhuri and V. Narasayya. Automating statistics management for query
optimizers. IEEE Trans. on Knowl. and Data Eng., 13(1):7–20, Jan. 2001.
[20] S. Chaudhuri and V. R. Narasayya. An efficient cost-driven index selection tool
for microsoft SQL server. In Proceedings of the 23rd International Conference
on Very Large Data Bases, VLDB ’97, pages 146–155, San Francisco, CA,
USA, 1997. Morgan Kaufmann Publishers Inc.
[21] F. Chen, T. Luo, and X. Zhang. CAFTL: A content-aware flash translation layer
enhancing the lifespan of flash memory based solid state drives. In FAST, pages
77–90, 2011.
[22] S. Chen. Time travel query or bi-temporal. In DB2 for z/OS Technical Forum,
[23] J. Do, D. Zhang, J. M. Patel, D. J. DeWitt, J. F. Naughton, and A. Halverson.
Turbocharging DBMS buffer pool using SSDs. In Proceedings of the 2011
ACM SIGMOD International Conference on Management of data, SIGMOD
’11, pages 1113–1124, New York, NY, USA, 2011. ACM.
[24] A. J. Dou, S. Lin, and V. Kalogeraki. Real-time querying of historical data in
flash-equipped sensor devices. In IEEE Real-Time Systems Symposium, pages
335–344, 2008.
[25] G. Drossel. Methodologies for calculating SSD usable life. In Storage
Developer Conference, 2009.
[26] R. Elmasri, G. T. J. Wuu, and V. Kouramajian. The time index and the
monotonic B+-tree. In Temporal Databases, pages 433–456. 1993.
[27] Fusion-io breaks one billion IOPS barrier.
fusion-io- breaks-one-billion- iops-barrier/.
[28] H. Garcia-Molina, J. D. Ullman, and J. Widom. Database Systems: The
Complete Book. Prentice Hall Press, Upper Saddle River, NJ, USA, 2 edition,
[29] The GiST indexing project.
[30] H. Gunadhi and A. Segev. Efficient indexing methods for temporal relations.
IEEE Trans. on Knowledge and Data Eng., 5(3):496, June 1993.
[31] J. M. Hellerstein, J. F. Naughton, and A. Pfeffer. Generalized search trees for
database systems. In Proceedings of the 21th International Conference on Very
Large Data Bases, VLDB ’95, pages 562–573, San Francisco, CA, USA, 1995.
Morgan Kaufmann Publishers Inc.
[32] F. D. Hinshaw, C. S. Harris, and S. K. Sarin. Controlling visibility in
multi-version database systems, 2007. US 7305386 Patent, Netezza
[33] D. Hitz, J. Lau, and M. Malcolm. File system design for an NFS file server
appliance. In Proceedings of the USENIX Winter 1994 Technical Conference on
USENIX Winter 1994 Technical Conference, WTEC’94, pages 19–19, Berkeley,
CA, USA, 1994. USENIX Association.
[34] DB2 10 for z/OS.
[35] W. H. Inmon. Building the Operational Data Store. John Wiley & Sons, Inc.,
New York, NY, USA, 2nd edition, 1999.
[36] K. Jouini and G. Jomier. Indexing multiversion databases. In Proceedings of the
sixteenth ACM conference on Conference on information and knowledge
management, CIKM ’07, pages 915–918, New York, NY, USA, 2007. ACM.
[37] W.-H. Kang, S.-W. Lee, and B. Moon. Flash-based extended cache for higher
throughput and faster recovery. PVLDB, 5(11):1615–1626, 2012.
[38] P.-A. Larson, S. Blanas, C. Diaconu, C. Freedman, J. M. Patel, and M. Zwilling.
High-performance concurrency control mechanisms for main-memory
databases. Proc. VLDB Endow., 5(4):298–309, Dec. 2011.
[39] J. J. Levandoski, D. B. Lomet, and S. Sengupta. The Bw-Tree: A B-tree for new
hardware platforms. In Proceedings of the 2013 IEEE 29th International
Conference on Data Engineering, ICDE ’13, Washington, DC, USA, 2013.
IEEE Computer Society.
[40] A. Leventhal. Flash storage memory. Commun. ACM, 51(7):47–51, July 2008.
[41] Y. Li, B. He, Q. Luo, and K. Yi. Tree indexing on flash disks. In Proceedings of
the 2009 IEEE International Conference on Data Engineering, ICDE ’09,
pages 1303–1306, Washington, DC, USA, 2009. IEEE Computer Society.
[42] D. Lomet, R. Barga, M. F. Mokbel, G. Shegalov, R. Wang, and Y. Zhu.
Immortal DB: transaction time support for SQL server. In Proceedings of the
2005 ACM SIGMOD international conference on Management of data,
SIGMOD ’05, pages 939–941, New York, NY, USA, 2005. ACM.
[43] D. Lomet, M. Hong, R. Nehme, and R. Zhang. Transaction time indexing with
version compression. Proc. VLDB Endow., 1(1):870–881, Aug. 2008.
[44] E. Omiecinski, W. Liu, and I. F. Akyildiz. Analysis of a deferred and
incremental update strategy for secondary indexes. Inf. Syst., 16(3):345–356,
[45] P. E. O’Neil, E. Cheng, D. Gawlick, and E. J. O’Neil. The log-structured
merge-tree (LSM-Tree). Acta Inf., 33(4):351–385, 1996.
[46] Oracle database 11g workspace manager overview.
twp-appdev- workspace-manager-11g- 128289.pdf.
[47] Oracle total recall/flashback data archive.
08-jul/flashback- data-archive-whitepaper- 129145.pdf.
[48] M. Rosenblum and J. K. Ousterhout. The design and implementation of a
log-structured file system. ACM Trans. Comput. Syst., 10(1):26–52, Feb. 1992.
[49] Salzberg and Tsotras. Comparison of access methods for time-evolving data.
CSURV: Computing Surveys, 31, 1999.
[50] C. M. Saracco, M. Nicola, and L. Gandhi. A matter of time: Temporal data
management in DB2 for z/OS, 2010.
[51] R. Sears and R. Ramakrishnan. bLSM: a general purpose log structured merge
tree. In SIGMOD Conference, pages 217–228, 2012.
[52] H. Shen, B. Chin, and O. H. Lu. The TP-Index: A dynamic and efficient
indexing mechanism for temporal databases. In In Proceedings of the Tenth
International Conference on Data Engineering, pages 274–281. IEEE, 1994.
[53] R. T. Snodgrass. A case study of temporal data, 2010. Teradata Corporation.
[54] TPC-H, decision support benchmark.
[55] H. T. Vo, S. Wang, D. Agrawal, G. Chen, and B. C. Ooi. LogBase: A scalable
log-structured database system in the cloud. PVLDB, 5(10):1004–1015, 2012.
[56] C.-H. Wu, T.-W. Kuo, and L.-P. Chang. An efficient B-tree layer
implementation for flash-memory storage systems. ACM Trans. Embedded
Comput. Syst., 6(3), 2007.
... Examples of systems that operate on a single data representation are SAP HANA by Sikka et al. (2012), HyPer by Neumann et al. (2015), and L-Store by Sadoghi et al. (2016aSadoghi et al. ( ,b, 2013Sadoghi et al. ( , 2014. They provide a unified data representation and enable users to analyze the latest (not just fresh data) or any version of data. ...
... Researchers have explored where to put the updates (whether to apply them in-place, append them in a logstructured format, or store them in a delta and periodically merge them with the main), how to best arrange the records (row-store, columnstore, or a PAX format), and how to handle multiple versions and when to do garbage collection. Sadoghi et al. (2016aSadoghi et al. ( ,b, 2013Sadoghi et al. ( , 2014 proposed L-Store (lineage-based data store) that combines the real-time processing of transactional and analytical workloads within a single unified engine. L-Store bridges the gap between managing the data that is being updated at a high velocity and analyzing a large volume of data by introducing a novel update-friendly lineage-based storage architecture (LSA). ...
Full-text available
Hybrid OLTP and OLAP
... As we pointed out earlier, conceptually for every record, we distinguish between base vs. tail records, where each record is assigned a unique RID. But it is important to note that the RID assigned to a base record is stable and remains constant throughout the entire life-cycle of a record, and all indexes only reference base records (base RIDs); consequently, eliminating index maintenance problem associated when update operation results in creation of a new version of the record [33,34]. When a reader performing index lookup, it always lands at a base record, and from the base record it can reach any desired version of the record by following the table-embedded indirection to access the latest (if the base record is out-of-date) or an earlier version of the record. ...
... Furthermore, indexes always point to base records (i.e., base RIDs), and they are never directly point to any tail records (i.e., tail RIDs) in order to avoid the index maintenance cost that arise in the absence of in-place update mechanism [33]. Therefore, when a new version of a record is created (i.e., a new tail record), first, all indexes defined on unaffected columns do not have to be modified and, second, only the affected indexes are modified with the updated values, but they continue to point to base records and not the newly created tail records. ...
Conference Paper
Full-text available
To derive real-time actionable insights from the data, it is important to bridge the gap between managing the data that is being updated at a high velocity (i.e., OLTP) and analyzing a large volume of data (i.e., OLAP). However, there has been a divide where specialized solutions were often deployed to support either OLTP or OLAP workloads but not both; thus, limiting the analysis to stale and possibly irrelevant data. In this paper, we present Lineage-based Data Store (L-Store) that combines the real-time processing of transactional and analytical workloads within a single unified engine by introducing a novel update-friendly lineage-based storage architecture. By exploiting the lineage, we develop a contention-free and lazy staging of columnar data from a write-optimized form (suitable for OLTP) into a read-optimized form (suitable for OLAP) in a transactionally consistent approach that supports querying and retaining the current and historic data.
... Each record is uniquely identified by a logical record ID (RID) that never changes throughout the lifetime of the record, in contrast to physical RIDs in traditional disk-based systems [47]. For each table, we maintain an indirection array [24,50] that maps RID to the virtual memory pointer to the record's latest version which further points to the next older version, and so on. Indexes map keys to RIDs, instead of pointers to record versions. ...
Data stalls are a major overhead in main-memory database engines due to the use of pointer-rich data structures. Lightweight coroutines ease the implementation of software prefetching to hide data stalls by overlapping computation and asynchronous data prefetching. Prior solutions, however, mainly focused on (1) individual components and operations and (2) intra-transaction batching that requires interface changes, breaking backward compatibility. It was not clear how they apply to a full database engine and how much end-to-end benefit they bring under various workloads. This paper presents \corobase, a main-memory database engine that tackles these challenges with a new coroutine-to-transaction paradigm. Coroutine-to-transaction models transactions as coroutines and thus enables inter-transaction batching, avoiding application changes but retaining the benefits of prefetching. We show that on a 48-core server, CoroBase can perform close to 2x better for read-intensive workloads and remain competitive for workloads that inherently do not benefit from software prefetching.
... Large scale distributed databases have been designed and deployed for handling commercial and cloud-based applications [7,17,19,39,47,51,54,[63][64][65]67,75,77]. The common denominator across all these databases is the use of transactions. ...
Full-text available
Large scale distributed databases are designed to support commercial and cloud based applications. The minimal expectation from such systems is that they ensure consistency and reliability in case of node failures. The distributed database guarantees reliability through the use of atomic commitment protocols. Atomic commitment protocols help in ensuring that either all the changes of a transaction are applied or none of them exist. To ensure efficient commitment process, the database community has mainly used the two-phase commit (2PC) protocol. However, the 2PC protocol is blocking under multiple failures. This necessitated the development of non-blocking, three-phase commit (3PC) protocol. However, the database community is still reluctant to use the 3PC protocol, as it acts as a scalability bottleneck in the design of efficient transaction processing systems. In this work, we present EasyCommit protocol which leverages the best of both worlds (2PC and 3PC), that is non-blocking (like 3PC) and requires two phases (like 2PC). EasyCommit achieves these goals by ensuring two key observations: (i) first transmit and then commit, and (ii) message redundancy. We present the design of the EasyCommit protocol and prove that it guarantees both safety and liveness. We also present a detailed evaluation of EC protocol and show that it is nearly as efficient as the 2PC protocol. To cater the needs of geographically large scale distributed systems we also design a topology-aware agreement protocol (Geo-scale EasyCommit) that is non-blocking, safe, live and outperforms 3PC protocol.
... This will result in serious performance degradation when updating the index. In a hybrid NVM environment, an intermediate layer constructed by low-latency NVM can be used to decouple the relationship between the physical representation and the logical representation of the multi-version log [111]. This study that builds a new storage layer to ease the bottleneck of reading and writing is worth learning. ...
Full-text available
Hardware techniques and environments underwent significant transformations in the field of information technology, represented by high-performance processors and hardware accelerators characterized by abundant heterogeneous parallelism, nonvolatile memory with hybrid storage hierarchies, and RDMA-enabled high-speed network. Recent hardware trends in these areas deeply affect data management and analysis applications. In this paper, we first introduce the development trend of the new hardware in computation, storage, and network dimensions. Then, the related research techniques which affect the upper data management system design are reviewed. Finally, challenges and opportunities are addressed for the key technologies of data management and analysis in new hardware environments.
NVMe SSDs based on flash are cheap and offer high throughput. Combining several of these devices into a single server enables 10 million I/O operations per second or more. Our experiments show that existing out-of-memory database systems and storage engines achieve only a fraction of this performance. In this work, we demonstrate that it is possible to close the performance gap between hardware and software through an I/O optimized storage engine design. In a heavy out-of-memory setting, where the dataset is 10 times larger than main memory, our system can achieve more than 1 million TPC-C transactions per second.
Hot standby systems often have to trade safety (i.e., not losing committed work) and freshness (i.e., having access to recent updates) for performance. Guaranteeing safety requires synchronous log shipping that blocks the primary until the log records are durably replicated in one or multiple backups; maintaining freshness necessitates fast log replay on backups, but is often defeated by the dual-copy architecture and serial replay: a backup must generate the "real" data from the log to make recent updates accessible to read-only queries. This paper proposes Query Fresh, a hot standby system that provides both safety and freshness while maintaining high performance on the primary. The crux is an append-only storage architecture used in conjunction with fast networks (e.g., InfiniBand) and byte-addressable, non-volatile memory (NVRAM). Query Fresh avoids the dual-copy design and treats the log as the database, enabling lightweight, parallel log replay that does not block the primary. Experimental results using the TPC-C benchmark show that under Query Fresh, backup servers can replay log records faster than they are generated by the primary server, using one quarter of the available compute resources. With a 56Gbps network, Query Fresh can support up to 4--5 synchronous replicas, each of which receives and replays ∼1.4GB of log records per second, with up to 4--6% overhead on the primary compared to a standalone server that achieves 620kTPS without replication.
The last decade has brought groundbreaking developments in transaction processing. This resurgence of an otherwise mature research area has spurred from the diminishing cost per GB of DRAM that allows many transaction processing workloads to be entirely memory-resident. This shift demanded a pause to fundamentally rethink the architecture of database systems. The data storage lexicon has now expanded beyond spinning disks and RAID levels to include the cache hierarchy, memory consistency models, cache coherence and write invalidation costs, NUMA regions, and coherence domains. New memory technologies promise fast non-volatile storage and expose unchartered trade-offs for transactional durability, such as exploiting byte-addressable hot and cold storage through persistent programming that promotes simpler recovery protocols. In the meantime, the plateauing single-threaded processor performance has brought massive concurrency within a single node, first in the form of multi-core, and now with many-core and heterogeneous processors. The exciting possibility to reshape the storage, transaction, logging, and recovery layers of next-generation systems on emerging hardware have prompted the database research community to vigorously debate the trade-offs between specialized kernels that narrowly focus on transaction processing performance vs. designs that permit transactionally consistent data accesses from decision support and analytical workloads. In this book, we aim to classify and distill the new body of work on transaction processing that has surfaced in the last decade to navigate researchers and practitioners through this intricate research subject.
Full-text available
The emergence of new hardware and platforms has led to reconsideration of how data management systems are designed. However, certain basic functions such as key indexed access to records remain essential. While we exploit the common architectural layering of prior systems, we make radically new design decisions about each layer. Our new form of B-tree, called the Bw-tree achieves its very high performance via a latch-free approach that effectively exploits the processor caches of modern multi-core chips. Our storage manager uses a unique form of log structuring that blurs the distinction between a page and a record store and works well with flash storage. This paper describes the architecture and algorithms for the Bw-tree, focusing on the main memory aspects. The paper includes results of our experiments that demonstrate that this fresh approach produces outstanding performance.
Geographic information systems are information systems which are based on data referenced by geographic coordinates. They are well developed, but are, as yet, little used in the UK (House of Lords 1984). Provides an introduction to the principles and literature of geographic information systems and explores their development alongside the field of remote sensing. -Author
Flash memories are in ubiquitous use for storage on sensor nodes, mobile devices, and enterprise servers. However, they present significant challenges in designing tree indexes due to their fundamentally different read and write characteristics in comparison to magnetic disks. In this paper, we present the Lazy-Adaptive Tree (LA-Tree), a novel index structure that is designed to improve performance by minimizing accesses to flash. The LA-tree has three key features: 1) it amortizes the cost of node reads and writes by performing update operations in a lazy manner using cascaded buffers, 2) it dynamically adapts buffer sizes to workload using an online algorithm, which we prove to be optimal under the cost model for raw NAND flashes, and 3) it optimizes index parameters, memory management, and storage reclamation to address flash constraints. Our performance results on raw NAND flashes show that the LA-Tree achieves 2x to 12x gains over the best of alternate schemes across a range of workloads and memory constraints. Initial results on SSDs are also promising, with 3x to 6x gains in most cases.
Data management workloads are increasingly write-intensive and subject to strict latency SLAs. This presents a dilemma: Update in place systems have unmatched latency but poor write throughput. In contrast, existing log structured techniques improve write throughput but sacrifice read performance and exhibit unacceptable latency spikes. We begin by presenting a new performance metric: read fanout, and argue that, with read and write amplification, it better characterizes real-world indexes than approaches such as asymptotic analysis and price/performance. We then present bLSM, a Log Structured Merge (LSM) tree with the advantages of B-Trees and log structured approaches: (1) Unlike existing log structured trees, bLSM has near-optimal read and scan performance, and (2) its new "spring and gear" merge scheduler bounds write latency without impacting throughput or allowing merges to block writes for extended periods of time. It does this by ensuring merges at each level of the tree make steady progress without resorting to techniques that degrade read performance. We use Bloom filters to improve index performance, and find a number of subtleties arise. First, we ensure reads can stop after finding one version of a record. Otherwise, frequently written items would incur multiple B-Tree lookups. Second, many applications check for existing values at insert. Avoiding the seek performed by the check is crucial.
This paper introduces the Generalized Search Tree (GiST), an index structure supporting an extensible set of queries and data types. The GiST allows new data types to be indexed in a manner supporting queries natural to the types; this is in contrast to previous work on tree extensibility which only supported the traditional set of equality and range predicates. In a single data structure, the GiST provides all the basic search tree logic required by a database system, thereby unifying disparate structures such as B+-trees and R-trees in a single piece of code, and opening the application of search trees to general extensibility. To illustrate the exibility of the GiST, we provide simple method implementations that allow it to behave like a B+-tree, an R-tree, and an RD-tree, a new index for data with set-valued attributes. We also present a preliminary performance analysis of RD-trees, which leads to discussion on the nature of tree indices and how they behave for various datasets.