ArticlePDF Available

Instant recovery with write-ahead logging

Authors:

Abstract and Figures

Instant recovery improves system availability by reducing the mean time to repair, i.e., the interval during which a database is not available for queries and updates due to recovery activities. Variants of instant recovery pertain to system failures, media failures, node failures, and combinations of multiple failures. After a system failure, instant restart permits new transactions immediately after log analysis, before and concurrent to “redo” and “undo” recovery actions. After a media failure, instant restore permits new transactions immediately after allocation of a replacement device, before and concurrent to restoring backups and replaying the recovery log. Write-ahead logging is already ubiquitous in data management software. The recent definition of single-page failures and techniques for log-based single-page recovery enable immediate, lossless repair after a localized wear-out in novel or traditional storage hardware. In addition, they form the backbone of on-demand “redo” in instant restart, instant restore, and eventually instant failover. Thus, they complement on-demand invocation of traditional single-transaction “undo” or rollback. In addition to these instant recovery techniques, the discussion introduces self-repairing indexes and much faster offline restore operations, which impose no slowdown in backup operations and hardly any slowdown in log archiving operations. The new restore techniques also render differential and incremental backups obsolete, complete backup commands on a database server practically instantly, and even permit taking full up-to-date backups without imposing any load on the database server.
Content may be subject to copyright.
Noname manuscript No.
(will be inserted by the editor)
Instant recovery with write-ahead logging
Goetz Graefe ·Caetano Sauer ·Wey Guy ·Theo Härder
Received: date / Accepted: date
Abstract Instant recovery improves system availability by
reducing the mean time to repair, i.e., the interval during
which a database is not available for queries and updates due
to recovery activities. Variants of instant recovery pertain to
system failures, media failures, node failures, and combi-
nations of multiple failures. After a system failure, instant
restart permits new transactions immediately after log anal-
ysis, before and concurrent to “redo” and “undo” recovery
actions. After a media failure, instant restore permits new
transactions immediately after allocation of a replacement
device, before and concurrent to restoring backups and re-
playing the recovery log.
Write-ahead logging is already ubiquitous in data man-
agement software. The recent definition of single-page fail-
ures and techniques for log-based single-page recovery en-
able immediate, lossless repair after a localized wear-out in
novel or traditional storage hardware. In addition, they form
the backbone of on-demand “redo” in instant restart, instant
restore, and eventually instant failover. Thus, they comple-
ment on-demand invocation of traditional single-transaction
“undo” or rollback.
In addition to these instant recovery techniques, the dis-
cussion introduces self-repairing indexes and much faster
offline restore operations, which impose no slowdown in
backup operations and hardly any slowdown in log archiving
operations. The new restore techniques also render differ-
ential and incremental backups obsolete, complete backup
commands on a database server practically instantly, and
Goetz Graefe, Wey Guy
HP Lab, Palo Alto
E-mail: [goetz.graefe, wey.guy]@hp.com
Caetano Sauer, Theo Härder
TU Kaiserslautern
E-mail: [csauer, haerder]@cs.uni-kl.de
even permit taking full up-to-date backups without impos-
ing any load on the database server.
1 Introduction
Modern hardware differs from hardware of 25 years ago,
when many of the database recovery techniques used to-
day were designed. Current hardware includes high-density
disks with single-page failures due to cross-track effects,
e.g., in shingled or overlapping recording; high-capacity
storage devices with long restore recovery after media fail-
ures; semiconductor storage with single-page failures due
to localized wear-out; and large memory and large buffer
pools with many pages and therefore many dirty pages and
long restart recovery after system failures.
On contemporary hardware, instant recovery1tech-
niques seem more appropriate. They employ and build on
many proven techniques, in particular write-ahead logging,
checkpoints, and log archiving. The foundation are two new
ideas. First, single-page failures and single-page recovery
[5] enable incremental recovery fast enough to run on de-
mand without imposing major delays in query and transac-
tion processing. Second, log archiving not only compresses
the log records but also partially sorts the log archive, which
enables multiple access patterns, all reasonably efficient.
These foundations are exploited for incremental recovery
actions executing on demand, in particular after system fail-
ures (producing an impression of “instant restart”) and after
1We use the term “instant” not in an absolute meaning but a relative
one, i.e., in comparison to prior techniques. This is like instant coffee,
which is not absolutely instantaneous but only relative to traditional
techniques of coffee preparation. The reader’s taste and opinion must
decide whether instant coffee actually is coffee. Instant recovery, how-
ever, is true and reliable recovery from system and media failures, with
guarantees as strong as those of traditional recovery techniques.
2 Goetz Graefe et al.
media failures (“instant restore”). In addition to incremen-
tal recovery, new techniques speed up offline backup and
offline restore operations. In particular, differential and in-
cremental backups become obsolete and full backups can be
created efficiently without imposing any load on the active
server process.
The problem of out-of-date recovery methods for to-
day’s hardware exists equally for file systems, databases,
key-value stores, and contents indexes in information re-
trieval and internet search. Similarly, the techniques and
solutions discussed below apply not only to databases, even
if they are often discussed using database terms, but also
to file systems, key-value stores, and contents indexes. In
other words, the problems, techniques, and solutions apply
to practically all persistent digital storage technologies that
employ write-ahead logging.
2 Single-page failure and repair
Modern hardware such as flash storage promises higher
performance than traditional hardware such as rotating
magnetic disks. However, it also introduces its own issues
such as relatively high write costs and limited endurance.
Techniques such as log-structured file systems and write-
optimized B-trees [3] might reduce the effects of high write
costs and wear leveling might delay the onset of reliability
problems. Nonetheless, when failures do occur, they must
be identified and repaired.
2.1 Single-page recovery
Single-page recovery uses a page image in a backup and the
history of the page as captured in the recovery log, specif-
ically the “redo” portions of log records pertaining to the
specific database page. Efficient access to all relevant log
records requires a pointer to the most recent log record and,
within each log record, a pointer to the prior one. In a sys-
tem that ensures exactly-once application of log records to
database pages by means of PageLSN values [11], this is
equivalent to saving, in each log record, the prior PageLSN
value of the affected database page.
Figure 1 shows a few log records in a recovery log in-
cluding the per-transaction log chains (transactions T1 and
T2) and the per-page log chains (database pages 4 and 7).
Varying from the ARIES design, log records describing
“undo” (rollback log records) point to the original “do”
log records in order to reduce redundant information in
the log. Incidentally, this design permits compensation log
records of uniform size and therefore enables accurate pre-
allocation of log space for an eventual rollback – with that,
a transaction abort cannot fail due to exhausted log space.
In the example shown in Figure 1, both rollback log records
have equal values for the per-transaction pointer and the per-
page pointer, with an obvious opportunity for compression.
The sequence of log records for page 4, slot 6 implies that
transaction T1 released locks incrementally while rolling
back. An aborted transaction ends with a commit record
after it has “updated back” all its changes in the database.
If a transaction ends with no change in the logical database
contents, there is no need to force the commit record to
stable storage – this applies both to system transactions
(similar to “top-level actions” in ARIES) and to aborted
user transactions.
While some commercial systems already include the
prior PageLSN in each log record, e.g., Microsoft SQL
Server, others do not. Thus, an argument could be made that
the per-page chain of log records increases individual log
records and thus a systems overall log volume and band-
width requirements. It turns out, however, that all systems
unnecessarily include per-transaction chains of log records
in the persistent recovery log. Instead, it is sufficient to retain
this per-transaction information in memory. During restart
after a system failure, log analysis can re-create the required
information from checkpoint log records and the individual
log records between checkpoint and system crash.
The original proposal for single-page failures suggests a
“page recovery index” for each database or each table space.
With an index entry for each page in the database or table
space, an entry in the page recovery index points to the most
recent log record for each database not in the buffer pool. In
other words, each time the buffer pool writes a dirty database
page to storage, an entry in the page recovery index requires
an update with a new LSN value.
2.2 Self-repairing B-trees
Self-repairing indexes [6] combine efficient (yet compre-
hensive) detection of single-page faults with immediate
single-page recovery. Comprehensive fault detection re-
quires in-page checks as well as cross-page checks. In a
self-repairing B-tree index, each node includes low and
Fig. 1 Example log contents
Instant recovery with write-ahead logging 3
high fence keys that define the node’s maximal permissi-
ble key range. Along the left and right edges of the B-tree,
these fence keys have values and +, including in the
root node. In all other nodes, a node’s fence keys equal two
keys in the node’s parent, i.e., typically branch keys. A node
and its leftmost child share the same low fence key value;
a node and its rightmost child share the high fence key
value. Moreover, for both fault detection and repair, each
parent-to-child pointer in a self-repairing B-tree carries an
expected PageLSN value for the child page. For simplicity
of maintenance, this requires that there be at all times only
a single pointer to each page as in Foster B-trees [5].
3 System failure and restart
Database system failures and the subsequent recovery dis-
rupt many transactions and entire applications, usually for
an extended duration. For those failures, new on-demand
“instant” recovery techniques reduce application downtime
from minutes or hours to seconds.
The top of Figure 2 illustrates the three traditional phases
of system recovery and some typical durations. The bottom
of Figure 2 illustrates application availability after a restart
using prior approaches and using the new technique. Top
and bottom share a common timeline. The important obser-
vation is that previous techniques enable query and transac-
tion processing only after the “redo” recovery phase or even
after the “undo” recovery phase, whereas instant recovery
permits new queries and update transactions immediately af-
ter log analysis. If log analysis takes one second and “redo”
and “undo” phases take one minute each, then instant recov-
ery reduces the time from database restart to processing new
transactions by about two orders of magnitude compared to
both traditional implementations and the ARIES design. Re-
ducing the mean time to repair by two orders of magnitude
adds two nines to application availability, e.g., turning a sys-
tem with 99% availability into one with 99.99% availability.
Immediately upon system restart, instant recovery per-
forms log analysis but invokes neither “redo” nor “undo” re-
covery. Log analysis gathers information both about pages
requiring “redo” and about transactions requiring “undo”.
Thus, log analysis restores essential server state lost in the
system failure, i.e., in transaction manager and lock man-
Fig. 2 Restart phases and new transactions
ager. The buffer pool gathers information about dirty pages.
This information does not include images of pages, i.e., ran-
dom I/O in the database is not required. For efficiency of
subsequent recovery, log pages and records should remain
in memory after log analysis.
In preparation of “undo” recovery, log analysis tracks the
set of active transactions and their locks. It initiates this set
from the checkpoint log record. When log analysis is com-
plete, it has identified all transactions active at the time of the
crash and their concurrency control requirements. The lock
manager holds these locks just as if the transactions were
still active. Note that conflict detection is not required dur-
ing log analysis; the recovery process may rely on success-
ful and correct detection of lock conflicts during transaction
processing prior to the crash.
In preparation of “redo” recovery, log analysis produces
a list of pages that may require “redo” actions. It initiates this
list from the checkpoint log record, specifically the list of
dirty pages. Log analysis registers those pages without I/O
and thus without page images in memory. In other words,
the buffer pool must support allocation of descriptors with-
out page images. While registered for “redo” recovery, a
page must remain in the buffer pool. For each such page,
the registration includes the expected PageLSN value, i.e.,
the last log record pertaining to the database page found
during log analysis. During log analysis, i.e., the scan over
all log records between the last checkpoint and the crash,
log records describing page updates (including formatting
of newly allocated pages) add or modify registrations of
database pages. Log records describing completed write op-
erations unregister the appropriate database page.
When an application requires one of the registered pages
but the page image in the database is older than the expected
PageLSN included in the registration, the buffer pool in-
vokes single-page “redo” recovery. Once single-page “redo”
recovery is complete, it rescinds the registration, which pre-
vents future “redo” attempts for this page.
Upon a lock conflict between new and old (pre-crash)
transactions, the first question is whether the old transac-
tion has participated in a two-phase commit and is waiting
for the global commit decision – in those cases, the new
transaction must wait or abort. Otherwise, the old transac-
tions can roll back using standard techniques, i.e., invoking
“undo” (compensation) actions and logging them. If trans-
action rollback touches a database page registered in the
buffer pool as requiring “redo” recovery, rollback invokes
the appropriate single-page recovery before the transaction
rollback resumes. As usual, when a transaction rollback is
complete, the transaction writes a log record (it “commits
nothing” with no need to force the log record immediately
to stable storage), releases its locks, and frees its in-memory
data structures.
4 Goetz Graefe et al.
4 Media failure and restore
After detection of a media failure, the first step is provi-
sioning of a replacement device, hopefully a spare that is
formatted but empty. Traditional restore operations require
multiple phases: copying a full backup (perhaps a week old),
adding modified pages from incremental backups (perhaps
from every day since the full backup), log analysis (to de-
termine incomplete transactions that require rollback), log
replay (“redo” of hours of recovery log in order to ensure
durability of committed transactions that modified the failed
media), and finally rollback of incomplete transactions (for
transaction-consistent restore). Optimizations merge pages
from full backup and incremental backups in a single re-
store phase and sort log records by their affected database
page such that log replay and transaction rollback requires
only a single sweep over the replacement media.
4.1 Single-pass restore
Our design for single-pass restore goes two steps further.
First, during transaction processing, it writes the standard
recovery log but when archiving the recovery log, it ap-
plies run generation logic (the first phase of external merge
sort). Thus, a log archive is partitioned into epochs (per-
haps one minute to one hour of log records per partition)
and within each partition, log records are sorted by their af-
fected database page. Run generation with replacement se-
lection (a priority queue) is a continuous process built into
the log archiving logic. Second, during media restore opera-
tions, our design merges not only page images from backups
but also the runs in the log archive with each other and with
the backup pages. Thus, each page written to the replace-
ment device is immediately fully up-to-date and recovered.
If run generation during transaction processing uses very
limited memory and CPU power, intermediate merge steps,
e.g., once a day, can reduce the number of runs (partitions)
in the log archive such that a restore operation requires only
a single merge step.
4.2 Instant restore
For practically immediate availability after a media failure
(assuming immediate availability of an empty replacement
device), the logic of single-pass restore can run on demand.
The required indexes on backups and log archive can be a
side effect of backup and log archiving – note that both pro-
cesses write their output sorted on page identifier, which
permits almost free creation of sorted indexes. Media re-
covery can run for individual pages or, for efficiency and
higher bandwidth, in groups of contiguous pages. A sim-
ple and practical policy recovers contiguous database pages
until it reaches a database page already restored or until an
active transaction requires a database page not yet restored.
In other words, instant restore uses the logic of single-pass
restore but in multiple segments chosen on demand instead
of in a single contiguous run.
5 Multiple failures
The presence of a first failure or inconsistency suggests that
another failure or inconsistency is likely due to a common
underlying cause. For example, if a code defect in low-level
concurrency control (latching) causes an inconsistent page
image during a period of high system load, it is likely to
affect more than a single page. Similarly, if a programming
error (perhaps in an unrelated application) causes a system
crash, running the same applications after a system restart
may cause another crash.
A system failure during system restart (after a prior fail-
ure) requires precisely the same recovery logic as the orig-
inal system failure. The first restart may speed up a sec-
ond restart, should it become necessary, by logging a sys-
tem checkpoint immediately after log analysis and then fre-
quently during restart. Such checkpoint log records reduce
the log analysis effort during restart recovery from a system
failure during a restart.
Similarly, a media failure during recovery from another,
unrelated media failure merely requires running the restore
logic for both failures using, of course, two replacement de-
vices. A media failure of replacement media during restore
merely requires restarting the restore logic with uncompro-
mised replacement media.
Our long paper on instant recovery [4] also covers fur-
ther combined failure modes, e.g., a media failure during
restart or a system failure during restore.
6 Alternatives
The desire for instant recovery after failures is not new. For
system restart, the promise of nonvolatile memory triggered
early designs [7–9] as well as recent ones [12]. In contrast,
all techniques described above rely on write-ahead logging,
which any transactional system needs for cases of trans-
action failure and transaction rollback, and work with all
storage technologies (except tape), from traditional disks
and disk arrays to flash storage and non-volatile memory.
For media restore operations, Gray [2] proposed sorting
and merging log records to turn a “fuzzy dump” into a
“sharp” one, i.e., to turn an online backup into one with
only committed transactions, and some IBM products sort
and aggregate log records prior to log replay [1,10]. In
contrast, single-pass restore divides the sort into run gen-
eration during log archiving and merging during restore
Instant recovery with write-ahead logging 5
operations, thus achieving high restore bandwidth without
adding phases and delays to the recovery effort. In addition,
inexpensive indexing for the backup and for the log archive
permit the appearance of instant restore operations. Again,
these recovery techniques work with all storage technolo-
gies (except tape), i.e., without reliance on special hardware.
7 Summary
In summary, write-ahead logging readily enables recov-
ery techniques overlooked for decades. The foundation is
on-demand single-page repair using per-page chains of log
records, i.e., efficient access to the history of each database
page. Using on-demand single-page “redo” and on-demand
single-transaction “undo”, instant restart permits new trans-
actions almost immediately after a system failure and re-
boot. External merge sort of log records during log archiving
and media restore operations, with run generation during log
archiving and merging during restore, enables single-pass
restore. Exploiting the order of database pages during back-
ups and of log records during log archiving enables cheap
creation of sorted indexes, which in turn enable on-demand
restore logic for individual database pages or for contiguous
runs of database pages.
References
1. Paolo Bruni, Marcelo Antonelli, Davy Goethals, Armin Kompalka,
Mary Petras: DB2 9 for z/OS: using the utilities suite. IBM Red-
books, 2nd ed., February 2010 – Section 13.10 “Fast log apply”
(2010).
2. Jim Gray: Notes on data base operating systems. In R. Bayer, R.
M. Graham, G. Seegmüller (eds): Operating systems – an advanced
course. LNCS 60: 393–481, Springer-Verlag (1978).
3. Goetz Graefe: Write-optimized B-trees. VLDB 2004: 672–683.
4. Goetz Graefe, Wey Guy, Caetano Sauer: Instant recovery with
write-ahead logging: page repair, system restart, and media restore.
Synthesis Lectures on Data Management, Morgan & Claypool Pub-
lishers (2014).
5. Goetz Graefe, Hideaki Kimura, Harumi A. Kuno: Foster B-trees.
ACM TODS 37(3): 17 (2012).
6. Goetz Graefe, Harumi A. Kuno, Bernhard Seeger: Self-diagnosing
and self-healing indexes. DBTest 2012: 8.
7. Eliezer Levy: Incremental restart. ICDE 1991: 640–648.
8. Tobin J. Lehman, Michael J. Carey: A recovery algorithm for
a high-performance memory-resident database system. ACM SIG-
MOD 1987: 104–117.
9. Eliezer Levy, Abraham Silberschatz: Incremental recovery in main
memory database systems. IEEE TKDE 4(6): 529–540 (1992).
10. Rick Long, Mark Harrington, Robert Hain, Geoff Nicholls: IMS
primer. IBM Redbooks, January 2000 – Section 15.4.2 “Database
change accumulation utility (DFSUCUM0)”.
11. C. Mohan, Donald J. Haderle, Bruce G. Lindsay, Hamid Pirahesh,
Peter M. Schwarz: ARIES: a transaction recovery method support-
ing fine-granularity locking and partial rollbacks using write-ahead
logging. ACM TODS 17(1): 94–162 (1992).
12. Ismail Oukid, Wolfgang Lehner, Thomas Kissinger, Thomas Will-
halm, Peter Bumbulis: Instant recovery for main memory databases.
CIDR 2015.
... We take it as Resultbased Recovery (RR). The most typical example is Write-Ahead Logging (WAL) [35] and its derivatives [11,[36][37][38][39][40][41][42][43]. In such systems, Content Data, Metadata, and Indexes do not need to be kept, but most systems keep them for performance. ...
Article
Full-text available
Existing databases supporting Online Transaction Processing (OLTP) workloads based on non-volatile memory (NVM) have not fully leveraged hardware characteristics, resulting in an imbalance between throughput and recovery performance. In this paper, we conclude with the reason why existing designs fail to achieve both: placing indexes on NVM results in numerous random writes and write amplification for index updates, leading to a decrease in system performance. Placing indexes on dynamic random access memory (DRAM) results in much time consumption for rebuilding indexes during recovery. To address this issue, we propose FIR, an NVM OLTP Engine with the fast rebuilding of the DRAM indexes, achieving instant system recovery while maintaining high throughput. Firstly, we design an index checkpoint strategy. During recovery, the indexes are quickly rebuilt by the bottom-up algorithm with index checkpoints. Then, to achieve instant recovery of the entire engine after rebuilding indexes, we optimize the existing log-free design by leveraging time-ordered storage, which significantly reduces the number of NVM writes. We also implement garbage collection based on data redistribution, enhancing system availability. The experimental results demonstrate that FIR achieves 98% of the performance of state-of-the-art OLTP Engine when running TPCC and YCSB. And the recovery speed of FIR is 43.6×–54.5× faster, achieving near-instantaneous recovery.
... Although ARIES is a well-established recovery approach, recent hardware improvements have spurred the development of new software architectures that can better exploit modern hardware. The section also introduces modern techniques that use log-structured file and write-optimized trees in the recovery process, such as Single-page repair [13] , Single-pass restore [43], Instant restart [12,14], and Instant Restore [41,42,44]. ...
... WBL also does not require an WALstyle undo phase. Instead, the DBMS only tracks the transactions active in the current group commit interval as determined by the analysis phase, so that it can ignore the effects of the associated uncommitted transactions [15,44,46]. In case of a transaction failure, the transaction manager rolls back any dirty changes flushed to NVM using the meta-data that it records in the dirty tuple table. ...
Conference Paper
The difference in the performance characteristics of volatile (DRAM) and non-volatile storage devices (HDD/SSDs) influences the design of database management systems (DBMSs). The key assumption has always been that the latter is much slower than the former. This affects all aspects of a DBMS's runtime architecture. But the arrival of new non-volatile memory (NVM) storage that is almost as fast as DRAM with fine-grained read/writes invalidates these previous design choices. In this tutorial, we provide an outline on how to build a new DBMS given the changes to hardware landscape due to NVM. We survey recent developments in this area, and discuss the lessons learned from prior research on designing NVM database systems. We highlight a set of open research problems, and present ideas for solving some of them.
Conference Paper
Debugging, evaluating, and optimizing stream processing applications is challenging due to continuous streams of input data and typically distributed and parallel execution environments. To address these issues, we present an approach for explorative debugging of stream processing pipelines that allows in-depth investigation of a pipeline’s execution behavior and evolution. The time traveling debugger enables traveling back in time within the pipeline’s execution history and thoroughly analyzing and retracing each fine-grained step. Any changes made to the pipeline’s structure or parameters are captured based on provenance information and can be reviewed, compared, and analyzed with the provenance inspector to understand the impact of each alteration on the quality of the pipeline.
Article
Many of today’s applications need massive real-time data processing. In-memory database systems have become a good alternative for these requirements. These systems maintain the primary copy of the database in the main memory to achieve high throughput rates and low latency. However, a database in RAM is more vulnerable to failures than in traditional disk-oriented databases because of the memory volatility. DBMSs implement recovery activities (logging, checkpoint, and restart) for recovery proposes. Although the recovery component looks similar in disk- and memory-oriented systems, these systems differ dramatically in the way they implement their architectural components, such as data storage, indexing, concurrency control, query processing, durability, and recovery. This survey aims to provide a thorough review of in-memory database recovery techniques. To achieve this goal, we reviewed the main concepts of database recovery and architectural choices to implement an in-memory database system. Only then, we present the techniques to recover in-memory databases and discuss the recovery strategies of a representative sample of modern in-memory databases.
Article
The design of the logging and recovery components of database management systems (DBMSs) has always been influenced by the difference in the performance characteristics of volatile (DRAM) and non-volatile storage devices (HDD/SSDs). The key assumption has been that non-volatile storage is much slower than DRAM and only supports block-oriented read/writes. But the arrival of new non-volatile memory (NVM) storage that is almost as fast as DRAM with fine-grained read/writes invalidates these previous design choices. This paper explores the changes that are required in a DBMS to leverage the unique properties of NVM in systems that still include volatile DRAM. We make the case for a new logging and recovery protocol, called write-behind logging, that enables a DBMS to recover nearly instantaneously from system failures. The key idea is that the DBMS logs what parts of the database have changed rather than how it was changed. Using this method, the DBMS flushes the changes to the database before recording them in the log. Our evaluation shows that this protocol improves a DBMS's transactional throughput by 1.3×, reduces the recovery time by more than two orders of magnitude, and shrinks the storage footprint of the DBMS on NVM by 1.5×. We also demonstrate that our logging protocol is compatible with standard replication schemes.
Conference Paper
Full-text available
With the emergence of new hardware technologies, new opportunities arise and existing database architectures have to be rethought to fully exploit them. In particular, recovery mechanisms of current main-memory database systems are tuned to efficiently work on block-oriented, high-latency storage devices. These devices create a bottleneck during transaction processing. In this paper, we investigate the opportunities given by the upcoming Storage Class Memory (SCM) technology for database system recovery mechanisms. In contrast to traditional block-oriented devices, SCM is byte-addressable and offers a latency close to that of DRAM. We propose a novel main-memory database architecture that directly operates in SCM, eliminates the need for logging mechanisms, and provides a way to trade recovery time with the overall query performance. We implemented these concepts in our prototype SOFORT. Our evaluation shows that we are able to achieve instant recovery of the DBMS while removing the need for transaction rollbacks after failure.
Article
With memory prices dropping and memory sizes increasing accordingly, a number of researchers are addressing the problem of designing high-performance database systems for managing memory-resident data. In this paper we address the recovery problem in the context of such a system. We argue that existing database recovery schemes fall short of meeting the requirements of such a system, and we present a new recovery mechanism which is designed to overcome their shortcomings. The proposed mechanism takes advantage of a few megabytes of reliable memory in order to organize recovery information on a per “object” basis. As a result, it is able to amortize the cost of checkpoints over a controllable number of updates, and it is also able to separate post-crash recovery into two phases—high-speed recovery of data which is needed immediately by transactions, and background recovery of the remaining portions of the database. A simple performance analysis is undertaken, and the results suggest our mechanism should perform well in a high-performance, memory-resident database environment.
Article
Download Free Sample Traditional theory and practice of write-ahead logging and of database recovery techniques revolve around three failure classes: transaction failures resolved by rollback; system failures (typically software faults) resolved by restart with log analysis, “redo,” and “undo” phases; and media failures (typically hardware faults) resolved by restore operations that combine multiple types of backups and log replay. The recent addition of single-page failures and single-page recovery has opened new opportunities far beyond its original aim of immediate, lossless repair of single-page wear-out in novel or traditional storage hardware. In the contexts of system and media failures, efficient single-page recovery enables on-demand incremental “redo” and “undo” as part of system restart or media restore operations. This can give the illusion of practically instantaneous restart and restore: instant restart permits processing new queries and updates seconds after system reboot and insta...
Article
Foster B-trees are a new variant of B-trees that combines advantages of prior B-tree variants optimized for many-core processors and modern memory hierarchies with flash storage and nonvolatile memory. Specific goals include: (i) minimal concurrency control requirements for the data structure, (ii) efficient migration of nodes to new storage locations, and (iii) support for continuous and comprehensive self-testing. Like Blink-trees, Foster B-trees optimize latching without imposing restrictions or specific designs on transactional locking, for example, key range locking. Like write-optimized B-trees, and unlike Blink-trees, Foster B-trees enable large writes on RAID and flash devices as well as wear leveling and efficient defragmentation. Finally, they support continuous and inexpensive yet comprehensive verification of all invariants, including all cross-node invariants of the B-tree structure. An implementation and a performance evaluation show that the Foster B-tree supports high concurrency and high update rates without compromising consistency, correctness, or read performance.
Article
Transactional storage and indexing is the heart of every database, not only for performance and functionality but also for reliability and availability. For high concurrency, these components must be programmed carefully with short critical sections, a variety of consistent states with short transitions, etc. Many-core CPUs exacerbate these requirements. Testing software with 100s or 1, 000s of threads is very difficult. In order to test such code, we suggest verifying the B-tree structure in each traversal. With carefully designed tree structure and node contents, a root-to-leaf pass can verify all nodes along its path comprehensively, i. e., it can verify all B-tree invariants including its consistency constraints with respect to its siblings and cousins (defined below). Thus, instead of testing the index implementation by running a stress-test and verifying the B-tree structure and contents afterwards, i. e., instead of the traditional approach, the test execution itself verifies the structure frequently and efficiently. During testing prior to a software release, frequent comprehensive verification combined with fast access to all relevant log records permits efficient root cause analysis of test failures. In deployments after software release, frequent verification and fast access to log records permits automatic, reliable, and efficient recovery of the correct, up-to-date page contents. Our contribution is an index structure that gives reliable and efficient access to all relevant log records and thus enables root cause analysis during testing and automatic recovery after deployment.
Chapter
This paper is a compendium of data base management operating systems folklore. It is an early paper and is still in draft form. It is intended as a set of course notes for a class on data base operating systems. After a brief overview of what a data management system is it focuses on particular issues unique to the transaction management component especially locking and recovery.
Conference Paper
Large writes are beneficial both on individual disks and on disk arrays, e.g., RAID-5. The presented design enables large writes of internal B-tree nodes and leaves. It supports both in-place updates and large append-only ("log-structured") write operations within the same stor- age volume, within the same B-tree, and even at the same time. The essence of the proposal is to make page migra- tion inexpensive, to migrate pages while writing them, and to make such migration optional rather than manda- tory as in log-structured file systems. The inexpensive page migration also aids traditional defragmentation as well as consolidation of free space needed for future large writes. These advantages are achieved with a very limited modification to conventional B-trees that also simplifies other B-tree operations, e.g., key range locking and com- pression. Prior proposals and prototypes implemented trans- acted B-tree on top of log-structured file systems and added transaction support to log-structured file systems. Instead, the presented design adds techniques and per- formance characteristics of log-structured file systems to traditional B-trees and their standard transaction support, notably without adding a layer of indirection for locating B-tree nodes on disk. The result retains fine-granularity locking, full transactional ACID guarantees, fast search performance, etc. expected of a modern B-tree implemen- tation, yet adds efficient transacted page relocation and large, high-bandwidth writes.
Conference Paper
Wrth memory prrces droppmg and memory slaes m- creasmg accordrngly, a number of researchers are ad- dressmg the problem of desrgnmg hrgh-performance database systems for managing memory-restdent data In thus paper we address the recovery problem m the context of such a system We argue that exrstmg database recovery schemes fall short of meetmg the re- qmrements of such a system, and we present a new re- covery mechamsm whrch IS designed to overcome therr shortcommgs The proposed mechamsm takes advan- tage of a few megabytes of rehable memory m order to orgamae recovery mformatron on a per "obJect" basrs As a result, It IS able to amortme the cost of check- pomts over a controllable number of updates, and rt IS also able to separate post-crash recovery mto two phases-hrgh-speed recovery of data whrch IS needed rmmedrately by transacttons, and background recov- ery of the remammg portions of the database A turn- ple performance analysrs IS undertaken, and the results suggest our mechamsm should perform well m a hrgh- performance, memory-resrdent database envrronment