Content uploaded by Ohad Rodeh
Author content
All content in this area was uploaded by Ohad Rodeh on Apr 24, 2019
Content may be subject to copyright.
Abstract
BTRFS is a Linux filesystem, it has been adopted as the default
filesystem in some popular versions of Linux. It is based on copy-on-
write, allowing for efficient snapshots and clones. It uses b-trees as its
main on-disk data-structure. The design goal is to work well for many
use cases and workloads. To this end, much effort has been directed
to maintaining even performance as the filesystem ages, rather than
trying to support a particular narrow benchmark use case.
Linux filesystems are installed on smartphones as well as enterprise
servers. This entails challenges on many different fronts.
•Scalability: The filesystem must scale in many dimensions: disk
space, memory, and CPUs.
•Data integrity: Losing data is not an option, and much effort
is expended to safeguard the content. This includes checksums,
metadata duplication, and RAID support built into the filesys-
tem.
•Disk diversity: the system should work well with SSDs and hard-
disks. It is also expected to be able to use an array of different
sized disks; posing challenges to the RAID and striping mecha-
nisms.
This paper describes the core ideas, data-structures, and algorithms
of this filesystem. It sheds light on the challenges posed by defrag-
mentation in the presence of snapshots, and the tradeoffs required to
maintain even performance in the face of a wide spectrum of workloads.
1
BTRFS: The Linux B-tree Filesystem
Ohad Rodeh, Josef Bacik, Chris Mason
2
1 Introduction
BTRFS is an open source filesystem that has seen extensive development
since its inception in 2007. It is jointly developed by FujitsuT M , Fusion-
IOT M , IntelT M , OracleT M , Red HatT M , StratoTM , SUSETM , and many
others. It is slated to become the next major Linux filesystem. Its main
features are:
1. CRCs maintained for all metadata and data
2. Efficient writeable snapshots, clones as first class citizens
3. Multi-device support
4. Online resize and defragmentation
5. Compression
6. Efficient storage for small files
7. SSD optimizations and TRIM support
The design goal is to work well for a wide variety of workloads, and to
maintain performance as the filesystem ages. This is in contrast to storage
systems aimed at a particular narrow use case. BTRFS is intended to be
used by main stream Linux distributions; it is expected to work well on
systems as small as a smartphone, and as large as an enterprise production
server. As such, it must work well on a wide range of hardware.
The filesystem on disk layout is a forest of b-trees, with copy-on-write
(COW) as the update method. Disk blocks are managed in extents, with
checksumming for integrity, and reference counting for space reclamation.
BTRFS is unique among filesystems in its use of COW friendly b-trees [26]
and reference counting.
Filesystem performance relies on the availability of long contiguous ex-
tents. However, as the system ages, space becomes increasingly fragmented,
requiring online defragmentation. Due to snapshots, disk extents are poten-
tially pointed to by multiple filesystem volumes. This make defragmentation
challenging because (1) extents can only be moved after all source pointers
are updated and (2) file contiguity is desirable for all snapshots.
To make good use of modern CPUs, good concurrency is important.
However, with copy-on-write this is difficult, because all updates ripple up
to the root of the filesystem.
3
This paper shows how BTRFS addresses these challenges, and achieves
good performance. Compared with conventional filesystems that update
files in place, the main workload effect is to make writes more sequential,
and reads more random.
It is important to understand that BTRFS is a work in progress. Fea-
tures, performance improvements, and bug fixes are contributed with every
minor kernel release. While we believe that it holds great promise, we also
think that a lot remains to be accomplished. There has been a lot of work
on the RAID subsystem, which now supports mirroring, striping, and mir-
roring+striping (RAID levels {0,1,10}). Code to support RAID-5/6 has
just recently been added. Fragmentation is dealt with in a number of ways,
however, we are not yet ready to say that the optimal solution has been
found. Flash optimization is an ongoing project, and we report on some of
this work here, however, we cannot claim that our system is the optimal
flash storage solution.
The approach taken in this paper is to explain the core concepts and
intuitions through examples and diagrams. The reader interested in finer
grain details can find the filesystem code publicly available from the Linux
kernel archives, and low level discussions in the kernel mailing list [7].
This paper is structured as follows: Section 2 describes related filesys-
tems. Section 3 describes basic terminology, presents the b-trees used to hold
metadata, and shows the fundamentals of copy-on-write updates. Section 4
is about the use of multiple devices, striping, mirroring, and RAID. Sec-
tion 5 describes defragmentation, which is important for maintaining even
filesystem performance. Section 6 talks about performance, and Section 7
summarizes.
4
2 Related work
On Linux, there are three popular filesystems, Ext4 [3], XFS [4, 8], and
BTRFS [9]. In the class of copy-on-write filesystems, two important con-
temporary systems are ZFS [34, 18], and WAFL [12, 20]. In what follows, we
use the term overwrite based filesystem to refer to systems that update files
in place. At the time of writing, this is the prevalent architectural choice.
BTRFS development started in 2007, by C. Mason. He combined ideas
from ReiserFS [15], with COW friendly b-trees suggested by O. Rodeh [26],
to create a new Linux filesystem. Today, this project has many contributers,
some of them from commercial companies, and it is on its way to becoming
the default Linux filesystem. As development started in 2007, BTRFS is
less mature and stable than others listed here.
The Fourth Extended Filesystem (EXT4) [3] is a mostly backward com-
patible extension to the previous general purpose Linux filesystem, Ext3.
It was created to address filesystem and file size limitations, and improve
performance. Initially, Linux kernel developers improved and modified Ext3
itself, however, in 2006, Ext4 was forked in order to segregate development
and changes in an experimental branch. Today, Ext4 is the default Linux
filesystem for some common Linux distributions. As it is an in-place re-
placement for Ext3, older filesystems can seamlessly be upgraded. Ext4 is
an overwrite based filesystem, that manages storage in extents. It uses an
efficient tree-based index to represent files and directories. A write-ahead
journal is used to ensure operation atomicity. Checksumming is performed
on the journal, but not on user data. Snapshots are not supported internally,
rather, the underlying volume-manager provides that functionality.
XFS [8] is a filesystem originally developed by SGIT M . Development
started in 1993, for the IRIX operating system. In 2000, it was ported to
Linux, and made available on GNU/Linux distributions. The design goal
of XFS is to achieve high scalability in terms of IO threads, number of
disks, file/filesystem size. It is an overwrite class filesystem that uses B-
tree of extents to manage disk space. A journal is used to ensure metadata
operation atomicity. Snapshots are not supported; an underlying volume-
manager is expected to support that operation.
ReiserFS [15] is a general purpose Linux filesystem, which inspired some
of the BTRFS architecture and design. It was built by Hans Reiser and a
team of engineers at NamesysT M . It was the first journaled filesystem to be
included in the standard Linux kernel, and it was the default filesystem on
many Linux distributions for a number of years. ReiserFS uses a single tree
to hold the entire filesystem, instead of separate trees per file and directory.
5
In order to reduce internal fragmentation, tail packing is implemented. The
main idea is to pack the tail, the last partial block, of multiple files into a
single block.
LFS [23] introduced the concept of log structured storage in the early
1990s. The main idea is to write all modifications to disk in large chunks,
called segments. The filesystem can be thought of as a large tree that has a
well known root, but whose data is spread across the live segments. When
disk space runs low, a segment cleaner background process kicks in. It frees
segments by merging several partly full segments, into a smaller number of
full segments. In terms of crash recovery, the live segments can be thought
of as a log that could be replayed, reproducing the filesystem state immedi-
ately prior to the crash. In order to reduce recovery time, the system takes
periodic checkpoints. This limits replay to the last checkpoint plus every
new segment since. Generally speaking, as long as overall disk utilization
was below 80-90%, write performance was very good compared to contem-
porary filesystems, while read performance was comparable. The Achilles
heal of LFS is a random write workload, followed by a sequential scan. Note
that BTRFS, in contrast to LFS, can work well even in the absence of long
free segments.
NILFS [31] is a modern implementation of an LFS inside the Linux ker-
nel. It is developed by NTTT M Japan, as an open-source project. The
first version was released in 2005, and development has been ongoing since.
NILFS brings efficient b-tree indexes to the filesystem tree, efficiently sup-
port for read-only snapshots, and fast recovery from crashes. The snapshot
and crash recovery implementation hinge on the idea that the filesystem
can be thought of a long log, stretching from filesystem creation to present
day. Data can be discarded only if it was overwritten, or, if the snapshot it
belonged to was discarded. Recovery is implemented by moving to the lat-
est checkpoint, and then rolling forward. The main challenge for NILFS is
writing a good garbage collector that takes into account snapshots. NILFS
is a work in progress, features like fsck, online defragmentation, writeable
snapshots, and quotas remain to be implemented.
WAFL [12] is the filesystem used in the NetAppT M commercial file
server; development started in the early 1990’s. It is a copy-on-write filesys-
tem that is especially suited for NFS [6, 33] and CIFS [16] workloads. It
uses NVRAM to store an operation log, and supports recovery up to the
last acknowledged operation. This is important for supporting low-latency
file write operations; NFS write semantics are that an operation is persis-
tant once the server sends an acknowledgment to the client. WAFL manages
space in 4KB blocks, and indexes files using a balanced tree structure. Snap-
6
shots are supported, as well as RAID. Free-space is managed using a form
of bitmaps. WAFL is mature and feature rich filesystem.
2.1 ZFS
ZFS [18] is a copy-on-write filesystem originally developed by SUNTM for
its Solaris operating system. Development started in 2001, with the goal of
replacing UFS, which was reaching its size limitations. ZFS was incorporated
into Solaris in 2005. The ZFS creators: Jeff Bonwick and Matthew Ahrens,
saw several problems in the filesystems that existed on Unix platforms at
the time:
•Data would get corrupted on disk without the filesystem, nor under-
lying storage system, detecting it. This would eventually cause bad
data to be delivered to the application.
•Storage was split into many sub components that could not be man-
aged as a single unit. For example, to construct a filesystem from
many disks, one had to use a logical volume manager, build a virtual
disk, and then lay a filesystem on top of it.
•There were many limitations on volume size, filesystem size, disk size,
maximal disk size, maximal number of inodes, etc.
•Growing disparity between disk latency and bandwidth. This was not
compensated for by new filesystem algorithms.
They decided to base their new system on the following engineering
principles:
•Pooled storage: all the disks are managed as one large storage pool.
There is no separate volume manager.
•Tree of blocks: the file-system logically looks like a large tree of blocks.
Updates are performed with copy-on-write, they ripple up the tree. In
order to commit a set of changes, all changes are written off to the
side, and finally, the root is overwritten in an atomic step.
•Checksums: the tree has data at its leaves, and internal nodes contain-
ing pointers. The pointers contain a checksum of the node they point
to. This allows online verification, as one traverses down the tree.
•Simple administration: this can now be done, as volume management
is integrated into the filesystem.
7
ZFS uses a an interesting RAID algorithm, called RAID-Z. Traditional
RAID-4 [13] uses a configuration like 4 + P. There are four data disks, D1,
through D4, and a single parity disk P. Table 1 shows an example for how
pages are laid out. Three rows are presented, each row consists of four data
pages, and one parity page. The xor of the data pages in a row is equal to
the parity page: di1⊕di2⊕di3⊕di4=pi5.
disk D1D2D3D4P
row 1 d11 d12 d13 d14 p15
row 2 d21 d22 d23 d24 p25
row 3 d31 d32 d33 d34 p35
Figure 1: Three first stripes on a RAID4 configuration with 5 disks. Each
row is a stripe, with the xor of the data pages equal to the parity page.
Each row is also called a full stripe. As long as data is written to disk in
full stripes, performance is good. However, updating a page in place requires
a costly read-modify-write operation. For example, to update d11 we need
to:
1. Read old values for d11 and p15
2. Compute new parity p0
15 =d0
11 ⊕d11 ⊕p15
3. Write new values d0
11 and p0
15
An additional problem with RAID4 is the write hole problem. If a partial
update is written to disk, and power is lost, then an entire row becomes
garbage. This is resolved with NVRAM in hardware based RAID controllers,
and an additional log in software RAID.
In order to have updates that are always full stripe, RAID-Z uses an
adaptive stripe size. Examine a situation with five disks. In order to provide
fault tolerance to a single disk fault, RAID-Z will always add a parity page.
This is done at different stripe widths, depending on IO size. For example,
Table 2 shows a series of writes, and Table 3 shows the resulting disk layout.
8
IO size data pages parity page configuration
4KB d11 p12 1+P
12KB d13, d14 , d15 p21 3+P
8KB d22, d23 p24 2+P
16KB d25, d31 , d32, d33 p34 4+P
Figure 2: A series of IOs, and their RAID-Z stripes. Note that different
configurations are created for different IO sizes.
disk D1D2D3D4D5
row 1 d11 p12 d13 d14 d15
row 2 p21 d22 d23 p24 d25
row 3 d31 d32 d33 p34
Figure 3: Layout of data and parity on disk, resulting from the IOs listed
in Table 2. Note that data and parity are spread more or less evenly across
the disks.
RAID-Z has extensions to support double fault tolerance. In order
to resolve the write-hole problem, ZFS does only copy-on-write updates,
this allows committing a set of changes atomically without using expensive
NVRAM or additional logging.
In order to support ZFS snapshots [5], block-pointers are enhanced with
the child’s birth time. This is not actual wall clock, but rather the trans-
action ID used to write that block to disk. Having the file-system tree
decorated with birth times allows efficient snapshot creation and deletion.
It also allows creating writeable snapshots (clones). An important limita-
tion, however, is that a a clone is not a first class citizen, it cannot be cloned
again.
While ZFS and BTRFS have a lot in common, they also differ substan-
tially.
•Goals: ZFS is intended for an enterprise class Solaris server, BTRFS
is intended to run on (almost) all things running Linux.
•Filesystem check fsck: ZFS does not have an fsck utility. If the
RAID and checksum algorithms are insufficient, the ZFS user is ex-
pected to restore from external backup. By contrast, Linux is installed
on many desktops, and smartphones, that do not have enterprise grade
9
backups. Linux users have therefore come to expect fsck utilities. Un-
fortunately, writing a good recovery tool for BTRFS is very hard, as
the data can move around on disk, and there is no fixed location for
metadata (except for the superblock).
•Blocks vs. extents: ZFS uses blocks that are powers of two: 4KB,
8KB, ..., 128KB. Each object has a fixed block-size. the RAID-Z code
works well with fixed sized blocks, however, since there are several
different block sizes, we could run out of a particular size. This requires
a procedure called ganging where larger blocks are created from smaller
blocks. Since there might not be contiguous sub-blocks, the system
can very well get into trouble. On BTRFS, extents are used instead
of blocks. This eliminates the need for special block sizes. However, if
the system is left with a highly fragmented free space, writes to disk
are still going to be inefficient.
•Checksums: both ZFS and BTRFS calculate checksums. However,
ZFS keeps them in the block-pointers, whereas BTRFS keeps data
checksums in a separate tree. This is necessary for BTRFS, since
extents could be very large, and we would like to be able to validate
each page separately.
•Snapshots vs. clones: ZFS uses birth-times to build snapshots.
BTRFS uses the reference-counting mechanism instead. The result is
very similar, however, the BTRFS mechanism is more general, in that
it supports clones as first class citizens.
•RAID: ZFS uses RAID-Z, and supports several RAID levels includ-
ing single/dual parity (RAID {5,6}). BTRFS uses something closer
to a standard RAID layout; support is currently available only for
mirroring, striping, and mirroring+striping (RAID levels {0,1,10}).
•Back references: ZFS does not support back-references, while BTRFS
does. ZFS is assumed to run in a big server with many disks, and it
can be assumed that redundancy through RAID-Z is implemented.
This means that ZFS can recover bad blocks through reconstruction,
failing that, there is backup. BTRFS cannot assume underlying RAID
and multiple devices, hence, it must be able to cope with bad blocks
by other means.
•Deduplication: ZFS supports deduplication, although this comes at
a very significant memory cost. BTRFS does not support this feature
10
at the moment. Due to the memory requirements, it might be a feature
only fit for high end servers.
2.2 Log Structured Merge Trees
Log Structured Merge Trees [29](LSM-trees) are an on-disk data structure
designed to support high update rates. An in memory table maintains the
latest set of updates. When it fills, it is sorted and flushed to disk as a
sub-table. Periodically, the on-disk tables are compacted. This kind of data
structure is now used by GoogleT M in its big-table implementation [14, 19],
YahooT M [32] in its PNUTS platform, and others. It is useful where the
update rate is high. An additional constraint is that read accesses should
mostly be point accesses, and not range queries. This is because point
accesses to the wrong sub-table can easily be discarded using bloom-filters.
At this point, it is hard to see how LSMs could serve as the main filesys-
tem data structure. A general purpose filesystem must provide excellent
support for random read workloads. In addition, reading from a file maps
to a query for allocated pages in a range; not to a point query. One possibil-
ity, is auxiliary use, for example [28] uses LSM trees to track back references
in a filesystem. Another possibility is presented in [22], where only small
files are stored in the LSM.
Recently published work [30], shows how to combine the benefits of b-
trees, LSM trees, and avoid most of the pitfalls.
It remains to be seen what role LSM trees will play in filesystem layouts.
It is a topic that is currently drawing a lot of attention.
11
3 Fundamentals
Filesystems support a wide range of operations and functionality. A full
description of all the BTRFS options, use cases, and semantics would be
prohibitively long. The focus of this work is to explain the core concepts,
and we limit ourselves to the more basic filesystem operations: file cre-
ate/delete/read/write, directory lookup/iteration, snapshots and clones. We
also discuss data integrity, crash recovery, and RAID. Our description re-
flects the filesystem at the time of writing.
The following terminology is used throughout:
•Page, block: a 4KB contiguous region on disk and in memory. This
is the common page size for IntelT M architectures. On other architec-
tures, for example PowerPC (PPC), the page size could be 8KB, or
even larger. To simplify the discussion, we assume the page size here
is 4KB.
•Extent: A contiguous on-disk area. It is page aligned, and its length
is a multiple of pages.
•Copy-on-write (COW): creating a new version of an extent or a
page at a different location. Normally, the data is loaded from disk
to memory, modified, and then written elsewhere. The idea is not
to update the original location in place, risking a power failure and
partial update.
3.1 COW Friendly B-trees
COW friendly b-trees are central to the BTRFS data-structure approach.
For completeness, this section provides a recap of how they work. For a
full account, the interested reader is referred to: [26, 24, 25, 27]. The main
idea is to use standard b+-tree construction [11], but (1) employ a top-down
update procedure, (2) remove leaf-chaining, (3) use lazy reference-counting
for space management.
For purposes of this discussion, we use trees with short integer keys, and
no actual data items. The b-tree invariant is that a node can maintain 2 to
5 elements before being split or merged. Tree nodes are assumed to take up
exactly one page. Unmodified pages are rectangle shaped with an orange
border, and COWed pages are hexagon shaped with a green border.
Figure 4(a) shows an initial tree with two levels. Figure 4(b) shows an
insert of new key 19 into the right most leaf. A path is traversed down the
12
tree, and all modified pages are written to new locations, without modifying
the old pages.
1, 2 5, 6, 7 10, 11
1, 4, 10
1, 4, 10 1 , 4, 10
1, 2 5, 6, 7 10, 11 10, 11, 19
(a) (b)
Figure 4: (a) A basic b-tree (b) Inserting key 19, and creating a path of
modified pages.
In order to remove a key, copy-on-write is used. Remove operations
do not modify pages in place. For example, Figure 5 shows how key 6 is
removed from a tree. Modifications are written off to the side, creating a
new version of the tree.
1, 2 5, 6, 7 10, 11
1, 4, 10
1, 4, 10 1, 4, 10
1, 2 5, 6, 7 10, 115, 7
(a) (b)
Figure 5: (a) A basic tree (b) Deleting key 6.
In order to clone a tree, its root node is copied, and all the child pointers
are duplicated. For example, Figure 6 shows a tree Tp, that is cloned to tree
Tq. Tree nodes are denoted by symbols. As modifications will be applied to
Tq, sharing will be lost between the trees, and each tree will have its own
view of the data. Cloning a node, is very similar to COW, except that the
original node is not marked for erasure.
13
P
B C
D E G H
P Q
B C
D E G H
(a) Tree Tp(b) Tpcloned to Tq
Figure 6: Cloning tree Tp. A new root Qis created, initially pointing to
the same nodes as the original root P. As modifications will be applied, the
trees will diverge.
Since tree nodes are reachable from multiple roots, garbage collection is
needed for space reclamation. In practice, file systems are directed acyclic
graphs (DAGs). There are multiple trees with shared nodes, but there are no
cycles. Therefore, reference-counters (ref-counts) can and are used to track
how many pointers there are to tree nodes. A ref-count for a node records
how many direct pointers there are to it. Once the counter reaches zero,
a block can be reused. The ref-counts are not stored in the nodes, rather,
the system keeps them stored persistently in a separate data structure (see
the extent-tree in Sections 3.4,3.5). This allows modifying a node’s counter,
without having to modify the node itself. In order to keep track of ref-counts,
the copy-on-write mechanism is modified. Whenever a node is COWed, the
ref-count for the original is decremented, and the ref-counts for the children
are incremented. Figure 7 shows an example for a clone operation, with a
ref-count indication. The convention is that diamond like nodes with red
borders are unchanged except for their ref-count. Tree TPis cloned, causing
a new copy of the root to be created, and an increment in the ref-counts of
the immediate children {B, C}.
14
P,1
B,1 C,1
D,1 E,1 G ,1 H,1
P,1 Q,1
B,2 C,2
D,1 E,1 G ,1 H,1
(a) Tree Tp(b) Tpcloned to Tq
Figure 7: Cloning tree Tp. A new root Qis created, initially pointing to
the same nodes as the original root P. The ref-counts for the immediate
children are incremented. The grandchildren remain untouched.
Figure 8 shows an example of an insert-key into leaf H, tree q. The nodes
on the path from Qto Hare {Q, C, H}. They are all modified and COWed.
15
P,1 Q,1
B,2 C,2
D,1 E,1 G ,1 H,1
P,1 Q,0
B,2 C,2
Q',1
D,1 E,1 G ,1 H,1
(a) Initial trees, Tpand Tq(b) Shadow Q
P,1 Q,0
B,2 C,1
Q',1
C',1
D,1 E,1 G,2 H,2
P,1 Q,0
B,2 C,1
Q',1
C',1
D,1E,1 G,2 H,1 H',1
(c) shadow C(d) shadow H
Figure 8: Inserting a key into node Hof tree Tq. The path from Qto H
includes nodes {Q, C, H }, these are all COWed. Sharing is broken for nodes
Cand H; the ref-count for Cis decremented.
Figure 9 shows an example of a tree delete. The algorithm used is a
recursive tree traversal, starting at the root. For each node N:
•ref-count(N)>1: Decrement the ref-count and stop downward traver-
sal. The node is shared with other trees.
•ref-count(N) == 1 : It belongs only to q. Continue downward traver-
sal and deallocate N.
16
P,1 Q,1
B,1 C,2 X,1
D,1 E,1 G,1 Y,2 Z,1
P,1 Q,0
B,1 C,1 X,0
D,1 E,1 G,1 Y,1 Z,0
(a) Initial trees Tpand Tq(b) Deleting Tq
P,1
B,1 C,1
D,1 E,1 G ,1 Y,1
(c) Removing unallocated nodes
Figure 9: Deleting a tree rooted at node Q. Nodes {X, Z}, reachable solely
from Q, are deallocated. Nodes {C, Y }, reachable also through P, have their
ref-count reduced from 2 to 1.
3.2 Checkpoints
The filesystem comprises a forest of trees that are all modified using copy-
on-write. Updates are accumulated in memory, and written out in atomic
checkpoints. This process is fully explained in section 3.4. Checkpoints
are labeled with a monotonically increasing generation number, which is
embedded in various on disk data structures. This is a logical timer that can
be used for validation, or difference calculations. Note that some operations,
such as fsync have special handling, so as to avoid full checkpoints; see
Section 3.6.
3.3 Filesystem B-tree
The BTRFS b-tree is a generic data-structure that knows only about three
types of data structures: keys, items, and block headers. The block header
17
is fixed size and holds fields like checksums, flags, filesystem ids, generation
number, etc. A key describes an object address using the structure:
struct btrfs key {
u64: objectid; u8: type; u64 offset;
}
An item is a btrfs key with additional offset and size fields:
struct btrfs item {
struct btrfs key key; u32 offset; u32 size;
}
Internal tree nodes hold only [key, block-pointer] pairs. Leaf nodes
hold arrays of [item, data] pairs. Item data is variable sized. A leaf stores
an array of items in the beginning, and a reverse sorted data array at the
end. These arrays grow towards each other. For example Table 10 shows
a leaf with three items {I0, I1, I2}and three corresponding data elements
{D2, D1, D0}.
block header I0I1I2free space D2D1D0
Figure 10: A leaf node with three items. The items are fixed size, but the
data elements are variable sized.
Item data is variably sized, and various filesystem data structures are
defined as different types of item data. The type field in the key indicates
the type of data stored in the item.
The filesystem is composed of objects, each of which has an abstract
64bit object id. When an object is created, a previously unused object id is
chosen for it. The object id makes up the most significant bits of the key,
allowing all of the items for a given filesystem object to be logically grouped
together in the b-tree. The type field describes the kind of data held by an
item; an object typically comprises several items. The offset field describes
data held in an extent.
Figure 11 shows a more detailed schematic of a leaf node.
18
header item 0 item 1 ... item N free space data N ... data 1 data 0
btrfs_key _le32 offset _le32 size
_le64 objectid u8 type _le64 offset
u8 csum u8 fsid _le64 flags __le64 generatio n .. .
Figure 11: A detailed look at a generic leaf node holding keys and items.
Inodes are stored in an inode item at offset zero in the key, and have a
type value of one. Inode items are always the lowest valued key for a given
object, and they store the traditional stat data for files and directories. This
includes properties such as size, permissions, flags, and a count of the number
of links to the object. The inode structure is relatively small, and will not
contain embedded file data, extended attribute data, or extent mappings.
These things are stored in other item types.
Small files that occupy less than one leaf block may be packed into the
b-tree inside the extent item. In this case the key offset is the byte offset of
the data in the file, and the size field of the item indicates how much data
is stored. There may be more than one of these per file.
Larger files are stored in extents. These are contiguous on-disk areas
that hold user-data without additional headers or formatting. An extent-
item records a generation number recording when the extent was created,
a[disk block, length] pair to record the area on disk, and a [logical
file offset, length] pair to record the file area. An extent is a mapping
from a logical area in a file, to a physical area on disk. If a file is stored in a
small number of large extents, then a common operation such as a full file
read will be efficiently mapped to a few disk operations. This explains the
attractiveness of extents.
Some filesystems use fixed size blocks instead of extents. Using an extent
representation is much more space efficient, however, this comes at the cost
of more complexity. For example, examine a file with data in the range 0 -
64KB, where the file is laid out optimally as one contiguous disk area. With
a 4KB block representation, we will need 16 pointers to represent it. If we
overwrite the area 8-16KB, we will still need 16 pointers to represent the file.
19
With an extent representation, the file will initially consist of exactly one
64KB extent. The overwrite operation, however, muddies the water. The 8-
16KB area must be written elsewhere, due to copy-on-write considerations.
This splits the file into three extents: 0 - 8KB, 8-16KB, 16-64KB.
A directory holds elements of type dir item. A dir item is a structure
comprising a file-name (string) and a btrfs key specifying the location of
the object. The directory contains two sorted lists of dir items. The first
list holds all mappings, sorted by filename hash; it is used to satisfy path
lookups. The second list is a copy of the first, but sorted by inode sequence
number. It is used for bulk directory operations. The inode sequence num-
ber is stored in the directory, and is incremented every time a new file or
directory is added. It approximates the on-disk order of the underlying file
inodes, and thus saves disk seeks when accessing them. Bulk performance
is important for operations like backups, copies, and filesystem validation.
For example, fsck uses the iteration index to gain speed improvements as
it needs to scan through the entire metadata with as few seeks as possible.
Figure 12 shows an example directory structure containing a root and
three files: player.c, player and doc.txt. The files have corresponding
inode numbers: 259, 260, and 261. The root directory holds the files sorted
by hashes; these are the DIR ITEMs, their offset field holds the filename
hash. The file names are also sorted by inode creation time, these are the
INDEX ITEMs, where the offset field contains the inode sequence number.
The files themselves consist of an inode element, of which we show only
the size attribute, and extents pointing to data. The doc.txt file is small
enough to be allocated inside the metadata node, while the other files have
extents allocated externally. For example, in order to sequentially read file
player.c, the top level directory is searched for the hash of player.c.
This is achieved by a b-tree search for a DIR ITEM with inode=256, and
offset=5012. The resulting inode is 259, and then the extents for inode
259 can be read sequentially.
20
/
player.c
259
player
260
doc .txt
261
inode 256
key type offset
INODE_ITEM 0 size=4KB
DIR_ITEM 2234 name="do c.txt"
DIR_ITEM 5012 name="player.c"
DIR_ITEM 6901 name="player"
DIR_INDEX 1 name="player.c"
DIR_INDEX 2 name= "player"
DIR_INDEX 3 name="doc.txt"
inode 259
key type offset
INODE_ITEM 0 size=64KB
EXTENT_DATA 0
EXTENT_DATA 16KB
inode 260
key type offset
INODE_ITEM 0 size=196KB
EXTENT_DATA 0
EXTENT_DATA 32KB
EXTENT_DATA 128KB
inode 261
key type offset
INODE_ITEM 0 size= 529
INLINE_EXTENT 0 d ata="..."
Figure 12: A directory structure with three files: player.c, player, and
doc.txt. The files have corresponding inode numbers: 259, 260, and 261.
The filesystem tree holds a directory with a double mapping, and three file
elements. The doc.txt file is small enough to be inlined, while the other
files have external extents.
3.4 A Forest
A filesystem is constructed from a forest of trees. A superblock located at
a fixed disk location is the anchor. It points to a tree of tree roots, which
indexes the b-trees making up the filesystem. The trees are:
•Sub-volumes: store user visible files and directories. Each sub-
volume is implemented by a separate tree. Sub-volumes can be snap-
shotted and cloned, creating additional b-trees. The roots of all sub-
21
volumes are indexed by the tree of tree roots.
•Extent allocation tree: tracks allocated extents in extent items,
and serves as an on-disk free-space map. All back-references to an
extent are recorded in the extent item. This allows moving an extent
if needed, or recovering from a damaged disk block. Taken as a whole,
back-references multiply the number of filesystem disk pointers by two.
For each forward pointer, there is exactly one back-pointer. See more
details on this tree below.
•Checksum tree: holds a checksum item per allocated extent. The
item contains a list of checksums per page in the extent.
•Chunk and device trees: indirection layer for handling physical
devices. Allows mirroring/striping and RAID. Section 4 shows how
multiple device support is implemented using these trees.
•Reloc tree: for special operations involving moving extents. Section 5
describes how the reloc-tree is used for defragmentation.
For example, Figure 13(a) shows a high-level view of the structure of
a particular filesystem. The reloc and chunk trees are omitted for simplic-
ity. Figure 13(b) shows the changes that occur after the user wrote to the
filesystem.
22
superblock
tree of tree roots
subvol1
data extents
extent tree checksum tree
(a)
superblock
tree of tree roots
subvol1
data extents
extent tree checksum tree
(b)
Figure 13: (a) A filesystem forest. (b) The changes that occur after modifi-
cation; modified pages are colored green.
Modifying user-visible files and directories causes page and extent up-
dates. These ripple up the sub-volume tree until its root. Changes also occur
to extent allocation, ref-counts, and back-pointers. These ripple through the
23
extent tree. Data and metadata checksums change, these updates modify
the checksum tree leaves, causing modifications to ripple up. All these tree
modifications are captured at the top most level as a new root in the tree of
tree roots. Modifications are accumulated in memory, and after a timeout, or
enough pages have changed, are written in batch to new disk locations, form-
ing a checkpoint. The default timeout is 30 seconds. Once the checkpoint
has been written, the superblock is modified to point to the new checkpoint;
this is the only case where a disk block is modified in place. If a crash occurs,
the filesystem recovers by reading the superblock, and following the point-
ers to the last valid on-disk checkpoint. When a checkpoint is initiated, all
dirty memory pages that are part of it are marked immutable. User updates
received while the checkpoint is in flight cause immutable pages to be re-
COWed. This allows user visible filesystem operations to proceed without
damaging checkpoint integrity. Checkpoints are numbered by a monoton-
ically increasing generation number, which is embedded in various on disk
data-structures.
Sub-volume trees can be snapshoted and cloned, and they are therefore
ref-counted. All other trees keep meta-data per disk range, and they are
never snapshoted. Reference counting is unnecessary for them.
A filesystem update affects many on-disk structures. For example, a
4KB write into a file changes the file i-node, the file-extents, checksums, and
back-references. Each of these changes causes an entire path to change in its
respective tree. If users performed entirely random updates, this would be
very expensive for the filesystem. Fortunately, user behavior normally has
a lot of locality. If a file is updated, it would be updated with lots of new
data; files in the same directory have a high likelihood of co-access. This
allows coalescing modified paths in the trees. Nonetheless, worst cases are
considered in the filesystem code. Tree structure is organized so that file
operations normally modify single paths. Large scale operations are broken
into parts, so that checkpoints never grow too large. Finally, special block
reservations are used so that a checkpoint will always have a home on disk,
guaranteeing forward progress.
Using copy-on-write as the sole update strategy has pros and cons.
The upside is that it is simple to guarantee operation atomicity, and data-
structure integrity. The downside is that performance relies on the ability
to maintain large extents of free contiguous disk areas. In addition, ran-
dom updates to a file tend to fragment it, destroying sequentiality. A good
defragmentation algorithm is required; this is described in section 5.
Checksums are calculated at the point where a block is written to disk.
At the end of a checkpoint, all the checksums match, and the checksum at
24
the root block reflects the entire tree. Metadata nodes record the generation
number when they were created. This is the serial number of their check-
point. B-tree pointers store the expected target generation number, this
allows detection of phantom or misplaced writes on the media. Generation
numbers and checksums serve together to verify on disk block content.
3.5 Extent allocation tree
The extent allocation tree holds extent-items, each describing a particular
contiguous on-disk area. There could be many references to an extent, each
addressing only part of it. For example, consider file foo that has an on-disk
extent 100KB - 128KB. File foo is cloned creating file bar. Later on, a range
of 10KB is overwritten in bar. This could cause the following situation:
File On disk extents
foo 100-128KB
bar 100-110KB, 300-310KB, 120-128KB
Figure 14: Foo and its clone bar share parts of the extent 100-128KB. Bar
has an extent in the middle that has been overwritten, and is now located
much further away on disk
There are three pointers into extent 100-128KB, covering different parts
of it. The extent-item keeps track of all such references, to allow moving
the entire extent at a later time. An extent could potentially have a large
number of back references, in which case the extent-item does not fit in a
single b-tree leaf node. In such cases, the item spills and takes up more than
one leaf.
A back reference is a logical hint, it is not a physical address. It is
constructed from the root object id, generation id, tree level, and lowest
object-id in the pointing block. This allows finding the pointer, after a
lookup traversal starting at the root object-id. In this example, extent
100-128KB has three back references, one from foo, and two from bar. The
back references also serve as the ref-count, when it reaches zero, the extent
can be freed.
An important operation on the extent tree is finding all references into
a disk area. The extent items are sorted by start address, and there are
no overlaps between items. This allows doing a range query on the b-tree,
resulting in all extents in the range. We can then extract all the back
references and find all pointers into this disk area. This is a crucial operation
25
used for moving data out of an area. There could be multiple reasons for
this: garbage collection, device removal, file system resize, RAID rebalance,
etc.
3.6 Fsync
fsync is a operation that flushes all dirty data for a particular file to disk.
An important use case is by databases that wish to ensure that the database
log is on disk, prior to committing a transaction. Latency is important; a
transaction will not commit until the log is fully on disk. A naive fsync
implementation is to checkpoint the entire filesystem. However, that suffers
from high latency. Instead, modified data and metadata related to the
particular file are written to a special log-tree. Should the system crash, the
log-tree will be read as part of the recovery sequence. This ensures that only
minimal and relevant modifications will be part of the fsync code path.
3.7 Compression
Compression is implemented at the extent level. Two compression algo-
rithms are supported: ZLIB [21], and LZO [2]. ZLIB is slower, but provides
higher compression ratios. LZO is faster, but provides a worse compression
ratio.
3.8 Concurrency
Modern systems have multiple CPUs with many cores. Taking advantage
of this computing power through parallelism is an important consideration.
Old generations are immutable on disk, and their access does not require
locking. The in-memory under-modification pages requires protection. Since
data is organized in trees, the strategy is to use a read/write locking scheme.
Tree traversal is initiated in read mode. When a node that requires update
is encountered, the lock is converted to write mode. If a block Brequires
COW, traversal is restarted. The new traversal stops at parent(B), COWs
B, modifies the parent pointer, and continues down.
Tree traversals are top-down. They start at the top, and walk down the
tree, it is unnecessary to walk back up.
26
4 Multiple Device Support
Linux has two device management subsystems: the device mapper (DM),
and the software RAID subsystem (md). The device mapper is a stackable
design starting with raw devices, and building up with additional modules
supporting multipathing, thin provisioning, mirroring, striping, RAID5/6,
etc. The software RAID module supports various fault tolerance levels by
combining lower level raw devices. Both modules primary function is to take
raw disks, merge them into a virtually contiguous block-address space, and
export that abstraction to higher level kernel layers. An important issue is
that checksums are not supported, which causes a problem for BTRFS. For
example, consider a case where data is stored in RAID-1 form on disk, and
each 4KB block has an additional copy. If the filesystem detects a checksum
error on one copy, it needs to recover from the other copy. DMs hide that
information behind the virtual address space abstraction, and return one
of the copies. To circumvent this problem, BTRFS does its own device
management. It calculates checksums, stores them in a separate tree, and
is then better positioned to recover data when media errors occur.
A machine may be attached to multiple storage devices; BTRFS splits
each device into large chunks. The rule of thumb is that a chunk should
never be more than 10% of the device size. At the time of writing 1GB
chunks are used for data, and 256MB chunks are used for metadata.
Achunk tree maintains a mapping from logical chunks to physical chunks.
Adevice tree maintains the reverse mapping. The rest of the filesystem sees
logical chunks, and all extent references address logical chunks. This allows
moving physical chunks under the covers without the need to backtrace and
fix references. The chunk/device trees are small, and can typically be cached
in memory. This reduces the performance cost of an added indirection layer.
Physical chunks are divided into groups according to the required RAID
level of the logical chunk. For mirroring, chunks are divided into pairs. Ta-
ble 15 presents an example with three disks, and groups of two. For example,
logical chunk L1is made up of physical chunks C11 and C21. Table 16 shows
a case where one disk is larger than the other two.
27
logical chunks disk 1 disk 2 disk 3
L1C11 C21
L2C22 C31
L3C12 C32
Figure 15: To support RAID1 logical chunks, physical chunks are divided
into pairs. Here there are three disks, each with two physical chunks, pro-
viding three logical chunks. Logical chunk L1is built out of physical chunks
C11 and C21.
logical chunks disk 1 disk 2 disk 3
L1C11 C21
L2C22 C31
L3C12 C23
L4C24 C32
Figure 16: One large disk, and two small disks, in a RAID1 configuration.
For striping, groups of nchunks are used, where each physical chunk
is on a different disk. For example, Table 17 shows stripe width of four
(n= 4), with four disks, and three logical chunks.
logical chunks disk 1 disk 2 disk 3 disk 4
L1C11 C21 C31 C41
L2C12 C22 C32 C42
L3C13 C23 C33 C43
Figure 17: Striping with four disks, stripe width is n= 4. Three logical
chunks are each made up of four physical chunks.
At the time of writing, RAID levels 0,1, and 10 are supported. Support
for RAID5/6 has recently been added (btrfs version 3.9). The core idea in
higher RAID levels is to use chunk groups with Reed-Solomon [17] parity
relationships. For example, Figure 18 shows a RAID6 configuration where
logical chunks L1,2,3are constructed from doubly protected physical chunks.
For example, L1is constructed from {C11, C21 , C31, C41 }. Chunks {C11, C12 }
hold data in the clear, C31 =C21 ⊕C11, and C41 =Q(C21 , C11). Function
28
Qis a defined by Reed-Solomon codes such that any double chunk failure
combination would be recoverable.
physical disks
logical chunks D1D2P Q
L1C11 C21 C31 C41
L2C12 C22 C32 C42
L3C13 C23 C33 C43
Figure 18: A RAID6 example. There are four disks, {D1, D2, P, Q}. Each
logical chunk has a physical chunk on each disk. For example, the raw data
for L1is striped on disks D1and D2.C31 is the parity of C11 and C21,C41
is the calculated Qof chunks C11 and C12.
Replicating data and storing parity is costly overhead for a storage sys-
tem. However, it allows recovery from many media error scenarios. The
simplest case is RAID1, where each block has a mirror copy. When the
filesystem tries to read one copy, and discovers an IO or checksum error, it
tries the second copy. If the second copy also has an error, then the data is
lost. Back references have to be traced up the filesystem tree, and the file
has to be marked as damaged. If the second copy is valid, then it can be
returned to the caller. In addition, the first copy can be overwritten with
the valid data. A proactive approach, where a low intensity scrub operation
is continuously run on the data, is also supported.
There is flexibility in the RAID configuration of logical chunks. A sin-
gle BTRFS storage pool can have various logical chunks at different RAID
levels. This decouples the top level logical structure from the low-level reli-
ability and striping mechanisms. This is useful for operations such as:
1. Changing RAID levels on the fly, increasing or decreasing reliability
2. Changing stripe width: more width allows better bandwidth
3. Giving different subvolumes different RAID levels. Perhaps some sub-
volumes require higher reliability, while others need more performance
at the cost of less reliability.
The default behavior is to use RAID1 for filesystem metadata, even if
there is only one disk. This gives the filesystem a better chance to recover
when there are media failures.
29
Common operations that occur in the lifetime of a filesystem are device
addition and removal. This is supported by a general balancing algorithm
that tries to spread allocations evenly on all available disks, even as the
device population changes. This is carries out as a background process,
with minimal interference with user IO. For example, in Table 19(a) the
system has two disks in a RAID1 configuration; each disk holds 1
2of the raw
data. Then, a new disk is added, see Table 19(b). The goal of the balancing
code is to reach the state shown in Table 19(c), where data is spread evenly
on all three disks, and each disk holds 1
3of the raw data.
(a) 2 disks logical chunks disk 1 disk 2
L1C11 C21
L2C12 C22
L3C13 C23
(b) disk added disk 3
L1C11 C21
L2C12 C22
L3C13 C23
(c) rebalance
L1C11 C21
L2C22 C12
L3C13 C23
Figure 19: Device addition. Initially (a), there are two disks. In state (b),
another disk is added, it is initially empty. State (c) shows the goal: data
spread evenly on all disks. Here, physical chunks C12 and C23 were moved
to disk 3.
When a device is removed, the situation is reversed. From a 1
3ratio
(as in Table 19(c)), the system has to move back to 1
2ratio. If there are
unused chunks on the remaining disks, then the rebalancer can complete
the task autonomously. However, we are not always that fortunate. If data
is spread across all chunks, then trying to evict a chunk requires traversal
through the filesystem, moving extents, and fixing references. This is similar
to defragmentation, which is described in Section 5.
30
5 Defragmentation and Relocation
Fragmentation occurs when a filesystem runs low on long contiguous disk
extents. This is a problem for all filesystems, it is especially acute for copy-
on-write systems, because they write all new data to fresh disk areas. This
means that even overwrites, IOs that update areas that have already been
allocated, will require new allocation.
In BTRFS, updates are accumulated in memory, and written to disk at
sync time. This allows batching optimizations, such as delayed allocation,
and co-locating data from the same file. Since the sync timer is normally
about 30 seconds, and memory buffer space is limited, the batching process
is helpful, but is not sufficient in itself to prevent fragmentation. There
are several different approaches and algorithms for dealing with disk frag-
mentation. Note that, unlike traditional LFS [23], BTRFS can continue to
function in the absence of long disk extents. It can allocate data and meta-
data on many short extents if it has to. However, performance can get very
bad in these kinds of situations.
The nocow option can be set on an entire filesystem or a particular file.
It cancels copy-on-write for data blocks, unless there is a snapshot. COW
still applies to metadata blocks. Nocow makes sense for workloads where
COW would be very expensive. For example, database workloads that do
random small updates, followed by sequential scans. The normal BTRFS
policy would write the blocks to disk in update order, which would have
very bad performance in the sequential scan. Databases could well be large
enough to overwhelm the in memory buffers, defeating the attempt to layout
the data in increasing address order.
For applications less demanding than databases, the user can set the
autodefrag mount option for the filesystem. Under the autodefrag policy, the
system continuously looks for files that are good candidates, and schedules
them for defragmentation. In order to defrag a file, it is read, COWed,
and written to disk in the next checkpoint. This is likely to make it much
more sequential, because the allocator will try to write it out in as few
extents as possible. The downside is that sharing with older snapshots
is lost. While this option is simple, and works well for many systems, it
would be better to maintain sharing. This is currently a topic of active
development. Autodefrag is intended to be a default mount option.
The policies used at the time of writing by autodefrag are as follows. If
there is a small write (less than 64k) into the middle of the file the file is
added to a auto defrag list and then it is defragged whenever the cleaner
thread runs, which is woken up at the end of every transaction commit,
31
so every 30 seconds. This works well for workloads like Sqllite databases.
Chrome or Firefox users will notice a speedup in startup time if they use
autodefrag, since both applications use Sqllite.
The last important case, is shrinking a filesystem, or evicting data from a
disk. In this case, a relocator is used. This is an algorithm that scans a chunk
and moves the live data off of it, while maintaining sharing. Relocation is
a complex procedure, however, and disregarding sharing could cause an
increase in space usage, at the point where we were trying to reduce space
consumption.
The relocator works on a chunk by chunk basis. The general scheme is:
1. Move out all live extents (in the chunk)
2. Find all references into a chunk
3. Fix the references while maintaining sharing
The copy-on-write methodology is used throughout; references are never
updated in place. Figure 20 shows a simple example. Extent K, marked as
an oval with a blue border, needs to be moved out of the chunk. It is copied
out to K0, which is on a different chunk.
superblock
tree of tree roots
subvol1
X Y
A B CD E
K K'
Figure 20: Do a range lookup in the extent tree, find all the extents located
in the chunk. Copy all the live extents to new locations.
32
In order to speed up some of the reference tracking, we follow back-
references to find all upper level tree blocks that directly or indirectly refer-
ence the chunk. The result is stored in a DAG like data structure called a
backref cache, see Figure 21.
tree of tree roots
subvol1
Y
Figure 21: Upper level nodes stored in a backref cache.
A list of sub-volume trees that reference the chunk from the backref cache
is calculated; these trees are subsequently cloned, see Figure 22. This op-
eration has two effects: (1) it freezes in time the state of the sub-volumes
(2) it allows making off-to-the-side modifications to the sub-volumes while
affording the user continued operation. The active filesystem continues to
update the sub-volumes; conflicting updates will be merged later on in the
process.
33
superblock
tree of tree roots
subvol1
X Y
reloc
Y
A B C D E C D
K K'
Figure 23: Fix references in the reloc tree using COW.
The last step is to merge the reloc trees with the originals. We traverse
the trees. We find sub-trees that are modified in the reloc tree but where
corresponding parts in the fs tree are not modified. These sub-trees in the
reloc tree are swapped with their older counterparts from the fs tree. The
end result is new fs-trees. Conflicts could arise if intermediate node Nwas
updated by the active filesystem while a new version (N0) was created by
the relocation code. We treat this by prefering the version of Nfrom the
active filesystem, and look for shared sub-trees in the children of N. The
worst case is where all intermediate nodes have been concurrently modified,
leaving the relocated data extents orphan. In this case, the merge step
will COW the immediate parents (and their paths) and fix them. Finally,
we drop the reloc trees, they are no longer needed. Figure 24 depicts an
example.
35
superblock
tree of tree roots
subvol1
X Y
A B ECD
K'
Figure 24: Merge reloc trees with corresponding fs trees, then drop the reloc
trees.
Swapping is done using copy-on-write; nodes are never updated in place.
This incurs a cost, because nodes on the path to the root need to be updated,
even if they only indirectly reference the chunk. For example, node Yis
COWed, only because it references nodes {C, D}, which themselves are not
in the chunk, but merely reference extent K. Nodes that have no path to
the chunk, such as X, do not require modification.
The new filesystem DAG is now in memory, and has the correct sharing
structure. It takes the same amount of space as the original, which means
that filesystem space usage stays the same. Writing out the DAG to disk can
be done onto contiguous extents resulting in improved sequentiality. Once
that is done, the old chunk can be discarded.
36
6 Performance
There are no agreed upon standards for testing filesystem performance.
While there are industry benchmarks for the NFS and CIFS protocols, they
cover only a small percent of the workloads seen in the field. At the end of
the day, what matters to a user is performance for his particular applica-
tion. The only realistic way to check which filesystem is the best match for
a particular use case, is to try several filesystems, and see which one works
best.
As we cannot cover all use cases, we chose several common benchmarks,
to show that BTRFS performs comparably with its contemporaries. At
the time of writing, the major Linux filesystems, aside from BTRFS, are
XFS and Ext4. These are significantly more mature systems, and we do
not expect to perform orders of magnitude better. Our contribution is a
filesystem supporting important new features, such as snapshots and data
checksums, while providing reasonable performance under most workloads.
An important filesystem which do not compare against is ZFS. Sec-
tion 2.1 covers a lot of its features, and also shows that it has significant
differences compared to BTRFS. While ZFS has ports to Linux, it is not a
native Linux filesystem, this prevents a real apples to apples comparison.
Three storage susbsystems were examined: an SSD, a hard disk, and
multiple disks in a RAID10 configuration.
6.1 Hard disk
The test reported here were run on a single socket 3.2 Ghz quad core x86
processor with 8 gigabytes of ram on a single SATA drive with a 6gb/s link.
The Linux version was 3.4.
The first test was a Linux kernel make, starting from a clean tree of
source files. A block trace was collected, starting with the make -j8 com-
mand. This starts eight parallel threads that perform C compilation and
linking with gcc. Figure 25 compares throughput, seek count, and IOps
between the three filesystems. Ext4 has slightly higher throughput than
BTRFS and XFS, averaging a little less than twice as much throughput.
All filesystems maintain about the same seeks per second, with BTRFS on
average seeking less. The spike at the beginning of the run for BTRFS is
likely to do with the initial bit of copy-on-writing that bounces between
different block groups. Once additional block groups are allocated to deal
with the meta-data load, everything smooths out. The IO operations per
second are a little closer together, but again Ext4 wins out overall. The
37
final spike in throughput is due to writing out vmlinux to disk, amounting
to a long sequential write. The compile times are all within a few seconds of
each other. The kernel compile test tends to be a bit meta-data intensive,
and it is a good benchmark for an application that has a heavy meta-data
load. The overall picture is that the filesystems are generally on par.
Figure 25: A Kernel compile, all filesystems exhibit similar performance.
The following micro benchmarks deal with write performance, see Fig-
ure 26. Tiobench writes a given size to a file with a specified number of
threads. We used a 2000MB file and ran with 1, 2, 4, 8, and 16 threads.
Both tests show BTRFS running the fastest overall, and in the random case
dominating the other two file systems. The random case is probably much
38
better for BTRFS due to its write anywhere nature, and also because we
use range locking for writing instead of a global inode lock which makes it
do parallel operations much faster. Performance declines somewhat with
additional threads due to contention on the shared inode mutex.
1 2 4 8 16
btrfs
xfs
ext4
number of threads
MB/s
0 20 40 60 80 100 120 140
1 2 4 8 16
btrfs
xfs
ext4
number of threads
MB/s
0 20 40 60 80
(a) sequential (b) random
Figure 26: TIO benchmark, (a) sequential, (b) random.
6.2 Flash disk
Flash disks are becoming ubiquitous, replacing traditional hard disks in
many modern laptops. Smartphones and embedded devices running Linux
commonly use flash disks as well. This has motivated an ongoing effort to
optimize BTRFS for Solid State Drives (SSDs). Here, we describe some of
this work; the code is now part of Linux kernel 3.4. The hardware used was
a FusionIOT M device.
Figure 27 depicts performance for the Linux 3.3 kernel, with BTRFS cre-
ating 32 million empty files. The graph was generated by Seekwatcher [10], a
tool that visualizes disk head movement. In the top graph X axis is time, the
Y axis is disk head location, reads are colored blue, and writes are colored
green. The bottom graph tracks throughput. The filesystem starts empty,
39
filling up with empty files as the test progresses. The pattern emerging from
the graph is a continuous progression, writing new files at the end. File meta-
data is batched and written sequentially during checkpoints. Checkpoints
take place every thirty seconds. They incur many random reads, appearing
as a scattering of blue dots. The reads are required to access the free space
allocation tree, a structure too big to fit in memory. The additional disk
seeks slow down the system considerably, as can be seen in the throughput
graph. The reason writes do not progress linearly is that checkpoints, in
addition to writing new data, also free up blocks; these are subsequentally
reused.
Figure 27: Performance in the kernel 3.3. File creation rate is unsteady,
throughput is a series of peaks and valleys.
In order to improve performance, the number of reads had to be reduced.
The core problem turned out to be the way the Linux virtual memory (VM)
system was used. The VM assumes that allocated pages will be used for
a while. BTRFS, however, uses many temporary pages due to its copy-on-
40
write nature, where data can move around on the disk. This mismatch was
causing the VM to hold many out of date pages, reducing cache effectiveness.
This in turn, caused additional paging in the free space allocation tree, which
was thrown out of cache due to cache pressure. The fix was for BTRFS to try
harder to discard from the VM pages that have become invalid. Figure 28
shows the resulting performance improvement. The number of reads is much
reduced, and they are concentrated in the checkpoint intervals. Throughput
is steady at about 125MB/sec, and the rate of file creation is 150,000 files
per second. Note that this optimization is general in nature, and applies
also to disk workloads. The problem was not observed with HDD, because
the IOps rates are much lower.
Figure 28: Performance in the kernel 3.4, with 4KB metadata pages.
Testing showed that using larger page sizes was beneficial on flash. Fig-
ure 29 shows the effects of using a 16KB metadata page size. The rate of
file creation is 170,000 files per second. Using a larger metadata page size is
also important for RAID5/6 integration, as it allows putting one page on a
41
single RAID stripe.
Figure 29: Performance in the kernel 3.4, with 16KB metadata pages.
On the same hardware, XFS performed 115,000 files/second, and Ext4
performed 110,000 files/second. We believe this makes the filesystems com-
parable.
42
6.3 Multiple devices
The machine used in this section was a single socket AMD PhenomT M II X4
840 Processor with 16GB of RAM. It has a Fedora 17 OS with a 3.6 Linux
kernel. It was connected to four 3GB SATA disks each with 6Gbit/sec
connectivity. A RAID10 configuration was created in three ways: (1) native
BTRFS (btrfs), (2) BTRFS over the Linux MD layer (btrfs-md), (3) XFS
over MD (xfs-md). These three configurations were compared using two
experiments.
Figure 30 shows an FIO job which creates two 32GB files one after the
other, creating a 64GB sequential write. BTRFS using its native RAID10
is slightly faster, but all of the results are very close together. We believe
this is because BTRFS has lower overhead using its internal RAID10. The
system always has to map its logical disk offsets to the physical disk offsets
to build the block-io structure (bio) to send to the lower level. With native
RAID, the math is done at this point, but on top of MD we incur the slight
overhead of the lower level when doing the RAID math to our constructed
bio.
43
Figure 30: RAID-10 create two 32GB files
Figure 31 shows an FIO job that reads both of the files back simulta-
neously and sequentially. This is probably the biggest difference between
native RAID and MD, and it is likely to do with the complexity of MD’s
read balancing math. Looking at the read patterns, there are sequential
44
reads across all four disks in the case of BTRFS, but with MD there is a
stair stepping pattern between the devices. This is probably because BTRFS
selects the read mirror based on the process id (PID) of the reader, so one
thread will always read from the same devices in a stripe set. With two
sequential readers, the pattern is that both of them read from different de-
vices in a stripe set. This provides a lot less seeking overall and much higher
throughput. MD applies a sophisticated algorithm that keeps track of head
locations of all the disks within its array to try and make the disks seek as
little as possible. However, with a multi-threaded workload like this, the
cleverness does not pay off. It is likely that in more complex workloads MD
would be closer and might be faster than BTRFS.
45
Figure 31: RAID-10 read two 32GB files sequentially with two threads.
46
6.4 Macro benchmarks
In this section we run several complex workloads, and see how the filesystems
perform. The FileBench [1] toolkit was used, with the mail, web, file, and
OLTP personalities.
The machine used was an Intel CoreT M i7-2600 3.40GHz CPU. It has a
single socket with eight cores, and 16GB of RAM. The memory was limited
to 2GB by setting a Linux boot parameter. The OS was Ubuntu 12.10,
with a 3.6.6 Linux kernel. The hard disk (HDD) was a SATA II, 3Gbit/sec,
7200 RPM drive. The flash disk (SSD) was an Intel SSD-320 Series, 300GB
capacity, with R/W throughput of 270 MB/s / 205 MB/s, and R/W IOps
of 39.5K/23K.
A FileBench run starts by preallocating a filesystem tree. Once that
is done, the chosen workload is executed while carefully measuring perfor-
mance. We looked at the operations per second (ops/s), and the cpu per
operation (cpu/op) metrics. An operation is a benchmark step, such as read-
ing a whole file, appending a certain amount of data to it, etc. Ops/s can
be used to compare relative filesystem performance.
Each workload was run five times against the HDD, and five times
against the SSD. The execution phase of each experiment lasted ten minutes.
We report on the median ops/s, median cpu/op, and the relative standard
error. Generally speaking, results were fairly consistent. We did encounter
a problem with OLTP, which occasionally reported very low ops/s.
Default mount options and mkfs programs were used. Per file system,
the following options were employed:
filesystem options
BTRFS relatime,space cache,rw
BTRFS w. SSD relatime,space cache,rw,ssd
Ext4 data=ordered,relatime,rw
XFS attr2,noquota,relatime,rw
In the OLTP case we also tested BTRFS with the nodatacow option. This
turns off shadowing and checksums for data pages. Copy-on-write is still
used for metadata.
The web workload emulates a web server that serves files to HTTP
clients. It uses 100 threads that read entire files sequentially, as if they
were web pages. The threads, in a 1:10 ratio, append to a common file,
emulating a web log.
The file workload mimics a server that hosts home directories of multiple
users. A thread emulates a user, which is assumed to access only files and
47
directories in his home directory. Each thread performs a sequence of create,
delete, append, read, write, and stat operations
The mail workload mimics an electronic mail server. It creates a flat
directory structure with many small files. It then creates 100 threads that
emulate e-mail operations: reading mail, composing, and deleting. This
translates into many small file and metadata operations.
The OLTP workload emulates a database that performs online trans-
action processing. It is based on the IO model used by OracleT M 9i. It
creates 200 data reader threads, 10 data writer threads, and one log thread.
The readers perform small random reads, the writers perform small random
writes, and the log thread does 256KB log writes. The threads synchronize
using a set of semaphores, so they all work in concert.
Table 32 summarizes the test parameters. All tests have a total on disk
footprint of 10GB or higher, at least a factor of five larger than physical
RAM. This prevents the filesystem from simply caching the entire data set
in memory.
Workload Avg. file size #files Footprint IO size (R/W) # threads R/W ratio
web 32KB 350000 11GB 1MB/16KB 100 10:1
file 256KB 50000 12.5GB 1MB/16KB 50 1:2
mail 16KB 800000 12.5GB 1MB/16KB 100 1:1
OLTP 1GB 10 10GB 2KB/2KB 200+10 20:1
Figure 32: Summary of workload parameters.
Table 33 summarizes the hard disk results, and table 34 shows the cor-
responding SSD results. Each line in a table summarizes five experiments.
Three values are presented, the median ops/s, the relative error of ops/s,
and the median cpu/op.
Examining the cpu/op values for both tables, we can see that BTRFS
requires more CPU cycles, although this is generally within a factor of 1.5
of Ext4, and XFS.
The HDD ops/s results are quite similar for the file and OLTP workloads.
The mail server uses fsync quite heavily, and BTRFS suffers there, because
its fsync implementation is about x4-8 slower than Ext4. This is a known
problem, and improvements are planned for kernel 3.8. The Ext4 HTree
data structure is a very effective index for the large flat mail directory. It is
two levels deep, while the B-tree indexes used by the other filesystems are
much deeper, requiring more random IO. This forms an advantage for Ext4
48
in the web workload as well.
Workload Filesystem median ops/s relerr(ratio) median cpu/op
file btrfs 432 0.00 579us
ext4 396 0.07 356us
xfs 231 0.00 488us
mail btrfs 132 0.01 903us
ext4 613 0.02 464us
xfs 195 0.02 619us
oltp btrfs 229 0.01 27066us
btrfs nocow 252 0.00 24469us
ext4 209 0.00 30821us
xfs 234 0.02 21327us
web btrfs 370 0.04 796us
ext4 859 0.00 215us
xfs 400 0.00 274us
Figure 33: Hard disk results.
For SSD, the ops/s are reasonably close for the file and web workloads. In
the mail workload, BTRFS lags due to a less efficient fsync implementation.
For OLTP, standard BTRFS does poorly, as this is a workload that does not
work well with copy-on-write. Disabling COW brings results to the same
ballpark as the other filesystems. In addition, OLTP had one run with very
low ops/s, this caused a relative standard error of 50%.
49
Workload Filesystem median ops/s relerr(ratio) median cpu/op
file btrfs 3524 0.00 834us
ext4 3465 0.00 381us
xfs 3151 0.00 535us
mail btrfs 4007 0.02 929us
ext4 11701 0.11 665us
xfs 7616 0.03 765us
oltp btrfs 426 0.50 17897us
btrfs nocow 7104 0.02 421us
ext4 7349 0.00 385us
xfs 9068 0.01 285us
web btrfs 14945 0.09 549us
ext4 17585 0.00 294us
xfs 17828 0.00 295us
Figure 34: SSD results.
These are just four workloads which by no means exercise all the various
ways one can use a filesystem. Hopefully, they are representative of the
ways most file systems are used. In most of these cases, BTRFS was in the
same ballpark as its more mature counterparts. In a few cases, low fsync
performance was an issue, and this problem is being addressed at the time
of writing. For heavy database workloads, the nodatacow mount option will
be possible per file in kernel 3.7.
50
7 Summary
BTRFS is a relatively young Linux filesystem, working its way towards
achieving default status on many Linux distributions. It is based on copy-
on-write, and supports efficient snapshots and strong data integrity.
As a general purpose filesystem, BTRFS has to work well on a wide
range of hardware, from enterprise servers to smartphones and embedded
systems. This is a challenging task, as the hardware is diverse in terms of
CPUs, disks, and memory.
This work describes the main algorithms and data layout of BTRFS. It
shows the manner in which copy-on-write b-trees, reference-counting, and
extents are used. It is the first to present a defragmentation algorithm for
COW based filesystems in the presence of snapshots.
8 Acknowledgments
We would like to thank Zheng Yan, for explaining the defragmentation al-
gorithm, and Valerie Aurora, for an illuminating LWN paper on BTRFS.
Vasily Tarasov and Erez Zadok, helped with us use FileBench utility, and
gave good advice on filesystem performance testing.
We are indebted to the many BTRFS developers who have been working
hard since 2007 to bring this new filesystem to the point where it can be
used by Linux customers and users.
Finally, we would like to thank friends who reviewed the drafts, catching
errors, and significantly improving quality: Gary Weiss and W.W..
51
References
[1] FileBench. http://sourceforge.net/projects/filebench.
[2] LZO. http://en.wikipedia.org/wiki/Lempel-Ziv-Oberhumer.
[3] A. Mathur, M. Cao, S. Bhattacharya, A. Dilger, A. Tomas, and L.
Vivier. The New Ext4 Filesystem: Current Status and Future Plans.
In Linux Symposium, Ottawa, Canada, 2007.
[4] A. Sweeney, D. Doucette, W. Hu, C. Anderson, M. Nishimoto, and G.
Peck. Scalability in the XFS File System. In USENIX Annual Technical
Conference, San Diego, CA, 1996.
[5] M. Ahren. https://blogs.oracle.com/ahrens/entry/is it magic.
[6] B. Callaghan, B. Pawlowski, and P. Staubach. NFS Version 3 Protocol
Specification. RFC 1813, IETF, June 1995.
[7] BTRFS mailing list. http://www.mail-archive.com/-
linux-btrfs@vger.kernel.org/index.html.
[8] C. Hellwig. XFS: The Big Storage File System for Linux. In Usenix
Login magazine, October 2009.
[9] C. Mason. BTRFS, 2007. http://en.wikipedia.org/wiki/Btrfs.
[10] C. Mason. Seekwatcher, 2008.
[11] D. Comer. Ubiquitous B-Tree. ACM Comput. Surv., 11(2):121–137,
1979.
[12] D. Hitz, J. Lau, and M. Malcolm. File System Design for an NFS
File Server Appliance. In USENIX, Berkeley, CA, USA, 1994. USENIX
Association.
[13] D. Patterson, G. Gibson, and R. Katz. A Case for Redundant Arrays
of Inexpensive Disks (RAID). SIGMOD, 17(3):109–116, June 1988.
[14] F. Chang, J. Dean, S. Ghemawat, C. Wilson, D. Wallach, M. Burrows,
T. Chandra, A. Fikes, and R. Gruber. Bigtable: a Distributed Storage
System for Structured Data. In 7th USENIX Symposium on Operating
Systems Design and Implementation, OSDI ’06, pages 15–15, Berkeley,
CA, USA, 2006. USENIX Association.
52
[15] H. Reiser. ReiserFS, 2001. http://http://en.wikipedia.org/wiki/ReiserFS.
[16] I. Heizer, P. Leach, and D. Perry. Common Internet File System Pro-
tocol (CIFS/1.0). Draft draft-heizer-cifs-v1-spec-00.txt, IETF, 1996.
[17] I. S. Reed and G. Solomon. Polynomial Codes Over Certain Finite
Fields. Journal of the Society for Industrial and Applied Mathematics,
8:300–304, 1960.
[18] J. Bonwick and B. Moore. ZFS, The last Word In File Systems.
http://hub.opensolaris.org/bin/download/Community+Group+zfs/docs/zfslast.pdf.
[19] J. Dean and S. Ghemawat. LevelDB.
http://code.google.com/p/leveldb.
[20] J. Edwards, D. Ellard, C. Everhart, R. Fair, E. Hamilton, A. Kahn, A.
Kanevsky, J. Lentini, A. Prakash, K. Smith, and E. Zayas. FlexVol:
Flexible, Efficient File Volume Virtualization in WAFL. In USENIX
Annual Technical Conference, pages 129–142, Berkeley, CA, USA, 2008.
USENIX Association.
[21] J. L. Gailly and M. Adler. ZLIB. en.wikipedia.org/wiki/Zlib.
[22] K. Ren and G. Gibson. TABLEFS: Embedding a NoSQL Database
Inside the Local File System. Technical Report CMU-PDL-12-103, May
2012.
[23] M. Rosenblum and J. Ousterhout. The Design and Implementation of a
Log-Structured File System. ACM Transactions on Computer Systems,
10(1):26–52, 1992.
[24] O. Rodeh. B-trees, Shadowing, and Clones. Technical Report H-248,
IBM, November 2006.
[25] O. Rodeh. B-trees, Shadowing, and Range-operations. Technical Re-
port H-248, IBM, November 2006.
[26] O. Rodeh. B-trees, Shadowing, and Clones. ACM Transactions on
Storage, 3(4), Feb 2008.
[27] O. Rodeh. Deferred Reference Counters for Copy-On-Write B-trees.
Technical Report rj10464, IBM, 2010.
53
[28] P. Macko, M. Seltzer, and K. Smith. Tracking Back References in a
Write-Anywhere File System. In 8th USENIX Conference on File and
Storage Technologies, FAST’10, Berkeley, CA, USA, 2010. USENIX
Association.
[29] P. O’Neil, E. Cheng, D. Gawlick, and E. O’Neil. The Log-Structured
Merge-Tree (LSM-tree). Acta Informatica, 33(4):351–385, June 1996.
[30] P. Shetty, R. Spillane, R. Malpani, B. Andrews, J. Seyster, and E.
Zadok. Building Workload-Independent Storage with VT-trees. In 11th
USENIX Conference on File and Storage Technologies (FAST 2013),
Feb 2013.
[31] R. Konishi, K. Sato, and Y. Amagai. NILFS. www.nilfs.org.
[32] R. Sears and R. Ramakrishnan. bLSM: a General Purpose Log Struc-
tured Merge Tree. In ACM SIGMOD International Conference on Man-
agement of Data, SIGMOD ’12, pages 217–228, New York, NY, USA,
2012. ACM.
[33] S. Shepler, B. Callaghan, D. Robinson, R. Thurlow, C. Beame, M.
Eisler, and D. Noveck. NFS version 4 Protocol. RFC 3010, IETF,
December 2000.
[34] V. Henson, M. Ahrens, and J. Bonwick. Automatic Performance Tuning
in the Zettabyte File System. In File and Storage Technologies (FAST),
work in progress report, Berkeley, CA, USA, 2003. USENIX Association.
54