Conference Paper

Transactional NVM cache with high performance and crash consistency

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

The byte-addressable non-volatile memory (NVM) is new promising storage medium. Compared to NAND flash memory, the next-generation NVM not only preserves the durability of stored data but has much shorter access latencies. An architect can utilize the fast and persistent NVM as an external disk cache. Regarding the system's crash consistency, a prevalent journaling file system needs to run atop an NVM disk cache. However, the performance is severely impaired by redundant efforts in achieving crash consistency in both file system and disk cache. Therefore, we propose a new mechanism called t ransact i onal N VM disk ca che (Tinca). In brief, Tinca jointly guarantees consistency of file system and disk cache and removes the performance penalty of file system journaling with a lightweight transaction scheme. Evaluations confirm that Tinca significantly outperforms state-of-the-art design by up to 2.5X in local and cluster tests without causing any inconsistency issue.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Non-Volatile Memory (NVM) technologies such as PCM [20] and STT-MRAM [21] are the state-of-the-art memory technology which allows DRAM like fast access speed, byteaddressability, and persistency of storage device [22]. Due to its characteristics, NVM is used in the research field as highspeed storage device [23,24], persistent cache [25], or main memory extension for persistent data structures [22,26]. However, it is necessary to support transactional updates for consistency guarantee while managing persistent data in NVM. ...
... NVM can be used as a write-back cache [25] or as a journaling device [24,30]. In the NVM Node Logging, NVM is used as storage only for file and file system metadata. ...
... Write Optimizations: There exist several studies to improve I/O performance of the file system using NVM [24,25,30]. UBJ [30] integrates the page cache and journal area on NVM. ...
Conference Paper
In Manycore server environment, we observe the performance degradation in parallel writes and identify the causes as follows - (i) When multiple threads write to a single file simultaneously, the current POSIX-based F2FS file system does not allow this parallel write even though ranges are distinct where threads are writing. (ii) The high processing time of Fsync at file system layer degrades the I/O throughput as multiple threads call Fsync simultaneously. (iii) The file system periodically checkpoints to recover from system crashes. All incoming I/O requests are blocked while the checkpoint is running, which significantly degrades overall file system performance. To solve these problems, first, we propose file systems to employ a fine-grained file-level Range Lock that allows multiple threads to write on mutually exclusive ranges of files rather than the course-grained inode mutex lock. Second, we propose NVM Node Logging that uses NVM as an extended storage space to store file metadata and file system metadata at high speed during Fsync and checkpoint operations. In particular, the NVM Node Logging consists of (i) a fine-grained inode structure to solve the write amplification problem caused by flushing the file metadata in block units and (ii) a Pin Point NAT (Node Address Table) Update, which can allow flushing only modified NAT entries. We implemented Range Lock and NVM Node Logging for F2FS in Linux kernel 4.14.11. Our extensive evaluation at two different types of servers (single socket 10 cores CPU server, multi-socket 120 cores NUMA CPU server) shows significant write throughput improvements in both real and synthetic workloads.
... As an example, redo logging first commits modified data to a log and subsequently, updates file data in place. In short, logging has to write the same data twice [10,[17][18][19][20]. ...
... Initially we preallocate 3% of NVMM space for the use of LAWN. Given an NVM device of 128GB, the entire zone takes 3.8GB, which is sufficiently large as compared to the default 128MB log (journal) of Ext4 [17,19]. If the percentage of free slots in all slots of a subzone drops below a threshold, say, 10%, which may entail swapping data with files to release slots, we will allocate more NVMM space to expand that sub-zone. ...
... If the crash occurs after setting the descriptor, newer d has been successfully committed to the zone and will be used as the valid version. This is identical to the data consistency achieved by the classic Ext4 at the mode of data=journal, in which data committed to the log (journal) would be valid after a crash [17,19]. ...
Conference Paper
Byte-addressable non-volatile memories can be used with DRAM to build a hybrid memory system of volatile/non-volatile main memory (NVMM). NVMM file systems demand consistency techniques such as logging and copy-on-write to guarantee data consistency in case of system crashes. However, conventional consistency techniques may incur write amplification that severely degrades the file system performance. In this paper, we propose LAWN ( l ogless, a lternate w riting for N VMM), a novel approach that achieves data consistency and significantly improves performance via reducing write amplification. Our evaluation reveals that LAWN boosts the performance of a state-of-the-art NVMM file system by up to 12.0×.
... Though, new challenges emerge when a data structure is ported from hard disk to persistent memory that is directly operated by a CPU. First, modern CPUs mainly support an atomic write of 8B [4], [8], [21], [22], [23]. In addition, the exchange unit between CPU cache and memory is a cache line that typically has a size of 64B or 128B. ...
... Computer scientists have proposed a number of artifacts for system-and application-level softwares to utilize persistent memory on the memory bus [2], [3], [4], [6], [7], [12], [20], [22], [24], [25], [26], [27], [28], [29], [30], [31], [32]. In particular, several in-NVM B+-tree variants have been developed [8], [9], [14], [15], [16], [17], [33]. ...
Preprint
Several B+-tree variants have been developed to exploit the performance potential of byte-addressable non-volatile memory (NVM). In this paper, we attentively investigate the properties of B+-tree and find that, a conventional B+-tree node is a linear structure in which key-value (KV) pairs are maintained from the zero offset of the node. These pairs are shifted in a unidirectional fashion for insertions and deletions. Inserting and deleting one KV pair may inflict a large amount of write amplifications due to shifting KV pairs. This badly impairs the performance of in-NVM B+-tree. In this paper, we propose a novel circular design for B+-tree. With regard to NVM's byte-addressability, our Circ-tree design embraces tree nodes in a circular structure without a fixed base address, and bidirectionally shifts KV pairs in a node for insertions and deletions to minimize write amplifications. We have implemented a prototype for Circ-Tree and conducted extensive experiments. Experimental results show that Circ-Tree significantly outperforms two state-of-the-art in-NVM B+-tree variants, i.e., NV-tree and FAST+FAIR, by up to 1.6x and 8.6x, respectively, in terms of write performance. The end-to-end comparison by running YCSB to KV store systems built on NV-tree, FAST+FAIR, and Circ-Tree reveals that Circ-Tree yields up to 29.3% and 47.4% higher write performance, respectively, than NV-tree and FAST+FAIR.
... A study [29] on using NVM as an I/O cache for SSD or HDD reveals that the current I/O caching solution cannot fully benefit from the low-latency and high-throughput of NVM. Recent researches have been trying to overcome the complexity of using NVM as a Direct Access (DAX) storage device and using it as a cache for SSD/HDD [4,12,24,61]. In recent years, Intel provided a Persistent Memory Development Kit (PMDK) [1] which provides several APIs to access the persistent memory from the user level directly. ...
... This technique is also used in many other NVM-based designs [25,69]. Transactional NVM disk Cache (Tinca) [61] aims to achieve crash consistency through transactional supports while avoiding double writes by exploiting an NVM-based disk cache. Leveraging the byte addressability feature of NVM, Tinca maintains fine-grained cache metadata to enable copy-on-write (COW) while writing a data block. ...
Preprint
Although every individual invented storage technology made a big step towards perfection, none of them is spotless. Different data store essentials such as performance, availability, and recovery requirements have not met together in a single economically affordable medium, yet. One of the most influential factors is price. So, there has always been a trade-off between having a desired set of storage choices and the costs. To address this issue, a network of various types of storing media is used to deliver the high performance of expensive devices such as solid state drives and non-volatile memories, along with the high capacity of inexpensive ones like hard disk drives. In software, caching and tiering are long-established concepts for handling file operations and moving data automatically within such a storage network and manage data backup in low-cost media. Intelligently moving data around different devices based on the needs is the key insight for this matter. In this survey, we discuss some recent pieces of research that have been done to improve high-performance storage systems with caching and tiering techniques.
... Using a log to record changes made to the data structure is a straightforward approach. Nevertheless, logging incurs writing the same data twice, which is costly in time and particularly harmful for NVM technologies that suffer from the write endurance issue [4,5,10,15,27,[34][35][36]. On the other hand, devising a logless consistency mechanism is non-trivial, as demonstrated in existing work on NVM-based systems and structures [31,40]. ...
... Require: A RB-Tree T and a node z; 1: if (z has a red sibling) then 2: Color z's sibling black; 3: Return; 4: end if 5: if (z is its parent's right child) then 6: Let α be z's parent; 7: Let β and γ be z's left and right children, respectively; 8: if (z is red) then 9: Color z black; 10: Make z the child of its grandparent (if any); 11: Make α the left child of z and color α red; 12: Make β the right child of α; 13: Delete_Adjust(T, β); 14: else if (z has no red child) then 15: Color z red; 16: if (α is the root of T ) then 17: Color α black; 18: else 19: Let x be z's uncle; 20: Delete_Adjust(T, x); 21: end if 22: else if (if β is red) then 23: Color β black and z red; 24: Make β the right child of α; 25: Assign β's right child as the left child of z; 26: Make z the right child of β; 27: Delete_Adjust(T, β); 28: else if (if γ is red) then 29: Color γ black and z red; 30: Make z the child of its grandparent (if any); 31: Make α the left child of z and color α black; 32: Make β the right child of α; 33: end if 34: else 35: Repeat Lines 2-29 with 'left' and 'right' exchanged. 36 ...
Article
Full-text available
Byte-addressable non-volatile memory (NVM) is going to reshape conventional computer systems. With advantages of low latency, byte-addressability, and non-volatility, NVM can be directly put on the memory bus to replace DRAM. As a result, both system and application softwares have to be adjusted to perceive the fact that the persistent layer moves up to the memory. However, most of the current in-memory data structures will be problematic with consistency issues if not well tuned with NVM. This article places emphasis on an important in-memory structure that is widely used in computer systems, i.e., the Red/Black-tree (RB-tree). Since it has a long and complicated update process, the RB-tree is prone to inconsistency problems with NVM. This article presents an NVM-compatible consistent RB-tree with a new technique named cascade-versioning. The proposed RB-tree (i) is all-time consistent and scalable and (ii) needs no recovery procedure after system crashes. Experiment results show that the RB-tree for NVM not only achieves the aim of consistency with insignificant spatial overhead but also yields comparable performance to an ordinary volatile RB-tree.
... Researchers have proposed to place them on the processor-memory bus to build persistent memory that blurs the boundary between memory and storage. A number of system and application softwares have been designed to leverage the persistent memory in x86-based computing platforms [7][8][9][10][11][12][13][14][15][16][17][18][19][20][21]. ...
Conference Paper
The byte-addressable non-volatile memory (NVM) promises persistent memory. Concretely, ARM processors have incorporated architectural supports to utilize NVM. In this paper, we consider tailoring the important B+-tree for NVM operated by a 64-bit ARMv8 processor. We first conduct an empirical study of performance overheads in writing and reading data for a B+-tree with an ARMv8 processor, including the time cost of cache line flushes and memory fences for crash consistency as well as the execution time of binary search compared to that of linear search. We hence identify the key weaknesses in the design of B+-tree with ARMv8 architecture. Accordingly, we develop a new B+-tree variant, namely, crash recoverable ARMv8-oriented B+-tree (Crab-tree). To insert and delete data at runtime, Crab-tree selectively chooses one of two strategies, i.e., copy on write and shifting in place, depending on which one causes less consistency cost to performance. Crab-tree regulates a strict execution order in both strategies and recovers the tree structure in case of crashes. We have evaluated Crab-tree in Raspberry Pi 3 Model B+ with emulated NVM. Experiments show that Crab-tree significantly outperforms state-of-the-art B+-trees designed for persistent memory by up to 2.6x and 3.2x in write and read performances, respectively, with both consistency and scalability achieved.
... Software technologies that leverage non-volatile memory have also begun to be developed. One study proposed a method for leveraging non-volatile memory by using it for the storage of metadata on a file system [3] [4]. Another study proposed a method that uses non-volatile memory as a write cache [5]. ...
... Within the long swapping period, the update operation is infrequent, which has negligible influence on NVM performance. How to ensure the crash consistency is important an challenging problem and has been discussed in [24,38,47], which is beyond the scope of this paper and we assume that there is a battery backup in memory controller to refresh metadata during power failure like existing schemes [24,36]. ...
Preprint
In order to meet the needs of high performance computing (HPC) in terms of large memory, high throughput and energy savings, the non-volatile memory (NVM) has been widely studied due to its salient features of high density, near-zero standby power, byte-addressable and non-volatile properties. In HPC systems, the multi-level cell (MLC) technique is used to significantly increase device density and decrease the cost, which however leads to much weaker endurance than the single-level cell (SLC) counterpart. Although wear-leveling techniques can mitigate this weakness in MLC, the improvements upon MLC-based NVM become very limited due to not achieving uniform write distribution before some cells are really worn out. To address this problem, our paper proposes a self-adaptive wear-leveling (SAWL) scheme for MLC-based NVM. The idea behind SAWL is to dynamically tune the wear-leveling granularities and balance the writes across the cells of entire memory, thus achieving suitable tradeoff between the lifetime and cache hit rate. Moreover, to reduce the size of the address-mapping table, SAWL maintains a few recently-accessed mappings in a small on-chip cache. Experimental results demonstrate that SAWL significantly improves the NVM lifetime and the performance for HPC systems, compared with state-of-the-art schemes.
... Existing implementations and research are designed around SSDs and HDDs, aiming to provide a device with flash-level latencies and a cost-competitive capacity [4,10,12,19,23,25,33]. The advent of non-volatile memory-based storage (NVM storage) [5,9,17,18,27,28], however, fundamentally changes the landscape of the storage stack [3,11,13,20,30,32]. NVM storage offers both persistency and near-DRAM latencies, making it an attractive target as the faster device for block I/O caching. ...
Conference Paper
This paper presents an empirical study on block I/O caches when combining the performance benefits of emerging NVM storages and the cost-effectiveness of secondary storages. Current state-of-the-art I/O caching solutions are designed around the performance characteristics of SSDs, and our study reveals that using them with NVM storages does not fully reap the benefits of I/O devices with near-DRAM latencies. With fast NVM storages, locating data must be handled efficiently, but the sophisticated yet complex data structures used in existing designs impose significant overheads by substantially increasing the hit time. As this design approach is suboptimal for accessing fast I/O devices, we suggest several architectural designs to exploit the performance of NVM storages.
Article
Existing nonvolatile memory (NVM)-based file systems can fully leverage the characteristics of NVM to obtain better performance than traditional disk-based file systems. It has the potential capacity to efficiently manage metadata and perform fast metadata operations. However, most NVM-based file systems mainly focus on managing file metadata (inode), while pay little attention to directory metadata (dentry), which also has a noticeable impact on the file system performance. Besides, the traditional journaling technique that guarantees metadata consistency may not yield satisfactory performance on NVM-based file systems. To solve these problems, in this article we propose a fast and low overhead metadata operation mechanism, called FLOMO. It first adopts a novel slotted-paging structure in NVM to reorganize dentry for efficiently performing dentry operations, and utilizes the red–black tree in DRAM to accelerate dentry lookup and the search process of dentry deletion. Moreover, FLOMO presents a selective journaling scheme for metadata updates, which partially logs the changes related to dentry in the proposed slotted page, thereby, mitigating the redundant journaling overhead. To verify FLOMO, we implement it in a typical NVM-based file system, the persistent memory file system (PMFS). Experimental results show that FLOMO accelerates the metadata operations in PMFS by 34.4% $\sim 59$ %, and notably reduces the journaling overhead for metadata, shortening the latency by 59% on average. For real-world applications, FLOMO has higher throughput compared with PMFS, PMFS without journal, and NOVA, achieving up to $2.1\times $ , $1.1\times $ , and $1.3\times $ performance improvement, respectively.
Article
Several B+-tree variants have been developed to exploit the byte-addressable non-volatile memory (NVM). We attentively investigate the properties of B+-tree and find that, a conventional B+-tree node is a linear structure in which key-value (KV) pairs are maintained from the zero offset of a node. These KV pairs are shifted in a unidirectional fashion for insertions and deletions. Inserting and deleting one KV pair may inflict a large amount of write amplifications due to shifting existing KV pairs. This badly impairs the performance of in-NVM B+-tree. In this paper, we propose a novel circular design for B+-tree. With regard to NVM's byte-addressability, our Circ-Tree embraces tree nodes in a circular structure without a fixed base address, and bidirectionally shifts KV pairs for insertions and deletions to minimize write amplifications. We have implemented a prototype for Circ-Tree and conducted extensive experiments. Experimental results show that Circ-Tree significantly outperforms two state-of-the-art in-NVM B+-tree variants, i.e., NV-tree and FAST+FAIR, by up to 1.6X and 8.6X, respectively, in terms of write performance. The end-to-end comparison by running YCSB to KV stores built on NV-tree, FAST+FAIR, and Circ-Tree reveals that Circ-Tree yields up to 29.3% and 47.4% higher write performance, respectively, than NV-tree and FAST+FAIR.
Article
In recent years, the next-generation non-volatile memory (NVM) technologies have emerged with DRAM-like byte addressability and disk-like durability. Computer architects have proposed to use them to build persistent memory that blurs the conventional boundary between volatile memory and non-volatile storage. However, ARM processors, ones that are widely used in embedded computing systems, start providing architectural supports to utilize NVM since ARMv8. In this article, we consider tailoring B+-tree for NVM operated by a 64-bit ARMv8 processor. We first conduct an empirical study of performance overhead in writing and reading data for a B+-tree with an ARMv8 processor, including the time cost of cache line flushes and memory fences for crash consistency as well as the execution time of binary search compared to that of linear search. We hence identify the key weaknesses in the design of B+-tree with ARMv8 architecture. Accordingly, we develop a new B+-tree variant, namely, c rash r ecoverable A RMv8-oriented B +-tree (Crab-tree). To insert and delete data at runtime, Crab-tree selectively chooses one of two strategies, i.e., copy on write and shifting in place, depending on which one causes less consistency cost. Crab-tree regulates a strict execution order in both strategies and recovers the tree structure in case of crashes. To further improve the performance of Crab-tree, we employ three methods to reduce software overhead, cache misses, and consistency cost, respectively. We have implemented and evaluated Crab-tree in Raspberry Pi 3 Model B+ with emulated NVM. Experiments show that Crab-tree significantly outperforms state-of-the-art B+-trees designed for persistent memory by up to 2.2× and 3.7× in write and read performances, respectively, with both consistency and scalability achieved.
Article
Modern file systems rely on the journaling mechanism to maintain crash consistency. The use of non-volatile memory (NVM) significantly improves the performance of journaling file systems. However, the superior performance of NVM will increase the likelihood of the journal filling up more often, thereby increasing the frequency of checkpointing. Together with the large amount of random checkpointing I/O found in most use cases, the checkpointing process becomes a new performance bottleneck. This paper proposes NV-Journaling, a strategy that reduces the frequency of checkpointing as well as reshapes the I/O pattern of checkpointing from one of random I/O to that which is more sequential I/O. NV-Journaling introduces fine-grained commits along with a cache-friendly NVM journaling layout that exploits the idiosyncrasies of NVM technology. Under this scheme, only the modified portion of a block, rather than the entire block, is written into the NVM journal device. Doing so significantly reduces checkpoint frequency and achieves better space utilization. NV-Journaling further reshapes the I/O pattern of checkpoint using a locality-aware checkpointing process. Checkpointed blocks are classified into hot and cold blocks. NV-Journaling maintains a hot block list to absorb repeated updates, and a cold bucket list to group blocks by their proximity on disk. When a checkpoint is required, cold buckets are selected such that blocks are sequentially flushed to the hard disk. We built a prototype of NV-Journaling by modifying the JBD2 layer in the Linux kernel and evaluated it using different workloads. Our experimental results show that NV-Journaling can improve performance by up to 4.3× compared to traditional journaling.
Chapter
Journaling techniques are widely employed in modern file systems to guarantee crash consistency. However, journaling usually leads to system performance decrease due to the frequent storage accesses it entails. Architects can utilize emerging non-volatile memory (NVM) as a persistent cache or journaling device to reduce the storage accesses of journaling file systems. Yet problems such as double writes, metadata write amplification and heavy transaction ordering overhead still exist in current solutions. Therefore, we propose Spindle, a write-optimized NVM cache to address these challenges. Spindle decouples data and metadata accesses by processing data in DRAM while pinning metadata in NVM. With redesigned metadata log and state switch mechanism, Spindle eliminates double writes and relieves metadata write amplification. Moreover, Spindle adopts a lightweight transaction scheme to guarantee crash consistency and reduce transaction ordering overhead. Experimental results reveal that Spindle achieves up to \(47\%\) throughput improvement compared with state-of-the-art design.
Article
Modern file systems employ journaling techniques to guarantee data consistency in case of unexpected system crashes or power failures. However, journaling file systems usually suffer from performance decrease due to the extra journal writes. Moreover, the emerging non-volatile memory technologies (NVMs) have the potential capability to reduce the journaling overhead by being deployed as the journaling storage devices. However, traditional journaling techniques, which are designed for hard disks, fail to perform efficiently in NVMs. In order to address this problem, we propose an NVM-based journaling scheme, called NJS. The basic idea behind NJS is to reduce the journaling overhead of traditional file systems while fully exploiting the byte-accessibility characteristic of NVM, and alleviating the slow write and endurance limitation of NVM. Our NJS consists of four major contributions: (1) In order to decrease the amount of journal writes, NJS only needs to write the file system metadata and over-write data to NVM as write-ahead logging, thus alleviating the slow write and endurance limitation of NVM. (2) NJS adopts a wear aware strategy for NVM journal block allocation in which each block can be evenly worn out, thus further extending the lifetime of NVM. (3) We propose a novel journaling update scheme in which journal data blocks can be updated in the byte-granularity based on the difference of the old and new versions of journal blocks, thus fully exploiting the unique byte-accessibility characteristic of NVM. (4) NJS includes a garbage collection mechanism that absorbs the redundant journal updates, and actively delays the checkpointing to the file system. Evaluation results show the efficiency and efficacy of NJS. For example, compared with Ext4 with a ramdisk-based journaling device, the throughput improvement of Ext4 with our NJS is up to 131.4%.
Article
Buffer caching is an effective approach to improve the system performance and extend the lifetime of SSDs. However, the frequent synchronization operations in most real-world applications limit such advantages. This paper proposes to adopt emerging non-volatile main memories (NVMMs) to relieve the above problems while achieving both efficient and consistent cache management. To this end, an adaptive fine-grained cache (AFCM) scheme is proposed, which is motivated by our observation that the file data in many synchronized pages is partially updated for a wide range of workloads, implying that fine-grained cache management can save the NVMM cache space wasted by the clean parts. To reduce the cache index overhead introduced by fine-grained cache management, AFCM employs a Hybrid Cache based on DRAM and NVMM, with which the normal read and write operations are served without performance penalty. We also propose the Transactional Copy-on-Write mechanism to guarantee the crash consistency of both NVMM cache space and file system image. Our experimental results show that AFCM provides up to 84% performance improvement and 63% SSD write reduction on average compared to the conventional coarse-grained cache management scheme.
Article
Full-text available
File system performance is dominated by small and frequent metadata access. Metadata is stored as blocks on the hard disk drive. Partial metadata update results in whole-block read or write, which significantly amplifies disk I/O. Furthermore, a huge performance gap between the CPU and disk aggravates this problem. In this article, a file system metadata accelerator (referred to as FSMAC) is proposed to optimize metadata access by efficiently exploiting the persistency and byte-addressability of Nonvolatile Memory (NVM). The FSMAC decouples data and metadata access path, putting data on disk and metadata in byte-addressable NVM at runtime. Thus, data is accessed in a block from I/O the bus and metadata is accessed in a byteaddressable manner from the memory bus. Metadata access is significantly accelerated and metadata I/O is eliminated because metadata in NVM is no longer flushed back to the disk periodically. A lightweight consistency mechanism combining fine-grained versioning and transaction is introduced in the FSMAC. The FSMAC is implemented on a real NVDIMM platform and intensively evaluated under different workloads. Evaluation results show that the FSMAC accelerates the file system up to 49.2 times for synchronized I/O and 7.22 times for asynchronized I/O. Moreover, it can achieve significant performance speedup in network storage and database environment, especially for metadata-intensive or write-dominated workloads.
Conference Paper
Full-text available
The advent of the byte-addressable, non-volatile memory (NVM) has initiated the design of new data management strategies to utilize it as the persistent memory (PM). One way to manage the PM is via an in-memory file system. The consistency of the in-memory file system may nevertheless be compromised from directly exposing the PM to the CPU, because data are likely to be flushed from the CPU cache to the PM in an order that is different from the order in which they have been programed to be. As a result, in spite of classic consistency mechanisms, such as journaling and Copy-on-Write, file systems for the PM have to seek support of cacheline flush and memory fence instructions, e.g., clflush and sfence, to achieve ordered writes. On the other hand, manipulating the PM as a consistent block device with conventional file systems is also doable. The pros and cons of two approaches, however, have not been thoroughly investigated yet. We hence do so with extensive evaluations and detailed analyses. Our aim of this paper is to inspire how the PM shall be managed, especially from the performance perspective.
Conference Paper
Full-text available
With the rapid development of new types of nonvolatile memory (NVM), one of these technologies may replace DRAM as the main memory in the near future. Some drawbacks of DRAM, such as data loss due to power failure or a system crash can be remedied by NVM's non-volatile nature. In the meantime, solid state drives (SSDs) are becoming widely deployed as storage devices for faster random access speed compared with traditional hard disk drives (HDDs). For applications demanding higher reliability and better performance, using NVM as the main memory and SSDs as storage devices becomes a promising architecture. Although SSDs have better performance than HDDs, SSDs cannot support in-place updates (i.e., an erase operation has to be performed before a page can be updated) and suffer from a low endurance problem that each unit will wear out after certain number of erase operations. In an NVM based main memory, any updated pages called dirty pages can be kept longer without the urgent need to be flushed to SSDs. This difference opens an opportunity to design new cache policies that help extend the lifespan of SSDs by wisely choosing cache eviction victims to decrease storage write traffic. However, it is very challenging to design a policy that can also increase the cache hit ratio for better system performance. Most existing DRAM-based cache policies have mainly concentrated on the recency or frequency status of a page. On the other hand, most existing NVM-based cache policies have mainly focused on the dirty or clean status of a page. In this paper, by extending the concept of the Adaptive Replacement Cache (ARC), we propose a Hierarchical Adaptive Replacement Cache (H-ARC) policy that considers all four factors of a page's status: dirty, clean, recency, and frequency. Specifically, at the higher level, H-ARC adaptively splits the whole cache space into a dirty-page cache and a clean-page cache. At the lower level, inside the dirty-page cache and the clean-page cache, H-A- C splits them into a recency-page cache and a frequency-page cache separately. During the page eviction process, all parts of the cache will be balanced towards to their desired sizes.
Conference Paper
Full-text available
Modern file systems use ordering points to maintain consistency in the face of system crashes. However, such ordering leads to lower performance, higher complexity, and a strong and perhaps naive dependence on lower layers to correctly enforce the ordering of writes. In this paper, we introduce the No-Order File System (NoFS), a simple, lightweight file system that employs a novel technique called backpointer-based consistency to provide crash consistency without ordering writes as they go to disk. We utilize a formal model to prove that NoFS provides data consistency in the event of system crashes; we show through experiments that NoFS is robust to such crashes, and delivers excellent performance across a range of workloads. Backpointer-based consistency thus allows NoFS to provide crash consistency without resorting to the heavyweight machinery of traditional approaches.
Conference Paper
Full-text available
We introduce optimistic crash consistency, a new approach to crash consistency in journaling file systems. Using an array of novel techniques, we demonstrate how to build an optimistic commit protocol that correctly recovers from crashes and delivers high performance. We implement this optimistic approach within a Linux ext4 variant which we call OptFS. We introduce two new file-system primitives, osync() and dsync(), that decouple ordering of writes from their durability. We show through experiments that OptFS improves performance for many workloads, sometimes by an order of magnitude; we confirm its correctness through a series of robustness tests, showing it recovers to a consistent state after crashes. Finally, we show that osync() and dsync() are useful in atomic file system and database update scenarios, both improving performance and meeting application-level consistency demands.
Conference Paper
Full-text available
Future exascale systems face extreme power challenges. To improve power efficiency of future HPC systems, non-volatile memory (NVRAM) technologies are being investigated as potential alternatives to existing memories technologies. NVRAMs use extremely low power when in standby mode, and have other performance and scaling benefits. Although previous work has explored the integration of NVRAM into various architecture and system levels, an open question remains: do specific memory workload characteristics of scientific applications map well onto NVRAMs' features when used in a hybrid NVRAM-DRAM memory system? Furthermore, are there common classes of data structures used by scientific applications that should be frequently placed into NVRAM?In this paper, we analyze several mission-critical scientific applications in order to answer these questions. Specifically, we develop a binary instrumentation tool to statistically report memory access patterns in stack, heap, and global data. We carry out hardware simulation to study the impact of NVRAM for both memory power and system performance. Our study identifies many opportunities for using NVRAM for scientific applications. In two of our applications, 31% and 27% of the memory working sets are suitable for NVRAM. Our simulations suggest at least 27% possible power savings and reveal that the performance of some applications is insensitive to relatively long NVRAM write-access latencies.
Conference Paper
Full-text available
Drawing parallels to the rise of general purpose graphical processing units (GPGPUs) as accelerators for specific high-performance computing (HPC) workloads, there is a rise in the use of non-volatile memory (NVM) as accelerators for I/O-intensive scientific applications. However, existing works have explored use of NVM within dedicated I/O nodes, which are distant from the compute nodes that actually need such acceleration. As NVM bandwidth begins to out-pace point-to-point network capacity, we argue for the need to break from the archetype of completely separated storage. Therefore, in this work we investigate co-location of NVM and compute by varying I/O interfaces, file systems, types of NVM, and both current and future SSD architectures, uncovering numerous bottlenecks implicit in these various levels in the I/O stack. We present novel hardware and software solutions, including the new Unified File System (UFS), to enable fuller utilization of the new compute-local NVM storage. Our experimental evaluation, which employs a real-world Out-of-Core (OoC) HPC application, demonstrates throughput increases in excess of an order of magnitude over current approaches.
Conference Paper
Full-text available
Large caches in storage servers have become essential for meeting service levels required by applications. These caches need to be warmed with data often today due to various scenarios including dynamic creation of cache space and server restarts that clear cache contents. When large storage caches are warmed at the rate of application I/O, warmup can take hours or even days, thus affecting both application performance and server load over a long period of time. We have created Bonfire, a mechanism for accelerat-ing cache warmup. Bonfire monitors storage server work-loads, logs important warmup data, and efficiently pre-loads storage-level caches with warmup data. Bonfire is based on our detailed analysis of block-level data-center traces that provides insights into heuristics for warmup as well as the potential for efficient mechanisms. We show through both simulation and trace replay that Bonfire re-duces both warmup time and backend server load signifi-cantly, compared to a cache that is warmed up on demand.
Article
Full-text available
Spin-transfer torque random access memory (STT-RAM) has received increasing attention because of its attractive features: good scalability, zero standby power, non-volatility and radiation hardness. The use of STT-RAM technology in the last level on-chip caches has been proposed as it minimizes cache leakage power with technology scaling down. Furthermore, the cell area of STT-RAM is only 1/9 ~ 1/3 that of SRAM. This allows for a much larger cache with the same die footprint, improving overall system performance through reducing cache misses. However, deploying STT-RAM technology in L1 caches is challenging because of the long and power-consuming write operations. In this paper, we propose both L1 and lower level cache designs that use STT-RAM. In particular, our designs use STT-RAM cells with various data retention time and write performances, made possible by different magnetic tunneling junction (MTJ) designs. For the fast STT-RAM bits with reduced data retention time, a counter controlled dynamic refresh scheme is proposed to maintain the data validity. Our dynamic scheme saves more than 80% refresh energy compared to the simple refresh scheme proposed in previous works. A L1 cache built with ultra low retention STT-RAM coupled with our proposed dynamic refresh scheme can achieve 9.2% in performance improvement, and saves up to 30% of the total energy when compared to one that uses traditional SRAM. For lower level caches with relative large cache capacity, we propose a data migration scheme that moves data between portions of the cache with different retention characteristics so as to maximize the performance and power benefits. Our experiments show that on the average, our proposed multi retention level STT-RAM cache reduces 30 ~ 70% of the total energy compared to previous works, while improving IPC performance for both 2-level and 3-level cache hierarchy.
Article
Full-text available
Ext3 has been the most widely used general Linux R filesystem for many years. In keeping with increasing disk capacities and state-of-the-art feature requirements, the next generation of the ext3 filesystem, ext4, was cre-ated last year. This new filesystem incorporates scalabil-ity and performance enhancements for supporting large filesystems, while maintaining reliability and stability. Ext4 will be suitable for a larger variety of workloads and is expected to replace ext3 as the "Linux filesys-tem." In this paper we will first discuss the reasons for start-ing the ext4 filesystem, then explore the enhanced ca-pabilities currently available and planned for ext4, dis-cuss methods for migrating between ext3 and ext4, and finally compare ext4 and other filesystem performance on three classic filesystem benchmarks.
Article
Full-text available
The availability of high-speed solid-state storage has introduced a new tier into the storage hierarchy. Low-latency and high-IOPS solid-state drives (SSDs) cache data in front of high-capacity disks. However, most existing SSDs are designed to be a drop-in disk replacement, and hence are mismatched for use as a cache. This paper describes FlashTier, a system architecture built upon solid-state cache (SSC), a flash device with an interface designed for caching. Management software at the operating system block layer directs caching. The FlashTier design addresses three limitations of using traditional SSDs for caching. First, FlashTier provides a unified logical address space to reduce the cost of cache block management within both the OS and the SSD. Second, FlashTier provides cache consistency guarantees allowing the cached data to be used following a crash. Finally, FlashTier leverages cache behavior to silently evict data blocks during garbage collection to improve performance of the SSC. We have implemented an SSC simulator and a cache manager in Linux. In trace-based experiments, we show that FlashTier reduces address translation space by 60% and silent eviction improves performance by up to 167%. Furthermore, FlashTier can recover from the crash of a 100GB cache in only 2.4 seconds.
Conference Paper
Full-text available
Being one of the few mechanical components in a typ- ical computer system, hard drives consume a significant amount of the overall power used by a computer. Spinning down a hard drive reduces its power consumption, but only works when no disk accesses occur, limiting overall effec- tiveness. We have designed and implemented a technique to extend disk spin-down times using a small non-volatile storage cache called NVCache, which contains a combina- tion of caching techniques to service reads and writes while the hard disk is in low-power mode. We show that combin- ing NVCache with an adaptive disk spin-down algorithm, a hard disk's power consumption can be reduced by up to 90%.
Conference Paper
Full-text available
New storage-class memory (SCM) technologies, such as phase-change memory, STT-RAM, and memristors, promise user-level access to non-volatile storage through regular memory instructions. These memory devices enable fast user-mode access to persistence, allowing regular in-memory data structures to survive system crashes. In this paper, we present Mnemosyne, a simple interface for programming with persistent memory. Mnemosyne addresses two challenges: how to create and manage such memory, and how to ensure consistency in the presence of failures. Without additional mechanisms, a system failure may leave data structures in SCM in an invalid state, crashing the program the next time it starts. In Mnemosyne, programmers declare global persistent data with the keyword "pstatic" or allocate it dynamically. Mnemosyne provides primitives for directly modifying persistent variables and supports consistent updates through a lightweight transaction mechanism. Compared to past work on disk-based persistent memory, Mnemosyne reduces latency to storage by writing data directly to memory at the granularity of an update rather than writing memory pages back to disk through the file system. In tests emulating the performance characteristics of forthcoming SCMs, we show that Mnemosyne can persist data as fast as 3 microseconds. Furthermore, it provides a 35 percent performance increase when applied in the OpenLDAP directory server. In microbenchmark studies we find that Mnemosyne can be up to 1400% faster than alternative persistence strategies, such as Berkeley DB or Boost serialization, that are designed for disks.
Conference Paper
Full-text available
Modern computer systems have been built around the assumption that persistent storage is accessed via a slow, block-based interface. However, new byte-addressable, persistent memory technologies such as phase change memory (PCM) offer fast, fine-grained ac cess to persistent storage. In this paper, we present a file system and a hardware archi- tecture that are designed around the properties of persiste nt, byte- addressable memory. Our file system, BPFS, uses a new techniq ue called short-circuit shadow pagingto provide atomic, fine-grained updates to persistent storage. As a result, BPFS provides st rong re- liability guarantees and offers better performance than traditional file systems, even when both are run on top of byte-addressabl e, persistent memory. Our hardware architecture enforces atomicity and ordering guarantees required by BPFS while still providing the performance benefits of the L1 and L2 caches. Since these memory technologies are not yet widely available, we evaluate BPFS on DRAM against NTFS on both a RAM disk and a traditional disk. Then, we use microarchitectural simulations to estimate the performance of BPFS on PCM. Despite providing strong safety and consistency guarantees, BPFS on DRAM is typ- ically twice as fast as NTFS on a RAM disk and 4-10 times faster than NTFS on disk. We also show that BPFS on PCM should be significantly faster than a traditional disk-based file syst em.
Conference Paper
Full-text available
Emerging storage technologies such as flash memories, phase-change memories, and spin-transfer torque memories are poised to close the enormous performance gap between disk-based storage and main memory. We evaluate several approaches to integrating these memories into computer systems by measuring their impact on IO-intensive, database, and memory-intensive applications. We explore several options for connecting solid-state storage to the host system and find that the memories deliver large gains in sequential and random access performance, but that different system organizations lead to different performance trade-offs. The memories provide substantial application-level gains as well, but overheads in the OS, file system, and application can limit performance. As a result, fully exploiting these memories' potential will require substantial changes to application and system software. Finally, paging to fast non-volatile memories is a viable option for some applications, providing an alternative to expensive, powerhungry DRAM for supporting scientific applications with large memory footprints.
Conference Paper
Full-text available
The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that can take more frequent checkpoints with minimum overhead are critical to implement a reliable exascale system. In this work, we leverage the upcoming Phase-Change Random Access Memory (PCRAM) technology and propose a hybrid local/global checkpointing mechanism after a thorough analysis of MPP systems failure rates and failure sources. We propose three variants of PCRAM-based hybrid checkpointing schemes, DIMM+HDD, DIMM+DIMM, and 3D+3D, to reduce the checkpoint overhead and offer a smooth transition from the conventional pure HDD checkpoint to the ideal 3D PCRAM mechanism. The proposed pure 3D PCRAM-based mechanism can ultimately take checkpoints with overhead less than 4% on a projected exascale system.
Article
Emerging non-volatile memory (NVM) technologies have gained a lot of attention recently. The byte-addressability and high density of NVM enable computer architects to build large-scale main memory systems. NVM has also been shown to be a promising alternative to conventional persistent store. With NVM, programmers can persistently retain in-memory data structures without writing them to disk. Therefore, one can envision that in the future, NVM will play the role of both working memory and persistent store at the same time. Persistent store demands consistency and durability guarantees, thereby imposing new design constraints on the memory system. Consistency is achieved at the expense of serializing multiple write operations. Durability requires memory cells to guarantee non-volatility and thus reduces the write speed. Therefore, a unified architecture oblivious to these two use cases would lead to suboptimal design. In this paper, we propose a novel unified working memory and persistent store architecture, NVM Duet, which provides the required consistency and durability guarantees for persistent store while relaxing these constraints if accesses to NVM are for working memory. A cross-layer design approach is adopted to achieve the design goal. Overall, simulation results demonstrate that NVM Duet achieves up to 1.68x (1.32x on average) speedup compared with the baseline design.
Conference Paper
Journaling file systems have been widely used where data consistency must be assured. However, we observed that the overhead of journaling can cause up to 48.2% performance drop under certain kinds of workloads. On the other hand, the emerging high-performance, byte-addressable Non-volatile Memory (NVM) has the potential to minimize such overhead by being used as the journal device. The traditional journaling mechanism based on block devices is nevertheless unsuitable for NVM due to the write amplification of metadata journal we observed. In this paper, we propose a fine-grained metadata journal mechanism to fully utilize the low-latency byte-addressable NVM so that the overhead of journaling can be significantly reduced. Based on the observation that conventional block-based metadata journal contains up to 90% clean metadata that is unnecessary to be journalled, we design a fine-grained journal format for byte-addressable NVM which contains only modified metadata. Moreover, we redesign the process of transaction committing, checkpointing and recovery in journaling file systems utilizing the new journal format. Therefore, thanks to the reduced amount of ordered writes to NVM, the overhead of journaling can be reduced without compromising the file system consistency. Experimental results show that our NVM-based fine-grained metadata journaling is up to 15.8x faster than the traditional approach under FileBench workloads.
Conference Paper
SSD based cache solutions are being widely utilized to improve performance in network storage systems. With a goal of providing a cost-effective, high performing SSD cache solution, we propose a new caching solution called SRC (SSD RAID as a Cache) for an array of commodity SSDs. In designing SRC, we borrow both the well-known RAID technique and the log-structured approach and adopt them into the cache layer. In so doing, we explore a wide variety of design choices such as flush issue frequency, write units, forming stripes without parity, and garbage collection through copying rather than destaging that become possible as we make use of RAID and a log-structured approach at the cache level. Using an implementation in Linux under the Device Mapper framework, we quantitatively present and analyze results of the design space options that we considered in our design. Our experiments using realistic workload traces show that SRC performs at least 2 times better in terms of throughput than existing open source solutions. We also consider cost-effectiveness of SRC with a variety of SSD products. In particular, we compare SRC configured with MLC and TLC SATA SSDs and a single high-end NVMe SSD. We find that SRC configured as RAID-5 with low-cost MLC and TLC SATA SSDs generally outperforms that configured with a single high-end SSD in terms of both performance and lifetime per dollars spent.
Conference Paper
In data centers, caches work both to provide low IO latencies and to reduce the load on the back-end network and storage. But they are not designed for multi-tenancy; system-level caches today cannot be configured to match tenant or provider objectives. Exacerbating the problem is the increasing number of un-coordinated caches on the IO data plane. The lack of global visibility on the control plane to coordinate this distributed set of caches leads to inefficiencies, increasing cloud provider cost. We present Moirai, a tenant- and workload-aware system that allows data center providers to control their distributed caching infrastructure. Moirai can help ease the management of the cache infrastructure and achieve various objectives, such as improving overall resource utilization or providing tenant isolation and QoS guarantees, as we show through several use cases. A key benefit of Moirai is that it is transparent to applications or VMs deployed in data centers. Our prototype runs unmodified OSes and databases, providing immediate benefit to existing applications.
Article
Today's databases and key-value stores commonly keep all their data in main memory. A single server can have over 100 GB of memory, and a cluster of such servers can have 10s to 100s of TB. However, a storage back end is still required for recovery from failures. Recovery can last for minutes for a single server or hours for a whole cluster, causing heavy load on the back end. Nonvolatile main memory (NVRAM) technologies can help by allowing near-instantaneous recovery of in-memory state. However, today's software does not support this well. Block-based approaches such as persistent buffer caches suffer from data duplication and block transfer overheads. Recently, user-level persistent heaps have been shown to have much better performance than these. However they require substantial application modification and still have significant runtime overheads. This paper proposes whole-system persistence (WSP) as an alternative. WSP is aimed at systems where all memory is nonvolatile. It transparently recovers an application's entire state, making a failure appear as a suspend/resume event. Runtime overheads are eliminated by using "flush on fail": transient state in processor registers and caches is flushed to NVRAM only on failure, using the residual energy from the system power supply. Our evaluation shows that this approach has 1.6-13 times better runtime performance than a persistent heap, and that flush-on-fail can complete safely within 2-35% of the residual energy window provided by standard power supplies.
Article
Byte-addressable non-volatile memory may usher in a new era of computing where in-memory data structures are persistent and can be reused directly across machine restarts. In this context, we study the implications of different CPU caching modes and show how they affect both programmability and performance of a program.
Conference Paper
Transaction is a common technique to ensure system consistency but incurs high overhead. Recent flash memory techniques enable efficient embedded transaction support inside solid state drives (SSDs). In this paper, we propose a new embedded transaction mechanism, TxCache, for SSDs with non-volatile disk cache. TxCache revises cache management of disk cache to support transactions using two techniques. First, it persists new-version data in non-volatile disk cache in a shadow way while protecting old-version data from being overwritten. Second, it uses pointers and flags leveraging the byte-addressability to cluster pages of each transaction and manage transaction status. The non-volatility and byte-addressability properties make TxCache an efficient transaction design. Experiments using file system and database workloads show performance improvement up to 46.0% and lifetime extension up to 33.8% compared to a recent transactional SSD design.
Conference Paper
The non-volatile memory (NVM) has DRAM-like performance and disk-like persistency which make it possible to replace both disk and DRAM to build single level systems. To keep data consistency in such systems is non-trivial because memory writes may be reordered by CPU and memory controller. In this paper, we study the consistency cost for an important and common data structure, B + Tree. Although the memory fence and CPU cacheline flush instructions can order memory writes to achieve data consistency, they introduce a significant overhead (more than 10X slower in performance). Based on our quantitative analysis of consistency cost, we propose NV-Tree, a consistent and cache-optimized B + Tree variant with reduced CPU cacheline flush. We implement and evaluate NV-Tree and NV-Store, a key-value store based on NV-Tree, on an NVDIMM server. NV-Tree outperforms the state-of-art consistent tree structures by up to 12X under write-intensive workloads. NV-Store increases the throughput by up to 4.8X under YCSB workloads compared to Redis.
Article
Storage systems based on Phase Change Memory (PCM) devices are beginning to generate considerable attention in both industry and academic communities. But whether the technology in its current state will be a commercially and technically viable alternative to entrenched technologies such as flash-based SSDs remains undecided. To address this, it is important to consider PCM SSD devices not just from a device standpoint, but also from a holistic perspective. This article presents the results of our performance study of a recent all-PCM SSD prototype. The average latency for a 4KiB random read is 6.7μs, which is about 16× faster than a comparable eMLC flash SSD. The distribution of I/O response times is also much narrower than flash SSD for both reads and writes. Based on the performance measurements and real-world workload traces, we explore two typical storage use cases: tiering and caching. We report that the IOPS/$ of a tiered storage system can be improved by 12--66% and the aggregate elapsed time of a server-side caching solution can be improved by up to 35% by adding PCM. Our results show that (even at current price points) PCM storage devices show promising performance as a new component in enterprise storage systems.
Article
Next-generation non-volatile memories (NVMs) promise DRAM-like performance, persistence, and high density. They can attach directly to processors to form non-volatile main memory (NVMM) and offer the opportunity to build very low-latency storage systems. These high-performance storage systems would be especially useful in large-scale data center environments where reliability and availability are critical. However, providing reliability and availability to NVMM is challenging, since the latency of data replication can overwhelm the low latency that NVMM should provide. We propose Mojim, a system that provides the reliability and availability that large-scale storage systems require, while preserving the performance of NVMM. Mojim achieves these goals by using a two-tier architecture in which the primary tier contains a mirrored pair of nodes and the secondary tier contains one or more secondary backup nodes with weakly consistent copies of data. Mojim uses highly-optimized replication protocols, software, and networking stacks to minimize replication costs and expose as much of NVMM's performance as possible. We evaluate Mojim using raw DRAM as a proxy for NVMM and using an industrial NVMM emulation system. We find that Mojim provides replicated NVMM with similar or even better performance than un-replicated NVMM (reducing latency by 27% to 63% and delivering between 0.4 to 2.7× the throughput). We demonstrate that replacing MongoDB's built-in replication system with Mojim improves MongoDB's performance by 3.4 to 4x.
Conference Paper
Flash memory has accelerated the architectural evolution of storage systems with its unique characteristics compared to magnetic disks. The no-overwrite property of flash memory has been leveraged to efficiently support transactions, a commonly used mechanism in systems to provide consistency. However, existing transaction designs embedded in flash-based Solid State Drives (SSDs) have limited support for transaction flexibility, i.e., support for different isolation levels between transactions, which is essential to enable different systems to make tradeoffs between performance and consistency. Since they provide support for only strict isolation between transactions, existing designs lead to a reduced number of on-the-fly requests and therefore cannot exploit the abundant internal parallelism of an SSD. There are two design challenges that need to be overcome to support flexible transactions: (1) enabling a transaction commit protocol that supports parallel execution of transactions; and (2) efficiently tracking the state of transactions that have pages scattered over different locations due to parallel allocation of pages. In this paper, we propose LightTx to address these two challenges. LightTx supports transaction flexibility using a lightweight embedded transaction design. The design of LightTx is based on two key techniques. First, LightTx uses a commit protocol that determines the transaction state solely inside each transaction (as opposed to having dependencies between transactions that complicate state tracking) in order to support parallel transaction execution. Second, LightTx periodically retires the dead transactions to reduce transaction state tracking cost. Experiments show that LightTx provides up to 20.6% performance improvement due to transaction flexibility. LightTx also achieves nearly the lowest overhead in garbage collection and mapping persistence compared to existing embedded transaction designs.
Conference Paper
Persistent Memory (PM) technologies, such as Phase Change Memory, STT-RAM, and memristors, are receiving increasingly high interest in academia and industry. PM provides many attractive features, such as DRAM-like speed and storage-like persistence. Yet, because it draws a blurry line between memory and storage, neither a memory- or storage-based model is a natural fit. Best integrating PM into existing systems has become challenging and is now a top priority for many. In this paper we share our initial approach to integrating PM into computer systems, with minimal impact to the core operating system. By adopting a hybrid storage model, all of our changes are confined to a block storage driver, called PMBD, which directly accesses PM attached to the memory bus and exposes a logical block I/O interface to users. We explore the design space by examining a variety of options to achieve performance, protection from stray writes, ordered persistence, and compatibility for legacy file systems and applications. All told, we find that by using a combination of existing OS mechanisms (per-core page table mappings, non-temporal store instructions, memory fences, and I/O barriers), we are able to achieve each of these goals with small performance overhead for both micro-benchmarks and real world applications (e.g., file server and database workloads). Our experience suggests that determining the right combination of existing platform and OS mechanisms is a non-trivial exercise. In this paper, we share both our failed and successful attempts. The final solution that we propose represents an evolution of our initial approach. We have also open-sourced our software prototype with all attempted design options to encourage further research in this area.
Conference Paper
Providing transactional primitives of NAND flash based solid state disks (SSDs) have demonstrated a great potential for high performance transaction processing and relieving software complexity. Similar with software solutions like write-ahead logging (WAL) and shadow paging, transactional SSD has two parts of overhead which include: 1) write overhead under normal condition, and 2) recovery overhead after power failures. Prior transactional SSD designs utilize out-of-band (OOB) area in flash pages to store transaction information to reduce the first part of overhead. However, they are required to scan a large part of or even whole SSD after power failures to abort unfinished transactions. Another limitation of prior approaches is the unicity of transactional primitive they provided. In this paper, we propose a new transactional SSD design named Möbius. Möbius provides different types of transactional primitives to support static and dynamic transactions separately. Möbius flash translation layer (mFTL), which combines normal FTL with transaction processing by storing mapping and transaction information together in a physical flash page as atom inode. By amortizing the cost of transaction processing with FTL persistence, MFTL achieve high performance in normal condition and does not increase write amplification ratio. After power failures, Möbius can leverage atom inode to eliminate unnecessary scanning and recover quickly. We implemented a prototype of Möbius and compare it with other state-of-art transactional SSD designs. Experimental results show that Möbius can at most 67% outperform in transaction throughput (TPS) and 29 times outperform in recovery time while still have similar or even better write amphfication ratio comparing with prior hardware approaches.
Article
Emerging byte-addressable, non-volatile memory technologies offer performance within an order of magnitude of DRAM, prompting their inclusion in the processor memory subsystem. However, such load/store accessible Persistent Memory (PM) has implications on system design, both hardware and software. In this paper, we explore system software support to enable low-overhead PM access by new and legacy applications. To this end, we implement PMFS, a light-weight POSIX file system that exploits PM's byte-addressability to avoid overheads of block-oriented storage and enable direct PM access by applications (with memory-mapped I/O). PMFS exploits the processor's paging and memory ordering features for optimizations such as fine-grained logging (for consistency) and transparent large page support (for faster memory-mapped I/O). To provide strong consistency guarantees, PMFS requires only a simple hardware primitive that provides software enforceable guarantees of durability and ordering of stores to PM. Finally, PMFS uses the processor's existing features to protect PM from stray writes, thereby improving reliability. Using a hardware emulator, we evaluate PMFS's performance with several workloads over a range of PM performance characteristics. PMFS shows significant (up to an order of magnitude) gains over traditional file systems (such as ext4) on a RAMDISK-like PM block device, demonstrating the benefits of optimizing system software for PM.
Conference Paper
Multilevel-cell (MLC) phase change memory (PCM) may provide both high capacity main memory and faster-than-Flash persistent storage. But slow growth in cell resistance with time, resistance drift, can cause transient errors in MLC-PCM. Drift errors increase with time, and prior work suggests refresh before the cell loses data. The need for refresh makes MLC-PCM volatile, taking away a key advantage. Based on the observation that most drift errors occur in a particular state in four-level-cell PCM, we propose to change from four levels to three levels, eliminating the most vulnerable state. This simple change lowers cell drift error rates by many orders of magnitude: three-level-cell PCM can retain data without power for more than ten years. With optimized encoding/decoding and a wearout tolerance mechanism, we can narrow the capacity gap between three-level and four-level cells. These techniques together enable low-cost, high-performance, genuinely nonvolatile MLC-PCM.
Conference Paper
Journaling techniques are widely used in modern file systems as they provide high reliability and fast recovery from system failures. However, it reduces the performance benefit of buffer caching as journaling accounts for a bulk of the storage writes in real system environments. In this paper, we present a novel buffer cache architecture that subsumes the functionality of caching and journaling by making use of non-volatile memory such as PCM or STT-MRAM. Specifically, our buffer cache supports what we call the in-place commit scheme. This scheme avoids logging, but still provides the same journaling effect by simply altering the state of the cached block to frozen. As a frozen block still performs the function of caching, we show that in-place commit does not degrade cache performance. We implement our scheme on Linux 2.6.38 and measure the throughput and execution time of the scheme with various file I/O benchmarks. The results show that our scheme improves I/O performance by 76% on average and up to 240% compared to the existing Linux buffer cache with ext4 without any loss of reliability.
Conference Paper
Flash memory has recently become popular as a caching medium. Most uses to date are on the storage server side. We investigate a different structure: flash as a cache on the client side of a networked storage environment. We use trace-driven simulation to explore the design space. We consider a wide range of configurations and policies to determine the potential client-side caches might offer and how best to arrange them. Our results show that the flash cache writeback policy does not significantly affect performance. Write-through is sufficient; this greatly simplifies cache consistency handling. We also find that the chief benefit of the flash cache is its size, not its persistence. Cache persistence offers additional performance benefits at system restart at essentially no runtime cost. Finally, for some workloads a large flash cache allows using miniscule amounts of RAM for file caching (e.g., 256 KB) leaving more memory available for application use.
Conference Paper
Storage systems based on Phase Change Memory (PCM) devices are beginning to generate considerable attention in both industry and academic communities. But whether the technology in its current state will be a commercially and technically viable alternative to entrenched technologies such as flash-based SSDs remains undecided. To address this it is important to consider PCM SSD devices not just from a device standpoint, but also from a holistic perspective. This paper presents the results of our performance study of a recent all-PCM SSD prototype. The average latency for a 4 KiB random read is 6.7 µs, which is about 16× faster than a comparable eMLC flash SSD. The distribution of I/O response times is also much narrower than flash SSD for both reads and writes. Based on the performance measurements and real-world workload traces, we explore two typical storage use-cases: tiering and caching. For tiering, we model a hypothetical storage system that consists of flash, HDD, and PCM to identify the combinations of device types that offer the best performance within cost constraints. For caching, we study whether PCM can improve performance compared to flash in terms of aggregate I/O time and read latency. We report that the IOPS/$ of a tiered storage system can be improved by 12-66% and the aggregate elapsed time of a server-side caching solution can be improved by up to 35% by adding PCM. Our results show that - even at current price points - PCM storage devices show promising performance as a new component in enterprise storage systems.
Conference Paper
Persistent memory is an emerging technology which allows in-memory persistent data objects to be updated at much higher throughput than when using disks as persistent storage. Previous persistent memory designs use logging or copy-on-write mechanisms to update persistent data, which unfortunately reduces the system performance to roughly half that of a native system with no persistence support. One of the great challenges in this application class is therefore how to efficiently enable atomic, consistent, and durable updates to ensure data persistence that survives application and/or system failures. Our goal is to design a persistent memory system with performance very close to that of a native system. We propose Kiln, a persistent memory design that adopts a nonvolatile cache and a nonvolatile main memory to enable atomic in-place updates without logging or copy-on-write. Our evaluation shows that Kiln can achieve 2× performance improvement compared with NVRAM-based persistent memory with write-ahead logging. In addition, our design has numerous practical advantages: a simple and intuitive abstract interface, microarchitecture-level optimizations, fast recovery from failures, and eliminating redundant writes to nonvolatile storage media.
Conference Paper
In the era of smartphones and mobile computing, many popular applications such as Facebook, twitter, Gmail, and even Angry birds game manage their data using SQLite. This is mainly due to the development productivity and solid transactional support. For transactional atomicity, however, SQLite relies on less sophisticated but costlier page-oriented journaling mechanisms. Hence, this is often cited as the main cause of tardy responses in mobile applications. Flash memory does not allow data to be updated in place, and the copy-on-write strategy is adopted by most flash storage devices. In this paper, we propose X-FTL, a transactional flash translation layer(FTL) for SQLite databases. By offloading the burden of guaranteeing the transactional atomicity from a host system to flash storage and by taking advantage of the copy-on-write strategy used in modern FTLs, X-FTL drastically improves the transactional throughput almost for free without resorting to costly journaling schemes. We have implemented X-FTL on an SSD development board called OpenSSD, and modified SQLite and ext4 file system minimally to make them compatible with the extended abstractions provided by X-FTL. We demonstrate the effectiveness of X-FTL using real and synthetic SQLite workloads for smartphone applications, TPC-C benchmark for OLTP databases, and FIO benchmark for file systems.
Conference Paper
Emerging non-volatile memory (NVM) technologies have gained a lot of attention recently. The byte-addressability and high density of NVM enable computer architects to build large-scale main memory systems. NVM has also been shown to be a promising alternative to conventional persistent store. With NVM, programmers can persistently retain in-memory data structures without writing them to disk. Therefore, one can envision that in the future, NVM will play the role of both working memory and persistent store at the same time. Persistent store demands consistency and durability guarantees, thereby imposing new design constraints on the memory system. Consistency is achieved at the expense of serializing multiple write operations. Durability requires memory cells to guarantee non-volatility and thus reduces the write speed. Therefore, a unified architecture oblivious to these two use cases would lead to suboptimal design. In this paper, we propose a novel unified working memory and persistent store architecture, NVM Duet, which provides the required consistency and durability guarantees for persistent store while relaxing these constraints if accesses to NVM are for working memory. A cross-layer design approach is adopted to achieve the design goal. Overall, simulation results demonstrate that NVM Duet achieves up to 1.68x (1.32x on average) speedup compared with the baseline design.
Article
We conduct a comprehensive study of file-system code evolution. By analyzing eight years of Linux file-system changes across 5079 patches, we derive numerous new (and sometimes surprising) insights into the file-system development process; our results should be useful for both the development of file systems themselves as well as the improvement of bug-finding tools.
Conference Paper
Resistive Random Access Memory (ReRAM) is one of the most promising emerging memory technologies as a potential replacement for DRAM memory and/or NAND Flash. Multi-level cell (MLC) ReRAM, which can store multiple bits in a single ReRAM cell, can further improve density and reduce cost-per-bit, and therefore has recently been investigated extensively. However, the majority of the prior studies on MLC ReRAM are at the device level. The design implications for MLC ReRAM at the circuit and system levels remain to be explored. This paper aim to provide the first comprehensive investigation of the design trade-offs involved in MLC ReRAM. Our study indicates that different resistance allocation schemes, programming strategies, peripheral designs, and material selections profoundly affect the area, latency, power, and reliability of MLC ReRAM. Based on this analysis, we conduct two case studies: first we compare MLC ReRAM design against MLC phase-change memory (PCM) and multi-layer cross-point ReRAM design, and point out why multi-level ReRAM is appealing; second we further explore the design space for MLC ReRAM.
Article
Today's databases and key-value stores commonly keep all their data in main memory. A single server can have over 100 GB of memory, and a cluster of such servers can have 10s to 100s of TB. However, a storage back end is still required for recovery from failures. Recovery can last for minutes for a single server or hours for a whole cluster, causing heavy load on the back end. Nonvolatile main memory (NVRAM) technologies can help by allowing near-instantaneous recovery of in-memory state. However, today's software does not support this well. Block-based approaches such as persistent buffer caches suffer from data duplication and block transfer overheads. Recently, user-level persistent heaps have been shown to have much better performance than these. However they require substantial application modification and still have significant runtime overheads. This paper proposes whole-system persistence (WSP) as an alternative. WSP is aimed at systems where all memory is nonvolatile. It transparently recovers an application's entire state, making a failure appear as a suspend/resume event. Runtime overheads are eliminated by using "flush on fail": transient state in processor registers and caches is flushed to NVRAM only on failure, using the residual energy from the system power supply. Our evaluation shows that this approach has 1.6–13 times better runtime performance than a persistent heap, and that flush-on-fail can complete safely within 2–35% of the residual energy window provided by standard power supplies.
Conference Paper
This paper presents a new disk I/O architecture composed of an array of a flash memory SSD (solid state disk) and a hard disk drive (HDD) that are intelligently coupled by a special algorithm. We call this architecture I-CASH: Intelligently Coupled Array of SSD and HDD. The SSD stores seldom-changed and mostly read reference data blocks whereas the HDD stores a log of deltas between currently accessed I/O blocks and their corresponding reference blocks in the SSD so that random writes are not performed in SSD during online I/O operations. High speed delta compression and similarity detection algorithms are developed to control the pair of SSD and HDD. The idea is to exploit the fast read performance of SSDs and the high speed computation of modern multi-core CPUs to replace and substitute, to a great extent, the mechanical operations of HDDs. At the same time, we avoid runtime SSD writes that are slow and wearing. An experimental prototype I-CASH has been implemented and is used to evaluate I-CASH performance as compared to existing SSD/HDD I/O architectures. Numerical results on standard benchmarks show that I-CASH reduces the average I/O response time by an order of magnitude compared to existing disk I/O architectures such as RAID and SSD/HDD storage hierarchy, and provides up to 2.8 speedup over state-of-the-art pure SSD storage. Furthermore, I-CASH reduces random writes to SSD implying reduced wearing and prolonged life time of the SSD.
Conference Paper
The predicted shift to non-volatile, byte-addressable memory (e.g., Phase Change Memory and Memristor), the growth of "big data", and the subsequent emergence of frameworks such as memcached and NoSQL systems require us to rethink the design of data stores. To derive the maximum performance from these new memory technologies, this paper proposes the use of single-level data stores. For these systems, where no distinction is made between a volatile and a persistent copy of data, we present Consistent and Durable Data Structures (CDDSs) that, on current hardware, allows programmers to safely exploit the low-latency and non-volatile aspects of new memory technologies. CDDSs use versioning to allow atomic updates without requiring logging. The same versioning scheme also enables rollback for failure recovery. When compared to a memory-backed Berkeley DB B-Tree, our prototype-based results show that a CDDS B-Tree can increase put and get throughput by 74% and 138%. When compared to Cassandra, a two-level data store, Tembo, a CDDS B-Tree enabled distributed Key-Value system, increases throughput by up to 250%-286%.
Conference Paper
We propose a novel method to measure the robustness of journaling file systems under disk write failures. In our approach, we build models of how journaling file systems order disk writes under different journaling modes and use these models to inject write failures during file system updates. Using our technique, we analyze if journaling file systems maintain on-disk consistency in the presence of disk write failures. We apply our technique to three important Linux journaling file systems: ext3, Reiserfs, and IBM JFS. From our analysis, we identify several design flaws and correctness bugs in these file systems, which can cause serious file system errors ranging from data corruption to unmountable file systems.
Conference Paper
We develop and apply two new methods for analyzing Þle sys- tem behavior and evaluating Þle system changes. First, seman- tic block-level analysis (SBA) combines knowledge of on-disk data structures with a trace of disk trafÞc to infer Þle system be- havior; in contrast to standard benchmarking approaches, SBA enables users to understand why the Þle system behaves as it does. Second, semantic trace playback (STP) enables traces of disk trafÞc to be easily modiÞed to represent changes in the Þle system implementation; in contrast to directly modifying the Þle system, STP enables users to rapidly gauge the beneÞts of new policies. We use SBA to analyze Linux ext3, ReiserFS, JFS, and Windows NTFS; in the process, we uncover many strengths and weaknesses of these journaling Þle systems. We also apply STP to evaluate several modiÞcations to ext3, demonstrating the beneÞts of various optimizations without incurring the costs of a real implementation.
Conference Paper
This paper considers the problem of how to implement a file system on Storage Class Memory (SCM), that is directly connected to the memory bus, byte addressable and is also non-volatile. In this paper, we propose a new file system, called SCMFS, which is implemented on the virtual address space. In SCMFS, we utilize the existing memory management module in the operating system to do the block management and keep the space always contiguous for each file. The simplicity of SCMFS not only makes it easy to implement, but also improves the performance. We have implemented a prototype in Linux and evaluated its performance through multiple benchmarks.
Conference Paper
Memory scaling is in jeopardy as charge storage and sensing mechanisms become less reliable for prevalent memory technologies, such as DRAM. In contrast, phase change memory (PCM) storage relies on scalable current and thermal mechanisms. To exploit PCM's scalability as a DRAM alternative, PCM must be architected to address relatively long latencies, high energy writes, and finite endurance. We propose, crafted from a fundamental understanding of PCM technology parameters, area-neutral architectural enhancements that address these limitations and make PCM competitive with DRAM. A baseline PCM system is 1.6x slower and requires 2.2x more energy than a DRAM system. Buffer reorganizations reduce this delay and energy gap to 1.2x and 1.0x, using narrow rows to mitigate write energy and multiple rows to improve locality and write coalescing. Partial writes enhance memory endurance, providing 5.6 years of lifetime. Process scaling will further reduce PCM energy costs and improve endurance.
Conference Paper
Using nonvolatile memories in memory hierarchy has been investigated to reduce its energy consumption because nonvolatile memories consume zero leakage power in memory cells. One of the difficulties is, however, that the endurance of most nonvolatile memory technologies is much shorter than the conventional SRAM and DRAM technology. This has limited its usage to only the low levels of a memory hierarchy, e.g., disks, that is far from the CPU. In this paper, we study the use of a new type of nonvolatile memories -- the Phase Change Memory (PCM) as the main memory for a 3D stacked chip. The main challenges we face are the limited PCM endurance, longer access latencies, and higher dynamic power compared to the conventional DRAM technology. We propose techniques to extend the endurance of the PCM to an average of 13 (for MLC PCM cell) to 22 (for SLC PCM) years. We also study the design choices of implementing PCM to achieve the best tradeoff between energy and performance. Our design reduced the total energy of an already low-power DRAM main memory of the same capacity by 65%, and energy-delay ² product by 60%. These results indicate that it is feasible to use PCM technology in place of DRAM in the main memory for better energy efficiency.