Conference PaperPDF Available

DualFS: A new journaling file system without meta-data duplication

Authors:

Abstract and Figures

In this paper we introduce DualFS, a new high performance journaling file system that puts data and meta-data on different devices (usually, two partitions on the same disk or on different disks), and manages them in very different ways. Unlike other journaling file systems, DualFS has only one copy of every meta-data block. This copy is in the meta-data device, a log which is used by DualFS both to read and to write meta-data blocks. By avoiding a time-expensive extra copy of meta-data blocks, DualFS can achieve a good performance as compared to other journaling file systems. Indeed, we have implemented a DualFS prototype, which has been evaluated with microbenchmarks and macrobenchmarks, and we have found that DualFS greatly reduces the total I/O time taken by the file system in most cases (up to 97%), whereas it slightly increases the total I/O time only in a few and limited cases.
Content may be subject to copyright.
A preview of the PDF is not available
... The log storage is designed for the crash recovery purpose. DualFS is a journaling file system that separates metadata from data using partitions or devices [Piernas et al. 2002]. Update-in-place and update-out-of-place strategies are applied to the metadata and data partitions, respectively. ...
Article
This article presents a framework, Frog, for Context-Based File Systems (CBFSs) that aim at simplifying the development of context-based file systems and applications. Unlike existing informed-based context-aware systems, Frog is a unifying informed-based framework that abstracts context-specific solutions as views, allowing applications to make view selections according to application behaviors. The framework can not only eliminate overheads induced by traditional context analysis, but also simplify the interactions between the context-based file systems and applications. Rather than propagating data through solution-specific interfaces, views in Frog can be selected by inserting their names in file path strings. With Frog in place, programmers can migrate an application from one solution to another by switching among views rather than changing programming interfaces. Since the data consistency issues are automatically enforced by the framework, file-system developers can focus their attention on context-specific solutions. We implement two prototypes to demonstrate the strengths and overheads of our design. Inspired by an observation that there are more than 50% of small files (<4KB) in a file system, we create a Bi-context Archiving Virtual File System (BAVFS) that utilizes conservative and aggressive prefetching for the contexts of random and sequential reads. To improve the performance of random read-and-write operations, the Bi-context Hybrid Virtual File System (BHVFS) combines the update-in-place and update-out-of-place solutions for read-intensive and write-intensive contexts. Our experimental results show that the benefits of Frog-based CBFSs outweigh the overheads introduced by integrating multiple context-specific solutions.
... The log storage is designed for the crash recovery purpose. DualFS is a journaling file system that separates metadata from data using partitions or devices [33]. Update-in-place and update-out-of-place strategies are applied to the metadata and data partitions, respectively. ...
Article
Thisarticlepresentsaframework,Frog,forContext-BasedFileSystems(CBFSs)thataimatsimplifyingthe development of context-based file systems and applications. Unlike existing informed-based context-aware systems, Frog is a unifying informed-based framework that abstracts context-specific solutions as views, allowing applications to make view selections according to application behaviors. The framework can not only eliminate overheads induced by traditional context analysis, but also simplify the interactions between the context-based file systems and applications. Rather than propagating data through solution-specific interfaces, views in Frog can be selected by inserting their names in file path strings. With Frog in place, programmers can migrate an application from one solution to another by switching among views rather than changing programming interfaces. Since the data consistency issues are automatically enforced by the framework, file-system developers can focus their attention on context-specific solutions. We implement two prototypes to demonstrate the strengths and overheads of our design. Inspired by an observation that there aremorethan50%ofsmallfiles(<4KB)inafilesystem,wecreateaBi-contextArchivingVirtualFileSystem (BAVFS) that utilizes conservative and aggressive prefetching for the contexts of random and sequential reads. To improve the performance of random read-and-write operations, the Bi-context Hybrid Virtual File System (BHVFS) combines the update-in-place and update-out-of-place solutions for read-intensive and write-intensive contexts. Our experimental results show that the benefits of Frog-based CBFSs outweigh the overheads introduced by integrating multiple context-specific solutions.
... BetrFS [47] introduces B ϵ -Tree as an indexing structure for efficient large scans. DualFS [70], hFS [110], and ext4-lazy [2] abandon traditional FFS [61] cylinder group design and aggregate all metadata in one place to achieve significantly faster metadata operations. TableFS [75] and DeltaFS [111] store metadata in LevelDB running atop a file system and achieve orders of magnitude faster metadata operations than local file systems. ...
Conference Paper
For a decade, the Ceph distributed file system followed the conventional wisdom of building its storage backend on top of local file systems. This is a preferred choice for most distributed file systems today because it allows them to benefit from the convenience and maturity of battle-tested code. Ceph's experience, however, shows that this comes at a high price. First, developing a zero-overhead transaction mechanism is challenging. Second, metadata performance at the local level can significantly affect performance at the distributed level. Third, supporting emerging storage hardware is painstakingly slow. Ceph addressed these issues with BlueStore, a new back-end designed to run directly on raw storage devices. In only two years since its inception, BlueStore outperformed previous established backends and is adopted by 70% of users in production. By running in user space and fully controlling the I/O stack, it has enabled space-efficient metadata and data checksums, fast overwrites of erasure-coded data, inline compression, decreased performance variability, and avoided a series of performance pitfalls of local file systems. Finally, it makes the adoption of backwards-incompatible storage hardware possible, an important trait in a changing storage landscape that is learning to embrace hardware diversity.
... Techniques. The technique of DualFS uses two different file systems for data and metadata, and two different file system managements, respectively [105]. The log-structure-based file system is used for metadata placement. ...
Article
Full-text available
The provisioning of an efficient ultra-large scalable distributed storage system for expanding cloud applications has been a challenging job for researchers in academia and industry. In such an ultra-large-scale storage system, data are distributed on multiple storage nodes for performance, scalability, and availability. The access to this distributed data is through its metadata, maintained by multiple metadata servers. The metadata carries information about the physical address of data and access privileges. The efficiency of a storage system highly depends on effective metadata management. This research presents an extensive systematic literature analysis of metadata management techniques in storage systems. This research work will help researchers to find the significance of metadata management and important parameters of metadata management techniques for storage systems. Methodical examination of metadata management techniques developed by various industry and research groups is described. The different metadata distribution techniques lead to various taxonomies. Furthermore, the article investigates techniques based on distribution structures and key parameters of metadata management. It also presents strengths and weaknesses of individual existing techniques that will help researchers to select the most appropriate technique for specific applications. Finally, it discusses existing challenges and significant research directions in metadata management for researchers.
... The log storage is designed for the crash recovery purpose. DualFS is a journaling file system that separates metadata from data using partitions or devices [Piernas et al. 2002]. Update-in-place and update-out-of-place strategies are applied to the metadata and data partitions, respectively. ...
Article
This article presents a framework, Frog, for Context-Based File Systems (CBFSs) that aim at simplifying the development of context-based file systems and applications. Unlike existing informed-based context-aware systems, Frog is a unifying informed-based framework that abstracts context-specific solutions as views, allowing applications to make view selections according to application behaviors. The framework can not only eliminate overheads induced by traditional context analysis, but also simplify the interactions between the context-based file systems and applications. Rather than propagating data through solution-specific interfaces, views in Frog can be selected by inserting their names in file path strings. With Frog in place, programmers can migrate an application from one solution to another by switching among views rather than changing programming interfaces. Since the data consistency issues are automatically enforced by the framework, file-system developers can focus their attention on context-specific solutions. We implement two prototypes to demonstrate the strengths and overheads of our design. Inspired by an observation that there are more than 50% of small files (<4KB) in a file system, we create a Bi-context Archiving Virtual File System (BAVFS) that utilizes conservative and aggressive prefetching for the contexts of random and sequential reads. To improve the performance of random read-and-write operations, the Bi-context Hybrid Virtual File System (BHVFS) combines the update-in-place and update-out-of-place solutions for read-intensive and write-intensive contexts. Our experimental results show that the benefits of Frog-based CBFSs outweigh the overheads introduced by integrating multiple context-specific solutions.
Article
Object-based storage systems have been widely used for various scenarios such as file storage, block storage, blob (e.g., large videos) storage, etc., where the data is placed among a large number of object storage devices (OSDs). Data placement is critical for the scalability of decentralized object-based storage systems. The state-of-the-art CRUSH placement method is a decentralized algorithm that deterministically places object replicas onto storage devices without relying on a central directory. While enjoying the benefits of decentralization such as high scalability, robustness, and performance, CRUSH-based storage systems suffer from uncontrolled data migration when expanding the capacity of the storage clusters (i.e., adding new OSDs), which is determined by the nature of CRUSH and will cause significant performance degradation when the expansion is nontrivial. This paper presents MapX , a novel extension to CRUSH that uses an extra time-dimension mapping (from object creation times to cluster expansion times) for controlling data migration after cluster expansions. Each expansion is viewed as a new layer of the CRUSH map represented by a virtual node beneath the CRUSH root. MapX controls the mapping from objects onto layers by manipulating the timestamps of the intermediate placement groups (PGs). MapX is applicable to a large variety of object-based storage scenarios where object timestamps can be maintained as higher-level metadata. We have applied MapX to the state-of-the-art Ceph-RBD (RADOS Block Device) to implement a migration-controllable, decentralized object-based block store (called Oasis ). Oasis extends the RBD metadata structure to maintain and retrieve approximate object creation times (for migration control) at the granularity of expansion layers. Experimental results show that the MapX -based Oasis block store outperforms the CRUSH-based Ceph-RBD (which is busy in migrating objects after expansions) by 3.17 × ∼4.31 × in tail latency, and \(76.3\% \) (resp. \(83.8\% \) ) in IOPS for reads (resp. writes).
Article
Two oft-cited file systems, the Fast File System (FFS) and the Log-Structured File System (LFS), adopt two sharply different update strategies--- update-in-place and update-out-of-place . This paper introduces the design and implementation of a hybrid file system called hFS, which combines the strengths of FFS and LFS while avoiding their weaknesses. This is accomplished by distributing file system data into two partitions based on their size and type. In hFS, data blocks of large regular files are stored in a data partition arranged in a FFS-like fashion, while metadata and small files are stored in a separate log partition organized in the spirit of LFS but without incurring any cleaning overhead . This segregation makes it possible to use more appropriate layouts for different data than would otherwise be possible. In particular, hFS has the ability to perform clustered I/O on all kinds of data---including small files, metadata, and large files. We have implemented a prototype of hFS on FreeBSD and have compared its performance against three file systems, including FFS with Soft Updates, a port of NetBSD's LFS, and our lightweight journaling file system called yFS. Results on a number of benchmarks show that hFS has excellent small file and metadata performance. For example, hFS beats FFS with Soft Updates in the range from 53% to 63% in the PostMark benchmark.
Conference Paper
This paper presents URSA, a hybrid block store that provides virtual disks for various applications to run efficiently on cloud VMs. Trace analysis shows that the I/O patterns served by block storage have limited locality to exploit. Therefore, instead of using SSDs as a cache layer, URSA proposes an SSD-HDD-hybrid storage structure that directly stores primary replicas on SSDs and replicates backup replicas on HDDs, using journals to bridge the performance gap between SSDs and HDDs. URSA integrates the hybrid structure with designs for high reliability, scalability, and availability. Experiments show that URSA in its hybrid mode achieves almost the same performance as in its SSD-only mode (storing all replicas on SSDs), and outperforms other block stores (Ceph and Sheepdog) even in their SSD-only mode while achieving much higher CPU efficiency (performance per core). We also discuss some practical issues in our deployment.
Article
Full-text available
The digital tachograph is a device that automatically records driving activities such as vehicle speed, driving distance, break status, transmission, engine RPM, longitude and latitude, accumulated distance, etc. Digital tachographs are mandatory for all trucks by European Commission regulation from 2005. In South Korea, digital tachographs are mandatory for all new business vehicles from 2011 and are being installed in the wider range of vehicles year by year. This device analyses drivers' driving style and car accidents. A car accident makes the reliability of the device unpredictable, thus it is very important to reliably store driving information and recover it from an accident. We designed and implemented a practical digital tachograph to reliably store data and quickly recover it even after an accident. This paper presents a tiered storage scheme that consists of a first storage device with small capacity at a high reliability and a second storage device with large capacity at a low cost. This tiered architecture provides a high reliability and fast recovery time for an embedded storage. In addition, we present a reverse first-page scanning scheme that overcomes the slow scanning time of log-structured storages at the boot stage. The scheme reduced the scanning time of the first storage device by 1/28 times. In addition, our design includes a scheme that fast stores data at a moment of accident by 1/25 of data transfer time of a normal method.
Article
Full-text available
Recent developments in the areas of computer communications have enabled the deployment of a wide variety of multimedia applications. Among the various media, video is characterized by its stringent requirements in terms of processing power, storage and bandwidth. In this paper, we undertake the study of a parallel implementation of a software MPEG-2 encoder. We use a platform consisting of a cluster of workstations interconnected by an ATM switch. The use of a parallel encoder offers great flexibility as opposed to a hardware decoder, i.e., flexibility in setting parameters and modifying the various stages of the encoding process. The use of a parallel system should allow us to reduce turn around times by reducing considerable the time required to encode the video with a new set of parameters. However, the effective implementation of such application over a platform without a common memory requires a thorough analysis of the best strategy to distribute the data. This issue is particularly important in video coding, due to the large volume of video data to be handled. In this study, we pay particularly attention to the study of data allocation among the processors in order to improve the overall system operation. We explore various different methods of data distribution. Results of encoding times using different number of processors are provided. Our results show that the underlying middleware plays a major role on the performance of the overall system.
Conference Paper
Full-text available
Research results [ROSE91] suggest that a log-structured file system (LFS) offers the potential for dramatically improved write performance, faster recovery time, and faster file creation and deletion than traditional UNIX file systems. This paper presents a redesign and implementation of the Sprite [ROSE91] log-structured file system that is more robust and integrated into the vnode interface [KLEI86]. Measurements show its performance to be superior to the 4BSD Fast File System (FFS) in a variety of benchmarks and not significantly less than FFS in any test. Unfortunately, an enhanced version of FFS (with read and write clustering) [MCVO91] provides comparable and sometimes superior performance to our LFS. However, LFS can be extended to provide additional functionality such as embedded transactions and versioning, not easily implemented in traditional file systems.
Conference Paper
Full-text available
1 Abstract The UNIX Fast File System (FFS) is probably the most widely-used file system for performance comparisons. However, such comparisons frequently overlook many of the performance enhancements that have been added over the past decade. In this paper, we explore the two most commonly used approaches for improving the performance of meta-data operations and recovery: journaling and Soft Updates. Journaling systems use an auxiliary log to record meta-data operations and Soft Updates uses ordered writes to ensure meta-data consistency. The commercial sector has moved en masse to journaling file systems, as evidenced by their presence on nearly every server platform available today: Solaris, AIX, Digital UNIX, HP-UX, Irix, and Windows NT. On all but Solaris, the default file system uses journaling. In the meantime, Soft Updates holds the promise of providing stronger reliability guarantees than journaling, with faster recovery and superior performance in certain boundary cases. In this paper, we explore the benefits of Soft Updates and journaling, comparing their behavior on both micro- benchmarks and workload-based macrobenchmarks. We find that journaling alone is not sufficient to "solve" the meta-data update problem. If synchronous semantics are required (i.e., meta-data operations are durable once the system call returns), then the journaling systems cannot realize their full potential. Only when this synchronicity requirement is relaxed can journaling systems approach the performance of systems like Soft Updates (which also relaxes this requirement). Our asynchronous journaling and Soft Updates systems perform comparably in most cases. While Soft Updates excels in some meta-data intensive microbenchmarks, the macrobenchmark results are more ambiguous. In three cases Soft Updates and journaling are comparable. In a file intensive news workload, journaling prevails, and in a small ISP workload, Soft Updates prevails. 2I n t r o d u c t i o n
Article
In this paper, we describe the history of Linux filesystems. We briefly introduce the fundamental concepts implemented in Unix filesystems. We present the implementation of the Virtual File System layer in Linux and we detail the Second Extended File System kernel code and user mode tools. Last, we present performance measurements made on Linux and BSD filesystems and we conclude with the current status of Ext2fs and the future directions.
Article
Our Netra NFS group at Sun set out to solve the challenging problem of providing remote Network File System (NFS) service with high performance and availability. An NFS server must guarantee the permanence of changes to the file system before acknowledging an NFS request. Thus, the server's underlying local file system must perform update operations synchronously to stable storage with potentially high latency. Our solution to this problem involves using the Solaris Unix File System (UFS), derived from the Berkeley Fast File System (FFS), in conjunction with nonvolatile RAM (NVRAM) as fast stable storage. We evaluated the system using the LADDIS benchmark and as a result, developed a cacheing technique for block- mapping information that gav e us a 23% increase in measured server throughput in our standard RAID-5 server configuration. With recent increases in disk capacity and RAID technology, file- system sizes have reached a point not imagined by the FFS designers, requiring an approach to checking file-system consistency that does not grow proportionately with file-system size. We examined several log-based solutions to providing fast crash recovery, but none could use the NVRAM effectively and meet our performance requirements. As an alternative, we dev eloped an approach that uses UFS but maintains file-system working-set information, so that the consistency checker needs to examine only the active portions of a file system. This approach met our performance goals and also reduced file-system consistency-checking times to between 3% and 25% of those in the original UFS implementation.
Article
Existing file system benchmarks are deficient in portraying performance in the ephemeral small-file regime used by Internet software, especially: electronicmail; netnews; and web-based commerce. PostMark is a new benchmark to measure performance for this class of application.In this paper, PostMark test results are presented and analyzed for both UNIX and Windows NT application servers. Network Appliance Filers (file server appliances) are shown to provide superior performance (via NFS or CIFS) compared to local disk alternatives, especially at higher loads. Such results are consistent with reports from ISPs (Internet Service Providers) who have deployed NetApp filers to support such applications on a large scale.
Conference Paper
One of the fundamental limits to high-performance, high-reliability file systems is memory's vulnerability to system crashes. Because memory is viewed as unsafe, systems periodically write data back to disk. The extra disk traffic lowers performance, and the delay period before data is safe lowers reliability. The goal of the Rio (RAM I/O) file cache is to make ordinary main memory safe for persistent storage by enabling memory to survive operating system crashes. Reliable memory enables a system to achieve the best of both worlds: reliability equivalent to a write-through file cache, where every write is instantly safe, and performance equivalent to a pure write-back cache, with no reliability-induced writes to disk. To achieve reliability, we protect memory during a crash and restore it during a reboot (a "warm" reboot). Extensive crash tests show that even without protection, warm reboot enables memory to achieve reliability close to that of a write-through file system. Adding protection makes memory even safer than a write-through file system while adding essentially no overhead. By eliminating reliability-induced disk writes, Rio performs 4-22 times as fast as a write-through file system, 2-14 times as fast as a standard Unix file system, and 1-3 times as fast as an optimized system that risks losing 30 seconds of data and metadata.
Conference Paper
Over the last few years, there have been several efforts to use logging to improve performance, reliability, and recovery times of file systems. The two major techniques are metadata logging, where the log records metadata changes and is a supplement to the on-disk file system, and log- structured file systems, whose log is their only on- disk representation. When the file system is mainly or wholly accessed through the Network File System (NFS) protocol, it adds new considerations to the suitability of the logging technique. NFS requires that all operations be updated to stable storage before returning. As a result, file system implementations that were effective for local access may perform poorly on an NFS server. This paper analyzes the issues regarding the use of logging on an NFS server, and describes an implementation of a BSD Fast File System (FFS) with metadata logging that performs effectively for a dedicated NFS server.
Article
A reimplementation of the UNIX TM file system is described. The reimplementation provides substantially higher throughput rates by using more flexible allocation policies that allow better locality of reference and can be adapted to a wide range of peripheral and processor characteristics. The new file system clusters data that is sequentially accessed and provides two block sizes to allow fast access to large files while not wasting large amounts of space for small files. File access rates of up to ten times faster than the traditional UNIX file system are experienced. Long-needed enhancements to the programmers' interface are discussed. These include a mechanism to place advisory locks on files, extensions of the name space across file systems, the ability to use long file names, and provisions for administrative control of resource usage.