Conference Paper

Revitalizing Buffered I/O: Optimizing Page Reclaim and I/O Throttling

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Recently, flash-based solid-state drives (SSDs) are widely used in industry and academia due to their higher bandwidth and lower latency compared with traditional hard disk drives (HDDs). Furthermore, SSDs with the Non-Volatile Memory Express (NVMe) interface can provide higher performance and ultra-low latency compared with the Serial AT Attachment (SATA) SSDs. Due to their high performance, NVMe SSDs are adopted in many systems as fast storage devices. However, the performance of NVMe SSDs can be negatively affected by I/O access patterns. For example, random write access patterns can have negative impacts on the performance due to the unique characteristics of SSDs such as out-of-place update and garbage collection. In this paper, we propose an address remapping scheme to improve the I/O performance of NVMe SSDs. Our proposed scheme transforms random access patterns into sequential access patterns in the NVMe device driver. This allows our scheme to improve the I/O performance of NVMe SSDs while supporting widely used file systems such as EXT4, XFS, BTRFS, and F2FS without any modification to the device. Experimental results show that our proposed scheme can improve the performance of NVMe SSD by up to 64.1% compared with the existing scheme.
Conference Paper
Full-text available
Computer engineers in academia and industry rely on a standardized set of benchmarks to quantitatively evaluate the performance of computer systems and research prototypes. SPEC CPU2017 is the most recent incarnation of standard benchmarks designed to stress a system's processor, memory subsystem, and compiler. This paper describes the results of measurement-based studies focusing on characterization, performance, and energy-efficiency analyses of SPEC CPU2017 on the Intel's Core i7-8700K. Intel and GNU compilers are used to create executable files utilized in performance studies. The results show that executables produced by the Intel compilers are superior to those produced by GNU compilers. We characterize all the benchmarks, perform a top-down microarchitectural analysis to identify performance bottlenecks, and test benchmark scalability with respect to performance and energy. Findings from these studies can be used to guide future performance evaluations and computer architecture research
Article
Full-text available
As a way to increase the actual main memory capacity of Android smartphones, most of them make use of zRAM swapping, but it has limitation in increasing its capacity since it utilizes main memory. Unfortunately, they cannot use secondary storage as a swap space due to long response time and wear-out problem. In this paper, we propose a hybrid swapping scheme based on per-process reclaim that supports both secondary-storage swapping and zRAM swapping. It attempts to swap out all the pages in the working set of a process to zRAM swap space rather than killing the process selected by LMK, and to swap out the least recently used pages into secondary storage swap space. The main reason being is that frequently swap- in/out pages use zRAM swap space while less frequently swap-in/out pages use secondary storage swap space, in order to reduce the page operation cost. Our scheme resolves both response time and wear-out problems of secondary-storage swapping and zSWAP, and overcomes the size limitation of zRAM swap space. According to performance measurements, it also increased the extension ratio of main memory by 15 ~ 17% and 6 ~ 17% and reduced the page operation cost by 9 ~ 22% and 18 ~ 28%, respectively, compared to zRAM swapping and zSWAP.
Article
Full-text available
PolarFS is a distributed file system with ultra-low latency and high availability, designed for the POLARDB database service, which is now available on the Alibaba Cloud. PolarFS utilizes a lightweight network stack and I/O stack in user-space, taking full advantage of the emerging techniques like RDMA, NVMe, and SPDK. In this way, the end-to-end latency of PolarFS has been reduced drastically and our experiments show that the write latency of PolarFS is quite close to that of local file system on SSD. To keep replica consistency while maximizing I/O throughput for PolarFS, we develop ParallelRaft, a consensus protocol derived from Raft, which breaks Raft's strict serialization by exploiting the out-of-order I/O completion tolerance capability of databases. ParallelRaft inherits the understand-ability and easy implementation of Raft while providing much better I/O scalability for PolarFS. We also describe the shared storage architecture of PolarFS, which gives a strong support for POLARDB.
Conference Paper
In this paper we propose a new paradigm and algorithms to address cache writeback performance in servers and storage arrays. As servers and storage processors move to multi-core architecture, with ever increasing memory caches, the cost of flushing these caches to disk has become a problem. Traditional watermark based algorithms currently used in many storage arrays and NAS servers have a problem keeping up with the higher speeds of incoming application writes, often resulting in a performance penalty. The server's cache is generally used for hiding high disk latencies associated with file system data. In general, metadata performance was optimized, while application data was considered less sensitive to high latencies and was given lower priority or was written directly to disk. The new algorithms proposed here change the application data writeback from using watermark based flush to something that approximates the rate of the incoming application I/Os. The problem is more critical for network file systems where the complex client/server protocols can make writeback a serious performance barrier, particularly in light of very large I/Os and the lack of application commits. Our proposed algorithms are applicable to local file systems and remote servers as well as to storage arrays. We show test results based on dynamic traces of real file system dirty pages in the buffer cache and prove that rate based cache writeback algorithms are the most efficient replacement for watermark based flushing.
Conference Paper
While the use of MapReduce systems (such as Hadoop) for large scale data analysis has been widely recognized and studied, we have recently seen an explosion in the number of systems developed for cloud data serving. These newer systems address "cloud OLTP" applications, though they typically do not support ACID transactions. Examples of systems proposed for cloud serving use include BigTable, PNUTS, Cassandra, HBase, Azure, CouchDB, SimpleDB, Voldemort, and many others. Further, they are being ap- plied to a diverse range of applications that di!er consider- ably from traditional (e.g., TPC-C like) serving workloads. The number of emerging cloud serving systems and the wide range of proposed applications, coupled with a lack of apples- to-apples performance comparisons, makes it di"cult to un- derstand the tradeo!s between systems and the workloads for which they are suited. We present the Yahoo! Cloud Serving Benchmark (YCSB) framework, with the goal of fa- cilitating performance comparisons of the new generation of cloud data serving systems. We define a core set of benchmarks and report results for four widely used systems: Cassandra, HBase, Yahoo!'s PNUTS, and a simple sharded MySQL implementation. We also hope to foster the devel- opment of additional cloud benchmark suites that represent other classes of applications by making our benchmark tool available via open source. In this regard, a key feature of the YCSB framework/tool is that it is extensible—it supports easy definition of new workloads, in addition to making it easy to benchmark new systems.
Article
In a path-breaking paper last year Pat and Betty O'Neil and Gerhard Weikum proposed a self-tuning improvement to the Least Recently Used (LRU) buffer management algorithm[15]. Their improvement is called LRU/k and advocates giving priority to buffer pages based on the kth most recent access. (The standard LRU algorithm is denoted LRU/1 according to this terminology.) If P1's kth most recent access is more more recent than P2's, then P1 will be replaced after P2. Intuitively, LRU/k for k ? 1 is a good strategy, because it gives low priority to pages that have been scanned or to pages that belong to a big randomly accessed file (e.g., the account file in TPC/A). They found that LRU/2 achieves most of the advantage of their method. The one problem of LRU/2 is the processor overhead to implement it. In contrast to LRU, each page access requires log N work to manipulate a priority queue where N is the number of pages in the buffer. Question: is there low overhead way (constant overhead per ...
Ziggurat: A tiered file system for non-volatile main memories and disks
  • S Zheng
  • M Hoseinzadeh
  • S Swanson
FIO-flexible io tester
  • Axboe
No-I/O dirty throttling
  • Corbet
Analysis and mitigation of writeback cache lock-ups in Linux
  • Orero
On stacking a persistent memory file system on legacy file systems
  • H Woo
  • D Han
  • S Ha
  • S H Noh
  • B Nam
The new and improved filebench
  • Wilson
Asynchronous I/O stack: A low-latency kernel I/O stack for ultralow latency SSDs
  • G Lee
  • S Shin
  • W Song
  • T J Ham
  • J W Lee
  • J Jeong
On stacking a persistent memory file system on legacy file systems
  • Woo
Asynchronous I/O stack: A low-latency kernel I/O stack for ultralow latency SSDs
  • Lee
Ziggurat: A tiered file system for non-volatile main memories and disks
  • Zheng