Article

Programmable solid-state storage in future cloud datacenters

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Programmable software-defined solid-state drives can move computing functions closer to storage.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The data have to be read first and then transferred to host before being processed. Fortunately, Computational Storage [16,70], an emerging technology allows placing computing capabilities directly on storage (i.e., in-storage processing) to enable the data to be processed in the place where they reside. By bringing intelligence to the storage media itself, data reduction and qualification could occur before the data are sent to the host, resulting in the system's overall throughput being increased. ...
... 15 Table 1 shows the selected object trackers and the compute resource that they have been tested on. Among the object trackers, it is worth to note that only YOLO 16 and GOTURN are implemented to be executed on GPU. In addition, YOLO is the only one that supports the multithread CPU execution. ...
... For more details, see Section 4.2. 16 Note that we used YOLOv3 for the GPU execution, and YOLO-Lite for the CPU/CSD execution. implementation. ...
Article
Full-text available
The growing volume of data produced continuously in the Cloud and at the Edge poses significant challenges for large-scale AI applications to extract and learn useful information from the data in a timely and efficient way. The goal of this article is to explore the use of computational storage to address such challenges by distributed near-data processing. We describe Newport, a high-performance and energy-efficient computational storage developed for realizing the full potential of in-storage processing. To the best of our knowledge, Newport is the first commodity SSD that can be configured to run a server-like operating system, greatly minimizing the effort for creating and maintaining applications running inside the storage. We analyze the benefits of using Newport by running complex AI applications such as image similarity search and object tracking on a large visual dataset. The results demonstrate that data-intensive AI workloads can be efficiently parallelized and offloaded, even to a small set of Newport drives with significant performance gains and energy savings. In addition, we introduce a comprehensive taxonomy of existing computational storage solutions together with a realistic cost analysis for high-volume production, giving a good big picture of the economic feasibility of the computational storage technology.
... Modern SSDs package processing (e.g., storage controller) and storage components (e.g., DRAM, Flash) for routine tasks such as mapping and garbage collection. These computing resources present an opportunity to execute user-defined functions inside the SSDs [8,20,34], which has evolved from the pioneering idea of Jim Gray's active disks [13] to a new generation of SSDs allowing such in situ processing, called "computational SSDs" [10]. Computational SSDs include general-purpose, multi-GHZ clock speed, multi-core processors with built-in hardware accelerators (e.g., compression and decompression [3], pattern matching [14], FPGA [37]) to offload compute-intensive tasks from the processors, multiple GBs of DRAM, and tens of independent flash channels to the underlying storage media, allowing GB/s of internal data throughput. ...
... Modern SSDs contain computing components (such as embedded processors and DRAM) to perform various SSD management tasks, providing interesting opportunities to run user-defined programs inside the SSDs. An overview of the concept of programmable SSDs is described in [10]. There is clear industrial interest in exploiting programmable SSDs [30], so research in this area is likely to have a high payoff. ...
Article
Full-text available
Data should be placed at the most cost- and performance-effective tier in the storage hierarchy. While performance and cost decrease with distance from the CPU, the cost/performance trade-off depends on how efficiently data can be moved across tiers. Log structuring improves this cost/performance by writing batches of pages from main memory to secondary storage using a conventional block-at-a-time I/O interface. However, log structuring incurs overhead in the form of recovery and garbage collection. With computational Solid-State Drives, it is now possible to design a storage interface that minimizes this overhead. In this paper, we offload log structuring from the CPU to the SSD. We define a new batch I/O storage interface and we design a Flash Translation Layer that takes care of log structuring on the SSD side. This removes the CPU computational and I/O load associated with recovery and garbage collection. We compare the performance of the Bw-tree key-value store with its LLAMA host-based log structuring to the same key-value software stack executing on a computational SSD equipped with a batch I/O interface. Our experimental results show the benefits of eliminating redundancies, minimizing interactions across storage layers, and avoiding the CPU cost of providing log structuring.
... Hence, a system integrating CSA is likely a heterogeneous compute platform. By splitting computation and offloading part of data processing to the storage system, CSA minimizes data movements through the storage interconnect, i.e., network or local bus (PCIe) [28,47]. Therefore, CSA have been adopted by the database community for a while, more recently, pushing down part of a query to computational storage servers [15,93], or to computational storage devices [59,69]. ...
... Net + Accel SmartNICs [5,110], AcclNet [53], hXDP [35] Net + GPU GPUDirect [102], GPUNet [78] Sto + GPU Donard [22], SPIN [25], GPUfs [124], GPUDirect [103], nvidia BAM [113] Net + Sto iSCSI, NVMoF (offload [117], BlueField [5]), i10 [68], ReFlex [80] Sto + Accel ASIC/CPU [60,83,121], GPUs [25,26,124], FPGA [69,116,119,143], Hayagui [15] Hybrid System with ARM SoC [3,47,90], BEE3 [44], hybrid CPU-FPGA systems [39,41] DPUs Hyperion (stand-alone), Fungible (MIPS64 R6 cores) DPU processor [54], Pensando (hostattached P4 Programmable processor) [108], BlueField (host-attached, with ARM cores) [5] Table 1: Related work ( §4) in the integration of network (net), storage (sto), and accelerators (accel) devices. ...
Preprint
Since the inception of computing, we have been reliant on CPU-powered architectures. However, today this reliance is challenged by manufacturing limitations (CMOS scaling), performance expectations (stalled clocks, Turing tax), and security concerns (microarchitectural attacks). To re-imagine our computing architecture, in this work we take a more radical but pragmatic approach and propose to eliminate the CPU with its design baggage, and integrate three primary pillars of computing, i.e., networking, storage, and computing, into a single, self-hosting, unified CPU-free Data Processing Unit (DPU) called Hyperion. In this paper, we present the case for Hyperion, its design choices, initial work-in-progress details, and seek feedback from the systems community.
... Some storage systems provide specific computational capabilities ("active storage"), e.g. for compression and checksumming [5]. Active storage requires tight integration with the I/O application layer to use its capabilities. ...
Preprint
Full-text available
This document discusses the state, roadmap, and risks of the foundational components of ROOT with respect to the experiments at the HL-LHC (Run 4 and beyond). As foundational components, the document considers in particular the ROOT input/output (I/O) subsystem. The current HEP I/O is based on the TFile container file format and the TTree binary event data format. The work going into the new RNTuple event data format aims at superseding TTree, to make RNTuple the production ROOT event data I/O that meets the requirements of Run 4 and beyond.
... The idea of programmable storage is not new, and has been explored previously with HDDs [5,22,31,43], though the low disk bandwidths dominated the data processing costs. With the rise of Non-Volatile Memory (NVM) storage the idea is being revisited as NVM SSDs offer significant device-internal bandwidths [18], and already have an element of programmability to run management code like Flash Translation Layer (FTL) for NAND flash chips. The elimination (or reduction) of data movement due to storage programmability offers energy, cost, and performance benefits. ...
Preprint
The Big Data trend is putting strain on modern storage systems, which have to support high-performance I/O accesses for the large quantities of data. With the prevalent Von Neumann computing architecture, this data is constantly moved back and forth between the computing (i.e., CPU) and storage entities (DRAM, Non-Volatile Memory NVM storage). Hence, as the data volume grows, this constant data movement between the CPU and storage devices has emerged as a key performance bottleneck. To improve the situation, researchers have advocated to leverage computational storage devices (CSDs), which offer a programmable interface to run user-defined data processing operations close to the storage without excessive data movement, thus offering performance improvements. However, despite its potential, building CSD-aware applications remains a challenging task due to the lack of exploration and experimentation with the right API and abstraction. This is due to the limited accessibility to latest CSD/NVM devices, emerging device interfaces, and closed-source software internals of the devices. To remedy the situation, in this work we present an open-source CSD prototype over emerging NVMe Zoned Namespaces (ZNS) SSDs and an interface that can be used to explore application designs for CSD/NVM storage devices. In this paper we summarize the current state of the practice with CSD devices, make a case for designing a CSD prototype with the ZNS interface and eBPF (ZCSD), and present our initial findings. The prototype is available at https://github.com/Dantali0n/qemu-csd.
... The new SSDs depend on complicated firmware running on sophisticated processing engines, and developers already have customizable firmware build for cloud operations. By introducing this programmability via user-friendly interfaces, cloud data-center storage platforms would be able to respond to dynamically evolving requirements on the fly to offer lower latency and increased throughput by orders of magnitude [52] specifically by Non-Volatile Memory Express Solid State Drives (NVMe SSDs). As a result, NVMe SSDs are increasingly replacing HDDs in large data centers [16]. ...
Thesis
Due to the rapid growth of data, DBMS technology is under increased demands to satisfy throughput and latency requisites. At the same time, exploring new strategies for coping with a large variety of computational environments has become particularly relevant in this era of hardware heterogeneity. Many pioneers in this field have made substantial advances in data processing methodologies on new hardware and use emerging technology to effectively process a wide range of data over the last few years. However, excessive data movement is a major concern for modern data centers, as it can reduce throughput and latency while also increasing the average energy consumption per query. Near data processing is a framework that helps to bring computing closer towards where the data exists so that further operations can be performed for preventing unwanted data transfer. Despite the fact that several research have looked at the idea of near data processing utilizing modern hardware, the challenge of designing a hardware-aware optimizer which can cooperate with a multitude of near-data processing hardware remains unanswered. ReProVide (Reconfigurable data ProVider) is a novel FPGA-based System-on-Chip (SoC) architecture designed for near-data processing of large data sources. This framework serves both as a storage system as well as reconfigurable data (pre-)processing interface, with it’s own query processing capabilities, between various data sources and host systems requesting data. Our contributions to the field of near-data processing on reconfigurable hardware, namely ReProVide, are discussed in this thesis. The following two main topics are addressed in particular: • The first section discusses how to integrate underlying hardware capabilities for query processing, especially during query optimization, into the ReProVide near-data processor. To that purpose, we propose COPRAO, a hardware-aware optimizer that takes into account the computational capability and computing resource availability of the ReProVide hardware connected to the host. The optimizer will use different optimization rules and cost models to effciently offload query operations to the hardware based on it’s dynamic state and capabilities. It also has the potential to significantly reduce DBMS and network overhead associated with transferring large amounts of data to computing centers. Our prototype database engine, which is based on Apache Calcite, enables us to include this extension to the optimization methodology and helps to demonstrate the concept’s feasibility. • In the second section of this thesis, we explore novel optimization techniques for query sequence optimization that reap the benefits of COPRAO’s privileges. Client applications send sequences of relational queries to database servers that handle massive amounts of data. The execution of these streams of queries are really important as the performance of the application rely on the latency at which the results are obtained. Furthermore, for many applications, the sequences of submitted queries retain patterns which could be used to predict future requests. Therefore, we explore the potential COPRAO optimization strategies as well as ReProVide hardware specific local optimization methods, which considers the knowledge about upcoming queries, to achieve maximum throughput. For both topics, our implementations are rationalized, present and discuss experimental and calculated evaluations, and outline potential future research directions
... In order to enable highly efficient SSD execution, modern storage solutions rely on programmable memory systems [15]. Leveraging this compute capability, there has been much work on both software and hardware solutions for Near Data Processing in SSDs for other datacenter applications [14,18,25,31,35,36,38,39]. Previous works which target more general SSD NDP solutions have relied on hardware modifications, complex programming frameworks, and heavily modified driver subsystems to support the performance requirements of more complex and general computational tasks. ...
Preprint
Full-text available
Neural personalized recommendation models are used across a wide variety of datacenter applications including search, social media, and entertainment. State-of-the-art models comprise large embedding tables that have billions of parameters requiring large memory capacities. Unfortunately, large and fast DRAM-based memories levy high infrastructure costs. Conventional SSD-based storage solutions offer an order of magnitude larger capacity, but have worse read latency and bandwidth, degrading inference performance. RecSSD is a near data processing based SSD memory system customized for neural recommendation inference that reduces end-to-end model inference latency by 2X compared to using COTS SSDs across eight industry-representative models.
... The recent advancement in reinforcement learning (RL) has paved the way for solving many complicated problems, including but not limited to robotics (Lillicrap et al. 2015;Gu et al. 2017;Hwangbo et al. 2019), video game playing (Jaderberg et al. 2019;Vinyals et al. 2019), autonomous driving (Sallab et al. 2017;Kiran et al. 2020) and neural architecture search (Baker et al. 2017;Tan and Le 2019). Even though there has been a series of investigations applying machine learning for system optimizations (Dean 2017;Lecuyer et al. 2017;Bychkovsky et al. 2018;Song et al. 2020), there is an absence of RL caching systems competitive with existing heuristics, due to two principal unresolved issues: the lack of adequate Markov property in the problem formulation and the prohibitively large training overhead during the online learning process. First, in order to achieve better learning efficiency, the problem formulation (especially the state formulation) should obtain great Markov property. ...
Preprint
Full-text available
With data durability, high access speed, low power efficiency and byte addressability, NVMe and SSD, which are acknowledged representatives of emerging storage technologies, have been applied broadly in many areas. However, one key issue with high-performance adoption of these technologies is how to properly define intelligent cache layers such that the performance gap between emerging technologies and main memory can be well bridged. To this end, we propose Phoebe, a reuse-aware reinforcement learning framework for the optimal online caching that is applicable for a wide range of emerging storage models. By continuous interacting with the cache environment and the data stream, Phoebe is capable to extract critical temporal data dependency and relative positional information from a single trace, becoming ever smarter over time. To reduce training overhead during online learning, we utilize periodical training to amortize costs. Phoebe is evaluated on a set of Microsoft cloud storage workloads. Experiment results show that Phoebe is able to close the gap of cache miss rate from LRU and a state-of-the-art online learning based cache policy to the Belady's optimal policy by 70.3% and 52.6%, respectively.
... There have been two emerging trends that bring an optimistic outlook to enforcing privacy rules directly in the storage layer. First, disaggregated architectures in the cloud and datacenters are increasingly offering some form of in-storage processing [33,8] and flexible data management [17,13] with high-speed network connectivity. Second, as an effort to keep the management, configuration and monitoring of a large number of storage nodes scalable, Software-Defined Storage (SDS) has been proposed [39] -albeit the goal of existing SDS systems is almost exclusively to guarantee performance isolation and service levels in distributed multi-tenant settings [27]. ...
Preprint
Full-text available
Enforcing data protection and privacy rules within large data processing applications is becoming increasingly important, especially in the light of GDPR and similar regulatory frameworks. Most modern data processing happens on top of a distributed storage layer, and securing this layer against accidental or malicious misuse is crucial to ensuring global privacy guarantees. However, the performance overhead and the additional complexity for this is often assumed to be significant -- in this work we describe a path forward that tackles both challenges. We propose "Software-Defined Data Protection" (SDP), an adoption of the "Software-Defined Storage" approach to non-performance aspects: a trusted controller translates company and application-specific policies to a set of rules deployed on the storage nodes. These, in turn, apply the rules at line-rate but do not take any decisions on their own. Such an approach decouples often changing policies from request-level enforcement and allows storage nodes to implement the latter more efficiently. Even though in-storage processing brings challenges, mainly because it can jeopardize line-rate processing, we argue that today's Smart Storage solutions can already implement the required functionality, thanks to the separation of concerns introduced by SDP. We highlight the challenges that remain, especially that of trusting the storage nodes. These need to be tackled before we can reach widespread adoption in cloud environments.
... SSDs are designed with flash storage and a firmware controller that is responsible for managing the memory resource by doing low level tasks such as garbage collection and maintaining rough equality of writing to all storage areas to prolong the life of the device. Do et al. (2019) describe significant disruptive trends in SSD design. There is now a move away from the current bare bones approach that incorporates compute and software flexibility in the SSD, to a model in which each storage device has a greater number of cores and increased clock speeds, as well as increased flexibility provided by a more general purpose embedded operating Step 1 is due to Arlinghaus et al. (1990); step 4 is due to Lim and Ma (2013). ...
Chapter
Full-text available
Spatial land use allocation is often formulated as a complex multiobjective optimization problem. As effective tools for multiobjective optimization, Pareto-based heuristic optimization algorithms, such as genetic, artificial immune system, particle swarm optimization, and ant colony optimization algorithms, have been introduced to support trade-off analysis and posterior stakeholder involvement in land use decision making. However, these algorithms are extremely time consuming, and minimizing the computational time has become one of the largest challenges in obtaining the Pareto frontier in spatial land use allocation problems. To improve the efficiency of these algorithms and better support multiobjective decision making in land use planning, high-performance Pareto-based optimization algorithms for shared-memory and distributed-memory computing platforms were developed in this study. The OpenMP and Message Passing Interface (MPI) parallel programming technologies were employed to implement the shared-memory and distributed-memory parallel models, respectively, in parallel in the Pareto-based optimization algorithm. Experiments show that both the shared-memory and message-passing parallel models can effectively accelerate multiobjective spatial land use allocation models. The shared-memory model achieves satisfying performance when the number of CPU cores used for computing is less than 8. Conversely, the message-passing model displays better scalability than the shared-memory model when the number of CPU cores used for computing is greater than 8.
... Modern SSDs contain computing components (such as embedded processors and DRAM) to perform various SSD management tasks, providing interesting opportunities to run user-defined programs inside the SSDs. An overview of the concept of programmable SSDs is described in [9]. There is clear industrial interest in exploiting programmable SSDs [35], so research in this area is likely to have a high payoff. ...
Conference Paper
Exploiting a storage hierarchy is critical to cost-effective data management. One can achieve great performance when working solely on main memory data. But this comes at a high cost. Systems that use secondary storage as the "home" for data have much lower storage costs as they can not only make the data durable but reduce its storage cost as well. Performance then becomes the challenge, reflected in an increased execution cost. Log structured stores, e.g. Deuteronomy, improve I/O cost/performance by batching writes. However, this incurs the cost of host-based garbage collection and recovery, which duplicates SSD flash translation layer (FTL) functionality. This paper describes the design and implementation in a controller for an Open Channel SSD of a new FTL that supports multi-page I/O without host-based log structuring. This both simplifies the host system and improves performance. The new FTL improves I/O cost/performance with only modest change to the current block at a time, update-in-place interface.
Article
Every database engine runs on top of an operating system in the host, strictly separated with the storage. This more-than-half-century-old IHDE (In-Host-Database-Engine) architecture, however, reveals its limitations when run on fast flash memory SSDs. In particular, the IO stacks incur significant run-time overhead and also hinder vertical optimizations between database engines and SSDs. In this paper, we envisage a new database architecture, called SaS (SSD as SQL database engine), where a full-blown SQL database engine runs inside SSD, tightly integrated with SSD architecture without intervening kernel stacks. As IO stacks are removed, SaS is free from their run-time overhead and further can explore numerous vertical optimizations between database engine and SSD. SaS evolves SSD from dummy block device to database server with SQL as its primary interface. The benefit of SaS will be more outstanding in the data centers where the distance between database engine and the storage is ever widening because of virtualization, storage disaggregation, and open software stacks. The advent of computational SSDs with more compute resource will enable SaS to be more viable and attractive database architecture.
Article
Near-data processing (NDP) architecture is promised to break the bottleneck of data movement in many scenarios (e.g., databases and recommendation systems), which limits the efficiency of data processing. Different from traditional SSD, NDP-based SSD not only needs to handle normal I/Os (e.g., read and write), but also needs to handle NDP requests that contain data processing operations. NDP and normal I/O requests share some function units of NDP-based SSD, such as flash chips and embedded processors. However, existing works ignore the resource competition between normal I/Os and NDP requests, which drastically degrades the performance. In this article, we propose a novel scheduling technique called Horae, which can efficiently schedule hybrid NDP-normal I/O requests in NDP-based SSD to improve performance. Horae exploits the critical paths on critical resources to maximize the parallelism of multiple stages of requests. The experimental results on typical workloads show that Horae can significantly improve the performance of hybrid NDP-normal I/O requests over the state-of-the-art scheduling algorithms of NDP-based SSDs.
Article
Abrupt development of resources and rising expenses of infrastructure are leading institutions to take on cloud computing. Albeit, the cloud environment is vulnerable to various sorts of attacks. So, recognizing malicious software is one of the principal challenges in cloud security governance. Intrusion detection system (IDS) has turned to the most generally utilized element of computer system security that asserts the cloud from diverse sorts of attacks and threats. As evident, no systematic literature review exists that focuses on cloud computing usage within IDS processes. The previous investigations had not considered the statistical analysis method. Hence, this paper examined the IDS mechanisms in cloud computing systematically. Twenty-two articles have been obtained using defined filters divided into four sections: hypervisor-based IDS, network-based IDS, machine learning-based IDS, and hybrid IDS. The comparison is performed depending on the outcomes illustrated in the investigations. It demonstrates that IDS precision, inclusiveness, overhead, and reaction time have been discussed in many studies. Simultaneously, less attention has been paid to cost-sensitivity, functioning, attack tolerance, and intrusion facing. This paper has made an excellent effort to organize literature drawn from multiple sources into a manuscript.
Article
Most modern data processing pipelines run on top of a distributed storage layer, and securing the whole system, and the storage layer in particular, against accidental or malicious misuse is crucial to ensuring compliance to rules and regulations. Enforcing data protection and privacy rules, however, stands at odds with the requirement to achieve higher and higher access bandwidths and processing rates in large data processing pipelines. In this work we describe our proposal for the path forward that reconciles the two goals. We call our approach "Software-Defined Data Protection" (SDP). Its premise is simple, yet powerful: decoupling often changing policies from request-level enforcement allows distributed smart storage nodes to implement the latter at line-rate. Existing and future data protection frameworks can be translated to the same hardware interface which allows storage nodes to offload enforcement efficiently both for company-specific rules and regulations, such as GDPR or CCPA. While SDP is a promising approach, there are several remaining challenges to making this vision reality. As we explain in the paper, overcoming these will require collaboration across several domains, including security, databases and specialized hardware design.
Chapter
The pace of improvement in the performance of conventional computer hardware has slowed significantly during the past decade, largely as a consequence of reaching the physical limits of manufacturing processes. To offset this slowdown, new approaches to HPC are now undergoing rapid development. This chapter describes current work on the development of cutting-edge exascale computing systems that are intended to be in place in 2021 and then turns to address several other important developments in HPC, some of which are only in the early stage of development. Domain-specific heterogeneous processing approaches use hardware that is tailored to specific problem types. Neuromorphic systems are designed to mimic brain function and are well suited to machine learning. And then there is quantum computing, which is the subject of some controversy despite the enormous funding initiatives that are in place to ensure that systems continue to scale-up from current small demonstration systems.
Preprint
The pace of improvement in the performance of conventional computer hardware has slowed significantly during the past decade, largely as a consequence of reaching the physical limits of manufacturing processes. To offset this slowdown, new approaches to HPC are now undergoing rapid development. This chapter describes current work on the development of cutting-edge exascale computing systems that are intended to be in place in 2021 and then turns to address several other important developments in HPC, some of which are only in the early stage of development. Domain-specific heterogeneous processing approaches use hardware that is tailored to specific problem types. Neuromorphic systems are designed to mimic brain function and are well suited to machine learning. And then there is quantum computing, which is the subject of some controversy despite the enormous funding initiatives that are in place to ensure that systems continue to scale-up from current small demonstration systems.
Conference Paper
Full-text available
Solid-State Drives (SSDs) have gained acceptance by providing the same block device abstraction as magnetic hard drives, at the cost of suboptimal resource utilisation and unpredictable performance. Recently, Open-Channel SSDs have emerged as a means to obtain predictably high performance, based on a clean break from the block device abstraction. Open-channel SSDs embed a minimal flash translation layer (FTL) and expose their internals to the host. The Linux open-channel SSD subsystem, LightNVM, lets kernel modules as well as user-space applications control data placement and I/O scheduling. This way, it is the host that is responsible for SSD management. But what kind of performance model should the host rely on to guide the way it manages data placement and I/O scheduling? For addressing this question we have defined uFLIP-OC, a benchmark designed to identify the I/O patterns that are best suited for a given open-channel SSD. Our experiments on a Dragon-Fire Card (DFC) SSD, equipped with the OX controller, illustrate the performance impact of media characteristics and parallelism. We discuss how uFLIP-OC can be used to guide the design of host-based data systems on open-channel SSDs.
Conference Paper
Full-text available
As Solid-State Drives (SSDs) become commonplace in data-centers and storage arrays, there is a growing demand for predictable latency. Traditional SSDs, serving block I/Os, fail to meet this demand. They offer a high-level of abstraction at the cost of unpredictable performance and suboptimal resource utilization. We propose that SSD management trade-offs should be handled through Open-Channel SSDs, a new class of SSDs, that give hosts control over their internals. We present our experience building LightNVM, the Linux Open-Channel SSD subsystem. We introduce a new Physical Page Address I/O interface that exposes SSD parallelism and storage media characteristics. LightNVM integrates into traditional storage stacks, while also enabling storage engines to take advantage of the new I/O interface. Our experimental results demonstrate that LightNVM has modest host overhead, that it can be tuned to limit read latency variability and that it can be customized to achieve predictable I/O latencies.
Article
Full-text available
Complex data queries, because of their need for random accesses, have proven to be slow unless all the data can be accommodated in DRAM. There are many domains, such as genomics, geological data and daily twitter feeds where the datasets of interest are 5TB to 20 TB. For such a dataset, one would need a cluster with 100 servers, each with 128GB to 256GBs of DRAM, to accommodate all the data in DRAM. On the other hand, such datasets could be stored easily in the flash memory of a rack-sized cluster. Flash storage has much better random access performance than hard disks, which makes it desirable for analytics workloads. In this paper we present BlueDBM, a new system architecture which has flash-based storage with in-store processing capability and a low-latency high-throughput inter-controller network. We show that BlueDBM outperforms a flash-based system without these features by a factor of 10 for some important applications. While the performance of a ram-cloud system falls sharply even if only 5%~10% of the references are to the secondary storage, this sharp performance degradation is not an issue in BlueDBM. BlueDBM presents an attractive point in the cost-performance trade-off for Big Data analytics.
Conference Paper
Encryption ransomware is a malicious software that stealthily encrypts user files and demands a ransom to provide access to these files. Several prior studies have developed systems to detect ransomware by monitoring the activities that typically occur during a ransomware attack. Unfortunately, by the time the ransomware is detected, some files already undergo encryption and the user is still required to pay a ransom to access those files. Furthermore, ransomware variants can obtain kernel privilege, which allows them to terminate software-based defense systems, such as anti-virus. While periodic backups have been explored as a means to mitigate ransomware, such backups incur storage overheads and are still vulnerable as ransomware can obtain kernel privilege to stop or destroy backups. Ideally, we would like to defend against ransomware without relying on software-based solutions and without incurring the storage overheads of backups. To that end, this paper proposes FlashGuard, a ransomware tolerant Solid State Drive (SSD) which has a firmware-level recovery system that allows quick and effective recovery from encryption ransomware without relying on explicit backups. FlashGuard leverages the observation that the existing SSD already performs out-of-place writes in order to mitigate the long erase latency of flash memories. Therefore, when a page is updated or deleted, the older copy of that page is anyway present in the SSD. FlashGuard slightly modifies the garbage collection mechanism of the SSD to retain the copies of the data encrypted by ransomware and ensure effective data recovery. Our experiments with 1,447 manually labeled ransomware samples show that FlashGuard can efficiently restore files encrypted by ransomware. In addition, we demonstrate that FlashGuard has a negligible impact on the performance and lifetime of the SSD.
Article
The ever increasing amount of data being handled in data centers causes an intrinsic inefficiency: moving data around is expensive in terms of bandwidth, latency, and power consumption, especially given the low computational complexity of many database operations. In this paper we explore near-data processing in database engines, i.e., the option of offloading part of the computation directly to the storage nodes. We implement our ideas in Caribou, an intelligent distributed storage layer incorporating many of the lessons learned while building systems with specialized hardware. Caribou provides access to DRAM/NVRAM storage over the network through a simple key-value store interface, with each storage node providing high-bandwidth near-data processing at line rate and fault tolerance through replication. The result is a highly efficient, distributed, intelligent data storage that can be used to both boost performance and reduce power consumption and real estate usage in the data center thanks to the micro-server architecture adopted.
Article
Data-intensive queries are common in business intelligence, data warehousing and analytics applications. Typically, processing a query involves full inspection of large in-storage data sets by CPUs. An intuitive way to speed up such queries is to reduce the volume of data transferred over the storage network to a host system. This can be achieved by filtering out extraneous data within the storage, motivating a form of near-data processing. This work presents Biscuit, a novel near-data processing framework designed for modern solid-state drives. It allows programmers to write a data-intensive application to run on the host system and the storage system in a distributed, yet seamless manner. In order to offer a high-level programming model, Biscuit builds on the concept of data flow. Data processing tasks communicate through typed and data-ordered ports. Biscuit does not distinguish tasks that run on the host system and the storage system. As the result, Biscuit has desirable traits like generality and expressiveness, while promoting code reuse and naturally exposing concurrency. We implement Biscuit on a host system that runs the Linux OS and a high-performance solid-state drive. We demonstrate the effectiveness of our approach and implementation with experimental results. When data filtering is done by hardware in the solid-state drive, the average speed-up obtained for the top five queries of TPC-H is over 15×.
Article
This paper presents YourSQL, a database system that accelerates data-intensive queries with the help of additional in-storage computing capabilities. YourSQL realizes very early filtering of data by offloading data scanning of a query to user-programmable solid-state drives. We implement our system on a recent branch of MariaDB (a variant of MySQL). In order to quantify the performance gains of YourSQL, we evaluate SQL queries with varying complexities. Our result shows that YourSQL reduces the execution time of the whole TPC-H queries by 3.6×, compared to a vanilla system. Moreover, the average speed-up of the five TPC-H queries with the largest performance gains reaches over 15×. Thanks to this significant reduction of execution time, we observe sizable energy savings. Our study demonstrates that the YourSQL approach, combining the power of early filtering with end-to-end datapath optimization, can accelerate large-scale analytic queries with lower energy consumption.
Conference Paper
PCIe-based Flash is commonly deployed to provide datacenter applications with high IO rates. However, its capacity and bandwidth are often underutilized as it is difficult to design servers with the right balance of CPU, memory and Flash resources over time and for multiple applications. This work examines Flash disaggregation as a way to deal with Flash overprovisioning. We tune remote access to Flash over commodity networks and analyze its impact on workloads sampled from real datacenter applications. We show that, while remote Flash access introduces a 20% throughput drop at the application level, disaggregation allows us to make up for these overheads through resource-efficient scale-out. Hence, we show that Flash disaggregation allows scaling CPU and Flash resources independently in a cost effective manner. We use our analysis to draw conclusions about data and control plane issues in remote storage.
Article
Modern data appliances face severe bandwidth bottlenecks when moving vast amounts of data from storage to the query processing nodes. A possible solution to mitigate these bottlenecks is query off-loading to an intelligent storage engine, where partial or whole queries are pushed down to the storage engine. In this paper, we present Ibex , a prototype of an intelligent storage engine that supports off-loading of complex query operators. Besides increasing performance, Ibex also reduces energy consumption, as it uses an FPGA rather than conventional CPUs to implement the off-load engine. Ibex is a hybrid engine, with dedicated hardware that evaluates SQL expressions at line-rate and a software fallback for tasks that the hardware engine cannot handle. Ibex supports GROUP BY aggregation, as well as projection- and selection- based ltering. GROUP BY aggregation has a higher impact on performance but is also a more challenging operator to implement on an FPGA.
Article
Modern data appliances face severe bandwidth bottlenecks when moving vast amounts of data from storage to the query processing nodes. A possible solution to mitigate these bottlenecks is query off-loading to an intelligent storage engine, where partial or whole queries are pushed down to the storage engine. In this paper, we present Ibex, a prototype of an intelligent storage engine that supports off-loading of complex query operators. Besides increasing performance, Ibex also reduces energy consumption, as it uses an FPGA rather than conventional CPUs to implement the off-load engine. Ibex is a hybrid engine, with dedicated hardware that evaluates SQL expressions at line-rate and a software fallback for tasks that the hardware engine cannot handle. Ibex supports GROUP BY aggregation, as well as projection- and selection- based filtering. GROUP BY aggregation has a higher impact on performance but is also a more challenging operator to implement on an FPGA.
Article
While the ubiquitous SSD shares many features with the hard-disk drive, under the surface they are completely different.
Article
While the ubiquitous SSD shares many features with the hard-disk drive, under the surface they are completely different.
Conference Paper
Datacenters have traditionally been architected as a collection of servers wherein each server aggregates a fixed amount of computing, memory, storage, and communication resources. In this paper, we advocate an alternative construction in which the resources within a server are disaggregated and the datacenter is instead architected as a collection of standalone resources. Disaggregation brings greater modularity to datacenter infrastructure, allowing operators to optimize their deployments for improved efficiency and performance. However, the key enabling or blocking factor for disaggregation will be the network since communication that was previously contained within a single server now traverses the datacenter fabric. This paper thus explores the question of whether we can build networks that enable disaggregation at datacenter scales.
Conference Paper
Data storage devices are getting "smarter." Smart Flash storage devices (a.k.a. "Smart SSD") are on the horizon and will package CPU processing and DRAM storage inside a Smart SSD, and make that available to run user programs inside a Smart SSD. The focus of this paper is on exploring the opportunities and challenges associated with exploiting this functionality of Smart SSDs for relational analytic query processing. We have implemented an initial prototype of Microsoft SQL Server running on a Samsung Smart SSD. Our results demonstrate that significant performance and energy gains can be achieved by pushing selected query processing components inside the Smart SSDs. We also identify various changes that SSD device manufacturers can make to increase the benefits of using Smart SSDs for data processing applications, and also suggest possible research opportunities for the database community.
Conference Paper
In the last several years hundreds of thousands of SSDs have been deployed in the data centers of Baidu, China's largest Internet search company. Currently only 40\% or less of the raw bandwidth of the flash memory in the SSDs is delivered by the storage system to the applications. Moreover, because of space over-provisioning in the SSD to accommodate non-sequential or random writes, and additionally, parity coding across flash channels, typically only 50-70\% of the raw capacity of a commodity SSD can be used for user data. Given the large scale of Baidu's data center, making the most effective use of its SSDs is of great importance. Specifically, we seek to maximize both bandwidth and usable capacity. To achieve this goal we propose {\em software-defined flash} (SDF), a hardware/software co-designed storage system to maximally exploit the performance characteristics of flash memory in the context of our workloads. SDF exposes individual flash channels to the host software and eliminates space over-provisioning. The host software, given direct access to the raw flash channels of the SSD, can effectively organize its data and schedule its data access to better realize the SSD's raw performance potential. Currently more than 3000 SDFs have been deployed in Baidu's storage system that supports its web page and image repository services. Our measurements show that SDF can deliver approximately 95% of the raw flash bandwidth and provide 99% of the flash capacity for user data. SDF increases I/O bandwidth by 300\% and reduces per-GB hardware cost by 50% on average compared with the commodity-SSD-based system used at Baidu.
What's up with the storage hierarchy?
  • P Bonnet
Bonnet, P. What's up with the storage hierarchy? In Proceedings of the 8 th Biennial Conference on Innovative Data Systems Research (Chaminade, CA, Jan. 8-11), 2017.
Programmable storage in future data centers
  • J Do
  • Softflash
  • Do J.
Do, J. Softflash: Programmable storage in future data centers. In Proceedings of the 20 th SNIA Storage Developer Conference (Santa Clara, CA, Sep. 11-14), 2017.
Intelligent distributed storage
  • Z István
  • D Sidler
  • G Caribou
István, Z., Sidler, D., and Alonso, G. Caribou: Intelligent distributed storage. In Proceedings of the VLDB Endowment 10, 11 (Aug. 2017), 1202-1213.
Query processing on smart SSDs
  • K Park
  • Y.-S Kee
  • J M Patel
  • J Do
  • C Park
  • D J Dewitt
Park, K., Kee, Y.-S., Patel, J.M., Do, J., Park, C., and Dewitt, D.J. Query processing on smart SSDs. IEEE Data Engineering Bulletin 37, 2 (Jun. 2014), 19-26.
Wicked fast storage and beyond
  • F Hady
Hady, F. Wicked fast storage and beyond. In Proceedings of the 7 th Non Volatile Memory Workshop (San Diego, CA, Mar. 6-8). Keynote, 2016.
The multi-streamed solid-state drive
  • J.-U Kang
  • J Hyun
  • H Maeng
  • S Cho
Kang, J.-U., Hyun, J., Maeng, H., and Cho, S. The multi-streamed solid-state drive. In Proceedings of the 6 th USENIX Workshop on Hot Topics in Storage and File Systems (Philadelphia, PA, Jun. 17-18). USENIX Association, Berkeley, CA, 2014.
Flash storage disaggregation
  • A Klimovic
  • C Kozyrakis
  • E Thereska
  • B John
  • S Kumar
Klimovic, A., Kozyrakis, C., Thereska, E., John, B., and Kumar, S. Flash storage disaggregation. In Proceedings of the 11 th European Conference on Computer Systems (London, U.K., Apr. 18-21). ACM Press, New York, 2016, 29.
A user-programmable SSD
  • S Seshadri
  • M Gahagan
  • S Bhaskaran
  • T Bunker
  • A De
  • Y Jin
  • Y Liu
  • S Swanson
  • Willow
Seshadri, S., Gahagan, M., Bhaskaran, S., Bunker, T., De, A., Jin, Y., Liu, Y., and Swanson, S. Willow: A user-programmable SSD. In Proceedings of the 11 th USENIX Symposium on Operating Systems Design and Implementation (Broomfield, CO, Oct. 6-8). USENIX Association, Berkeley, CA, 2014, 67-80.
Biscuit: A framework for near data processing of big data workloads
  • B Gu
  • A S Yoon
  • D.-H Bae
  • I Jo
  • J Lee
  • J Yoon
  • J.-U Kang
  • M Kwon
  • C Yoon
  • S Cho
He is leading a project, SoftFlash, which aims to use programmable SSDs in cloud datacenters
  • W A Redmond
Jaeyoung Do (jaedo@microsoft.com) is a researcher at Microsoft Research, Redmond, WA, USA. He is leading a project, SoftFlash, which aims to use programmable SSDs in cloud datacenters.
Hady, F. Wicked fast storage and beyond
  • F Hady
  • Hady F.
Bonnet, P. What's up with the storage hierarchy?
  • P Bonnet
  • Bonnet P.