ArticlePDF Available

Abstract

Application’s memory footprints are growing exponentially due to an increase in their data set and additional software layer. Modern RDMA capable networks such as InfiniBand and Myrinet with low latency and high bandwidth provide us a new vision to utilize remote memory. Remote idle memory can be exploited to improvement performance of memory intensive applications on individual nodes. Network swapping will be faster than traditional swapping to local disk. In this paper, we design a remote memory system for remote memory utilization in InfiniBand clusters. We present the architecture, communication method and algorithm of InfiniBand Block Device (IBD), which is implemented as loadable kernel module for version 3.5.0-45 of the Linux kernel. Especially, we discuss design issues transfer pages to remote memory. Our experiments show that IBD can bring more performance gain for applications whose working sets are larger than the local memory on a node but smaller than idle memory available on the cluster.
Conference Paper
Today's data intensive applications require to process large amount of data on cluster systems that consist of commodity computing nodes with high performance network interconnects. Individual computing nodes in the cluster system have indentical resource specification although applications vary greatly in their resource needs. From imbalance of the amount of memory used different computing nodes, we can consider the use of remote idle memory to alleviate the memory pressure on individual computing nodes. We propose a remote memory block device (RMBD) using remote memory access capabilities provided by both Infiniband and RoCE as one of remote swapping techniques. We evaluates the I/O performance of RMBD with it of the block device using TCP/IP protocols. From performance experiments, we find that RMBD using RDMA is able to improve througput by up to 20 times in read operation and 6 times in write operation compared to it using TCP/IP. We demonstrate that RMBD as swap device can provide performance benefits that is up to 2 times speed up compared to HDD for quick-sorting.
Conference Paper
Full-text available
This paper proposes a kernel to kernel communication sys- tem for use in cluster computers. It is implemented directly on the Ether- net data link layer. This allows use of Ethernet's inherent broadcast ca- pability. This system is implemented and performance tests are run. The results show signican t improvement in broadcast performance.
Conference Paper
Full-text available
Emerging 64 bitOSpsilas supply a huge amount of memory address space that is essential for new applications using very large data. It is expected that the memory in connected nodes can be used to store swapped pages efficiently, especially in a dedicated cluster which has a high-speed network such as 10 GbE and Infiniband. In this paper, we propose the distributed large memory system (DLM), which provides very large virtual memory by using remote memory distributed over the nodes in a cluster. The performance of DLM programs using remote memory is compared to ordinary programs using local memory. The results of STREAM, NPB and Himeno benchmarks show that the DLM achieves better performance than other remote paging schemes using a block swap device to access remote memory. In addition to performance, DLM offers the advantages of easy availability and high portability, because it is a user-level software without the need for special hardware. To obtain high performance, the DLM can tune its parameters independently from kernel swap parameters. We also found that DLMpsilas independence of kernel swapping provides more stable behavior.
Article
Full-text available
The explosion of data and transactions demands a creative approach for data processing in a variety of applications. Research on remote memory systems (RMSs), so as to exploit the superior characteristics of dynamic random access memory (DRAM), has been performed for many decades, and today’s information explosion galvanizes researchers into shedding new light on the technology. Prior studies have mainly focused on architectural suggestions for such systems, highlighting different design rationale. These studies have shown that choosing the appropriate applications to run on an RMS is important in fully utilizing the advantages of remote memory. This article provides an extensive performance evaluation for various types of data processing applications so as to address the efficacy of an RMS by means of a prototype RMS with reliability functionality. The prototype RMS used is a practical kernel-level RMS that renders large memory data processing feasible. The abstract concept of remote memory was materialized by borrowing unused local memory in commodity PCs via a high speed network capable of Remote Direct Memory Access (RDMA) operations. The prototype RMS uses remote memory without any part of its computation power coming from remote computers. Our experimental results suggest that an RMS can be practical in supporting the rigorous demands of commercial in memory database systems that have high data access locality. Our evaluation also convinces us of the possibility that a reliable RMS can satisfy both the high degree of reliability and efficiency for large memory data processing applications whose data access pattern has high locality.
Conference Paper
Full-text available
With the developments of network technologies, many mechanisms have been introduced to improve system performance in cluster systems by exploiting remote idle memory. However, none of them can satisfy the requirements from different applications. Most methods can only improve the performance of a particular type of applications but not for others. One important reason is they failed to provide unified interfaces. In this paper, we propose Collaborative Memory Pool (CMP) to solve this problems. CMP brings scalability and high performance. It has five features: 1) Providing malloc-like interfaces, block device interfaces and kernel API for different applications, which benefit both user-level and kernel-level applications; 2) Retaining traditional VM mechanism, programmers and uses have the freedom to select CMP or not; 3) Improving kernel applications performance by eliminating remote swapping; 4) Avoiding loan while in debt problem with dynamic workload; 5) Providing optional memory servers to further improve performance. In our testbed with CMP-based swap devices, Qsort gets 83.28% improvement comparing with the case using disk-based swap devices.
Conference Paper
With the fast development of highly integrated distributed systems (cluster systems), especially those encapsulated within a single platform [28, 9], designers have to face interesting memory hierarchy design choices that attempt to avoid disk storage swapping. Disk swapping activities slow down application execution drastically. Leveraging remote free memory through Memory Collaboration has demonstrated its cost-effectiveness compared to overprovisioning for peak load requirements. Recent studies propose several ways on accessing the under-utilized remote memory in static system configurations, without detailed exploration on the dynamic memory collaboration. Dynamic collaboration is an important aspect given the run-time memory usage fluctuations in clustered systems. In this paper, we propose an Autonomous Collaborative Memory System (ACMS) that manages memory resources dynamically at run time, to optimize performance, and provide QoS measures for nodes engaging in the system. We implement a prototype realizing the proposed ACMS, experiment with a wide range of real-world applications, and show up to 3x performance speedup compared to a non-collaborative memory system, without perceivable performance impact on nodes that provide memory. Based on our experiments, we conduct detailed analysis on the remote memory access overhead and provide insights for future optimizations.
Conference Paper
Traditionally, operations with memory on other nodes (remote memory) in cluster environments interconnected with technologies like Gigabit Ethernet have been expen- sive with latencies several magnitudes slower than local memory accesses. Modern RDMA capable networks such as InfiniBand and Quadrics provide low latency of a few microseconds and high bandwidth of up to 10 Gbps. This has significantly reduced the latency gap between access to local memory and remote memory in modern clusters. Remote idle memory can be exploited to reduce the mem- ory pressure on individual nodes. This is akin to adding an additional level in the memory hierarchy between lo- cal memory and the disk, with potentially dramatic perfor- mance improvements especially for memory intensive ap- plications. In this paper, we take on the challenge to de- sign a remote paging system for remote memory utilization in InfiniBand clusters. We present the design and imple- mentation of a high performance networking block device (HPBD) over InfiniBand fabric, which serves as a swap de- vice of kernel Virtual Memory (VM) system for efficient page transfer to/from remote memory servers. Our experiments show that using HPBD, quick sort performs only 1.45 times slower than local memory system, and up to 21 times faster than local disk. And our design is completely transparent to user applications. To the best of our knowledge, it is the first work of a remote pager design using InfiniBand for remote memory utilization.
Conference Paper
Cluster applications that process large amounts of data, such as parallel scientific or multimedia applications, are likely to cause swapping on individual cluster nodes. These applications will perform better on clusters with network swapping support. Network swapping allows any cluster node with over-committed memory to use idle memory of a remote node as its backing store and to "swap" its pages over the network. As the disparity between network speeds and disk speeds continues to grow, network swapping will be faster than traditional swapping to local disk. We present Nswap, a network swapping system for heterogeneous Linux clusters and networks of Linux machines. Nswap is implemented as a loadable kernel module for version 2.4 of the Linux kernel. It is a space-efficient and time-efficient implementation that transparently performs network swapping. Nswap scales to larger clusters, supports migration of remotely swapped pages, and supports dynamic growing and shrinking of Nswap cache (the amount of RAM available to store remote pages) in response to a node's local memory needs. Results comparing Nswap running on an eight node Linux cluster with 100BaseT Ethernet interconnect and faster disk show that Nswap is comparable to swapping to local, faster disk; depending on the workload, Nswap's performance is up to 1.7 times faster than disk to between 1.3 and 4.6 times slower than disk for most workloads. We show that with faster networking technology, Nswap will outperform swapping to disk.
Conference Paper
This paper describes the use of remote memory for virtual memory swapping in a cluster computer. Our design uses a lightweight kernel-to-kernel communications channel for fast, efficient data transfer. Performance tests are made to compare our system to normal hard disk swapping. The tests show significantly improved performance when data access is random.
Conference Paper
In this paper, we present the design and implementation of Dodo, an efficient user-level system for harvesting idle memory in off-the-shelf clusters of workstations. Dodo enables data-intensive applications to use remote memory in a cluster as an intermediate cache between local memory and disk. It requires no modifications to the operating system and/or processor firmware and is hence portable to multiple platforms. Further, the memory recruitment policy used by Dodo is designed to minimize any delays experienced by the owner of desktop machines whose memory is harvested by Dodo. Our implementation of Dodo is operational and currently runs on Linux 2.0.35. For communication, Dodo can use either UDP/IP or U-Net, the low-latency user-level network architecture developed by von Eicken et al. (1995). We evaluated the performance improvements that can be achieved by using Dodo for two real applications and three synthetic benchmarks. Our results show that speedups obtained for an application are highly dependent on its I/O access pattern and data set sizes. Significant speedups (between 2 and 3) were obtained for applications whose working sets are larger than the local memory on a workstation but smaller than aggregate memory available on the cluster and for applications that can benefit from the zero-seek nature of remote memory
Collaborative memory pool in cluster system
  • N Wang
  • X Linu
  • J He
  • J Han
  • L Zhang
  • Z Xu
N. Wang, X. Linu, J. He, J. Han, L. Zhang, and Z. Xu, "Collaborative memory pool in cluster system," in Proc. the International Conference on Parallel Processing, 2007, p. 17.