Ranjit Noronha’s research while affiliated with The Ohio State University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (27)


Exploiting Remote Memory in InfiniBand Clusters using a High Performance Network Block Device (HPBD)
  • Article

April 2013

·

75 Reads

·

12 Citations

Shuang Liang

·

Ranjit Noronha

·

Traditionally, remote memory accesses in cluster sys- tems are very expensive operations, which perform 20-100 times slower than local memory accesses. Modern RDMA capable networks such as InfiniBand and Quadrics provide low latency of a few microseconds and high bandwidth of up to 10 Gbps. This has made remote memory much closer to the local memory system. Using remote idle memory to enhance local memory hierarchy thus becomes an attractive choice, especially for data intensive applications in cluster environment. In this paper, we take the challenge to de- sign a remote paging system for remote memory utilization in InfiniBand clusters. We present the design and imple- mentation of a high performance networking block device (HPBD), which serves as a swap device of kernel Virtual Memory (VM) system for efficient page transfer to/from re- mote memory servers. Our experiments show that using HPBD, quick sort performs only 1.45 times slower than lo- cal memory system, and up to 21 times faster than local disk. And our design is completely transparent to user ap- plications. To the best of our knowledge, it is the first work of a remote pager design using InfiniBand for remote mem- ory utilization.


Fig. 1. pNFS high-level architecture resilient to network level faults. Traditionally , non-idempotent requests are taken care of through the Duplicate Request Cache (DRC) at the server. The DRC has a limited number of entries, and these entries are shared among all the clients. So, eventually some entries will be evicted from the cache. In the face of networklevel partitions, duplicate requests that arrive that have been evicted from the DRC, will be reexecuted . Sessions solves this problem by requiring each connection be alloted a fixed number of RPC slots in the DRC. The client is only allowed to issue requests up to the number of slots in the connection. Because of this reservation policy, duplicate requests from the client to the server in the face of network-level partitions will not be re-executed. We will consider design issues with sessions and RPC/RDMA in the following section. RPC/RDMA for NFS: The existing RPC/RDMA design for Linux and OpenSolaris is based on the Read-Write design [17]. It consists of two protocols; namely the inline protocol for small requests and the bulk data transfer protocol for large operations. The inline protocol on Linux is enabled through the use of a set of persistent buffers; (32 buffers of 1K each for Send and 32 buffers of 1K each for receives on Linux). RPC Requests are sent using the persistent inline buffers. RPC replies are also received using the persistent inline buffers. The responses for some NFS procedures such as READ and READDIR might be quite large. These responses may be sent to the user via the bulkdata transfer protocol, which uses RDMA Write to send large responses from the server to the clients without a copy and RDMA Reads to pull data in from the client for procedures such as Write. The design trade-offs for RPC/RDMA are discussed further in [17, 18].  
Fig. 2. pNFS detailed design  
Fig. 3. Latency for small operations (GETFILELAYOUT)  
Fig. 4. Latency for 1 and 2 processes  
Fig. 5. Latency for 4,8 and 16 processes  

+6

Designing a High-Performance Clustered NAS: A Case Study with pNFS over RDMA on InfiniBand
  • Conference Paper
  • Full-text available

December 2008

·

1,460 Reads

·

9 Citations

Lecture Notes in Computer Science

Large scale scientic and commercial applications consume and produce petabytes of data. This data needs to be safely stored, cataloged and reproduced with high-performance. The current gen- eration of single headed NAS (Network Attached Storage) based systems such as NFS is not able to provide an acceptable level of performance to these types of demanding applications. Clustered NAS have evolved to meet the storage demands of these demanding applications. However, the perfor- mance of these Clustered NAS solutions is limited by the communication protocol being used, usually TCP/IP. In this paper, we propose, design and evaluate a clustered NAS; pNFS over RDMA on In- niBand. Our results show that for a sequential workload on 8 data servers, the pNFS over RDMA design can achieve a peak aggregate Read throughput of up to 5,029 MB/s, a maximum improvement of 188% over the TCP/IP transport and a Write throughput of 1,872 MB/s; a maximum improvement of 150% over the corresponding TCP/IP transport throughput. Evaluations with other type of work- loads and traces show an improvement in performance of up to 27%. Finally, our design of pNFS over RDMA also improves the performance of a scientic application BTIO.

Download

IMCa: A High Performance Caching Front-end for GlusterFS on InfiniBand

October 2008

·

277 Reads

·

20 Citations

With the rapid advances in computing technology, thereis an explosion in media that needs to collected, cataloged, stored and accessed. With the speed of disks not keeping pace with the improvements in processor and network speed,the ability of network file systems to provide data todemanding applications at an appropriate rate is diminishing. In this paper, we propose to enhance the performance of network file systems by providing an InterMediate bank of Cache servers between the client and server called (IMCa). Whenever possible, file system operations from the client are serviced from the cache bank. We evaluate IMCa with a number of different benchmarks. The results of these experiments demonstrate that the intermediate cache architecture can reduce the latency of certain operations by upto 82% over the native implementation and upto 86% compared with the Lustre file system. In addition, we also see an improvement in the performance of data transfer operations in most cases and for most scenarios. Finally, the caching hierarchy helps us to achieve better scalability of file system operations.


Figure 1. A Cluster-of-Clusters Scenario
Table 1 . Delay Overhead corresponding to Wire Length
Figure 2. Cluster-of-Clusters Connected with Obsidian Longbow XRs 
Performance of HPC Middleware over InfiniBand WAN

October 2008

·

271 Reads

·

15 Citations

S. Narravula

·

·

·

[...]

·

High performance interconnects such as InfiniBand (IB)have enabled large scale deployments of High Performance Computing (HPC) systems. High performance communication and IO middleware such as MPI and NFS over RDMA have also been redesigned to leverage the performance of these modern interconnects. With the advent of longhaul InfiniBand (IB WAN), IB applications now have inter-cluster reaches. While this technology is intended to enable high performance network connectivity across WAN links,it is important to study and characterize the actual performance that the existing IB middleware achieve in these emerging IB WAN scenarios. In this paper, we study and analyze the performance characteristics of the following three HPC middleware: (i)IPoIB (IP traffic over IB), (ii) MPI and (iii) NFS overRDMA. We utilize the Obsidian IB WAN routers for inter-cluster connectivity. Our results show that many of the applications absorb smaller network delays fairly well. However, most approaches get severely impacted in high delay scenarios. Further, communication protocols need to be optimized in higher delay scenarios to improve the performance. In this paper, we propose several such optimizations to improve communication performance. Our experimental results show that techniques such as WAN-aware protocols, transferring data using large messages (message coalescing) and using parallel data streams can improve the communication performance (up to 50%) in high delay scenarios. Overall, these results demonstrate that IB WAN technologies can enable cluster-of-clusters architecture asa feasible platform for HPC systems.


pNFS/PVFS2 over InfiniBand: early experiences

November 2007

·

177 Reads

·

9 Citations

The computing power of clusters has been rapidly grow- ing up towards petascale capability, which requires petascale I/O systems to provide data in a sustained high-throughput manner. Network File System (NFS), a ubiquitous standard used in most existing clusters, has shown performance bot- tleneck associated with the single server model. pNFS, a parallel version of NFS, has been proposed in this context to eliminate the performance bottleneck while maintain the ease of management and interoperability features of NFS. With InniBand being one of the most popular high speed networks for clusters, whether pNFS can pick up the advan- tages of InniBand is an interesting and important question. It is also important to quantify and understand the poten- tial benets of using pNFS compared with the single server NFS, and the possible overhead associated with pNFS. How- ever, since pNFS is relatively new, few such study has been carried out in an InniBand cluster environment. In this paper we have designed and carried out a set of experi- ments to study the performance and scalability of pNFS, using PVFS2 as the backend le system. The aim is to un- derstand the characteristics of pNFS, and its feasibility as the parallel le system solution for clusters. From our ex- perimental results we observer that pNFS can take advan- tages of high speed networks such as InniBand, and achieve up to 480% improvement in throughput compared with us- ing GigE as the transport. pNFS can eliminate the single server bottleneck associated with NFS. pNFS/PVFS2 shows signican tly higher throughput and better scalability com- pared with NFS/PVFS2. pNFS/PVFS2 achieves peak write


Fig. 1. Architecture of the NFS/RDMA stack in OpenSolaris
Fig. 3. RPC/RDMA protocol design
Enhancing the performance of NFSv4 with RDMA

October 2007

·

2,223 Reads

·

6 Citations

NFS is a widely deployed storage technology. It has gone through several revisions. One of the latest revisions, v4 has started to become deployed. NFSv4 on OpenSolaris uses TCP as the underlying transport. This has limited its performance. In this paper, we take on the challenge of designing an RDMA transport for NFSv4. Challenges include COMPOUND procedures, which might potentially have unbounded request and reply sizes. Performance evaluation shows that NFSv4 can achieve an IOzone Read throughput of over 700 MB/s and an IOzone Write bandwidth of over 500 MB/s. It also significantly outperforms NFSv4/TCP


Designing NFS with RDMA for Security, Performance and Scalability

October 2007

·

94 Reads

·

18 Citations

NFS has traditionally used TCP or UDP as the underlying transport. However, the overhead of these stacks has limited both the performance and scalability of NFS. Recently, high-performance network such as InfiniBand have been deployed. These networks provide low latency of a few microseconds and high bandwidth for large messages up to 20 Gbps. Because of the unique characteristics of NFS protocols, previous designs of NFS with RDMA were unable to exploit the improved bandwidth of networks such as InfiniBand. Also, they leave the server open to attacks from malicious clients. In this paper, we discuss the design principles for implementing NFS/RDMA protocols. We propose, implement and evaluate an alternate design for NFS/RDMA on InfiniBand, which can significantly improve the security of the server, compared to the previous design. In addition, we evaluate the performance bottlenecks of using RDMA operations in NFS protocols and propose strategies and designs that tackle these overheads. With the best of these strategies and designs, we demonstrate throughput of 700 MB/s on the OpenSolaris NFS/RDMA design and 900 MB/s on the Linux design and an application level improvement in performance of up to 50%. We also evaluate the scalability of the RDMA transport in a multi-client setting, with a RAID array of disks. Our design has been integrated into the OpenSolaris kernel.


Fig. 1. OpenMP fork-join model 
Fig. 3. Aggregate Instruction TLB misses per second of application run time with the application binary placed in 4K pages. 
Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support

April 2007

·

199 Reads

·

16 Citations

Modern multi-core architectures have become popular because of the limitations of deep pipelines and heating and power concerns. Some of these multi-core architectures such as the Intel Xeon have the ability to run several threads on a single core. The OpenMP standard for compiler directive based shared memory programming allows the developer an easy path to writing multi-threaded programs and is a natural fit for multi-core architectures. The OpenMP standard uses loop parallelism as a basis for work division among multiple threads. These loops usually use arrays in their computation with different data distributions and access patterns. The performance of accesses to these arrays may be impacted by the underlying page size depending on the frequency and strides of these accesses. In this paper, we discuss the issues and potential benefits from using large pages for OpenMP applications. We design an OpenMP implementation capable of using large pages and evaluate the impact of using large page support available in most modern processors on the performance and scalability of parallel OpenMP applications. Results show an improvement in performance of up to 25% for some applications. It also helps improve the scalability of these applications.


Better NFS through RDMA and Efficient Memory Registration

February 2007

·

54 Reads

NFS over RDMA implementations on two op-erating systems are shown to achieve 10 gigabit Infiniband wire saturation with careful manage-ment of memory registration. While NFS is highly desirable in grid computing environments for its familiar file API and ease of management, performance issues in its implementations over Ethernet protocol stacks have impeded its use in them. We show that NFS version 3 clients and servers on two architecturally distinct operating systems can be layered atop the new RPC/RDMA protocol, and achieve outstanding performance through the use of appropriate memory reg-istration techniques. With NFS/RDMA and Infiniband, we demonstrate performance in excess of 700MB/s on OpenSolaris and 900MB/s on Linux, at CPU utilizations of only 10%.


Figure 2. Overall Design for MVAPICH-uDAPL
Figure 5. Time Spent on Connection Establishment
MPI over uDAPL: Can High Performance and Portability Exist Across Architectures?

June 2006

·

108 Reads

·

8 Citations

Looking at the TOP 500 list of supercomputers we can see that different architectures and networking technologies appear on the scene from time to time. The networking technologies are also changing along with the advances of processor technologies. While the hardware has been constantly changing, parallel applications written in different paradigms have remained largely unchanged. With MPI being the most popular parallel computing standard, it is crucial to have an MPI implementation portable across different networks and architectures. It is also desirable to have such an MPI deliver high performance. In this paper we take on this challenge. We have designed an MPI with both portability and portable high performance using the emerging uDAPL interface. We present the design alternatives and a comprehensive performance evaluation of this new design. The results show that this design can improve the startup time and communication performance by 30% compared with our previous work. It also delivers the same good performance as MPI implemented over native APIs of the underlying interconnect. We also present a multistream MPI design which aims to achieve high bandwidth across networks and operating systems. Experimental results on Solaris show that the multi-stream design can improve bandwidth over InfiniBand by 30%, and improve the application performance by up to 11%.


Citations (23)


... In the area of logic simulation, the use of FPGA offers opportunities for performance since the design being simulated can simply be emulated on the FPGA [38]. Noronha et al. explore the use of a programmable network interface card to accelerate GVT computation and direct message cancellation [24]. Our work differs in the way that a general optimistic PDES is implemented completely in hardware. ...

Reference:

PDES-A: Accelerators for parallel discrete event simulation implemented on FPGAS
Early Cancellation: An Active NIC Optimization for Time Warp
  • Citing Article
  • January 2002

... In addition to one backend swap device, several frameworks are designed in the form of hybrid swap. For example, the reports in [16,25] utilize remote memory and a hard disk together for enhanced reliability. However, constant reading/writing on a disk slows down the performance of other I/O operations and consumes network bandwidth if the disk is connected via a network such as iSCSI. ...

Exploiting Remote Memory in InfiniBand Clusters using a High Performance Network Block Device (HPBD)
  • Citing Article
  • April 2013

... To our knowledge, only few attempts have been made to use RDMA in a DSM protocol. In each of [7, 2, 5] a page oriented Home-based Lazy Release Consistency protocol (HLRC) is optimized with RDMA. In [7] multiple page diffs (the changes made by the local processor) are sent to their home via RDMA. ...

Reducing diff overhead in software DSM systems using RDMA operations in infiniband
  • Citing Article
  • August 2004

... In order to reduce the communication overhead of numerical models when using massive cores, it is necessary to use a method that overlaps computation with communication. In 2005, Noronha et al. (2005 proposed a non-blocking method using MPI for the medium-range atmospheric circulation model MM5, which synchronizes data received from neighboring processors during computation, overlapping of adjacent data in X-direction. The European Centre for Medium-Range Weather Forecasts (ECMWF) (Molteni et al. 1996), which is state-of-the-art, also uses overlapping techniques to reduce the computation overhead. ...

Performance Evaluation of MM5 on Clusters with Modern Interconnects: Scalability and Impact
  • Citing Conference Paper
  • August 2005

Lecture Notes in Computer Science

... Due to the needs of both planeness and flexibility of the data organization of the object storage system, the efforts of storing those files in an orderly manner grow significantly [1][2][3][4][5]. Just a few massive sets of examples dealing with those issues are Amazon Simple Storage Service (Amazon S3) [6], Lustre [7,8], Ceph [9], Sheepdog [10], and OpenStack Object Storage (Swift) [11]. ...

Benefits of high speed interconnects to cluster file systems: a case study with Lustre

... Various groups from industry and academia are working on MPI implementations. Several freely available implementations exist and, further, so called vendor MPI implementations exist, which are tuned for special hardware (see for example [9] [16] [5] [20] [3] [4]). MPI specifies primitives for sending and receiving of messages. ...

Designing a Portable MPI-2 over Modern Interconnects Using uDAPL Interface

Lecture Notes in Computer Science

... Hardware-level solutions can elevate performance issues, e.g., by providing a dedicated high-speed network [12] and using dedicated memory blades [11]. However, many methods have not been adopted because of the major investments needed [29], such as changes in the OS and hypervisor, explicit memory management, or hardware support [11,11,22,31]. ...

Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device
  • Citing Conference Paper
  • September 2005

... When even more exotic networks are available, end-system, edge, and host integrated network stack APIs are available, such as LibFabric (Grun et al. 2015), and its underlying UCX framework (Shamis et al. 2015), and Portals (Barrett et al. 2017). Detailed performance characterization of many of these paradigms can be found in Jose et al. (2011), Balaji (2004), Lu et al. (2013), and Balaji et al. (2005a). In BXI (Derradji et al. 2015), the Portals 4 userspace API is leveraged to provide the application developer OS bypass capabilities to a high-throughput, lowlatency systems interconnect, such as Cray's Aries network. ...

Head-to-TOE Evaluation of High-Performance Sockets over Protocol Offload Engines

... We intend to extend the design of this NIC-based allto-all broadcast to MPI [7] layer and study its benefits to applications. In addition, we intend to study its benefits in other programming models, such as distributed shared-memory [8], and their applications. ...

Implementing TreadMarks over GM on Myrinet: Challenges, Design Experience, and Performance Evaluation.
  • Citing Conference Paper
  • January 2003