[Show abstract][Hide abstract] ABSTRACT: As the types of problems we solve in high-performance computing and other areas become more complex, the amount of data generated and used is growing at a rapid rate. Today many terabytes of data are common; tomorrow petabytes of data will be the norm. Much work has been put into increasing capacity and I/O performance for large-scale storage systems. However, one often ignored area is metadata management. Metadata can have a significant impact on the performance of a system. Past approaches have moved metadata activities to a separate server in order to avoid potential interference with data operations. However, with the advent of object-based storage technology, there is a compelling argument to re-couple metadata and data. In this paper we present two metadata management schemes, both of which remove the need for a separate metadata server and replace it with object-based storage.
[Show abstract][Hide abstract] ABSTRACT: Active measurements on network paths provide end-to-end network health status in terms of metrics such as bandwidth, delay, jitter and loss. Hence, they are increasingly being used for various network control and management functions on the Internet. For purposes of network health anomaly detection and forecasting involved in these functions, it is important to accurately model the time-series process of active measurements. In this paper, we describe our time-series analysis of two typical active measurement data sets collected over several months: (i) routine, and (ii) event-laden. Our analysis suggests that active network measurements follow the moving average process. Specifically, they possess ARIMA(0,1,q) model characteristics with low q values, across multi-resolution timescales. We validate our model selection accuracy by comparing how well our predicted values using our model match the actual measurements.
[Show abstract][Hide abstract] ABSTRACT: Distributed file systems that use multiple servers to store data in parallel are becoming commonplace. Much work has already gone into such systems to maximize data throughput. However, metadata management has historically been treated as an afterthought. In previous work we focused on improving metadata management techniques by placing file metadata along with data on object-based storage devices (OSDs). However, we did not investigate directory operations. This work looks at the possibility of designing directory structures directly on OSDs, without the need for intervening servers. In particular, the need for atomicity is a fundamental requirement that we explore in depth. Through performance results of benchmarks and applications we show the feasibility of using OSDs directly for metadata, including directory operations.
[Show abstract][Hide abstract] ABSTRACT: Managing concurrency is a fundamental requirement for any multi-threaded system, frequently implemented by serializing critical code regions or using object locks on shared resources. Storage systems are one case of this, where multiple clients may wish to access or modify on-disk objects concurrently yet safely. Data consistency may be provided by an inter-client protocol, or it can be implemented in the file system server or storage device. In this work we demonstrate ways of enabling atomic operations on object-based storage devices (OSDs), in particular, the compare-and-swap and fetch-and-add atomic primitives. With examples from basic disk resident data structures to higher level applications like file systems, we show how atomics-capable storage devices can be used to solve consistency requirements of distributed algorithms. Offloading consistency management to storage devices obviates the need for dedicated lock manager servers.
[Show abstract][Hide abstract] ABSTRACT: The access patterns performed by disk-intensive applications vary widely, from simple contiguous reads or writes through an entire file to completely unpredictable random access. Often, applications will be able to access multiple disconnected sections of a file in a single operation. Application programming interfaces such as POSIX and MPI encourage the use of non-contiguous access with calls that process I/O vectors. Under the level of the programming interface, most storage protocols do not implement I/O vector operations (also known as scatter/gather). These protocols, including NFSv3 and block-based SCSI devices, must instead issue multiple independent operations to complete the single I/O vector operation specified by the application, at a cost of a much slower overall transfer time. Scatter/gather I/O is critical to the performance of many parallel applications, hence protocols designed for this area do tend to support I/O vectors. Parallel Virtual File System (PVFS) in particular does so; however, a recent specification for object-based storage devices (OSD) does not. Using a software implementation of an OSD as storage devices in a Parallel Virtual File System (PVFS) framework, we show the advantages of providing direct support for non-contiguous data transfers. We also implement the feature in OSDs in a way that is both efficient for performance and appropriate for inclusion in future specification documents.
[Show abstract][Hide abstract] ABSTRACT: As storage systems evolve, the block-based design of today's disks is becoming inadequate. As an alternative, object-based storage devices (OSDs) offer a view where the disk manages data layout and keeps track of various attributes about data objects. By moving functionality that is traditionally the responsibility of the host OS to the disk, it is possible to improve overall performance and simplify management of a storage system. The capabilities of OSDs will also permit performance improvements in parallel file systems, such as further decoupling metadata operations and thus reducing metadata server bottlenecks. In this work we present an implementation of the Parallel Virtual File System (PVFS) integrated with a software emulator of an OSD and describe an infrastructure for client access. Even with the overhead of emulation, performance is comparable to a traditional server-fronted implementation, demonstrating that serverless parallel file systems using OSDs are an achievable goal.
[Show abstract][Hide abstract] ABSTRACT: In order to increase client capacity and performance, storage systems have begun to utilize the advantages offered by modern interconnects. Previously storage has been transported over costly fibre channel networks or ubiquitous but low performance Ethernet networks. However, with the adoption of the iSCSI extensions for RDMA (iSER) it is now possible to use RDMA based interconnects for storage while leveraging existing iSCSI tools and deployments. Building on previous work with an object-based storage device emulator using the iSCSI transport, we extend its functionality to include iSER. Using an RDMA transport brings with it many design issues including the need register memory to be used by the network, and how to bridge the quite different RDMA completion semantics with existing event management based on file descriptors. Experiments demonstrate reduced latency and greatly increased throughput compared to iSCSI implementations both on gigabit ethernet and on IP over InfiniBand.
[Show abstract][Hide abstract] ABSTRACT: As storage systems grow larger and more complex, the traditional block-based design of current disks can no longer satisfy workloads that are increasingly metadata intensive. A new approach is offered by object-based storage devices (OSDs). By moving more of the responsibility of storage management onto each OSD, improvements in performance, scalability and manageability are possible. Since this technology is new, no physical object-based storage device is currently available. In this work we describe a software emulator for an object-based disk. We focus on the design of the attribute storage, which is used to hold metadata associated with particular data objects. Alternative designs are discussed, and performance results for an SQL implementation are presented.
[Show abstract][Hide abstract] ABSTRACT: As computing breaches petascale limits both in processor performance and storage capacity, the only way that current and future gains in performance can be achieved is by increasing the parallelism of the system. Gains in storage performance remain low due to the use of traditional distributed file systems such as NFS, where although multiple clients can access files at the same time, only one node can serve files to the clients. New file systems that distribute load across multiple data servers are being developed; however, most implementations concentrate all the metadata load at a single server still. Distributing metadata load is important to accommodate growing numbers of more powerful clients. Scaling metadata performance is more complex than scaling raw I/O performance, and with distributed metadata the complexity increases further. In this paper we present strategies for file creation in distributed metadata file systems. Using the PVFS distributed file system as our testbed, we present designs that are able to reduce the message complexity of the create operation and increase performance. Compared to the basecase create protocol implemented in PVFS, our design delivers near constant operation latency as the system scales, does not degenerate under high contention situations, and increases throughput linearly as the number of metadata servers increase. The design schemes are applicable to any distributed file system implementation.
[Show abstract][Hide abstract] ABSTRACT: With rapid advances in VLSI technology, field programmable gate arrays (FPGAs) are receiving the attention of the parallel and high performance computing community. In this paper, we propose a highly parallel FPGA design for the Floyd-Warshall algorithm to solve the all-pairs shortest-paths problem in a directed graph. Our work is motivated by a computationally intensive bio-informatics application that employs this algorithm. The design we propose makes efficient and maximal utilization of the large amount of resources available on an FPGA to maximize parallelism in the presence of significant data dependences. Experimental results from a working FPGA implementation on the Cray XD1 show a speedup of 22 over execution on the XD1's processor.
[Show abstract][Hide abstract] ABSTRACT: Field-Programmable Gate Arrays (FPGAs) are be- ing employed in high performance computing systems owing to their potential to accelerate a wide variety of long-running routines. Parallel FPGA-based de- signs often yield a very high speedup. Applications us- ing these designs on reconfigurable supercomputers in- volve software on the system managing computation on the FPGA. To extract maximum performance from an FPGA design at the application level, it becomes neces- sary to minimize associated data movement costs on the system. We address this hardware/software integration challenge in the context of the All-Pairs Shortest-Paths (APSP) problem in a directed graph. We employ a par- allel FPGA-based design using a blocked algorithm to solve large instances of APSP. With appropriate design choices and optimizations, experimental results on the Cray XD1 show that the FPGA-based implementation sustains an application-level speedup of 15 over an op- timized CPU-based implementation.
[Show abstract][Hide abstract] ABSTRACT: Zero-copy, RDMA, and protocol offload are three very important characteristics of high performance interconnects. Previous networks that made use of these techniques were built upon proprietary, and often expensive, hardware. With the introduction of iWarp, it is now possible to achieve all three over existing low-cost TCP/IP networks. iWarp is a step in the right direction, but requires an expensive RNIC to enable zero-copy, RDMA, and protocol offload. While the hardware is expensive at present, given that iWarp is based on a commodity interconnect, prices fall. In the meantime only the most critical of servers make use of iWarp, but in order to take advantage of the RNIC both sides must be so equipped. It is for this reason that we have implemented the iWarp protocol in software. This allows a server equipped with an RNIC to exploit its advantages even if the client does not have an RNIC. While throughput and latency do not improve by doing this, the server with the RNIC does experience a dramatic reduction in system load. This means that the server is much more scalable, and can handle many more clients than would otherwise be possible with the usual sockets/TCP/IP protocol stack.
[Show abstract][Hide abstract] ABSTRACT: A distributed lock manager (DLM) provides advisory locking services to applications such as databases and file systems that run on distributed systems. Lock management at the server is implemented using first-in-first-out (FIFO) queues. In this paper, we demonstrate a novel way of delegating the lock management to the participating lock-requesting nodes, using advanced network primitives such as remote direct memory access (RDMA) and atomic operations. This nicely complements the original idea of DLM, where management of the lock space is distributed. Our implementation achieves better load balancing, reduction in server load and improved throughput over traditional designs.
[Show abstract][Hide abstract] ABSTRACT: As the trend for increased accessibility to data increases, encryption is necessary to protect the integrity and security of that data. Due to the highly parallel nature of the AES encryption algorithm, an FPGA based approach pro-vides the potential for up to an order of magnitude increase in performance over a traditional CPU. Our effort will showcase the capability of FPGA based en-cryption on the Cray XD1 as well as other FPGA related efforts at OSC's center for data intensive computing.
[Show abstract][Hide abstract] ABSTRACT: The term iWarp indicates a set of published protocol specifications that provide remote read and write access to user applications, without operating system intervention or intermediate data copies. The iWarp protocol provides for higher bandwidth and lower latency transfers over existing, widely deployed TCP/IP networks. While hardware imple- mentations of iWarp are starting to emerge, there is a need for software implementations to enable offload on servers as a transition mechanism, for protocol testing, and for future protocol research. The work presented here allows a server with an iWarp network card to utilize it fully by implementing the iWarp protocol in software on the non-accelerated clients. While throughput does not improve, the true benefit of reduced load on the server machine is realized. Experiments show that sender system load is reduced from 35% to 5% and re- ceiver load is reduced from 90% to under 5%. These gains allow a server to scale to handle many more simultaneous client connections.
[Show abstract][Hide abstract] ABSTRACT: Intelligent storage systems were an active area of research in later half of last decade. The idea was to improve the throughput of data intensive applications from database and image processing domains by offloading computation onto the active storage elements and hopefully reducing unnecessary data traffic between data sources and compute nodes. With the advent of Object-based Storage Devices (OSD) we feel that the time is appropriate to have a relook at notion of intelligent storage. OSDs provide a higher level object abstraction that makes it easier to describe the computation to the devices. OSDs provide the right platform to ask several questions in the context of intelligent storage devices. This is a working document capturing our experience in desiging an intelligent OSD (iOSD) and using it to build scalable systems. In this preliminary document we describe the design goals for an intelligent OSD. We explain the implementation details of iOSD and challenges faced in integrating the offload engine into OSD ecosystem. We describe how offloaded compu-tation is expressed, executed and managed in an storage system made up of iOSDs. Finally we quantify the performance and overhead of iOSD.
[Show abstract][Hide abstract] ABSTRACT: Rapid advances in VLSI technology have led to Field-Programmable Gate Arrays (FPGAs) being employed in High Performance Computing systems. Applications using FPGAs on reconfigurable supercomputers involve software on the system managing computation on the reconfigurable hardware. To extract maximum benefits from a parallel FPGA kernel at the application level, it becomes crucial to minimize data movement costs on the system. We address this challenge in the context of the All-Pairs Shortest-Paths (APSP) problem in a directed graph. Employing a parallel FPGA-based APSP kernel with a blocked algorithm, with appropriate optimizations, the application-level speedup of the FPGA-based implementation over a modern general-purpose microprocessor is increased from 4x to 15x for all problem sizes.