P. Balaji

Virginia Polytechnic Institute and State University, Blacksburg, VA, USA

Are you P. Balaji?

Claim your profile

Publications (43)1.78 Total impact

  • Source
    Article: ProOnE: a general-purpose protocol onload engine for multi- and many-core architectures
    [show abstract] [hide abstract]
    ABSTRACT: Modern high-end computing systems utilize specialized offload engines to enhance various aspects of their processing. For example, high-speed networks such as InfiniBand, Quadrics and Myrinet utilize specialized hardware to offload network processing to help improve performance. However, such hardware units are expensive, and their manufacturing complexity increases exponentially depending on the number and complexity of tasks they offload. On the other hand, the proliferation of multi- and many-core processors into the general desktop and laptop markets is increasingly driving their cost down due to the economies of scale. To take advantage of the obvious benefits of multi/many-core architectures, we propose, design and evaluate ProOnE, ageneral purpose Protocol Onload Engine. ProOnE utilizes asmall subset of the available cores on amulti-core CPU to ‘‘onload’’ various tasks in adedicated manner instead of ‘‘offloading’’ them to specialized hardware. The general purpose processing capabilities of multi-core architectures allow ProOnE to be designed in aflexible, extensible and scalable manner, while benefiting from the reducing costs of general-purpose CPUs. In this paper, we onload onto ProOnE, several tasks relevant to communication sub-systems such as MPI that are too complex for current hardware offload engines to support, and demonstrate significant benefits in terms of overlap of computation and communication and improved application performance.
    Computer Science - Research and Development 04/2012; 23(3):133-142.
  • Conference Proceeding: Designing High-End Computing Systems with InfiniBand and High-Speed Ethernet
    D.K. Panda, S. Sur, P. Balaji
    [show abstract] [hide abstract]
    ABSTRACT: InfiniBand (IB) and High-speed Ethernet (HSE) technologies are generating a lot of excitement towards building next generation High-End Computing (HEC) systems. This tutorial will provide an overview of these emerging technologies, their offered features, their current market standing, and their suitability for prime-time HEC. It will start with a brief overview of IB, HSE, and their architectural features. An overview of the emerging OpenFabrics stack which encapsulates both IB and HSE in a unified manner will be presented. IB and HSE hardware/software solutions and the market trends will be highlighted. Finally, sample performance numbers highlighting the performance these technologies can achieve in different environments will be shown.
    High Performance Interconnects (HOTI), 2010 IEEE 18th Annual Symposium on; 09/2010
  • Source
    Article: Global-scale distributed I / O with ParaMEDIC
    [show abstract] [hide abstract]
    ABSTRACT: Achieving high performance for distributed I/O on a wide-area network continues to be an elusive holy grail. Despite enhancements in network hardware as well as software stacks, achieving high-performance remains a challenge. In this paper, our worldwide team took a completely new and non-traditional approach to distributed I/O, called ParaMEDIC: Parallel Metadata Environment for Distributed I/O and Computing, by utilizing application-specific transformation of data to orders of magnitude smaller metadata before performing the actual I/O. Specifically, this paper details our experiences in deploying a large- scale system to facilitate the discovery of missing genes and constructing a genome similarity tree by encapsulating the mpiBLAST sequence-search algorithm into ParaMEDIC. The overall project involved nine computational sites spread across the U.S. and generated more than a petabyte of data that was 'teleported' to a large-scale facility in Tokyo for storage.
    CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE. 01/2010;
  • Conference Proceeding: Are nonblocking networks really needed for high-end-computing workloads?
    [show abstract] [hide abstract]
    ABSTRACT: High-speed interconnects are frequently used to provide scalable communication on increasingly large high-end computing systems. Often, these networks are nonblocking, where there exist independent paths between all pairs of nodes in the system allowing for simultaneous communication with zero network contention. This performance, however, comes at a heavy cost as the number of components needed (and hence cost) increases superlinearly with the number of nodes in the system. In this paper, we study the behavior of real and synthetic supercomputer workloads to understand the impact of the networkpsilas nonblocking capability on overall performance. Starting from a fully nonblocking network, we begin by assessing the worse-case performance degradation caused by removing interstage communication links, resulting in over provisioning and hence potentially blocking in the communication network.We also study the impact of several factors on this behavior, including system workloads, multicore processors, and switch crossbar sizes. Our observations show that a significant reduction in the number of interstage links can be tolerated on all of the workloads analyzed, causing less than 5% overall loss of performance.
    Cluster Computing, 2008 IEEE International Conference on; 11/2008
  • Source
    Conference Proceeding: Impact of Network Sharing in Multi-Core Architectures
    [show abstract] [hide abstract]
    ABSTRACT: As commodity components continue to dominate the realm of high-end computing, two hardware trends have emerged as major contributors-high-speed networking technologies and multi-core architectures. Communication middleware such as the Message Passing Interface (MPI) uses the network technology for communicating between processes that reside on different physical nodes, while using shared memory for communicating between processes on different cores within the same node. Thus, two conflicting possibilities arise: (i) with the advent of multi-core architectures, the number of processes that reside on the same physical node and hence share the same physical network can potentially increase significantly, resulting in increased network usage, and (ii) given the increase in intra-node shared-memory communication for processes residing on the same node, the network usage can potentially decrease significantly. In this paper, we address these two conflicting possibilities and study the behavior of network usage in multi-core environments with sample scientific applications. Specifically, we analyze trends that result in increase or decrease of network usage, and we derive insights into application performance based on these. We also study the sharing of different resources in the system in multi-core environments and identify the contribution of the network in this mix. In addition, we study different process allocation strategies and analyze their impact on such network sharing.
    Computer Communications and Networks, 2008. ICCCN '08. Proceedings of 17th International Conference on; 09/2008
  • Source
    Conference Proceeding: Distributed I/O with ParaMEDIC: Experiences with a Worldwide Supercomputer
    Proceedings of 23rd Annual International Supercomputing Conference (ISC); 01/2008
  • Source
    Conference Proceeding: Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling
    [show abstract] [hide abstract]
    ABSTRACT: Quality of service (QoS) mechanisms allowing users to request for turn-around time guarantees for their jobs have recently generated much interest. In our previous work we had designed a framework, QoPS, to allow for such QoS. This framework provides an admission control mechanism that only accepts jobs whose requested deadlines can be met and, once accepted, guarantees these deadlines. However, the framework is completely blind to the revenue these jobs can fetch for the supercomputer center. By accepting a job, the super-computer center might relinquish its capability to accept some future arriving (and potentially more expensive) jobs. In other words, while each job pays an explicit price to the system for running it, the system may also be viewed as paying an implicit opportunity cost by accepting the job. Thus, accepting a job is profitable only when the job's price is higher than its opportunity cost. In this paper we analyze the impact such opportunity cost can have on the overall revenue of the supercomputer center and attempt to minimize it through predictive techniques. Specifically, we propose two extensions to QoPS, Value-aware QoPS (VQoPS) and Dynamic Value-aware QoPS (DVQoPS), to provide such capabilities. We present detailed analysis of these schemes and demonstrate using simulation that they not only achieve several factors improvement in system revenue, but also good service differentiation as a much desired side-effect.
    Parallel Processing, 2007. ICPP 2007. International Conference on; 10/2007
  • Source
    Conference Proceeding: Advanced Flow-control Mechanisms for the Sockets Direct Protocol over InfiniBand
    [show abstract] [hide abstract]
    ABSTRACT: The Sockets Direct Protocol (SDP) is an industry standard to allow existing TCP/IP applications to be executed on high-speed networks such as InfiniBand (IB). Like many other high-speed networks, IB requires the receiver process to inform the network interface card (NIC), before the data arrives, about buffers in which incoming data has to be placed. To ensure that the receiver process is ready to receive data, the sender process typically performs flow-control on the data transmission. Existing designs of SDP flow-control are naive and do not take advantage of several interesting features provided by IB. Specifically, features such as RDMA are only used for performing zero-copy communication, although RDMA has more capabilities such as sender-side buffer management (where a sender process can manage SDP resources for the sender as well as the receiver). Similarly, IB also provides hardware flow-control capabilities that have not been studied in previous literature. In this paper, we utilize these capabilities to improve the SDP flow-control over IB using two designs: RDMA-based flow-control and NIC-assisted RDMA-based flow-control. We evaluate the designs using micro-benchmarks and real applications. Our evaluations reveal that these designs can improve the resource usage of SDP and consequently its performance by an order-of-magnitude in some cases. Moreover we can achieve 10-20% improvement in performance for various applications.
    Parallel Processing, 2007. ICPP 2007. International Conference on; 10/2007
  • Source
    Conference Proceeding: An Analysis of 10-Gigabit Ethernet Protocol Stacks in Multicore Environments
    [show abstract] [hide abstract]
    ABSTRACT: This paper analyzes the interactions between the protocol stack (TCP/IP or iWARP over 10-Gigabit Ethernet) and its multicore environment. Specifically, for host-based protocols such as TCP/IP, we notice that a significant amount of processing is statically assigned to a single core, resulting in an imbalance of load on the different cores of the system and adversely impacting the performance of many applications. For host-offloaded protocols such as iWARP, on the other hand, the portions of the communication stack that are performed on the host, such as buffering of messages and memory copies, are closely tied with the associated process, and hence do not create such load imbalances. Thus, in this paper, we demonstrate that by intelligently mapping different processes of an application to specific cores, the imbalance created by the TCP/IP protocol stack can be largely countered and application performance significantly improved. At the same time, since the load is a better balanced in host-offloaded protocols such as iWARP, such mapping does not adversely affect their performance, thus keeping the mapping generic enough to be used with multiple protocol stacks.
    High-Performance Interconnects, 2007. HOTI 2007. 15th Annual IEEE Symposium on; 09/2007
  • Source
    Conference Proceeding: Nonuniformly Communicating Noncontiguous Data: A Case Study with PETSc and MPI
    [show abstract] [hide abstract]
    ABSTRACT: Due to the complexity associated with developing parallel applications, scientists and engineers rely on high-level software libraries such as PETSc, ScaLAPACK and PESSL to ease this task. Such libraries assist developers by providing abstractions for mathematical operations, data representation and management of parallel layouts of the data, while internally using communication libraries such as MPI and PVM. With high-level libraries managing data layout and communication internally, it can be expected that they organize application data suitably for performing the library operations optimally. However, this places additional overhead on the underlying communication library by making the data layout noncontiguous in memory and communication volumes (data transferred by a process to each of the other processes) nonuniform. In this paper, we analyze the overheads associated with these two aspects (noncontiguous data layouts and nonuniform communication volumes) in the context of the PETSc software toolkit over the MPI communication library. We describe the issues with the current approaches used by MPICH2 (an implementation of MPI), propose different approaches to handle these issues and evaluate these approaches with micro-benchmarks as well as an application over the PETSc software library. Our experimental results demonstrate close to an order of magnitude improvement in the performance of a 3-D Laplacian multi-grid solver application when evaluated on a 128 processor cluster.
    Parallel and Distributed Processing Symposium, 2007. IPDPS 2007. IEEE International; 04/2007
  • Article: Bridging the Ethernet-Ethernot Performance Gap
    P. Balaji, W. Feng, D.K. Panda
    [show abstract] [hide abstract]
    ABSTRACT: Ethernet seeks to reach the system-area-network performance of Ethernot networks by introducing 10-gigabit Ethernet and adding the hardware-offloaded protocol stacks that give Ethernots their performance edge. As this study shows, such moves might indeed be closing the gap. Recent developments could also empower Ethernet to penetrate the SAN arena, where it has struggled against the Ethernots' serious performance advantage. Despite that, network providers have been reluctant to view it as a serious SAN interconnect because of Ethernets formidable performance lag relative to mainstream SAN technologies
    IEEE Micro 06/2006; · 1.78 Impact Factor
  • Source
    Article: A case for udp offload engines in lambdagrids
    [show abstract] [hide abstract]
    ABSTRACT: Though TCP/IP is considered the de facto standard for Internet related wide area computing, its failure for LambdaGrids is well documented. On the other hand, rate-controlled UDP/IP-based protocols are strongly emerging as a feasible so-lution for meeting the performance goals in such environments. While such protocols have been able to avoid most the drawbacks of TCP/IP, they are still plagued by the drawbacks of UDP/IP, such as a host-based implementation, lim-iting their performance on high-speed networks. On the other hand, researchers have attempted to fix this drawback in TCP/IP using hardware offloaded TCP/IP implementations such as the TCP Offload Engines (TOEs). Given these two orthogonal developments, it is not completely clear about which is a better solu-tion, i.e., a rate-controlled UDP/IP based protocol that is implemented in the host or a hardware offloaded TCP/IP solution such as the TOE. In this paper, we com-bine the benefits of both these solutions to design and develop an emulated UDP Offload Engine (UOE) based on the Chelsio T110 TOE; a solution which would transparently improve the performance for existing UDP/IP-based applications. Evaluations of our emulated UOE stack show that our design can achieve up to a 35% improvement in the performance while maintaining a significantly lower CPU usage.
    05/2006;
  • Source
    Article: Asynchronous zero-copy communication for synchronous sockets in the Sockets Direct Protocol (SDP) over InfiniBand
    [show abstract] [hide abstract]
    ABSTRACT: Sockets Direct Protocol (SDP) is an industry standard pseudo sockets-like implementation to allow existing sockets applications to directly and transparently take advantage of the advanced features of current generation networks such as InfiniBand. The SDP standard supports two kinds of sockets semantics, viz., Synchronous sockets (e.g., used by Linux, BSD, Windows) and Asynchronous sockets (e.g., used by Windows, upcoming support in Linux). Due to the inherent benefits of asynchronous sockets, the SDP standard allows several intelligent approaches such as source-avail and sink-avail based zero-copy for these sockets. Unfortunately, most of these approaches are not bene-ficial for the synchronous sockets interface. Further, due to its porta-bility, ease of use and support on a wider set of platforms, the syn-chronous sockets interface is the one used by most sockets applica-tions today. Thus, a mechanism by which the approaches proposed for asynchronous sockets can be used for synchronous sockets is highly desirable. In this paper, we propose one such mechanism, termed as AZ-SDP (Asynchronous Zero-Copy SDP), where we memory-protect application buffers and carry out communication asynchronously while maintaining the synchronous sockets semantics. We present our de-tailed design in this paper and evaluate the stack with an extensive set of benchmarks. The experimental results demonstrate that our approach can provide an improvement of close to 35% for medium-message uni-directional throughput and up to a factor of 2 benefit for computation-communication overlap tests and multi-connection benchmarks.
    05/2006;
  • Source
    Article: Head-to-toe evaluation of high-performance sockets over protocol offload engines
    [show abstract] [hide abstract]
    ABSTRACT: Despite the performance drawbacks of Ethernet, it still possesses a siz-able footprint in cluster computing because of its low cost and back-ward compatibility to existing Ethernet infrastructure. In this paper, we demonstrate that these performance drawbacks can be reduced (and in some cases, arguably eliminated) by coupling TCP offload engines (TOEs) with 10-Gigabit Ethernet (10GigE). Although there exists significant research on individual network tech-nologies such as 10GigE, InfiniBand (IBA), and Myrinet; to the best of our knowledge, there has been no work that compares the capabil-ities and limitations of these technologies with the recently introduced 10GigE TOEs in a homogeneous experimental testbed. Therefore, we present performance evaluations across 10GigE, IBA, and Myrinet (with identical cluster-compute nodes) in order to enable a coherent compar-ison with respect to the sockets interface. Specifically, we evaluate the network technologies at two levels: (i) a detailed micro-benchmark eval-uation and (ii) an application-level evaluation with sample applications from different domains, including a bio-medical image visualization tool known as the Virtual Microscope, an iso-surface oil reservoir simulator, a cluster file-system known as the Parallel Virtual File-System (PVFS), and a popular cluster management tool known as Ganglia. In addition to 10GigE's advantage with respect to compatibility to wide-area network infrastructures, e.g., in support of grids, our results show that 10GigE also delivers performance that is comparable to traditional high-speed network technologies such as IBA and Myrinet in a system-area network environment to support clusters and that 10GigE is particularly well-suited for sockets-based applications.
    10/2005;
  • Source
    Conference Proceeding: Performance characterization of a 10-Gigabit Ethernet TOE
    [show abstract] [hide abstract]
    ABSTRACT: Though traditional Ethernet based network architectures such as Gigabit Ethernet have suffered from a huge performance difference as compared to other high performance networks (e.g, InfiniBand, Quadrics, Myrinet), Ethernet has continued to be the most widely used network architecture today. This trend is mainly attributed to the low cost of the network components and their backward compatibility with the existing Ethernet infrastructure. With the advent of 10-Gigabit Ethernet and TCP offload engines (TOEs), whether this performance gap be bridged is an open question. In this paper, we present a detailed performance evaluation of the Chelsio T110 10-Gigabit Ethernet adapter with TOE. We have done performance evaluations in three broad categories: (i) detailed micro-benchmark performance evaluation at the sockets layer, (ii) performance evaluation of the message passing interface (MPI) stack atop the sockets interface, and (iii) application-level evaluations using the Apache Web server. Our experimental results demonstrate latency as low as 8.9 μs and throughput of nearly 7.6 Gbps for these adapters. Further, we see an order-of-magnitude improvement in the performance of the Apache Web server while utilizing the TOE as compared to the basic 10-Gigabit Ethernet adapter without TOE.
    High Performance Interconnects, 2005. Proceedings. 13th Symposium on; 09/2005
  • Source
    Conference Proceeding: Architecture for caching responses with multiple dynamic dependencies in multi-tier data-centers over InfiniBand
    [show abstract] [hide abstract]
    ABSTRACT: It has been well acknowledged in the research community that in order to design a data-center environment which is efficient and offers high performance, one of the critical issues that needs to be addressed is the effective reuse of cache content stored away from the origin server. However, for caching dynamically changing content (e.g., content involved in online banking, Internet auctions, etc.). consistency and coherency issues need to be addressed. In addition, most current real world requests have multiple dynamic dependencies, i.e., these requests might depend on multiple data objects. Further, these requests are not entirely independent; several requests might have common dependencies. While there have been previous research solutions on maintaining coherent caches for dynamic content, these solutions have several shortcomings including inability to adapt to server load or handle multiple dynamic dependencies. In this paper, we propose a load resilient architecture using one sided operations supported by several high performance interconnects such as InfiniBand, while maintaining multiple dynamic dependencies per response. Our experimental results show that our schemes to tackle the multi-dependency issue efficiently and significantly outperform the existing approaches. Further, our results demonstrate that the proposed load resilient architecture can possibly improve the performance of loaded data-centers by over an order of magnitude.
    Cluster Computing and the Grid, 2005. CCGrid 2005. IEEE International Symposium on; 06/2005
  • Source
    Conference Proceeding: On the provision of prioritization and soft qos in dynamically reconfigurable shared data-centers over infiniband
    [show abstract] [hide abstract]
    ABSTRACT: In the past few years several researchers have proposed and configured data-centers providing multiple independent services, known as shared data-centers. For example, several ISPs and other Web service providers host multiple unrelated Web-sites on their data-centers allowing potential differentiation in the service provided to each of them. Such differentiation becomes essential in several scenarios in a shared data-center environment. In this paper, we extend our previously proposed scheme on dynamic re-configurability to allow service differentiation in the shared data-center environment. In particular, we point out the issues associated with the basic dynamic configurability scheme and propose two extensions to it, namely (i) dynamic reconfiguration with prioritization and (ii) dynamic reconfiguration with prioritization and QoS. Our experimental results show that our extensions can allow the dynamic reconfigurability scheme to attain a performance improvement of up to five times for high priority Web sites irrespective of any background low priority requests. Also, these extensions are able to significantly improve the performance of low priority requests when there are minimal or no high priority requests in the system. Further, they can achieve a similar performance as a static scheme with up to 43% lesser nodes in some cases
    Performance Analysis of Systems and Software, 2005. ISPASS 2005. IEEE International Symposium on; 04/2005
  • Source
    Conference Proceeding: Towards provision of quality of service guarantees in job scheduling
    [show abstract] [hide abstract]
    ABSTRACT: Considerable research has focused on the problem of scheduling dynamically arriving independent parallel jobs on a given set of resources. There has also been some recent work in the direction of providing differentiated service to different classes of jobs using statically or dynamically calculated priorities assigned to the jobs. However, the potential and usability of a quality of service based scheme has not been much studied. In This work, we extend a previously proposed scheme (QoPS) to provide quality of service to submitted jobs; we propose extensions to the algorithm in multiple aspects: (i) studying the effect of user tolerance towards missed deadlines on the overall profit attainable by the supercomputer center, (it) providing artificial slack to some jobs to maximize the overall profit and (hi) utilizing a kill-and-restart mechanism to further improve the profit attainable.
    Cluster Computing, 2004 IEEE International Conference on; 10/2004
  • Source
    Conference Proceeding: Sockets Direct Protocol over InfiniBand in clusters: is it beneficial?
    [show abstract] [hide abstract]
    ABSTRACT: The Sockets Direct Protocol (SDP) had been proposed recently in order to enable sockets based applications to take advantage of the enhanced features provided by InfiniBand architecture. In this paper, we study the benefits and limitations of an implementation of SDP. We first analyze the performance of SDP based on a detailed suite of micro-benchmarks. Next, we evaluate it on two different real application domains: (1) A multitier data-center environment and (2) A Parallel Virtual File System (PVFS). Our micro-benchmark results show that SDP is able to provide up to 2.7 times better bandwidth as compared to the native sockets implementation over InfiniBand (IPoIB) and significantly better latency for large message sizes. Our experimental results also show that SDP is able to achieve a considerably higher performance (improvement of up to 2.4 times) as compared to IPoIB in the PVFS environment. In the data-center environment, SDP outperforms IPoIB for large file transfers inspite of currently being limited by a high connection setup time. However, this limitation is entirely implementation specific and as the InfiniBand software and hardware products are rapidly maturing, we expect this limitation to be overcome soon. Based on this, we have shown that the projected performance for SDP, without the connection setup time, can outperform IPoIB for small message transfers as well.
    Performance Analysis of Systems and Software, 2004 IEEE International Symposium on - ISPASS; 02/2004
  • Conference Proceeding: Impact of high performance sockets on data intensive applications
    [show abstract] [hide abstract]
    ABSTRACT: The challenging issues in supporting data intensive applications on clusters include efficient movement of large volumes of data between processor memories and efficient coordination of data movement and processing by a runtime support to achieve high performance. Such applications have several requirements such as guarantees in performance, scalability with these guarantees and adaptability to heterogeneous environments. With the advent of user-level protocols like the Virtual Interface Architecture (VIA) and the modern InfiniBand Architecture, the latency and bandwidth experienced by applications has approached to that of the physical network on clusters. In order to enable applications written on top of TCP/IP to take advantage of the high performance of these user-level protocols, researchers have come up with a number of techniques including User Level Sockets Layers over high performance protocols. In this paper, we study the performance and limitations of such substrate, referred to here as SocketVIA, using a component framework designed to provide runtime support for data intensive applications. The experimental results show that by reorganizing certain components of an application (in our case, the partitioning of a dataset into smaller data chunks), we can make significant improvements in application performance. This leads to a higher scalability of applications with performance guarantees. It also allows fine grained load balancing, hence making applications more adaptable to heterogeneity in resource availability. The experimental results also show that the different performance characteristics of SocketVIA allow a more efficient partitioning of data at the source nodes, thus improving the performance of the application up to an order of magnitude in some cases.
    High Performance Distributed Computing, 2003. Proceedings. 12th IEEE International Symposium on; 07/2003

Institutions

  • 2008
    • Virginia Polytechnic Institute and State University
      • Department of Computer Science
      Blacksburg, VA, USA
  • 2007
    • Argonne National Laboratory
      Downers Grove, IL, USA
  • 2002–2006
    • The Ohio State University
      • Department of Computer Science and Engineering
      Columbus, OH, USA
  • 2005
    • Los Alamos National Laboratory
      Los Alamos, CA, USA