D.K. Panda

D.K. Panda
The Ohio State University | OSU · Department of Computer Science and Engineering

PhD

About

726
Publications
88,101
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
13,018
Citations
Citations since 2017
86 Research Items
3757 Citations
20172018201920202021202220230200400600
20172018201920202021202220230200400600
20172018201920202021202220230200400600
20172018201920202021202220230200400600

Publications

Publications (726)
Article
The importance of GPUs in accelerating HPC applications is evident by the fact that a large number of supercomputing clusters are GPU-enabled. Many of these HPC applications use MPI as their programming model. These MPI applications frequently exchange data that is non-contiguous in GPU memory. MPI provides Derived Datatypes (DDTs) to represent suc...
Book
An in-depth overview of an emerging field that brings together high-performance computing, big data processing, and deep learning. Over the last decade, the exponential explosion of data known as big data has changed the way we understand and harness the power of data. The emerging field of high-performance big data computing, which brings together...
Chapter
As more High-Performance Computing (HPC) and Deep Learning (DL) applications are adapting to scale using GPUs, the communication of GPU-resident data is becoming vital to end-to-end application performance. Among the available MPI operations in such applications, All-to-All is one of the most communication-intensive operations that becomes the bott...
Article
The Deep Learning (DL) training process consists of multiple phases: data augmentation, training, and validation of the trained model. Traditionally, these phases are executed either on the CPUs or GPUs in a serial fashion due to lack of additional computing resources to offload independent phases of DL training. Recently, Mellanox/NVIDIA introduce...
Preprint
Python has become a dominant programming language for emerging areas like Machine Learning (ML), Deep Learning (DL), and Data Science (DS). An attractive feature of Python is that it provides easy-to-use programming interface while allowing library developers to enhance performance of their applications by harnessing the computing power offered by...
Preprint
Understanding and visualizing the full-stack performance trade-offs and interplay between HPC applications, MPI libraries, the communication fabric, and the file system is a challenging endeavor. Designing a holistic profiling and visualization method for HPC communication networks is challenging since different levels of communication coexist and...
Chapter
In the state-of-the-art production quality MPI (Message Passing Interface) libraries, communication progress is either performed by the main thread or a separate communication progress thread. Taking advantage of separate communication threads can lead to a higher overlap of communication and computation as well as reduced total application executi...
Chapter
Due to the emergence of AMD GPUs and their adoption in upcoming exascale systems (e.g. Frontier), it is pertinent to have scientific applications and communication middlewares ported and optimized for these systems. Radeon Open Compute (ROCm) platform is an open-source suite of libraries tailored towards writing high-performance software for AMD GP...
Conference Paper
Full-text available
Transformer models have revolutionized the field of Natural Language Processing (NLP) and they achieve state-of-the-art performance in applications like machine translation, question answering, regression, and summarization. However, training Transformers is challenging because of their large memory and compute requirements. The literature contains...
Conference Paper
The MPI-3.0 standard introduced neighborhood collective to support sparse communication patterns used in many applications. In this paper, we propose a hierarchical and distributed graph topology that considers the physical topology of the system and the virtual communication pattern of processes to improve the performance of large message neighbor...
Chapter
To reduce the training time of large-scale Deep Neural Networks (DNNs), Deep Learning (DL) scientists have started to explore parallelization strategies like data-parallelism, model-parallelism, and hybrid-parallelism. While data-parallelism has been extensively studied and developed, several problems exist in realizing model-parallelism and hybrid...
Chapter
Overlap of computation and communication is critical for good application-level performance. Modern high-performance networks offer Hardware-assisted tag matching and rendezvous offload to enable communication progress without involving the host CPU. However, hardware based offload cannot be used in many situations due to various hardware limitatio...
Article
This paper addresses the challenges of MPI derived datatype processing and proposes FALCON-X — A Fast and Low-overhead Communication framework for optimized zero-copy intra-node derived datatype communication on emerging CPU/GPU architectures. We quantify various performance bottlenecks such as memory layout translation and copy overheads for highl...
Conference Paper
Full-text available
Communication interfaces of High Performance Computing (HPC) systems, Cloud middleware, and Deep Learning (DL) frameworks have been continually evolving to meet the ever-increasing communication demands being placed on them by HPC, Cloud, and DL applications. Modern high performance interconnects like InfiniBand EDR 100 Gbps, InfiniBand HDR 200 Gbp...
Chapter
Full-text available
The advent of Graphics Processing Unit (GPU)-enabled OpenPOWER architectures are empowering the advancement of various High-Performance Computing (HPC) applications from dynamic modular simulation to deep learning training. GPU-aware Message Passing Interface (MPI) is one of the most efficient libraries used to exploit the computing power on GPU-en...
Conference Paper
The Message Passing Interface has been the dominating programming model for developing scalable and high-performance parallel applications. Collective operations empower group communication operations in a portable, and efficient manner and are used by a large number of applications across different domains. Optimization of collective operations is...
Conference Paper
Frontera is the largest NSF-funded cluster in the US and comprises of 8,008 nodes equipped with the latest Intel Xeon processors (Cascade-Lake). In this paper, we explore the potential of Frontera for training state-of-the-art Deep Learning (DL) models at scale. Most DL studies present performance data from large-scale GPU clusters that are equippe...
Article
Heterogeneous HPC systems with GPUs are equipped with high-performance interconnects like InfiniBand, Omni-Path, PCIe, and NVLink. However, little exists in the literature that captures the performance impact of these interconnects on distributed Deep Learning (DL). In this paper, we choose Horovod; a distributed training middleware, to analyze and...
Chapter
Various Erasure Coding (EC) schemes based on hardware accelerations have been proposed in the community to leverage the advanced compute capabilities on modern data centers, such as Intel ISA-L Onload EC coders and Mellanox InfiniBand Offload EC coders. These EC coders can play a vital role in designing next-generation distributed storage systems....
Conference Paper
The recent surge of Deep Learning (DL) models and applications can be attributed to the rise in computational resources, availability of large-scale datasets, and accessible DL frameworks such as TensorFlow and PyTorch. Because these frameworks have been heavily optimized for NVIDIA GPUs, several performance characterization studies exist for GPU-b...
Conference Paper
Distributed storage systems typically need data to be stored redundantly to guarantee data durability and reliability. While the conventional approach towards this objective is to store multiple replicas, today's unprecedented data growth rates encourage modern distributed storage systems to employ Erasure Coding (EC) techniques, which can achieve...
Conference Paper
Full-text available
The current wave of advances in Deep Learning (DL) have been triggered by the availability of large-scale datasets, efficient CPU and GPU hardware, and development of software frameworks like TensorFlow (TF). However, little exists in literature that addresses TensorFlow's distributed training capabilities. In this paper, we provide an in-depth per...
Article
Traditionally, MPI runtimes have been designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and GPU clusters with a relatively smaller number of nodes, efficient communication schemes need to be designed for such systems. This coupled with new application workloads brought forward by Deep Learning (DL...
Conference Paper
Full-text available
NVMe-based SSDs are in huge demand for Big Data analytics owing to their extremely low latency and high throughput for both read and write operations. Their inherent parallelism in request processing makes them ideal to be used in virtualized environments, where sharing of resources is a given. Given the shared resource-driven ideology of cloud env...
Preprint
Full-text available
TensorFlow has been the most widely adopted Machine/Deep Learning framework. However, little exists in the literature that provides a thorough understanding of the capabilities which TensorFlow offers for the distributed training of large ML/DL models that need computation and communication at scale. Most commonly used distributed training approach...
Conference Paper
Full-text available
Traditionally, MPI runtimes have been designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and dense multi-GPU systems, it has become important to design efficient communication schemes. This coupled with new application workloads brought forward by Deep Learning frameworks like Caffe and Microsoft C...
Conference Paper
Intel Knights Landing (KNL) and IBM POWER architectures are becoming widely deployed on modern supercomputing systems due to its powerful components. MPI Remote Memory Access (RMA) model that provides one-sided communication semantics has been seen as an attractive approach for developing High-Performance Data Analytics (HPDA) applications such as...
Conference Paper
The overlap of computation and communication is critical for good performance of many HPC applications. State-of-the-art designs for the asynchronous progress require specially designed hardware resources (advanced switches or network interface cards), dedicated processor cores or application modification (e.g. use of MPI_Test). These techniques su...
Article
Broadcast is a widely used operation in many streaming and deep learning applications to disseminate large amounts of data on emerging heterogeneous High-Performance Computing (HPC) systems. However, traditional broadcast schemes do not fully utilize hardware features for Graphics Processing Unit (GPU)-based applications. In this paper, a model-ori...
Article
Scientists from many different fields have been developing Bulk‐Synchronous MPI applications to simulate and study a wide variety of scientific phenomena. Since failure rates are expected to increase in larger‐scale future HPC systems, providing efficient fault‐tolerance mechanisms for this class of applications is paramount. The global‐restart mod...
Article
underline xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">D eep L earning o ver B ig D ata (DLoBD) is an emerging paradigm to mine value from the massive amount of gathered data. Many Deep Learning frameworks, like Caffe, TensorFlow, etc., start running over Big Data stacks, such as Apache Hadoop a...
Article
Full-text available
Remote procedure call (RPC) is the backbone of many modern distributed systems. Google's gRPC is one of the most popular open source RPC frameworks available in the community. gRPC is the main communication engine for Google's Deep Learning framework TensorFlow. TensorFlow primarily uses gRPC for communicating tensors and administrative tasks among...
Chapter
Single Root I/O Virtualization (SR-IOV) technology has been steadily gainingmomentum for high-speed interconnects such as InfiniBand. SR-IOVenabled InfiniBand has been widely used in modern HPC clouds with virtual machines and containers. While SR-IOV can deliver near-native I/O performance, recent studies have shown that locality-aware communicati...
Conference Paper
Significant growth has been witnessed during the last few years in HPC clusters with multi-/many-core processors, accelerators, and high-performance interconnects (such as InfiniBand, Omni-Path, iWARP, and RoCE). To alleviate the cost burden, sharing HPC cluster resources to end users through virtualization for both scientific computing and Big Dat...
Conference Paper
The Message Passing Interface (MPI) standard has become the de facto programming model for parallel computing with the last 25-year continuous community effort. With the development of building efficient HPC clouds, more and more MPI-based HPC applications start running on cloud-based environments. Singularity is one of the most attractive containe...
Conference Paper
Full-text available
In this paper, we combine high-performance computing science with computational neuroscience methods to show how to speed-up cutting-edge methods for mapping and evaluation of the large-scale network of brain connections. More specifically, we use a recent factorization method of the Linear Fascicle Evaluation model (i.e., LiFE [1], [2]) that allow...
Conference Paper
Full-text available
Traditionally, Deep Learning (DL) frameworks like Caffe, TensorFlow, and Cognitive Toolkit exploited GPUs to accelerate the training process. This has been primarily achieved by aggressive improvements in parallel hardware as well as through sophisticated software frameworks like cuDNN and cuBLAS. However, recent enhancements to CPU-based hardware...
Article
Full-text available
Dense Multi-GPU systems have recently gained a lot of attention in the HPC arena. Traditionally, MPI runtimes have been primarily designed for clusters with a large number of nodes. However, with the advent of MPI+CUDA applications and CUDA-Aware MPI runtimes like MVAPICH2 and OpenMPI, it has become important to address efficient communication sche...
Conference Paper
Broadly, there exist two protocols for point-to-point data transfer in the Message Passing Interface (MPI) programming model - Eager and Rendezvous. State-of-the-art MPI libraries decide the switch point between these protocols based on the trade-off between memory footprint of the MPI library and communication performance without considering the o...
Conference Paper
Availability of large data sets like ImageNet and massively parallel computation support in modern HPC devices like NVIDIA GPUs have fueled a renewed interest in Deep Learning (DL) algorithms. This has triggered the development of DL frameworks like Caffe, Torch, TensorFlow, and CNTK. However, most DL frameworks have been limited to a single node....
Conference Paper
Full-text available
Big Data Systems are becoming increasingly complex and generally have very high operational costs. Cloud computing offers attractive solutions for managing large scale systems. However, one of the major bottlenecks in VM performance is virtualized I/O. Since Big Data applications and middleware rely heavily on high performance interconnects such as...
Article
Full-text available
With the emergence of high-performance data analytics, the Hadoop platform is being increasingly used to process data stored on high-performance computing clusters. While there is immense scope for improving the performance of Hadoop MapReduce (including the network-intensive shuffle phase) over these modern clusters, that are equipped with high-sp...
Conference Paper
Full-text available
Hadoop is gaining more and more popularity in virtualized environments because of the flexibility and elasticity offered by cloud-based systems. Hadoop supports topology-awareness through topology-aware designs in all of its major components. However, there exists no service that can automatically detect the underlying network topology in a scalabl...