Laxmikant V. Kalé’s research while affiliated with University of Illinois Urbana-Champaign and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (81)


Scalable molecular dynamics on CPU and GPU architectures with NAMD
  • Article

July 2020

·

414 Reads

·

2,304 Citations

·

·

·

[...]

·

NAMD is a molecular dynamics program designed for high-performance simulations of very large biological objects on CPU- and GPU-based architectures. NAMD offers scalable performance on petascale parallel supercomputers consisting of hundreds of thousands of cores, as well as on inexpensive commodity clusters commonly found in academic environments. It is written in C++ and leans on Charm++ parallel objects for optimal performance on low-latency architectures. NAMD is a versatile, multipurpose code that gathers state-of-the-art algorithms to carry out simulations in apt thermodynamic ensembles, using the widely popular CHARMM, AMBER, OPLS, and GROMOS biomolecular force fields. Here, we review the main features of NAMD that allow both equilibrium and enhanced-sampling molecular dynamics simulations with numerical efficiency. We describe the underlying concepts utilized by NAMD and their implementation, most notably for handling long-range electrostatics; controlling the temperature, pressure, and pH; applying external potentials on tailored grids; leveraging massively parallel resources in multiple-copy simulations; and hybrid quantum-mechanical/molecular-mechanical descriptions. We detail the variety of options offered by NAMD for enhanced-sampling simulations aimed at determining free-energy differences of either alchemical or geometrical transformations and outline their applicability to specific problems. Last, we discuss the roadmap for the development of NAMD and our current efforts toward achieving optimal performance on GPU-based architectures, for pushing back the limitations that have prevented biologically realistic billion-atom objects to be fruitfully simulated, and for making large-scale simulations less expensive and easier to set up, run, and analyze. NAMD is distributed free of charge with its source code at www.ks.uiuc.edu.


Improving the memory access locality of hybrid MPI applications

September 2017

·

209 Reads

·

16 Citations

Maintaining memory access locality is continuing to be a challenge for parallel applications and their runtime environments. By exploiting locality, application performance, resource usage, and performance portability can be improved. The main challenge is to detect and fix memory locality issues for applications that use shared-memory programming models for intra-node parallelization. In this paper, we investigate improving memory access locality of a hybrid MPI+OpenMP application in two different ways, by manually fixing locality issues in its source code and by employing the Adaptive MPI (AMPI) runtime environment. Results show that AMPI can result in similar locality improvements as manual source code changes, leading to substantial performance and scalability gains compared to the unoptimized version and to a pure MPI runtime. Compared to the hybrid MPI+OpenMP baseline, our optimizations improved performance by 1.8x on a single cluster node, and by 1.4x on 32 nodes, with a speedup of 2.4x compared to a pure MPI execution on 32 nodes. In addition to performance, we also evaluate the impact of memory locality on the load balance within a node.


Optimizing Data Locality for Fork/Join Programs Using Constrained Work Stealing

January 2015

·

13 Reads

·

33 Citations

We present an approach to improving data locality across different phases of fork/join programs scheduled using work stealing. The approach consists of: (1) user-specified and automated approaches to constructing a steal tree, the schedule of steal operations, and (2) constrained work-stealing algorithms that constrain the actions of the scheduler to mirror a given steal tree. These are combined to construct work-stealing schedules that maximize data locality across computation phases while ensuring load balance within each phase. These algorithms are also used to demonstrate dynamic coarsening, an optimization to improve spatial locality and sequential overheads by combining many finer-grained tasks into coarser tasks while ensuring sufficient concurrency for locality-optimized load balance. Implementation and evaluation in Cilk demonstrate performance improvements of up to 2.5x on 80 cores. We also demonstrate that dynamic coarsening can combine the performance benefits of coarse task specification with the adaptability of finer tasks.


Scalable replay with partial-order dependencies for message-logging fault tolerance

November 2014

·

27 Reads

·

10 Citations

Deterministic replay of a parallel application is commonly used for discovering bugs or to recover from a hard fault with message-logging fault tolerance. For message passing programs, a major source of overhead during forward execution is recording the order in which messages are sent and received. During replay, this ordering must be used to deterministically reproduce the execution. Previous work in replay algorithms often makes minimal assumptions about the programming model and application to maintain generality. However, in many applications, only a partial order must be recorded due to determinism intrinsic in the program, ordering constraints imposed by the execution model, and events that are commutative (their relative execution order during replay does not need to be reproduced exactly). In this paper, we present a novel algebraic framework for reasoning about the minimum dependencies required to represent the partial order for different orderings and interleavings. By exploiting this framework, we improve on an existing scalable message-logging fault tolerance scheme that uses a total order. The improved scheme scales to 131,072 cores on an IBM BlueGene/P with up to 2× lower overhead.


Structure-Adaptive Parallel Solution of Sparse Triangular Linear Systems

October 2014

·

51 Reads

·

34 Citations

Parallel Computing

Solving sparse triangular systems of linear equations is a performance bottleneck in many methods for solving more general sparse systems. Both for direct methods and for many iterative preconditioners, it is used to solve the system or improve an approximate solution, often across many iterations. Solving triangular systems is notoriously resistant to parallelism, however, and existing parallel linear algebra packages appear to be ineffective in exploiting significant parallelism for this problem. We develop a novel parallel algorithm based on various heuristics that adapt to the structure of the matrix and extract parallelism that is unexploited by conventional methods. By analyzing and reordering operations, our algorithm can often extract parallelism even for cases where most of the nonzero matrix entries are near the diagonal. Our main parallelism strategies are: (1) identify independent rows, (2) send data earlier to achieve greater overlap, and (3) process dense off-diagonal regions in parallel. We describe the implementation of our algorithm in Charm++ and MPI and present promising experimental results on up to 512 cores of BlueGene/P, using numerous sparse matrices from real applications.


FIG. 1: 
FIG. 2: The absolute error of the energy after roughly 40 iterations of the spectral projection method for different water clusters in RHF/6-31G ∗∗ . 
FIG. 3: 
FIG. 4: 
FIG. 7: 

+2

Solvers for O(N)\mathcal{O} (N) Electronic Structure in the Strong Scaling Limit
  • Article
  • Full-text available

March 2014

·

130 Reads

·

14 Citations

SIAM Journal on Scientific Computing

We present two parallel implementations within the OpenMP and Charm++ frameworks of the important spectral projection O(N)\mathcal{O} (N) quantum chemical solver, with our recently introduced Sparse Approximate Matrix Multiply (SpAMM) as kernel. We find that the error in the energy under a finite SpAMM tolerance is well controlled while at the same time reducing the computational complexity of the solver to O(nlgn)\mathcal{O} (n \lg n). We present parallal scaling studies of water cluster systems on up to 24,576 CPU cores. We find that the standard load balancing strategies of Charm++ lead to impressive fine grained parallelism at this scale, which is approaching 1,000 CPU cores per water molecule for the smaller systems.

Download

G-Charm: An adaptive runtime system for message-driven parallel applications on hybrid systems

June 2013

·

23 Reads

·

22 Citations

The effective use of GPUs for accelerating applications depends on a number of factors including effective asynchronous use of heterogeneous resources, reducing memory transfer between CPU and GPU, increasing occupancy of GPU kernels, overlapping data transfers with computations, reducing GPU idling and kernel optimizations. Overcoming these challenges require considerable effort on the part of the application developers and most optimization strategies are often proposed and tuned specifically for individual applications. In this paper, we present G-Charm, a generic framework with an adaptive runtime system for efficient execution of message-driven parallel applications on hybrid systems. The framework is based on Charm++, a message-driven programming environment and runtime for parallel applications. The techniques in our framework include dynamic scheduling of work on CPU and GPU cores, maximizing reuse of data present in GPU memory, data management in GPU memory, and combining multiple kernels. We have presented results using our framework on Tesla S1070 and Fermi C2070 systems using three classes of applications: a highly regular and parallel 2D Jacobi solver, a regular dense matrix Cholesky factorization representing linear algebra computations with dependencies among parallel computations and highly irregular molecular dynamics simulations. With our generic framework, we obtain 1.5 to 15 times improvement over previous GPU-based implementation of Charm++. We also obtain about 14\% improvement over an implementation of Cholesky factorization with a static work-distribution scheme.


Characteristics of adaptive runtime systems in HPC

June 2013

·

14 Reads

·

1 Citation

The phrase "Runtime System" is somewhat broad and is used with differing meanings in differing contexts. The Java runtime and most of the MPI runtimes are focused on providing mechanisms. In contrast, adaptive runtime systems emphasize strategies, in addition to providing mechanisms. This talk will look at some characteristics that make HPC RTSs adaptive. These include dynamic load balancing, exploitation of the "principle of persistence" to learn from recent data, automatic allocation to heterogeneous processors, automatic optimization of communication, application reconfiguration via control-points, automated control and optimization of temperature/power/energy/execution-time, automated tolerance of component failures so as to maintain the rate of computational progress in presence of such failures, and adapting to memory availability. The talk will examine these characteristics, and what features are necessary and/or desirable to empower the runtime system. I will illustrate it using examples from the runtime system underlying Charm++ and Adaptive MPI.


Fig. 1: Experimental setup (on right) and timeline of 8 VMs showing one iteration of Stencil2D: white portion = idle time, colored portions = application functions. case, the Projections timelines tool includes the time spent executing the interfering task in the time spent for executing tasks of the parallel job on that processor because it can not identify when the operating system switches context. This gets reflected in the fact that some of the tasks hosted on VM#3 take significantly longer time to execute than others (longer bars in Figure 1c). Due to this CPU sharing, it takes longer for the parallel job to finish the same tasks. Moreover, the tightly-coupled nature of the application means that no other process can start the next iteration unless all processes have finished the current iteration (idle times on rest of the VMs). If the VMs do not share physical core but share the multicore physical node, the contention for limited shared cache capacity and memory controller subsystem can manifest itself as another source of interference (Figure 1e). Here. we ran the 8 VMs on 3 fast nodes, with first three VMs on one node, next three VMs on second node, and last two VMs on third node. On second node, we placed another VM mapped to the unused core and ran NPB-LU Class B benchmark on it. The unused cores on first and third nodes are left idle. Figure 1e shows that VM#5 is taking longer time than the rest compared to the case with exactly same configuration but no interfering VM (Figure 1d). It can also be noted that the time in Figure 1d is slightly better than Figure 1a. This can be attributed to the fact that the shared resources in the 4-core node are shared between 4 processes in Figure 1a, but by only 3 in Figure 1d. The distribution of such interference is fairly random and unpredictable in a cloud. Hence, we need a mechanism to adapt to the dynamic variation in the execution environment.
Improving HPC Application Performance in Cloud through Dynamic Load Balancing

May 2013

·

251 Reads

·

52 Citations

Driven by the benefits of elasticity and pay-as-you-go model, cloud computing is emerging as an attractive alternative and addition to in-house clusters and supercomputers for some High Performance Computing (HPC) applications. However, poor interconnect performance, heterogeneous and dynamic environment, and interference by other virtual machines (VMs) are some bottlenecks for efficient HPC in cloud. For tightly-coupled iterative applications, one slow processor slows down the entire application, resulting in poor CPU utilisation. In this paper, we present a dynamic load balancer for tightly-coupled iterative HPC applications in cloud. It infers the static hardware heterogeneity in virtualized environments, and also adapts to the dynamic heterogeneity caused by the interference arising due to multi-tenancy. Through continuous live monitoring, instrumentation, and periodic refinement of task distribution to VMs, our load balancer adapts to the dynamic variations in cloud resources. Through experimental evaluation on a private cloud with 64 VMs using benchmarks and a real science application, we demonstrate performance benefits up to 45%. Finally, we analyse the effect of load balancing frequency, problem size and computational granularity (problem decomposition) on the performance and scalability of our techniques.


Fig. 5: % improvement achieved using HPC awareness (homogeneity) compared to the case where 2 VMs were on slower processors and rest on faster processors white=Idle time  
Fig. 6: CPU Timelines of 8 VMs running Jacobi2D  
HPC-Aware VM Placement in Infrastructure Clouds

March 2013

·

383 Reads

·

106 Citations

Cloud offerings are increasingly serving workloads with a large variability in terms of compute, storage and net- working resources. Computing requirements (all the way to High Performance Computing or HPC), criticality, communication intensity, memory requirements, and scale can vary widely. Virtual Machine (VM) placement and consolidation for effective utilization of a common pool of resources for efficient execution of such diverse class of applications in the cloud is challenging, resulting in higher cost and missed Service Level Agreements (SLAs). For HPC, current cloud providers either offer dedicated cloud with dedicated nodes, losing out on consolidation benefits of virtualization, or use HPC-agnostic cloud scheduling resulting in poor HPC performance. In this work, we address application-aware allocation of n VM instances (comprising a single job request) to physical hosts from a single pool. We design and implement an HPC-aware scheduler on top of OpenStack Compute (Nova) and also incorporate it in a simulator (CloudSim). Through various optimizations, specifi- cally topology- and hardware-awareness, cross-VM interference accounting and application-aware consolidation, we demonstrate enhanced VM placements which achieve up to 45% improvement in HPC performance and/or 32% increase in job throughput while limiting the effect of jitter (or noise) to 8%.


Citations (70)


... All-atom MD simulations have been performed via the NAMD 3.0 software [54,55] to study the structure and properties of FR-α in complex with the native FA as well as the Pyro-PEG-FA drug. The systems are described with standard amber force field parameters, namely ff14SB [56] is used to represent the protein and TIP3P [57] for water. ...

Reference:

In Silico Study of Active Delivery of a Photodynamic Therapy Drug Targeting the Folate Receptor
Scalable molecular dynamics on CPU and GPU architectures with NAMD
  • Citing Article
  • July 2020

... The computational domain of the STN jet considered is shown in Fig. 1. The equations of mass, momentum, and energy conservation are solved using our in-house code PlasCom2 ( [21][22][23]), for a compressible, viscous fluid with an ideal gas equation of state ( = ), Fourier law of heat conduction ( = − ∇ ) and Newtonian viscous stresses = [∇ + (∇ ) T ] + (∇ · ) , where is the identity tensor and is the velocity field. For the jet flow solution shown in Fig. 1, the equations are expressed in computational coordinates = ( ), whose mapping to physical coordinates = ( , ) is one-to-one and onto [24]. ...

Improving the memory access locality of hybrid MPI applications
  • Citing Conference Paper
  • September 2017

... It also leverages the reordering to provide some basic debugging features such as deterministic replay with step-in, step-over, and step-back. MReplayer is a first step in our ongoing efforts to provide bug detection and correction support for models of distributed systems, and realize more sophisticated debugging services as described in [11,20,59]. MReplayer uses existing model-to-model transformation techniques [47] to implement the instrumentation and ensure that the generated code emits the expected trace information at runtime. ...

Scalable replay with partial-order dependencies for message-logging fault tolerance
  • Citing Article
  • November 2014

... Ding et al. [105] proposed a cache hierarchy-aware loop-iterationsto-core mapping strategy by exploiting data reuse and minimizing data dependencies, which results in improved data locality. Lifflander et al. [106] proposed locality-aware optimization at different phases of fork/join programs with optimal load balance based on Cilk and also provides programmatic support for work-stealing schedules which helps in user guidance on data locality. ...

Optimizing Data Locality for Fork/Join Programs Using Constrained Work Stealing
  • Citing Article
  • January 2015

... However, distributed-memory parallel SpTRSV algorithms are even more challenging as communication quickly becomes dominant as the number of processors increases. Existing works include supernodal representation with a 2D process layout [12,13,22,29], non-blocked representation with a 1D process layout [41], multifrontal representation with sparse RHSs [34], and selective inversion-based algorithms for dense systems [32,43]. Among them, communication optimization techniques such as customized communication trees [29], one-sided MPI communication [13] and GPU-initiated communication [12] have been exploited and performance prediction studies such as critical path analysis [12,13], roofline modeling [44] and machine learningbased performance tuning [1,15] have been considered. ...

Structure-Adaptive Parallel Solution of Sparse Triangular Linear Systems
  • Citing Article
  • October 2014

Parallel Computing

... It includes explicit synchronization mechanisms, including a queue abstraction for data transfers. Other examples in this category include dCUDA [16], Groute [5], BlasX [48], G-Charm [44] or Executors [7]. ...

G-Charm: An adaptive runtime system for message-driven parallel applications on hybrid systems
  • Citing Conference Paper
  • June 2013

... In HPC, coordinator/worker is commonly used to coordinate the concurrent execution of processes and tasks across multiple resources and compute nodes. At programming level, coordinator/worker is used with MPI libraries [25], [26] and language extensions like Charm++ [27], [28] or COMPSs [29], to implement large-scale, single-executable applications. At task-level, diverse frameworks use workers coordinated by a coordinator to distribute and then execute tasks across HPC resources. ...

Performance Optimization of a Parallel, Two Stage Stochastic Linear Program

... Temperature-aware workload balancing strategies are mostly focused on computing resources and CPU utilization [31,32]. It has become a traditional wisdom to save energy costs by keeping CPU temperatures under a certain threshold through the dynamic CPU voltage/frequency scaling technique. ...

A ‘Cool’ Load Balancer for Parallel Applications
  • Citing Conference Paper
  • January 2011

... For example, skipping the calculation of small enough elements of near-sparse matrices is a profitable way for performance acceleration. Based on such an idea, Sparse Approximate Matrix Multiply (SpAMM) [27,28] has been proposed for accelerating the decay matrix multiplication. For matrices with exponential decay, existing research has demonstrated the absolute error of SpAMM can be controlled reliably [29]. ...

Solvers for O(N)\mathcal{O} (N) Electronic Structure in the Strong Scaling Limit

SIAM Journal on Scientific Computing