# Ravi Reddy ManumachuUniversity College Dublin | UCD · School of Computer Science

Ravi Reddy Manumachu

Doctor of Philosophy

Models, algorithms, and tools for performance and energy optimization in heterogeneous clusters, clouds and data centers

## About

72

Publications

10,706

Reads

**How we measure 'reads'**

A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more

888

Citations

Introduction

Ravi Reddy Manumachu received his Bachelor of Technology from I.I.T, Madras in 1997 and Ph.D. from UCD in Computer Science (high performance heterogeneous computing) in 2005. Ravi does research in high performance heterogeneous computing, distributed computing, energy-efficient computing, and sparse matrix computations.

Additional affiliations

Education

January 2001 - June 2005

June 1993 - May 1997

## Publications

Publications (72)

Power and energy efficiency are now critical concerns in extreme-scale high performance scientific comput- ing. Many extreme-scale computing systems today (For example: Top500) have tight integration of multicore CPU processors and accelerators (mix of GPUs, Intel Xeon Phis, or FPGAs) empowering them to provide not just unprecedented computational...

Modern homogeneous parallel platforms are composed of tightly integrated multicore CPUs. This tight integration has resulted in the cores contending for various shared on-chip resources such as Last Level Cache (LLC) and interconnect, leading to resource contention and non-uniform memory access (NUMA). Due to these newly introduced complexities, th...

We study a bi-objective optimization problem, which for a given positive real number $n$ aims to find a vector $X = \{x_0,\cdots,x_{k-1}\} \in \mathbb{R}^{k}_{\ge 0}$ such that $\sum_{i=0}^{k-1} x_i = n$, minimizing the maximum of $k$ functions of objective type one, $\max_{i=0}^{k-1} f_i(x_i)$, and the sum of $k$ functions of objective type two, $...

Performance and energy are the two most important objectives for optimization on heterogeneous high performance computing platforms. This work studies a mathematical problem motivated by the bi‐objective optimization of data‐parallel applications on such platforms for performance and energy. First, we formulate the problem and present an exact algo...

Energy proportionality (EP) means designing a system that consumes energy proportional to the amount of work it performs. For an EP system, optimizing an application for performance also optimizes the application for total energy. Energy-proportional multicore CPUs and graphics processing units (GPUs) are fundamental to addressing the grand technol...

Performance and energy are the two most important objectives for optimization on heterogeneous HPC platforms. This work studies a mathematical problem motivated by the bi-objective optimization of a matrix multiplication application on such platforms for performance and energy. We formulate the problem and propose an algorithm of polynomial complex...

The energy efficiency in ICT is becoming a grand technological challenge and is now a first-class design constraint in all computing settings. Energy predictive modelling based on performance monitoring counters (PMCs) is the leading method for application-level energy optimization. However, a sound theoretical framework to understand the fundament...

Performance and energy are the two most important objectives for optimization on modern parallel platforms. In this article, we show that moving from single-objective optimization for performance or energy to their bi-objective optimization on heterogeneous processors results in a tremendous increase in the number of optimal solutions (workload dis...

Energy predictive modelling is the leading method for determining the energy consumption of an application. Performance monitoring counters (PMCs) and resource utilizations have been the principal source of model variables primarily due to their high positive correlation with energy consumption. Performance events, however, have come to dominate th...

Accurate and reliable measurement of energy consumption is essential to energy optimization at an application level. Energy predictive modelling using performance monitoring counters (PMCs) emerged as a promising approach, one of the main drivers being its capacity to provide fine-grained component-level breakdown of energy consumption. In this wor...

Accurate energy profiles are essential to the optimization of parallel applications for energy through workload distribution. Since there are many model-based methods available for efficient construction of energy profiles, we need an approach to measure the goodness of the profiles compared with the ground-truth profile, which is usually built by...

Energy is one of the most important objectives for optimization on modern heterogeneous high‐performance computing (HPC) platforms. The tight integration of multicore CPUs with accelerators such as graphical processing units (GPUs) and Xeon Phi coprocessors in these platforms presents several challenges to the optimization of multithreaded data‐par...

Energy is one of the most important objectives for optimization on modern heterogeneous high performance computing (HPC) platforms. The tight integration of multicore CPUs with accelerators in these platforms present several challenges to optimization of multithreaded data-parallel applications for dynamic energy. In this work, we formulate the opt...

Modern high-performance computing platforms, cloud computing systems, and data centers are highly heterogeneous containing nodes where a multicore CPU is tightly integrated with accelerators. An important challenge for energy optimization of hybrid parallel applications on such platforms is how to accurately estimate the energy consumption of appli...

Energy proportionality is the key design goal followed by architects of multicore processors. One of its implications is that optimization of an application for performance will also optimize it for energy.
In this work, we show that energy proportionality does not hold true for multicore processors. This finding creates the opportunity for bi-obj...

Modern HPC platforms are highly heterogeneous with tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to address the twin critical concerns of performance and energy efficiency. Due to this inherent characteristic, processing elements contend f...

Chapter "In Situ Visualization of Performance-Related Data in Parallel CFD Applications" is available open access under a Creative Commons Attribution 4.0 International License via link.springer.com.

Energy proportionality is the key design goal followed by architects of modern multicore CPUs. One of its implications is that optimization of an application for performance will also optimize it for energy. In this work, we show that energy proportionality does not hold true for multicore CPUs. This finding creates the opportunity for bi-objective...

Parallel matrix-matrix multiplication (PMM) of dense matrices is a foundational kernel of parallel linear algebra libraries in high performance computing (HPC) domain. The problem of finding the optimal shape of matrices for efficient execution of PMM on heterogeneous platforms has an engrossing history comprising of two distinct threads. The first...

Energy predictive modelling using performance monitoring counters (PMCs) has emerged as the leading mainstream approach for modelling the energy consumption of an application. Modern computing platforms such as multicore CPUs provide a large set of PMCs. The programmers , however, can obtain only a small number of PMCs (typically 3-4) during an app...

Performance and energy are the two most important objectives for optimisation on modern parallel platforms. Latest research demonstrated the importance of workload distribution as a decision variable in the bi-objective optimisation for performance and energy on homogeneous multicore clusters. We show in this work that bi-objective optimisation for...

Energy is now a first-class design constraint along with performance in all computing settings. Energy predictive modelling based on performance monitoring counts (PMCs) is the leading method used for prediction of energy consumption during an application execution. We use a model-theoretic approach to formulate the assumed properties of existing m...

Energy of computing is a serious environmental concern and mitigating it is an important technological challenge. Accurate measurement of energy consumption during an application execution is key to application-level energy minimization techniques. There are three popular approaches to providing it: (a) System-level physical measurements using exte...

This survey aims to present the state of the art in analytic communication performance models, providing sufficiently detailed descriptions of particularly noteworthy efforts. Modeling the cost of communications in computer clusters is an important and challenging problem. It provides insights into the design of the communication pattern of paralle...

Many classical methods and algorithms developed when single-core CPUs dominated the parallel computing landscape, are still widely used in the changed multicore world. Two prominent examples are load balancing, which has been one of the main techniques for minimization of the computation time of parallel applications since the beginning of parallel...

Heterogeneity is emerging as one of the most profound and challenging characteristics of today's parallel environments. From the macro level, where networks of distributed computers composed of diverse node architectures are interconnected with potentially heterogeneous networks, to the micro level, where deeper memory hierarchies and various accel...

Data partitioning algorithms aiming to minimize the execution time and the energy of computations in self-adaptable data-parallel applications on modern extreme-scale multicore platforms must address two critical challenges. First, they must take into account the new complexities inherent in these platforms such as severe resource contention and no...

Fast Fourier transform (FFT) is a key routine employed in application domains such as molecular dynamics, computational fluid dynamics, signal processing, image processing, and condition monitoring systems. Its performance on modern multicore platforms is therefore of paramount concern to the high performance computing community. The inherent compl...

Modern HPC platforms have become highly heterogeneous owing to tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to maximize the dominant objectives of performance and energy efficiency. Due to this inherent characteristic, processing elements...

In this paper, we use multithreaded fast Fourier transforms provided in three highly optimized packages, FFTW-2.1.5, FFTW-3.3.7, and Intel MKL FFT, to present a novel model-based parallel computing technique as a very effective and portable method for optimization of scientific multithreaded routines for performance, especially in the current multi...

Hardware accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors (PHIs), and Field-Programmable Gate Arrays (FPGAs) are now ubiquitous in extreme-scale high performance computing (HPC), cloud, and Big data platforms to facilitate execution of workloads that demand high energy efficiency. They present unique interfaces an...

Affinity-aware thread mapping is a method to effectively exploit cache resources in multicore processors.We propose an affinity and architecture-aware thread mapping technique which maximises data reuse and minimises remote communications and cache coherency costs of multi-threaded applications. It consists of three main components: Data Sharing Es...

Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to facilitate easier migration to them of HPC workloads. While virtualization of accelerators in clo...

Performance events or performance monitoring counters (PMCs) have been originally conceived, and widely used to aid low-level performance analysis and tuning. Nevertheless, they were opportunistically adopted for energy predictive modeling owing to lack of a precise energy measurement mechanism in processors, and to address the need of determining...

Performance events or performance monitoring counters (PMCs) are now the dominant predictor variables for modeling energy consumption. Modern hardware processors provide a large set of PMCs. Determination of the best subset of PMCs for energy predictive modeling is a non-trivial task given the fact that all the PMCs can not be determined using a si...

Traditional heterogeneous parallel algorithms, designed for heterogeneous
clusters of workstations, are based on the assumption that the absolute speed
of the processors does not depend on the size of the computational task. This
assumption proved inaccurate for modern and perspective highly heterogeneous
HPC platforms. New class of algorithms base...

Two strategies of distribution of computations can be used to implement parallel solvers for dense linear algebra problems for Heterogeneous Computational Clusters of Multicore Processors (HCoMs). These strategies are called Heterogeneous Process Distribution Strategy (HPS) and Heterogeneous Data Distribution Strategy (HDS). They are not novel and...

The paper presents a new data partitioning algorithm for parallel computing on heterogeneous processors. Like traditional functional partitioning algorithms, the algorithm assumes that the speed of the processors is character- ized by speed functions rather than speed constants. Unlike the traditional algo- rithms, it does not assume the speed func...

The functional performance model (FPM) of heterogeneous proces- sors has proven to be more realistic than the traditional models because it integrates many important features of heterogeneous processors such as the processor heterogeneity, the heterogeneity of memory structure, and the effects of paging. Optimal 1D matrix partitioning algorithms em...

This paper presents a software library, called Heterogeneous PBLAS (HeteroPBLAS), which provides optimized parallel basic linear algebra subprograms for Heterogeneous Computational Clusters. This library is written on the top of HeteroMPI and PBLAS whose building blocks, the de facto standard kernels for matrix and vector operations (BLAS) and mess...

This paper describes the design and the implementation of parallel routines in the heterogeneous ScaLAPACK library that solve a dense system of linear equations. This library is written on top of HeteroMPI and ScaLAPACK whose building blocks, the de facto standard kernels for matrix and vector operations (BLAS and its parallel counterpart PBLAS) an...

This paper presents a package, called Heterogeneous PBLAS (HeteroPBLAS), which is built on top of PBLAS and provides optimized parallel basic linear algebra subprograms for heterogeneous computational clusters. We present the user interface and the software hierarchy of the first research implementation of HeteroPBLAS. This is the first step toward...

We present a package, called Heterogeneous PBLAS (HeteroPBLAS), which is built on top of PBLAS and provides optimized parallel basic linear algebra subprograms for Heterogeneous Computational Clusters. We present the user interface and the software hierarchy of the first research implementation of HeteroPBLAS. This is the first step towards the dev...

This paper discusses the design and the implementation of the LU factorization routines included in the Heterogeneous ScaLAPACK library, which is built on top of ScaLAPACK. These routines are used in the factorization and solution of a dense system of linear equations. They are implemented using optimized PBLAS, BLACS and BLAS libraries for heterog...

In this paper, we study the problem of optimal matrix partitioning for parallel dense factorization on heterogeneous processors. First, we outline existing algorithms solving the problem that use a constant performance model of processors, when the relative speed of each processor is represented by a positive constant. We also propose a new efficie...

In this paper, we present a novel algorithm of optimal matrix partitioning for parallel dense matrix factorization on heterogeneous processors based on their constant performance model. We prove the correctness of the algorithm and estimate its complexity. We demonstrate that this algorithm better suits extensions to more complicated, non-constant,...

In this paper, we address the problem of optimal distribu- tion of computational tasks on a network of heterogeneous computers when one or more tasks do not fit into the main memory of the processors and when relative speeds vary with the problem size. We propose a functional perform- ance model of heterogeneous processors that integrates many esse...

The paper presents a tool that ports ScaLAPACK programs designed to run on massively parallel processors to Heterogeneous Networks of Computers. The tool converts ScaLAPACK programs to HeteroMPI programs. The resulting HeteroMPI programs do not aim to extract the maximum performance from a Heterogeneous Networks of Computers but provide an easy and...

The paper presents Heterogeneous MPI (HeteroMPI), an extension of MPI for programming high-performance computations on heterogeneous networks of computers. It allows the application programmer to describe the performance model of the implemented algorithm in a generic form. This model allows the specification of all the main features of the underly...

In this paper, we present an efficient procedure for building a piecewise linear function approximation of the speed function of a processor with hierarchical memory structure. The procedure tries to minimize the experimental time used for building the speed function approximation. We demonstrate the efficiency of our procedure by performing experi...

In this paper, we present a static data distribution strategy called Vari- able Group Block distribution to optimize the execution of factorization of a dense matrix on a network of heterogeneous computers. The distribution is based on a functional performance model of computers, which tries to capture differ- ent aspects of heterogeneity of the co...

The paper presents a performance model that can be used to optimally distribute computations over heterogeneous computers. This model is application-centric representing the speed of each computer by a function of the problem size. This way it takes into account the processor heterogeneity, the heterogeneity of memory structure, and the memory limi...

The paper presents an approach to performance analysis of heterogeneous parallel algorithms. As a typical heterogeneous parallel algorithm is just a modification of some homogeneous one, the idea is to compare the heterogeneous algorithm with its homogeneous prototype, and to assess the heterogeneous modification rather than analyse the algorithm a...

The paper presents a performance model that can be used to optimally schedule arbitrary tasks on a network of heterogeneous computers when there is an upper bound on the size of the task that can be solved by each computer. We formulate a problem of partitioning of an n-element set over p heterogeneous processors using this advanced performance mod...

Summary form only given. The article presents a performance model of a network of heterogeneous computers that takes account of the heterogeneity of memory structure and other architectural differences. Under this model, the speed of each processor is represented by a function of the size of the problem whereas standard models use single numbers to...

The paper presents an approach to the performance analysis of heterogeneous parallel algorithms. As a typical heterogeneous
parallel algorithm is just a modification of some homogeneous one, the idea is to compare the heterogeneous algorithm with
its homogeneous prototype, and to assess the heterogeneous modification rather than to analyse the algo...

The paper presents a classification of mathematical problems encoun- tered during partitioning of data when designing parallel algorithms on networks of heterogeneous computers. We specify problems with known efficient solutions and open problems. Based on this classification, we suggest an API for partition- ing mathematical objects commonly used...

The paper presents Heterogeneous MPI (HMPI), an extension of MPI for programming high-performance computations on heterogeneous networks of computers. It allows the application programmer to describe the performance model of the implemented algorithm. This model allows for all the main features of the underlying parallel algorithm, which have an im...