Ravi Reddy Manumachu

Ravi Reddy Manumachu
University College Dublin | UCD · School of Computer Science

Doctor of Philosophy
Models, algorithms, and tools for performance and energy optimization in heterogeneous clusters, clouds and data centers

About

72
Publications
10,706
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
888
Citations
Introduction
Ravi Reddy Manumachu received his Bachelor of Technology from I.I.T, Madras in 1997 and Ph.D. from UCD in Computer Science (high performance heterogeneous computing) in 2005. Ravi does research in high performance heterogeneous computing, distributed computing, energy-efficient computing, and sparse matrix computations.
Additional affiliations
January 2022 - present
University College Dublin
Position
  • SEAI Research Fellow (Level 3)
Description
  • Software tools and solutions to improve the energy efficiency of servers and data centers.
January 2022 - present
University College Dublin
Position
  • SEAI Research Fellow Level 3
Description
  • Software tools and solutions to improve the energy efficiency of servers and data centers.
January 2022 - present
University College Dublin
Position
  • SEAI Research Fellow Level 3
Description
  • Software tools and solutions to improve the energy efficiency of servers and data centers.
Education
January 2001 - June 2005
University College Dublin
Field of study
  • High Performance Heterogeneous Computing, Computer Science
June 1993 - May 1997
Indian Institute of Technology Madras
Field of study
  • Civil Engineering

Publications

Publications (72)
Article
Full-text available
Power and energy efficiency are now critical concerns in extreme-scale high performance scientific comput- ing. Many extreme-scale computing systems today (For example: Top500) have tight integration of multicore CPU processors and accelerators (mix of GPUs, Intel Xeon Phis, or FPGAs) empowering them to provide not just unprecedented computational...
Article
Modern homogeneous parallel platforms are composed of tightly integrated multicore CPUs. This tight integration has resulted in the cores contending for various shared on-chip resources such as Last Level Cache (LLC) and interconnect, leading to resource contention and non-uniform memory access (NUMA). Due to these newly introduced complexities, th...
Preprint
Full-text available
We study a bi-objective optimization problem, which for a given positive real number $n$ aims to find a vector $X = \{x_0,\cdots,x_{k-1}\} \in \mathbb{R}^{k}_{\ge 0}$ such that $\sum_{i=0}^{k-1} x_i = n$, minimizing the maximum of $k$ functions of objective type one, $\max_{i=0}^{k-1} f_i(x_i)$, and the sum of $k$ functions of objective type two, $...
Article
Performance and energy are the two most important objectives for optimization on heterogeneous high performance computing platforms. This work studies a mathematical problem motivated by the bi‐objective optimization of data‐parallel applications on such platforms for performance and energy. First, we formulate the problem and present an exact algo...
Conference Paper
Energy proportionality (EP) means designing a system that consumes energy proportional to the amount of work it performs. For an EP system, optimizing an application for performance also optimizes the application for total energy. Energy-proportional multicore CPUs and graphics processing units (GPUs) are fundamental to addressing the grand technol...
Article
Performance and energy are the two most important objectives for optimization on heterogeneous HPC platforms. This work studies a mathematical problem motivated by the bi-objective optimization of a matrix multiplication application on such platforms for performance and energy. We formulate the problem and propose an algorithm of polynomial complex...
Article
Full-text available
The energy efficiency in ICT is becoming a grand technological challenge and is now a first-class design constraint in all computing settings. Energy predictive modelling based on performance monitoring counters (PMCs) is the leading method for application-level energy optimization. However, a sound theoretical framework to understand the fundament...
Article
Performance and energy are the two most important objectives for optimization on modern parallel platforms. In this article, we show that moving from single-objective optimization for performance or energy to their bi-objective optimization on heterogeneous processors results in a tremendous increase in the number of optimal solutions (workload dis...
Article
Full-text available
Energy predictive modelling is the leading method for determining the energy consumption of an application. Performance monitoring counters (PMCs) and resource utilizations have been the principal source of model variables primarily due to their high positive correlation with energy consumption. Performance events, however, have come to dominate th...
Article
Full-text available
Accurate and reliable measurement of energy consumption is essential to energy optimization at an application level. Energy predictive modelling using performance monitoring counters (PMCs) emerged as a promising approach, one of the main drivers being its capacity to provide fine-grained component-level breakdown of energy consumption. In this wor...
Article
Full-text available
Accurate energy profiles are essential to the optimization of parallel applications for energy through workload distribution. Since there are many model-based methods available for efficient construction of energy profiles, we need an approach to measure the goodness of the profiles compared with the ground-truth profile, which is usually built by...
Article
Energy is one of the most important objectives for optimization on modern heterogeneous high‐performance computing (HPC) platforms. The tight integration of multicore CPUs with accelerators such as graphical processing units (GPUs) and Xeon Phi coprocessors in these platforms presents several challenges to the optimization of multithreaded data‐par...
Article
Energy is one of the most important objectives for optimization on modern heterogeneous high performance computing (HPC) platforms. The tight integration of multicore CPUs with accelerators in these platforms present several challenges to optimization of multithreaded data-parallel applications for dynamic energy. In this work, we formulate the opt...
Article
Full-text available
Modern high-performance computing platforms, cloud computing systems, and data centers are highly heterogeneous containing nodes where a multicore CPU is tightly integrated with accelerators. An important challenge for energy optimization of hybrid parallel applications on such platforms is how to accurately estimate the energy consumption of appli...
Article
Full-text available
Energy proportionality is the key design goal followed by architects of multicore processors. One of its implications is that optimization of an application for performance will also optimize it for energy. In this work, we show that energy proportionality does not hold true for multicore processors. This finding creates the opportunity for bi-obj...
Article
Full-text available
Modern HPC platforms are highly heterogeneous with tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to address the twin critical concerns of performance and energy efficiency. Due to this inherent characteristic, processing elements contend f...
Preprint
Full-text available
Energy proportionality is the key design goal followed by architects of modern multicore CPUs. One of its implications is that optimization of an application for performance will also optimize it for energy. In this work, we show that energy proportionality does not hold true for multicore CPUs. This finding creates the opportunity for bi-objective...
Conference Paper
Parallel matrix-matrix multiplication (PMM) of dense matrices is a foundational kernel of parallel linear algebra libraries in high performance computing (HPC) domain. The problem of finding the optimal shape of matrices for efficient execution of PMM on heterogeneous platforms has an engrossing history comprising of two distinct threads. The first...
Article
Full-text available
Energy predictive modelling using performance monitoring counters (PMCs) has emerged as the leading mainstream approach for modelling the energy consumption of an application. Modern computing platforms such as multicore CPUs provide a large set of PMCs. The programmers , however, can obtain only a small number of PMCs (typically 3-4) during an app...
Preprint
Full-text available
Performance and energy are the two most important objectives for optimisation on modern parallel platforms. Latest research demonstrated the importance of workload distribution as a decision variable in the bi-objective optimisation for performance and energy on homogeneous multicore clusters. We show in this work that bi-objective optimisation for...
Preprint
Full-text available
Energy is now a first-class design constraint along with performance in all computing settings. Energy predictive modelling based on performance monitoring counts (PMCs) is the leading method used for prediction of energy consumption during an application execution. We use a model-theoretic approach to formulate the assumed properties of existing m...
Article
Full-text available
Energy of computing is a serious environmental concern and mitigating it is an important technological challenge. Accurate measurement of energy consumption during an application execution is key to application-level energy minimization techniques. There are three popular approaches to providing it: (a) System-level physical measurements using exte...
Article
This survey aims to present the state of the art in analytic communication performance models, providing sufficiently detailed descriptions of particularly noteworthy efforts. Modeling the cost of communications in computer clusters is an important and challenging problem. It provides insights into the design of the communication pattern of paralle...
Article
Many classical methods and algorithms developed when single-core CPUs dominated the parallel computing landscape, are still widely used in the changed multicore world. Two prominent examples are load balancing, which has been one of the main techniques for minimization of the computation time of parallel applications since the beginning of parallel...
Article
Heterogeneity is emerging as one of the most profound and challenging characteristics of today's parallel environments. From the macro level, where networks of distributed computers composed of diverse node architectures are interconnected with potentially heterogeneous networks, to the micro level, where deeper memory hierarchies and various accel...
Article
Data partitioning algorithms aiming to minimize the execution time and the energy of computations in self-adaptable data-parallel applications on modern extreme-scale multicore platforms must address two critical challenges. First, they must take into account the new complexities inherent in these platforms such as severe resource contention and no...
Article
Fast Fourier transform (FFT) is a key routine employed in application domains such as molecular dynamics, computational fluid dynamics, signal processing, image processing, and condition monitoring systems. Its performance on modern multicore platforms is therefore of paramount concern to the high performance computing community. The inherent compl...
Article
Modern HPC platforms have become highly heterogeneous owing to tight integration of multicore CPUs and accelerators (such as Graphics Processing Units, Intel Xeon Phis, or Field-Programmable Gate Arrays) empowering them to maximize the dominant objectives of performance and energy efficiency. Due to this inherent characteristic, processing elements...
Preprint
Full-text available
In this paper, we use multithreaded fast Fourier transforms provided in three highly optimized packages, FFTW-2.1.5, FFTW-3.3.7, and Intel MKL FFT, to present a novel model-based parallel computing technique as a very effective and portable method for optimization of scientific multithreaded routines for performance, especially in the current multi...
Preprint
Full-text available
Hardware accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors (PHIs), and Field-Programmable Gate Arrays (FPGAs) are now ubiquitous in extreme-scale high performance computing (HPC), cloud, and Big data platforms to facilitate execution of workloads that demand high energy efficiency. They present unique interfaces an...
Article
Full-text available
Affinity-aware thread mapping is a method to effectively exploit cache resources in multicore processors.We propose an affinity and architecture-aware thread mapping technique which maximises data reuse and minimises remote communications and cache coherency costs of multi-threaded applications. It consists of three main components: Data Sharing Es...
Article
Full-text available
Cloud environments today are increasingly featuring hybrid nodes containing multicore CPU processors and a diverse mix of accelerators such as Graphics Processing Units (GPUs), Intel Xeon Phi co-processors, and Field-Programmable Gate Arrays (FPGAs) to facilitate easier migration to them of HPC workloads. While virtualization of accelerators in clo...
Presentation
Full-text available
Performance events or performance monitoring counters (PMCs) have been originally conceived, and widely used to aid low-level performance analysis and tuning. Nevertheless, they were opportunistically adopted for energy predictive modeling owing to lack of a precise energy measurement mechanism in processors, and to address the need of determining...
Article
Full-text available
Performance events or performance monitoring counters (PMCs) are now the dominant predictor variables for modeling energy consumption. Modern hardware processors provide a large set of PMCs. Determination of the best subset of PMCs for energy predictive modeling is a non-trivial task given the fact that all the PMCs can not be determined using a si...
Article
Traditional heterogeneous parallel algorithms, designed for heterogeneous clusters of workstations, are based on the assumption that the absolute speed of the processors does not depend on the size of the computational task. This assumption proved inaccurate for modern and perspective highly heterogeneous HPC platforms. New class of algorithms base...
Conference Paper
Full-text available
Two strategies of distribution of computations can be used to implement parallel solvers for dense linear algebra problems for Heterogeneous Computational Clusters of Multicore Processors (HCoMs). These strategies are called Heterogeneous Process Distribution Strategy (HPS) and Heterogeneous Data Distribution Strategy (HDS). They are not novel and...
Conference Paper
The paper presents a new data partitioning algorithm for parallel computing on heterogeneous processors. Like traditional functional partitioning algorithms, the algorithm assumes that the speed of the processors is character- ized by speed functions rather than speed constants. Unlike the traditional algo- rithms, it does not assume the speed func...
Conference Paper
The functional performance model (FPM) of heterogeneous proces- sors has proven to be more realistic than the traditional models because it integrates many important features of heterogeneous processors such as the processor heterogeneity, the heterogeneity of memory structure, and the effects of paging. Optimal 1D matrix partitioning algorithms em...
Article
Full-text available
This paper presents a software library, called Heterogeneous PBLAS (HeteroPBLAS), which provides optimized parallel basic linear algebra subprograms for Heterogeneous Computational Clusters. This library is written on the top of HeteroMPI and PBLAS whose building blocks, the de facto standard kernels for matrix and vector operations (BLAS) and mess...
Conference Paper
Full-text available
This paper describes the design and the implementation of parallel routines in the heterogeneous ScaLAPACK library that solve a dense system of linear equations. This library is written on top of HeteroMPI and ScaLAPACK whose building blocks, the de facto standard kernels for matrix and vector operations (BLAS and its parallel counterpart PBLAS) an...
Conference Paper
Full-text available
This paper presents a package, called Heterogeneous PBLAS (HeteroPBLAS), which is built on top of PBLAS and provides optimized parallel basic linear algebra subprograms for heterogeneous computational clusters. We present the user interface and the software hierarchy of the first research implementation of HeteroPBLAS. This is the first step toward...
Technical Report
Full-text available
We present a package, called Heterogeneous PBLAS (HeteroPBLAS), which is built on top of PBLAS and provides optimized parallel basic linear algebra subprograms for Heterogeneous Computational Clusters. We present the user interface and the software hierarchy of the first research implementation of HeteroPBLAS. This is the first step towards the dev...
Conference Paper
Full-text available
This paper discusses the design and the implementation of the LU factorization routines included in the Heterogeneous ScaLAPACK library, which is built on top of ScaLAPACK. These routines are used in the factorization and solution of a dense system of linear equations. They are implemented using optimized PBLAS, BLACS and BLAS libraries for heterog...
Article
In this paper, we study the problem of optimal matrix partitioning for parallel dense factorization on heterogeneous processors. First, we outline existing algorithms solving the problem that use a constant performance model of processors, when the relative speed of each processor is represented by a positive constant. We also propose a new efficie...
Conference Paper
In this paper, we present a novel algorithm of optimal matrix partitioning for parallel dense matrix factorization on heterogeneous processors based on their constant performance model. We prove the correctness of the algorithm and estimate its complexity. We demonstrate that this algorithm better suits extensions to more complicated, non-constant,...
Article
In this paper, we address the problem of optimal distribu- tion of computational tasks on a network of heterogeneous computers when one or more tasks do not fit into the main memory of the processors and when relative speeds vary with the problem size. We propose a functional perform- ance model of heterogeneous processors that integrates many esse...
Conference Paper
The paper presents a tool that ports ScaLAPACK programs designed to run on massively parallel processors to Heterogeneous Networks of Computers. The tool converts ScaLAPACK programs to HeteroMPI programs. The resulting HeteroMPI programs do not aim to extract the maximum performance from a Heterogeneous Networks of Computers but provide an easy and...
Article
The paper presents Heterogeneous MPI (HeteroMPI), an extension of MPI for programming high-performance computations on heterogeneous networks of computers. It allows the application programmer to describe the performance model of the implemented algorithm in a generic form. This model allows the specification of all the main features of the underly...
Conference Paper
Full-text available
In this paper, we present an efficient procedure for building a piecewise linear function approximation of the speed function of a processor with hierarchical memory structure. The procedure tries to minimize the experimental time used for building the speed function approximation. We demonstrate the efficiency of our procedure by performing experi...
Conference Paper
In this paper, we present a static data distribution strategy called Vari- able Group Block distribution to optimize the execution of factorization of a dense matrix on a network of heterogeneous computers. The distribution is based on a functional performance model of computers, which tries to capture differ- ent aspects of heterogeneity of the co...
Article
Full-text available
The paper presents a performance model that can be used to optimally distribute computations over heterogeneous computers. This model is application-centric representing the speed of each computer by a function of the problem size. This way it takes into account the processor heterogeneity, the heterogeneity of memory structure, and the memory limi...
Article
The paper presents an approach to performance analysis of heterogeneous parallel algorithms. As a typical heterogeneous parallel algorithm is just a modification of some homogeneous one, the idea is to compare the heterogeneous algorithm with its homogeneous prototype, and to assess the heterogeneous modification rather than analyse the algorithm a...
Conference Paper
The paper presents a performance model that can be used to optimally schedule arbitrary tasks on a network of heterogeneous computers when there is an upper bound on the size of the task that can be solved by each computer. We formulate a problem of partitioning of an n-element set over p heterogeneous processors using this advanced performance mod...
Conference Paper
Summary form only given. The article presents a performance model of a network of heterogeneous computers that takes account of the heterogeneity of memory structure and other architectural differences. Under this model, the speed of each processor is represented by a function of the size of the problem whereas standard models use single numbers to...
Conference Paper
The paper presents an approach to the performance analysis of heterogeneous parallel algorithms. As a typical heterogeneous parallel algorithm is just a modification of some homogeneous one, the idea is to compare the heterogeneous algorithm with its homogeneous prototype, and to assess the heterogeneous modification rather than to analyse the algo...
Conference Paper
The paper presents a classification of mathematical problems encoun- tered during partitioning of data when designing parallel algorithms on networks of heterogeneous computers. We specify problems with known efficient solutions and open problems. Based on this classification, we suggest an API for partition- ing mathematical objects commonly used...
Conference Paper
The paper presents Heterogeneous MPI (HMPI), an extension of MPI for programming high-performance computations on heterogeneous networks of computers. It allows the application programmer to describe the performance model of the implemented algorithm. This model allows for all the main features of the underlying parallel algorithm, which have an im...