Aleksandar IlicInstituto Superior Técnico (IST), University of Lisbon · Department of Electrical and Computer Engineering (DEEC)
Aleksandar Ilic
PhD in Electrical and Computer Engineering
About
85
Publications
11,107
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
893
Citations
Introduction
Additional affiliations
January 2008 - present
December 2014 - present
Instituto Superior Técnico (IST), University of Lisbon
Position
- Professor (Assistant)
Education
October 2008 - February 2014
Publications
Publications (85)
The growing need for inference on edge devices brings with it a necessity for efficient hardware, optimized for particular computational kernels, such as Sparse Matrix-Vector Multiplication (SpMV). With the RISC-V Instruction Set Architecture (ISA) providing unprecedented freedom to hardware designers, there is now a greater opportunity to tailor t...
The minimum feedback vertex set problem consists in finding the minimum set of vertices that should be removed in order to make the graph acyclic. This is a well known NP-hard optimization problem with applications in various fields, such as VLSI chip design, bioinformatics and transaction processing. In this paper, we explore the complementary pro...
Background
In the pursuit of a better understanding of biodiversity, evolutionary biologists rely on the study of phylogenetic relationships to illustrate the course of evolution. The relationships among natural organisms, depicted in the shape of phylogenetic trees, not only help to understand evolutionary history but also have a wide range of add...
Developments in Genome-Wide Association Studies have led to the increasing notion that future healthcare techniques will be personalized to the patient, by relying on genetic tests to determine the risk of developing a disease. To this end, the detection of gene interactions that cause complex diseases constitutes an important application. Similarl...
Large-scale hydrological models simulate watershed processes with applications in water resources, climate change, land use, and forecast systems. The quality of the simulations mainly depends on calibrating optimal sets of watershed parameters, a time-consuming task that highly demands computational resources from repeated simulations. This work a...
The substitution of nucleotides at specific positions in the genome of a population, known as single-nucleotide polymorphisms (SNPs), has been correlated with a number of important diseases. Complex conditions such as Alzheimer's disease or Crohn's disease are significantly linked to genetics when the impact of multiple SNPs is considered. SNPs oft...
A Single Nucleotide Polymorphism (SNP) is a DNA variation occurring when a single nucleotide differs between individuals of a species. Some conditions can be explained with a single SNP. However, the combined effect of multiple SNPs, known as epistasis, allows to better correlate genotype with a number of complex traits. We propose a highly optimiz...
Epistasis detection represents a fundamental problem in bio-medicine to understand the reasons for occurrence of complex phenotypic traits (diseases) across a population of individuals. Exhaustively examining all possible interactions of multiple Single-Nucleotide Polymorphisms provides the most reliable way to identify accurate solutions, but it i...
In the coming exascale era, the complexity of modern applications and hardware resources imposes significant challenges for boosting the efficiency via execution fine-tuning. To abstract this complexity in an intuitive way, recent application analysis tools rely on insightful modeling, e.g., Intel® Advisor with Cache-aware Roofline Model. However,...
Multiple studies provide evidence on the impact of certain gene interactions in the occurrence of diseases. Due to the complexity of genotype–phenotype relationships, it is required the development of highly efficient algorithmic strategies that successfully identify high-order interactions attending to different evaluation criteria. This work inve...
In the quest for exascale computing, energy-efficiency is a fundamental goal in high-performance computing systems, typically achieved via dynamic voltage and frequency scaling (DVFS). However, this type of mechanism relies on having accurate methods of predicting the performance and power/energy consumption of such systems. Unlike previous works i...
Dynamic voltage and frequency scaling (DVFS) is a popular technique to improve the energy-efficiency of high-performance computing systems. It allows placing the devices into lower performance states when the computational demands are lower, opening the possibility for significant power/energy savings. This work presents a GPU power consumption mod...
Modern heterogeneous computing architectures, which couple multi-core CPUs with discrete many-core GPUs (or other specialized hardware accelerators), enable unprecedented peak performance and energy efficiency levels. Unfortunately, though, developing applications that can take full advantage of the potential of heterogeneous systems is a notorious...
The Hungarian algorithm solves the linear assignment problem in polynomial time. A GPU/CUDA implementation of this algorithm is proposed. GPUs are massive parallel machines. In this implementation, the alternating path search phase of the algorithm is distributed by several blocks in a way to minimize global device synchronization. This phase is ve...
General purpose graphics processing units (GPGPUs) have gained much popularity in scientific computing to speedup computational intensive workloads. Resource allocation in terms of power and subcarriers assignment, in current wireless standards, is one of the challenging problems due to its high computational complexity requirement. The Hungarian a...
NUMA platforms, emerging memory architectures with on-package high bandwidth memories bring new opportunities and challenges to bridge the gap between computing power and memory performance. Heterogeneous memory machines feature several performance trade-offs, depending on the kind of memory used, when writing or reading it. Finding memory performa...
The interest in developing cognitive aware systems, specially for vision applications based on artificial neural networks, has grown exponentially in the last years. While high performance systems are key for the success of current Convolutional Neural Network (CNN) implementations, there is a trend to bring these capabilities to embedded real-time...
When optimizing or porting applications to new architectures, a preliminary characterization is necessary to exploit the maximum computing power of the employed devices. Profiling tools are available for numerous architectures and programming models, making it easier to spot possible bottlenecks. However, for a better interpretation of the collecte...
The increasing importance of GPUs as high-performance accelerators and the power and energy constraints of computing systems, make it fundamental to develop techniques for energy efficiency maximization of GPGPU applications. Among several potential techniques, dynamic voltage and frequency scaling (DVFS) stands out as one of the most promising app...
In order to fulfill modern applications needs, computing systems become more powerful, heterogeneous and complex. NUMA platforms and emerging high bandwidth memories offer new opportunities for performance improvements. However they also increase hardware and software complexity, thus making application performance analysis and optimization an even...
In the High Efficiency Video Coding (HEVC) standard, multiple decoding modules have been designed to take advantage of parallel processing. In particular, the HEVC in-loop filters (i.e., the deblocking filter and sample adaptive offset) were conceived to be exploited by parallel architectures. However, the type of the offered parallelism mostly sui...
The High Efficiency Video Coding HEVC standard provides a higher compression efficiency than other video coding standards but at the cost of an increased computational load, which makes hard to achieve real-time encoding/decoding for ultra high-resolution and high-quality video sequences. Graphics Processing Units GPU are known to provide massive p...
In this paper, a novel game theory based approach for task scheduling on emerging heterogeneous embedded systems is proposed. It relies on the auction concept to assign tasks to players, where players compete against each other by bidding for the tasks in order to acquire them. To ensure the feasibility of game rounds, a set of different utility fu...
To foster the energy-efficiency in current and future multi-core processors, the benefits and trade-offs of a large set of optimization solutions must be evaluated. For this purpose, it is often crucial to consider how key micro-architecture aspects, such as accessing different memory levels and functional units, affect the attainable power and ene...
The availability of heterogeneous CPU+GPU systems has opened the door to new opportunities for the development of parallel solutions to tackle complex biological problems. The reconstruction of evolutionary histories among species represents a grand computational challenge, which can be addressed by exploiting this kind of hardware designs. In this...
The high compression efficiency that is provided by the High Efficiency Video Coding (HEVC) standard comes at the cost of a significant increase of the computational load at the decoder. Such an increased burden is a limiting factor to accomplish real-time decoding, specially for high definition video sequences (e.g. Ultra HD 4K). In this scenario,...
In this paper, we propose an energy-aware task management mechanism designed for the forward-in-time algorithms running on multicore central processing units (CPUs), where the multidimensional positive definite advection transport algorithm stencil-based algorithm is one of the representative examples. This mechanism is based on the dynamic voltage...
The increased adoption of Graphics Processing Units (GPUs) to accelerate modern computational intensive applications, together with the strict power and energy constraints of many computing systems, has pushed for the development of efficient procedures to exploit dynamic voltage and frequency scaling (DVFS) techniques in GPUs. Although previous wo...
The added encoding efficiency and visual quality offered by the High Efficiency Video Coding (HEVC) standard is attained at the cost of a significant computational complexity of both the encoder and the decoder. In particular, the considerable amount of intra prediction modes that are now considered by this standard, together with the increased com...
In order to challenge real-time encoding of high definition video sequences on heterogenous desktop systems, a collaborative CPU+GPU framework for inter-loop video encoding is proposed herein. The proposed framework considers the overall complexity of the collaborative inter-loop encoding as a unified optimization problem. Several functional blocks...
In this article, we propose a general framework for fine-grain application-aware task management in heterogeneous embedded platforms, which allows integration of different mechanisms for an efficient resource utilization, frequency scaling, and task migration. The proposed framework incorporates several components for accurate runtime monitoring by...
The inter prediction decoding is one of the most time consuming modules in modern video decoders, which may significantly limit their real-time capabilities. To circumvent this issue, an efficient acceleration of the HEVC inter prediction decoding module is proposed, by offloading the involved workload to GPU devices. The proposed approach aims at...
In this paper we propose an efficient method for collaborative H.264/AVC inter-prediction in heterogeneous CPU+GPU systems. In order to minimize the overall encoding time, the proposed method provides stable and balanced load distribution of the most computationally demanding video encoding modules, by relying on accurate and dynamically built func...
Due to the dissemination of smartphones and tablets, a constant complexity growth can be observed for both embedded systems and mobile applications. However, this results in an increase in energy consumption. To guarantee longer battery life cycles, it is fundamental to develop system level strategies that allow guaranteeing the applications' requi...
Lead by high performance computing potential of modern heterogeneous desktop systems and predominance of video content in general applications, we propose herein an autonomous unified video encoding framework for hybrid multi-core CPU and multi-GPU platforms. To fully exploit the capabilities of these platforms, the proposed framework integrates si...
Accurate characterization of modern systems and applications requires run-time and simultaneous assessment of several execution-related parameters. Although hardware monitoring facilities in modern multi-cores allow low-level profiling, it is not always easy to convert the acquired data into insightful information. For this, a low-overhead monitori...
HPC platforms are getting increasingly heterogeneous and hierarchical. The main source of heterogeneity in many individual computing nodes is due to the utilization of specialized accelerators such as GPUs alongside general purpose CPUs. Het- erogeneous many-core processors will be another source of intra-node heterogene- ity in the near future. As...
This chapter proposes several algorithms for efficient balancing of divisible load applications in order to fully exploit the capabilities of heterogeneous multicore CPU and multi-graphics processing unit (GPU) environments for collaborative processing. It focuses on efficient load-balancing and scheduling algorithms for discretely divisible load (...
The Roofline model graphically represents the attainable upper bound performance of a computer architecture. This paper analyzes the original Roofline model and proposes a novel approach to provide a more insightful performance modeling of modern architectures by introducing cache-awareness, thus significantly improving the guidelines for applicati...
The high computational demands and overall encoding complexity make the processing of high definition video sequences hard to be achieved in real-time. In this manuscript, we target an efficient parallelization and RD performance analysis of H.264/AVC inter-loop modules and their collaborative execution in hybrid multi-core CPU and multi-GPU system...
Transparent application acceleration in heterogeneous systems can be performed by automatically intercepting shared libraries calls and by efficiently orchestrating the execution across all processing devices. To fully exploit the available computing power, the intercepted calls must be replaced with faster accelerator-based implementations and int...
Accurate on-the-fly characterization of application behaviour requires assessing a set of execution-related parameters at runtime, including performance, power and energy consumption. These parameters can be obtained by relying on hardware measurement facilities built-in modern multi-core architectures, such as performance and energy counters. Howe...
Hierarchical level of heterogeneity exists in many modern high performance clusters in the form of heterogeneity between computing nodes, and within a node with the addition of specialized accelerators, such as GPUs. To achieve high performance of scientific applications on these platforms it is necessary to perform load balancing. In this paper we...
In this paper, we propose an algorithm for efficient divisible load balancing across all processing devices available in a heterogeneous desktop system. The proposed algorithm allows to achieve simultaneous load balancing at different execution levels, namely between execution subdomains defined with several processing devices, and between devices...
This paper investigates the problem of scheduling discretely divisible applications in highly heterogeneous distributed platforms which deploy modern desktop systems with limited memory as computing nodes. We propose an algorithm for hierarchical load balancing at both inter- and intra-node platform levels which relies on realistic performance mode...
This paper addresses the problem of scheduling discretely divisible applications in heterogeneous desktop systems with limited memory by relying on realistic performance models for computation and communication, through bidirectional asymmetric full-duplex buses. We propose an algorithm for multi-installment processing with multi-distributions that...
p>Modern commodity desktop computers equipped with multi-core Central Processing Units (CPUs) and specialized but programmable co-processors are capable of providing a remarkable computational performance. However, approaching this performance is not a trivial task as it requires the coordination of architecturally different devices for cooperative...
Current desktop computers are heterogeneous systems that integrate different types of processors. For example, general-purpose processors and GPUs do not only have different characteristics but also adopt diverse programming models. Despite these differences, data parallelism is exploited for both types of processors, by using application processin...
When analyzing the neuronal code, neuroscientists usually perform extra-cellular recordings of neuronal responses (spikes). Since the size of the microelectrodes used to perform these recordings is much larger than the size of the cells, responses from multiple neurons are recorded by each micro-electrode. Thus, the obtained response must be classi...
Nowadays, commodity computers are complex heterogeneous systems that provide a huge amount of computational power. However, to take advantage of this power we have to orchestrate the use of processing units with different characteristics. Such distributed memory systems make use of relatively slow interconnection networks, such as system buses. The...
Computer architecture simulation and modeling re- quire a huge amount of time and resources, not only for the simulation itself but also regarding the configuration and submission procedures. A quite common simulation toolset (SimpleScalar) has been used to model a variety of platforms ranging from simple unpipelined processors to detailed dynamica...