Christos KotselidisThe University of Manchester · School of Computer Science
Christos Kotselidis
About
94
Publications
20,361
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
837
Citations
Publications
Publications (94)
This chapter aims to familiarize readers with the definition of a programming model for heterogeneous hardware, while also diving into the architecture of existing programming models. OpenCL and CUDA are used as drivers to comprehend the fundamentals of the programming models, while also mentioning the similarities and differences that they present...
This chapter provides the necessary background on computer architecture in order to understand how hardware accelerators are programmed and execute code. While this book focuses on high-level programming languages and managed runtime systems, programming accelerators usually requires deep understanding of the architecture underneath. This chapter f...
The last chapter described the challenges posed by programming hardware accelerators from managed programming languages, such as Java, C#, Python, or JavaScript. Those challenges have been attributed to the differences in memory management, concurrency models, and the need to interface with native code for utilizing the heterogeneous programming mo...
This chapter has two objectives. The first one is to provide an overview of managed runtime environments, their architecture, and main functionalities. The second one is to provide a discussion regarding the challenges that heterogeneous programming models for hardware acceleration pose to managed runtime environments.
In this article, we present TornadoQSim, an open-source quantum circuit simulation framework implemented in Java. The proposed framework has been designed to be modular and easily expandable for accommodating different user-defined simulation backends, such as the unitary matrix simulation technique. Furthermore, TornadoQSim features the ability to...
This paper presents the Beehive SPIR-V Toolkit; a framework that can automatically generate a Java composable and functional library for dynamically building SPIR-V binary modules. The Beehive SPIR-V Toolkit can be used by optimizing compilers and runtime systems to generate and validate SPIR-V binary modules from managed runtime systems, such as t...
Ray tracing has been typically known as a graphics rendering method capable of producing highly realistic imagery and visual effects generated by computers. More recently the performance improvements in Graphics Processing Units (GPUs) have enabled developers to exploit sufficient computing power to build a fair amount of ray tracing applications w...
The ever-increasing demand for high performance Big Data analytics and data processing, has paved the way for heterogeneous hardware accelerators, such as Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs), to be integrated into modern Big Data platforms. Currently, this integration comes at the cost of programmability sinc...
During the last decade, managed runtime systems have been constantly evolving to become capable of exploiting underlying hardware accelerators, such as GPUs and FPGAs. Regardless of the programming language and their corresponding runtime systems, the majority of the work has been focusing on the compiler front trying to tackle the challenging task...
Cache-coherent non-uniform memory access (ccNUMA) systems enable parallel applications to scale-up to thousands of cores and many terabytes of main memory. However, since remote accesses come at an increased cost, extra measures are necessitated to scale the applications to high core-counts and process far greater amounts of data than a typical ser...
In this article, we present FastPath_MP, a novel low-overhead and energy-efficient storage multi-path architecture that leverages FPGAs to operate transparently to the main processor and improve the performance and energy efficiency of accessing storage devices. We prototyped FastPath_MP on both Arm-FPGA Zynq 7000 SoC and Zynq UltraScale+ MPSoC and...
In recent years, heterogeneous computing has emerged as the vital way to increase computers? performance and energy efficiency by combining diverse hardware devices, such as Graphics Processing Units (GPUs) and Field Programmable Gate Arrays (FPGAs). The rationale behind this trend is that different parts of an application can be offloaded from the...
The advent of modern cloud services, along with the huge volume of data produced on a daily basis, have increased the demand for fast and efficient data processing. This demand is common among numerous application domains, such as deep learning, data mining, and computer vision. In recent years, hardware accelerators have been employed as a means t...
This paper addresses the fundamental question of how modern Big Data frameworks can dynamically and transparently exploit heterogeneous hardware accelerators. After presenting the major challenges that have to be addressed towards this goal, we describe our proposed architecture for automatic and transparent hardware acceleration of Big Data framew...
Since the early conception of managed runtime systems with tiered JIT compilation, several research attempts have been made to accelerate the bytecode execution. In this paper, we extend prior attempts by performing an initial analysis of whether heterogeneous hardware accelerators in the form of Graphics Processing Units (GPUs) and Field Programma...
The advent of modern cloud services along with the huge volume of data produced on a daily basis, have set the demand for fast and efficient data processing. This demand is common among numerous application domains, such as deep learning, data mining, and computer vision. Prior research has focused on employing hardware accelerators as a means to o...
With micro-services continuously gaining popularity and low-power processors making their way into data centers, efficient execution of managed runtime systems on low-power architectures is also gaining interest. Apart from the inherent performance differences between high and low power processors, porting a managed runtime system to a low-power ar...
As the silicon industry moves into deep nanoscale technologies, preserving Mean Time to Failure at acceptable levels becomes a first-order challenge. The operational stress, along with the inefficient power dissipation and the unsustainable thermal thresholds increase the wear-induced failures. As a result, faster wear-out leads to earlier performa...
Blockchain technology has become extremely popular, during the last decade, mainly due to its successful application into the cryptocurrency domain. Following the explosion of Bitcoin and other cryptocurrencies, blockchain solutions are being deployed to almost every aspect of transactional operations as a means to exchange safely and secure digita...
By utilizing diverse heterogeneous hardware resources, developers can significantly improve the performance of their applications. Currently, in order to determine which parts of an application suit a particular type of hardware accelerator better, an offline analysis that uses a priori knowledge of the target hardware configuration is necessary. T...
In this work, we propose an approach for transparent compilation and execution of Java programs onto Intel FPGA devices. In detail, we showcase how a managed runtime environment can leverage Intel OpenCL SDK to generate specialized FPGA code, enabling prototyping and acceleration of Java Programs onto FPGAs. Finally, we describe our implementation...
Blockchain technology has become extremely popular , during the last decade, mainly due to the successful application in the cryptocurrency domain. Following the explosion of Bitcoin and other cryptocurrencies, blockchain solutions are being deployed in almost every aspect of transactional operations as a means to safely exchange digital assets bet...
Parallel skeletons are essential structured design patterns for efficient heterogeneous and parallel programming. They allow programmers to express common algorithms in such a way that it is much easier to read, maintain, debug and implement for different parallel programming models and parallel architectures. Reductions are one of the most common...
The efficient execution of Big Data applications requires a large quantity of compute and memory resources. Typically, these resources are in the form of data centres with numerous processing elements connected through a computer network. Although initially the majority of data centers were utilizing only CPU resources, nowadays we can find heterog...
The proliferation of heterogeneous hardware in recent years means that every system we program is likely to include a mix of compute elements; each with different characteristics. By utilizing these available hardware resources, developers can improve the performance and energy efficiency of their applications. However, existing tools for heterogen...
Visual understanding of 3D environments in real-time, at low power, is a huge computational challenge. Often referred to as SLAM (Simultaneous Localisation and Mapping), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, virtual and augmented reality. This paper describes the results of a major research ef...
Visual understanding of 3D environments in real-time, at low power, is a huge computational challenge. Often referred to as SLAM (Simultaneous Localisation and Mapping), it is central to applications spanning domestic and industrial robotics, autonomous vehicles, virtual and augmented reality. This paper describes the results of a major research ef...
The constant growth of data and its importance to drive Machine Learning and Big Data is pushing storage systems towards ever increasing I/O bandwidth and lower latency requirements. In recent years, the Non Volatile Memory Express (NVMe) standard has enabled SSD drives to deliver high I/O rates by allowing the storage to be connected directly via...
The rapid development of data intensive applications, such as social networks, IoT, etc., has caused an explosion in Big Data processing. Soon, homogeneous CPU clusters reached their limits , as they could no longer satisfy their clients' requirements, which were to execute their applications within a strict time and budget limit. Consequently, clo...
The slowdown of Moore's law along with the end of Dennard's scaling and the ever-increasing demand for computing power have shown the performance limitations of homogeneous systems. To address this issue, computer architects have taken advantage of the recent technological advancements in order to come up with heterogeneous solutions where the exte...
In this paper we outline the current state of language Virtual Machines (VMs) running on RISC-V as well as our initiatives in augmenting the existing ecosystem with Maxine VM, a state-of-the-art open source research Virtual Machine (VM). Maxine VM is a metacircular VM for Java and is currently part of the Beehive ecosystem that provides a unified f...
Heterogeneous computing has emerged as a means to achieve high performance and energy efficiency. Naturally, this trend has been accompanied by changes in software development norms that do not necessarily favor programmers. A prime example is the two most popular heterogeneous programming languages, CUDA and OpenCL, which expose several low-level...
In the recent years, we have witnessed an explosion of the usages of Virtual Machines (VMs) which are currently found in desktops, smartphones, and cloud deployments. These recent developments create new research opportunities in the VM domain extending from performance to energy efficiency, and scalability studies. Research into these directions n...
Heterogeneous computing has emerged as a means to achieve high performance and energy efficiency. Naturally, this trend has been accompanied by changes in software development norms that do not necessarily favor programmers. A prime example is the two most popular heterogeneous programming languages, CUDA and OpenCL, which expose several low-level...
This paper describes our experiences creating Tornado: a practical and efficient heterogeneous programming framework for managed languages. The novel aspect of Tornado is that it turns the programming of heterogeneous systems from an activity predominantly based on a priori knowledge into one based on a posteriori knowledge. Alternatively put, it s...
Extending current Virtual Machine implementations to new Instruction Set Architectures entails a significant programming and debugging effort. Meta-circular VMs add another level of complexity towards this aim since they have to compile themselves with the same compiler that is being extended. Therefore, having low-level debugging tools is of vital...
In this paper, we describe our experiences in co-designing a domain-specific compilation stack. Our motivation stems from the missed optimization opportunities we observed while implementing a computer vision library in Java. To tackle the performance shortcomings, we developed Indigo, a computer vision API co-designed with a compilation plugin for...
Implementations of object-oriented programming languages associate type information with each object to perform various runtime tasks such as dynamic dispatch, type introspection, and reflection. A common means of storing such relation is by inserting a pointer to the associated type information into every object. Such an approach, however, introdu...
Despite their proliferation as a dominant computing paradigm, cloud computing systems lack effective mechanisms to manage their vast amounts of resources efficiently. Resources are stranded and fragmented, ultimately limiting cloud systems' applicability to large classes of critical applications that pose non-moderate resource demands. Eliminating...
Real-time 3D space understanding is becoming prevalent across a wide range of applications and hardware platforms. To meet the desired Quality of Service (QoS), computer vision applications tend to be heavily parallelized and exploit any available hardware accelerators. Current approaches to achieving real-time computer vision, evolve around progra...
Real-time 3D space understanding is becoming prevalent across a wide range of applications and hardware platforms. To meet the desired Quality of Service (QoS), computer vision applications tend to be heavily parallelized and exploit any available hardware accelerators. Current approaches to achieving real-time computer vision, evolve around progra...
Managed applications, written in programming languages such as Java, C# and others, represent a significant share of workloads in the mobile, desktop, and server domains. Microarchitectural timing simulation of such workloads is useful for characterization and performance analysis, of both hardware and software, as well as for research and developm...
In this paper we describe Jacc, an experimental framework which allows developers to program GPGPUs directly from Java. The goal of Jacc, is to allow developers to benefit from using heterogeneous hardware whilst minimizing the amount of code refactoring required. Jacc utilizes two key abstractions: tasks which encapsulate all the information neede...
System designers typically use well-studied benchmarks to evaluate and improve new architectures and compilers. We design tomorrow's systems based on yesterday's applications. In this paper we investigate an emerging application, 3D scene understanding, likely to be significant in the mobile space in the near future. Until now, this application cou...
The explosion of Big Data was followed by the proliferation of numerous complex parallel software stacks whose aim is to tackle the challenges of data deluge. A drawback of a such multi-layered hierarchical deployment is the inability to maintain and delegate vital semantic information between layers in the stack. Software abstractions increase the...
The explosion of Big Data was followed by the proliferation of numerous complex parallel software stacks whose aim is to tackle the challenges of data deluge. A drawback of a such multi-layered hierarchical deployment is the inability to maintain and delegate vital semantic information between layers in the stack. Software abstractions increase the...
The end of Dennard scaling combined with stagnation in architectural and
compiler optimizations makes it challenging to achieve significant performance
deltas. Solutions based solely in hardware or software are no longer sufficient
to maintain the pace of improvements seen during the past few decades. In
hardware, the end of single-core scaling res...
Heterogeneous programming has started becoming the norm in order to achieve
better performance by running portions of code on the most appropriate hardware
resource. Currently, significant engineering efforts are undertaken in order to
enable existing programming languages to perform heterogeneous execution mainly
on GPUs. In this paper we describe...
Techniques for implementing identification and management of unsafe optimizations are disclosed. A method of the disclosure includes receiving, by a managed runtime environment (MRE) executed by a processing device, a notice of misprediction of optimized code, the misprediction occurring during a runtime of the optimized code, determining, by the M...
Applications using transactional memory may exhibit fluctuating (dynamic) available parallelism, i.e. the maximum number of transactions that can be committed concurrently may change over time. Executing large numbers of transactions concurrently in phases with low available parallelism will waste processor resources in aborted transactions, while...
In transactional memory, conflicts between two concurrently executing transactions reduce performance, reduce scalability, and may lead to aborts, which waste computing resources. Ideally, concurrent execution of transactions would be ordered to minimise conflicts, but such an ordering is often complex, or unfeasible, to obtain. This paper identifi...
Affordable transparent clustering solutions to scale non-HPC applications on commodity clusters (such as Terracotta) are emerging for Java Virtual Machines (JVMs). Working in this direction, we propose the Anaconda framework as a research platform to investigate the role Transactional Memory (TM) can play in this domain. Anaconda is a software tran...