Georgios Goumas’s research while affiliated with National Technical University of Athens and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (120)


Scaling Serverless Functions: Horizontal or Vertical? Both!
  • Conference Paper

March 2025

·

7 Reads

·

·

·

[...]

·


Figure 4: This figure visualizes the initial segment of a quantum circuit that can be used for the classification of the functions in í µí°¹ 2 .
Figure 12: This figure shows the implementation of the BFPQC algorithm for the classification of the Boolean functions in í µí°¹ 4 , assuming Bob has chosen the oracle for the function í µí±“ 3 and Alice has employed the classifier í µí±„ 4 .
Figure 14: This is the measurement outcome of the quantum circuit of Figure 12 for 2048 runs.
A Quantum Algorithm for the Classification of Patterns of Boolean Functions
  • Preprint
  • File available

March 2025

·

82 Reads

This paper introduces a novel quantum algorithm that is able to classify a hierarchy of classes of imbalanced Boolean functions. The fundamental characteristic of imbalanced Boolean functions is that the proportion of elements in their domain that take the value 0 is not equal to the proportion of elements that take the value 1. For every positive integer n, the hierarchy contains a class of Boolean functions defined based on their behavioral pattern. The common trait of all the functions belonging to the same class is that they possess the same imbalance ratio. Our algorithm achieves classification in a straightforward manner as the final measurement reveals the unknown function with probability 1. Let us also note that the proposed algorithm is an optimal oracular algorithm because it can classify the aforementioned functions with a single query to the oracle. At the same time we explain in detail the methodology we followed to design this algorithm in the hope that it will prove general and fruitful, given that it can be easily modified and extended to address other classes of imbalanced Boolean functions that exhibit different behavioral patterns.

Download

Figure 1. The 1:1 and N:1 serverless models. N:1 fragments and ties down idle memory resources.
Figure 4. HotMem integration to a serverless runtime
Fast and Efficient Memory Reclamation For Serverless MicroVMs

November 2024

·

67 Reads

Resource elasticity is one of the key defining characteristics of the Function-as-a-Service (FaaS) serverless computing paradigm. In order to provide strong multi-tenant isolation, FaaS providers commonly sandbox functions inside virtual machines (VMs or microVMs). While compute resources assigned to VM-sandboxed functions can be seamlessly adjusted on the fly, memory elasticity remains challenging, especially when scaling down. State-of-the-art mechanisms for VM memory elasticity suffer from increased reclaim latency when memory needs to be released, compounded by CPU and memory bandwidth overheads. We identify the obliviousness of the Linux memory manager to the virtually hotplugged memory as the key issue hindering hot-unplug performance, and design HotMem, a novel approach for fast and efficient VM memory hot(un)plug, targeting VM-sandboxed serverless functions. Our key insight is that by segregating virtually hotplugged memory regions from regular VM memory, we are able to bound the lifetimes of allocations within these regions thus enabling their fast and efficient reclamation. We implement HotMem in Linux v6.6 and our evaluation shows that it is an order of magnitude faster than state-of-practice to reclaim VM memory, while achieving the same P99 function latency with a model that statically over-provisions VMs.




Figure 1. eBPF-mm workflow
eBPF-mm: Userspace-guided memory management in Linux with eBPF

September 2024

·

89 Reads

We leverage eBPF in order to implement custom policies in the Linux memory subsystem. Inspired by CBMM, we create a mechanism that provides the kernel with hints regarding the benefit of promoting a page to a specific size. We introduce a new hook point in Linux page fault handling path for eBPF programs, providing them the necessary context to determine the page size to be used. We then develop a framework that allows users to define profiles for their applications and load them into the kernel. A profile consists of memory regions of interest and their expected benefit from being backed by 4KB, 64KB and 2MB pages. In our evaluation, we profiled our workloads to identify hot memory regions using DAMON.


Figure 1: Throughput achieved by a NUMA-oblivious [2, 34] and a NUMA-aware [65] priority queue, both initialized with 1024 keys. We use 64 threads that perform a mix of insert and deleteMin operations in parallel, and the key range is set to 2048 keys. We use all NUMA nodes of a 4-node NUMA system, the characteristics of which are presented in Section 4.
Figure 2: High-level overview of SmartPQ. SmartPQ dynamically adapts its algorithm to the contention levels of the workload based on the prediction of a simple classifier.
SmartPQ: An Adaptive Concurrent Priority Queue for NUMA Architectures

June 2024

·

35 Reads

Concurrent priority queues are widely used in important workloads, such as graph applications and discrete event simulations. However, designing scalable concurrent priority queues for NUMA architectures is challenging. Even though several NUMA-oblivious implementations can scale up to a high number of threads, exploiting the potential parallelism of insert operation, NUMA-oblivious implementations scale poorly in deleteMin-dominated workloads. This is because all threads compete for accessing the same memory locations, i.e., the highest-priority element of the queue, thus incurring excessive cache coherence traffic and non-uniform memory accesses between nodes of a NUMA system. In such scenarios, NUMA-aware implementations are typically used to improve system performance on a NUMA system. In this work, we propose an adaptive priority queue, called SmartPQ. SmartPQ tunes itself by switching between a NUMA-oblivious and a NUMA-aware algorithmic mode to achieve high performance under all various contention scenarios. SmartPQ has two key components. First, it is built on top of NUMA Node Delegation (Nuddle), a generic low-overhead technique to construct efficient NUMA-aware data structures using any arbitrary concurrent NUMA-oblivious implementation as its backbone. Second, SmartPQ integrates a lightweight decision making mechanism to decide when to switch between NUMA-oblivious and NUMA-aware algorithmic modes. Our evaluation shows that, in NUMA systems, SmartPQ performs best in all various contention scenarios with 87.9% success rate, and dynamically adapts between NUMA-aware and NUMA-oblivious algorithmic mode, with negligible performance overheads. SmartPQ improves performance by 1.87x on average over SprayList, the state-of-theart NUMA-oblivious priority queue.




Open-Source SpMV Multiplication Hardware Accelerator for FPGA-Based HPC Systems

March 2024

·

70 Reads

·

1 Citation

Lecture Notes in Computer Science

The Sparse Matrix Vector (SpMV) multiplication kernel is a key component of many high-performance computing applications, but at the same time one of the most challenging to optimize, primarily due to its low flop-per-byte ratio and irregular memory accesses. As such, modern FPGAs, combined with High-Bandwidth Memory (HBM) modules, are much better-suited to the memory-bound nature of this kernel, compared to general purpose CPUs. Current FPGA-based approaches on SpMV support only single-precision floating point arithmetic. Moreover, they target for highly-streamed implementations that, although enhance performance, facilitate custom matrix storage formats, which (i) can increase the matrix footprint up to 3x, and (ii) drop the burden of input matrix transformation to developers. Towards widening the spectrum of FPGA-supported floating point formats for sparse algebra, this paper presents a first set of effective optimizations for double-precision SpMV hardware kernels using High-Level Synthesis (HLS) tools on HBM-equipped FPGAs. Results show that our work can provide 52.4x on average better performance compared to state-of-practice SpMV double-precision multiplication implementations on FPGAs for applications with volatile matrices, and up to 5.1x better performance-per-Watt compared to server-class CPUs.


Citations (66)


... The use of Transformer-based encoders transforms historical data into retrievable knowledge, effectively coupling this data with user-item-context features for recommender systems [25]. Additionally, architectural advancements such as DaeMon significantly mitigate data movement overheads in disaggregated systems through adaptive granularity selection and synergy in data movement techniques [26]. Furthermore, influence functions in data valuation are optimized for scalability with advanced strategies that enhance gradient utilization during model training [27]. ...

Reference:

Scalable Architectures for Data Processing in High-Volume Gig Economy Transactions
Architectural Support for Efficient Data Movement in Fully Disaggregated Systems
  • Citing Article
  • June 2023

ACM SIGMETRICS Performance Evaluation Review

... Prior work has explored hardware-assisted [37,45,47,54] and application-guided [18,23,67] page migration for tiered memory systems, which may not be practical as they require specialized hardware support or application redesign from the ground up. A large body of work exists in page migration for disaggregated and tiered memory that is independent of the application and hardware support [10,17,22,28,29,41,43,44,48,56,68]. We discuss the limitations of these approaches in Section 2.5. ...

Architectural Support for Efficient Data Movement in Fully Disaggregated Systems
  • Citing Conference Paper
  • June 2023

... Compute Express Link (CXL) has emerged as a key enabling technology for memory expansion and pooling in modern datacenters. Extensive research has examined the po-tential of CXL-based memory for disaggregated architectures [47], [67], [85], [48], [1], [133], [45]. Other research has investigated the role of CXL in tiered memory systems [96], [151], [154], [76], [120]. ...

DaeMon: Architectural Support for Efficient Data Movement in Fully Disaggregated Systems
  • Citing Article
  • March 2023

Proceedings of the ACM on Measurement and Analysis of Computing Systems

... OpenCL-based CSR SpMV is also evaluated in literature [5] [6]. Modifications to the CSR format that allow data streaming, thereby improving memory bandwidth utilization, have also been proposed in [8] [15]. In [13] [16], matrix reordering on the host side is utilized for better x-vector reuse on the FPGA. ...

On the Performance and Energy Efficiency of Sparse Matrix-Vector Multiplication on FPGAs

... Speedup is provided by reducing the communication between processors. Giannoula et al. (2023) proposed the ColorTM algorithm that detects inconsistencies of coloring between vertices. ColorTM algorithm has a speculative synchronization to minimize costs and improve parallelism. ...

High-performance and balanced parallel graph coloring on multicore platforms

The Journal of Supercomputing

... PIM-specific code optimizations. Prior work has shown that host data distribution and kernel multi-level tiling are essential for achieving high performance in DRAM-PIM systems [13,21,55]. Studies such as [13,22] demonstrated that efficient data distribution can reduces data movement and enhances DPU parallelism, improving UPMEM performance. ...

SparseP: Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures
  • Citing Conference Paper
  • July 2022

... The fundamental concept of PIM revolves around integrating computing circuits within or near memory arrays to process data directly on stored information, especially vector-matrix multiplication [2], [3]. However, deep neural network (DNN) models, following the deep learning scaling law, are scaling up at an exponential speed [4]- [8], inflicting unprecedented challenges for the limited on-chip PIM capacity in that most of the conventional PIM architectures hold the presumption that loading weights (parameters) of deep learning models only once before repetitive computation based on a weight-stationary parallelism scheme [9]. ...

DaxVM: Stressing the Limits of Memory as a File Interface
  • Citing Conference Paper
  • October 2022

... Limitation 2: Sparsity. In recent years, sparsity has become a key feature in state-of-the-art AI models [39]- [41], enabling a suite of software optimizations [42], [43], hardware accelerators [41], [44], [45], [45], [46] as well as hardware- [25], [28] analytical single N/A N/A N/A Cacti-6.0 [29] No N/A STONNE [23] cycle-accurate single N/A SIGMA-Average Table-based model No N/A like [30] based on synthesis Unstructured and place-and-route Timeloop v4 [22] analytical many Spatio-temporal Distribution Average Accelergy [31] No * CactiDRAM [32] based SCALE-Sim v2 [24] cycle-accurate many ...

Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures
  • Citing Article
  • June 2022

ACM SIGMETRICS Performance Evaluation Review

... [46] showcases how VM-based sandboxing can outperform container-based sandboxing via lean virtual machines. Recently, microVM snapshotting has been proposed [20,25,35,61] in order to accelerate function cold-starts. HotMem unlocks increased performance by using lean microVMs, which can rapidly scale to accommodate multiple concurrent function invocations, thus avoiding the microVM initialization tax, and enabling functions to share the warmed-up state of other concurrently running functions on the same microVM. ...

FaaS in the age of (sub-) μs I/O: a performance analysis of snapshotting
  • Citing Conference Paper
  • June 2022

... The design of Attn-PIM has already been demonstrated to be implementable in industry prototypes and products, such as UPMEM [66][67][68][69][70][71][72][73][74] and HBM-PIM [30,75]. Its integration into our system ensures efficient processing of memory-bound tasks, making it a suitable solution for LLM workloads. ...

Towards Efficient Sparse Matrix Vector Multiplication on Real Processing-In-Memory Architectures
  • Citing Conference Paper
  • June 2022