Gustavo Alonso’s research while affiliated with ETH Zurich and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (405)


Figure 2. Limitations of existing shell interfaces and the proposed solutions with Coyote v2's interfaces.
Figure 7. a) Data transfer throughput scaling with number of HBM channels in one vFPGA. b) Comparison of synthesis and implementation times with shell and app flow targetting an Alveo U250.
Reconfiguration throughput comparison.
Reconfiguration latency for various Coyote v2 shell configs. Average latency with STD reported from 5 trials.
Coyote v2: Raising the Level of Abstraction for Data Center FPGAs
  • Preprint
  • File available

April 2025

·

6 Reads

Benjamin Ramhorst

·

Dario Korolija

·

Maximilian Jakob Heer

·

[...]

·

Gustavo Alonso

In the trend towards hardware specialization, FPGAs play a dual role as accelerators for offloading, e.g., network virtualization, and as a vehicle for prototyping and exploring hardware designs. While FPGAs offer versatility and performance, integrating them in larger systems remains challenging. Thus, recent efforts have focused on raising the level of abstraction through better interfaces and high-level programming languages. Yet, there is still quite some room for improvement. In this paper, we present Coyote v2, an open source FPGA shell built with a novel, three-layer hierarchical design supporting dynamic partial reconfiguration of services and user logic, with a unified logic interface, and high-level software abstractions such as support for multithreading and multitenancy. Experimental results indicate Coyote v2 reduces synthesis times between 15% and 20% and run-time reconfiguration times by an order of magnitude, when compared to existing systems. We also demonstrate the advantages of Coyote v2 by deploying several realistic applications, including HyperLogLog cardinality estimation, AES encryption, and neural network inference. Finally, Coyote v2 places a great deal of emphasis on integration with real systems through reusable and reconfigurable services, including a fully RoCE v2-compliant networking stack, a shared virtual memory model with the host, and a DMA engine between FPGAs and GPUs. We demonstrate these features by, e.g., seamlessly deploying an FPGA-accelerated neural network from Python.

Download


Figure 2: RAGO for systematic RAG serving optimization.
Figure 3: Describing general RAG pipelines with RAGSchema.
RAGSchema of the workloads used in case studies.
RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving

March 2025

·

38 Reads

Retrieval-augmented generation (RAG), which combines large language models (LLMs) with retrievals from external knowledge databases, is emerging as a popular approach for reliable LLM serving. However, efficient RAG serving remains an open challenge due to the rapid emergence of many RAG variants and the substantial differences in workload characteristics across them. In this paper, we make three fundamental contributions to advancing RAG serving. First, we introduce RAGSchema, a structured abstraction that captures the wide range of RAG algorithms, serving as a foundation for performance optimization. Second, we analyze several representative RAG workloads with distinct RAGSchema, revealing significant performance variability across these workloads. Third, to address this variability and meet diverse performance requirements, we propose RAGO (Retrieval-Augmented Generation Optimizer), a system optimization framework for efficient RAG serving. Our evaluation shows that RAGO achieves up to a 2x increase in QPS per chip and a 55% reduction in time-to-first-token latency compared to RAG systems built on LLM-system extensions.


FpgaHub: Fpga-centric Hyper-heterogeneous Computing Platform for Big Data Analytics

March 2025

·

10 Reads

Modern data analytics requires a huge amount of computing power and processes a massive amount of data. At the same time, the underlying computing platform is becoming much more heterogeneous on both hardware and software. Even though specialized hardware, e.g., FPGA- or GPU- or TPU-based systems, often achieves better performance than a CPU-only system due to the slowing of Moore's law, such systems are limited in what they can do. For example, GPU-only approaches suffer from severe IO limitations. To truly exploit the potential of hardware heterogeneity, we present FpgaHub, an FPGA-centric hyper-heterogeneous computing platform for big data analytics. The key idea of FpgaHub is to use reconfigurable computing to implement a versatile hub complementing other processors (CPUs, GPUs, DPUs, programmable switches, computational storage, etc.). Using an FPGA as the basis, we can take advantage of its highly reconfigurable nature and rich IO interfaces such as PCIe, networking, and on-board memory, to place it at the center of the architecture and use it as a data and control plane for data movement, scheduling, pre-processing, etc. FpgaHub enables architectural flexibility to allow exploring the rich design space of heterogeneous computing platforms.


Cracking Vector Search Indexes

March 2025

·

7 Reads

Retrieval Augmented Generation (RAG) uses vector databases to expand the expertise of an LLM model without having to retrain it. This idea can be applied over data lakes, leading to the notion of embeddings data lakes, i.e., a pool of vector databases ready to be used by RAGs. The key component in these systems is the indexes enabling Approximated Nearest Neighbor Search (ANNS). However, in data lakes, one cannot realistically expect to build indexes for every possible dataset. In this paper, we propose an adaptive, partition-based index, CrackIVF, that performs much better than up-front index building. CrackIVF starts answering queries by near brute force search and only expands as it sees enough queries. It does so by progressively adapting the index to the query workload. That way, queries can be answered right away without having to build a full index first. After seeing enough queries, CrackIVF will produce an index comparable to the best of those built using conventional techniques. As the experimental evaluation shows, CrackIVF can often answer more than 1 million queries before other approaches have even built the index and can start answering queries immediately, achieving 10-1000x faster initialization times. This makes it ideal when working with cold data or infrequently used data or as a way to bootstrap access to unseen datasets.


Efficiently Processing Joins and Grouped Aggregations on GPUs

February 2025

·

3 Reads

·

1 Citation

Proceedings of the ACM on Management of Data

There is a growing interest in leveraging GPUs for tasks beyond ML, especially in database systems. Despite the existing extensive work on GPU-based database operators, several questions are still open. For instance, the performance of almost all operators suffers from random accesses, which can account for up to 75% of the runtime. In addition, the group-by operator which is widely used in combination with joins, has not been fully explored for GPU acceleration. Furthermore, existing work often uses limited and unrepresentative workloads for evaluation and does not explore the query optimization aspect, i.e., how to choose the most efficient implementation based on the workload. In this paper, we revisit the state-of-the-art GPU-based join and group-by implementations. We identify their inefficiencies and propose several optimizations. We introduce GFTR, a novel technique to reduce random accesses, leading to speedups of up to 2.3x. We further optimize existing hash-based and sort-based group-by implementations, achieving significant speedups (19.4x and 1.7x, respectively). We also present a new partition-based group-by algorithm ideal for high group cardinalities. We analyze the optimizations with cost models, allowing us to predict the speedup. Finally, we conduct a performance evaluation to analyze each implementation. We conclude by providing practical heuristics to guide query optimizers in selecting the most efficient implementation for a given workload.


Chameleon: A Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models

February 2025

·

13 Reads

·

11 Citations

Proceedings of the VLDB Endowment

A Retrieval-Augmented Language Model (RALM) combines a large language model (LLM) with a vector database to retrieve context-specific knowledge during text generation. This strategy facilitates impressive generation quality even with smaller models, thus reducing computational demands by orders of magnitude. To serve RALMs efficiently and flexibly, we propose Chameleon , a heterogeneous accelerator system integrating both LLM and vector search accelerators in a disaggregated architecture. The heterogeneity ensures efficient serving for both inference and retrieval, while the disaggregation allows independent scaling of LLM and vector search accelerators to fulfill diverse RALM requirements. Our Chameleon prototype implements vector search accelerators on FPGAs and assigns LLM inference to GPUs, with CPUs as cluster coordinators. Evaluated on various RALMs, Chameleon exhibits up to 2.16× reduction in latency and 3.18× speedup in throughput compared to the hybrid CPU-GPU architecture. The promising results pave the way for adopting heterogeneous accelerators for not only LLM inference but also vector search in future RALM systems.


In-Network Preprocessing of Recommender Systems on Multi-Tenant SmartNICs

January 2025

·

6 Reads

Keeping ML-based recommender models up-to-date as data drifts and evolves is essential to maintain accuracy. As a result, online data preprocessing plays an increasingly important role in serving recommender systems. Existing solutions employ multiple CPU workers to saturate the input bandwidth of a single training node. Such an approach results in high deployment costs and energy consumption. For instance, a recent report from industrial deployments shows that data storage and ingestion pipelines can account for over 60\% of the power consumption in a recommender system. In this paper, we tackle the issue from a hardware perspective by introducing Piper, a flexible and network-attached accelerator that executes data loading and preprocessing pipelines in a streaming fashion. As part of the design, we define MiniPipe, the smallest pipeline unit enabling multi-pipeline implementation by executing various data preprocessing tasks across the single board, giving Piper the ability to be reconfigured at runtime. Our results, using publicly released commercial pipelines, show that Piper, prototyped on a power-efficient FPGA, achieves a 39\sim105×\times speedup over a server-grade, 128-core CPU and 3\sim17×\times speedup over GPUs like RTX 3090 and A100 in multiple pipelines. The experimental analysis demonstrates that Piper provides advantages in both latency and energy efficiency for preprocessing tasks in recommender systems, providing an alternative design point for systems that today are in very high demand.


Efficient Tabular Data Preprocessing of ML Pipelines

September 2024

·

30 Reads

Data preprocessing pipelines, which includes data decoding, cleaning, and transforming, are a crucial component of Machine Learning (ML) training. Thy are computationally intensive and often become a major bottleneck, due to the increasing performance gap between the CPUs used for preprocessing and the GPUs used for model training. Recent studies show that a significant number of CPUs across several machines are required to achieve sufficient throughput to saturate the GPUs, leading to increased resource and energy consumption. When the pipeline involves vocabulary generation, the preprocessing performance scales poorly due to significant row-wise synchronization overhead between different CPU cores and servers. To address this limitation, in this paper we present the design of Piper, a hardware accelerator for tabular data preprocessing, prototype it on FPGAs, and demonstrate its potential for training pipelines of commercial recommender systems. Piper achieves 4.7 \sim 71.3×\times speedup in latency over a 128-core CPU server and outperforms a data-center GPU by 4.8\sim 20.3×\times when using binary input. The impressive performance showcases Piper's potential to increase the efficiency of data preprocessing pipelines and significantly reduce their resource consumption.


Imaginary Machines: A Serverless Model for Cloud Applications

June 2024

·

40 Reads

Serverless Function-as-a-Service (FaaS) platforms provide applications with resources that are highly elastic, quick to instantiate, accounted at fine granularity, and without the need for explicit runtime resource orchestration. This combination of the core properties underpins the success and popularity of the serverless FaaS paradigm. However, these benefits are not available to most cloud applications because they are designed for networked virtual machines/containers environments. Since such cloud applications cannot take advantage of the highly elastic resources of serverless and require run-time orchestration systems to operate, they suffer from lower resource utilization, additional management complexity, and costs relative to their FaaS serverless counterparts. We propose Imaginary Machines, a new serverless model for cloud applications. This model (1.) exposes the highly elastic resources of serverless platforms as the traditional network-of-hosts model that cloud applications expect, and (2.) it eliminates the need for explicit run-time orchestration by transparently managing application resources based on signals generated during cloud application executions. With the Imaginary Machines model, unmodified cloud applications become serverless applications. While still based on the network-of-host model, they benefit from the highly elastic resources and do not require runtime orchestration, just like their specialized serverless FaaS counterparts, promising increased resource utilization while reducing management costs.


Citations (58)


... ANNS methods include graph-based and cluster-based approaches [4], [8]. The former links similar vectors and searches greedily along edges, but can suffer from irregular memory access. ...

Reference:

Cosmos: A CXL-Based Full In-Memory System for Approximate Nearest Neighbor Search
Chameleon: A Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models
  • Citing Article
  • February 2025

Proceedings of the VLDB Endowment

... Similar to data and control plane separation in modern networks, in OffRAC, data transfer happens in a highly efficient streaming manner to FPGAs hosting multiple accelerators, without explicit coordination from a host CPU. Offloading to FPGAs in OffRAC is a "bump in the wire" operation, that could take place in SmartNICs equipped with FPGAs [60,61,82], programmable switches with FPGAs [1,64], or network-accessible disaggregated FPGAs [57,68]. After studying the state of the art in acceleration design, we found that when deployed in the data path, stateless accelerators can be highly effective across a range of applications. ...

Strega : An HTTP Server for FPGAs
  • Citing Article
  • October 2023

ACM Transactions on Reconfigurable Technology and Systems

... This work covers 53 recent studies and summarises the overall trend in this research area. Jiang et al. [32] present a systematic comparative study of serverless and serverful systems for distributed ML training. While conducting our study, we also observed a growing research interest in the application of a serverless approach to support ML systems. ...

A systematic evaluation of machine learning on serverless infrastructure

The VLDB Journal

... Graph databases are non-relational database systems specifically designed to efficiently represent and model intricate relationships between entities. This characteristic makes them particularly well-suited for datasets characterized by significant interdependencies ( [Besta et al. 2023]). Unlike other data models, the graph database model is founded on three fundamental properties ( [Latif et al. 2023]) -the manner in which data is organized and stored within the database; the means by which data within the database can be manipulated and queried; and a set of constraints that ensure data integrity and consistency. ...

Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries
  • Citing Article
  • June 2023

ACM Computing Surveys

... Lee et al. [23] introduce a modular Solid-State Drive (SSD) architecture and the concept of Database Kernels (DBKs) using CXL technology, but practical implementation of DBKs requires more research. Meanwhile, Maschi and Alonso [30] conducted an in-depth study on using FPGA-based accelerators on commercial search engines, revealing that the imbalance between system CPUs limits improvements and could make deployments economically less attractive. ...

The Difficult Balance Between Modern Hardware and Conventional CPUs
  • Citing Conference Paper
  • June 2023

... Recent works on FPGA-based data center accelerators to accelerate workloads show promise [15,32,33], but their applications to network security and large flow detection remain underexplored. Data center accelerator card-based smart NICs [34,35] and 100Gbps network stacks [22,36] primarily target data analytics, machine learning, and cryptography [37][38][39][40]. Nevertheless, these architectures offer valuable insights and can be adapted for large flow detection, significantly reducing design time. ...

Data Processing with FPGAs on Modern Architectures
  • Citing Conference Paper
  • June 2023

... As regards the support for hardware accelerators, the integration of FPGAs and serverless computing is attracting interest (e.g., [38][39][40]), but we are still in a very early phase of exploration. Conversely, the ability of offloading computation to GPUs has been recognized as essential for serverless functions implementing ML inference (or, more rarely, training) tasks and, thus, has been the subject of various research works (e.g., [41][42][43][44][45]). ...

Serverless FPGA: Work-In-Progress
  • Citing Conference Paper
  • May 2023

... There is also new movement on data query languages that are more suited to data exploration of hierarchical data than SQL [50], e.g., Malloy [51] (see also [52]) which improves composability of joins and aggregations. The combination of these trends could provide significant opportunities on how the HEP community manages data in the future, with significant savings in storage space (due to views and dataset versioning using abstractions similar to git) and network bandwidth due to higher utility of cached data and the ability to push down operators into network and storage layers. ...

Evaluating query languages and systems for high-energy physics data

Journal of Physics Conference Series

... We have included three such tasks: compression with the DEFLATE algorithm [21], correspondingly decompression, and regular expression (RegEx) matching. These tasks are frequently invoked on the data path of cloud data systems [19]. Exploring optimizable tasks gives us a glimpse into the performance gains with different levels of possible compute optimizations, especially the hardware accelerators that are yet to be fully exploited by today's data systems. ...

Hardware acceleration of compression and encryption in SAP HANA
  • Citing Article
  • August 2022

Proceedings of the VLDB Endowment