Daniele De Sensi

Daniele De Sensi
  • PhD
  • Professor (Assistant) at Sapienza University of Rome

About

63
Publications
13,417
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
684
Citations
Introduction
Daniele De Sensi currently works at the Department of Computer Science, Università di Pisa. Daniele does research in Power-Aware Computing and High Level Parallel Programming Models.
Current institution
Sapienza University of Rome
Current position
  • Professor (Assistant)
Additional affiliations
May 2018 - February 2020
University of Pisa
Position
  • PostDoc Position
October 2014 - April 2018
University of Pisa
Position
  • PhD Student
March 2020 - August 2022
ETH Zurich
Position
  • PostDoc Position

Publications

Publications (63)
Article
Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the ongoing AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we...
Preprint
Full-text available
n recent years, GPUs have become the preferred accelerators for HPC and ML applications due to their parallelism and fast memory bandwidth. While GPUs boost computation, inter-GPU communication can create scalability bottlenecks, especially as the number of GPUs per node and cluster grows. Traditionally, the CPU managed multi-GPU communication, but...
Preprint
Full-text available
Multi-GPU nodes are increasingly common in the rapidly evolving landscape of exascale supercomputers. On these systems, GPUs on the same node are connected through dedicated networks, with bandwidths up to a few terabits per second. However, gauging performance expectations and maximizing system efficiency is challenging due to different technologi...
Preprint
Full-text available
Most existing datacenter transport protocols rely on in-order packet delivery, a design choice rooted in legacy systems and simplicity. However, advancements in technology, such as RDMA, have made it feasible to relax this requirement, allowing for more effective use of modern datacenter topologies like FatTree and Dragonfly. The rise of AI/ML work...
Preprint
Full-text available
Multi-tenancy is essential for unleashing SmartNIC's potential in datacenters. Our systematic analysis in this work shows that existing on-path SmartNICs have resource multiplexing limitations. For example, existing solutions lack multi-tenancy capabilities such as performance isolation and QoS provisioning for compute and IO resources. Compared to...
Article
Cloud computing represents an appealing opportunity for cost-effective deployment of HPC workloads on the best-fitting hardware. However, although cloud and on-premise HPC systems offer similar computational resources, their network architecture and performance may differ significantly. For example, these systems use fundamentally different network...
Article
Cloud computing represents an appealing opportunity for cost-effective deployment of HPC workloads on the best-fitting hardware. However, although cloud and on-premise HPC systems offer similar computational resources, their network architecture and performance may differ significantly. For example, these systems use fundamentally different network...
Preprint
Full-text available
Cloud computing represents an appealing opportunity for cost-effective deployment of HPC workloads on the best-fitting hardware. However, although cloud and on-premise HPC systems offer similar computational resources, their network architecture and performance may differ significantly. For example, these systems use fundamentally different network...
Preprint
Full-text available
Numerous microarchitectural optimizations unlocked tremendous processing power for deep neural networks that in turn fueled the AI revolution. With the exhaustion of such optimizations, the growth of modern AI is now gated by the performance of training systems, especially their data movement. Instead of focusing on single accelerators, we investig...
Preprint
Full-text available
High-performance clusters and datacenters pose increasingly demanding requirements on storage systems. If these systems do not operate at scale, applications are doomed to become I/O bound and waste compute cycles. To accelerate the data path to remote storage nodes, remote direct memory access (RDMA) has been embraced by storage systems to let dat...
Preprint
Full-text available
This paper presents a security analysis of the InfiniBand architecture, a prevalent RDMA standard, and NVMe-over-Fabrics (NVMe-oF), a prominent protocol for industrial disaggregated storage that exploits RDMA protocols to achieve low-latency and high-bandwidth access to remote solid-state devices. Our work, NeVerMore, discovers new vulnerabilities...
Article
In fault tolerance for parallel and distributed systems, message logging protocols have played a prominent role in the last three decades. Such protocols enable local rollback to provide recovery from fail-stop errors. Global rollback techniques can be straightforward to implement but at times lead to slower recovery than local rollback. Local roll...
Preprint
Full-text available
The allreduce operation is one of the most commonly used communication routines in distributed applications. To improve its bandwidth and to reduce network traffic, this operation can be accelerated by offloading it to network switches, that aggregate the data received from the hosts, and send them back the aggregated result. However, existing solu...
Conference Paper
Full-text available
Structured parallel programming models based on parallel design patterns are gaining more and more importance. Several state-of-the-art industrial frameworks build on the parallel design pattern concept, including Intel TBB and Microsoft PPL. In these frameworks, the explicit exposition of parallel structure of the application favours the identific...
Preprint
Full-text available
The interconnect is one of the most critical components in large scale computing systems, and its impact on the performance of applications is going to increase with the system size. In this paper, we will describe Slingshot, an interconnection network for large scale computing systems. Slingshot is based on high-radix switches, which allow buildin...
Article
Full-text available
The Actor-based programming model is largely used in the context of distributed systems for its message-passing semantics and neat separation between the concurrency model and the underlying hardware platform. However, in the context of a single multi-core node where the performance metric is the primary optimization objective, the “pure” Actor Mod...
Article
Full-text available
The notion of k-truss has been introduced a decade ago in social network analysis and security for community detection, as a form of cohesive subgraphs less stringent than a clique (set of pairwise linked nodes), and more selective than a k-core (induced subgraph with minimum degree k). A k-truss is an inclusion-maximal subgraph H in which each edg...
Article
Full-text available
An increasing attention has been given to provide service level objectives (SLOs) in stream processing applications due to the performance and energy requirements, and because of the need to impose limits in terms of resource usage while improving the system utilization. Since the current and next-generation computing systems are intrinsically offe...
Chapter
One of the key needs of an autonomic computing system is the ability to monitor the application performance with minimal intrusiveness and performance overhead. Several solutions have been proposed, differing in terms of effort required by the application programmers to add autonomic capabilities to their applications. In this work we extend the No...
Article
Full-text available
This work proposes a methodology to find performance and energy trade-offs for parallel applications running on Heterogeneous Multi-Processing systems with a single instruction-set architecture. These offer flexibility in the form of different core types and voltage and frequency pairings, defining a vast design space to explore. Therefore, for a g...
Preprint
Full-text available
This work proposes a methodology to find performance and energy trade-offs for parallel applications running on Heterogeneous Multi-Processing systems with a single instruction-set architecture. These offer flexibility in the form of different core types and voltage and frequency pairings, defining a vast design space to explore. Therefore, for a g...
Chapter
Power consumption of IT infrastructure is a major concern for data centre operators. Since data centres power supply is usually dimensioned for an average-case scenario, uncorrelated and simultaneous power spikes in multiple servers could lead to catastrophic effects such as power outages. To avoid such situations, power capping solutions are usual...
Conference Paper
Full-text available
System noise can negatively impact the performance of HPC systems, and the interconnection network is one of the main factors contributing to this problem. To mitigate this effect, adaptive routing sends packets on non-minimal paths if they are less congested. However, while this may mitigate interference caused by congestion, it also generates mor...
Preprint
Full-text available
System noise can negatively impact the performance of HPC systems, and the interconnection network is one of the main factors contributing to this problem. To mitigate this effect, adaptive routing sends packets on non-minimal paths if they are less congested. However, while this may mitigate interference caused by congestion, it also generates mor...
Conference Paper
Full-text available
The Actor-based parallel programming model is having its momentum in the context of IoT and Cloud computing thanks to a clear separation between the concurrency model and the underlying HW platform. The message-driven style of non-blocking and asynchronous interactions via immutable messages among Actors fits well with the complexity of modern hete...
Article
Full-text available
We discuss the extended parallel pattern set identified within the EU-funded project RePhrase as a candidate pattern set to support data intensive applications targeting heterogeneous architectures. The set has been designed to include three classes of pattern, namely i) core patterns, modelling common, not necessarily data intensive parallelism ex...
Chapter
Full-text available
Stream processing applications became a representative workload in current computing systems. A significant part of these applications demands parallelism to increase performance. However, programmers are often facing a trade-off between coding productivity and performance when introducing parallelism. SPar was created for balancing this trade-off...
Chapter
In recent years, increasing attention has been given to the possibility of guaranteeing Service Level Objectives (SLOs) to users about their applications, either regarding performance or power consumption. SLO can be implemented for parallel applications since they can provide many control knobs (e.g., the number of threads to use, the clock freque...
Article
Full-text available
Today’s stream processing systems handle high-volume data streams in an efficient manner. To achieve this goal, they are designed to scale out on large clusters of commodity machines. However, despite the efficient use of distributed architectures, they lack support to co-processors like graphical processing units (GPUs) ready to accelerate data-...
Conference Paper
Full-text available
This paper studies k-plexes, a well known pseudo-clique model for network communities. In a k-plex, each node can miss at most k-1 links. Our goal is to detect large communities in today's real-world graphs which can have hundreds of millions of edges. While many have tried, this task has been elusive so far due to its computationally challenging n...
Article
Full-text available
Continuous streaming computations are usually composed of different modules, exchanging data through shared message queues. The selection of the algorithm used to access such queues (i.e. the concurrency control) is a critical aspect both for performance and power consumption. In this paper we describe the design of automatic concurrency control al...
Article
Full-text available
Self-adaptation is an emerging requirement in parallel computing. It enables the dynamic selection of resources to allocate to the application in order to meet performance and power consumption requirements. This is particularly relevant in Fog Applications, where data is generated by a number of devices at a varying rate, according to users' activ...
Chapter
Full-text available
A desirable characteristic of modern parallel applications is the ability to dynamically select the amount of resources to be used to meet requirements on performance or power consumption. In many cases, providing explicit guarantees on performance is of paramount importance. In streaming applications, this is related with the concept of elasticity...
Article
Full-text available
Parallel design patterns can be fruitfully combined to develop parallel software applications. Different combinations of patterns can feature different QoS while being functionally equivalent. To support application developers in selecting the best combinations of patterns to develop their applications, we hereby propose a probabilistic approach th...
Conference Paper
Full-text available
Abstract—In this work, we consider the C++ Actor Framework (CAF), a recent proposal that revamped the interest in building concurrent and distributed applications using the actor programming model in C++. CAF has been optimized for high-throughput computing, whereas message latency between actors is greatly influenced by the message data rate: at l...
Article
Managing low-level architectural features for controlling performance and power consumption is a growing demand in the parallel computing community. Such features include, but are not limited to: energy profiling, platform topology analysis, CPU cores disabling and frequency scaling. However, these low-level mechanisms are usually managed by specif...
Article
Full-text available
High-level parallel programming is an active research topic aimed at promoting parallel programming methodologies that provide the programmer with high-level abstractions to develop complex parallel software with reduced time-to-solution. Pattern-based parallel programming is based on a set of composable and customizable parallel patterns used as b...
Conference Paper
Full-text available
Power consumption management has become a major concern in software development. Continuous streaming computations are usually composed by different modules, exchanging data through shared message queues. The selection of the algorithm used to access such queues (i.e., the concurrency control) is a critical aspect for both performance and power con...
Conference Paper
Full-text available
High-level parallel programming is a de-facto standard approach to develop parallel software with reduced time to development. High-level abstractions are provided by existing frameworks as pragma-based annotations in the source code, or through pre-built parallel patterns that recur frequently in parallel algorithms, and that can be easily instant...
Conference Paper
Full-text available
Power-aware computing is gaining an increasing attention both in academic and industrial settings. The problem of guaranteeing a given QoS requirement (either in terms of performance or power consumption) can be faced by selecting and dynamically adapting the amount of physical and logical resources used by the application. In this study, we consid...
Article
Full-text available
The dataflow programming model has been extensively used as an effective solution to implement efficient parallel programming frameworks. However, the amount of resources allocated to the runtime support is usually fixed once by the programmer or the runtime, and kept static during the entire execution. While there are cases where such a static cho...
Article
Full-text available
In current computing systems, many applications require guarantees on their maximum power consumption to not exceed the available power budget. On the other hand, for some applications, it could be possible to decrease their performance, yet maintain an acceptable level, in order to reduce their power consumption. To provide such guarantees, a poss...
Poster
Full-text available
Current architectures provide many possibilities for the reduction of power consumption of applications, such as reducing the number of used cores or scaling down their frequency. However, the amount of resources allocated to an application is usually static and fixed by the programmer or by the runtime. While there are cases where such a static ch...
Conference Paper
Full-text available
Current architectures provide many control knobs for the reduction of power consumption of applications, like reducing the number of used cores or scaling down their frequency. However, choosing the right values for these knobs in order to satisfy requirements on performance and/or power consumption is a complex task and trying all the possible com...
Conference Paper
Full-text available
Determining the right amount of resources needed for a given computation is a critical problem. In many cases, computing systems are configured to use an amount of resources to manage high load peaks even though this cause energy waste when the resources are not fully utilised. To avoid this problem, adaptive approaches are used to dynamically incr...
Conference Paper
Full-text available
The analysis of packet payload is mandatory for network security and traffic monitoring applications. The computational cost of this activity pushed the industry towards hardware-assisted deep packet inspection (DPI) that have the disadvantage of being more expensive and less flexible. This paper covers the design and implementation of a new DPI fr...
Conference Paper
Full-text available
The analysis of packet payload is mandatory for network security and traffic monitoring applications. The computational cost of this activity pushed the industry towards hardware-assisted deep packet inspection (DPI) that have the disadvantage of being more expensive and less flexible. This paper covers the design and implementation of a new DPI fr...
Conference Paper
Full-text available
Monitoring network traffic on 10 Gbit networks requires very efficient tools capable of exploiting modern multicore computing architectures. Specialized network cards can accelerate packet capture and thus reduce the processing over-head, but they can not achieve adequate packet analysis performance. For this rea-son most monitoring tools cannot co...

Network

Cited By