José Duato’s research while affiliated with Polytechnic University of Valencia and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (586)


FIGURE 1: Congestion Point in DCQCN.
Evaluated Network Configurations
ECP: Improving the Accuracy of Congesting-Packets Identification in High-Performance Interconnection Networks
  • Article
  • Full-text available

January 2025

·

8 Reads

IEEE Micro

·

·

Pedro Javier Garcia

·

[...]

·

Jose Duato

Interconnection networks are crucial in data centers and supercomputers, ensuring high communication bandwidth and low latency under demanding traffic patterns from data-intensive applications. These patterns can cause congestion, affecting system performance if not addressed efficiently. Current congestion control techniques, like DCQCN, struggle to precisely identify which packets cause congestion, leading to false positives. To address this, we propose the Enhanced Congestion Point (ECP) mechanism, which accurately identifies congesting packets. ECP monitors packets at the head of switch ingress queues, flagging them as congesting when queue occupancy exceeds a threshold and packet requests are rejected. Additionally, ECP introduces a re-evaluation mechanism to cancel the identification of congesting packets if they no longer contribute to congestion after rerouting. We evaluated ECP using a network simulator modeling various configurations and realistic traffic patterns. Results show that ECP significantly improves congestion detection accuracy with a low error margin, enhancing DCQCN performance.

Download





Analyzing the impact of the MPI allreduce in distributed training of convolutional neural networks

January 2022

·

258 Reads

·

4 Citations

Computing

For many distributed applications, data communication poses an important bottleneck from the points of view of performance and energy consumption. As more cores are integrated per node, in general the global performance of the system increases yet eventually becomes limited by the interconnection network. This is the case for distributed data-parallel training of convolutional neural networks (CNNs), which usually proceeds on a cluster with a small to moderate number of nodes. In this paper, we analyze the performance of the Allreduce collective communication primitive, a key to the efficient data-parallel distributed training of CNNs. Our study targets the distinct realizations of this primitive in three high performance instances of Message Passing Interface (MPI), namely MPICH, OpenMPI, and IntelMPI, and employs a cluster equipped with state-of-the-art processor and network technologies. In addition, we apply the insights gained from the experimental analysis to the optimization of the TensorFlow framework when running on top of Horovod. Our study reveals that a careful selection of the most convenient MPI library and Allreduce (ARD) realization accelerates the training throughput by a factor of 1.2×1.2×1.2\times compared with the default algorithm in the same MPI library, and up to 2.8×2.8×2.8\times when comparing distinct MPI libraries in a number of relevant combinations of CNN model+dataset.


Accelerating distributed deep neural network training with pipelined MPI allreduce

December 2021

·

1,269 Reads

·

10 Citations

Cluster Computing

TensorFlow (TF) is usually combined with the Horovod (HVD) workload distribution package to obtain a parallel tool to train deep neural network on clusters of computers. HVD in turn utilizes a blocking Allreduce primitive to share information among processes, combined with a communication thread to overlap communication with computation. In this work, we perform a thorough experimental analysis to expose (1) the importance of selecting the best algorithm in MPI libraries to realize the Allreduce operation; and (2) the performance acceleration that can be attained when replacing a blocking Allreduce with its non-blocking counterpart (while maintaining the blocking behaviour via the appropriate synchronization mechanism). Furthermore, (3) we explore the benefits of applying pipelining to the communication exchange, demonstrating that these improvements carry over to distributed training via TF+HVD. Finally, (4) we show that pipelining can also boost performance for applications that make heavy use of other collectives, such as Broadcast and Reduce-Scatter.


UPR: deadlock-free dynamic network reconfiguration by exploiting channel dependency graph compatibility

November 2021

·

128 Reads

·

5 Citations

The Journal of Supercomputing

Deadlock-free dynamic network reconfiguration process is usually studied from the routing algorithm restrictions and resource reservation perspective. The dynamic nature yielded by the transition process from one routing function to another is often managed by restricting resource usage in a static predefined manner, which often limits the supported routing algorithms and/or inactive link patterns, or either requires additional resources such as virtual channels. Exploiting compatibility between routing functions by exploring their associated channel dependency graphs (CDG) leads to a better reconfiguration process given its dynamic nature. In this paper, we propose a new dynamic reconfiguration process called Upstream Progressive Reconfiguration (UPR). Our algorithm progressively performs dependency addition/removal in a per channel basis relying on the information provided by the CDG, while the reconfiguration process takes place. This gives us the opportunity to foresee compatible scenarios where both routing functions coexist, reducing the needed amount of resource drainage as well as packet injection halting.




Citations (69)


... • We have described the additional re-evaluation mechanism. • We have evaluated the ECP and Re-evaluation mechanism on a lossless CLOS topology [8]. ...

Reference:

ECP: Improving the Accuracy of Congesting-Packets Identification in High-Performance Interconnection Networks
A New Mechanism to Identify Congesting Packets in High-Performance Interconnection Networks
  • Citing Conference Paper
  • August 2024

... Optimizing collective operations has been an active area of research [4,14,18,19,25,32,33,36,39]. Studies have shown that optimizing reduction collective operations, specifically MPI_Allreduce, the most exploited collective in DL applications [5], can significantly improve the performance of distributed DL frameworks [8][9][10]37], such as Horovod [35], TensorFlow [3] and CNTK [34]. Unfortunately, however, similar to other collective operations, reduction algorithms have been optimized mainly under the premise that all processes start the operation at the same time. ...

Accelerating distributed deep neural network training with pipelined MPI allreduce

Cluster Computing

... Several analytical performance models have been introduced to explore the complexity of deep neural networks [36], [37]. Paleo [38] separated training time into computation and communication times to capture the complexity of deep neural network architectures based on their input size, number of FLOPS, parallelization strategies, and bandwidth. ...

Performance Modeling for Distributed Training of Convolutional Neural Networks
  • Citing Conference Paper
  • March 2021

... In this paper, we extend our previous work in [4] with a complete evaluation of MPI_Allreduce for three popular instances of MPI, analyzing the impact of this primitive on the distributed training of CNNs, using a top-of-the-shelf cluster with nodes connected via an EDR Infiniband interconnection network. In addition, we complete this study by targeting a variety of scenarios including four CNN models and two datasets with distinct batch sizes. ...

Evaluation of MPI Allreduce for Distributed Training of Convolutional Neural Networks
  • Citing Conference Paper
  • March 2021

... Vehicle ad-hoc networks might experience deadlocks during routing. Nodes can't forward RREQ packets because they are stuck in a deadlock [13]. ...

UPR: deadlock-free dynamic network reconfiguration by exploiting channel dependency graph compatibility

The Journal of Supercomputing

... However, the PPFC protocol needs to define the priority according to the value in the IPV6 protocol flow label field, which has great limitations. Olmedilla et al. proposed DVL-Lossy [21,22] which is a traffic scheduling mechanism. However, DVL-Lossy algorithm does not achieve lossless transmission, and the congested flow (usually an elephant flow) is transferred to a dedicated queue (usually with a lower priority and forwarding rate) for transmission, which is more likely to cause packet dropping. ...

DVL-Lossy: Isolating Congesting Flows to Optimize Packet Dropping in Lossy Data-Center Networks
  • Citing Article
  • December 2020

IEEE Micro

... On the one hand, there are proposals for modeling DCN workloads which use information publicly released by some data center owners. For instance, recent proposals model the network traffic observed in some of Facebook's data centers [9]. They assume a workload is a set of traffic flows from different applications and services generated within a given time fraction. ...

Modeling Traffic Workloads in Data-center Network Simulation Tools
  • Citing Conference Paper
  • July 2019

... When the traffic bursts instantaneously, it is easy to cause congestion, resulting in path asymmetry. In addition, due to link failures and the heterogeneity of network equipment, path asymmetry generally exists in data centers [5,6,[17][18][19][20]. The main difference between symmetric topology and asymmetric topology is whether the delay and bandwidth of multiple paths between any pair of communication hosts are consistent. ...

Optimizing Packet Dropping by Efficient Congesting-Flow Isolation in Lossy Data-Center Networks
  • Citing Conference Paper
  • August 2020

... Advocates of patient-centered health have long argued that individuals who are engaged and informed have better health outcomes. Patients have identified that high quality services be accessible, efficient, and effective (Coulshed & Mullender, 2006;Grimaldo et al., 2013), and the use of ICT during treatment has been found to result in higher quality care (Nalin, Verga, Sanna, & Saranummi, 2013). In this way, ICT could support increased patient engagement and empowerment in a personalized care approach. ...

Design of an ICT Tool for Decision Making in Social and Health Policies
  • Citing Chapter
  • January 2015

... PolarFly is one of a recent wave of mathematically designed low-diameter networks, including Slim Fly [5], Bundlefly [39] and others. The success here of a mathematical approach to Allreduce on PolarFly suggests that the mathematical structure of other networks may similarly be used to generate optimal Allreduce solutions. ...

Bundlefly: a low-diameter topology for multicore fiber
  • Citing Conference Paper
  • June 2020