Pedro Javier Garcia’s research while affiliated with University of Castilla-La Mancha and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (88)


Figure 1: Node configuration in the CELLIA cluster.
Figure 2: PCIe cluster maximum payload size (MPS).
Figure 3: SAURON simulator host architecture.
Bandwidth results (GiB/s) when communicating two nodes in the real cluster.
Latency results (in µs) when communicating two nodes in the real cluster.
Understanding intra-node communication in HPC systems and Datacenters
  • Preprint
  • File available

February 2025

·

17 Reads

Joaquin Tarraga-Moreno

·

·

Pedro Javier Garcia

·

Francisco J. Quiles

Over the past decade, specialized computing and storage devices, such as GPUs, TPUs, and high-speed storage, have been increasingly integrated into server nodes within Supercomputers and Data Centers. The advent of high-bandwidth memory (HBM) has facilitated a more compact design for these components, enabling multiple units to be interconnected within a single server node through intra-node networks like PCIe, NVLink, or Ethernet. These networks allow for scaling up the number of dedicated computing and storage devices per node. Additionally, inter-node networks link these devices across thousands of server nodes in large-scale computing systems. However, as communication demands among accelerators grow-especially in workloads like generative AI-both intra- and inter-node networks risk becoming critical bottlenecks. Although modern intra-node network architectures attempt to mitigate this issue by boosting bandwidth, we demonstrate in this paper that such an approach can inadvertently degrade inter-node communication. This occurs when high-bandwidth intra-node traffic interferes with incoming traffic from external nodes, leading to congestion. To evaluate this phenomenon, we analyze the communication behavior of realistic traffic patterns commonly found in generative AI applications. Using OMNeT++, we developed a general simulation model that captures both intra- and inter-node network interactions. Through extensive simulations, our findings reveal that increasing intra-node bandwidth and the number of accelerators per node can actually hinder overall inter-node communication performance rather than improve it.

Download

Leveraging InfiniBand Controller to Configure Deadlock-Free Routing Engines for Dragonflies

February 2025

·

11 Reads

·

·

Pedro Javier Garcia

·

[...]

·

The Dragonfly topology is currently one of the most popular network topologies in high-performance parallel systems. The interconnection networks of many of these systems are built from components based on the InfiniBand specification. However, due to some constraints in this specification, the available versions of the InfiniBand network controller (OpenSM) do not include routing engines based on some popular deadlock-free routing algorithms proposed theoretically for Dragonflies, such as the one proposed by Kim and Dally based on Virtual-Channel shifting. In this paper we propose a straightforward method to integrate this routing algorithm in OpenSM as a routing engine, explaining in detail the configuration required to support it. We also provide experiment results, obtained both from a real InfiniBand-based cluster and from simulation, to validate the new routing engine and to compare its performance and requirements against other routing engines currently available in OpenSM.


Figure 1: A 3-stage RLFT built from 8-port switches that interconnects 128 end-nodes. Each group G i interconnects 16 end-nodes through 1st-and 2nd-stage switches, and is connected to other groups through 3rd-stage switches.
Figure 2: An 8-node Fat-Tree using D-mod-K routing and different queuing schemes (Flow2SL, DBBM and vFtree) with 2 VCs per input port buffer (blue layouts for upward input buffers; red layouts for downward ones). Links crossed in the upward sense are shown as bold arrows while those crossed in the downward sense as dashed arrows.
Figure 5: Possible combinations of the proposed restrictions. The options for each restriction are more aggressive as we move towards the right in the squares.
Figure 6: NED diagram of the switch model used in the OMNeT++-based simulator.
Towards an Efficient Combination of Adaptive Routing and Queuing Schemes in Fat-Tree Topologies

February 2025

·

9 Reads

The interconnection network is a key element in High-Performance Computing (HPC) and Datacenter (DC) systems whose performance depends on several design parameters, such as the topology, the switch architecture, and the routing algorithm. Among the most common topologies in HPC systems, the Fat-Tree offers several shortest-path routes between any pair of end-nodes, which allows multi-path routing schemes to balance traffic flows among the available links, thus reducing congestion probability. However, traffic balance cannot solve by itself some congestion situations that may still degrade network performance. Another approach to reduce congestion is queue-based flow separation, but our previous work shows that multi-path routing may spread congested flows across several queues, thus being counterproductive. In this paper, we propose a set of restrictions to improve alternative routes selection for multi-path routing algorithms in Fat-Tree networks, so that they can be positively combined with queuing schemes.


Congestion Management in High-Performance Interconnection Networks Using Adaptive Routing Notifications

February 2025

·

15 Reads

The interconnection network is a crucial subsystem in High-Performance Computing clusters and Data-centers, guaranteeing high bandwidth and low latency to the applications' communication operations. Unfortunately, congestion situations may spoil network performance unless the network design applies specific countermeasures. Adaptive routing algorithms are a traditional approach to dealing with congestion since they provide traffic flows with alternative routes that bypass congested areas. However, adaptive routing decisions at switches are typically based on local information without a global network traffic perspective, leading to congestion spreading throughout the network beyond the original congested areas. In this paper, we propose a new efficient congestion management strategy that leverages adaptive routing notifications currently available in some interconnect technologies and efficiently isolates the congesting flows in reserved spaces at switch buffers. The experiment results based on simulations of realistic traffic scenarios show that our proposal removes the congestion impact.


FIGURE 1: Congestion Point in DCQCN.
Evaluated Network Configurations
ECP: Improving the Accuracy of Congesting-Packets Identification in High-Performance Interconnection Networks

January 2025

·

13 Reads

IEEE Micro

Interconnection networks are crucial in data centers and supercomputers, ensuring high communication bandwidth and low latency under demanding traffic patterns from data-intensive applications. These patterns can cause congestion, affecting system performance if not addressed efficiently. Current congestion control techniques, like DCQCN, struggle to precisely identify which packets cause congestion, leading to false positives. To address this, we propose the Enhanced Congestion Point (ECP) mechanism, which accurately identifies congesting packets. ECP monitors packets at the head of switch ingress queues, flagging them as congesting when queue occupancy exceeds a threshold and packet requests are rejected. Additionally, ECP introduces a re-evaluation mechanism to cancel the identification of congesting packets if they no longer contribute to congestion after rerouting. We evaluated ECP using a network simulator modeling various configurations and realistic traffic patterns. Results show that ECP significantly improves congestion detection accuracy with a low error margin, enhancing DCQCN performance.





Implementation and testing of a KNS topology in an InfiniBand cluster

June 2024

·

46 Reads

The Journal of Supercomputing

The InfiniBand (IB) interconnection technology is widely used in the networks of modern supercomputers and data centers. Among other advantages, the IB-based network devices allow for building multiple network topologies, and the IB control software (subnet manager) supports several routing engines suitable for the most common topologies. However, the implementation of some novel topologies in IB-based networks may be difficult if suitable routing algorithms are not supported, or if the IB switch or NIC architectures are not directly applicable for that topology. This work describes the implementation of the network topology known as KNS in a real HPC cluster using an IB network. As far as we know, this is the first implementation of this topology in an IB-based system. In more detail, we have implemented the KNS routing algorithm in the OpenSM software distribution of the subnet manager, and we have adapted the available IB-based switches to the particular structure of this topology. We have evaluated the correctness of our implementation through experiments in the real cluster, using well-known benchmarks. The obtained results, which match the expected performance for the KNS topology, show that this topology can be implemented in IB-based clusters as an alternative to other interconnection patterns.



Citations (53)


... • We have described the additional re-evaluation mechanism. • We have evaluated the ECP and Re-evaluation mechanism on a lossless CLOS topology [8]. ...

Reference:

ECP: Improving the Accuracy of Congesting-Packets Identification in High-Performance Interconnection Networks
A New Mechanism to Identify Congesting Packets in High-Performance Interconnection Networks
  • Citing Conference Paper
  • August 2024

... an HPC cluster. Consequently, network utilization and congestion level emerge as critical metrics in HPC networks [1], attracting research attention to monitor and analyze these parameters [2,7]. In this section, we discuss the available approaches to monitor theses metrics through programmable network devices since there are few studies in this field [4]. ...

Monitoring InfiniBand Networks to React Efficiently to Congestion
  • Citing Article
  • March 2023

IEEE Micro

... The next-generation BXI (from now on, BXIv3 [23]) is an Ethernet-based technology under development, which is expected to be used in future European HPC systems. Several projects have recently performed efforts to develop this technology, such as the RED-SEA project [24]. Follow-up projects such as Net4EXA and UltraEthernet aim to continue with the development of post-Exascale era interconnects. ...

RED-SEA: Network Solution for Exascale Architectures

... -Wire Routing: Definition of the path of signal wires to minimize delay, power consumption, and congestion [293,294]. -Congestion Management: Preventing over-congested regions in the design by optimizing component placement and routing [295,296]. -Thermal-aware Layout: Ensure that heat-generating components are placed strategically for heat dissipation [297,298]. -Signal Integrity Optimization: Preventing issues such as crosstalk and electromagnetic interference in wire routing [299,300]. ...

Congestion management in high-performance interconnection networks using adaptive routing notifications

The Journal of Supercomputing

... Lossless RDMA networks offer several flow-control mechanisms, such as Priority Flow Control (PFC) [89] for RoCE networks and credit-based flow control [90] for IB networks. In network routing, static routing algorithms in IB or ECMP (Equal-Cost Multi-Path) [91] and AR (Adaptive Routing) [92] effectively handle routing issues. However, congestion can still occur when multiple servers send data to a single receiver, potentially blocking the entire network. ...

Adaptive Routing in InfiniBand Hardware
  • Citing Conference Paper
  • May 2022

... On the one hand, there are proposals for modeling DCN workloads which use information publicly released by some data center owners. For instance, recent proposals model the network traffic observed in some of Facebook's data centers [9]. They assume a workload is a set of traffic flows from different applications and services generated within a given time fraction. ...

Modeling Traffic Workloads in Data-center Network Simulation Tools
  • Citing Conference Paper
  • July 2019

... When the traffic bursts instantaneously, it is easy to cause congestion, resulting in path asymmetry. In addition, due to link failures and the heterogeneity of network equipment, path asymmetry generally exists in data centers [5,6,[17][18][19][20]. The main difference between symmetric topology and asymmetric topology is whether the delay and bandwidth of multiple paths between any pair of communication hosts are consistent. ...

Optimizing Packet Dropping by Efficient Congesting-Flow Isolation in Lossy Data-Center Networks
  • Citing Conference Paper
  • August 2020

... Another routing algorithm named wFatTree [78] is an adaptation of Ftree routing with advanced load-balancing. The authors showed that wFatTree distributes the congestion on the links more evenly than Ftree, but congestion is still present, while FORS is a congestion-free solution. ...

Towards an efficient combination of adaptive routing and queuing schemes in Fat-Tree topologies
  • Citing Article
  • August 2020

Journal of Parallel and Distributed Computing

... To overcome those limitations we proposed Path2SL [8], a SQS that allows using any number of VLs without producing destination overlapping. In practice, this number can be any divisor of the number of flow destinations to improve the mapping balance. ...

Path2SL: Optimizing Head-of-Line Blocking Reduction in InfiniBand-Based Fat-Tree Networks
  • Citing Conference Paper
  • August 2019

... Deterministic routing itself also suffers from poor load balance. Although some topology-specific fault-tolerant deterministic routing algorithms [10] exploit the tree properties, the performance is still constrained and they are fragile when confronted with failures. Adaptive routing is a more proper approach to pro-vide fault-tolerance, as it exploits multipathing to attain higher performance [4]. ...

High-Quality Fault-Resiliency in Fat-Tree Networks (Extended Abstract)