Available via license: CC BY 4.0
Content may be subject to copyright.
Citation: Luo, J.; Yu, F.; Li, W.; Xing,
Q. A Novel Switch Architecture for
Multi-Die Optimization with Efficient
Connections. Electronics 2024,13, 3205.
https://doi.org/10.3390/
electronics13163205
Academic Editor: Maciej Ławry´nczuk
Received: 11 July 2024
Revised: 10 August 2024
Accepted: 12 August 2024
Published: 13 August 2024
Copyright: © 2024 by the authors.
Licensee MDPI, Basel, Switzerland.
This article is an open access article
distributed under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
electronics
Article
A Novel Switch Architecture for Multi-Die Optimization with
Efficient Connections
Jifeng Luo 1, Feng Yu 1, Weijun Li 2and Qianjian Xing 1,*
1College of Biomedical Engineering and Instrument Science, Zhejiang University, Hangzhou 310027, China;
luojf9@zju.edu.cn (J.L.); osfengyu@zju.edu.cn (F.Y.)
2School of Information Science and Engineering, NingboTech University, Ningbo 315000, China;
zhuxinlwj@zju.edu.cn
*Correspondence: xingqianjian@zju.edu.cn
Abstract: Switches play a critical role as core components in data center networks. The advent of
multi-die chiplet packaging as a prevailing trend in complex chip development presents challenges
in designing the multi-die packaging of switch chips. With limited inter-die connections in mind,
we propose a scalable, unified switch architecture optimized for efficient connectivity. This archi-
tecture includes the strategic mapping of data queues, meticulous planning of data paths, and the
integration of a unified interface, all aiming to facilitate efficient switch operations within constrained
connectivity environments. Our optimization efforts encompass various areas, including refining
arbitration strategies, managing mixed unicast and multicast transmissions, and mitigating network
congestion to alleviate bottlenecks in data flow. These enhancements contribute to heightened levels
of performance and robustness in the switching process. During the validation phase, the structure
we propose reduced interconnection usage between dies by 25%, while supporting functions such as
unicast and multicast transmissions.
Keywords: switch architecture; multi-die packaging; efficient connectivity
1. Introduction
In the swiftly evolving landscape of networking technology, network switches require
substantial computing power and programmability, imposing stringent demands on the
design of switch chips. Although significant progress has been made in chip design and
manufacturing, Moore’s Law has exhibited signs of limitations in recent years. Addressing
this challenge, advanced packaging [
1
–
3
] has emerged as a pivotal strategy for increasing
the number of cores more economically. Over the past few years, multi-die packaging
has emerged as a leading technique, and has been widely embraced by major companies
driving innovations in chip performance [
4
–
8
]. This strategic approach involves breaking
down large-scale system-on-chip (SoC) architectures into smaller dies and connecting them
through advanced packaging [9–11].
To cater to diverse application needs, multi-die architectures incorporate a wide range
of topologies. These topologies are derived and enhanced from fundamental connection
layouts to optimize performance. Figure 1presents a cross-sectional view of a multi-die
chip, emphasizing basic connectivity; it showcases how each die is interconnected through
a silicon interposer, underscoring the sophisticated design and engineering efforts that fa-
cilitate these connections [
12
–
14
]. This approach facilitates the interconnection of small dies
to simulate the functionality of a single, larger chip. Such a multi-die architecture enables
the development of switch chips with enhanced port capacities and improved performance,
and its adoption has paved the way for designing cutting-edge, high-efficiency switches.
Nevertheless, the design of switches employing multi-die packaging [
15
–
17
] faces notable
challenges. In multi-die packaging, the inter-die connection density is significantly lower
compared to the intra-die connection density. Since switch applications require all ports to
Electronics 2024,13, 3205. https://doi.org/10.3390/electronics13163205 https://www.mdpi.com/journal/electronics
Electronics 2024,13, 3205 2 of 15
be connected, the inter-die connections become the primary bottleneck as the number of
ports increases.This limitation poses a critical challenge, necessitating innovative solutions
to optimize data transfer and ensure the seamless operation of the switch.
Die0 Die1 Die2
Silicon Interposer
Figure 1. Multi-die interconnection.
The crossbar switch architecture [
18
,
19
], renowned for its non-blocking nature, excep-
tional scalability, and straightforward implementation, serves as the foundational model in
switch design. Figure 2illustrates a typical crossbar-based switch architecture, which can
be roughly divided into the following parts: input ports and associated buffers, crossbar,
scheduling unit, and output ports and associated buffers [
20
,
21
]. In a network switch,
the input and output ports usually handle tasks such as protocol parsing and route lookup.
The scheduling unit is typically located at the output port, with the main control unit being
the arbiter. The arbiter handles signals related to data arbitration and directs the crossbar
to complete data scheduling.
Input port
Input port
Input port
Input port
Output port
...
Output port
Output port
...
Output port
Crossbar
...
...
Scheduler
Req
bus
Grant
bus
Figure 2. Normal switch architecture.
The crossbar architecture requires the creation of comprehensive connections between
input and output ports [
22
]. In a multi-die packaged chip, the full connectivity of the
crossbar significantly impacts the relatively limited inter-die connections [
23
]. Specifically,
Electronics 2024,13, 3205 3 of 15
distributing ports across various dies complicates the task of establishing extensive inter-
connections among these ports, necessitating innovative approaches to maintain efficient
communication and data transfer across the system. Moreover, the distribution of diverse
ports across multiple dies introduces inherent challenges in both inter-die and intra-die com-
munication. Given the equality of each port, maintaining fairness in data transfer between
ports becomes paramount, and the doubled inter-die latency adds complexity to timing
convergence, particularly for logic circuits spanning multiple dies. The interconnections in
the multi-die architecture are thus susceptible to routing congestion [24,25].
In this study, we introduce a scalable, unified switch architecture engineered for
optimal connectivity. The core contributions of this study are as follows:
•
A switch architecture that incorporates queue mapping to suit the intricacies of a
complex multi-die architecture.
• A scalable single-chip switch structure featuring a unified interface.
•
Optimizations in scheduling that enhance deadlock resolution in multicast traffic and
enable fair arbitration in case of output blocking.
This manuscript is organized as follows: Section 2describes the key methods for de-
signing a switch architecture with multi-die packaging, and is divided into three subsections
discussing the detailed architecture, improvements in multicast arbitration, and optimiza-
tions for congestion handling, respectively. Section 3provides experiments related to
resource usage and the consistency of the switch. Finally, we conclude this manuscript in
Section 4.
2. Proposed Architecture
Recognizing the intricate challenges posed by advanced packaging structures and their
influence on implementing crossbar architectures, in this study, we have embarked on an
extensive overhaul of the crossbar switch framework. Furthermore, we have implemented
strategic improvements aimed at preventing multicast deadlocks and efficiently managing
network congestion.
2.1. Switch Architecture
The crossbar switch architecture relies on a scheduler to establish conflict-free matches
between connecting ports and create transmission channels. Each input port must be
connected to the scheduler, determining which input port can transmit data to an output
port. For centralized scheduling, each port needs to be connected to the scheduling
center [
26
]. As the number of ports connected through the crossbar increases, the complexity
will rapidly escalate; this complexity is heightened in a multi-chip architecture, where a
centralized crossbar distribution introduces a significant number of inter-die logic lines
between the dies.
Given the requirement to implement an
N
port switch on a chip featuring a multi-die
architecture with
P
dies, we specify
k
as the total number of data and control buses con-
necting a port to the scheduler. Assuming that all ports are evenly distributed across each
die, the number of interactions between
diei−1
and
diei
can be considered as the interaction
between
Ni/P
ports and
N(P−i)/P
ports. Taking into account both sending and receiving
scenarios, due to the need to establish a full connection, the formula for calculating the
number of inter-die interconnections between
diei−1
and
diei
with a centralized crossbar
Lcent is as follows:
Lcent =2×N
P·(P−i)×N
P·i×k. (1)
Undoubtedly, this scenario depicts the theoretical worst-case for a centralized crossbar.
In a multi-die architecture, effective data exchange mandates the transmission of informa-
tion from the local die to others. Without the use of data compression and aggregation,
theoretically, the minimum number of connections required for inter-die data exchange via
the data bus suffices [27].
Electronics 2024,13, 3205 4 of 15
Figure 3demonstrates that each die is outfitted with a data bus extending across the
chip’s entirety. This configuration allows every port on the die to communicate with other
dies via distinct array buses. This design ensures that all of the data are transmitted between
the dies exactly once, enhancing efficiency and reducing redundancy. The optimization
strategy employed involves connecting the dies using a bus system, which then fans out
upon reaching the targeted die. This approach not only streamlines data transmission
across the multi-die system, but also minimizes latency and maximizes throughput by
optimizing the path data travel through the interconnected landscape. Following the
optimization, the connection count is
Lbus =2×N
P·P×k=2Nk. (2)
Through a theoretical analysis, we have innovatively adjusted the centralized switch
architecture and proposed a distributed crossbar-based switch architecture to ensure effi-
cient connectivity that is compatible with the multi-die architecture.
Die 0
Die 1
Die P-1
...
bus 0 bus 1 bus P-1
...
Figure 3. Multi-die bus connection.
Figure 4demonstrates the segmentation of the entire switch architecture, tailored to
the distribution of the dies. Within each bare die, an independent crossbar structure enables
efficient data transmission, ensuring localized processing and exchange. Conversely, inter-
die data transfer is managed through bus systems, facilitating communication between
separate dies. This dual-layered approach optimizes both intra- and inter-die data flow,
ensuring efficient data transmission across the multi-die system.
Transitioning from a centralized to a distributed switch system is not a matter of
simple division. It primarily involves non-uniform crossbar segmentation, buffer mapping,
and the design of cross-die data flow buses, among other methods. Figure 5illustrates
the unified logic architecture of a single die and the inter-die interface. The architecture
divides the full
N×N
switch into
P
sets of
N/P×N
switches, with each die having a
small section of the crossbar and a scheduler. Due to the fully connected nature of the
crossbar, it will occupy a significant amount of routing resources; through this approach,
dies communicate with each other through data buses and control buses, and the crossbar
does not occupy the inter-die routing resources. Additionally, we employed a buffer
mapping approach to ensure fairness in both intra-die and inter-die data transfer processes.
Each input port is required to map its ingress buffers to different dies, serving as the cache
for cross-die bus data. This strategic implementation guarantees that data transmission,
whether intra-die or inter-die, only traverses the crossbar once, balancing data transmission
latency. Data transfer between dies is exclusively conducted between the input ports and
the corresponding mapped buffers, and this exchange is facilitated by the crossbar situated
Electronics 2024,13, 3205 5 of 15
at the destination port. Consequently, the usage of inter-die connections is significantly
minimized, and the number of data buses between diei−1and dieiis calculated as follows:
Ndist =2×[N
P·(P−i) + N
P·i]×k=2Nk. (3)
Crossbar
Port
Crossbar
CrossbarDie0
Die1
Die2 Port
Port
Port
Port
Port
Port
Port
Port
Port
Port
Port
Port
Port
Port
Port
Port
Port
Figure 4. Multi-die switch architecture.
The data flow within the switch architecture is depicted in Figure 5. In this architecture,
incoming data packets enter the system at the line rate through the input port and are
then processed by the input scheduling module. The input scheduling module divides the
input queue into
P
parts, placing one part on the local die and
P−
1 parts on other dies.
After passing through the input port scheduler, data are directed to queues on different
dies based on its destination address. If the destination address points to a port on the
northern die, it is sent to the northern die, and the same applies to the southern die. When
data need to be transmitted across multiple dies, assuming data from
diei−1
need to be
sent to
diei+1
, and there is no direct connection between
diei−1
and
diei+1
, the data must
be routed through
diei
to reach
diei+1
. Upon reaching
diei
, the data undergo a demux
with the destination address as the select signal. This demux determines whether the
data are directly placed into a virtual queue or sent to
diei+1
. This approach ensures that
input and output queues are implemented within the same die, maintaining a consistent
distance between the input and output queues for each packet. Subsequently, the data in
the queue issue a request to the arbiter, and the arbiter scheduling directs the packet to the
output queue through the crossbar. Finally, the output scheduling processes and transmits
the packet to the output port. The key characteristics of the switch architecture based on
multi-die design are outlined below:
•
The structure is partitioned into
P
segments, implemented on
P
dies, with each die
having N/Pports.
•
Each die is equipped with
N
queues, allocating
N/P
for input queues dedicated to
local die ports, and the remainder
N−N/P
assigned as virtual queues mapped to
ports from other dies within the local die.
Electronics 2024,13, 3205 6 of 15
•
Three types of interfaces are present on each die: SerDes for external port data transfer,
southern interface, and northern interface for inter-die data transfer.
•
Every input port comprises
P
input queues, denoted as
IQij
, where
i
represents the
input port number, and jis the destination die number 1 ≤i≤N, 1 ≤j≤P.
Input
Scheduler
Input
Scheduler
...
Output
Scheduler
Output
Scheduler
Input
Scheduler
Ouput
Scheduler
Input
Scheduler
Ouput
Scheduler
...
...
...
...
...
...
...
Die i
north Interconnect south Interconnectlocal datapath
Die i+1
Die i-1
...
...
virtual
queue
virtual
queue
...
...
...
...
...
...
Input Port
Output Port
Input Port
Output Port
Input Port
Output Port
Input Port
Output Port
Input Port
Output Port
Input Port
Output Port
Input Port
Output Port
Input Port
Output Port
Figure 5. Switch architecture with unified interface.
Iterative advancements in multi-die technologies may necessitate designs featuring
multiple dies, emphasizing the structure’s scalability. As shown in Figure 5, the proposed
architecture adopts a unified interface design. Each die can be considered an independent
module, with external interfaces consisting only of I/O ports and a southbound data bus
and a northbound data bus. The scalability of the architecture accommodates any number of
dies, achieved through the sequential connection of multiple single-die structures, thereby
forming a high-performance network-on-chip (NoC) architecture [28].
Ultimately, we introduced strategies to improve the layout and routing efficiency,
aiming for quicker and more efficient designs [
29
]. By adopting a uniform architecture
across all dies, we not only standardized the design, but also ensured consistent resource
distribution. To address potential inter-die issues, we applied specific constraints within
Electronics 2024,13, 3205 7 of 15
each die, and to apply tighter constraints, we divided the switch into distinct sections:
ports, queues, crossbar, and schedulers. For ports closely integrated with a predefined IP,
we limited their placement to areas near the IP to reduce system interference and enhance
efficiency. Each die has its own crossbar and scheduler, making it optimal to restrict
their interactions to within ports. Queues, especially virtual ones crucial for inter-die
communication, were strategically placed to support efficient design and routing.
2.2. High-Performance Arbiter
In a high-performance switch architecture, the arbiter significantly influences the
overall switch performance [
30
,
31
]. The application of arbiters extends beyond the central
switch core, proving essential for structures such as virtual channels (VCs) [
32
]. A high-
performance arbiter adeptly, precisely, and fairly schedules transmissions, mitigating port
congestion and averting instances of starvation. Its crucial role extends to supporting the
Quality of Service (QoS) in the switch [33].
Within the scheduling system, there is a network of connections encompassing all
input ports and arbiters. Input ports transmit request signals
ri
to the arbiter, indicating
their data transmission requirements. The arbiter, in turn, evaluates these requests and
issues grant signals
gi
, authorizing the transmission of data to the output ports. Ensuring
port fairness in the switch, especially without priorities, is crucial. This approach somewhat
guarantees each port’s bandwidth and prevents port starvation scenarios.
The round-robin arbiter (RRA) [
34
,
35
], a scheduling algorithm designed for resource
fairness, finds extensive application across a multitude of systems, with a notable emphasis
on its use in switches. It forms the cornerstone for a variety of arbiters. In this study,
the RRA is adopted as the primary scheduling mechanism. Given that
gj
was assigned a
value of 1 in the preceding arbitration cycle, the grant signal can be articulated as follows:
gi=
1, i=max {(j−a)|r(j−a)=1,
(1≤a≤j)},
0, otherwise.
(4)
Considering the arbiter’s pivotal role on the switch’s critical path, it necessitates a
level of performance demonstrating minimal latency. To address this, we have imple-
mented a decentralized, high-performance arbiter leveraging fair round-robin arbitration
to achieve low-latency operation [
36
]. However, a single arbiter encounters difficulties in
handling intricate traffic scenarios, among which multicast and mixed traffic stand out as
significant challenges.
With the proliferation of multicast applications, a significant portion of network data
are attributed to multicast traffic, and to effectively handle this traffic, a switch fabric
must be well equipped. The crossbar architecture inherently supports multicast traffic by
manipulating the status of crosspoints, and it can open multiple crosspoints to facilitate
multicast packet replication. However, efficiently managing both unicast and multicast
traffic poses a considerable challenge [37,38].
Multicast traffic differs from unicast by targeting multiple destination addresses, ne-
cessitating the replication of data packets for transmission to various destinations. Treating
multicast data as if they were unicast would require the inefficient approach of time-division
multiplexing for sending unicast transmissions to each destination address. To efficiently
utilize the crossbar’s replication feature for multicast services, it is necessary to initiate
requests to all destination ports simultaneously. However, this approach can lead to
arbitration deadlock.
Figure 6depicts a deadlock phenomenon in a 4
×
4 switch arbitration structure under
mixed unicast and multicast traffic conditions. Each output port features an independent
arbiter that performs fair round-robin arbitration. The arrows on the arbiter wheels in the
figure indicate which input ports are allowed to send data to the output ports. In the first
arbitration cycle, as illustrated, input port 2 issues a unicast request to output port 3, while
Electronics 2024,13, 3205 8 of 15
input port 3 issues multicast requests to both output port 0 and output port 2. Since there is
no contention among the ports at this stage, the arbiters complete the authorization process
smoothly and update the priorities accordingly. In the second arbitration cycle, input port
0 sends a unicast request to output port 2, input port 1 sends multicast requests to output
ports 0 and 3, and input port 2 sends multicast requests to output ports 0, 1, and 3. In this
round of arbitration, output ports 0 and 3 receive requests from both input ports 1 and
2, which is a typical port competition phenomenon. However, due to the changes in the
arbiters’ priorities from the previous arbitration cycle, output port 0 grants authorization
to input port 1, while output port 3 grants authorization to input port 2. Consequently,
neither input port 1 nor input port 2 receives authorization for all its requests. The red lines
in the figure indicate the unauthorized requests. In such a situation, without additional
measures, input ports 1 and 2 will remain in a waiting state, leading to a deadlock.
0
12
3
Arbitration Cycle0 Arbitration Cycle1
0
12
3
0
12
3
0
12
3
0
12
3
0
12
3
0
12
3
port0
port1
port2
port3
0
12
3
input output input output
Figure 6. Multicast deadlock.
We have developed a two-stage multicast arbitration framework to reduce the risk of
deadlock, the detailed algorithm can be found in Appendix A. At the egress of each port,
there is an independent fair round-robin arbiter to handle requests sent from the ingress.
Additionally, there is a shared fair round-robin arbiter for all ports to handle multicast
requests. This framework mandates that a multicast data packet at the queue’s forefront
must initially request multicast transmission authorization from a dedicated multicast
arbiter. Following this approval, the ports are then eligible to request access to multiple
destination ports in the second phase of arbitration. In the context of multi-die architectures,
each die incorporates its own independent switch system, within which a multicast arbiter
plays a crucial role in orchestrating the dispatch of multicast data. While this architecture
might marginally slow down multicast data transmission, it significantly curtails the
likelihood of deadlocks during multicast operations. Furthermore, the introduction of an
overlapping port detection mechanism in the preliminary arbitration stage ensures that
any potential performance impact is minimal.
2.3. Schedule Algorithm Optimization
Schedulers play a crucial role as the primary processing units within switches, handle
complex traffic by managing network flow. We have optimized them to address issues like
Electronics 2024,13, 3205 9 of 15
egress congestion, a common challenge in modern networks. Congestion often arises from
varying processing capabilities across network endpoints, requiring switches to effectively
manage the congestion, especially during traffic spikes. In the combined input and output
queue (CIOQ) [
39
] architecture, output queues buffer data for congested ports. These
buffers mitigate congestion, but blocking the transmission channel is crucial to avoid data
loss when buffers approach full capacity [40].
We use the almost full signal as the trigger to activate the output port arbiter, rather
than backpressuring the flow directly. When the output buffer is nearing full capacity, the ar-
biter halts arbitration, preventing the scheduler from sending data to the congested port.
grant =arbiter_grant ·almost_f ull. (5)
Through this method, packets destined for a blocked port will be back-pressured at the
ingress. Since the egress arbiter is disabled, the priority order remains unchanged, ensuring
port fairness. Once the blockage is alleviated, authorization can proceed according to the
previous priority. Additionally, mechanisms such as timeouts at the ingress can prevent
prolonged head-of-line blocking.
In multicast scenarios, the backpressure mechanism can lead to additional issues
when one of the output ports is blocked, and it is crucial for all requested ports to receive
authorization before data transmission. If certain ports are already in use, the authorized
output port may become blocked. These blocked data remain idle until the slowest port
becomes available and grants multicast authorization, introducing inefficiencies when
handling both unicast and multicast transmissions concurrently. To enhance forwarding
efficiency, especially for most multicast packets that do not require high synchronization,
we implement a timeout scheduling mechanism. The timer starts when the input port
sends the multicast request; if the timer exceeds a specified duration, the arbiter sends the
data packet to the authorized output port. Subsequently, the multicast packet is queued,
awaiting the next scheduling cycle. However, for packets requiring a complete multicast
transfer method, the timer is set to infinity. This scheduling approach minimizes the time
during which the output port is blocked.
With these optimizations, the structure effectively manages mixed unicast and multi-
cast transmissions while providing robust mitigation for blockages resulting from sudden
traffic surges.
3. Implementation and Experiment
To evaluate the performance of the proposed structure, we implemented the design
in Verilog HDL. All experiments were validated through implementation on the VU9P
from AMD/Xilinx’s Virtex UltraScale Plus series (San Jose, CA, USA), which is an FPGA
with multi-die packaging, comprising three dies interconnected through Super Long Line
(SLL) routing [
41
]. Each die within an FPGA is referred to as a Super Logic Region (SLR).
The entire process, including simulation, synthesis, and implementation, was carried out
using the AMD/Xilinx Vivado Design Suite.
Within the context of multi-die architecture, SLL connections are essential for inter-
die communication. To evaluate the proposed architecture’s efficiency in utilizing SLL
resources, we implemented and compared both centralized and distributed switch ar-
chitectures. For each architecture, we conducted separate tests to measure SLL resource
consumption across various ports, providing a comprehensive analysis of how each archi-
tecture manages SLL connectivity. Figure 7illustrates the SLL resource consumption for
both architectures across a spectrum of port quantities.
Electronics 2024,13, 3205 10 of 15
4 6 8 10 12 14 16 18
Port Number
2500
5000
7500
10000
12500
15000
17500
20000
22500
Super Long Line Number
3,025
5,968
12,020
17,421
4,485
8,782
15,973
23,354
Our Design
Centralized Design
Figure 7. Super Long Line utilization.
An analysis of the data reveals that the distributed switch architecture markedly
lowers SLL usage. As the number of ports increases, the SLL usage can be kept relatively
low compared to centralized switch architectures, with a 25% reduction in SLL utilization.
The results for the distributed architecture are consistent with theoretical expectations,
as detailed in Section 3. In contrast, the outcomes for the centralized architecture notably
deviate from what was theoretically anticipated (1). Examining the FPGA synthesis and
implementation results, it was discovered that Vivado executed targeted optimizations on
the centralized switch structure during the layout and routing phases. These enhancements,
which mirror the principles of a bus architecture, were strategically designed to minimize
SLL resource consumption. Even so, the distributed architecture can still maintain a
relatively low usage of inter-die connections, laying a solid foundation for the design of
larger-scale switch chips.
The power consumption of switches with different port counts is shown in Table 1.
It can be seen that the proposed architecture performs slightly better in terms of power
consumption compared to the centralized architecture. A power distribution analysis shows
that the power consumption of high-speed interfaces accounts for 58%, and the power
consumption of the clock reaches 20%. Since the port configurations of both architectures
are identical, the difference in power consumption is minimal.
Due to the constraints imposed by the SLL resources and the layout and routing capa-
bilities within the VU9P, we successfully developed a switch featuring 18 ports, with each
port capable of supporting a line rate of 100 Gbps. Table 2offers an in-depth analysis
of the resource allocation for various components within the switch design: ‘single port’
corresponds to the resources necessary for the protocol encapsulation and parsing required
by a 100 Gbps port; ‘switch core’ indicates the resources allocated for the crossbar-based
switch mechanism within each die, as part of a distributed switch architecture; ‘single die’
details the resources needed for a configuration comprising six ports and their associated
switch structure within a single die; and ‘total design’ provides a comprehensive overview
of the resource deployment throughout the entire distributed switch architecture.
Electronics 2024,13, 3205 11 of 15
Table 1. Power result of switch (W).
Design N = 3 N = 6 N = 12 N = 18
Centralized design 7.184 11.159 18.881 26.897
Proposed design 7.022 10.705 18.483 25.981
Table 2. Switch resource usage.
Design LUT Register DSP U/BRAM
single port 22,014 34,072 5 8/48
switch core 727 3720 684 0/0
single die 136,938 216,131 714 48/288
total design 421,247 661,943 2142 144/864
To thoroughly evaluate the distributed switch architecture’s performance impact and
ensure data transmission consistency between dies, we embarked on a comprehensive
series of experiments. Our initial phase aimed to determine the switch’s influence on
network performance. To achieve this, we set up two distinct test environments: one
where network cards were directly interconnected, and the other where connections were
facilitated through a switch. The findings, as depicted in Figure 8, reveal a noteworthy
consistency in throughput for both scenarios when transmitting different sizes of data
packets. These results reinforce the conclusion that the inclusion of a switch exerts a
minimal impact on the network’s bandwidth, thereby underscoring the distributed switch
architecture’s capability to sustain robust data transmission with high efficiency.
0 500 1,000 1,500 2,000 2,500 3,000 3,500 4,000
Package size (Byte)
0
2,000
4,000
6,000
8,000
10,000
Throughput (MB/s)
2,600 2,800 3,000 3,200 3,400
10,800
10,900
11,000
11,100
11,200
Through Switch
Direct Connection
Figure 8. Bandwidth testing.
Due to the adoption of a distributed switch architecture, we conducted an assessment
of the internal consistency within the switch. This involved configuring the switch’s routing
table to examine the uniformity of bandwidth across transmissions between different dies.
The symmetrical design of the VU9P chip allowed us to limit our testing to transmissions
from SLR0 to SLR1 and SLR2, which was considered adequate for our purposes. By calcu-
lating the difference in bandwidth between data transmissions across dies and those that do
not traverse dies, we derived the variations as depicted in Figure 9. The curve in the graph
represents the data forwarding bandwidth when both the source and destination ports
Electronics 2024,13, 3205 12 of 15
are within SLR0. The accompanying bar chart illustrates the difference in data forwarding
bandwidth for data transmissions that span across SLRs compared to those confined within
a single SLR. The results indicate that within the switch, the bandwidth difference between
data transmissions crossing SLRs and those not crossing SLRs remains below 0.1%. These
differences are consistent with normal network fluctuations and are essentially negligible.
This phenomenon underscores the consistency of internal data transmissions in switches
utilizing a distributed architecture.
0 1,000 2,000 3,000 4,000
Package size (Byte)
0
2,000
4,000
6,000
8,000
10,000
Throughput (MB/s)
0.00
0.05
0.10
0.15
0.20
Inter-Die Throughput Discrepancy (%)
SLR0 to SLR0
SLR0 to SLR2
SLR0 to SLR1
Figure 9. Inter-die throughput testing.
Further analytical methods, including simulation tests and counting tests, were em-
ployed to assess the delays associated with inter-die data transmission in the switch
network. These tests revealed that, under conditions free from blocking, the delays experi-
enced during inter-die transmission were only slightly higher than those within a single
die. Given the implementation’s clock frequency of 300 MHz, such differences in delay are
deemed insignificant. This ensures a high level of consistency in data latency within the
switch.
Ultimately, we carried out tests on a mix of unicast and multicast data transmissions
to assess the efficacy of the proposed two-stage multicast strategy. Deadlocks induced
by congestion and those arising from multicast arbitration processes share superficial
similarities; however, they can be readily differentiated by analyzing simulation waveforms.
To facilitate this distinction, we utilized a simulation-based testing framework.
In our simulation, which involved eight ports send unicast and multicast packets with
random destination addresses, we observed a 100% deadlock occurrence in the absence of
the two-stage multicast strategy. By employing a two-stage multicast strategy, it was found
that while it causes an increase in data transmission latency at higher multicast bandwidths,
it significantly resolves the issue of multicast deadlock.
4. Conclusions
We propose a switch architecture that introduces several innovative strategies to ad-
dress the challenges of multi-die chip connectivity. By implementing strategic approaches
like buffer mapping, distributed crossbar utilization, and a unified interface, we have
successfully tackled several challenges associated with the design of intricate switch chips,
particularly those related to interconnection limitations. Notably, the introduction of a two-
stage arbitration structure has addressed multicast transmission deadlocks, improving the
overall efficiency of mixed unicast and multicast transmissions. The proposed methodology
Electronics 2024,13, 3205 13 of 15
sets up a scalable logical framework essential for developing high-performance switch
networks in multi-die designs. Our work establishes the basis for a flexible switch architec-
ture, suggesting that future research could explore the integration of advanced algorithms
to further enhance the performance and efficiency of the proposed switch architecture.
Another potential direction is to examine the scalability of the architecture with emerging
technologies and larger multi-die systems. These efforts could significantly contribute
to the evolution of high-performance switch networks and their applications in diverse
domains.
Author Contributions: Conceptualization, J.L., W.L. and Q.X.; methodology, J.L., F.Y., W.L. and Q.X.;
software, J.L. and W.L.; validation, J.L., F.Y. and Q.X.; formal analysis, J.L., F.Y. and Q.X.; investigation,
J.L. and F.Y.; data curation, J.L.; writing—original draft preparation, J.L. and F.Y.; writing—review
and editing, J.L., F.Y., W.L. and Q.X.; visualization, J.L.; supervision, Q.X.; project administration, Q.X.
All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.
Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.
Data Availability Statement: The data presented in this study are available on request from the
corresponding author.
Conflicts of Interest: The authors declare no conflicts of interest.
Abbreviations
The following abbreviations are used in this manuscript:
SoC System-on-Chip
FPGA Field Programmable Gate Array
RRA Round-robin arbiter
QoS Quality of Service
SLR Super Logic Region
Appendix A
Algorithm A1 Two-Stage Multicast Processing Function
1: procedure HANDL EMULTICASTPACKET(packet)
2: if ISMULTIC AST(packet)then
3: MulticastReq ←REQ UES TVIRTUALMULTI CA STPO RT
4: MulticastGrant ←WAITGRAN TSIGNAL(MulticastReq)
5: if MulticastGrant then
6: Port ←GET MULTICASTDEST PORT
7: Req ←REQUE STDE ST INATIO NPORT(Port)
8: for each Port do
9: SENDUN ICA ST REQUEST(Port,packet)
10: end for
11: else
12: while not MulticastGrant do
13: MulticastGrant ←WAITGRAN TSIGNAL(MulticastReq)
14: wait some time before next request
15: end while
16: end if
17: else
18: SENDUN ICA ST REQU ES T(Port,packet)
19: end if
20: end procedure
Electronics 2024,13, 3205 14 of 15
References
1.
Lau, J.H. Recent Advances and Trends in Advanced Packaging. IEEE Trans. Compon. Packag. Manuf. Technol. 2022,12, 228–252.
[CrossRef]
2.
Lee, H.J.; Mahajan, R.; Sheikh, F.; Nagisetty, R.; Deo, M. Multi-die Integration Using Advanced Packaging Technologies. In
Proceedings of the 2020 IEEE Custom Integrated Circuits Conference (CICC), Boston, MA, USA, 22–25 March 2020; pp. 1–7.
[CrossRef]
3.
Das Sharma, D.; Mahajan, R.V. Advanced packaging of chiplets for future computing needs. Nat. Electron. 2024,7, 425–427.
[CrossRef]
4.
Jeloka, S.; Cline, B.; Das, S.; Labbe, B.; Rico, A.; Herberholz, R.; DeLaCruz, J.; Mathur, R.; Hung, S. System technology co-
optimization and design challenges for 3D IC. In Proceedings of the 2022 IEEE Custom Integrated Circuits Conference (CICC),
Newport Beach, CA, USA, 24–27 April 2022; pp. 1–6. [CrossRef]
5.
McCann, S.; Lee, H.H.; Refai-Ahmed, G.; Lee, T.; Ramalingam, S. Warpage and Reliability Challenges for Stacked Silicon
Interconnect Technology in Large Packages. In Proceedings of the 2018 IEEE 68th Electronic Components and Technology
Conference (ECTC), San Diego, CA, USA, 29 May–1 June 2018; pp. 2345–2350. [CrossRef]
6.
Mahajan, R.; Qian, Z.; Viswanath, R.S.; Srinivasan, S.; Aygün, K.; Jen, W.L.; Sharan, S.; Dhall, A. Embedded Multidie Interconnect
Bridge—A Localized, High-Density Multichip Packaging Interconnect. IEEE Trans. Compon. Packag. Manuf. Technol. 2019,
9, 1952–1962. [CrossRef]
7.
Mahajan, R.; Sankman, R.; Aygun, K.; Qian, Z.; Dhall, A.; Rosch, J.; Mallik, D.; Salama, I. Embedded Multi-die Interconnect Bridge
(EMIB). In Advances in Embedded and Fan-Out Wafer-Level Packaging Technologies; John Wiley & Sons, Ltd.: Hoboken, NJ, USA,
2019; Chapter 23, pp. 487–499. [CrossRef]
8.
Chen, Y.H.; Yang, C.A.; Kuo, C.C.; Chen, M.F.; Tung, C.H.; Chiou, W.C.; Yu, D. Ultra High Density SoIC with Sub-micron Bond
Pitch. In Proceedings of the 2020 IEEE 70th Electronic Components and Technology Conference (ECTC), Orlando, FL, USA, 3–30
June 2020; pp. 576–581. [CrossRef]
9. Chakravarthi, V.S. A Practical Approach to VLSI System on Chip (SoC) Design; Springer: Berlin/Heidelberg, Germany, 2020.
10.
Pal, S.; Petrisko, D.; Kumar, R.; Gupta, P. Design Space Exploration for Chiplet-Assembly-Based Processors. IEEE Trans. Very
Large Scale Integr. (VLSI) Syst. 2020,28, 1062–1073. [CrossRef]
11.
Shan, G.; Zheng, Y.; Xing, C.; Chen, D.; Li, G.; Yang, Y. Architecture of Computing System based on Chiplet. Micromachines 2022,
13, 205. [CrossRef] [PubMed]
12.
Zhang, X.; Lin, J.K.; Wickramanayaka, S.; Zhang, S.; Weerasekera, R.; Dutta, R.; Chang, K.F.; Chui, K.J.; Li, H.Y.; Wee Ho, D.S.;
et al. Heterogeneous 2.5D integration on through silicon interposer. Appl. Phys. Rev. 2015,2, 021308. [CrossRef]
13.
Lee, C.C.; Hung, C.; Cheung, C.; Yang, P.F.; Kao, C.L.; Chen, D.L.; Shih, M.K.; Chien, C.L.C.; Hsiao, Y.H.; Chen, L.C.; et al. An
Overview of the Development of a GPU with Integrated HBM on Silicon Interposer. In Proceedings of the 2016 IEEE 66th
Electronic Components and Technology Conference (ECTC), Las Vegas, NV, USA, 31 May–3 June 2016; pp. 1439–1444. [CrossRef]
14.
Sunohara, M.; Tokunaga, T.; Kurihara, T.; Higashi, M. Silicon interposer with TSVs (Through Silicon Vias) and fine multilayer
wiring. In Proceedings of the 2008 58th Electronic Components and Technology Conference, Lake Buena Vista, FL, USA, 27–30
May 2008; pp. 847–852. [CrossRef]
15.
Raikar, R.; Stroobandt, D. Multi-Die Heterogeneous FPGAs: How Balanced Should Netlist Partitioning Be? In Proceedings of the
24th ACM/IEEE Workshop on System Level Interconnect Pathfinding, New York, NY, USA, 27 January 2023; SLIP ’22.
16.
Lee, C.C.; Chang, Y.W. Floorplanning for Embedded Multi-Die Interconnect Bridge Packages. In Proceedings of the 2023
IEEE/ACM International Conference on Computer Aided Design (ICCAD), San Francisco, CA, USA, 29 October–2 November
2023; pp. 1–8. [CrossRef]
17.
Zhang, J.; Lu, W.; Huang, P.T.; Li, S.H.; Hung, T.Y.; Wu, S.H.; Dai, M.J.; Chung, I.S.; Chen, W.C.; Wang, C.H.; et al. An Embedded
Multi-Die Active Bridge (EMAB) Chip for Rapid-Prototype Programmable 2.5D/3D Packaging Technology. In Proceedings of the
2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits), Honolulu, HI, USA, 12–17 June 2022;
pp. 262–263. [CrossRef]
18.
Kutuzov, D.; Osovsky, A.; Starov, D.; Stukach, O.; Maltseva, N.; Surkov, D. Crossbar Switch Arbitration with Traffic Control for
NoC. In Proceedings of the 2022 International Siberian Conference on Control and Communications (SIBCON), Tomsk, Russia,
17–19 November 2022; pp. 1–5. [CrossRef]
19.
Wang, X.; Zidan, M.A.; Lu, W.D. A Crossbar-Based In-Memory Computing Architecture. IEEE Trans. Circuits Syst. I Regul. Pap.
2020,67, 4224–4232. [CrossRef]
20.
Yun, Q.; Xu, Q.; Zhang, Y.; Chen, Y.; Sun, Y.; Chen, C. Flexible Switching Architecture with Virtual-Queue for Time-Sensitive
Networking Switches. In Proceedings of the IECON 2021—47th Annual Conference of the IEEE Industrial Electronics Society,
Toronto, ON, Canada, 13–16 October 2021; pp. 1–6. [CrossRef]
21.
Al-Bawani, K.; Englert, M.; Westermann, M. Online Packet Scheduling for CIOQ and Buffered Crossbar Switches. In Proceedings
of the 28th ACM Symposium on Parallelism in Algorithms and Architectures, New York, NY, USA, 11–13 July 2016; SPAA ’16,
pp. 241–250. [CrossRef]
22. Jahanshahi, M.; Bistouni, F. Crossbar-based interconnection networks. Ser. Comput. Commun. Netw. 2018,12, 164–173.
Electronics 2024,13, 3205 15 of 15
23.
Park, J.; Nam, S.; Moon, S.; Kim, J. Optimal Channel Design for Die-to-Die Interface in Multi-die Integration Applications. In
Proceedings of the 2023 IEEE 73rd Electronic Components and Technology Conference (ECTC), Orlando, FL, USA, 30 May–2 June
2023; pp. 1509–1513. [CrossRef]
24.
Shin, G.; Kim, J.; Kim, J.Y. OpenMDS: An Open-Source Shell Generation Framework for High-Performance Design on Xilinx
Multi-Die FPGAs. IEEE Comput. Archit. Lett. 2022,21, 101–104. [CrossRef]
25.
Di, Z.; Tao, R.; Mai, J.; Chen, L.; Lin, Y. LEAPS: Topological-Layout-Adaptable Multi-Die FPGA Placement for Super Long Line
Minimization. IEEE Trans. Circuits Syst. I Regul. Pap. 2024,71, 1259–1272. [CrossRef]
26.
Blunck, H.; Armbruster, D.; Bendul, J.; Hütt, M.T. The balance of autonomous and centralized control in scheduling problems.
Appl. Netw. Sci. 2018,3, 1–19. [CrossRef]
27. Bermond, J.C.; Ergincan, F. Bus interconnection networks. Discret. Appl. Math. 1996,68, 1–15. [CrossRef]
28.
Zheng, H.; Wang, K.; Louri, A. Adapt-NoC: A Flexible Network-on-Chip Design for Heterogeneous Manycore Architectures. In
Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Seoul, Republic of
Korea, 27 February–3 March 2021; pp. 723–735. [CrossRef]
29.
Nasiri, E.; Shaikh, J.; Hahn Pereira, A.; Betz, V. Multiple Dice Working as One: CAD Flows and Routing Architectures for Silicon
Interposer FPGAs. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2016,24, 1821–1834. [CrossRef]
30.
Papaphilippou, P.; Meng, J.; Luk, W. High-Performance FPGA Network Switch Architecture. In Proceedings of the 2020
ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, 23–25 February 2020; FPGA ’20,
pp. 76–85.
31.
Bashizade, R.; Sarbazi-Azad, H. P2R2: Parallel Pseudo-Round-Robin arbiter for high performance NoCs. Integration 2015,
50, 173–182. [CrossRef]
32.
Guo, Y.; Zheng, H.; Wang, J.; Xiao, S.; Li, G.; Yu, Z. A Low-Cost and High-Throughput Virtual-Channel Router with Arbitration
Optimization. In Proceedings of the 2019 IEEE International Conference on Integrated Circuits, Technologies and Applications
(ICTA), Chengdu China, 13–15 November 2019; pp. 75–76. [CrossRef]
33.
Nguyen, H.K.; Tran, X.T. A novel reconfigurable router for QoS guarantees in real-time NoC-based MPSoCs. J. Syst. Archit. 2019,
100, 101664. [CrossRef]
34.
Zheng, S.Q.; Yang, M. Algorithm-Hardware Codesign of Fast Parallel Round-Robin Arbiters. IEEE Trans. Parallel Distrib. Syst.
2007,18, 84–95. [CrossRef]
35.
Oveis-Gharan, M.; Khan, G.N. Index-Based Round-Robin Arbiter for NoC Routers. In Proceedings of the 2015 IEEE Computer
Society Annual Symposium on VLSI, Montpellier, France, 8–10 July 2015; pp. 62–67. [CrossRef]
36.
Luo, J.; Wu, W.; Xing, Q.; Xue, M.; Yu, F.; Ma, Z. A Low-Latency Fair-Arbiter Architecture for Network-on-Chip Switches. Appl.
Sci. 2022,12, 12458. [CrossRef]
37.
Jin, Z.; Jia, W.K. P3FA: Unified Unicast/Multicast Forwarding Algorithm for High-Performance Router/Switch. IEEE Trans.
Consum. Electron. 2022,68, 327–335. [CrossRef]
38.
Mhamdi, L. On the Integration of Unicast and Multicast Cell Scheduling in Buffered Crossbar Switches. IEEE Trans. Parallel
Distrib. Syst. 2009,20, 818–830. [CrossRef]
39.
Chuang, S.T.; Goel, A.; McKeown, N.; Prabhakar, B. Matching output queueing with a combined input/output-queued switch.
IEEE J. Sel. Areas Commun. 1999,17, 1030–1039. [CrossRef]
40.
Gran, E.G.; Zahavi, E.; Reinemo, S.A.; Skeie, T.; Shainer, G.; Lysne, O. On the Relation between Congestion Control, Switch
Arbitration and Fairness. In Proceedings of the 2011 11th IEEE/ACM International Symposium on Cluster, Cloud and Grid
Computing, Newport Beach, CA, USA, 23–26 May 2011; pp. 342–351. [CrossRef]
41.
Liao, Y.C.; Mak, W.K. Pin Assignment Optimization for Multi-2.5D FPGA-Based Systems With Time-Multiplexed I/Os. IEEE
Trans. Comput.-Aided Des. Integr. Circuits Syst. 2021,40, 494–506. [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.