Express virtual channels: towards the ideal interconnection fabric.
-
Citations (0)
- Cited In (15)
-
Conference Proceeding: Kilo-NOC: a heterogeneous network-on-chip architecture for scalability and service guarantees.
38th International Symposium on Computer Architecture (ISCA 2011), June 4-8, 2011, San Jose, CA, USA; 01/2011 -
Conference Proceeding: Asynchronous Bypass Channels: Improving Performance for Multi-synchronous NoCs.
NOCS 2010, Fourth ACM/IEEE International Symposium on Networks-on-Chip, Grenoble, France, May 3-6, 2010; 01/2010 -
Conference Proceeding: Skip-links: A dynamically reconfiguring topology for energy-efficient NoCs
[show abstract] [hide abstract]
ABSTRACT: We introduce the Skip-link architecture that dynamically reconfigures Network-on-Chip (NoC) topologies, in order to reduce the overall switching activity in many-core systems. The proposed architecture allows the creation of long-range Skip-links at runtime to reduce the logical distance between frequently communicating nodes. This offers a number of advantages over existing methods of creating optimised topologies already present in the literature such as the Reconfigurable NoC (ReNoC) architecture and static Long-Range Link (LRL) insertion. Our architecture monitors traffic behaviour and optimises the mesh topology without prior analysis of communications behaviour, and is thus applicable to all applications. Our technique does not utilise a master node, and each router acts independently. The architecture is thus scalable to future many-core networks. We evaluate the performance using a cycle-accurate simulation with synthetic traffic patterns and compare the results to a mesh architecture, demonstrating hop count and energy reductions of around 10%.System on Chip (SoC), 2010 International Symposium on; 10/2010
Page 1
Express Virtual Channels: Towards the Ideal
Interconnection Fabric
Amit Kumar†, Li-Shiuan Peh†, Partha Kundu‡and Niraj K. Jha†
†Dept. of Electrical Engineering, Princeton University, Princeton, NJ 08544
‡Microprocessor Technology Labs, Intel Corp., Santa Clara, CA 95052
†{amitk, peh, jha}@princeton.edu,
‡partha.kundu@intel.com
ABSTRACT
Due to wire delay scalability and bandwidth limitations in-
herent in shared buses and dedicated links, packet-switched
on-chip interconnection networks are fast emerging as the
pervasive communication fabric to connect different process-
ing elements in many-core chips. However, current state-of-
the-art packet-switched networks rely on complex routers
which increases the communication overhead and energy
consumption as compared to the ideal interconnection fab-
ric.
In this paper, we try to close the gap between the state-
of-the-art packet-switched network and the ideal intercon-
nect by proposing express virtual channels (EVCs), a novel
flow control mechanism which allows packets to virtually by-
pass intermediate routers along their path in a completely
non-speculative fashion, thereby lowering the energy/delay
towards that of a dedicated wire while simultaneously ap-
proaching ideal throughput with a practical design suitable
for on-chip networks.
Our evaluation results using a detailed cycle-accurate sim-
ulator on a range of synthetic traffic and SPLASH bench-
mark traces show upto 84% reduction in packet latency and
upto 23% improvement in throughput while reducing the av-
erage router energy consumption by upto 38% over an exist-
ing state-of-the-art packet-switched design. When compared
to the ideal interconnect, EVCs add just two cycles to the
no-load latency, and are within 14% of the ideal throughput.
Moreover, we show that the proposed design incurs a mini-
mal hardware overhead while exhibiting excellent scalability
with increasing network sizes.
Categories and Subject Descriptors: C.2.1 [Computer
Systems Organization]: Network Architecture and De-
sign - Packet-switching
General Terms: Design, Management, Performance
Keywords: Flow control, Packet-switching, Router design
1. INTRODUCTION
Driven by technology limitations to wire scaling and in-
creasing bandwidth demands [1,2], packet-switched on-chip
networks are fast replacing shared buses and dedicated wires
as the de facto interconnection fabric in general-purpose
chip multi-processors (CMPs) [3,4] and application-specific
systems-on-a-chip (SoCs) [5–8]. While there has been sig-
nificant work on interconnection networks for multiproces-
sors, design of on-chip networks, which face unique design
constraints, is a relatively new research area. Ultra-low la-
tency and scalable, high bandwidth communication is criti-
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
ISCA’07, June 9–13, 2007, San Diego, California, USA.
Copyright 2007 ACM 978-1-59593-706-3/07/0006 ...$5.00.
cal in on-chip communication fabrics in order to support a
wide range of applications with diverse traffic characteris-
tics. Moreover, these fabrics need to adhere to tight area
and power budgets with tractable hardware complexity.
Modern state-of-the-art on-chip network designs use a mod-
ular packet-switched fabric in which network channels are
shared over multiple packet flows.
ables high bandwidth, it comes with a significant delay, en-
ergy and area overhead due to the need for complex routers.
Packets need to compete for resources on a hop-by-hop ba-
sis while going through a complex router pipeline before
traversing the output link at each intermediate node along
their path.Thus, the packet energy/delay in such net-
works is dominated largely by contention at intermediate
routers, resulting in a high router-to-link energy/delay ra-
tio. In other words, the gap between current state-of-the-art
networks and the ideal interconnect, in which all nodes are
connected by pair-wise dedicated wires, is quite large.
In this work, we propose express virtual channels (EVCs),
a novel flow control and router microarchitecture design
which tries to close the performance and energy gaps be-
tween the state-of-the-art packetized on-chip network and
the ideal interconnection fabric.
approaches the delay and energy of a dedicated link by al-
lowing packets to virtually bypass intermediate routers along
pre-defined virtual express paths between pairs of nodes.
Thus, EVCs allow packets to skip the entire router pipeline
at intermediate nodes and approach the energy/delay of a
dedicated wire interconnect. Intuitively, this is achieved by
statically designating a set of EVCs at each router that al-
ways connect nodes A and B that are k hops away, and
prioritizing EVCs over normal virtual channels (NVCs) [9]
at the intermediate nodes. For instance, Fig. 1(a) shows a
7×7 2D mesh network with three-hop EVCs (k = 3), with
the dotted lines depicting EVCs, where EVCs are not addi-
tional physical channels, but virtual channels (VCs) [9] that
share existing physical links. Traveling from node 00 to node
03 can then be done through an EVC which virtually skips
the router pipelines at nodes 01 and 02. Fig. 1(b) shows
an example of a typical packet route using EVCs where a
packet traveling from node 01 to node 56 skips the router
pipeline at nodes 04, 05, 16 and 26. Moving further, this
work also proposes dynamic EVCs which use EVCs of vary-
ing lengths. With dynamic EVCs, packets at any node are
allowed to choose among a range of EVCs and hop onto the
one which is most suitable along their route towards their
destination, thereby maximizing the use of EVCs.
In addition to lowering latency to that of a dedicated
link, EVCs also reduce the amount of buffering and av-
erage router switching activity which makes them energy-
and area-efficient. Moreover, as EVCs skip through arbi-
tration at intermediate nodes, they reduce contention and
push throughput. Hence, EVCs lead to savings in latency,
throughput as well as energy, unlike prior work which tends
to trade off one for the other (see Section 6 for a detailed
discussion). Circuit switching [10] allows communications
to approach the latency of dedicated wires (if the costly
setup delay can be amortized) but, by dedicating physical
Even though this en-
EVC-based flow control
Page 2
bandwidth to a message flow, suffers in throughput. Simi-
larly, express cubes [11], in using physical express channels,
trades off throughput when the express channels are not
highly utilized. While recent work has proposed specula-
tion [12, 13] to cut down the router critical path delay by
parallelizing multiple pipeline stages, such techniques show
diminishing returns with increasing network traffic when
the speculation failure rate becomes high. EVCs, by virtu-
ally bypassing nodes in a non-speculative fashion, overcome
these problems and are able to simultaneously approach the
energy/delay/throughput of the ideal interconnection fabric.
In this paper, we first motivate the need for EVCs by
highlighting the existing gap between current state-of-the-
art packetized networks and the ideal network, both in terms
of performance and energy consumption (Section 2). We
then explain the details of EVCs and its dynamic variant in
Section 3, before presenting the detailed microarchitecture
and circuit schematics of the EVC router in Section 4. We
evaluated EVCs using a cycle-accurate network simulator
considering both synthetic traffic and traffic traces gathered
from the execution of the SPLASH-2 benchmark suite [14].
Our results in Section 5 show upto 84% reduction in packet
latency and upto 23% improvement in throughput while re-
ducing the average router energy consumption by upto 38%
as compared to an existing state-of-the-art packet-switched
design. Section 6 contrasts EVCs with prior related work
while Section 7 concludes the paper.
2. MOTIVATION
In this section, we present a motivating case study that
highlights the latency-throughput performance and energy
gap between the ideal interconnection fabric and an exist-
ing baseline design that incorporates several state-of-the-art
router microarchitectural features that were recently pro-
posed to tackle network latency, throughput and energy.
2.1Baseline state-of-the-art router
Fig. 2(a) shows the microarchitecture of our baseline state-
of-the-art VC router.For simplicity, we assume a two-
dimensional mesh topology throughout this paper, though
the router microarchitectures presented readily extend to
other topologies. Thus, the router has five input and out-
put ports corresponding to the four neighboring directions
and the local processing element (PE) port. The major com-
ponents, which constitute the router, are the input buffers,
route computation logic, VC allocator, switch allocator and
crossbar switch.
Fig. 3(a) shows the base router pipeline on which we
will progressively add state-of-the-art router microarchitec-
tural features.Since on-chip designs need to adhere to
tight area budgets and low router footprints, we assume
flit-level1buffering and credit-based VC flow control [9] at
every router, as opposed to packet-level buffering. A head
flit, on arriving at an input port, first gets decoded and
buffered according to its input VC in the buffer write (BW)
pipeline stage. In the next stage, the routing logic performs
route computation (RC) to figure out the output port for the
packet. The header then arbitrates for a VC corresponding
to its output port in the VC allocation (VA) stage. Upon
successful allocation of a VC, it proceeds to the switch al-
location (SA) stage where it arbitrates for the switch input
and output ports. On winning the output port, the flit then
proceeds to the switch traversal (ST) stage, where it tra-
verses the crossbar. This is followed by link traversal (LT)
to travel to the next node. Body and tail flits follow a sim-
ilar pipeline except that they do not go through RC and
VA stages, instead inheriting the VC allocated by the head
flit. The tail flit, on leaving the router, deallocates the VC
reserved by the header.
To remove the serialization delay due to routing, prior
work has proposed lookahead routing (LA) [15] where the
route of a packet is determined one hop in advance, thereby
enabling flits to compete for VCs immediately after the BW
stage. Fig. 3(b) shows the router pipeline with lookahead
1A flit is part of a packet, and the smallest unit of flow control. A
packet consists of a head flit, followed by body flits, and ends with a
tail flit.
routing. Pipeline bypassing (BY) is another technique that
is commonly used to further shorten the router critical path
by allowing a flit to speculatively enter the ST stage if there
are no flits ahead of it in the input buffer queue. Fig. 3(c)
shows the pipeline where a flit goes through a single stage
of switch setup, during which the crossbar is set up for flit
traversal in the next cycle while simultaneously allocating a
free VC corresponding to the desired output port, followed
by ST and LT. The allocation is aborted upon a port conflict.
When the router ports are busy, thereby disallowing pipeline
bypassing, aggressive speculation (SP) can be used to cut
down the critical path [12,13,16]. Fig. 3(d) shows the router
pipeline where VA and SA take place in parallel in the same
cycle. If the speculation succeeds, the flit directly enters
the ST pipestage. However, when speculation fails, the flit
needs to go through these pipestages again depending on
where the speculation failed.
The baseline router used in this study incorporates all the
above-mentioned state-of-the-art techniques.
Router microarchitectural components: We next de-
tail the microarchitectures we assumed for the different com-
ponents within the baseline router. In order to make the
design area- and energy-efficient, we assume single-ported
buffers and a single shared port into the crossbar from each
input. Separable VC and switch allocators modeled closely
after the designs in [13] are assumed as they are fast and of
low complexity, while still providing a reasonable through-
put, making them suitable for the high clock frequencies and
tight area budgets of on-chip networks. We also incorporate
router microarchitectural optimizations for energy into our
baseline design. First, write-through input buffers [17] are
used, which save buffer read energy whenever a flit is able
to directly bypass to the ST stage. Second, we adopt the
cut-through crossbar design [17], which sacrifices the full con-
nectivity provided by a matrix crossbar to reduce the area
and energy overhead. In this work, we assume dimension-
ordered routing for which a cut-through design shows no
performance degradation due to reduced connectivity.
2.2Ideal interconnection fabric
We next discuss the characteristics of the ideal intercon-
nection fabric.
Ideal latency: A network with an ideal latency is one in
which data travel on dedicated pipelined wires directly con-
necting their source and destination. The packet latency,
Tideal, in such a network is governed only by the average
wire length D (the Manhattan distance) between the source
and destination, packet size L, channel bandwidth b, and
propagation velocity v:
Tideal= D/v + L/b
The first term corresponds to the time of flight which is
the time spent traversing the interconnect, and the second to
the serialization latency which is the time taken by a packet
of length L to cross a channel with bandwidth b.
In a packet-switched network, the sharing and multiplex-
ing of links between multiple source-destination flows result
in increased packet transmission latency T, which is defined
as the time elapsed between the first flit of the packet being
injected at the source node to the last flit being ejected at
the destination:
T = D/v + L/b + H · Trouter+ Tc
where H is the average hop-count, Trouter the delay through
a single router, and Tc the delay due to contention [10]. The
third term in Equation (2) corresponds to the time spent in
the router coordinating the multiplexing of packets, while
the fourth term is the contention delay spent waiting for
resources. While a packet-switched network adds router
pipeline and contention delay, the ideal network, however, in
assuming that every tile is interconnected with every other
tile, requires an enormous amount of global interconnect
that has a detrimental effect on overall chip dimension, and
thus D. In such a scenario, the wiring along the network
bisection plane, which has the maximum number of wire
tracks, forms the limiting factor. The chip edge length re-
quired to accommodate the total bisection wiring can be
calculated as
(1)
(2)
Page 3
50?
52?51?
55?54? 53?
56?
20?
22?21?
25?24?23?
26?
30?
32?31?
35?34?33?
36?
40?
42? 41?
45?44?43?
46?
10?
12?11?
15? 14? 13?
16?
00?
02? 01?
05?04? 03?
06?
60?
62?61?
65? 64?63?
66?
(a) An EVC network
Figure 1: EVC network (solid lines are NVCs, dotted ones are EVCs)
50?
52?51?
55?54? 53?
56?
20?
22?21?
25?24?23?
26?
30?
32?31?
35?34?33?
36?
40?
42? 41?
45? 44?43?
46?
10?
12?11?
15? 14?13?
16?
00?
(b) VCs acquired from nodes 01 to 56
02?01?
05?04?03?
06?
60?
62? 61?
65?64?63?
66?
Route?
Computation?
VC 1?
VC n?
Input buffers?
VC 2?
VC 1?
VC n?
VC 2?
VC?
Allocator?
Switch?
Allocator?
Output 0?
Output 4?
Input 0?
Input 4?
Crossbar switch?
Input buffers?
(a) Baseline router microarchitecture
Figure 2: Baseline router microarchitecture, design and layout
North?
West?
South?
East? CrossBar?
128?
128?
128?
128?
Local?Control?
128?
(b) Five-port router layout (input buffers
are housed within the respective input ports,
while VC, SA and control logic are grouped
within Control)
Ledge= 2 ·N
2·N
2· Wpitch· cwidth
(3)
where Ledge is the chip edge length, N the number of tiles,
Wpitch the wire pitch and cwidth the channel width.
We use the data from [18] to calculate the maximum
wire length which can be driven in a cycle assuming delay-
optimal insertion of repeaters and flip-flops. Assuming uni-
form placement of tiles, the average wire length D can then
be derived as:
D = H ·Ledge
It should be pointed out that such an ideal interconnec-
tion fabric, where each node has a dedicated interconnect to
every other node, is not feasible in practice: a 7×7 network
will require a chip size of 4760mm2(Ledge=69mm), way be-
yond the ITRS projection [1] of a chip size of 310mm2for
high-performance chips.
Ideal energy: The energy consumption Eideal of a packet
in an ideal network is given by
Eideal= L/b · D · Pwire
where D is the Manhattan distance between source and dis-
tance and Pwire the interconnect transmission power per
unit length.
Again, multiplexing of packets in networks leads to addi-
tional energy consumption, with the energy E required to
transmit a packet given by
E = L/b · (D · Pwire+ H · Prouter)
where Prouter is the average router power. Prouter is com-
posed of buffer read/write power, power spent in arbitrat-
ing for VCs and switch ports, and the crossbar traversal
power [19].
Ideal throughput: Network throughput, which is defined
as the data rate in bits per second that the network can ac-
cept per input port before saturation, is largely determined
by the topology, flow control and routing mechanism. Given
a particular topology, the ideal-throughput network is one
which employs perfect flow control and routing to balance
the network traffic over alternative paths while leaving no
idle cycles on the bottleneck channels.
To study the ideal throughput, we simulated a network
√N
(4)
(5)
(6)
with unconstrained buffer capacity, unlimited VCs2, and
perfect switch allocation. Since this paper targets microar-
chitectural optimizations, we keep the topology and routing
algorithm consistent between the baseline, ideal and pro-
posed designs.
2.3Existing gap
Here, we compare the state-of-the-art baseline design and
ideal network in terms of latency, energy and throughput,
noting the significant gap that still exists and motivating
the need for further router microarchitectural innovations
to bridge this gap.
Simulation methodology: Table 1 in Section 5.1 shows
the network parameters assumed for this study. Microarchi-
tectural parameters, such as the number of VCs and buffers,
were experimentally obtained by ensuring that they led to
the best performance for the specific traffic pattern. Details
of the simulation infrastructure are discussed in Section 5.
Design and pipeline sizing methodology: Fig. 2(b)
shows the router layout we assumed throughout this paper.
The buffers are laid out on the two sides of the router with
the crossbar/switch in the center. In this layout, the height
of the router is determined by the buffer height while the
width of the router is determined by the wire pitch.
For each microarchitectural component, we design the cir-
cuit schematics to estimate the pipeline delay of each logi-
cal pipeline stage and size the router pipeline appropriately.
The clock rate (3GHz frequency) of the router, correspond-
ing to a cycle time of 18 fanout-of-four (FO4) gate delays
(excluding clock set up), was chosen based on being able to
switch through the crossbar in a single cycle. The BW stage
is used to write the flit into the buffers as well as set up all
control signals in preparation for the next pipe stage (VA).
It is also in the BW stage that we pre-compute the route
(RC). The VA stage presents the longest critical path in our
design and our modeling ascertains that the entire VA stage
can be accommodated within a single 18FO4 cycle. The
physical pipeline thus corresponds to that shown in Fig. 3.
Results: Fig. 4(a) compares the latency of the ideal net-
work with that of the baseline network as a function of in-
creasing network traffic. The latency of the ideal network,
2In actual simulations, this is mimicked by having very large num-
bers of buffers and VCs and verifying that a further increase in both
resources does not lead to better performance.
Page 4
STSAVA RCLT
SABubble BubbleBW
BW
LT ST
Head
flit
Body
/tail
flit
(a) Base router pipeline (BASE)
VA
BW
RC
BW
LT STSA
SALTST Bubble
Head
flit
Body
/tail
flit
(b) Lookahead routing (BASE+LA)
Setup
Setup
LT ST
LTST
Head
flit
Body
/tail
flit
(c) Bypass pipeline (BASE+LA+BY)
Figure 3: Router pipeline [BW: Buffer Write, RC: Route Computation, VA: Virtual Channel Allocation,
SA: Switch Allocation, ST: Switch Traversal, LT: Link Traversal]
which uses dedicated links between any pair of tiles, re-
mains constant irrespective of the network load. For the
baseline, we show the impact on network latency as we pro-
gressively incorporate lookahead routing, pipeline bypassing
and aggressive speculation. Under very low network load
(< 15% capacity), when most router ports are free, bypass-
ing is able to significantly reduce packet latencies. Using
speculation along with bypassing lowers the latency further
with increasing load (< 30% capacity). However, as traffic
increases, contention for network resources becomes higher,
which leads to a higher probability of failed speculations and
results in increasing latency until the network reaches the
saturation point. Hence, it can be seen that router overhead
leads to a significant latency gap between packet-switched
networks and the ideal network.
Fig. 4(a) also highlights the significant throughput gap
between the baseline and ideal-throughput network, with
baseline’s separable allocators only achieving 70% of the ca-
pacity of the ideal-throughput fabric.
Fig. 4(b) shows the gap between the energy consumption
of the ideal network compared to the baseline network. De-
spite the baseline incorporating energy-efficient microarchi-
tectural features, there still exists a substantial energy gap
due to the additional buffering, switching and arbitration
energy consumed in a router, which increases with network
load until saturation is reached.
3.EXPRESS VIRTUAL CHANNELS:
TOWARDS THE IDEAL
INTERCONNECTION FABRIC
As explained in Section 2, the sharing and multiplexing of
data on links in interconnection networks come at the cost
of complex routers which contribute additional overhead in
terms of packet latency and energy due to the router pipeline
and resource contention while degrade throughput as a re-
sult of imperfect allocation of the limited bandwidth.
By virtually bypassing routers, EVCs, as proposed in this
work, remove Trouter and Erouter at bypass hops, thus low-
ering energy/delay. As EVCs do not participate in arbitra-
tions, they also lower Tc and improve allocation efficiency,
thereby pushing throughput. Moreover, since the express
paths created by EVCs are virtual as opposed to physical
links, numerous EVCs of varying lengths can be used to
connect nodes without the proportionate wiring area over-
head. This allows packets to use EVCs which are tailored
to their specific route and, hence, facilitates dynamic adap-
tation to different traffic patterns which further improves
network performance and energy.
In the rest of this section, we explain the details of EVCs
while Section 4 delves into the detailed microarchitectural
design of the EVC-based router.
3.1 Static EVCs
Here, we first present the details of EVCs using a static
design which uses express paths of uniform lengths.All
nodes in a static EVC network are distinguished as either
an EVC source/sink node or a bypass node. A node is an
EVC source/sink along a specific dimension if an EVC along
that dimension originates/terminates at that node. Bypass
BW
LT ST
VA
SA
BW
LTSAST
Head
flit
Body
/tail
flit
(d) Speculative pipeline (BASE+LA+BY+SP)
nodes, on the other hand, do not act as sources and sinks
of EVCs and are the ones which are virtually bypassed by
packets traveling on EVCs. For example, in Fig. 1(a), node
00 is an EVC source/sink for both the X and Y dimensions
while node 13 (04) is an EVC source/sink node along the X
(Y ) dimension. Nodes 01 and 02 (10 and 20) are examples
of bypass nodes along the X (Y ) dimension.
The entire set of VCs is divided into two types:
• NVCs: these are VCs which are allocated just like in
traditional VC flow control [9] and are responsible for
carrying a packet through a single hop.
• k-hop EVCs: these are VCs which carry the packet
through k consecutive hops (where k is the fixed length
of the EVC and is uniform throughout the network).
Bypass nodes only support the allocation of NVCs, not
EVCs, with EVCs bypassing their router pipelines. There-
fore, packets can acquire EVCs along a particular dimension
only at EVC source/sink nodes. When a packet traveling on
an EVC reaches a bypass node, it bypasses the entire router
pipeline, skipping VC allocation as it continues on the same
EVC it currently holds. It does not need to go through
switch allocation as EVCs are prioritized over NVCs and
are thus able to gain automatic passage through the switch
without any contention. In other words, a packet travel-
ing on a k-hop EVC traverses the next k − 1 nodes without
having to go through the router pipeline. Thus, a packet
tries to traverse as many EVCs as possible along its route
from the source to destination. NVCs are only used to reach
an EVC source/sink in order to hop onto an EVC or when
the hop-count in a dimension is less than k, the EVC length.
Fig. 1(b) shown earlier depicts the VCs acquired by a packet
traveling from node 01 to node 56 using deterministic XY
routing. Here, the packet travels on NVCs from node 01 to
03, EVCs from nodes 03 to 06 and then 06 to 36, and finally
NVCs from node 36 to 56. While connecting nodes using the
virtual express lanes provided by EVCs, it should be noted
that EVCs are restricted to connect nodes only along a sin-
gle dimension and cannot be allowed to turn. Thus, packets
are required to go through the router pipeline and change
VCs when turning to a different dimension. This restriction
is required to avoid conflicts between multiple EVC paths.
Impact on latency: Fig. 5(a) shows the non-express router
pipeline for head, body and tail flits. Flits go through this
pipeline at an EVC source/sink node or when they arrive
at a bypass node on an NVC. The number of logical stages
in this pipeline is identical to that in the baseline router
(BASE+LA+BY+SP) described in Section 2.
shows the express router pipeline. A flit goes through this
pipeline whenever it bypasses a node virtually, which hap-
pens when it arrives at a bypass node on an EVC. As a flit
traveling on an EVC will continue on that EVC, there is no
need to go through VA. It can also skip SA as EVCs are
granted higher priority over NVCs and are automatically
granted switch passage. Since no allocation is needed, an
EVC flit can bypass BW and head directly to ST, followed
by LT, to the next node at the end of which the flit gets
latched. This pipeline, however, requires the switch to be set
up a priori. We do this by sending a lookahead signal over
a single-bit wire, which goes one cycle ahead to set up the
Fig. 5(b)
Page 5
0
20
40
60
0.1 0.30.50.70.9
Injected load (fraction of capacity)
ideal latency
BASE+LA
BASE+LA+BY+SP
Latency (cycles)
ideal throughput
BASE+LA+BY
Throughput gap
(a) Latency-throughput gap
Figure 4: Existing gap between the state-of-the-art baseline and ideal interconnection fabric
switch at every intermediate hop which is bypassed, before
the actual flit starts traveling on an EVC. Aggressive tailor-
ing of the switch for EVCs can further shorten the express
pipeline by removing the ST stage and allowing EVC flits to
bypass the crossbar as well. Fig. 5(c) shows this pipeline. As
can be seen, the pipeline is now reduced to just link traver-
sal: approaching that of the ideal interconnect. Note that
EVCs enable bypassing of the router pipeline at all levels
of network loading, unlike prior techniques like bypassing or
speculation which are effective only under low network load.
Impact on energy: Unlike speculation-based techniques,
EVCs also lead to a significant reduction in network energy
consumption. They do so by targeting the per-hop router
energy Erouter, which is given as:
Erouter = Ebuffer write+Ebuffer read+Evc arb+Esw arb+Exb
(7)
where Ebuffer writeand Ebuffer readare the buffer write and
read energy, Evc arbis the VC arbitration energy, Esw arbis
the switch arbitration energy, and Exbis the energy required
to traverse the crossbar switch.
While traveling on an EVC, a packet is able to bypass
the router pipeline of intermediate nodes, without the need
for getting buffered or having to arbitrate for a VC or the
switch port. This in effect saves Ebuffer write, Ebuffer read,
Evc arb and Esw arb, thereby significantly reducing Erouter
and approaching ideal energy. Moreover, since bypass nodes
support only NVCs and do not buffer EVC flits, the to-
tal amount of buffering is reduced, which leads to a cor-
responding reduction in energy consumption and area. The
aggressive express pipeline removes Exbas well, though wire
energy Ewire increases slightly because of higher load (see
Section 4).
Impact on throughput: Given a particular topology and
routing strategy, network throughput is largely determined
by the flow control mechanism. A perfect flow control is
one which makes efficient use of network resources, leaving
no idle cycles on the bottleneck channels. Using virtual ex-
press lanes, which effectively act as dedicated wires between
pairs of nodes, EVC-based flow control is able to create par-
tial communication flows in the network, thereby improving
resource utilization and reducing contention Tcat individual
routers. Thus, packets spend less time waiting for resources
at each router which lowers the average queuing delay, al-
lowing the network to push through more packets before sat-
uration and hence approach ideal throughput.
3.2Dynamic EVCs
Using static EVCs of a fixed uniform length to connect
source/sink nodes throughout the network, as discussed in
the previous section, results in a constrained and asymmet-
ric design. Firstly, the classification of all nodes as bypass
and source/sink leads to an asymmetry in the design. Pack-
ets originating at bypass nodes are forced to first travel on
NVCs before they can acquire an EVC at a source/sink
node. Moreover, static EVCs lead to a non-optimal usage
of express paths when packet hop-counts do not match the
static EVC length k. In this case, packets end up bypass-
ing fewer nodes along their route. In other words, static
EVCs are biased towards traffic originating at source/sink
nodes and with hop-count equal to or a multiple of the EVC
length. Dynamic EVCs overcome these problems by: (a)
0
2
4
6
0 0.2 0.4 0.6 0.8
Injected load (fraction of capacity)
Network energy
(mJ)
baseline ideal
(b) Energy gap
making every node in the network a source/sink, and (b)
allowing EVCs of varying lengths to originate from a node.
By making all nodes identical, dynamic EVCs lead to a sym-
metric design ensuring fairness among nodes. Moreover, in-
stead of having a fixed length, EVC lengths are allowed to
vary between two hops upto a maximum of lmax hops. This
improves adaptivity by allowing packets to pick EVCs of ap-
propriate lengths to match their route the best. Fig. 6(a)
shows dynamic EVCs along a particular dimension with lmax
= 3. As can be seen, each node acts as a source/sink of
EVCs of lengths two and three hops. Fig. 6(b) shows the
VCs acquired by a packet traveling from node 01 to node
56 using XY routing in a network with lmax = 3. Since
all nodes are source/sink nodes, the packet can acquire an
EVC at its source node 01, taking the longest possible EVC
(three-hop) to reach node 04. It then takes a two-hop EVC
to reach node 06, followed by traversing a three-hop EVC in
the Y dimension to reach node 36. Finally, the packet takes
a two-hop EVC to reach its destination node 56. It can be
seen that as compared to static EVCs (Fig. 1(b)), dynamic
EVCs can adapt to the exact route of the packet, thereby
allowing it to bypass more nodes and, hence, result in better
performance and energy characteristics.
Similar to static EVCs, dynamic EVCs are restricted to
be along a single dimension to prevent conflicts. It should
be noted that overlapping of multiple EVC paths along the
same dimension is allowed since all overlapping EVCs use
network links in a sequentially ordered fashion.
node’s perspective, it always prioritizes any EVC flits it sees
over locally-buffered flits.
Implementation of dynamic EVCs requires partitioning
the entire set of VCs at any router port between NVCs and
EVCs of lengths from two through lmax hops. Thus, unlike
static EVCs, which partition all VCs between two bins of
NVCs and uniform-length EVCs, dynamic EVCs divide the
VCs into a total of lmax bins. A simple scheme is to uni-
formly partition all VCs by allocating an equal number of
them to each bin.
Routing flexibility using dynamic EVCs: The perfor-
mance of dynamic EVCs can be further improved by al-
lowing for flexibility in EVC traversals. While, normally, a
packet tries to acquire the longest possible EVC along its
path in order to bypass the maximum number of nodes, this
scheme can be relaxed under certain scenarios when con-
tention for EVCs of a particular length is high. For instance,
consider a scenario where a packet has to travel p hops in a
particular dimension, where p ≤ lmax. In this case, if a p-hop
EVC is not available but a smaller-length EVC is free, the
packet can choose to switch to a smaller-length EVC. Effec-
tively, this implies that longer EVC traversals can be broken
down into a combination of shorter EVC/NVC traversals in
order to spread the traffic between all virtual paths and,
hence, reduce contention.
In order to make effective use of the routing flexibility
in EVCs, the VC pool partitioning should be made non-
uniform, with more VCs allocated to virtual paths of smaller
lengths. This is because longer EVCs can only be used by
packets with larger hop-counts and will remain unutilized
if the traffic pattern has shorter distances to travel.
the other hand, by allocating more VCs to shorter paths,
From a
On