Content uploaded by Alexander Fell
Author content
All content in this area was uploaded by Alexander Fell on Jan 11, 2016
Content may be subject to copyright.
RECONNECT: A flexible Router
Architecture for
Network-on-Chips
A Thesis
Submitted for the Degree of
Doctor of Philosophy
in the Faculty of Engineering
by
Alexander Fell
Supercomputer Education and Research Centre
INDIAN INSTITUTE OF SCIENCE
BANGALORE – 560 012, INDIA
NOVEMBER 2012
ii
Acknowledgements
I would like to take the opportunity and thank my advisor, Prof. S.K. Nandy,
whose insights, encouragement and suggestions helped tremendously to fin-
ish this work successfully. Further I am grateful for his patience and support
he showed towards me during my administrative and legal obligations such
as visa applications and extensions.
Secondly I express my gratitude towards Dr. Ranjani Narayan, CTO of
Morphing Machines, for her moral and technical support during my stay in
India. I am sure this work would have not been possible without her help.
Besides I would like to thank my lab mates, especially Mythri Alle, Ke-
shavan Varadarajan, Ganesha K. Garga, S. Balakrishnan and many more for
testing my implementations and reporting bugs. They always had time to
listen to problems that occurred. Their (sometimes unconventional) sug-
gestions from a different point of view helped solving problems on many
occasions. This thesis would not have been possible without the help of
Niraj Sharma of Bluespec, Inc. During the realization of RECONNECT he
always patiently listened to big and small implementation problems alike.
Some of his suggestions for workarounds can still be found in the code and
the forwarding the bugs reported to him, improved the Bluespec System
Verilog (BSV) compiler.
This acknowledgement would not be complete without the mentioning of
the International Relation Cell (IRC) of IISc whose guidance and experience
in visa related issues and organizational skills made my stay a very pleasant
one. I was able to learn many interesting qualities about the Indian culture
and I am sure that this exposure effectively changed my perception of life.
Lastly I would like to thank all my friends from within and outside the
institute for welcoming me into their families to give me an insight view
into their lives. Further they were always present for an exhausting game of
basketball to release excessive energy.
iv
Abstract
In this thesis a Network on Chip (NoC) router implementation called RE-
CONNECT realized in BSV, is presented. It is highly configurable in terms of
flit size, the number of provided Input Port (IP)/Output Port (OP) pairs and
support for configurations during runtime, to name a few. Depending on the
amount of available IP/OP pairs, the router can be integrated into different
topologies. Due to the ability to be configured during runtime, the router can
even support multiple topologies. A developer is then able to choose among
the available topologies the one that promises the highest performance for
an application. However this work only concentrates on tessellations like
toroidal mesh, honeycomb and hexagonal.
Routing algorithms that were needed to be developed or adapted, are
presented. In addition a step-by-step example of the routing algorithm
development for the honeycomb topology is included in this thesis. This
enables a system designer who wishes to use RECONNECT for any other
topology that is not discussed such as a hypercube or a ring, to develop the
required routing algorithm easily and fast.
The impact of the chosen topology on the execution time of several
real life algorithms has been analyzed by executing these algorithms on a
target architecture called REDEFINE, a dataflow multi-processor consisting
of Compute Elements (CEs) and Support Logic (SL). For this purpose an NoC
comprising of RECONNECT routers establishing communication links among
the CEs, has been integrated into REDEFINE. It has been found out that for
very small algorithms, the execution time does not depend on the choice
of topology, whereas for larger applications such as Advanced Encryption
Standard (AES) encryption and decryption, it becomes evident that the
honeycomb topology performs worst and the hexagonal one best. However
it is observed that in many cases the additional links that are provided by
the hexagonal topology, when compared with the mesh, are not utilized
due to the topology unawareness of the REDEFINE SL. Hence the algorithm
execution time for mesh topology is often on par with hexagonal ones.
In addition to the chosen topology it is investigated, how the size of
the flit affects these algorithms. As expected the performance of the NoC
decreases, if the flit size is reduced so that the packets have to be segmented
into more flits. Further it is analyzed, if the NoC performance is sufficient
vi
to support high level algorithms such as e.g. the H.264 decoder through
which data is streamed. These algorithms require to perform the necessary
computations not only within a time constraint, but also the data needs to
be fed to the Processing Elements (PEs) fast enough. In H.264 the time
constraint is the frame rate meaning that each frame need to be processed in
a specified fraction of a second. The current RECONNECT implementation
does not qualify to deliver the data within this requirement. As a result, the
necessity for a pipelined router version is presented.
To allow a fair comparison of network performance with implementations
found in current literature and to validate this approach, the NoC has been
put under stress by artificial traffic generators which could be configured to
generate uniform and self-similar traffic patterns. Further different destina-
tion addresses generation algorithms such as normal (randomly selecting a
destination located anywhere in the network), close neighbor communica-
tion, bit complement and tornado, for each of these traffic patterns have been
developed. It could be observed that in general the honeycomb topology
performs worst, followed by the mesh and topped by the hexagonal topology.
From the artificial traffic generators it can be concluded that the richer the
topology, the higher the throughput.
The different router designs have been synthesized to gain approximate
area and power consumption details. Depending on the flit size the single
cycle router which is able to forward an incoming flit in the next clock cycle,
if no congestion occurs, dissipates between 13 and 35mW for honeycomb
topology operating at a frequency of 450MHz. The power increases by ap-
proximately 25% for each IP/OP pair that is added to the router integrated in
a honeycomb topology. The area that is required for a router in a honeycomb
network, has been found out to be between 96167 and 301339 cells depend-
ing on the flit size. A router supporting a mesh or a hexagonal topology
needs respectively 50% or 91% more area than the honeycomb router.
Depending on the flit size the pipelined version of the router dissipates
between 70 and 270, 75 and 294, and 85 and 337mW for the honeycomb,
mesh and hexagonal topologies respectively. The area that is required for
a single router, is between 213898 and 839334 for honeycomb, 238139
and 957548 for mesh, or 286328 and 1182129 cells for hexagonal router
configurations. The tremendous increase of both power dissipation and area
consumption is caused by the additional buffers that are required for each
stage. The maximum clock frequency of the pipelined version has reached
1.4GHz.
Contents
Acknowledgements iii
Abstract v
List of Figures xi
List of Tables xv
List of Algorithms xv
List of Acronyms xvii
1 Introduction 1
1.1 Motivation............................. 1
1.2 Definition of commonly used Terms . . . . . . . . . . . . . . . 4
1.3 A short Introduction into Bluespec System Verilog (BSV) . . . 5
1.3.1 Implicit and Explicit Conditions . . . . . . . . . . . . . 8
1.4 Organization ........................... 9
1.5 Summary ............................. 10
2 Related Work 11
2.1 Topologies............................. 11
2.2 Routing Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Deterministic Routing Algorithms . . . . . . . . . . . . 13
2.2.2 Adaptive Routing Algorithms . . . . . . . . . . . . . . 13
2.3 FlowControl ........................... 14
2.4 Impact on RECONNECT . . . . . . . . . . . . . . . . . . . . . 16
2.5 Summary ............................. 16
3 Architecture 17
3.1 Architectural Overview of the Router . . . . . . . . . . . . . . 17
3.2 Configuration Parameters . . . . . . . . . . . . . . . . . . . . 18
3.2.1 DEBUG_NOC[1-3] ..................... 19
3.2.2 ASMUNIT_DEBUG ...................... 19
viii
3.2.3 MULTIFLITSUPPORT .................... 19
3.2.4 STAGES_OF_ROUTER .................... 20
3.2.5 IP_VC,FIFO_DEPTH_VC_IP ................ 20
3.2.6 PORTS,PORTS_MESH,PORTS_HC .............. 21
3.2.7 NWFLITSIZE ........................ 21
3.2.8 NUM_ADDRESSES_ROUTER ................. 21
3.2.9 OPSLENGTH ........................ 22
3.2.10 HEADERSIZE,BITS_SPACE_IN_HEADFLIT ........ 22
3.2.11 MULTITOPOLOGY ...................... 22
3.2.12 XBAR ............................ 23
3.2.13 EJECTPORT and (.*)(_FABRIC)? ............ 24
3.2.14 FLITNW_FIFOSIZE_ASMUNIT ............... 24
3.2.15 CE_CLK and SUPPORTLOGIC_CLK ............. 24
3.3 Input Port (IP) (InputPort.bsv) ................ 25
3.3.1 Assembly Unit (AU) (AssemblyUnit.bsv) ....... 26
3.3.2 IP connected to OP . . . . . . . . . . . . . . . . . . . . 29
3.3.2.1 Matrix Arbitrators (Arbiter([0-9]+).bsv) . 29
3.4 Crossbar.............................. 33
3.5 Output Port (OP) (OutputPort.bsv) .............. 33
3.6 PipelinedRouters......................... 33
3.6.1 Pipelining of the Single Cycle Router Implementation . 34
3.6.2 Changes in the Implementation . . . . . . . . . . . . . 36
3.7 Summary ............................. 38
4 Routing Algorithms (Routing(.*).bsv) and Topologies 39
4.1 Preliminaries ........................... 39
4.1.1 Example.......................... 40
4.2 Topologies............................. 42
4.2.1 Flattened Butterfly . . . . . . . . . . . . . . . . . . . . 44
4.2.2 Spidergon and Stargon Topology . . . . . . . . . . . . 44
4.3 Virtual Channels (VCs) . . . . . . . . . . . . . . . . . . . . . . 47
4.4 Honeycomb Topology . . . . . . . . . . . . . . . . . . . . . . 49
4.4.1 Algorithm......................... 53
4.4.1.1 Behavioral Observations . . . . . . . . . . . . 53
4.4.1.2 if Branch Aggregation . . . . . . . . . . . . 56
4.4.1.3 Virtual Channel Optimization . . . . . . . . . 58
4.4.1.4 Input Port Optimization . . . . . . . . . . . . 60
4.4.2 Limitations of the Routing Algorithm . . . . . . . . . . 62
4.5 MeshTopology .......................... 62
4.6 Hexagonal Topology . . . . . . . . . . . . . . . . . . . . . . . 64
4.6.1 Proof ........................... 65
4.7 Summary ............................. 66
CONTENTS
5 Test Case: REDEFINE 67
5.1 RETARGET ............................ 68
5.2 SupportLogic(SL) ........................ 69
5.3 Fabric ............................... 70
5.4 Minimum Flit Sizes in Multiflit Environments . . . . . . . . . 71
5.5 Fabric Execution Time . . . . . . . . . . . . . . . . . . . . . . 74
5.5.1 Cyclic Redundancy Check (CRC) . . . . . . . . . . . . 74
5.5.2 AES Decryption . . . . . . . . . . . . . . . . . . . . . . 77
5.5.3 SOBEL Edge Detection . . . . . . . . . . . . . . . . . . 78
5.5.4 Further Application Examples . . . . . . . . . . . . . . 80
5.5.5 Flattened Butterfly . . . . . . . . . . . . . . . . . . . . 81
5.5.6 Spidergon and Stargon Topology . . . . . . . . . . . . 83
5.6 Summary ............................. 84
6 Artificial Traffic Generators 85
6.1 TestEnvironment......................... 86
6.2 Uniform Traffic Pattern . . . . . . . . . . . . . . . . . . . . . . 86
6.3 Self Similar Traffic Pattern . . . . . . . . . . . . . . . . . . . . 92
6.4 Summary ............................. 97
7 Synthesis Results 99
7.1 Synthesis Tool Parameters . . . . . . . . . . . . . . . . . . . . 100
7.2 Area and Power Consumption of the Single Cycle Router . . . 101
7.3 VCDepth .............................103
7.4 Area and Power Consumption of the Pipelined Routers . . . . 105
7.5 Maximum Clock Frequency . . . . . . . . . . . . . . . . . . . 108
7.6 Summary .............................110
8 Conclusion 111
8.1 Conclusion regarding REDEFINE . . . . . . . . . . . . . . . . 112
8.2 FutureWork............................114
A Fabric Execution Time 117
B List of Files 123
References 127
x
List of Figures
1.1 A point-to-point connection always between a pair of PEs. . . 2
1.2 A pipelined bus system . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Logic block to calculate b=a×4+3. ............. 6
3.1 The processing steps a flit encounters while traversing a router. 17
3.2 An architectural overview of the several modules of a router. . 25
3.3
An example of a packet. Here the packet type is an Instruction
Packet of REDEFINE. The payload bits at the LSB and the
template bits at the MSB are separated by unused bits (gray
field). Bluespec initializes unused bit with 0xa in the simulator. 27
3.4
In this example multiple flits are generated from an Instruction
Packet type. The flit size is set to 18 bits and the router are op-
erating in an environment in which
‘NUM_ADDRESSES_ROUTER
are set to 2 and the routing algorithm does not depend on VCs.
28
3.5
3-way matrix arbiter: The figure shows, how the grant for
request0 iscalculated. ...................... 30
3.6
How a deadlock situation can occur, if the arbiter have the
notion of strictly serving the oldest VC first. Although the
original implementation seems to be fairer, it certainly does
look so locally. However in a global point of view, unfortu-
nate situations such as this one, occur. Without additional
hardware like fixing the VC for a specific amount of cycles, in-
creases hardware complexity. This needs to be avoided, since
the arbitration is in the critical path. . . . . . . . . . . . . . . 32
3.7 The muxes and demuxes of a stateless crossbar [60]. . . . . . 33
3.8
The internal structure of the butterfly crossbar allowing the
implementation of pipelined routers. . . . . . . . . . . . . . . 35
3.9
While flit A is in transit, flit B is sent by another IP to the same
OP and VC like flit A. Assuming the VCs can store one flit only
and after flit A is stored, flit B has to wait in the last pipeline
stage till flit A can proceed further. . . . . . . . . . . . . . . . 36
4.1 Different layouts of the honeycomb and hexagonal topology. . 43
xii
4.2
The mapping of a honeycomb and mesh topology into a hexag-
onal one. The thick lines are the ones used for the honeycomb
topology. The gray nodes are the Access Routers (ARs) provid-
ing connectivity to the Fabric. . . . . . . . . . . . . . . . . . . 44
4.3
A
4×4
flattened butterfly topology in which all nodes are fully
connected row and column wise. . . . . . . . . . . . . . . . . 45
4.4
The logical and physical layouts of Spidergon and Stargon
topologies.............................. 46
4.5
Two examples for cyclic dependencies that can occur in hon-
eycombtopologies......................... 48
4.6 Prohibited turns according to the Turn-Model. . . . . . . . . . 49
4.7
Two examples how the Turn-Model increases the latency of a
message significantly. . . . . . . . . . . . . . . . . . . . . . . . 50
4.8
Two layers of the network each of them with a different rout-
ingalgorithm............................ 51
4.9
Bidirectional links for the toroidal structure create 2 cyclic
dependencies additionally which are broken by increasing the
number of VCs and by introducing a date line at linkThorizontal . 52
4.10
A
4×4
non-toroidal honeycomb topology. Some of the honey-
combs are incomplete. . . . . . . . . . . . . . . . . . . . . . . 53
4.11
The turns that are forbidden in the mesh topology are marked.
This results in routing rules in which the west direction has to
be considered first. The dotted lines show the location of the
datelines.............................. 63
4.12 Multiple possibilities are provided to describe the position of P64
5.1
Overview of the flow for compiling an application written in
C for REDEFINE by generating Hyper Operations (HyperOps) 68
5.2
An overview of REDEFINE and its major modules and their
relationship among each other . . . . . . . . . . . . . . . . . . 70
5.3 The design of the topology in the early stages of REDEFINE. . 71
5.4
The impact on the number of flits and the amount of unused
bits in the payload field, if the flit sizes varies. Here the
number of flits that traverse through the network during the
execution of the CRC application is shown. . . . . . . . . . . . 75
5.5
The mapping of the HyperOps of the CRC application onto the
Fabric. CRC consists of 3 HyperOps: 2 of the are mapped onto
one CE only (striped area around (0,0)) whereas the larger
one occupies 2 CEs (grayed out area). The thick links exist for
the honeycomb topology, the mesh topology consists of the
thick links and the thin link, whereas hexagonal also includes
thedottedlink. .......................... 76
5.6
The Fabric Execution Time against the maximum flit size for
CRC in various topologies. . . . . . . . . . . . . . . . . . . . . 77
LIST OF FIGURES
5.7
A snapshot of the location of multiple HyperOps of the AES-D
on the Fabric. Like in figure 5.5 the format of the interlinks
represent the different topologies they belong to. . . . . . . . 78
5.8
The Fabric Execution Time against the maximum flit size for
AES decryption in various topologies. . . . . . . . . . . . . . . 79
5.9
The Fabric Execution Time against the maximum flit size for
SOBEL edge detection in various topologies. . . . . . . . . . . 80
5.10
The row and column wise distances of nodes as seen from the
black one. In a toroidal
6×6
Fabric the distances for each
dimension cannot exceed three hops. . . . . . . . . . . . . . . 83
6.1
Latency for uniform traffic patterns and various address gen-
erationmethods ......................... 88
6.1
Latency for uniform traffic patterns and various address gen-
erationmethods ......................... 89
6.2
Throughput for uniform traffic patterns and various address
generationmethods ....................... 90
6.2
Throughput for uniform traffic patterns and various address
generationmethods ....................... 91
6.3
Latency for self similar traffic patterns and various address
generationmethods ....................... 93
6.3
Latency for self similar traffic patterns and various address
generationmethods ....................... 94
6.4
Throughput for self similar traffic patterns and various address
generationmethods ....................... 95
6.4
Throughput for self similar traffic patterns and various address
generationmethods ....................... 96
7.1
Power and area consumption of a single router for various flit
sizes. The drop at 116 bit is caused by the removal of the AU
whose functionality of segmenting packets into multiple flits
is not required at these sizes anymore. . . . . . . . . . . . . . 102
7.2
Power and area consumption, if the VC has to hold one packet
eventually comprising of multiple flits. . . . . . . . . . . . . . 104
7.3
Power and area consumption of a single pipelined router using
a butterfly as a crossbar for various flit sizes. The drop at 116
bit is caused by the removal of the AU whose functionality of
segmenting packets into multiple flits is not required at these
sizesanymore. ..........................107
A.1 AES Encryption algorithm . . . . . . . . . . . . . . . . . . . . 117
A.2 Elliptic Curve Point (ECP) Addition (ECPA) . . . . . . . . . . . 118
A.3 Elliptic Curve Point (ECP) Doubling (ECPD) . . . . . . . . . . 118
A.4 GIVENSalgorithm ........................119
List of Tables
1.1 Properties of commonly used topologies . . . . . . . . . . . . 5
4.1 Routing function naming convention. . . . . . . . . . . . . . . 41
5.1
The size of the different packet types used by REDEFINE in
the Fabric. Displayed is the required bus width to transfer the
packet as a whole, which apart from the payload itself also
includes the address tuples (
3×4
bits), VC number (2 bits)
and template/union bits stating which packet type is valid (4
bits)................................. 72
5.2
Occurrences of the various packet types during the execution
of a CRC against the sizes of the packets. . . . . . . . . . . . . 74
5.3
Occurrences of the various packet types during the execution
of an AES decryption against the sizes of the packets. . . . . . 79
5.4
Distances the flits have to travel during the execution of vari-
ousapplications.......................... 82
7.1
Maximum clock frequency of the router reported by the design
compiler in pre-synthesis phase. . . . . . . . . . . . . . . . . . 109
xvi
List of Acronyms
ACK Acknowledgement
AES Advanced Encryption Standard
ALU Arithmetic Logical Unit
AR Access Router
ASIC Application Specific Integrated Circuit
AU Assembly Unit
BB Basic Block
BSV Bluespec System Verilog
CE Compute Element
CRC Cyclic Redundancy Check
DFG data flow graph
degree
The degree of a router is defined as the number of links, it can
establish to its neighbors and excludes the injection and ejection port.
ECP Elliptic Curve Point
ECPA Elliptic Curve Point (ECP) Addition
ECPD Elliptic Curve Point (ECP) Doubling
FIFO first-in-first-out
flit
A flit is the largest amount of data that can be transmitted between two
routers in a clock cycle. The size of the flit depends on the amount of
available wires laid out between the routers.
HL HyperOp Launcher
HyperOp Hyper Operation
xviii
IHDF Inter HyperOp Data Forwarder
IP Input Port
IPC Intellectual Property Controller
LIFO last-in-first-out
LSU Load/Store Unit
NoC Network on Chip
OP Output Port
PE Processing Element
pHyperOp partial HyperOp
radix
The radix is the total amount of IPs and OPs a router can establish not
only to its neighbors, but also to the attached units as well. Hence it
includes also the injection and ejection port.
RB Resource Binder
SHA Secure Hash Algorithm
SL Support Logic
union
A union in BSV is comparable with a union known in the program-
ming language C. It can contain several structures such as packet struc-
tures, but only one is valid in any point in time. To distinguish which
packet type is valid, the BSV compiler automatically add template bits
according to the entries in the union.
VC Virtual Channel
VLSI Very Large Scale Integration
Chapter 1
Introduction
In this chapter the evolvement of the Network on Chips (NoCs) originating
from point-to-point connections and bus systems including the motivation for
this work is described. After defining common terms that are used throughout
this thesis, the reader will be exposed to a short introduction into Bluespec
System Verilog (BSV) in which RECONNECT has been implemented.
1.1 Motivation
Within a chip, complex systems and multi-core architectures consist of many
units called Processing Elements (PEs) that are either highly specific to
perform a single task very efficiently, or comprise Arithmetic Logical Units
(ALUs) for generic operations. The high level of integration of multiple
(in an order of tens or even higher) PEs promise to satisfy the demand for
computation power needed as of today. On the other hand these systems
and architectures require a high speed communication system to exchange
data among the PEs within the chip. One method is to analyze the traffic
patterns of the executed application and building a communication system
exactly matching these patterns by directly connecting the data exchanging
PEs as shown in figure 1.1. While this point-to-point communication system
is considered to be the fastest, it restricts the system designer to a few
applications only. In addition depending on the richness of the system, the
wiring requirements explode for n-to-n connections.
A solution of this dilemma is rearranging the PEs to be connected via a
cost efficient bus system, a shared resource in which only one PE can transmit
data at a time. A control logic manages access granted to the bus. While
this communication system gives the desired flexibility, it lacks scalability. In
a magnitude of a thousand PEs which can be integrated easily in modern
VLSI technology, the wires of the bus become very long and have a high
capacitance resulting on long delays and high power consumption [
39
]. In
the view of the allowed access of a single PE at a time, the throughput is
2 Introduction
PE
PE
PE
PE
PE
PE
PE
PE
PE
Figure 1.1: A point-to-point connection always between a pair of PEs.
very low. Thus bus systems are only used, if a few tens of PEs need to
communicate.
To increase the utilization level bridges basically consisting of a set of
first-in-first-out (FIFO) buffers, are inserted into the bus system, dividing the
bus into several subsets (refer to figure 1.2a). If two PEs within a subset
communicate, they can do so provided no other PE of the same group uses
the bus at the same moment. With these bridges multiple PEs in different
subsets are able to transmit data at the same time, since the bridges are
opaque and the traffic cannot cross them. If the destination of the data
resides in a subset outside of the current one, the bridge turns transparent
and lets the traffic pass through it. If the bus consists of multiple bridges it
can also be considered as a pipelined bus with each bridge representing one
pipeline stage as shown in figure 1.2b. Each subset has its own arbitration
control granting access to the bus segment either to one PE or to one bridge.
However the time a message requires to cross the bus system from one
side to the logically other, is immense. Depending on the amount of bridges
and the size of the system, it can be in the order of thousands of clock cycles.
In addition orchestration of the access to each subset and keeping track of
the traffic crossing through several bus segments, increases the complexity in
the global arbitration control.
By rearranging the PEs into a more beneficial pattern such as a grid to cut
short the long distances (refer to figure 1.2c), and by localizing the access
control, the disadvantages of pipelined buses can be avoided. By now the
access control not only regulates the grants to the shared bus resource, but
also forwards the traffic into one of the multiple directions available.
1.1 Motivation 3
PE
PE
PE
PEPEPE
PEControl Control
(a)
The bus is divided into two subsets separated by a bridge which is opaque, if the source
and destination of a communication pair is within the same subset. Otherwise the bridge is
transparent allowing the traffic to pass through it.
Control
Control
Control
Control
PE
PE
PEPEPE
PEPE
(b) The bus is further divided by using more bridges.
Control
Control
ControlControl
Control
Control
PE
PE
PEPE
PEPE
(c)
Rearrangement of the PEs to shorten the distances and decentralizing the access control
logic.
Figure 1.2: A pipelined bus system
4 Introduction
1.2 Definition of commonly used Terms
The pattern in which the bridges and their connections among each other
are arranged is henceforth called topology. The accumulation of the access
control including the direction decision logic is called a router. Topology and
routers form an Network on Chip (NoC). The NoC including the PEs among
which the connectivity is established, is referred to as Fabric.
An encapsulated piece of information outside the Fabric is called a packet.
If it is inserted into the NoC, it is converted into a protocol that the router
understands. This might include the necessity to divide the packet into
several flits. Flits represent the largest amount of bits that can be transmitted
at one instance of time between two routers. For instance a packet of a
size of 80 bits is divided into 4 flits of 20 bits each. To transmit the flit, the
connection between two routers need to consist of 20 wires at least.
The number of connections that are established from one routers to all
its neighbors, is referred to as degree. In figure 1.2c the routers located
in the corners have a degree of two whereas the routers in the middle a
degree of three. Usually the maximum degree only is given while listing the
specifications of the NoC. The degree depends on the topology that a router is
integrated into, and it differs from the term radix, which represents the total
number of connections a router provides including the links to all connected
PEs. Again referring to the same figure the radix of the corner router is three
whereas the radix of the routers in the middle is four. Assuming each router
has 4 PEs connected to it, the radix becomes 7 and 8 respectively.
Besides the characteristics for a router, the NoC can be described by
the bisection bandwidth which is obtained by dividing the network into two
disjoint sets of nearly equal size. The partition with the lowest number of
connections originating in one set and ending in the other one, determines
the bisection bandwidth.
For the system designer deciding on the topology, it is important to know
the maximum number of hops a flit has to travel to the source and destination
that are farthest apart. This characteristic is called diameter of the network.
Table 1.1 compares these characteristics in an overview.
The time a flit traverses through the network is referred to as latency.
Usually system designers are interested not only when the flit arrives, but
also when the packet consisting of multiple flits, is ejected from the network
and available for processing. Hence the latency also includes the time needed
for segmentation and reassembly of the packet (serialization latency). The
higher the number of flits and hence the higher the network load, the higher
is the latency due to the occurrence of congestions, in which flits have to
share a common path. The network load can reach a point in which the
throughput saturates. If the saturation point has been reached, the router
does not always accept injected packets anymore and these packets need
to be queued. Throughput depends on the underlaying topology, the clock
1.3 A short Introduction into Bluespec System Verilog (BSV) 5
Table 1.1: Properties of commonly used topologies
Topology Degree Diameter Bisection
(deg) (dia)Bandwidth
Mesh 4 2√n√n
Honeycomb 3 1.16√n0.82√n
Honeycomb (rectangular) 3 2√n0.5√n
Hexagonal 6 1.16√n2.31√n
Hexagonal (rectangular) 6 2√n2√n−1
Toroidal Topologies
Mesh 4 √n2√n
Honeycomb 3 0.81√n2.04√n
Honeycomb (rectangular) 3 √n√n
Hexagonal 6 0.58√n4.61√n
Hexagonal (rectangular) 6 √n4√n−2
Hypercube log nlog n0.5×n
frequency of the router and the flit size.
1.3 A short Introduction into Bluespec System
Verilog (BSV)
RECONNECT is implemented in the high-level language BSV [
6
,
12
,
11
]
which allows to compile it into a clock accurate simulator that can be executed
on ordinary end user PCs, but also to compile it into Verilog files for further
processing by e.g. synthesis tools. By using one code for simulation and
synthesis, code maintenance is minimized, since consistency does not need
to be ensured among several implementations. In addition BSV allows to use
certain constructs which are hard to understand for developers coming from
hardware design background, but are well known to software programmers.
On the other hand many times the software programmer lacks the experience
and knowledge of hardware programming. As an example, if
b=a×4+3 (1.1)
needs to be calculated, a software developer will most likely write the
equation directly into the program. However the hardware engineer knows
that the multiplication is a simply shift of
a
by 2 bits to the left and the
addition by 3 means that the last 2 bits of
b
are set after assignment of
a
to
b
.
Equation 1.1 is equivalent to
b= (a << 2) | 0x3 (1.2)
6 Introduction
VDD
0
1
2
3
ab
0
1
2
3
Figure 1.3: Logic block to calculate b=a×4 + 3.
avoiding the synthesis of power intensive multiplication and addition
logic potentially requiring multiple clock cycles for calculation (refer to figure
1.3). BSV tries to fill the gap between hardware and software developers by
providing very abstract high level language constructs, but also allows the
hardware engineers to do the operations that they are used to. The imple-
mentation of the router makes use of these high level abstractions. Hence
this section introduces and explains some of them to ease the understanding
of the code of the routers.
As described in chapter 3 the router needs differently behaving routing
algorithms depending on its location in the Fabric. It would be programming
overhead and cause a considerable effort spent for code maintenance, if the
code of the routers is duplicated. Hence the routing algorithms as functions
are given to the router module as parameters. The following listing shows
the module declaration of the router:
module mkNoCRouterXBar#(
function
Tuple2#(Bit#(‘STAGES_OF_ROUTER), UInt#(TLog#(‘IP_VC)))
getRouteHoneycomb(
Vector#(‘NUM_ADDRESSES_ROUTER, Int#(‘ADDRLENGTH)) addr,
UInt#(TLog#(‘IP_VC)) _vcNo,
Integer number),
parameter int routerNumber, Bool northHoneycomb)
(RouterXBar_IFC);
In this example in total three parameters are given to the module
mkNoC-
RouterXBar
: a function which can be invoked by the name
getRoute-
Honeycomb
from within the module, and two parameters,
routerNumber
and
northHoneycomb
. The list of arguments given to the function (
addr
,
_vcNo
and
number
) is a prototype so that the BSV compiler becomes aware
of, how to handle that function. Since giving a function as parameter for a
module is not a Verilog construct, this module cannot be synthesized. That
is why it is not recommended to instantiate
mkNoCRouterXBar
directly, but
through a trick to make it synthesizable. BSV allows to return modules that
are instantiated from a parent module which assumes the same functionality:
1.3 A short Introduction into Bluespec System Verilog (BSV) 7
(* synthesize *)
module mkNoCRouterXBarNormalS#(parameter int routerNumber)
(RouterXBar_IFC);
let router <- mkNoCRouterXBar(
getRouteHoneycombS2DVC,
routerNumber,
False);
return router;
endmodule
As it can be observed the function that is supposed to be used as
get-
RouteHoneycomb
inside the module
mkNoCRouterXBar
is named
getRoute-
HoneycombS2DVC. It can be any function to change the behavior of the mod-
ule, as long as it follows the list of arguments and return parameter format
of the prototype. Since there is no longer an unknown Verilog construct in
the header of
mkNoCRouterXBarNormalS
, it can be synthesized by inserting
(* synthesize *) before the module keyword.
Not only functions can be given as parameter to a module, but also
modules themselves. This functionality is used for instantiating the crossbar
(mkXBar):
XBAR_IFC xbar <- mkXBar(
‘STAGES_OF_ROUTER,
mkMerge2x1_multipleFlit,
routerNumber
);
The module header of
mkXBar
is similar to the one of
mkNoCRouterXBar
:
module mkXBar#(Integer logn,
module #(Merge2x1_IFC) mkMerge2x1, parameter int id)
(XBAR_IFC);
In that case
mkMerge2x1_multipleFlit
is a module which instantiates
one node of the crossbar. Depending on the required functionality,
mk-
Merge2x1_multipleFlit
can be replaced. Illustrations are given in the file
XBar.bsv.
XBar.bsv
is an example of one more specialty of BSV: Depending on the
number of input and output ports, multiple stages need to be instantiated
(refer to figure 3.8 on page 35). By generating one more stage, the previous
stages are instantiated two times and the upper stage is connected to the
lower nodes of the newly created stage, whereas the lower previous stage
is connected to the upper ones. This is repetitive for every stage. Hence
depending on the number of required stages,
mkXBar
is instantiating itself
recursively dlog2netimes with nbeing the number of required ports.
8 Introduction
Apart from the well known constructs in the world of software engineer-
ing, BSV provides functionality for e.g. linked lists of yet unknown size at
the time of development. It is possible to walk through those lists, append
elements and remove them. However at the time of compilation, everything
needs to be fixed, since hardware obviously cannot be invoked during run
time.
1.3.1 Implicit and Explicit Conditions
The functionality a BSV module has to provide, is implemented in rules and
methods. Rules are executed whenever conditions allow, whereas methods
are more comparable with function calls in C taking parameters as inputs
and returning values.
For instance the following rule is automatically executed every clock cycle,
since there is no condition preventing the execution:
rule incrementCounter;
counter <= counter + 1;
endrule
with
counter
being a register. If desired, the rule can be extended to
execute only, if certain conditions are met. As an example the rule above has
been extended to increase the counter value only, if a register called
start
is
set to true.
rule incrementCounter(start);
// or rule incrementCounter(start == True);
counter <= counter + 1;
endrule
The added condition is called an explicit condition and is enforced by
the system designer. Many times more conditions are required which are
not immediately visible to the designer. For instance a rule that writes to a
FIFO buffer cannot do so, if the buffer is already full. These conditions are
automatically inserted by the BSV compiler and are called implicit conditions.
While these implicit conditions are very helpful and for instance can be
used to check, if an Input Port (IP) is ready to receive a flit, they sometimes
create misunderstandings resulting in long debug sessions. Consider the
following rule as an example:
rule getData; // get data and store in register reg
if(getDataFromOne) begin
reg <= fifo1.deq(); // dequeue the data from fifo 1
end else begin
reg <= fifo2.deq(); // dequeue the data from fifo 2
1.4 Organization 9
end
endrule
Somewhere else in the code,
getDataFromOne
is calculated and the first
value from either
fifo1
or
fifo2
shall be stored in the register
reg
. However
in many cases the rule is not executed at all. The reason is that the implicit
condition states that to dequeue data from a FIFO, it must not be empty.
Hence the implicit conditions for this rule is that
fifo1
and at the same time
also
fifo2
must have some data stored. If only one FIFO contains data, these
conditions are not met and rule does not fire.
1.4 Organization
This thesis describes the development and architecture of the routers called
RECONNECT. As it is discussed in the next chapter (chapter 2) containing a
literature survey in which break-through and recent developments regarding
NoCs are introduced, the major difference between current publications and
RECONNECT is the unique feature that RECONNECT can be easily adapted
and configured to be integrated in various Fabrics. The motivation for
this work originates in the presented implementations which until now are
either too rigid because they serve only a very particular purpose, or closed
source and commercialized. However some of the techniques mentioned in
the papers have been implemented in RECONNECT. This chapter gives an
overview of these ideas and which can be used and which are despite the
fact that they are interesting, discarded.
The routers can be easily adapted to provide different configurations or
even functions by often just defining a constant. In chapter 3 these config-
uration parameters and their effects are explained. Further the currently
implemented functionality of the several router modules such as Input Port
and Crossbar, is elaborated, while the point of view of a flit is taken as it
traversals through the router. In many cases the section title includes a
filename stating in which file the described functionality is encapsulated. The
purpose of each file that are used in RECONNECT, is tabulated in appendix
B.
The routing algorithms and their tightly connected topologies are ex-
plicated in chapter 4. Three popular topologies viz. honeycomb, mesh
and hexagonal are introduced and the routing algorithms explained. Espe-
cially the honeycomb topology required the development of a new routing
algorithm. How this task has been accomplished is demonstrated via a step-
by-step description which is also applicable to develop routing algorithms for
other topologies not considered in this work such as the Flattened Butterfly,
Spidergon and Stargon which are only briefly analyzed.
In chapter 5 the performance of several router configurations is presented
by executing applications on REDEFINE, a dataflow oriented, reconfigurable
10 Introduction
architecture which is also briefly described in this chapter. In addition
the design of RECONNECT has been validated and compared by running
artificial traffic generators under different traffic loads (refer to chapter 6).
The router has been compiled into Verilog code and findings of pre-synthesis
are tabulated in chapter 7 before concluding the thesis in chapter 8.
1.5 Summary
In this chapter the motivation for an NoC has been explained by pointing
out the disadvantages of point-to-point communication systems and buses.
Clearly an NoC is not always the ultimate choice. Especially in small systems,
a bus is cheaper and easier to implement. However in systems that are
scaled, a bus soon becomes a bottleneck and an NoC promise to give a higher
performance.
Further the reader has been exposed to some of the common terms that
are used in literature regarding NoC. This minimal set of terms is required to
understand the basics of NoC. If specifications and requirements require to
alter the code in which the routers are implemented, novices to BSV will find
section 1.3 most interesting. The strength of the rule management found in
BSV is explained with the help of code snippets directly taken from the code
itself.
Chapter 2
Related Work
Network on Chips (NoCs) is lively investigated by uncountable research
groups throughout the world with many dedicated conferences and journals
conducted every year. Some of the results have been commercialized into a
product such as Arteris [
38
,
5
], the Teraflops Research Chip by Intel [
29
,
28
]
or Æthereal by Philips Research [
21
]. While these implementations are
intellectual property and hence detailed information is rare or patented,
NoCs have also been developed by academically oriented research groups
resulting for instance in Nostrum [42], MANGO [10] or CLICHÉ [37].
Currently research regarding NoC can be roughly divided into three
different categories each handling other aspects:
1. Topologies:
Arrangements of the nodes and their interconnects among
them.
2. Routing Algorithms:
Algorithms that determine how flits are for-
warded.
3.
and
Flow Control:
How the router forwards flits internally by for
instance applying novel arbitration rules or controlling the buffer man-
agement.
In the following sections, details of the break through and recent research
work sorted into one of the categories, will be summarized and finally its
impact on RECONNECT discussed.
2.1 Topologies
Since the invention of NoC around two decades ago [
16
,
24
,
50
], numerous
suggestions for topologies were introduced ranging from fully connected
topologies to arrangements of nodes in three and more dimensions [
9
]. A
lot of attention has been received by the mesh topology and variants thereof
due to its easy and natural understanding of arranging nodes in a grid which
12 Related Work
eases layout efforts at the same time. In recent years the topology is often
tailored to specific designs in which large Processing Elements (PEs) are
interconnected. In [
25
] the authors are describing an NoC in which each
node is not connected to its direct neighbors, but to the neighbors next
to them. Hollis and Jackson observed that in many cases the source and
destination are two hops apart in their application specific traffic patterns
and establishing this kind of connectivity results in a latency decrease.
Assuming a predictable traffic pattern the authors in [
55
,
44
] describe
a method to estimate the traffic demands between a communication pair.
In a second step the throughput capacity of the links is adapted to these
requirements resulting in an irregular NoC in which the bandwidth between
two routers differ. In both publications the NoC is customized for a specific
set of traffic traces and synthesized.
If, however, the application is not fixed or not yet known, it could be
beneficial to be able to change the topology according to the requirements.
Different traffic patterns can be accommodated by these topologies which are
also called Polymorphic On-Chip Networks [
36
]. While this approach seems
to be interesting, problems such as scalability and extremely long wiring
occur apart from a significant overhead in area consumption.
Similar to Polymorphic On-Chip Networks the routers of RECONNECT
have the ability to be integrated into one topology such as mesh and to
mimic other topologies. However in RECONNECT the Input Port (IP)/Output
Port (OP) pairs of the routers are deactivated for power conservation limiting
the number of supported topologies. Thus for example a mesh can be
transformed into a ring, but not into a fully connected network in which
the distance from any source to any destination is one hop only, because
of the unavailability of required connections. By limiting the number of
supported topologies, RECONNECT can be simplified considerably compared
to Polymorphic On-Chip Networks.
Primarily the purpose of this ability of RECONNECT is to prevent the
topology being dictated to the system designer. The difference between
Polymorphic On-Chip Networks and RECONNECT is that the designer is not
confronted with a fixed fully connected network which then can be configured
to mimic other topologies. Instead RECONNECT offers the freedom of
choice by letting the designer picking the desired topologies selectively, while
maintaining its scalability and generality. If the topologies are to be fixed to
e.g. a ring and mesh, a fully connected network whose unused links only
contribute to a tremendous area and power consumption, is not required.
2.2 Routing Algorithms
Tightly bound to the topology are the routing algorithms which can be further
divided into two categories: deterministic and adaptive algorithms.
2.2 Routing Algorithms 13
2.2.1 Deterministic Routing Algorithms
Deterministic or oblivious algorithms always choose the same route between
any source to any destination. Although they are comparatively easy to
make deadlock free, they do not consider the current state of the router. For
instance a particular link can be used heavily, whereas the other links are
underutilized.
A example of a deterministic routing algorithm is a look up routing table
in each source for any destination. The information into which direction the
router needs to forward the flit, is stored in the its header. These look up
tables do not scale efficiently nullifying the very purpose of using an NoC and
are usually not used. Deterministic routing algorithms can be easily made
deadlock free, since all situations are predictable [17].
2.2.2 Adaptive Routing Algorithms
Compared to deterministic routing algorithms, adaptive algorithms take
several factors into consideration such as link utilization but also thermal in-
formation [
52
], while calculating the further path. In addition they can avoid
a particular route, if faults occur and a link or node becomes dysfunctional
[18].
Adaptive routing algorithms does not need to send flits onto a path that
lead them one hop closer to their destination. In case of a heavily utilized
link, flits can get deflected and might reach the destination after additional
hops, but in a shorter time by avoiding regions of heavy traffic congestions
[
46
]. The algorithms distribute the load in the NoC automatically so that
no hot spots occur. In [
8
] the authors describe an adaptive algorithm called
“Hot Potato Routing”. Like a hot potato the aim of this algorithm is to pass
the flit on, preferably closer to its destination. In case of a congestion, if
multiple flits desire to use the same shared link, the flits has to be forwarded
nevertheless. Except from being sent back into the direction where the flit
just came from, it is deflected into an arbitrary direction. Thus storing of the
flit into a buffer and keeping it waiting till resources become available, is not
intended by this routing algorithm.
However adaptive routing algorithms are difficult to be proved being
deadlock free due to the unpredictability. A flit that is continuously deflected
does not reach its destination although it is not stuck and still traverses
through the network. If this situation occurs, the flit experiences in a livelock.
Adaptive routing algorithms can be combined with look ahead routing
and speculative allocation [
35
]. In look ahead routing the path a flit needs
to take, is calculated in the previous hop. Depending on the factors the
algorithm relies on, this assumes that the previous hop is aware of the
situation in the current router such as the link utilization for instance. In
speculative allocation a flit is tried to be forwarded regardless, if there is a
14 Related Work
competitor for the same resource. The router assumes that no congestion
within it occurs and that all flits requesting resources, can get the required
resource granted. This skips arbitration steps and hence shortens the critical
path. In case of a congestion, no flit can proceed and the arbitration step is
executed.
Both adaptive and deterministic routing algorithms have advantages.
Especially in heavily congested regions an adaptive routing algorithm will
perform better. In periods in which the NoC is not highly utilized, the de-
terministic routing algorithm does not have any impact on the performance
when compared with an adaptive counterpart, but requires less complexity
such as a history about link utilization. In [
26
] a router is introduced provid-
ing both, an adaptive and a deterministic routing algorithm. Depending on
the congestion conditions in the network one of the algorithm is chosen to
forward the flits.
Although adaptive routing algorithms sounds promising, in RECONNECT
only a deterministic approach is implemented. Apart from a higher com-
plexity, adaptivity comes with the disadvantage that packets can arrive in
an out-of-order sequence and the necessity to put them back into the right
order, before they can be passed on to the sink at the destination. Especially
in applications in which data is continuously streamed into the target ar-
chitecture such as in decoders for example, the order becomes important
and a tremendous task, because a method to distinguish packets is required.
Finding a compromise between the speed up promised by the adaptive rout-
ing algorithms, and the hardware complexity and its inherited power and
area consumption, depends on the applications that are executed on the
architecture which the NoC is integrated into. It might as well happen that
the required serialization latency nullifies all advantages of the adaptivity.
2.3 Flow Control
Flow Control defines, how the flits are passed on to the next hop. For instance
in [
33
] a crossbar has been implemented that is able to boost the bandwidth
temporarily between an Input Port (IP) and Output Port (OP) by integrating
a bus connecting all IPs and OPs together. A similar approach has been
introduced in [
41
] in which the authors observed a performance increase at
the cost of a higher power dissipation and area consumption by increasing the
radix of the router. It can provide multiple links to the attached sink resulting
that packets can be ejected faster avoiding competition for resources.
On the other hand, Virtual Channels (VCs) [
14
] are an important mech-
anism to prevent deadlocks of flits and achieve a higher throughput. The
routing algorithms presented in chapter 4.2 depend on VCs to break up
cyclic dependencies. In section 4.3 the theoretical background is explained
in detail.
2.3 Flow Control 15
In case of packets which are divided into an unequal number of flits
due to differences of their payload sizes, a Virtual Channel Regulator called
ViChaR [
45
], can adapt the required channel input buffer sizes according to
the sizes of the incoming packets. In wormhole switching however there is no
need to store a packet completely before it can be forwarded to the next hop
[
15
]. The header flit is followed by multiple flits carrying additional payload,
and the whole packet traverses the network like a train. A packet segmented
into multiple flits can therefore be spread over multiple routers. Compared
to wormhole switching in which only a fraction of the packet is stored in the
traversed routers, a packet needs to be able to be stored as a whole in virtual
cut-through techniques [
32
]. Although also forwarded immediately after
arrival, if the availability of resources permits, flits belonging to a packet, are
accumulated into a buffer, if a congestion occurs somewhere on the path. The
advantage of virtual cut-through is that in case of a congestion, the packet
does not consume valuable resources of several routers such as in wormhole
switching. However the power and area consumption is comparable to the
store-and-forward switching in which the packet is always stored in the buffer
completely before it is considered to be forwarded, hence resulting in large
buffers. A comparison between wormhole switching and virtual cut-through
is summarized in [54].
The problem of identifying flits which belong together and form one
packet, is solved by adding a locally unique identity tag to each flit in [
51
].
Each flit can be considered separately and intermixed whenever a higher
link utilization can be achieved. Consider the example in which one IP sends
multiple flits to a specific OP, but after some time, the IP runs out of data
before the tailing flit arrives due to some congestion in the previous routers.
In the meantime this OP which is currently not utilized, could be assigned
to another IP till the congestion is resolved and first IP is able to send some
data again. Flits from both IPs can be stored into the same VC which makes it
necessary to distinguish them by a unique ID. However during the execution
of real life applications on the test architecture introduced in 5, only very
brief congestions of a short duration were observed. Thus the additional
hardware for flit distinction cannot be justified currently.
A hot topic in NoC are optical links which promise a higher bandwidth
and longer wires without paying delay penalty. Research is concentrated on
optical networks (ONoC) such as in [
58
] that explains routing of packets
without the need to convert them into electrical signals first. However in
real life ONoC has not been observed in any implementation yet due to
the complexity of optical links within the chip, and a high entrance barrier,
because of its novelty in chip design.
16 Related Work
2.4 Impact on RECONNECT
From most of the mentioned publications it is evident that the underlaying
communication and its requirements was known prior an NoC implementa-
tion was considered. Extensive traffic trace analyses have been performed
and the best compromise between power, area and performance calculated
before an NoC was developed exactly fitting the specifications. Although
some research work has been done, giving the flexibility of e.g. changing the
topology even during runtime, it comes at huge power and area costs and
are not considered for productive architectures.
However one goal of RECONNECT is not to restrict its use unreasonably
resulting in a flexible router architecture that can be either be tailored to-
wards specific demands or kept flexible to run under several conditions. The
aim is to give the system designer a ready made solution that can be adopted
for a variety of domains by plugging in or stripping of various functions
which are either customized or readily taken out of a pool. As mentioned in
the next chapter many ideas which were introduced in the previous sections,
are currently implemented in RECONNECT such as support for wormhole as
well as virtual cut-through switching, VCs or routing algorithms which were
adapted to support various and multiple topologies (even post-synthesis).
It is the responsibility and the freedom of the system designer to use the
functionality that is appropriate.
2.5 Summary
In this chapter some of the research work published by other research groups,
has been introduced. Usually a bottom-up approach has been taken by
extensively analyzing the traffic patterns and developing an NoC meeting the
specifications. In this work another approach is taken (top-down): A ready
made NoC called RECONNECT is provided and stripped off functionalities to
also meet the specifications. Since RECONNECT is generic, many published
methods are implemented into RECONNECT. However due to the flexible
nature of RECONNECT the system designer is able to extend the functionality
of the routers with techniques which are not mentioned in the sections above,
or even with yet unpublished work.
Chapter 3
Architecture
This chapter describes the architecture of a RECONNECT router and its
configurations. Multiple configuration parameters provide an easy way to
alter the behavior and functionality of RECONNECT by either changing values
of constants (configuration flexibility) or by replacing modules (module
flexibility). The effects of some of the parameters are not only restricted
to RECONNECT or the Fabric, but also depend on other modules of the
architecture in which the Fabric is integrated into. Since this architecture
is not known at the time of writing, it is henceforth referred to as target
architecture.
3.1 Architectural Overview of the Router
The task a router has to fulfill, can be easily described: Take the data from
the incoming Input Port (IP), check which direction it needs to be forwarded
to, and pass it on to the next router. A more detailed description is depicted
in figure 3.1.
Collection Data from
neighboring IPs
IP/OP Arbitration
Crossbar Traversal
Relative Address
Update
VC Selection for Storage
VC Arbitration
Route Calculation
previous Routernext Router
Storing of incoming Flit in VC
Figure 3.1: The processing steps a flit encounters while traversing a router.
1.
The incoming data is stored in one of the Virtual Channel (VC). The VC
has been calculated by the routing algorithm in the previous router. It is
important to note that the VC cannot be chosen arbitrarily as described
in detail in chapter 4. Before the flit is stored, it is checked, if the new
flit is a header flit. In this case, the next Output Port (OP) and VC
18 Architecture
number is calculated by the routing algorithm and stored along with
the flit.
2.
In the next step the neighboring routers report their states of the VCs.
This step ensures that the IP only considers flits which have a chance
to be forwarded. For instance, if IP 2 of the next router reports that all
its VCs are full, all flits which want to be routed towards IP 2 do not
need to be considered. This step creates a bit array of VCs that contain
data and have a chance to be forwarded.
3.
An arbiter residing inside the IP, chooses one of the requesting VCs and
the IP reports the desired OP to the router.
4.
Multiple IP might request for the same route, hence another arbitration
step (IP/OP arbitration step) is required to resolve this conflict.
5.
The IP that won the IP/OP arbitration, transmits its flit and deletes it
from the chosen VC. In case of multiflit environment the OP is bound
to the IP for the entire duration of the transmission of the packet to
prevent interleaving of flits from other IPs to the same OP.
6.
The flit traverses through the crossbar and is received at the OP in
which its relative address is updated in case it is a header flit. In the
relative addressing scheme, the address tuple represents the distance
from the current node to the destination. Since the distance changes
when the flit traverses the Network on Chip (NoC), it needs to be
updated at every node the flit passes. If all elements of the address
tuple equal 0, the flit reached its destination and is ejected from the
network.
After the address has been updated, the flit is passed on towards the
next router.
In the following section the parameters for the configuration flexibility
are described. From section 3.3 onwards a point of view of the flit is taken
and the modules which a router consists of, are described in detail as it
traverses through the router.
3.2 Configuration Parameters
This section describes the constants (
‘defines
mentioned in
./trunk/in-
cludes/define.bsv
) which are evaluated by the Bluespec System Verilog
(BSV) precompiler. With corresponding
‘ifdef
and
‘else
branches the
precompiler removes paths which are never executed. However in some cases
such as the evaluation of the number of flits a packet comprises of, the value
of a defined constant cannot be calculated during the time of precompilation
3.2 Configuration Parameters 19
and the if-else branches are not removed. In the next compilation step the
following Bluespec compiler evaluates static variables and removes the part
of the branch that is never true.
3.2.1 DEBUG_NOC[1-3]
If one of the
DEBUG_NOC
is defined, the routers will print out debugging
messages during simulation. This is achieved by the
$display
command.
Obviously it is only relevant during simulations and does not have an equiva-
lent component in hardware. The higher the number of this define, the more
messages are printed. During simulations printing messages has a serious
impact on the performance. Hence the system designer can speed up the
execution time by disabling and commenting out messages which are of no
interest.
3.2.2 ASMUNIT_DEBUG
Similar to
DEBUG_NOC[1-3]
this define is used to print out debugging mes-
sages for the Assembly Unit (AU) (refer to section 3.3.1). This along with
the provided test bench is particular useful, to check, if the conversion from
the union representing different packet types, into the bit structure of flits
and back, has been implemented correctly. One of the major cause of bugs
and errors is the miscalculation of the bit field sizes of the structures within
the union. If this define is not set, the AU is silent.
3.2.3 MULTIFLITSUPPORT
If the router are integrated in an environment that splits the packet into
multiple flits, this define needs to be set. By doing so, the AU is automatically
added to the injection and ejection ports of the router ensuring that the
packets are divided into appropriate sizes of
NWFLITSIZE
bits (see section
3.2.7). Depending on this setting the buffers and the logic controlling the
flow of packets comprising of multiple flits, are adjusted.
Currently the router supports two flow control mechanisms:
1.
The wormhole flow control mechanism means that a header flit will
determine the path and can be immediately forwarded even without
waiting for the tail flit [
15
]. The flits of the corresponding packet
traverse the routers similar to a train. The advantage of wormhole
routing is that the buffers do not need to be able to hold the complete
packet once a congestion occurs. A packet comprising of several flit
can therefore be spread over multiple routers.
Since the destination address is carried only in the header flit and hence
only with that flit the required OP can be calculated by the routing
20 Architecture
algorithm, the information were the following flits need to be sent to,
must be retained. Thus the VC that just got selected to serve the head
flit, must be bound to the output of the IP module. Further the IP needs
to be fixed to the OP so that all successive flits can follow their head flit.
However this binding cannot be broken, if the channel runs out of data
due to e.g. a congestion in previous hops. The route from the VC to the
OP within the router cannot be restored, if data is available again. If
the channel does not contain data, the resources are blocked and other
flits are not able to utilize them, till a tail flit terminates the bound.
2.
Virtual cut-through [
32
] solves this problem be reassigning unused
resources resulting in a higher utilization of the available bandwidth.
However this might lead to packets whose flits got interleaved. The
destination is currently not able to distinguish flits belonging to differ-
ent packets. Secondly the buffer sizes required for virtual cut-through
will be in an order of magnitude larger compared to wormhole routing
[17].
If
MULTIFLITSUPPORT
is set, the format of the flit is converted into a
union that provides header, headtail, payload and tail flits. Headtail flits are
used, if the header provides some bits for payload and the whole payload
of the packet fits into it. If
MULTIFLITSUPPORT
is not set, all packets become
headtail flits and hence the union becomes superfluous.
In theory and for later requirements this mechanism of binding IP/OP
pairs could be exploited to provide virtual circuits between a source and
destination, since the channel is never terminated as long as no tail flit is
sent.
3.2.4 STAGES_OF_ROUTER
The given value defines the number of stages in the butterfly crossbar (see
section 3.6). The number of stages determine how many input and output
ports the crossbar has to provide. In the current implementation that supports
topologies up to hexagonal structures, the number of required ports is 6 plus
1 for the attached sink/generator. Hence the number of stages is
dlog27e= 3
and defines how many bits are required to determine the route of a flit
through the crossbar as described in section 3.6.1.
3.2.5 IP_VC,FIFO_DEPTH_VC_IP
Here the total of required VCs (refer to section 4.3) for each IP is defined.
This number is closely related to the routing algorithm (refer to section 4)
which needs VCs to provide a deadlock free traversal of flits through the
network. In case of a routing algorithm that does not depend on VCs to
3.2 Configuration Parameters 21
be deadlock free, this define can be commented out and the provision of
VCs completely deactivated. Since the VC number is transmitted in the flit
header, these bits will be saved and can be used for other purposes such as
an increased payload field in the header flit or a reduced flit size. The IP that
is connected to the sink/generator does not provide any VCs by default, since
a flit is always injected in VC0.
Since the Fabric and its routers are modularized, modules can be replaced
easily as long as the interface format is compatible. In another, alternate
implementation the define
SIZE_VIRTUAL_CHANNEL
was used for the same
purpose. To not to break the compilation of this earlier implementation, the
defines are kept in place.
Another value that is related to VCs is the depth or how many flits it
can hold of each channel. To ease the analysis of performance impacts for
various buffer sizes, its depth can be changed by alternating the value of
FIFO_DEPTH_VC_IP.
3.2.6 PORTS,PORTS_MESH,PORTS_HC
This define determines the number of ports that the router have. It allows an
efficient programming style by e.g. looping through the ports arranged in an
array for initialization purposes. The system developer defines the topology
and the degree of the router here. However it has to be kept in mind that
with each port added, the complexity of the router implementation increases.
In the implementation described in this work, only
PORTS
is used and the
remaining defines are ignored. They become important for alternative router
implementations.
3.2.7 NWFLITSIZE
The value given for this define, determines the flit size in bits that is sent
through the network and becomes important in multiflit environments (see
section 3.2.3). If
MULTIFLITSUPPORT
is not defined,
NWFLITSIZE
is ignored,
since the flit size will be equal to the packet size. It is of advantage to
determine a flit size that allows a few number of bits as payload to be
transmitted in the flit header itself to allow very small packets such as the
Topology Configuration Packet (refer to section 3.2.11) to be transmitted in
one single flit. A method to determining the appropriate flit size is presented
in section 5.4.
3.2.8 NUM_ADDRESSES_ROUTER
In the environment outside the Fabric the coordinates for each node within it
comprises of two tuples, since the nodes are arranged in a two dimensional
grid. However other topologies with different addressing schemes may
22 Architecture
have other requirements. This value reserves space in the header flit for
the number of dimensions as depicted in figure 3.3. A function to convert
the 2D address coming from the target architecture into the appropriate
format is necessary and can be included easily into the AU (refer to 3.3.1). A
conversion back into a 2D address is not required, since a flit is ejected from
the network only, if its relative address is
(0,0)
. The sink does not need to
examine this address.
3.2.9 OPSLENGTH
The type of a packet is stored within a union. Bluespec automatically deter-
mines the number of bits required to distinguish each packet type. If the
packet is stored in registers, the size will be
size =max_size(packet_type) +dlog2(number_of_packet_types)e.
Currently there is no method to extract this number of additional bits and
make it available to the system designer. So when it becomes necessary to
convert the packet from the bit field back into a union, the number of bits
that are needed to distinguish the packet type must be known. This value
is defined in
OPSLENGTH
and needs to be recalculated, when the number of
packet types in the union changes.
3.2.10 HEADERSIZE,BITS_SPACE_IN_HEADFLIT
Whereas
HEADERSIZE
accumulates the size of the address tuples and eventu-
ally also the number of bits needed to store the VC number (refer to 3.2.5),
BITS_SPACE_IN_HEADFLIT
returns the number of bits which are left to store
payload in the header flit.
3.2.11 MULTITOPOLOGY
The router supports multiple topologies and can change them according to
the desired configuration. After a reset all routers examine incoming packets
for the Topology Configuration Packet. Once they are configured or the
first packet that arrives, is not a Topology Configuration Packet, they lock
themselves and hence cannot change the topology configuration again till
the next reset. The reason is that the unused ports are clock gated and hence
switched off. In case of a fully dynamic topology change during runtime, a
switched off IP/OP pair will not accept any data nor forward data that is
stored in the buffers eventually. Waiting applications cannot continue with
their execution and stay on the Fabric occupying computation resources.
The current implementation does not expect to run multiple different
topologies on the Fabric at the same time. This restriction can be relaxed,
3.2 Configuration Parameters 23
if it is guaranteed that traffic does not cross from one topology to another
one. The Fabric is logically divided into subsets and the traffic caused by an
application, is restricted to a particular area of the Fabric. It needs to be kept
in mind, that the supporting logic which is responsible to launch applications
onto the Fabric has to be aware of the topology to maximize its utilization.
Otherwise the impact of a richer topology might be negligible as observed in
the test case REDEFINE, in which the Resource Binder (RB) is not topology
aware to not to increase its complexity (refer to section 5.2).
In addition the NoC does not have the notion of a broadcast. Hence
after the arrival of a Topology Configuration Packet, the routers change
their topology, but at the same time also route the packet according to the
routing algorithm of the honeycomb topology, which is the default one.
If the Fabric has
n
columns, only
n
packets with the destination address
(x, y) = (numbercols ,0)
that are injected at the Access Routers (ARs), are
required to configure all routers. So at maximum currently only a row and
not a section wise configuration is allowed.
If
MULTITOPOLOGY
is not set, the necessary logic to change the topology
is not synthesized and the Topology Configuration Packet does not exist.
The router falls back to the default, currently honeycomb, topology. The
main purpose for supporting multiple topologies is to have the ability to
check uncomplicated the impact of a topology on the execution time of an
application without the necessity of recompilation of the whole architecture.
Although it is expected that in the final design the routers serve only in
a single topology, the functionality has been preserved to allow system
designers to test future topologies and applications.
3.2.12 XBAR
There are different implementations from several authors of NoC routers
available. If the routers that are described in this work shall be used, the
define needs to be set. It ensures that the Python script
multiTopo.py
is executed, generating the Bluespec code which in turn establishes all
links among the routers and Processing Elements (PEs) (
./bluespec/Fab-
ric/Fabric/Fabric.bsv
). A look into
multiTopo.py
reveals that the Fabric
can be further configured:
•
Generate a Fabric that allows to connect artificial traffic generators.
The Fabric will become a self sustaining stand alone module allowing
testing of new modules such as e.g. routing algorithms. Artificial traffic
generators written in C code and compiled into the Bluespec code
allow a minute configuration of test cases. These traffic generators
read the traffic patterns from configuration files hence avoiding a time
consuming recompilation of the whole Fabric for other test cases. Due
to the C code generators, this Fabric cannot be synthesized into Verilog
24 Architecture
and runs only as a Bluesim simulator (i.e. an executable for an i686 or
x64 PC architecture).
•
Usage of routers which comprises multiple pipelined stages as described
in 3.6.
•
Single cycle routers running at the same clock speed as the remaining
modules of the target architecture.
3.2.13 EJECTPORT and (.*)(_FABRIC)?
These defines provide a human readable representation of the direction
numbering to ease the implementation and avoid programming mistakes.
There are two different sets of defines:
1. Defines that represent the direction within the router.
2.
and defines ending with
_FABRIC
which are only used in the Fabric
and which are decremented by one compared to the corresponding
direction definition for the router.
Within the router the routing algorithm determines, where flits have
to be forwarded to. To accomplish this the algorithm returns a number
which represents the heading of the OP. All OPs that are connected to
neighbors are stored in an array. Logically the returned number by the
routing algorithm represents the index of this array. However the OP to
which the sink/generator is connected, differs and is not included in that
array. This OP is exported by the router by a separate interface. Hence the
index for a particular OP outside the router does not match with the array
index for the same OP used inside the router and its routing algorithm. To
connect the routers to form a specific topology, defines ending with
_FABRIC
,
but to implement a new routing algorithm, defines not ending with
_FABRIC
,
need to be used.
3.2.14 FLITNW_FIFOSIZE_ASMUNIT
It determines the depth of the first-in-first-out (FIFO) buffer that stores the
flits of the segmented packet in the AU. The larger the buffer, the better
a congestion in the NoC can be compensated by allowing the attached
generator to continue sending packets. A smaller FIFO will result in lower
area and power numbers of the AU.
3.2.15 CE_CLK and SUPPORTLOGIC_CLK
Since the routers are comparatively small logical units, they can probably
run at higher clock frequencies than the remaining modules of the target
3.3 Input Port (IP) (InputPort.bsv) 25
architecture. Higher frequencies will cover up the latency that is experienced
after submission of a packet into the NoC till it reaches its destination.
Both defines represent clock dividers meaning that the given value is
used as a divisor to derive a new clock from the original one. This setting
highly depends on the target architecture. In case of REDEFINE
CE_CLK
is
the clock for the Compute Elements (CEs) whereas
SUPPORTLOGIC_CLK
is the
one for the Support Logic (SL) respectively. The original clock which is at
the highest frequency, is given to the NoC. However synchronizers are not
yet implemented at the boundary of the modules so that any setting of these
two define values does not have any effect.
3.3 Input Port (IP) (InputPort.bsv)
Addr.
Update
OP
IP
IP
IP/OP Arbiter
Control
Routing
VC Arbiter
OP
Assembly
Unit
Figure 3.2: An architectural overview of the several modules of a router.
As it can be observed in figure 3.2 depicting an architectural overview of
the several modules of a router, the first module a flit encounters, is an IP.
There are two different kind of IPs that provide connectivity to the router:
1.
One is designed to be connected to the sink/generator which can
be a PE or other modules of the target architecture. It includes the
segmentation and reassembly of packets into flits and vice versa in
multiflit environments. This functionality has been implemented in a
separate module called AU (refer to section 3.3.1).
26 Architecture
2.
The other kind of IPs is directly connected to the OPs of the neigh-
boring routers (see section 3.3.2). It provides the functionality of VCs
including the arbitration by Matrix Arbiters. If VCs are not required,
each IP is reduced to provide a simple buffer of a predefined depth and
a routing algorithm. Since there are no VCs anymore, obviously a VC
arbitration step also becomes superfluous.
3.3.1 Assembly Unit (AU) (AssemblyUnit.bsv)
The AU is always instantiated regardless of the configuration depending
on the defines (refer to section 3.2). However the provided functionality
differs, if the router is not used in multiflit environments. In these, the
AU basically only provides an interface that is compatible to the target
architecture. As mentioned in section 1.3 the NoC works merely on the
provision that a rule does not fire, if any of the implicit and explicit conditions
that leads to the firing of the rule, is false. For example if a rule could fire
and it contains a component that writes into a FIFO, but at the same time
the FIFO is full, the rule will not fire. The readiness of the FIFO buffer
becomes an implicit condition for firing this particular rule. This mechanism
is used to e.g. transfer data from the OP to the IP. The rule representing
the OP will not transmit data, if the IP cannot accept it. Thus sending
Acknowledgements (ACKs) back and forth can be omitted.
As described earlier the IP has an additional mechanism to choose only
those flits in the VCs that can be routed (i.e. space is available in the receiving
IP of the next router). Hence the implicit condition in which a rule transmits
a flit to a full IP, does not occur in the first place.
However it might be the case that the surrounding support logic or the
attached PEs do not follow this protocol. For instance in REDEFINE (refer to
chapter 5) the SL and CEs write to wires first, which are always writable and
cannot block rules. Hence the SL expects an ACK, if the storage of data was
successful in the intended module. The AU translates the protocol supported
by the SL and CEs into a compatible protocol for the router and vice versa.
If the router is embedded in a multiflit environment the task of AU is
extended into segmenting a packet into flits. Depending on the available bus
width and the size of the packet, multiple flits are generated. As long as the
flits are generated, the AU marks itself as busy and does not send back ACKs,
if new data is intended to be stored in the FIFO at its input port.
As it can be observed in figure 3.3 the packet structure is not necessarily
fortunately chosen to be split into multiple flits: The header is aligned to the
MSB of the packet whereas the payload is aligned to the LSB leaving bits
unused right in the middle. Hence it is not a simple shift operation till the
whole payload of the packet is converted into flits. The problem that forces
this format, was the Bluespec compiler. Bluespec is type sensitive meaning
that a variable declared as
int
(a signed integer of 32 bits width) cannot be
3.3 Input Port (IP) (InputPort.bsv) 27
slotNo
opCode
unused
template
bits
Data Type
No of Destinations
x Address of Destination
y Address of Destination
OPS
Signed
Predicate Expected
SlotNo of Destination
0000 1111 11 0 0000 1111 0111 0000 0111 1111 1111 0111 10101010101010101010101010101010 00000 01 1111 0000 00001111
Destination 3
Destination 2
11110010
y Address
x Address
Packet Payload = Instruction PacketPacket Header
Figure 3.3:
An example of a packet. Here the packet type is an Instruction Packet
of REDEFINE. The payload bits at the LSB and the template bits at the MSB are
separated by unused bits (gray field). Bluespec initializes unused bit with 0xa in
the simulator.
used as an
Int#(4)
(a signed integer of 4 bits width). The type
int
needs
to be casted into
Int#(4)
by converting it into a bit field of 32 bits (
unpack
function), truncating it (
truncate
) and casting it back into the
Int#(4)
type
(
pack
function). This is merely a change of data representation not effecting
any hardware eventually generated. Bluespec also allows to overload the
pack
/
unpack
functions by self defined functions to e.g. align the flit header
and the payload to the MSB. However it has been observed that in that
case additional hardware is indeed generated. To avoid this unnecessary
hardware, the default
unpack
/
pack
functions are used leaving unused bits
inconveniently in the middle.
The AU consists of a state machine to process incoming packets and
the procedure is depicted in figure 3.4. After a packet has been accepted
by the AU, depending on the type of packet first from a look-up table it
is chosen, how many flits should be generated. This value is stored in
counter
. The look-up table is static and has been precalculated during
compilation of the target architecture such as REDEFINE (refer to section 5)
in
FlitStructureSizes.bsv
. If the packet format changes and architecture
is recompiled, the number of required flits is adapted automatically.
In the second stage the head flit or headTail flit, if the payload of the
packet fits into the payload field in the header flit completely, is generated.
If
‘NUM_ADDRESSES_ROUTER
(refer to section 3.2.8) is greater than the di-
mension of the address array that is used by the target archite