Conference PaperPDF Available

Network-on-Chip Service Guarantees on the Kalray MPPA-256 Bostan Processor


Abstract and Figures

The Kalray MPPA-256 Bostan manycore processor implements a clustered architecture, where clusters of cores share a local memory, and a DMA-capable network-on-chip (NoC) connects the clusters. The NoC implements wormhole switching without virtual channels, with source routing, and can be configured for maximum flow rate and burstiness at ingress. We describe and illustrate the techniques used to configure the MPPA NoC for guaranteed services. Our approach is based on three steps: global selection of routes between end-points and computation of flow rates, by solving the max-min fairness with unsplittable path problem; configuration of the flow burstiness parameters at ingress, by solving an acyclic set of linear inequalities; and end-to-end latency upper bound computation, based on the principles of separated flow analysis (SFA). In this paper, we develop the two last steps, taking advantage of the effects of NoC link shaping on the leaky-bucket arrival curves of flows.
Content may be subject to copyright.
Network-on-Chip Service Guarantees
on the Kalray MPPA-256 Bostan Processor
Benoît Dupont de Dinechin and Amaury Graillat
Kalray SA
445 rue Lavoisier
F-38330 Montbonnot, France
The Kalray MPPA-256 Bostan manycore processor imple-
ments a clustered architecture, where clusters of cores share
a local memory, and a DMA-capable network-on-chip (NoC)
connects the clusters. The NoC implements wormhole switch-
ing without virtual channels, with source routing, and can be
configured for maximum flow rate and burstiness at ingress.
We describe and illustrate the techniques used to configure
the MPPA NoC for guaranteed services. Our approach is
based on three steps: global selection of routes between end-
points and computation of flow rates, by solving the max-
min fairness with unsplittable path problem; configuration
of the flow burstiness parameters at ingress, by solving an
acyclic set of linear inequalities; and end-to-end latency up-
per bound computation, based on the principles of separated
flow analysis (SFA). In this paper, we develop the two last
steps, taking advantage of the effects of NoC link shaping
on the leaky-bucket arrival curves of flows.
Network-on-Chip, Deterministic Network Calculus, Link Traf-
fic Shaping, Separated Flow Analysis
1.1 Motivation and Problem
The Kalray MPPA-256 Bostan manycore processor is de-
signed for timing predictability and energy efficiency [1]. Its
cores are clustered into compute units sharing a local mem-
ory, and these clusters are connected through a network-on-
chip (NoC). The MPPA NoC implements wormhole switch-
ing with source routing and supports guaranteed services
through the configuration of flow injection parameters at the
NoC interface: the maximum rate ρ; the maximum bursti-
ness σ; the minimum and the maximum packet sizes.
The MPPA-256 processors are used in time-critical appli-
cations [2], where a set of cyclic or sporadic tasks interact
through a static communication graph. When application
tasks are mapped to the cores, tasks communicate either
AISTECS 2017, Stockholm, Sweden
ACM ISBN 978-1-4503-2138-9.
DOI: 10.1145/1235
through the local memory if co-located on the same cluster,
or through the NoC that supports remote DMA. Ensuring
the worst-case response times (WCRT) of time-critical appli-
cations requires upper approximations of worst-case execu-
tion times (WCET) for the computations and of worst-case
traversal times (WCTT) for the remote communications.
In this work, we address the problem of computing end-
to-end latency upper bounds for on-chip communications,
assuming that the tasks have been assigned to cores inside
clusters thus to NoC nodes. We base our approach on De-
terministic Network Calculus (DNC), which was originally
developed for the performance evaluation of Internet and of
ATM networks [3]. Since then, DNC has been applied to
flow regulation of NoCs [4][5], and is used to guarantee the
services of the AFDX Ethernet-based avionics networks [6].
Application of DNC to a NoC assumes that routes be-
tween endpoints are predetermined, and that flow injection
parameters are known or enforced. Moreover, key DNC re-
sults only apply to feed-forward networks, that is, networks
where the graph of unidirectional links traversed by the set
of flows has no cycles [3]. A related assumption is that the
hop-to-hop flow control of the NoC is not active, that is, the
traffic is such that the router queues always accept incoming
flits. Figuratively speaking, queues must not overflow.
1.2 Contributions and Related Work
Our contributions are as follows. At the time scale of NoC
operations, there are no instantaneous packet arrivals, so the
(σ, ρ) characterization of flows used in macro networks must
be refined by taking into account the flit serialization effects
of links, both inside and between the NoC routers. These ef-
fects appear significant for the increase of the flow burstiness
between hops and for the computation of the end-to-end la-
tency bound. A related approach is to shape individual flows
at ingress, then propagate the resulting arrival curves across
the network elements [6]. However, this approach is more
relevant for an AFDX network than for a NoC.
For the service offered to a queue by the link arbiter, we
select either the rate and latency ensured by the round-
robin packet scheduling, or the residual service guaranteed
by blind multiplexing when the former does not apply. Pre-
vious works use only blind multiplexing [7] or only FIFO
multiplexing [8], however the latter does not apply to the
MPPA NoC link arbiters, since they do not keep the FIFO
service order between packets from different queues.
Compared to earlier work of applying DNC to the NoC of
the first-generation MPPA Andey processor [9], we decouple
the computation of the flow rates and burstiness and obtain
significantly better rates and end-to-end latency bounds.
2.1 Overview of MPPA NoC Management
There are two objectives of the MPPA NoC management.
First, ensure deadlock-free operations, which is a concern
with wormhole switched networks. Second, in case the com-
munication flows are static, configure the routes and packet
injection parameters so that minimum flow rates and maxi-
mum end-to-end latencies can be guaranteed. These two ob-
jectives are connected by the following insight: on a worm-
hole switched network, a static deadlock-free routing algo-
rithm ensures that the flows are feed-forward. Indeed, avoid-
ing deadlocks can be ensured by using a routing scheme
that prevents circuits in the resource dependence graph [10].
With wormhole switching, resources are the router internal
or external links, and depend through the packet paths.
We assume that application tasks have been assigned end-
points on the NoC. A static deadlock-free routing algorithm
with path diversity, in our case the Hamiltonian Odd-Even
turn model [11], is used to compute a set of paths between
the endpoints. In a first optimization problem, we glob-
ally select a single path for each endpoint with the max-min
fairness objective [12]. This objective corresponds to maxi-
mizing the lowest flow rate, then the next lowest flow rate,
etc. Solving this optimization problem yields the flow rates.
We then formulate a second problem, which is the topic
of this paper. The objective is to compute the flow bursti-
ness parameters at ingress, in order to ensure that the NoC
router queues do not overflow. This entails propagating the
individual and aggregate flow parameters across the relevant
network elements, which are the outgoing link arbiters that
serve the packets in round-robin across the adjacent ’turn’
queues. Replacing the flow rates by their values from the
first optimization problem yields a system of linear inequal-
ities with the flow burstiness variables. Thanks to feed-
forward flows, the system is acyclic and solved in one pass.
Finally, we obtain upper bounds on the end-to-end laten-
cies between endpoints by applying the principles of sepa-
rated flow analysis (SFA) [7], where a left-over service curve
is computed at each network element for the flow of inter-
est. These left-over service curves are combined by (min,+)
convolution, yielding the end-to-end service curve provided
by the network for this flow. Computing the maximum hor-
izontal deviation between the flow arrival curve at ingress
and this end-to-end service curve yields the delay for the
upper bound on the end-to-end latency for the flow.
2.2 The MPPA-256 Bostan Processor
The Kalray MPPA-256 Bostan processor is the second im-
plementation of the MPPA (Massively Parallel Processing
Array) architecture, manufactured with the TSMC CMOS
28HP process. The MPPA architecture belongs to the ‘many-
core’ family, characterized by a large number of cores and
the architecturally visible aggregation of resources into com-
pute tiles (DSP ‘core packs’, GPU ‘streaming multiproces-
sors’, OpenCL ‘compute units’) with local memory, and the
exploitation of a network-on-chip for their interaction.
The MPPA-256 Bostan processor integrates 256 process-
ing cores and 32 management cores on a single chip, all based
on the same 32-bit/64-bit VLIW core architecture. Its ar-
chitecture is clustered with 16 compute clusters and 2 I/O
clusters, where each cluster is built around a multi-banked
local static memory shared by the 16+1 (compute cluster)
Figure 1: MPPA-256 Bostan clusters and NoC.
Figure 2: Structure of a MPPA NoC router.
or 4+4 (I/O cluster) processing + management cores. Fig. 1
illustrates the MPPA-256 Bostan clusters, with I/O clusters
in light blue and the compute clusters in dark blue. The
clusters communicate through the NoC, with one node per
compute cluster and 8 nodes per I/O cluster.
The MPPA-256 Bostan NoC is a direct network with a 2D
torus topology extended with extra links connected to the
otherwise unused links of the NoC nodes on the I/O clusters.
The NoC implements wormhole switching with source rout-
ing and without virtual channels. With wormhole switching,
a packet is decomposed into flits (32-bit), which travel in a
pipelined fashion across the network elements, with buffering
applied at the flit level. The packet follows a route deter-
mined by a bit string in the header.
The key network elements are the link arbiters inside the
routers (Fig. 2), which schedule whole packets for transmis-
sion by applying a round-robin selection on adjacent queues.
There is one queue per incoming direction or ’turn’. We call
a queue active if there is some flow passing through it and
there is another queue in the same link arbiter with some
flow. A non-active queue either has no flow passing through
it, or is the only one in the link arbiter with some flow. If so,
there are no effects on the packets beyond a constant delay
and the queue can be ignored except for the constant delay
Figure 3: Flow arrival and departure as cumulative
data functions over time.
Figure 4: Arrival curve αfor function A.
that needs to be added to the end-to-end latency bound.
2.3 Deterministic Network Calculus (DNC)
Network Calculus [3] is a framework for performance anal-
ysis of networks based on the representation of flows as cu-
mulative data over time. Given a network element with
cumulative arrival and departure functions Aand A0, the
delay for traversing the network element and the backlog of
data it has to serve correspond respectively to the horizontal
and the vertical deviations between Aand A0(Fig. 3).
Network Calculus exploits the properties of the (min,+)
algebra, in particular it introduces the following operations:
convolution: (fg)(t),inf
0stf(t) + g(ts)
deconvolution: (fg)(t),sup
Let Abe a cumulative data function. An arrival curve
αis a constraint on Adefined by 0st:A(t)
A(s)α(ts), which is equivalent to AAα. Fig. 4
illustrates the smoothing constraint of the arrival curve αon
the cumulative function A. This particular type of arrival
curve α(t)=(ρt +σ)1t>0, known as affine or leaky-bucket,
is denoted γρ,σ with ρthe rate and σthe burstiness.
Let A, A0be the cumulative arrival and departure func-
tions of a network element. This element has βas a service
curve iff: t0 : A0(t)inf0st(A(s) + β(ts)), which is
equivalent to A0Aβ. Fig. 5 illustrates the guarantee of-
fered by the service curve β. This particular type of service
curve β(t) = R[tT]+, known as rate-latency, is denoted
βR,T with Rthe rate and Tthe latency.
Key results of Deterministic Network Calculus include:
Figure 5: Service curve βfor a server AA0.
The arrival curve of the aggregate of two flows is the
sum of their arrival curves.
A flow A(t) with arrival curve α(t) that traverses a
server with service curve β(t) results in a flow A0(t)
constrained by the arrival curve (αβ)(t).
The service curve of a tandem of two of servers with
respective service curves β1(t) and β2(t) is (β1β2)(t).
If a flow has arrival curve α(t) on a node that offers
service curve β(t), the backlog bound is maxt0(α(t)
β(t)) and the delay bound is maxt0{inf s0 : α(t)
β(t+s)}. These two bounds are respectively the max-
imum vertical deviation and maximum horizontal de-
viation between α(t) and β(t).
For any flow fi, the solution of the routing problem yields
its unique path and its rate ρi. We now consider the for-
mulation of the second problem, the computation of flow
burstiness parameters at ingress σifor each flow. Not only
this problem is linear, thanks to now determined values of
the flow rates, but also we account for the effects of link
shaping: due to the NoC router architecture, any internal
or external link has a maximum rate r= 1 flit/cycle.
3.1 Effects of Link Shaping on Queues
Let Fjdenote the set of flows passing through an active
queue qj. Assume that the arrival curve of each flow fi
Fjis of leaky-bucket type with rate ρj
iand burstiness σj
Then qjreceives a leaky-bucket shaped flow γρjjwith ρj=
iand σj=PfiFjσj
i. However, the maximum
queue filling rate is r. So the arrival curve for qjis the
convolution of curves λ(t),rt and α(t),(σj+ρjt)1t>0,
which is their minimum as both curves are concave and pass
through the origin [7]. The time where the line ρt meets the
affine curve (σj+ρj)1t>0is τj,σj
rρj(Fig. 6).
Further assume that the link arbiter offers a rate-latency
service curve βRj,Tjto queue qjwith Rjr. The backlog bj
of qjis the maximum vertical deviation between the arrival
curve and the service curve. It has two values, depending
on the comparison between τjand Tj. If τjTjthen
bj=σj+ρjTj(Fig. 6 left) else bj= j(τjTj)Rj
(Fig. 6 center). As τjTjσj(rρj)Tj:
bj=σj+ρjTjif σj(rρj)Tj(1)
rρjσj+RjTjif σj(rρj)Tj(2)
The delay djfor queue qjis the maximum horizontal de-
viation between the arrival curve and the service curve. As
rRj, this maximum is reached when t=τjon the arrival
curve (Fig. 6 right). Let δjbe the time when service curve
has the height of the arrival curve at time τj. This implies
j= (δjTj)Rj. As the delay dj=δjτj, we get:
3.2 Link Arbiter Service Curves
Let njthe number of active queues in the link arbiter,
including qj. From the point of view of an active queue qj,
the round-robin scheduler of the link arbiter ensures that one
Figure 6: Effects of link shaping on backlog (left, center) and delay (right).
Figure 7: Shaping constraint on ingress burstiness.
packet is transmitted every round. With lmax the maximum
packet size and rthe link rate, the maximum latency seen
by qjis (nj1) lmax
r, while the long-term rate for qjis r
[7]. This yields a first service curve βj=βRj,T jfor qj:
njand Tj= (nj1) lmax
However, this service curve may be overly constraining.
To see how, consider a case where nj= 2 and there are
three flows with rates ρa=ρb=ρc=r
3. There must be
two flows passing through one queue and one flow through
the other queue. As a result, there is one queue with service
rate r
2which is traversed by a total flow rate of ρa+ρb=2r
Another approach to derive a service curve for queue qj
is to consider that the round-robin scheduler serves packets
at peak rate raccording to a blind multiplexing strategy
across the queues. In that case, we may apply Theorem
6.2.1 (Blind Multiplexing) [3]:
Theorem. Consider a node serving two flows, 1 and 2, with
some unknown arbitration between the two flows. Assume
that the node guarantees a strict service curve βto the ag-
gregate of the two flows. Assume that flow 2 is α2-smooth.
Define β1(t),[β(t)α2(t)]+. If β1is wide-sense increas-
ing, then it is a service curve for flow 1.
In particular, if β=βR,T and α2=γρ22, the first flow
is guaranteed a service curve βR0,T0with R0=Rρ2and
T0=T+σ2+T ρ2
Rρ2[3]. Let Ajbe the set active queues in the
link arbiter of qj, and Bj,Aj− {qj}. Application with
ρ2=PkBjρk,σ2=PkBjσk,R=rand T= 0 yields
another service curve βj=βRj,Tjfor qj:
ρkand Tj=PkBjσk
In principle, the choice between equations (4) or (5) at
each server should be deferred until evaluation. In practice,
we heuristically select Eq. (4) in case ρjr
njwith ρj=
PfiFjρithe sum of flow rates inside qj.
3.3 Flow Burstiness Parameters
Let lmax be the maximum packet size, which we assume
the same for all flows. At ingress, whole packets are atomi-
cally injected at rate r. Call θthe date when injection ends
(Fig. 7). We have =lmax and lmax σi+ρiθ, so:
i,lmax rρi
Then, consider the constraints that the router queues must
not overflow, given bmax the size in flits of any queue. Queue
qjreceives a leaky-bucket shaped flow γρjjwith ρj=
iand σj=PfiFjσj
i, shaped by the link at rate
r, so equations (1) and (2) apply for the backlog bj. Router
queue do not overflow if qjQ:bjbmax, so:
bmax σj+ρjTjif σj(rρj)Tj(7)
bmax rRj
rρjσj+RjTjif σj(rρj)Tj(8)
We now express the values ρj
iand σj
ifor all flows fiFj
for an active queue qj. Obviously, ρj
i=ρi, while σj
if qjis the first active queue traversed by the flow. Else,
let qkbe predecessor of qjin the sequence of active queues
traversed by flow fi, with βRk,T kits service curve. When
flow fitraverses queue qk, its burstiness increases differently
whether it is alone or aggregated with other flows in qk.
If the flow is alone in queue qk, we apply the classic result
of the effects of a rate-latency service curve βR,T on a flow
constrained by an affine arrival curve γρ,σ. The result is
another affine arrival curve γρ,σ+ρT [3], so:
Else, we apply Theorem 6.2.2 (Burstiness Increase
Due to FIFO Multiplexing, General Case) [3]:
Theorem. Consider a node serving two flows, 1 and 2, in
FIFO order. Assume that flow 1 is constrained by one leaky
bucket with rate ρ1and burstiness σ1, and flow 2 is con-
strained by a sub-additive arrival curve α2. Assume that
the node guarantees to the aggregate of the two flows a rate-
latency service curve βR,T . Call ρ2= inft>0α2(t)
tthe max-
imum sustainable rate for flow 2. If ρ1+ρ2< R, then
at the output, flow 1 is constrained by γρ1,b1with b1=
R)and B= supt0[α2(t) + ρ1tRt].
For application of this theorem, flow 1 is fithe flow of
interest and flow 2 is the aggregate Fk−{fi}of other flows in
qk, with R=Rkand T=Tk. Let ρ1=ρi,σ1=σk
ρ2=PlFk,l6=iρl, and σ2=PlFk,l6=iσk
l. Because of link
shaping in qk,α2(t) = min(rt, ρ2t+σ2)1t>0. Let τ2,σ2
so that 2=ρ2τ2+σ2. If 0 tτ2then α2(t) = rt else if
tτ2then α2(t) = ρ2t+σ2. From the definition of B:
B= sup( sup
(rt +ρ1tRt),
(ρ2t+σ2+ρ1tRt)) = r+ρ1R
As a result, b1=σ1+ρ1(T+σ2(r+ρ1R)
R(rρ2)), yielding:
i+ρi Tk+(PlFk,l6=iσk
By comparison, Corollary 6.2.3 (Burstiness Increase
due to FIFO)[3] (Section 3.4) yields b1=σ1+ρ1(T+σ2
which is worse because ρ1+ρ2< R r+ρ1R < r ρ2.
3.4 End-to-End Latency Bound
For computing an upper bound on the end-to-end latency
of any particular flow fi, we proceed in three steps. First,
compute the left-over (or residual) service curve βj
iof each
active queue qjtraversed by fi. Second, find the equiva-
lent service curve β
ioffered by the NoC to fithrough the
convolution of the left-over service curves βj
i. Last, find the
end-to-end latency bound by computing d
ithe delay be-
tween αithe arrival curve of flow fiand β
i. Adding d
to the constant delays of flow fisuch as the traversal of
non-active queues and other logic and wiring pipeline yields
the upper bound. This approach is similar in principle to
the Separated Flow Analysis (SFA) [7], even though the lat-
ter is formulated in the setting of aggregation under blind
multiplexing, while we use FIFO multiplexing.
For the first step, we have two cases to consider at each
active queue qj. Either fiis the only flow traversing qj, and
i=βRj,T jfrom equations (4) or (5). Or, fiis aggregated
in qjwith other flows in Fj. Packets from the flow aggregate
Fjare served in FIFO order, so we may apply Corollary
6.2.3 (Burstiness Increase due to FIFO) [3]:
Corollary. Consider a node serving two flows, 1 and 2, in
FIFO order. Assume that flow iis constrained by one leaky
bucket with rate ρiand burstiness σi. Assume that the node
guarantees to the aggregate of the two flows a rate-latency
service curve βR,T . If ρ1+ρ2< R, then flow 1 has a service
curve equal to the rate-latency function with rate Rρ2and
latency T+σ2
Rand at the output, flow 1 is constrained by one
leaky bucket with rate ρ1and burstiness b1=σ1+ρ1(T+σ2
For application of this corollary, flow 1 is fithe flow of
interest and flow 2 is the aggregate Fj− {fi}of other flows
in qj, with R=Rjand T=Tj. Let ρ2=PlFj,l6=iρl, and
l. This yields the left-over service curve
i,T j
ifor an active queue qjtraversed by fi:
ρland Tj
For the second step, we compute the convolution β
iof the
left-over service curves βj
i. Let Qidenote the set of active
queues traversed by flow fi. Thanks to the properties of
rate-latency curves [3], β
iis a rate-latency curve whose rate
iis the minimum of the rates and the latency T
iis the
sum of the latencies of the left-over service curves βj
i= min
iand T
C0 C2
C8 C10
Figure 8: Example of a NoC flow problem.
For the last step, we compute the delay d
ibetween the αi
the arrival curve of flow fiat ingress and β
i. This flow is
injected at rate ρiand burstiness σi, however it is subjected
to link shaping at rate ras it enters the network. As a result,
αi= min(rt, σi+ρit)1t>0and we may apply Eq. (3):
For the application of the DNC equations, we use the
flow problem example illustrated in Fig. 8. There are four
flows, f1, f2, f3, f4, with f4a loop-back flow. Computing
rates using the max-min fairness objective [12] yields ρ1=2
and ρ2=ρ3=ρ4=1
3. The maximum packet size lmax is
set to 17 flits, corresponding to one for the header and the
others for the payload. Computing the σmin
ivalues according
to Eq. (6) yields σmin
3and σmin
In the following table, we identify the queues based on
the node number and the turn corresponding to the queue.
For instance, q2W S identifies the queue in router 2 that
buffers packets from link Wto link S. We also consider
the queues that buffer traffic back to the compute clusters:
q10LC and q8LC . Queues that share the same link arbiter
are {q2W S , q2LS }for link 2S,{q10N W , q10LW }for link 10W,
and {q8EL , q8LL}for link 8L.
q0LE q2W S q2LS q10N L q10NW q10LW q8E L q8LL
f1σ2W S
Queues q0LE and q10NL are not active, so they can be
ignored. For the other queues, we compute their service
curves. Queue q2W S is traversed by rate ρ1=2
n2W S =
2, so Eq. (5) for blind multiplexing applies to q2WS . Like-
wise, queue q8EL is traversed by rate ρ2+ρ3=1
n8EL =1
2, so Eq. (5) applies to q8EL . Conversely, Eq. (4)
applies to {q2LS , q10N W , q10LW , q8LL}. This yields:
R2W S = 1 ρ2T2W S =σ2
R2LS =1
2T2LS =lmax
R10NW =1
2T10NW =lmax
R10LW =1
2T10LW =lmax
R8EL = 1 ρ4T8EL =σ4
R8LL =1
2T8LL =lmax
Next, we express the flow burstiness constraints and rela-
tions. In most cases, Eq. (9) applies. The only cases where
we need to apply Eq. (10) are for flows passing through q8EL:
σ2W S
1=σ2W S
1+ρ1T2W S Eq. (9)
2+ρ2T2LS Eq. (9)
2+ρ2T10NW Eq. (9)
2+ρ2(T8EL +σ8EL
3(1+ρ2R8EL )
R8EL (1ρ3)) Eq. (10)
3=σ3+ρ3T10LW Eq. (9)
3+ρ3(T8EL +σ8EL
2(1+ρ3R8EL )
R8EL (1ρ2)) Eq. (10)
4=σ4+ρ4T8LL Eq. (9)
For the delays, we first compute the parameters of the
left-over service curves according to Eq. (12):
1=R2W S =2
2= min(R2LS , R10NW ,(R8E L ρ3)) = 1
3= min(R10LW , R8EL ) = 1
4=R8LL =1
1=T2W S = 17
2=T2LS +T10NW + (T8E L +σ8EL
R8EL ) = 76.5
3=T10LW +T8EL = 42.5
4=T8LL = 17
Here we have Rj
i=Rjexcept for R8EL
2=R8EL ρ3. Like-
wise, Tj
i=Tjexcept for T8EL
2=T8EL +σ8EL
R8EL .
Finally, we apply Eq. (13) which yields the delay values:
i25.5 110.5 102 34
We apply Deterministic Network Calculus (DNC) to the
network-on-chip (NoC) of the Kalray MPPA-256 Bostan
processor, in order to ensure quality of service to the end-
point tasks. The MPPA NoC is a RDMA-capable network,
which can be configured for bounding each flow injection rate
and burstiness, that is, the parameters of a leaky-bucket ar-
rival curve. We assume that flow paths between endpoints
have been selected, and that flow rates are set to the max-
imum given the link capacity bounds. This starting point
can be obtained as the solution of a max-min fairness routing
problem with unsplittable path, which is a standard tech-
nique for engineering elastic traffic in macro-networks.
Based on classic DNC results, we develop our contribu-
tions in four areas: modeling the effects of traffic shaping
by the peak rate of links of the NoC; refinement of the ser-
vice curves of the router queues, using either round-robin
scheduling or blind multiplexing; formulation of the bursti-
ness constraints and propagation across the network; and
computation of upper bounds on the flow end-to-end la-
tencies, based on the principles of separated flow analysis.
Thanks to the fact that flow rates are computed, the problem
of determining the burstiness parameters can be formulated
and solved on a acyclic set of linear inequalities.
This work was supported by the French DGE and Bpifrance
through the ”Investissements d’Avenir”program CAPACITES.
[1] S. Saidi, R. Ernst, S. Uhrig, H. Theiling, and B. D.
de Dinechin, “The Shift to Multicores in Real-time
and Safety-critical Systems,” in Proc. of the 10th
Inter. Conference on Hardware/Software Codesign and
System Synthesis, ser. CODES ’15, 2015, pp. 220–229.
[2] Q. Perret, P. Maur`ere, ´
E. Noulard, C. Pagetti,
P. Sainrat, and B. Triquet, “Mapping hard real-time
applications on many-core processors,” in Proc. of the
24th Inter. Conference on Real-Time Networks and
Systems, RTNS 2016, Brest, France, October 19-21,
2016, 2016, pp. 235–244.
[3] J.-Y. Le Boudec and P. Thiran, Network Calculus: A
Theory of Deterministic Queuing Systems for the
Internet. Berlin, Heidelberg: Springer-Verlag, 2012.
[4] Z. Lu, M. Millberg, A. Jantsch, A. Bruce, P. van der
Wolf, and T. Henriksson, “Flow Regulation for
On-Chip Communication,” in Proc. of the Conference
on Design, Automation and Test in Europe, ser.
DATE ’09, 2009, pp. 578–581.
[5] Y. Durand, C. Bernard, and F. Clermidy, “Distributed
Dynamic Rate Adaptation on a Network on Chip with
Traffic Distortion,” in 10th IEEE Inter. Symposium on
Embedded Multicore/Many-core Systems-on-Chip,
MCSOC 2016, Lyon, France, September 21-23, 2016,
2016, pp. 225–232.
[6] M. Boyer and C. Fraboul, “Tightening end to end
delay upper bound for AFDX network calculus with
rate latency FCFS servers using network calculus,” in
IEEE Inter. Workshop on Factory Communication
Systems (WFCS), Dresden, Germany. IEEE, may
2008, pp. 11–20.
[7] A. Bouillard and G. Stea, “Worst-Case Analysis of
Tandem Queueing Systems Using Network Calculus,”
in Quantitative Assessments of Distributed Systems,
Bruneo and Distefano, Eds., 2015.
[8] L. Lenzini, L. Martorini, E. Mingozzi, and G. Stea,
“Tight end-to-end per-flow delay bounds in FIFO
multiplexing sink-tree networks,” Perform. Eval.,
vol. 63, no. 9-10, pp. 956–987, 2006.
[9] B. Dupont de Dinechin, Y. Durand, D. van Amstel,
and A. Ghiti, “Guaranteed Services of the NoC of a
Manycore Processor,” in Proc. of the 2014 Inter.
Workshop on Network on Chip Architectures, ser.
NoCArc ’14, 2014, pp. 11–16.
[10] W. J. Dally and C. L. Seitz, “Deadlock-Free Message
Routing in Multiprocessor Interconnection Networks,”
IEEE Trans. Comput., vol. 36, no. 5, pp. 547–553,
May 1987.
[11] P. Bahrebar and D. Stroobandt, “The
Hamiltonian-based Odd-even Turn Model for
Maximally Adaptive Routing in 2D Mesh
Networks-on-chip,” Comput. Electr. Eng., vol. 45,
no. C, pp. 386–401, Jul. 2015.
[12] S. Chen and K. Nahrstedt, “Maxmin Fair Routing in
Connection-Oriented Networks,” in Proc. of
Euro-Parallel and Distributed Systems Conference
(Euro-PDS ’98), 1998, pp. 163–168.
... The NoC is of wormhole switching type, designed primarily to serve the two 100 Gbps Ethernet controllers located on the chip. It is more suited to carry out asynchronous RDMA operations targeting high-performance use, even though works [52] have shown that it can be used for time-critical purposes. ...
... On the cluster level, each core has a private bus to the shared cluster memory. On the chip level, the NoC offers service guarantees, such as minimum bandwidth and maximum latency [dDin15;dDG17]. Finally, on the I/O level, the DDR controller offers service guarantees and special predictability configurations as well [dDin16]. ...
Full-text available
This dissertation presents symbolic loop compilation, the first full-fledged approach to symbolically map loop nests onto tightly coupled processor arrays (TCPAs), a class of loop accelerators that consist of a grid of processing elements (PEs). It is: Full-fledged because it covers all steps of compilation, including space-time mapping, code generation, and generation of configuration data for all involved hardware components. A full-fledged compiler is paramount because manual mapping for accelerators, such as TCPAs, is difficult, tedious, and, most of all, error-prone. Symbolic because symbolic loop compilation assumes the loop bounds and number of allocated PEs to be unknown during compile time, thus allowing them to be chosen at run time.This flexibility benefits resource-aware applications where the number of PEs is known only at run time. Symbolic loop compilation is a hybrid static/dynamic approach with two phases: At compile time, all involved NP-hard problems (such as resource-constrained modulo scheduling) are solved symbolically, resulting in a so-called symbolic configuration, which is a space-efficient intermediate representation parameterized in the loop bounds and number of PEs. This phase is called symbolic mapping. Because it takes place at compile time, there is ample time to solve the involved NP-hard problems. At run time, for each requested accelerated execution of a loop program with given loop bounds and number of allocated PEs, concrete PE programs and configuration data are generated from the symbolic configuration according to these parameter values. This phase is called instantiation. In the context of these two phases, this dissertation presents the following contributions: 1. Symbolic modulo scheduling is a technique for solving resource-constrained modulo scheduling for multi-dimensional loop nests when the loop bounds and number of available PEs are unknown. We show that a latency-minimal solution can be found if the number of PEs is known beforehand and a near latency-minimal solution if it is not. 2. Polyhedral syntax trees are a space-efficient, parameterized representation of a set of PE program variants from which the necessary concrete PE programs are generated at run time. 3. Instantiation includes methods to generate concrete programs and configuration data from a symbolic configuration in a manner whose time complexity is not proportional to the loop bounds or number of allocated PEs. 4. Run-time enforcement for loops is a technique that utilizes the flexibility granted by symbolic loop compilation to enforce requirements on non-functional properties by dynamically adapting the mapping before execution. An example is to allocate a number of PEs that satisfies a given latency bound. In summary, the methods presented in this dissertation enable, for the first time, the full-fledged symbolic compilation of loop nests onto TCPAs. Without these methods, a given loop nest would need to be fully recompiled each time the loop bounds or number of available PEs change, which would render run-time mapping impractical and even conventional compilation overly time- and space-consuming.
... BLAST+ has been designed to be quickly executed by standard multicore microprocessors in a hardware environment, where cores are very powerful, memory accesses are extremely fast, and there is no contention among the different cores. Unfortunately, this is not the hardware scenario for the recent manycore central processing unit (CPU) coprocessors, like the Intel Xeon Phi (Jeffers et al., 2016), TILE-Gx (Schooler, 2010) or massively parallel processor array (MPPA)-256 (De Dinechin and Graillat, 2017), where cores are less powerful, coordination between a huge amounts of threads is slow, cache sizes are small, and cache faults are heavily penalized. As an example, Intel recognizes that executing BLAST+ directly on Xeon Phi in native mode is up to 3-fold slower than using a standard Xeon E5 processor (Albert, 2015). ...
Full-text available
New High-Performance Computing architectures have been recently developed for commercial central processing unit (CPU). Yet, that has not improved the execution time of widely used bioinformatics applications, like BLAST+. This is due to a lack of optimization between the bases of the existing algorithms and the internals of the hardware that allows taking full advantage of the available CPU cores. To optimize the new architectures, algorithms must be revised and redesigned; usually rewritten from scratch. BLVector adapts the high-level concepts of BLAST+ to the x86 architectures with AVX-512, to harness their capabilities. A deep comprehensive study has been carried out to optimize the approach, with a significant reduction in time execution. BLVector reduces the execution time of BLAST+ when aligning up to mid-size protein sequences (∼750 amino acids). The gain in real scenario cases is 3.2-fold. When applied to longer proteins, BLVector consumes more time than BLAST+, but retrieves a much larger set of results. BLVector and BLAST+ are fine-tuned heuristics. Therefore, the relevant results returned by both are the same, although they behave differently specially when performing alignments with low scores. Hence, they can be considered complementary bioinformatics tools.
... It also requires less wires than a shared bus and its power consumption is linear with the number of cores. Other NoC topologies have been proposed in other architectures like the torus of Kalray [6] or the ring used by Intel [12]. The many-core mesh connects the tiles using the typical configuration shown in Figure 1, with the core, its local memory and a router connecting the tile's core with the neighbor routers in the mesh. ...
Conference Paper
Full-text available
A current trend of industrial systems is reducing space, weight and power (SWaP) through the allocation of different applications on a single chip. This is enabled by the continued improvement of semiconductor technology which allows the integration of multiple cores in a single processor chip, as the processors are prevented to continue increasing their clock rate due to the "power-wall". The use of Commercial-Off-The-Shelf (COTS) multi-core processors for real-time purposes presents issues due to the shared bus used to access the shared memory. An alternative to the use of multi-core processors are the many-core processors with tens to hundreds of processors in the same chip, using different scalable ways to interconnect their cores. This paper presents the adaptation of the M2OS Real-Time Operating System (RTOS) and its simplified Ada run-time for mesh-based many-core processors. This RTOS is called M2OS-mc and has been tested on the Epiphany III many-core processor (referred in this paper simply as Epiphany), a many-core which has 16 cores connected by a Network-on-Chip (NoC) consisting of a 4x4 2D mesh. In order to have a synchronized way to send messages between tasks through the NoC independently of the core where they are being executed, we provide sampling port communication primitives.
... The results on the explicit linear method have been obtained using a tool developed by Kalray, presented in Dupont de Dinechin and Graillat (2017). The results on the LP method have been obtained using the NetCalBounds tool from Bouillard (2017), with the same tandem topology than the SFA. ...
Full-text available
The Kalray MPPA2-256 processor integrates 256 processing cores and 32 management cores on a chip. Theses cores are grouped into clusters and clusters are connected by a high-performance network on chip (NoC). This NoC provides hardware mechanisms (ingress traffic limiters) that can be configured to offer service guarantees. This paper introduces a network calculus formulation, designed to configure the NoC traffic limiters, that also computes guarantee upper bounds on the NoC traversal latencies. This network calculus formulation accounts for the traffic shaping performed by the NoC links, and can be solved using linear programming. This paper then shows how existing network calculus approaches (the Separated Flow Analysis – SFA ; the Total Flow Analysis – TFA ; the Linear Programming approach – LP) can be adapted to analyze this NoC. The delay bounds obtained by the four approaches are then compared on two case studies: a small configuration coming from a previous study, and a realistic configuration with 128 or 256 flows. From theses cases studies, it appears that modeling the shaping introduced by NoC links is of major importance to get accurate bounds. And when all packets have the same size, modeling it reduces the bound by 20%–25% on average.
... For example, the KALRAY MPPA2-256 is made of up to 288 cores: 256 computing cores, 16 management cores, and four quad cores (see Figure 1). Kalray's MPPA [35] technology addresses these challenges by combining highperformance cores with low-power processors [36]. (The MPPA2®-256 Bostan2 processor [36] Mixed-Integer Quadratic Programming (MIQP): In [37] mentions the use of the MIQP for solving scheduling problems. ...
Conference Paper
1 Abstract-Complex electronic systems are used in safety-critical applications (e.g., aerospace, nuclear stations), for which the certification standards demand the use of assured design methods and tools. Meta-scheduling is a way to manage the complexity of adaptive systems via predictable behavioural patterns established by static scheduling algorithms. This paper proposes a meta-scheduling algorithm for adaptive time-triggered systems based on Networks-on-a-Chip (NoCs). The meta-scheduling algorithm computes an individual schedule for each dynamic event of slack occurrence. Each dynamic slack occurrence triggers the shift to a more energy-efficient schedule. Dynamic frequency scaling of cores and routers is used to improve the energy efficiency, while preserving the temporal correctness of time-triggered computation and communication activities (e.g., collision avoidance, timeliness). Mixed-Integer Quadratic Programming (MIQP) is used to optimise the schedules Experimental results for an example scenario demonstrate that the presented meta-scheduling algorithm provides on average a power reduction of 34%. Our approach was able to deploy 93 dynamic slack schedules compared to the single slack schedule of using static slack scheduling.
... Prior work has been carried out to compute response-time analysis within a cluster [1,25] or a worst-case traversal time for the NoC [8]. However, there is little research on computing a worstcase response time that considers application mapped to more than one cluster with a complete model of the NoC interference on local memories (see Section5). ...
Conference Paper
We consider hard real-time applications running on many-core processor containing several clusters of cores linked by a Network-on-Chip (NoC). Communications are done via shared memory within a cluster and through the NoC for inter-cluster communication. We adopt the time-triggered paradigm, which is well-suited for hard real-time applications, and we consider data-flow applications, where communications are explicit. We extend the AER (Acquisition/Execution/Restitution) execution model to account for all delays and interferences linked to communications, including the interference between the NoC interface and the memory. Indeed, for NoC communications, data is first read from the initiator's local memory, then sent over the NoC, and finally written to the local memory of the target cluster. Read and write accesses to transfer data between local memories may interfere with shared-memory communication inside a cluster, and, as far as we know, previous work did not take these interferences into account. Building on previous work on deterministic network calculus and shared memory interference analysis, our method computes a static, time-triggered schedule for an application mapped on several clusters. This schedule guarantees that deadlines are met, and therefore provides a safe upper bound to the global worst-case response time.
... Clusters do not access memory from other clusters neither the host memory. The communication occurs by the means of two NoCs (control and data), organized in a 2D Torus topology [16]. Hence, we establish the requirements regarding the aforementioned characteristics of MPPA-256. ...
Conference Paper
Full-text available
Performance of parallel scientific applications on many-core processor architectures is a challenge that increases every day, especially when energy efficiency is concerned. To achieve this, it is necessary to explore architectures with high processing power composed by a network-on-chip to integrate many processing cores and other components. In this context, this paper presents a design space exploration over NoC-based many-core processor architectures with distributed and shared caches, using full-system simulations. We evaluate bottlenecks in such architectures with regard to energy efficiency, using different parallel scientific applications and considering aspects from caches and NoCs jointly. Five applications from NAS Parallel Benchmarks were executed over the proposed architectures, which vary in number of cores; in L2 cache size; and in 12 types of NoC topologies. A clustered topology was set up, in which we obtain performance gains up to 30.56% and reduction in energy consumption up to 38.53%, when compared to a traditional one.
The computing capacity demanded by embedded systems is on the rise as software implements more functionalities, ranging from best-effort entertainment functions to performance-guaranteed safety-related functions. Heterogeneous manycore processors, using wormhole mesh (wmesh) Network-on-Chips (NoCs) as the main communication means, and contention block among applications, are increasingly considered to deliver the required computing performance. Most research efforts on software timing analysis have focused on deriving bounds (estimates) to the contention that tasks can suffer when accessing wmesh NoCs. However, less effort has been devoted to an equally important problem, namely, accurately measuring the actual contention tasks generate each other on the wmesh which is instrumental during system validation to diagnose any software timing misbehavior and determine which tasks are particularly affected by contention on specific wmesh routers. In this paper, we work on the foundations of contention measuring in wmesh NoCs and propose and explain the rationale of a golden metric , called task PairWise Contention (PWC). PWC allows ascribing the actual share of the contention a given task suffers in the wmesh to each of its co-runner tasks at packet level. We also introduce and formalize a Golden Reference Value (GRV) for PWC that specifically defines a criterion to fairly break down the contention suffered by a task among its co-runner tasks in the wmesh. Our evaluation shows that GRV effectively captures how contention occurs by identifying the actual core (task) causing contention and whether contention is caused by local or remote interference in the wmesh.
Conference Paper
Full-text available
In real-time and safety-critical systems, the move towards multicores is becoming unavoidable in order to keep pace with the increasing processing requirements and to meet the high integration trend while maintaining a reasonable power consumption. However, standard multicore systems are mainly designed to increase average performance, whereas embedded systems have additional requirements with respect to safety, reliability and real-time behavior. Therefore, the shift to multicores raises several challenges the embedded systems community has to face. These challenges involve the design of certifiable multicore platforms, the management of shared resources and the development/integration of parallel software. New issues are encountered at different steps of application development, from modeling and design to software implementation and hardware deployment. Therefore, both multicore/ semiconductor manufacturers and the real-time community have to meet the challenges imposed by multicores. The goal of this paper is to trigger such a discussion as an attempt to bridge the gap between the two worlds and to raise awareness about the hurdles and challenges that need to be tackled.
Full-text available
The Kalray MPPA®-256 processor (Multi-Purpose Processing Array) integrates 256 processing engine (PE) cores and 32 resource management (RM) cores on a single 28nm CMOS chip. These cores are distributed across 16 compute clusters and 4 I/O subsystems. On-chip communications and synchronization are supported by an explicitly routed dual data & control network-on-chip (NoC), with one node per compute cluster and 4 nodes per I/O subsystem, for a total of 32 nodes. The data NoC is dedicated to streaming data transfers and may operate with guaranteed services, thanks to non-blocking routers and flow regulation at the source node. Its architecture has been designed so that (σ, ρ) network calculus applies with minimal approximations. Given a set of flows across this data NoC with predetermined routes, we formulate the problem of guaranteeing fair allocation of bandwidth across flows and we present bounds on the maximum transfer latency. By considering the architecture of the data NoC and by introducing conservative approximations, we show how this formulation can be transformed into a linear program. Solving this linear program is efficient and the quality of its solutions appears comparable to those of the original formulation, based on problem instances obtained from the cyclostatic dataflow compilation toolchain of the Kalray MPPA®-256 processor.
Conference Paper
Full-text available
This paper presents some new results in network calculus designed to improve the computation of end to end bounds for an AFDX network, using the FIFO assumption. The formal contribution is to allow to handle shaped leaky buckets flows traversing simple rate-latency network elements. Two new results are presented when a sum of some shaped flows crosses one single network element: the first one considers the global aggregated flow, and the second one considers each individual flow, assuming FIFO policy. Some configurations are tested, and the first results obtained are in almost all cases better than already known methods.
Conference Paper
Full-text available
We propose (sigma, rho)-based flow regulation as a design instrument for System-on-Chip (SoC) architects to control quality-of-service and achieve cost-effective communication, where sigma bounds the traffic burstiness and rho the traffic rate. This regulation changes the burstiness and timing of traffic flows, and can be used to decrease delay and reduce buffer requirements in the SoC infrastructure. In this paper, we define and analyze the regulation spectrum, which bounds the upper and lower limits of regulation. Experiments on a Network-on-Chip (NoC) with guaranteed service demonstrate the benefits of regulation. We conclude that flow regulation may exert significant positive impact on communication performance and buffer requirements.
Conference Paper
A NoC-based system subject to real-time constraints requires hard bounds on end-to-end data transfer latencies. Regulating the channel injection rates solves the problem by suppressing link congestion and router queue saturation, provided that the channel rates guarantee fairness in the data distribution. We propose a distributed algorithm for the computation of a channel rate vector solution, suitable for runtime execution on the system. The algorithm takes into account the capacity constraints on every link, but also fulfills the distortion constraints. It leads to a near-optimal solution in a few iterations, with less than 10% of vector distance from the optimal solution. The algorithm is valid for software implementation on manycore systems and applicable to any Network-On-Chip based system. Our hardware implementation is distributed into the network infrastructure and converges in around 500 clock cycles on a 4x4 network configuration.
Conference Paper
Many-core processors are interesting candidates for the design of modern avionics computers. Indeed, the computational power offered by such platforms opens new horizons to design more demanding systems and to integrate more applications on a single target. However, they also bring challenging research topics because of their lack of predictability and their programming complexity. In this paper, we focus on the problem of mapping large applications on a complex platform such as the KALRAY MPPA®-256 while maintaining a strong temporal isolation from co-running applications. We propose a constraint programming formulation of the mapping problem that enables an efficient parallelization and we demonstrate the ability of our approach to deal with large problems using a real world case study.
In this chapter we show how to derive performance bounds for tandem queueing systems using Network Calculus, a deterministic theory for performance analysis. We introduce the basic concepts of Network Calculus, namely arrival and service curves, and we show how to use them to compute performance bounds in an end-to-end perspective. As an application of the above theory, we evaluate tandems of network nodes with well-known service policies. We present the theory for two different settings: a simpler one, called ”per-flow scheduling”, where service policies at each node discriminate traffics coming from different flows and buffer them separately, and ”per-aggregate scheduling”, where schedulers manage a small number of traffic aggregates, and traffic of several flows may end up in the same queue. We show that, in the latter case, methodologies based on equivalent service curves cannot compute tight delay bounds and we present a different methodology that relies on input-output relationships and uses mathematical programming techniques.
Aggregate scheduling has been proposed as a solution for achieving scalability in large-size networks. However, in order to enable the provisioning of real-time services, such as video delivery or voice conversations, in aggregate scheduling networks, end-to-end delay bounds for single flows are required. In this paper, we derive per-flow end-to-end delay bounds in aggregate scheduling networks in which per-egress (or sink-tree) aggregation is in place, and flows traffic is aggregated according to a FIFO policy. The derivation process is based on Network Calculus, which is suitably extended to this purpose. We show that the bound is tight by deriving the scenario in which it is attained. A tight delay bound can be employed for a variety of purposes: for example, devising optimal aggregation criteria and rate provisioning policies based on pre-specified flow delay bounds.