Content uploaded by Benoît Dupont de Dinechin
Author content
All content in this area was uploaded by Benoît Dupont de Dinechin on Nov 07, 2017
Content may be subject to copyright.
Network-on-Chip Service Guarantees
on the Kalray MPPA-256 Bostan Processor
Benoît Dupont de Dinechin and Amaury Graillat
Kalray SA
445 rue Lavoisier
F-38330 Montbonnot, France
ABSTRACT
The Kalray MPPA-256 Bostan manycore processor imple-
ments a clustered architecture, where clusters of cores share
a local memory, and a DMA-capable network-on-chip (NoC)
connects the clusters. The NoC implements wormhole switch-
ing without virtual channels, with source routing, and can be
configured for maximum flow rate and burstiness at ingress.
We describe and illustrate the techniques used to configure
the MPPA NoC for guaranteed services. Our approach is
based on three steps: global selection of routes between end-
points and computation of flow rates, by solving the max-
min fairness with unsplittable path problem; configuration
of the flow burstiness parameters at ingress, by solving an
acyclic set of linear inequalities; and end-to-end latency up-
per bound computation, based on the principles of separated
flow analysis (SFA). In this paper, we develop the two last
steps, taking advantage of the effects of NoC link shaping
on the leaky-bucket arrival curves of flows.
Keywords
Network-on-Chip, Deterministic Network Calculus, Link Traf-
fic Shaping, Separated Flow Analysis
1. INTRODUCTION
1.1 Motivation and Problem
The Kalray MPPA-256 Bostan manycore processor is de-
signed for timing predictability and energy efficiency [1]. Its
cores are clustered into compute units sharing a local mem-
ory, and these clusters are connected through a network-on-
chip (NoC). The MPPA NoC implements wormhole switch-
ing with source routing and supports guaranteed services
through the configuration of flow injection parameters at the
NoC interface: the maximum rate ρ; the maximum bursti-
ness σ; the minimum and the maximum packet sizes.
The MPPA-256 processors are used in time-critical appli-
cations [2], where a set of cyclic or sporadic tasks interact
through a static communication graph. When application
tasks are mapped to the cores, tasks communicate either
AISTECS 2017, Stockholm, Sweden
ACM ISBN 978-1-4503-2138-9.
DOI: 10.1145/1235
through the local memory if co-located on the same cluster,
or through the NoC that supports remote DMA. Ensuring
the worst-case response times (WCRT) of time-critical appli-
cations requires upper approximations of worst-case execu-
tion times (WCET) for the computations and of worst-case
traversal times (WCTT) for the remote communications.
In this work, we address the problem of computing end-
to-end latency upper bounds for on-chip communications,
assuming that the tasks have been assigned to cores inside
clusters thus to NoC nodes. We base our approach on De-
terministic Network Calculus (DNC), which was originally
developed for the performance evaluation of Internet and of
ATM networks [3]. Since then, DNC has been applied to
flow regulation of NoCs [4][5], and is used to guarantee the
services of the AFDX Ethernet-based avionics networks [6].
Application of DNC to a NoC assumes that routes be-
tween endpoints are predetermined, and that flow injection
parameters are known or enforced. Moreover, key DNC re-
sults only apply to feed-forward networks, that is, networks
where the graph of unidirectional links traversed by the set
of flows has no cycles [3]. A related assumption is that the
hop-to-hop flow control of the NoC is not active, that is, the
traffic is such that the router queues always accept incoming
flits. Figuratively speaking, queues must not overflow.
1.2 Contributions and Related Work
Our contributions are as follows. At the time scale of NoC
operations, there are no instantaneous packet arrivals, so the
(σ, ρ) characterization of flows used in macro networks must
be refined by taking into account the flit serialization effects
of links, both inside and between the NoC routers. These ef-
fects appear significant for the increase of the flow burstiness
between hops and for the computation of the end-to-end la-
tency bound. A related approach is to shape individual flows
at ingress, then propagate the resulting arrival curves across
the network elements [6]. However, this approach is more
relevant for an AFDX network than for a NoC.
For the service offered to a queue by the link arbiter, we
select either the rate and latency ensured by the round-
robin packet scheduling, or the residual service guaranteed
by blind multiplexing when the former does not apply. Pre-
vious works use only blind multiplexing [7] or only FIFO
multiplexing [8], however the latter does not apply to the
MPPA NoC link arbiters, since they do not keep the FIFO
service order between packets from different queues.
Compared to earlier work of applying DNC to the NoC of
the first-generation MPPA Andey processor [9], we decouple
the computation of the flow rates and burstiness and obtain
significantly better rates and end-to-end latency bounds.
2. BACKGROUND
2.1 Overview of MPPA NoC Management
There are two objectives of the MPPA NoC management.
First, ensure deadlock-free operations, which is a concern
with wormhole switched networks. Second, in case the com-
munication flows are static, configure the routes and packet
injection parameters so that minimum flow rates and maxi-
mum end-to-end latencies can be guaranteed. These two ob-
jectives are connected by the following insight: on a worm-
hole switched network, a static deadlock-free routing algo-
rithm ensures that the flows are feed-forward. Indeed, avoid-
ing deadlocks can be ensured by using a routing scheme
that prevents circuits in the resource dependence graph [10].
With wormhole switching, resources are the router internal
or external links, and depend through the packet paths.
We assume that application tasks have been assigned end-
points on the NoC. A static deadlock-free routing algorithm
with path diversity, in our case the Hamiltonian Odd-Even
turn model [11], is used to compute a set of paths between
the endpoints. In a first optimization problem, we glob-
ally select a single path for each endpoint with the max-min
fairness objective [12]. This objective corresponds to maxi-
mizing the lowest flow rate, then the next lowest flow rate,
etc. Solving this optimization problem yields the flow rates.
We then formulate a second problem, which is the topic
of this paper. The objective is to compute the flow bursti-
ness parameters at ingress, in order to ensure that the NoC
router queues do not overflow. This entails propagating the
individual and aggregate flow parameters across the relevant
network elements, which are the outgoing link arbiters that
serve the packets in round-robin across the adjacent ’turn’
queues. Replacing the flow rates by their values from the
first optimization problem yields a system of linear inequal-
ities with the flow burstiness variables. Thanks to feed-
forward flows, the system is acyclic and solved in one pass.
Finally, we obtain upper bounds on the end-to-end laten-
cies between endpoints by applying the principles of sepa-
rated flow analysis (SFA) [7], where a left-over service curve
is computed at each network element for the flow of inter-
est. These left-over service curves are combined by (min,+)
convolution, yielding the end-to-end service curve provided
by the network for this flow. Computing the maximum hor-
izontal deviation between the flow arrival curve at ingress
and this end-to-end service curve yields the delay for the
upper bound on the end-to-end latency for the flow.
2.2 The MPPA-256 Bostan Processor
The Kalray MPPA-256 Bostan processor is the second im-
plementation of the MPPA (Massively Parallel Processing
Array) architecture, manufactured with the TSMC CMOS
28HP process. The MPPA architecture belongs to the ‘many-
core’ family, characterized by a large number of cores and
the architecturally visible aggregation of resources into com-
pute tiles (DSP ‘core packs’, GPU ‘streaming multiproces-
sors’, OpenCL ‘compute units’) with local memory, and the
exploitation of a network-on-chip for their interaction.
The MPPA-256 Bostan processor integrates 256 process-
ing cores and 32 management cores on a single chip, all based
on the same 32-bit/64-bit VLIW core architecture. Its ar-
chitecture is clustered with 16 compute clusters and 2 I/O
clusters, where each cluster is built around a multi-banked
local static memory shared by the 16+1 (compute cluster)
Figure 1: MPPA-256 Bostan clusters and NoC.
Figure 2: Structure of a MPPA NoC router.
or 4+4 (I/O cluster) processing + management cores. Fig. 1
illustrates the MPPA-256 Bostan clusters, with I/O clusters
in light blue and the compute clusters in dark blue. The
clusters communicate through the NoC, with one node per
compute cluster and 8 nodes per I/O cluster.
The MPPA-256 Bostan NoC is a direct network with a 2D
torus topology extended with extra links connected to the
otherwise unused links of the NoC nodes on the I/O clusters.
The NoC implements wormhole switching with source rout-
ing and without virtual channels. With wormhole switching,
a packet is decomposed into flits (32-bit), which travel in a
pipelined fashion across the network elements, with buffering
applied at the flit level. The packet follows a route deter-
mined by a bit string in the header.
The key network elements are the link arbiters inside the
routers (Fig. 2), which schedule whole packets for transmis-
sion by applying a round-robin selection on adjacent queues.
There is one queue per incoming direction or ’turn’. We call
a queue active if there is some flow passing through it and
there is another queue in the same link arbiter with some
flow. A non-active queue either has no flow passing through
it, or is the only one in the link arbiter with some flow. If so,
there are no effects on the packets beyond a constant delay
and the queue can be ignored except for the constant delay
Figure 3: Flow arrival and departure as cumulative
data functions over time.
Figure 4: Arrival curve αfor function A.
that needs to be added to the end-to-end latency bound.
2.3 Deterministic Network Calculus (DNC)
Network Calculus [3] is a framework for performance anal-
ysis of networks based on the representation of flows as cu-
mulative data over time. Given a network element with
cumulative arrival and departure functions Aand A0, the
delay for traversing the network element and the backlog of
data it has to serve correspond respectively to the horizontal
and the vertical deviations between Aand A0(Fig. 3).
Network Calculus exploits the properties of the (min,+)
algebra, in particular it introduces the following operations:
convolution: (f⊗g)(t),inf
0≤s≤tf(t) + g(t−s)
deconvolution: (fg)(t),sup
s>0
f(t+s)−g(s)
Let Abe a cumulative data function. An arrival curve
αis a constraint on Adefined by ∀0≤s≤t:A(t)−
A(s)≤α(t−s), which is equivalent to A≤A⊗α. Fig. 4
illustrates the smoothing constraint of the arrival curve αon
the cumulative function A. This particular type of arrival
curve α(t)=(ρt +σ)1t>0, known as affine or leaky-bucket,
is denoted γρ,σ with ρthe rate and σthe burstiness.
Let A, A0be the cumulative arrival and departure func-
tions of a network element. This element has βas a service
curve iff: ∀t≥0 : A0(t)≥inf0≤s≤t(A(s) + β(t−s)), which is
equivalent to A0≥A⊗β. Fig. 5 illustrates the guarantee of-
fered by the service curve β. This particular type of service
curve β(t) = R[t−T]+, known as rate-latency, is denoted
βR,T with Rthe rate and Tthe latency.
Key results of Deterministic Network Calculus include:
Figure 5: Service curve βfor a server A→A0.
•The arrival curve of the aggregate of two flows is the
sum of their arrival curves.
•A flow A(t) with arrival curve α(t) that traverses a
server with service curve β(t) results in a flow A0(t)
constrained by the arrival curve (αβ)(t).
•The service curve of a tandem of two of servers with
respective service curves β1(t) and β2(t) is (β1⊗β2)(t).
•If a flow has arrival curve α(t) on a node that offers
service curve β(t), the backlog bound is maxt≥0(α(t)−
β(t)) and the delay bound is maxt≥0{inf s≥0 : α(t)≤
β(t+s)}. These two bounds are respectively the max-
imum vertical deviation and maximum horizontal de-
viation between α(t) and β(t).
3. DNC APPLICATION TO THE MPPA NOC
For any flow fi, the solution of the routing problem yields
its unique path and its rate ρi. We now consider the for-
mulation of the second problem, the computation of flow
burstiness parameters at ingress σifor each flow. Not only
this problem is linear, thanks to now determined values of
the flow rates, but also we account for the effects of link
shaping: due to the NoC router architecture, any internal
or external link has a maximum rate r= 1 flit/cycle.
3.1 Effects of Link Shaping on Queues
Let Fjdenote the set of flows passing through an active
queue qj. Assume that the arrival curve of each flow fi∈
Fjis of leaky-bucket type with rate ρj
iand burstiness σj
i.
Then qjreceives a leaky-bucket shaped flow γρj,σjwith ρj=
Pfi∈Fjρj
iand σj=Pfi∈Fjσj
i. However, the maximum
queue filling rate is r. So the arrival curve for qjis the
convolution of curves λ(t),rt and α(t),(σj+ρjt)1t>0,
which is their minimum as both curves are concave and pass
through the origin [7]. The time where the line ρt meets the
affine curve (σj+ρj)1t>0is τj,σj
r−ρj(Fig. 6).
Further assume that the link arbiter offers a rate-latency
service curve βRj,Tjto queue qjwith Rj≤r. The backlog bj
of qjis the maximum vertical deviation between the arrival
curve and the service curve. It has two values, depending
on the comparison between τjand Tj. If τj≤Tjthen
bj=σj+ρjTj(Fig. 6 left) else bj=rτ j−(τj−Tj)Rj
(Fig. 6 center). As τj≤Tj⇔σj≤(r−ρj)Tj:
bj=σj+ρjTjif σj≤(r−ρj)Tj(1)
bj=r−Rj
r−ρjσj+RjTjif σj≥(r−ρj)Tj(2)
The delay djfor queue qjis the maximum horizontal de-
viation between the arrival curve and the service curve. As
r≥Rj, this maximum is reached when t=τjon the arrival
curve (Fig. 6 right). Let δjbe the time when service curve
has the height of the arrival curve at time τj. This implies
rτ j= (δj−Tj)Rj. As the delay dj=δj−τj, we get:
dj=Tj+σj(r−Rj)
Rj(r−ρj)(3)
3.2 Link Arbiter Service Curves
Let njthe number of active queues in the link arbiter,
including qj. From the point of view of an active queue qj,
the round-robin scheduler of the link arbiter ensures that one
Figure 6: Effects of link shaping on backlog (left, center) and delay (right).
Figure 7: Shaping constraint on ingress burstiness.
packet is transmitted every round. With lmax the maximum
packet size and rthe link rate, the maximum latency seen
by qjis (nj−1) lmax
r, while the long-term rate for qjis r
nj
[7]. This yields a first service curve βj=βRj,T jfor qj:
Rj=r
njand Tj= (nj−1) lmax
r(4)
However, this service curve may be overly constraining.
To see how, consider a case where nj= 2 and there are
three flows with rates ρa=ρb=ρc=r
3. There must be
two flows passing through one queue and one flow through
the other queue. As a result, there is one queue with service
rate r
2which is traversed by a total flow rate of ρa+ρb=2r
3.
Another approach to derive a service curve for queue qj
is to consider that the round-robin scheduler serves packets
at peak rate raccording to a blind multiplexing strategy
across the queues. In that case, we may apply Theorem
6.2.1 (Blind Multiplexing) [3]:
Theorem. Consider a node serving two flows, 1 and 2, with
some unknown arbitration between the two flows. Assume
that the node guarantees a strict service curve βto the ag-
gregate of the two flows. Assume that flow 2 is α2-smooth.
Define β1(t),[β(t)−α2(t)]+. If β1is wide-sense increas-
ing, then it is a service curve for flow 1.
In particular, if β=βR,T and α2=γρ2,σ2, the first flow
is guaranteed a service curve βR0,T0with R0=R−ρ2and
T0=T+σ2+T ρ2
R−ρ2[3]. Let Ajbe the set active queues in the
link arbiter of qj, and Bj,Aj− {qj}. Application with
ρ2=Pk∈Bjρk,σ2=Pk∈Bjσk,R=rand T= 0 yields
another service curve βj=βRj,Tjfor qj:
Rj=r−X
k∈Bj
ρkand Tj=Pk∈Bjσk
r−Pk∈Bjρk(5)
In principle, the choice between equations (4) or (5) at
each server should be deferred until evaluation. In practice,
we heuristically select Eq. (4) in case ρj≤r
njwith ρj=
Pfi∈Fjρithe sum of flow rates inside qj.
3.3 Flow Burstiness Parameters
Let lmax be the maximum packet size, which we assume
the same for all flows. At ingress, whole packets are atomi-
cally injected at rate r. Call θthe date when injection ends
(Fig. 7). We have rθ =lmax and lmax ≤σi+ρiθ, so:
∀fi∈F:σi≥σmin
i,lmax r−ρi
r(6)
Then, consider the constraints that the router queues must
not overflow, given bmax the size in flits of any queue. Queue
qjreceives a leaky-bucket shaped flow γρj,σjwith ρj=
Pfi∈Fjρj
iand σj=Pfi∈Fjσj
i, shaped by the link at rate
r, so equations (1) and (2) apply for the backlog bj. Router
queue do not overflow if ∀qj∈Q:bj≤bmax, so:
bmax ≥σj+ρjTjif σj≤(r−ρj)Tj(7)
bmax ≥r−Rj
r−ρjσj+RjTjif σj≥(r−ρj)Tj(8)
We now express the values ρj
iand σj
ifor all flows fi∈Fj
for an active queue qj. Obviously, ρj
i=ρi, while σj
i=σi
if qjis the first active queue traversed by the flow. Else,
let qkbe predecessor of qjin the sequence of active queues
traversed by flow fi, with βRk,T kits service curve. When
flow fitraverses queue qk, its burstiness increases differently
whether it is alone or aggregated with other flows in qk.
If the flow is alone in queue qk, we apply the classic result
of the effects of a rate-latency service curve βR,T on a flow
constrained by an affine arrival curve γρ,σ. The result is
another affine arrival curve γρ,σ+ρT [3], so:
σj
i=σk
i+ρiTk(9)
Else, we apply Theorem 6.2.2 (Burstiness Increase
Due to FIFO Multiplexing, General Case) [3]:
Theorem. Consider a node serving two flows, 1 and 2, in
FIFO order. Assume that flow 1 is constrained by one leaky
bucket with rate ρ1and burstiness σ1, and flow 2 is con-
strained by a sub-additive arrival curve α2. Assume that
the node guarantees to the aggregate of the two flows a rate-
latency service curve βR,T . Call ρ2= inft>0α2(t)
tthe max-
imum sustainable rate for flow 2. If ρ1+ρ2< R, then
at the output, flow 1 is constrained by γρ1,b1with b1=
σ1+ρ1(T+B
R)and B= supt≥0[α2(t) + ρ1t−Rt].
For application of this theorem, flow 1 is fithe flow of
interest and flow 2 is the aggregate Fk−{fi}of other flows in
qk, with R=Rkand T=Tk. Let ρ1=ρi,σ1=σk
i,b1=σj
i,
ρ2=Pl∈Fk,l6=iρl, and σ2=Pl∈Fk,l6=iσk
l. Because of link
shaping in qk,α2(t) = min(rt, ρ2t+σ2)1t>0. Let τ2,σ2
r−ρ2,
so that rτ2=ρ2τ2+σ2. If 0 ≤t≤τ2then α2(t) = rt else if
t≥τ2then α2(t) = ρ2t+σ2. From the definition of B:
B= sup( sup
0≤t≤τ2
(rt +ρ1t−Rt),
sup
τ2≤t
(ρ2t+σ2+ρ1t−Rt)) = r+ρ1−R
r−ρ2
σ2
As a result, b1=σ1+ρ1(T+σ2(r+ρ1−R)
R(r−ρ2)), yielding:
σj
i=σk
i+ρi Tk+(Pl∈Fk,l6=iσk
l)(r+ρi−Rk)
Rk(r−Pl∈Fk,l6=iρl)!(10)
By comparison, Corollary 6.2.3 (Burstiness Increase
due to FIFO)[3] (Section 3.4) yields b1=σ1+ρ1(T+σ2
R),
which is worse because ρ1+ρ2< R ⇒r+ρ1−R < r −ρ2.
3.4 End-to-End Latency Bound
For computing an upper bound on the end-to-end latency
of any particular flow fi, we proceed in three steps. First,
compute the left-over (or residual) service curve βj
iof each
active queue qjtraversed by fi. Second, find the equiva-
lent service curve β∗
ioffered by the NoC to fithrough the
convolution of the left-over service curves βj
i. Last, find the
end-to-end latency bound by computing d∗
ithe delay be-
tween αithe arrival curve of flow fiand β∗
i. Adding d∗
i
to the constant delays of flow fisuch as the traversal of
non-active queues and other logic and wiring pipeline yields
the upper bound. This approach is similar in principle to
the Separated Flow Analysis (SFA) [7], even though the lat-
ter is formulated in the setting of aggregation under blind
multiplexing, while we use FIFO multiplexing.
For the first step, we have two cases to consider at each
active queue qj. Either fiis the only flow traversing qj, and
βj
i=βRj,T jfrom equations (4) or (5). Or, fiis aggregated
in qjwith other flows in Fj. Packets from the flow aggregate
Fjare served in FIFO order, so we may apply Corollary
6.2.3 (Burstiness Increase due to FIFO) [3]:
Corollary. Consider a node serving two flows, 1 and 2, in
FIFO order. Assume that flow iis constrained by one leaky
bucket with rate ρiand burstiness σi. Assume that the node
guarantees to the aggregate of the two flows a rate-latency
service curve βR,T . If ρ1+ρ2< R, then flow 1 has a service
curve equal to the rate-latency function with rate R−ρ2and
latency T+σ2
Rand at the output, flow 1 is constrained by one
leaky bucket with rate ρ1and burstiness b1=σ1+ρ1(T+σ2
R).
For application of this corollary, flow 1 is fithe flow of
interest and flow 2 is the aggregate Fj− {fi}of other flows
in qj, with R=Rjand T=Tj. Let ρ2=Pl∈Fj,l6=iρl, and
σ2=Pl∈Fj,l6=iσj
l. This yields the left-over service curve
βj
i=βRj
i,T j
ifor an active queue qjtraversed by fi:
Rj
i=Rj−X
l∈Fj,l6=i
ρland Tj
i=Tj+Pl∈Fj,l6=iσj
l
Rj(11)
For the second step, we compute the convolution β∗
iof the
left-over service curves βj
i. Let Qidenote the set of active
queues traversed by flow fi. Thanks to the properties of
rate-latency curves [3], β∗
iis a rate-latency curve whose rate
R∗
iis the minimum of the rates and the latency T∗
iis the
sum of the latencies of the left-over service curves βj
i:
R∗
i= min
j∈Qi
Rj
iand T∗
i=X
j∈Qi
Tj
i(12)
L
EW
E W
S
N
C0 C2
C8 C10
S
N
L
L
L
1
2
3
4
Figure 8: Example of a NoC flow problem.
For the last step, we compute the delay d∗
ibetween the αi
the arrival curve of flow fiat ingress and β∗
i. This flow is
injected at rate ρiand burstiness σi, however it is subjected
to link shaping at rate ras it enters the network. As a result,
αi= min(rt, σi+ρit)1t>0and we may apply Eq. (3):
d∗
i=T∗
i+σi(r−R∗
i)
R∗
i(r−ρi)(13)
4. APPLICATION EXAMPLE
For the application of the DNC equations, we use the
flow problem example illustrated in Fig. 8. There are four
flows, f1, f2, f3, f4, with f4a loop-back flow. Computing
rates using the max-min fairness objective [12] yields ρ1=2
3
and ρ2=ρ3=ρ4=1
3. The maximum packet size lmax is
set to 17 flits, corresponding to one for the header and the
others for the payload. Computing the σmin
ivalues according
to Eq. (6) yields σmin
1=17
3and σmin
2=σmin
3=σmin
3=34
3.
In the following table, we identify the queues based on
the node number and the turn corresponding to the queue.
For instance, q2W S identifies the queue in router 2 that
buffers packets from link Wto link S. We also consider
the queues that buffer traffic back to the compute clusters:
q10LC and q8LC . Queues that share the same link arbiter
are {q2W S , q2LS }for link 2S,{q10N W , q10LW }for link 10W,
and {q8EL , q8LL}for link 8L.
q0LE q2W S q2LS q10N L q10NW q10LW q8E L q8LL
f1σ2W S
1σ10NL
1
f2σ2LS
2σ10NW
2σ8EL
2
f3σ10LW
3σ8EL
3
f4σ8LL
4
Queues q0LE and q10NL are not active, so they can be
ignored. For the other queues, we compute their service
curves. Queue q2W S is traversed by rate ρ1=2
3>r
n2W S =
1
2, so Eq. (5) for blind multiplexing applies to q2WS . Like-
wise, queue q8EL is traversed by rate ρ2+ρ3=1
3+1
3>
r
n8EL =1
2, so Eq. (5) applies to q8EL . Conversely, Eq. (4)
applies to {q2LS , q10N W , q10LW , q8LL}. This yields:
R2W S = 1 −ρ2T2W S =σ2
1−ρ2
R2LS =1
2T2LS =lmax
R10NW =1
2T10NW =lmax
R10LW =1
2T10LW =lmax
R8EL = 1 −ρ4T8EL =σ4
1−ρ4
R8LL =1
2T8LL =lmax
Next, we express the flow burstiness constraints and rela-
tions. In most cases, Eq. (9) applies. The only cases where
we need to apply Eq. (10) are for flows passing through q8EL:
σ1≥σmin
1
σ2W S
1=σ1
σ10NL
1=σ2W S
1+ρ1T2W S Eq. (9)
σ10LC
1=σ10NL
1
σ2≥σmin
2
σ2LS
2=σ2
σ10NW
2=σ2LS
2+ρ2T2LS Eq. (9)
σ8EL
2=σ10NW
2+ρ2T10NW Eq. (9)
σ8LC
2=σ8EL
2+ρ2(T8EL +σ8EL
3(1+ρ2−R8EL )
R8EL (1−ρ3)) Eq. (10)
σ3≥σmin
3
σ8EL
3=σ3+ρ3T10LW Eq. (9)
σ8LC
3=σ8EL
3+ρ3(T8EL +σ8EL
2(1+ρ3−R8EL )
R8EL (1−ρ2)) Eq. (10)
σ4≥σmin
4
σ8LC
4=σ4+ρ4T8LL Eq. (9)
For the delays, we first compute the parameters of the
left-over service curves according to Eq. (12):
R∗
1=R2W S =2
3
R∗
2= min(R2LS , R10NW ,(R8E L −ρ3)) = 1
3
R∗
3= min(R10LW , R8EL ) = 1
2
R∗
4=R8LL =1
3
T∗
1=T2W S = 17
T∗
2=T2LS +T10NW + (T8E L +σ8EL
3
R8EL ) = 76.5
T∗
3=T10LW +T8EL = 42.5
T∗
4=T8LL = 17
Here we have Rj
i=Rjexcept for R8EL
2=R8EL −ρ3. Like-
wise, Tj
i=Tjexcept for T8EL
2=T8EL +σ8EL
3
R8EL .
Finally, we apply Eq. (13) which yields the delay values:
f1f2f3f4
d∗
i25.5 110.5 102 34
5. SUMMARY AND CONCLUSIONS
We apply Deterministic Network Calculus (DNC) to the
network-on-chip (NoC) of the Kalray MPPA-256 Bostan
processor, in order to ensure quality of service to the end-
point tasks. The MPPA NoC is a RDMA-capable network,
which can be configured for bounding each flow injection rate
and burstiness, that is, the parameters of a leaky-bucket ar-
rival curve. We assume that flow paths between endpoints
have been selected, and that flow rates are set to the max-
imum given the link capacity bounds. This starting point
can be obtained as the solution of a max-min fairness routing
problem with unsplittable path, which is a standard tech-
nique for engineering elastic traffic in macro-networks.
Based on classic DNC results, we develop our contribu-
tions in four areas: modeling the effects of traffic shaping
by the peak rate of links of the NoC; refinement of the ser-
vice curves of the router queues, using either round-robin
scheduling or blind multiplexing; formulation of the bursti-
ness constraints and propagation across the network; and
computation of upper bounds on the flow end-to-end la-
tencies, based on the principles of separated flow analysis.
Thanks to the fact that flow rates are computed, the problem
of determining the burstiness parameters can be formulated
and solved on a acyclic set of linear inequalities.
6. ACKNOWLEDGMENT
This work was supported by the French DGE and Bpifrance
through the ”Investissements d’Avenir”program CAPACITES.
7. REFERENCES
[1] S. Saidi, R. Ernst, S. Uhrig, H. Theiling, and B. D.
de Dinechin, “The Shift to Multicores in Real-time
and Safety-critical Systems,” in Proc. of the 10th
Inter. Conference on Hardware/Software Codesign and
System Synthesis, ser. CODES ’15, 2015, pp. 220–229.
[2] Q. Perret, P. Maur`ere, ´
E. Noulard, C. Pagetti,
P. Sainrat, and B. Triquet, “Mapping hard real-time
applications on many-core processors,” in Proc. of the
24th Inter. Conference on Real-Time Networks and
Systems, RTNS 2016, Brest, France, October 19-21,
2016, 2016, pp. 235–244.
[3] J.-Y. Le Boudec and P. Thiran, Network Calculus: A
Theory of Deterministic Queuing Systems for the
Internet. Berlin, Heidelberg: Springer-Verlag, 2012.
[4] Z. Lu, M. Millberg, A. Jantsch, A. Bruce, P. van der
Wolf, and T. Henriksson, “Flow Regulation for
On-Chip Communication,” in Proc. of the Conference
on Design, Automation and Test in Europe, ser.
DATE ’09, 2009, pp. 578–581.
[5] Y. Durand, C. Bernard, and F. Clermidy, “Distributed
Dynamic Rate Adaptation on a Network on Chip with
Traffic Distortion,” in 10th IEEE Inter. Symposium on
Embedded Multicore/Many-core Systems-on-Chip,
MCSOC 2016, Lyon, France, September 21-23, 2016,
2016, pp. 225–232.
[6] M. Boyer and C. Fraboul, “Tightening end to end
delay upper bound for AFDX network calculus with
rate latency FCFS servers using network calculus,” in
IEEE Inter. Workshop on Factory Communication
Systems (WFCS), Dresden, Germany. IEEE, may
2008, pp. 11–20.
[7] A. Bouillard and G. Stea, “Worst-Case Analysis of
Tandem Queueing Systems Using Network Calculus,”
in Quantitative Assessments of Distributed Systems,
Bruneo and Distefano, Eds., 2015.
[8] L. Lenzini, L. Martorini, E. Mingozzi, and G. Stea,
“Tight end-to-end per-flow delay bounds in FIFO
multiplexing sink-tree networks,” Perform. Eval.,
vol. 63, no. 9-10, pp. 956–987, 2006.
[9] B. Dupont de Dinechin, Y. Durand, D. van Amstel,
and A. Ghiti, “Guaranteed Services of the NoC of a
Manycore Processor,” in Proc. of the 2014 Inter.
Workshop on Network on Chip Architectures, ser.
NoCArc ’14, 2014, pp. 11–16.
[10] W. J. Dally and C. L. Seitz, “Deadlock-Free Message
Routing in Multiprocessor Interconnection Networks,”
IEEE Trans. Comput., vol. 36, no. 5, pp. 547–553,
May 1987.
[11] P. Bahrebar and D. Stroobandt, “The
Hamiltonian-based Odd-even Turn Model for
Maximally Adaptive Routing in 2D Mesh
Networks-on-chip,” Comput. Electr. Eng., vol. 45,
no. C, pp. 386–401, Jul. 2015.
[12] S. Chen and K. Nahrstedt, “Maxmin Fair Routing in
Connection-Oriented Networks,” in Proc. of
Euro-Parallel and Distributed Systems Conference
(Euro-PDS ’98), 1998, pp. 163–168.