Content uploaded by Benoît Dupont de Dinechin

Author content

All content in this area was uploaded by Benoît Dupont de Dinechin on Nov 07, 2017

Content may be subject to copyright.

Network-on-Chip Service Guarantees

on the Kalray MPPA-256 Bostan Processor

Benoît Dupont de Dinechin and Amaury Graillat

Kalray SA

445 rue Lavoisier

F-38330 Montbonnot, France

ABSTRACT

The Kalray MPPA-256 Bostan manycore processor imple-

ments a clustered architecture, where clusters of cores share

a local memory, and a DMA-capable network-on-chip (NoC)

connects the clusters. The NoC implements wormhole switch-

ing without virtual channels, with source routing, and can be

conﬁgured for maximum ﬂow rate and burstiness at ingress.

We describe and illustrate the techniques used to conﬁgure

the MPPA NoC for guaranteed services. Our approach is

based on three steps: global selection of routes between end-

points and computation of ﬂow rates, by solving the max-

min fairness with unsplittable path problem; conﬁguration

of the ﬂow burstiness parameters at ingress, by solving an

acyclic set of linear inequalities; and end-to-end latency up-

per bound computation, based on the principles of separated

ﬂow analysis (SFA). In this paper, we develop the two last

steps, taking advantage of the eﬀects of NoC link shaping

on the leaky-bucket arrival curves of ﬂows.

Keywords

Network-on-Chip, Deterministic Network Calculus, Link Traf-

ﬁc Shaping, Separated Flow Analysis

1. INTRODUCTION

1.1 Motivation and Problem

The Kalray MPPA-256 Bostan manycore processor is de-

signed for timing predictability and energy eﬃciency [1]. Its

cores are clustered into compute units sharing a local mem-

ory, and these clusters are connected through a network-on-

chip (NoC). The MPPA NoC implements wormhole switch-

ing with source routing and supports guaranteed services

through the conﬁguration of ﬂow injection parameters at the

NoC interface: the maximum rate ρ; the maximum bursti-

ness σ; the minimum and the maximum packet sizes.

The MPPA-256 processors are used in time-critical appli-

cations [2], where a set of cyclic or sporadic tasks interact

through a static communication graph. When application

tasks are mapped to the cores, tasks communicate either

AISTECS 2017, Stockholm, Sweden

ACM ISBN 978-1-4503-2138-9.

DOI: 10.1145/1235

through the local memory if co-located on the same cluster,

or through the NoC that supports remote DMA. Ensuring

the worst-case response times (WCRT) of time-critical appli-

cations requires upper approximations of worst-case execu-

tion times (WCET) for the computations and of worst-case

traversal times (WCTT) for the remote communications.

In this work, we address the problem of computing end-

to-end latency upper bounds for on-chip communications,

assuming that the tasks have been assigned to cores inside

clusters thus to NoC nodes. We base our approach on De-

terministic Network Calculus (DNC), which was originally

developed for the performance evaluation of Internet and of

ATM networks [3]. Since then, DNC has been applied to

ﬂow regulation of NoCs [4][5], and is used to guarantee the

services of the AFDX Ethernet-based avionics networks [6].

Application of DNC to a NoC assumes that routes be-

tween endpoints are predetermined, and that ﬂow injection

parameters are known or enforced. Moreover, key DNC re-

sults only apply to feed-forward networks, that is, networks

where the graph of unidirectional links traversed by the set

of ﬂows has no cycles [3]. A related assumption is that the

hop-to-hop ﬂow control of the NoC is not active, that is, the

traﬃc is such that the router queues always accept incoming

ﬂits. Figuratively speaking, queues must not overﬂow.

1.2 Contributions and Related Work

Our contributions are as follows. At the time scale of NoC

operations, there are no instantaneous packet arrivals, so the

(σ, ρ) characterization of ﬂows used in macro networks must

be reﬁned by taking into account the ﬂit serialization eﬀects

of links, both inside and between the NoC routers. These ef-

fects appear signiﬁcant for the increase of the ﬂow burstiness

between hops and for the computation of the end-to-end la-

tency bound. A related approach is to shape individual ﬂows

at ingress, then propagate the resulting arrival curves across

the network elements [6]. However, this approach is more

relevant for an AFDX network than for a NoC.

For the service oﬀered to a queue by the link arbiter, we

select either the rate and latency ensured by the round-

robin packet scheduling, or the residual service guaranteed

by blind multiplexing when the former does not apply. Pre-

vious works use only blind multiplexing [7] or only FIFO

multiplexing [8], however the latter does not apply to the

MPPA NoC link arbiters, since they do not keep the FIFO

service order between packets from diﬀerent queues.

Compared to earlier work of applying DNC to the NoC of

the ﬁrst-generation MPPA Andey processor [9], we decouple

the computation of the ﬂow rates and burstiness and obtain

signiﬁcantly better rates and end-to-end latency bounds.

2. BACKGROUND

2.1 Overview of MPPA NoC Management

There are two objectives of the MPPA NoC management.

First, ensure deadlock-free operations, which is a concern

with wormhole switched networks. Second, in case the com-

munication ﬂows are static, conﬁgure the routes and packet

injection parameters so that minimum ﬂow rates and maxi-

mum end-to-end latencies can be guaranteed. These two ob-

jectives are connected by the following insight: on a worm-

hole switched network, a static deadlock-free routing algo-

rithm ensures that the ﬂows are feed-forward. Indeed, avoid-

ing deadlocks can be ensured by using a routing scheme

that prevents circuits in the resource dependence graph [10].

With wormhole switching, resources are the router internal

or external links, and depend through the packet paths.

We assume that application tasks have been assigned end-

points on the NoC. A static deadlock-free routing algorithm

with path diversity, in our case the Hamiltonian Odd-Even

turn model [11], is used to compute a set of paths between

the endpoints. In a ﬁrst optimization problem, we glob-

ally select a single path for each endpoint with the max-min

fairness objective [12]. This objective corresponds to maxi-

mizing the lowest ﬂow rate, then the next lowest ﬂow rate,

etc. Solving this optimization problem yields the ﬂow rates.

We then formulate a second problem, which is the topic

of this paper. The objective is to compute the ﬂow bursti-

ness parameters at ingress, in order to ensure that the NoC

router queues do not overﬂow. This entails propagating the

individual and aggregate ﬂow parameters across the relevant

network elements, which are the outgoing link arbiters that

serve the packets in round-robin across the adjacent ’turn’

queues. Replacing the ﬂow rates by their values from the

ﬁrst optimization problem yields a system of linear inequal-

ities with the ﬂow burstiness variables. Thanks to feed-

forward ﬂows, the system is acyclic and solved in one pass.

Finally, we obtain upper bounds on the end-to-end laten-

cies between endpoints by applying the principles of sepa-

rated ﬂow analysis (SFA) [7], where a left-over service curve

is computed at each network element for the ﬂow of inter-

est. These left-over service curves are combined by (min,+)

convolution, yielding the end-to-end service curve provided

by the network for this ﬂow. Computing the maximum hor-

izontal deviation between the ﬂow arrival curve at ingress

and this end-to-end service curve yields the delay for the

upper bound on the end-to-end latency for the ﬂow.

2.2 The MPPA-256 Bostan Processor

The Kalray MPPA-256 Bostan processor is the second im-

plementation of the MPPA (Massively Parallel Processing

Array) architecture, manufactured with the TSMC CMOS

28HP process. The MPPA architecture belongs to the ‘many-

core’ family, characterized by a large number of cores and

the architecturally visible aggregation of resources into com-

pute tiles (DSP ‘core packs’, GPU ‘streaming multiproces-

sors’, OpenCL ‘compute units’) with local memory, and the

exploitation of a network-on-chip for their interaction.

The MPPA-256 Bostan processor integrates 256 process-

ing cores and 32 management cores on a single chip, all based

on the same 32-bit/64-bit VLIW core architecture. Its ar-

chitecture is clustered with 16 compute clusters and 2 I/O

clusters, where each cluster is built around a multi-banked

local static memory shared by the 16+1 (compute cluster)

Figure 1: MPPA-256 Bostan clusters and NoC.

Figure 2: Structure of a MPPA NoC router.

or 4+4 (I/O cluster) processing + management cores. Fig. 1

illustrates the MPPA-256 Bostan clusters, with I/O clusters

in light blue and the compute clusters in dark blue. The

clusters communicate through the NoC, with one node per

compute cluster and 8 nodes per I/O cluster.

The MPPA-256 Bostan NoC is a direct network with a 2D

torus topology extended with extra links connected to the

otherwise unused links of the NoC nodes on the I/O clusters.

The NoC implements wormhole switching with source rout-

ing and without virtual channels. With wormhole switching,

a packet is decomposed into ﬂits (32-bit), which travel in a

pipelined fashion across the network elements, with buﬀering

applied at the ﬂit level. The packet follows a route deter-

mined by a bit string in the header.

The key network elements are the link arbiters inside the

routers (Fig. 2), which schedule whole packets for transmis-

sion by applying a round-robin selection on adjacent queues.

There is one queue per incoming direction or ’turn’. We call

a queue active if there is some ﬂow passing through it and

there is another queue in the same link arbiter with some

ﬂow. A non-active queue either has no ﬂow passing through

it, or is the only one in the link arbiter with some ﬂow. If so,

there are no eﬀects on the packets beyond a constant delay

and the queue can be ignored except for the constant delay

Figure 3: Flow arrival and departure as cumulative

data functions over time.

Figure 4: Arrival curve αfor function A.

that needs to be added to the end-to-end latency bound.

2.3 Deterministic Network Calculus (DNC)

Network Calculus [3] is a framework for performance anal-

ysis of networks based on the representation of ﬂows as cu-

mulative data over time. Given a network element with

cumulative arrival and departure functions Aand A0, the

delay for traversing the network element and the backlog of

data it has to serve correspond respectively to the horizontal

and the vertical deviations between Aand A0(Fig. 3).

Network Calculus exploits the properties of the (min,+)

algebra, in particular it introduces the following operations:

convolution: (f⊗g)(t),inf

0≤s≤tf(t) + g(t−s)

deconvolution: (fg)(t),sup

s>0

f(t+s)−g(s)

Let Abe a cumulative data function. An arrival curve

αis a constraint on Adeﬁned by ∀0≤s≤t:A(t)−

A(s)≤α(t−s), which is equivalent to A≤A⊗α. Fig. 4

illustrates the smoothing constraint of the arrival curve αon

the cumulative function A. This particular type of arrival

curve α(t)=(ρt +σ)1t>0, known as aﬃne or leaky-bucket,

is denoted γρ,σ with ρthe rate and σthe burstiness.

Let A, A0be the cumulative arrival and departure func-

tions of a network element. This element has βas a service

curve iﬀ: ∀t≥0 : A0(t)≥inf0≤s≤t(A(s) + β(t−s)), which is

equivalent to A0≥A⊗β. Fig. 5 illustrates the guarantee of-

fered by the service curve β. This particular type of service

curve β(t) = R[t−T]+, known as rate-latency, is denoted

βR,T with Rthe rate and Tthe latency.

Key results of Deterministic Network Calculus include:

Figure 5: Service curve βfor a server A→A0.

•The arrival curve of the aggregate of two ﬂows is the

sum of their arrival curves.

•A ﬂow A(t) with arrival curve α(t) that traverses a

server with service curve β(t) results in a ﬂow A0(t)

constrained by the arrival curve (αβ)(t).

•The service curve of a tandem of two of servers with

respective service curves β1(t) and β2(t) is (β1⊗β2)(t).

•If a ﬂow has arrival curve α(t) on a node that oﬀers

service curve β(t), the backlog bound is maxt≥0(α(t)−

β(t)) and the delay bound is maxt≥0{inf s≥0 : α(t)≤

β(t+s)}. These two bounds are respectively the max-

imum vertical deviation and maximum horizontal de-

viation between α(t) and β(t).

3. DNC APPLICATION TO THE MPPA NOC

For any ﬂow fi, the solution of the routing problem yields

its unique path and its rate ρi. We now consider the for-

mulation of the second problem, the computation of ﬂow

burstiness parameters at ingress σifor each ﬂow. Not only

this problem is linear, thanks to now determined values of

the ﬂow rates, but also we account for the eﬀects of link

shaping: due to the NoC router architecture, any internal

or external link has a maximum rate r= 1 ﬂit/cycle.

3.1 Effects of Link Shaping on Queues

Let Fjdenote the set of ﬂows passing through an active

queue qj. Assume that the arrival curve of each ﬂow fi∈

Fjis of leaky-bucket type with rate ρj

iand burstiness σj

i.

Then qjreceives a leaky-bucket shaped ﬂow γρj,σjwith ρj=

Pfi∈Fjρj

iand σj=Pfi∈Fjσj

i. However, the maximum

queue ﬁlling rate is r. So the arrival curve for qjis the

convolution of curves λ(t),rt and α(t),(σj+ρjt)1t>0,

which is their minimum as both curves are concave and pass

through the origin [7]. The time where the line ρt meets the

aﬃne curve (σj+ρj)1t>0is τj,σj

r−ρj(Fig. 6).

Further assume that the link arbiter oﬀers a rate-latency

service curve βRj,Tjto queue qjwith Rj≤r. The backlog bj

of qjis the maximum vertical deviation between the arrival

curve and the service curve. It has two values, depending

on the comparison between τjand Tj. If τj≤Tjthen

bj=σj+ρjTj(Fig. 6 left) else bj=rτ j−(τj−Tj)Rj

(Fig. 6 center). As τj≤Tj⇔σj≤(r−ρj)Tj:

bj=σj+ρjTjif σj≤(r−ρj)Tj(1)

bj=r−Rj

r−ρjσj+RjTjif σj≥(r−ρj)Tj(2)

The delay djfor queue qjis the maximum horizontal de-

viation between the arrival curve and the service curve. As

r≥Rj, this maximum is reached when t=τjon the arrival

curve (Fig. 6 right). Let δjbe the time when service curve

has the height of the arrival curve at time τj. This implies

rτ j= (δj−Tj)Rj. As the delay dj=δj−τj, we get:

dj=Tj+σj(r−Rj)

Rj(r−ρj)(3)

3.2 Link Arbiter Service Curves

Let njthe number of active queues in the link arbiter,

including qj. From the point of view of an active queue qj,

the round-robin scheduler of the link arbiter ensures that one

Figure 6: Eﬀects of link shaping on backlog (left, center) and delay (right).

Figure 7: Shaping constraint on ingress burstiness.

packet is transmitted every round. With lmax the maximum

packet size and rthe link rate, the maximum latency seen

by qjis (nj−1) lmax

r, while the long-term rate for qjis r

nj

[7]. This yields a ﬁrst service curve βj=βRj,T jfor qj:

Rj=r

njand Tj= (nj−1) lmax

r(4)

However, this service curve may be overly constraining.

To see how, consider a case where nj= 2 and there are

three ﬂows with rates ρa=ρb=ρc=r

3. There must be

two ﬂows passing through one queue and one ﬂow through

the other queue. As a result, there is one queue with service

rate r

2which is traversed by a total ﬂow rate of ρa+ρb=2r

3.

Another approach to derive a service curve for queue qj

is to consider that the round-robin scheduler serves packets

at peak rate raccording to a blind multiplexing strategy

across the queues. In that case, we may apply Theorem

6.2.1 (Blind Multiplexing) [3]:

Theorem. Consider a node serving two ﬂows, 1 and 2, with

some unknown arbitration between the two ﬂows. Assume

that the node guarantees a strict service curve βto the ag-

gregate of the two ﬂows. Assume that ﬂow 2 is α2-smooth.

Deﬁne β1(t),[β(t)−α2(t)]+. If β1is wide-sense increas-

ing, then it is a service curve for ﬂow 1.

In particular, if β=βR,T and α2=γρ2,σ2, the ﬁrst ﬂow

is guaranteed a service curve βR0,T0with R0=R−ρ2and

T0=T+σ2+T ρ2

R−ρ2[3]. Let Ajbe the set active queues in the

link arbiter of qj, and Bj,Aj− {qj}. Application with

ρ2=Pk∈Bjρk,σ2=Pk∈Bjσk,R=rand T= 0 yields

another service curve βj=βRj,Tjfor qj:

Rj=r−X

k∈Bj

ρkand Tj=Pk∈Bjσk

r−Pk∈Bjρk(5)

In principle, the choice between equations (4) or (5) at

each server should be deferred until evaluation. In practice,

we heuristically select Eq. (4) in case ρj≤r

njwith ρj=

Pfi∈Fjρithe sum of ﬂow rates inside qj.

3.3 Flow Burstiness Parameters

Let lmax be the maximum packet size, which we assume

the same for all ﬂows. At ingress, whole packets are atomi-

cally injected at rate r. Call θthe date when injection ends

(Fig. 7). We have rθ =lmax and lmax ≤σi+ρiθ, so:

∀fi∈F:σi≥σmin

i,lmax r−ρi

r(6)

Then, consider the constraints that the router queues must

not overﬂow, given bmax the size in ﬂits of any queue. Queue

qjreceives a leaky-bucket shaped ﬂow γρj,σjwith ρj=

Pfi∈Fjρj

iand σj=Pfi∈Fjσj

i, shaped by the link at rate

r, so equations (1) and (2) apply for the backlog bj. Router

queue do not overﬂow if ∀qj∈Q:bj≤bmax, so:

bmax ≥σj+ρjTjif σj≤(r−ρj)Tj(7)

bmax ≥r−Rj

r−ρjσj+RjTjif σj≥(r−ρj)Tj(8)

We now express the values ρj

iand σj

ifor all ﬂows fi∈Fj

for an active queue qj. Obviously, ρj

i=ρi, while σj

i=σi

if qjis the ﬁrst active queue traversed by the ﬂow. Else,

let qkbe predecessor of qjin the sequence of active queues

traversed by ﬂow fi, with βRk,T kits service curve. When

ﬂow fitraverses queue qk, its burstiness increases diﬀerently

whether it is alone or aggregated with other ﬂows in qk.

If the ﬂow is alone in queue qk, we apply the classic result

of the eﬀects of a rate-latency service curve βR,T on a ﬂow

constrained by an aﬃne arrival curve γρ,σ. The result is

another aﬃne arrival curve γρ,σ+ρT [3], so:

σj

i=σk

i+ρiTk(9)

Else, we apply Theorem 6.2.2 (Burstiness Increase

Due to FIFO Multiplexing, General Case) [3]:

Theorem. Consider a node serving two ﬂows, 1 and 2, in

FIFO order. Assume that ﬂow 1 is constrained by one leaky

bucket with rate ρ1and burstiness σ1, and ﬂow 2 is con-

strained by a sub-additive arrival curve α2. Assume that

the node guarantees to the aggregate of the two ﬂows a rate-

latency service curve βR,T . Call ρ2= inft>0α2(t)

tthe max-

imum sustainable rate for ﬂow 2. If ρ1+ρ2< R, then

at the output, ﬂow 1 is constrained by γρ1,b1with b1=

σ1+ρ1(T+B

R)and B= supt≥0[α2(t) + ρ1t−Rt].

For application of this theorem, ﬂow 1 is fithe ﬂow of

interest and ﬂow 2 is the aggregate Fk−{fi}of other ﬂows in

qk, with R=Rkand T=Tk. Let ρ1=ρi,σ1=σk

i,b1=σj

i,

ρ2=Pl∈Fk,l6=iρl, and σ2=Pl∈Fk,l6=iσk

l. Because of link

shaping in qk,α2(t) = min(rt, ρ2t+σ2)1t>0. Let τ2,σ2

r−ρ2,

so that rτ2=ρ2τ2+σ2. If 0 ≤t≤τ2then α2(t) = rt else if

t≥τ2then α2(t) = ρ2t+σ2. From the deﬁnition of B:

B= sup( sup

0≤t≤τ2

(rt +ρ1t−Rt),

sup

τ2≤t

(ρ2t+σ2+ρ1t−Rt)) = r+ρ1−R

r−ρ2

σ2

As a result, b1=σ1+ρ1(T+σ2(r+ρ1−R)

R(r−ρ2)), yielding:

σj

i=σk

i+ρi Tk+(Pl∈Fk,l6=iσk

l)(r+ρi−Rk)

Rk(r−Pl∈Fk,l6=iρl)!(10)

By comparison, Corollary 6.2.3 (Burstiness Increase

due to FIFO)[3] (Section 3.4) yields b1=σ1+ρ1(T+σ2

R),

which is worse because ρ1+ρ2< R ⇒r+ρ1−R < r −ρ2.

3.4 End-to-End Latency Bound

For computing an upper bound on the end-to-end latency

of any particular ﬂow fi, we proceed in three steps. First,

compute the left-over (or residual) service curve βj

iof each

active queue qjtraversed by fi. Second, ﬁnd the equiva-

lent service curve β∗

ioﬀered by the NoC to fithrough the

convolution of the left-over service curves βj

i. Last, ﬁnd the

end-to-end latency bound by computing d∗

ithe delay be-

tween αithe arrival curve of ﬂow fiand β∗

i. Adding d∗

i

to the constant delays of ﬂow fisuch as the traversal of

non-active queues and other logic and wiring pipeline yields

the upper bound. This approach is similar in principle to

the Separated Flow Analysis (SFA) [7], even though the lat-

ter is formulated in the setting of aggregation under blind

multiplexing, while we use FIFO multiplexing.

For the ﬁrst step, we have two cases to consider at each

active queue qj. Either fiis the only ﬂow traversing qj, and

βj

i=βRj,T jfrom equations (4) or (5). Or, fiis aggregated

in qjwith other ﬂows in Fj. Packets from the ﬂow aggregate

Fjare served in FIFO order, so we may apply Corollary

6.2.3 (Burstiness Increase due to FIFO) [3]:

Corollary. Consider a node serving two ﬂows, 1 and 2, in

FIFO order. Assume that ﬂow iis constrained by one leaky

bucket with rate ρiand burstiness σi. Assume that the node

guarantees to the aggregate of the two ﬂows a rate-latency

service curve βR,T . If ρ1+ρ2< R, then ﬂow 1 has a service

curve equal to the rate-latency function with rate R−ρ2and

latency T+σ2

Rand at the output, ﬂow 1 is constrained by one

leaky bucket with rate ρ1and burstiness b1=σ1+ρ1(T+σ2

R).

For application of this corollary, ﬂow 1 is fithe ﬂow of

interest and ﬂow 2 is the aggregate Fj− {fi}of other ﬂows

in qj, with R=Rjand T=Tj. Let ρ2=Pl∈Fj,l6=iρl, and

σ2=Pl∈Fj,l6=iσj

l. This yields the left-over service curve

βj

i=βRj

i,T j

ifor an active queue qjtraversed by fi:

Rj

i=Rj−X

l∈Fj,l6=i

ρland Tj

i=Tj+Pl∈Fj,l6=iσj

l

Rj(11)

For the second step, we compute the convolution β∗

iof the

left-over service curves βj

i. Let Qidenote the set of active

queues traversed by ﬂow fi. Thanks to the properties of

rate-latency curves [3], β∗

iis a rate-latency curve whose rate

R∗

iis the minimum of the rates and the latency T∗

iis the

sum of the latencies of the left-over service curves βj

i:

R∗

i= min

j∈Qi

Rj

iand T∗

i=X

j∈Qi

Tj

i(12)

L

EW

E W

S

N

C0 C2

C8 C10

S

N

L

L

L

1

2

3

4

Figure 8: Example of a NoC ﬂow problem.

For the last step, we compute the delay d∗

ibetween the αi

the arrival curve of ﬂow fiat ingress and β∗

i. This ﬂow is

injected at rate ρiand burstiness σi, however it is subjected

to link shaping at rate ras it enters the network. As a result,

αi= min(rt, σi+ρit)1t>0and we may apply Eq. (3):

d∗

i=T∗

i+σi(r−R∗

i)

R∗

i(r−ρi)(13)

4. APPLICATION EXAMPLE

For the application of the DNC equations, we use the

ﬂow problem example illustrated in Fig. 8. There are four

ﬂows, f1, f2, f3, f4, with f4a loop-back ﬂow. Computing

rates using the max-min fairness objective [12] yields ρ1=2

3

and ρ2=ρ3=ρ4=1

3. The maximum packet size lmax is

set to 17 ﬂits, corresponding to one for the header and the

others for the payload. Computing the σmin

ivalues according

to Eq. (6) yields σmin

1=17

3and σmin

2=σmin

3=σmin

3=34

3.

In the following table, we identify the queues based on

the node number and the turn corresponding to the queue.

For instance, q2W S identiﬁes the queue in router 2 that

buﬀers packets from link Wto link S. We also consider

the queues that buﬀer traﬃc back to the compute clusters:

q10LC and q8LC . Queues that share the same link arbiter

are {q2W S , q2LS }for link 2S,{q10N W , q10LW }for link 10W,

and {q8EL , q8LL}for link 8L.

q0LE q2W S q2LS q10N L q10NW q10LW q8E L q8LL

f1σ2W S

1σ10NL

1

f2σ2LS

2σ10NW

2σ8EL

2

f3σ10LW

3σ8EL

3

f4σ8LL

4

Queues q0LE and q10NL are not active, so they can be

ignored. For the other queues, we compute their service

curves. Queue q2W S is traversed by rate ρ1=2

3>r

n2W S =

1

2, so Eq. (5) for blind multiplexing applies to q2WS . Like-

wise, queue q8EL is traversed by rate ρ2+ρ3=1

3+1

3>

r

n8EL =1

2, so Eq. (5) applies to q8EL . Conversely, Eq. (4)

applies to {q2LS , q10N W , q10LW , q8LL}. This yields:

R2W S = 1 −ρ2T2W S =σ2

1−ρ2

R2LS =1

2T2LS =lmax

R10NW =1

2T10NW =lmax

R10LW =1

2T10LW =lmax

R8EL = 1 −ρ4T8EL =σ4

1−ρ4

R8LL =1

2T8LL =lmax

Next, we express the ﬂow burstiness constraints and rela-

tions. In most cases, Eq. (9) applies. The only cases where

we need to apply Eq. (10) are for ﬂows passing through q8EL:

σ1≥σmin

1

σ2W S

1=σ1

σ10NL

1=σ2W S

1+ρ1T2W S Eq. (9)

σ10LC

1=σ10NL

1

σ2≥σmin

2

σ2LS

2=σ2

σ10NW

2=σ2LS

2+ρ2T2LS Eq. (9)

σ8EL

2=σ10NW

2+ρ2T10NW Eq. (9)

σ8LC

2=σ8EL

2+ρ2(T8EL +σ8EL

3(1+ρ2−R8EL )

R8EL (1−ρ3)) Eq. (10)

σ3≥σmin

3

σ8EL

3=σ3+ρ3T10LW Eq. (9)

σ8LC

3=σ8EL

3+ρ3(T8EL +σ8EL

2(1+ρ3−R8EL )

R8EL (1−ρ2)) Eq. (10)

σ4≥σmin

4

σ8LC

4=σ4+ρ4T8LL Eq. (9)

For the delays, we ﬁrst compute the parameters of the

left-over service curves according to Eq. (12):

R∗

1=R2W S =2

3

R∗

2= min(R2LS , R10NW ,(R8E L −ρ3)) = 1

3

R∗

3= min(R10LW , R8EL ) = 1

2

R∗

4=R8LL =1

3

T∗

1=T2W S = 17

T∗

2=T2LS +T10NW + (T8E L +σ8EL

3

R8EL ) = 76.5

T∗

3=T10LW +T8EL = 42.5

T∗

4=T8LL = 17

Here we have Rj

i=Rjexcept for R8EL

2=R8EL −ρ3. Like-

wise, Tj

i=Tjexcept for T8EL

2=T8EL +σ8EL

3

R8EL .

Finally, we apply Eq. (13) which yields the delay values:

f1f2f3f4

d∗

i25.5 110.5 102 34

5. SUMMARY AND CONCLUSIONS

We apply Deterministic Network Calculus (DNC) to the

network-on-chip (NoC) of the Kalray MPPA-256 Bostan

processor, in order to ensure quality of service to the end-

point tasks. The MPPA NoC is a RDMA-capable network,

which can be conﬁgured for bounding each ﬂow injection rate

and burstiness, that is, the parameters of a leaky-bucket ar-

rival curve. We assume that ﬂow paths between endpoints

have been selected, and that ﬂow rates are set to the max-

imum given the link capacity bounds. This starting point

can be obtained as the solution of a max-min fairness routing

problem with unsplittable path, which is a standard tech-

nique for engineering elastic traﬃc in macro-networks.

Based on classic DNC results, we develop our contribu-

tions in four areas: modeling the eﬀects of traﬃc shaping

by the peak rate of links of the NoC; reﬁnement of the ser-

vice curves of the router queues, using either round-robin

scheduling or blind multiplexing; formulation of the bursti-

ness constraints and propagation across the network; and

computation of upper bounds on the ﬂow end-to-end la-

tencies, based on the principles of separated ﬂow analysis.

Thanks to the fact that ﬂow rates are computed, the problem

of determining the burstiness parameters can be formulated

and solved on a acyclic set of linear inequalities.

6. ACKNOWLEDGMENT

This work was supported by the French DGE and Bpifrance

through the ”Investissements d’Avenir”program CAPACITES.

7. REFERENCES

[1] S. Saidi, R. Ernst, S. Uhrig, H. Theiling, and B. D.

de Dinechin, “The Shift to Multicores in Real-time

and Safety-critical Systems,” in Proc. of the 10th

Inter. Conference on Hardware/Software Codesign and

System Synthesis, ser. CODES ’15, 2015, pp. 220–229.

[2] Q. Perret, P. Maur`ere, ´

E. Noulard, C. Pagetti,

P. Sainrat, and B. Triquet, “Mapping hard real-time

applications on many-core processors,” in Proc. of the

24th Inter. Conference on Real-Time Networks and

Systems, RTNS 2016, Brest, France, October 19-21,

2016, 2016, pp. 235–244.

[3] J.-Y. Le Boudec and P. Thiran, Network Calculus: A

Theory of Deterministic Queuing Systems for the

Internet. Berlin, Heidelberg: Springer-Verlag, 2012.

[4] Z. Lu, M. Millberg, A. Jantsch, A. Bruce, P. van der

Wolf, and T. Henriksson, “Flow Regulation for

On-Chip Communication,” in Proc. of the Conference

on Design, Automation and Test in Europe, ser.

DATE ’09, 2009, pp. 578–581.

[5] Y. Durand, C. Bernard, and F. Clermidy, “Distributed

Dynamic Rate Adaptation on a Network on Chip with

Traﬃc Distortion,” in 10th IEEE Inter. Symposium on

Embedded Multicore/Many-core Systems-on-Chip,

MCSOC 2016, Lyon, France, September 21-23, 2016,

2016, pp. 225–232.

[6] M. Boyer and C. Fraboul, “Tightening end to end

delay upper bound for AFDX network calculus with

rate latency FCFS servers using network calculus,” in

IEEE Inter. Workshop on Factory Communication

Systems (WFCS), Dresden, Germany. IEEE, may

2008, pp. 11–20.

[7] A. Bouillard and G. Stea, “Worst-Case Analysis of

Tandem Queueing Systems Using Network Calculus,”

in Quantitative Assessments of Distributed Systems,

Bruneo and Distefano, Eds., 2015.

[8] L. Lenzini, L. Martorini, E. Mingozzi, and G. Stea,

“Tight end-to-end per-ﬂow delay bounds in FIFO

multiplexing sink-tree networks,” Perform. Eval.,

vol. 63, no. 9-10, pp. 956–987, 2006.

[9] B. Dupont de Dinechin, Y. Durand, D. van Amstel,

and A. Ghiti, “Guaranteed Services of the NoC of a

Manycore Processor,” in Proc. of the 2014 Inter.

Workshop on Network on Chip Architectures, ser.

NoCArc ’14, 2014, pp. 11–16.

[10] W. J. Dally and C. L. Seitz, “Deadlock-Free Message

Routing in Multiprocessor Interconnection Networks,”

IEEE Trans. Comput., vol. 36, no. 5, pp. 547–553,

May 1987.

[11] P. Bahrebar and D. Stroobandt, “The

Hamiltonian-based Odd-even Turn Model for

Maximally Adaptive Routing in 2D Mesh

Networks-on-chip,” Comput. Electr. Eng., vol. 45,

no. C, pp. 386–401, Jul. 2015.

[12] S. Chen and K. Nahrstedt, “Maxmin Fair Routing in

Connection-Oriented Networks,” in Proc. of

Euro-Parallel and Distributed Systems Conference

(Euro-PDS ’98), 1998, pp. 163–168.