Page 1

In Proceedings of Asia South Pacific Design Automation Conference 2004

Yokohama, Japan, January 2004

Rate Analysis for Streaming Applications with On-chip Buffer Constraints

Alexander Maxiaguine

ETH Z¨ urich

maxiagui@tik.ee.ethz.ch

Simon K¨ unzli

ETH Z¨ urich

kuenzli@tik.ee.ethz.ch

Samarjit Chakraborty

National University of Singapore

samarjit@comp.nus.edu.sg

Lothar Thiele

ETH Z¨ urich

thiele@tik.ee.ethz.ch

Abstract— While mapping a streaming (such as multimedia or

network packet processing) application onto a specified architec-

ture, an important issue is to determine the input stream rates

that can be supported by the architecture for any given map-

ping. This is subject to typical constraints such as on-chip buffers

should not overflow, and specified playout buffers (which feed au-

dio or video devices) should not underflow, so that the quality

of the audio/video output is maintained. The main difficulty in

this problem arises from the high variability in execution times of

stream processing algorithms, coupled with the bursty nature of

the streams to be processed. In this paper we present a mathe-

matical framework for such a rate analysis for streaming applica-

tions, and illustrate its feasibility through a detailed case study of

a MPEG-2 decoder application. When integrated into a tool for

automated design-space exploration, such an analysis can be used

for fast performance evaluation of different stream processing ar-

chitectures.

I. INTRODUCTION

Lately, there has been a tremendousincrease in portable and

mobile devices running algorithms for processing streams of

audio and video data, and sometimes network packets. These

include hand-held computers and mobile phones, and it is

expected that their usage will increase even more in the fu-

ture. Such devices typically have very stringent constraints

pertaining to cost, size, and power consumption, and have

posed several challenges towards developingappropriatemod-

els, methodologies, languages and tools for designing them

(for example, see [10, 19, 20]).

The architecture of such devices typically consists of mul-

tiple processing elements (PEs) onto which parts of an appli-

cation are mapped, and they are integrated on a single chip

following a system-on-a-chip (SoC) design paradigm. In this

setup, a system-level view of stream processing is as follows:

the input stream enters a PE, gets processed by a function or

algorithm implemented on this PE, and the processed stream

enters another PE for further processing. Between two such

PEs there is a buffer which stores the intermediate stream. Fi-

nally, the fully processed stream emerges out of a PE and gets

stored in a playout buffer which feeds some real-time client

such as an audio or video output device. The process of map-

ping a stream processing application onto such a target archi-

tecture gives rise to the problem of determining the range of

input stream rates that can be supported by the architecture for

a givenmapping. Anyfeasible implementation,ormapping,of

an algorithm onto an architecture is subject to constraints such

as (i) the buffers between any two PEs should not overflow,

and (ii) the playout buffer, which is read by the real-time client

at some specified rate (depending on the quality of the audio

or video output required) should not underflow at any point

in time. Determining the range of feasible input stream rates,

subject to theaboveconstraintsis difficultbecauseoftwo main

reasons. Firstly, there is a high data-dependent variability in

the execution time of many stream processing algorithms, be-

cause it depends on the properties of the particularaudio/video

sample being processed. Secondly, the input streams them-

selves tend to be bursty in nature. These two factors coupled

together can result in increasing the burstiness of the stream

coming out of a PE, thereby necessitating a large amount of

on-chip buffer space for its storage. Here, it may be noted that

in contrast to the simple setup described above, there might be

multiple streams being processed by a PE, where the different

streams are processed by different functions—all of which are

implemented on the same PE. The burstiness of the outgoing

processed streams in such cases would additionally depend on

the scheduling policy used to schedule them on the PE [16].

The importance of the above problem of rate analysis stems

from the fact that on-chip buffers are available only at a

premium, because of their large area requirements (see [9]).

Therefore, when mapping a streaming application onto a spec-

ified architecture, it is necessary to accurately identify the fea-

sible range of input stream rates (and bursts) that can be sup-

ported by the available on-chip buffers. This also includes the

minimum rate that should be maintained to ensure the quality

of the audio/video output.

A. Our results and relation to previous work

There is a large body of work on on-chip traffic analysis and

SoC communication architecture design (see [13, 14] and the

references therein) which is relevant to the problem addressed

in this paper. However, most of this relies on simulation based

approaches. In the context of our problem, it typically requires

several hours to simulate a few minutes of audio/videodata for

any reasonably detailed processor model. Further, simulation

based approaches often fail to accurately characterize the al-

lowed input rates and burst ranges, and strongly depend on the

audio or video data used, which itself is difficult to select.

In this paper we present a mathematical framework for rate

analysis for streaming applications, with the aim of overcom-

ing the main problems associated with simulation based meth-

ods. For the sake of generality, we consider any stream to be

made up of a potentially infinite sequence of stream objects,

where a stream object might be a macroblock, a video frame,

an audio sample, or a network packet, depending on the ap-

plication in question. Given a specification of the architecture

along with the different buffer sizes, the scheduling policies

Page 2

implemented at the different PEs, the execution requirement

per stream object of each processing function implemented on

a PE, and the output rate required to drive the real-time client

(such as the audio or video terminal), the proposed framework

can be used to compute the minimum and maximum rate of

the input stream. Here, rate refers to the precise characteri-

zation of the stream—including the allowed burst range/jitter

and the long-term arrival rate—which are described more pre-

cisely in Section II. Any input stream whose rate is in between

the computed minimum and maximum rates is guaranteed to

satisfy all the constraints pertaining to buffer overflow and un-

derflow. We also substantiate these theoretical results through

a detailed case study of mapping a MPEG-2 decoder applica-

tion on a specified architecture.

The proposedframeworkcan beused to drivea system-level

design space exploration where different possible mappings of

a streaming application onto an architecture with fixed buffers

need to be evaluated. It can also be used for optimal buffer

sizing in an architecture, which is of crucial importance due to

the high space requirements of on-chip buffers.

Our framework is based on the theory of Network Calculus

[4], which extends the concept of service curves proposed in

[7, 8] by placing it in an algebraic setting developedin [2]. Al-

thoughNetwork Calculus was originally developed,and is still

being largely used in the domain of communication networks,

very recently it was used to analyse SoC architectures in the

context of network processors [6, 18]. This work was further

extended in [5, 11]. Our work in this paper follows this line

of development. On an abstract level, all the previous papers

consideredtheproblem: givenaninputstreamandtheschedul-

ing policy at each PE, what is the worst-case buffer require-

ment and what is the nature of the output stream? However,

the problem addressed in this paper is the “reverse problem”,

where the output stream and the buffer size are given and the

nature of the input stream needs to be computed. It turns out

that noneofthepreviousresults canbeextendedto addressthis

question, and a more elaborate theory based on [2] is required.

Our work is also related to the problem of multimedia smooth-

ing in the domain of communication networks [4], which ad-

dresses the problems of shaping an input stream to meet buffer

constraints and that of computing the optimal playback delay

orbufferingtimetomaintainqualityofservice. Itmaybenoted

here that the maindifferencesbetweenthe domains of commu-

nication networks and on-chip communication in SoC archi-

tectures are that in the former case (i) buffers are more read-

ily available since there are no space constraints, (ii) packet

dropping is a feasible option, which might not be possible in

the latter case due to power/performance constraints, and (iii)

shapingcan be employedto reduce bufferconsumption,which

might be too costly to employ in the latter case. Lastly, related

to this paper is also the work in [20], which proposes that on-

chip traffic for multimediaapplications exhibitsself-similarity,

and uses this property for optimal buffer sizing.

The problem is formally defined in the next section, fol-

lowed by the details of the proposed framework. In Section III

we present the case study of mapping a MPEG-2 decoder on a

specified architecture. Finally, in Section IV we outline some

of the possible directions in which this work may be extended.

Fig. 1. A node with a processing element and an internal buffer of size b,

processing an input stream and feeding the processed stream into a playout

buffer of size B.

II. RATE ANALYSIS WITH BUFFER CONSTRAINTS

In this section we first state the problem definition, followed

by some notation and then the case of a single PE with a play-

out buffer. This is then extended to consider the case of a

stream passing through multiple PEs.

As mentioned in the last section, a stream is processed by

multiple nodes, where each node consists of a PE and an in-

ternal buffer. Let us first consider the last node in the path

of a stream, which feeds the processed stream into a playout

buffer of size B, as shown in Figure 1. Let this node consist

of a PE and an internal buffer of size b. The playout buffer

is read by a real-time client such as an audio/video output de-

vice, at some specified rate. Let the input stream entering the

node be denoted by x(t), where x(t) denotes the number of

stream objects that arrived during the time interval [0,t]. The

PE provides a guaranteed service β(∆) of the following form:

within any time interval of length ∆, it will be able to process

at least β(∆) number of input stream objects. The function

β therefore provides a lower bound on the service provided

by the PE, and is determined by the time required to process

each stream object and the scheduling policy implemented at

the PE (in case multiple streams or other tasks are also being

processed by it). Let us denote the processed output stream

entering the playout buffer by y(t), which (like x(t)) denotes

the numberof stream objects comingout during the time inter-

val [0,t]. The real-time client consumes stream objects from

the playout buffer at a rate C(t), which denotes the number of

stream objects consumed within the time interval [0,t].

Therefore, x, y and C are functions denoting cumulative

values, while the function β denotes values over time interval

lengths and is referred to as a service curve [7]. Throughout

this paper, we assume all functions f to be wide-sense increas-

ing (which means f(s) ≤ f(t), ∀s ≤ t), and f(t) = 0 for

t ≤ 0. Now, given β, C, and the buffer sizes b and B, the

problem is to compute the function (or set of possible func-

tions) x(t), such that (i) the playout buffer does not overflow,

(ii) it does not underflow, and (iii) the internal buffer at the

node does not overflow. These constraints are subject to the

real-time server consuming stream objects at the specified rate

C(t) and the processing element providing a guaranteed ser-

vice β. The version of the problem with multiple PEs is a

simple extension of this, and is stated later.

As mentioned before, the constraint on playout buffer un-

derflow is to maintain the quality of the audio/video output.

The constraints on buffer overflow is motivated by the fact

that typicallystatic schedulingpolicies areimplementedon the

PEs (for simplicity), and hence checking buffer fill-levels and

stalling a processor in case an output buffer is full, can not be

easily implemented.

Page 3

A. Notation

For any two functions f and g, the min-plus convolution of

f and g is given by: (f ⊗ g)(t) = infs:0≤s≤t{f(t − s) +

g(s)}. The min-plus deconvolution of f and g is given by:

(f ? g)(t) = supu:u≥0{f(t + u) − g(u)}. We use f ∧ g to

denote the infimum of f and g, or the minimum if it exists, and

f ∨ g to denote the supremum of f and g, or the maximum if

it exists.

B. Buffer underflow and overflow constraints

Following our problem description, the constraint on the play-

out buffer underflow can be stated as:

y(t) ≥ C(t) ∀t ≥ 0

Since the PE providesa serviceguaranteeof β, it can be shown

that y(t) ≥ (x ⊗ β)(t) (see [4] for details). Hence, the min-

imum value of y(t) is equal to (x ⊗ β)(t) and the constraint

given by Eqn.(1) can be reformulated as:

(x ⊗ β)(t) ≥ C(t) ∀t ≥ 0

It can be shown that for any functions f, g and h, g ⊗h ≥ f if

and only if h ≥ f ? g. Using this, Eqn.(2) can be stated as:

x(t) ≥ (C ? β)(t) ∀t ≥ 0

Similarly, the constraint on the playout buffer overflow can be

stated as: y(t) − C(t) ≤ B

large as x(t) but not larger, this constraint can be reformulated

as:

x(t) ≤ C(t) + B ∀t ≥ 0

Finally,theconstraintthattheinternalbufferatthenodeshould

not overflow, is given by: x(t) − y(t) ≤ b

y(t) ≥ (x ⊗ β)(t), the minimum value of y(t) is (x ⊗ β)(t)

and the above constraint, as before, can be formulated as:

x(t) ≤ (x ⊗ β)(t) + b ∀t ≥ 0

Eqns.(3), (4) and (5) therefore state all the constraints that the

input stream x(t) is required to satisfy.

(1)

(2)

(3)

∀t ≥ 0. Since y(t) can be as

(4)

∀t ≥ 0. Since

(5)

C. Computing bounds on x(t)

Eqns.(4) and (5) can be combined and stated as follows:

x(t) ≤ (C(t) + B) ∧ ((x ⊗ β)(t) + b) ∀t ≥ 0

Let xmax(t) be the maximum value of x(t) which satisfies the

above inequality. This inequality is of the form:

x ≤ (C + B) ∧ Π(x)

where x and C are functions and Π is an operator given by

Π(x) = (x⊗β)+b. It followsfrom[4] (see also [2] forfurther

details), that the maximum solution which satisfies Eqn.(6) is

given by xmax(t) = Π(C(t)+B), where Π is the sub-additive

closure of Π and is defined as

Π(x) = x ∧ Π(x) ∧ Π(Π(x)) ∧ ...

Since Π(x) = (x ⊗ β) + b, it follows that:

Π(x) = x ∧ (x ⊗ β + b) ∧ (x ⊗ β ⊗ β + 2b) ∧ ...

(6)

or, Π(x) = x∧infn≥1{x⊗β(n)+nb}, where β(n)is the n-th

self-convolution of β. Now, it is known that for any functions

f, g andh, (f∧g)⊗h = (f⊗h)∧(g⊗h) [4]. Usingthis result,

it follows that: Π(x) = x ⊗ δ0 ∧ x ⊗ infn≥1{β(n)+ nb},

whereδ0is afunctiondefinedasδ0(t) = +∞forallt > 0,and

δ0(t) = 0 forall t ≤ 0. Hence, Π(x) = x⊗infn≥0{β(n)+nb}

since, for any function f, by convention f(0)= δ0(see [4]).

The sub-additive closure of any function f, denoted by¯f,

is defined as¯f = infn≥0{f(n)} (which is similar to the sub-

additive closure of an operator as described above). Hence, it

follows that Π(x) = x ⊗ (β + b) and therefore,

xmax(t) = (C(t) + B) ⊗ (β(t) + b)

Similarly, to obtain a lower bound on x(t), we recast Eqn.(5)

as follows: x(t) ≥ (x(t) − b) ? β(t). By combining this with

Eqn.(3), we obtain: x(t) ≥ (C ? β)(t) ∨ (x(t) − b) ? β(t).

This is of the form:x ≥ (C ? β) ∨ Γ(x)

where Γ is an operator given by Γ(x) = (x − b) ? β. Using a

result [4] analogous to the existence of the maximum solution

to Eqn.(6), it follows that Eqn.(8) has one minimum solution

which is given by:xmin(t) = Γ(C(t) ? β(t))

where Γ is the super-additive closure of Γ and is defined as

Γ(x) = x ∨ Γ(x) ∨ Γ(Γ(x)) ∨ ...

Unlike Eqn.(7), Eqn.(9) unfortunately does not give a

closed-form solution to xmin(t) and must be iteratively com-

puted for any given problem instance. Eqns.(7) and (9) there-

fore giveupperand lower boundson the functionx(t), andthis

is summarized in the following theorem, which is the main re-

sult of this paper.

(7)

(8)

(9)

Theorem 1 Any non-decreasing function x(t) which satisfies

the inequality: xmin(t) ≤ x(t) ≤ xmax(t), ∀t ≥ 0, respects

both, buffer overflow and the playout buffer underflow con-

straints, where xminand xmaxare computed using Eqns.(7)

and (9).

D. The case of multiple processing elements

PEs in the path of a stream, other than those considered in

the preceding subsections (i.e. those which do not feed their

output into a playout buffer) process an input stream x?(t) and

the output y?(t) is fed into another PE for further processing.

The only constraint for any such PE is that the associated in-

ternal buffer should not overflow. The required output y?(t)

for such a PE is determined from the input x(t) of the pre-

ceding PE. Following this composition scheme, we first fix an

input x(t) of the PE which feeds the playout buffer (where

xmin(t) ≤ x(t) ≤ xmax(t) from the previous subsection).

This x(t) is the requiredoutputy?(t) ofthe immediatenextPE,

for which we compute bounds x?maxand x?min(using similar

techniques as described above) and choose some x?(t) lying in

between these bounds. This is the required output of the im-

mediatenextPE, andthis processis followeduntilwe compute

x?(t) (or bounds on it) for the input stream entering the first

PE in the path of the stream. Any input stream conforming

to this computed value is guaranteed to respect all the buffer

constraints.

Page 4

MCVLD

IQ

MP

IDCT

PE1

VOUTPE2

B1

B2

Bout

C(t) x(t)

????

Fig. 2. Mapping the MPEG-2 decoder application onto a multiprocessor SoC.

III. RATE ANALYSIS FOR A MPEG-2 DECODER

Inthissectionweapplytherateanalysismethodologydevel-

opedintheprevioussectiontostudythemappingofaMPEG-2

decoder[3] applicationontoa multiprocessorSoC architecture

with fixed buffers. By comparing the results from our mathe-

matical framework with those obtained from a system simu-

lator, we show that our framework is able to provide useful

bounds on the allowed rates of the input stream. For any in-

put sequence x(t) and the computed bounds xmaxand xmin,

the results obtained by our framework were in conformance

with the simulation results in terms of predicting buffer over-

flow, underflow, and all cases where the buffer constraints are

satisfied.

A. The MPEG-2 decoder application

Ourtargetarchitectureconsists of a numberofPEs intercon-

nected by an on-chip communication network. This network

can be seen as a system of point-to-point buffered FIFO chan-

nels of limited capacity. The PEs exchange packetized data by

writing/reading to/from these channels.

ThepartofthearchitectureontowhichtheMPEG-2decoder

application is mapped, along with the mapping of the applica-

tion’s task graph on it is shown in Figure 2. In this figure, PE1

and PE2are programmable processors and V OUT denotes a

video output port.

The task graph of the MPEG-2 decoder includes several

tasks such as variable length decoding (VLD), inverse quan-

tization (IQ), inverse discrete cosine transform (IDCT), for-

mation of motion predictors (MP) and motion compensation

(MC). Based on profiling information, we partitioned this set

of tasks into two subsets, with one being executed on PE1,

and the other on PE2.

A compressed video bit stream arrives into a buffer B1(as

shown in Figure 2) and is processed by the VLD and the IQ

tasks running on PE1. Decompressed macroblocks are writ-

ten into the buffer B2, which is read by PE2for further pro-

cessing. Note that the data exchange between PE1and PE2

can be seen as a single stream of packets with each packet en-

capsulatingIDCT coefficientsandmotionvectorsforonemac-

roblock. This information is processed by the IDCT, MP and

MC tasks mapped onto PE2. Finally, the decoded video sam-

ples are written (one macroblock at a time) into the playout

buffer Bout.

Boutis read by an output process located in the video port,

which reads one macroblockat a time, at a constant rate that is

determined by the frame rate and the resolution of the decoded

video stream. The constraints associated with Boutare that

it should never be empty when the output process attempts to

read from it, and it should also not overflow when PE2writes

into it. We would not want adopt the option of stalling PE2

when Boutis filled, in order to use simple static scheduling

algorithms on PE2when multiple streams are processed by

it. Additionally, we also require the buffers B1and B2not to

overflow.

Satisfying all the above conditions simultaneously is diffi-

cult, because tasks executing on PE1produce a bursty time-

varying traffic on its output. This is mainly due to the fact that

the execution time of the VLD task exhibits high variability,

which depends on the structure of the compressed stream and

the propertiesof the encodedvideoinformation(see also [20]).

Using the proposedrate analysis technique, we can compute

upper and lower bounds on the macroblock output rate that is

to be satisfied by the processor PE1. Any macroblock output

stream from PE1which conforms to these bounds is guaran-

teed not to overflow the buffers B2and Boutand also not un-

derflow the buffer Bout. Depending on such a chosen output

rate of PE1, upper and lower bounds on the input rate to the

buffer B1can also be computed, as discussed in the previous

section.

B. Analytical results

Due to space restrictions, we will only concentrateon buffer

overflow and underflow constraints associated with B2 and

Boutand compute bounds that the macroblock output stream

coming out of PE1is required to satisfy. Following the nota-

tion ofSectionII(see alsoFigure1), theoutputprocesslocated

in the video port is a real-time client characterized by a cumu-

lative rate function C(t). Boutis the playout buffer of size B.

The processing capacity of PE2is characterized by a guaran-

teed service of at least β(t). The buffer B2corresponds to the

internal buffer of PE2with size b. Let x(t) be the cumula-

tive rate function of any macroblock stream on the output of

PE1. The curves denoted by xmax(t) and xmin(t), which are

computed analytically using the framework developed in Sec-

tion II, are the bounds which the macroblock stream x(t) on

PE1’s output has to conform in order to guarantee that buffers

B2and Boutdo not overflow and Boutdoes not underflow.

C(t) can be derivedfromthe parametersof the video stream

to be decoded, and is given as follows:

?

NFt

C(t) =

0if t ≤ t0

if t > t0

where N is the number of macroblocks per video frame, F is

the video frame rate and t0is the time, starting from which the

real-time client starts reading data out of the playout buffer. t0

can be referred to as the playback delay or buffering time. All

video streams used in our experiments have F = 25 frames

per second and N = 1620. The playback delay t0, in general,

can be chosen arbitrarily. We have set t0to be equalto the time

required by the real-time client to read half a frame from the

playout buffer.

A system configuration is defined by a set of parameters for

β(t), B, and b. In our experiments we varied the size of the

playout buffer B. Assuming that the processor PE2 is un-

loaded and hence its entire capacity is available for processing

Page 5

50100150 200250300350

2500

5000

7500

10000

12500

15000

t [ms]

# macroblocks

x (t)

max

x (t)

min

Fig. 3. The analytical bounds xmin(t) and xmax(t) computed for a system

configuration with B = 2430 macroblocks.

the IDCT, MP and MC tasks mapped onto it, the service curve

β(t) was modeled by a straight line. The slope of this line was

set to be equal to the long-term average macroblock produc-

tion rate on the output of PE1(which was measured using the

system simulator described in the next subsection). This was

based on the assumption that in the long term, PE2has suf-

ficient capacity to process all the incoming macroblocks. B2

was always set to a fixed size of b = 500 macroblocks. Fig-

ure 3 shows the resulting bounds xmin(t) and xmax(t) com-

puted by implementingEqns.(7) and (9) of Section II in Math-

ematica (from Wolfram Research), with the system configura-

tion and β(t) describedaboveand the playoutbuffersize B set

to 2430 macroblocks.

C. Simulation Setup

We performed simulations of the MPEG-2 decoder applica-

tion (shown in Figure 2) using a transaction level model of the

system architecture (see [12]). The system model was writ-

ten in SystemC [17], and the models of the programmable

PEs were based on the Sim-Profile configuration of the Sim-

pleScalar [1] instruction set simulator. Both, PE1and PE2,

had a RISC-core (similar to the MIPS3000 processor) aug-

mented with MPEG-2 specific hardware accelerators. PE1

was enhanced with bit-stream access operations, while PE2

had special support for application kernels such as IDCT,

Add Clip and Block Average and could prefetch memory in a

special video-block mode. Floating point operations were not

used on either of the PEs. The implementation of the MPEG-2

decoder was based on the source code available from [15].

We simulated the decoding of several MPEG–2 video clips.

All video clips had parameters as described in the previous

subsection. The clips were encoded using a constant bit rate of

9.78 Mb/s and a resolution of 720x576 pixels (typically used

in DVD applications). A selection of the simulated scenarios

(with each scenario being a combination of a video clip and

the playout buffer size), representing corner cases for our ex-

periments, is summarized in Table I. Video sequence A cor-

responds to a video clip with global motion, whereas video

sequence B corresponds to a video clip with moving objects

and still background and video sequence C represents a still

picture.

For all the simulation scenarios listed in Table I, using our

simulation setup we measured x(t) at the output of PE1and

the maximum and minimum fill levels of B2and the playout

buffer Bout.

D. Comparing the analytical bounds with simulation

In this subsection we evaluate the usefulness of the analyti-

cal boundson the input rate x(t), by comparingthe predictions

TABLE I

SIMULATION SCENARIOS

Scenario Buffer Size B

# macroblocks

1620

2430

2430

2430

3240

Video

Clip

1

2

3

4

5

Sequence A

Sequence A

Sequence B

Sequence C

Sequence C

0

24

6810

12 14

8

x 10

0

500

1000

1500

2000

2500

02468 10 1214

(c) (a)

(d)

(b)

0

500

1000

1500

2000

2500

0

500

1000

1500

2000

2500

0

500

1000

1500

2000

2500

x 10

8

x 10

8

x 10

8

0

24

68 10

12 14

02468101214

Fig. 4. (a) The difference plot for a macroblock stream x(t) which is

compliant with the computed upper and lower bounds; (b) Corresponding

playout buffer fill levels; (c) The difference plot for a non-compliant stream

x(t); (d) Corresponding playout buffer fill levels indicating buffer overflow.

The horizontal-axis shows time in ns and the vertical-axis shows the number

of processed macroblocks in the playout buffer.

on buffer overflow/underflow/conformance deduced from the

analytical framework, with the results obtained by simulation.

For the ease of interpretation of the simulation results, we

always show a difference plot. Such a plot does not show the

absolute values of x(t) (obtained from the simulation) versus

xmin(t) and xmax(t) (which are computed following Theo-

rem1ofSectionII). Instead,it showsthecurvescorresponding

to thedifferencesxmax(t)−xmin(t) andx(t)−xmin(t). From

such a plot it is possible to detect when an input stream x(t)

(where x(t) is measured from the input stream resulting from

simulating the decodingalgorithm on a video clip) violates the

computed bounds. A violation occurs whenever the curve rep-

resenting x(t)−xmin(t) crosses the curvexmax(t)−xmin(t),

or goes below 0.

Figure 4(a) shows an example where x(t) resulting from a

video clip is compliant with the bounds xmin(t) and xmax(t).

In Figure 4(b) the corresponding playout buffer fill level (as

measured from simulation) is shown, which confirms that no

bufferoverflowor underflowoccurs. Figure 4(c)depicts an ex-

ample of a sequence x(t) (obtained from the simulation Sce-

nario 1), which violates the upper bound. The corresponding

bufferfill levelplot in Figure4(d)shows thatthe playoutbuffer

overflows.

In Figure 5 we show the difference plots corresponding to

the Scenarios 2–5 which are outlined in Table I. All the fig-

ures in this subsection show an excerpt of 1.36 seconds (corre-

sponding to 34 frames) of video sequences from some simula-

tion scenario (1 to 5). The left bar plot in Figure 6 shows nor-

malized values of largest buffer fill levels observed at the play-

Page 6

05

Scenario 2

10 15

x 10

8

0

500

1000

1500

2000

2500

05

Scenario 4

1015

x 10

8

0

500

1000

1500

2000

2500

3000

05 1015

x 10

8

0

500

1000

1500

2000

2500

3000

3500

05 10 15

x 10

8

-1000

-500

0

500

1000

1500

2000

2500

Scenario 3

Scenario 5

Fig. 5. Difference plots (xmax(t) − xmin(t) and x(t) − xmin(t)) for the

simulation Scenarios 2–5, with the corresponding normalized buffer fill and

emptiness levels of the playout buffer shown in Figure 6. The horizontal-axis

shows time in ns and the vertical-axis shows the number of processed

macroblocks in the playout buffer.

0

0.25

0.5

0.75

1

12345

0

0.25

0.5

0.75

1

12345

ScenarioScenario

Fig. 6. Normalized buffer fill levels (on the left) and emptiness levels (on the

right) for the simulation Scenarios 1 to 5. The bars coloured black indicate

buffer overflows and underflows in the respective bar graphs.

out buffer. On the right, the normalized “emptiness” values E

(orsmallestbufferfill levels)areshownforthesimulationruns.

We define normalizedemptiness as E = sup∀t>t0{B−fill(t)

where fill(t) denotes the actual buffer fill level at time t and

t0is the playback delay.

For Scenarios 1 and 4, we can observe that x(t) measured

from simulations is greater than the computed xmax(t) for

some values of t. The measured buffer fill levels for both these

scenarios show overflow, which is in agreement with the ana-

lytical framework. Similarly for scenario 3, we see that x(t)

is less than xmin(t) for some values of t, indicating a possi-

bility of the playout buffer underflow, which is confirmed in

Figure 6. For the remaining two scenarios (i.e. 2 and 5), x(t)

was measured to be within the computed bounds and no buffer

overflow or underflow was observed.

These experiments therefore indicate that the proposed an-

alytical framework provides meaningful bounds on the rate of

the input stream, which are in conformance with detailed sim-

ulation results.

IV. CONCLUDING REMARKS

B

},

We presented a novel framework for rate analysis for

streaming applications, and showed its feasibility through a

case study of mapping a MPEG–2 decoder application onto a

multiprocessor SoC architecture. This framework can be used

to determine the admissible input stream rates with which a

system running a stream processing application can be loaded.

The main application of such an analysis is in system-level

performanceevaluation of hardware-softwarearchitectures for

stream processing, and can be used evaluate different possi-

ble mappings of an application onto a fixed architecture. In

contrast to simulation based approaches, our framework can

be used to meaningfully evaluate a large number of such map-

ping within a very short time and can therefore be used for

automated design space explorationtechniques. We plan to in-

tegrate this framework into appropriate tools, and also explore

its use in on-chip buffer sizing and scheduling for streaming

applications.

Acknowledgements: The first two authors are partly supported

by the Swiss Innovation Promotion Agency (KTI/CTI) through the

projects KTI 5845.1 and KTI 5500.2. The work of the third author is

partly supported by the NUS ARF grants R-252-000-169-101/112.

REFERENCES

[1] T. Austin, E. Larson, and D. Ernst. SimpleScalar: An infrastructure for

computer system modeling. IEEE Computer, 35(2):59–67, 2002.

[2] F.L. Baccelli, G. Cohen, G.J. Olsder, and J.-P. Quadrat. Synchronization

and Linearity: An Algebra for Discrete Event Systems. John Wiley &

Sons, 1992.

[3] V. Bhaskaran and K. Konstantinides. Image and Video Compression

Standards: Algorithms and Architectures. Kluwer Academic Publish-

ers, 1997.

[4] J.-Y. Le Boudec and P. Thiran. Network Calculus - A Theory of Deter-

ministic Queuing Systems for the Internet. LNCS 2050, Springer, 2001.

[5] S. Chakraborty, S. K¨ unzli, and L. Thiele.

analysing system properties in platform-based embedded system de-

signs. In 6th Design, Automation and Test in Europe (DATE), 2003.

[6] S. Chakraborty, S. K¨ unzli, L. Thiele, A. Herkersdorf, and P. Sagmeister.

Performance evaluation of network processor architectures: Combining

simulation with analytical estimation. Computer Networks, 41(5), 2003.

[7] R. Cruz. A calculus for network delay, Parts 1 & 2. IEEE Transactions

on Information Theory, 37(1), 1991.

[8] R. Cruz. Quality of service guarantees in virtual circuit switched net-

works. IEEE Journal of Selected Areas in Communication, 13(6), 1995.

[9] W. Dally and B. Towles. Route packets, not wires: On-chip interconnec-

tion networks. In 38th Design Automation Conference (DAC), 2001.

[10] M.I. Gordon et al. A stream compiler for communication-exposed ar-

chitectures. In 10th Conf. on Architectural Support for Programming

Languages and Operating Systems (ASPLOS), pages 291–303, 2002.

[11] M. Gries, C. Kulkarni, C. Sauer, and K. Keutzer. Comparing analytical

modeling with simulation for network processors: A case study. In Proc.

of the Designer’s Forum at DATE, 2003.

[12] A. Haverinen, M. Leclercq, N. Weyrich, and D. Wingard.

temC based SoC Communication Modeling for the OCP Protocol.

http://www.ocpip.org, October 2002.

[13] K. Lahiri, A. Raghunathan, and S. Dey.

performance characteristics of system-on-chip communication ar-

chitecutres. In Intl. Conf. on VLSI Design, pages 21–35, 2001.

[14] K. Lahiri, A. Raghunathan, and S. Dey. System level performance anal-

ysis for designing on-chip communication architectures. IEEE Trans. on

Computer Aided-Design of Integrated Circuits and Systems, 20(6):768–

783, 2001.

[15] MPEG Software Simulation Group. http://www.mpeg.org.

[16] K. Richter, M. Jersak, and R. Ernst. A formal approach to MpSoC per-

formance verification. IEEE Computer, 36(4), 2003.

[17] Open SystemC Initiative. http://www.systemc.org.

[18] L. Thiele, S. Chakraborty, M. Gries, and S. K¨ unzli. A framework for

evaluating design tradeoffs in packet processing architectures. In 39th

Design Automation Conference (DAC), 2002.

[19] W. Thies, M. Karczmarek, and S. Amarasinghe. StreamIt: A language

for streaming applications. In 11th Conference on Compiler Construc-

tion (CC), LNCS 2304, pages 179–196, 2002.

[20] G. Varatkar and R. Marculescu. Traffic analysis for on-chip networks de-

sign of multimedia applications. In 39th Design Automation Conference

(DAC), 2002.

A general framework for

Sys-

Evaluation of the traffic-