Rate analysis for streaming applications with onchip buffer constraints.
ABSTRACT While mapping a streaming (such as multimedia or network packet processing) application onto a specified architecture, an important issue is to determine the input stream rates that can be supported by the architecture for any given mapping. This is subject to typical constraints such as onchip buffers should not overflow, and specified play out buffers (which feed audio or video devices) should not underflow, so that the quality of the audio/video output is maintained. The main difficulty in this problem arises from the high variability in execution times of stream processing algorithms, coupled with the bursty nature of the streams to be processed. We present a mathematical framework for such a rate analysis for streaming applications, and illustrate its feasibility through a detailed case study of a MPEG2 decoder application. When integrated into a tool for automated designspace exploration, such an analysis can be used for fast performance evaluation of different stream processing architectures.

Conference Paper: Video Quality Driven Buffer Sizing via Frame Drops
[Show abstract] [Hide abstract]
ABSTRACT: We study the impact of video frame drops in buffer constrained multiprocessor systemonchip (MPSoC) platforms. Since onchip buffer memory occupies a significant amount of silicon area, accurate buffer sizing has attracted a lot of research interest lately. However, all previous work studied this problem with the underlying assumption that no video frame drops can be tolerated. In reality, multimedia applications can often tolerate some frame drops without significantly deteriorating their output quality. Although system simulations can be used to perform video quality driven buffer sizing, they are time consuming. In this paper, we first demonstrate a dualbuffer management scheme to drop only the less significant frames. Based on this scheme, we then propose a formal framework to evaluate the buffer size vs. video quality tradeoffs, which in turn will help a system designer to perform quality driven buffer sizing. In particular, we mathematically characterize the maximum numbers of frame drops for various buffer sizes and evaluate how they affect the worstcase PSNR value of the decoded video. We evaluate our proposed framework with anMPEG2 decoder and compare the obtained results with that of a cycleaccurate simulator. Our evaluations show that for an acceptable quality of 30 dB, it is possible to reduce the buffer size by up to 28.6% which amounts to 25.88 megabits.Embedded and RealTime Computing Systems and Applications (RTCSA), 2011 IEEE 17th International Conference on; 10/2011  SourceAvailable from: psu.edu
Article: Systemlevel performance/power analysis for platformbased design of multimedia applications.
[Show abstract] [Hide abstract]
ABSTRACT: The objective of this article is to introduce the use of Stochastic Automata Networks (SANs) as an effective formalism for applicationarchitecture modeling in systemlevel averagecase analy sis for platformbased design. By platform, we mean a family of heterogeneous architectures that satisfy a set of architectural constraints imposed to allow reuse of hardware and software compo nents. More precisely, we show how SANs can be used early in the design cycle to identify the best performance/power tradeoffs among several applicationarchitecture combinations. Having this information available not only helps avoid lengthy simulations for predicting power and perfor mance figures, but also enables efficient mapping of different applications onto a chosen platform. We illustrate the benefits of our methodology by using the "PictureinPicture" video decoder as a driver application.ACM Trans. Design Autom. Electr. Syst. 01/2007; 12.  SourceAvailable from: uci.edu[Show abstract] [Hide abstract]
ABSTRACT: This article proposes a hardware/software partitioning method targeted to performanceconstrained systems for datapath applications. Exploiting a platform based design, a Timed Petri Net formalism is proposed to represent the mapping of the application onto the platform, allowing to statically extract performance estimations in early phases of the de sign process and without the need of expensive simulations. The mapping process is generalized in order to allow an automatic exploration of the solution space, that identi es the best performance/area congurations among several applicationarchitecture combinations. The method is eval uated implementing a typical datapath performance con strained system, i.e. a packet processing application.Proceedings of the 6th International Conference on Hardware/Software Codesign and System Synthesis, CODES+ISSS 2008, Atlanta, GA, USA, October 1924, 2008; 01/2008
Page 1
In Proceedings of Asia South Pacific Design Automation Conference 2004
Yokohama, Japan, January 2004
Rate Analysis for Streaming Applications with Onchip Buffer Constraints
Alexander Maxiaguine
ETH Z¨ urich
maxiagui@tik.ee.ethz.ch
Simon K¨ unzli
ETH Z¨ urich
kuenzli@tik.ee.ethz.ch
Samarjit Chakraborty
National University of Singapore
samarjit@comp.nus.edu.sg
Lothar Thiele
ETH Z¨ urich
thiele@tik.ee.ethz.ch
Abstract— While mapping a streaming (such as multimedia or
network packet processing) application onto a specified architec
ture, an important issue is to determine the input stream rates
that can be supported by the architecture for any given map
ping. This is subject to typical constraints such as onchip buffers
should not overflow, and specified playout buffers (which feed au
dio or video devices) should not underflow, so that the quality
of the audio/video output is maintained. The main difficulty in
this problem arises from the high variability in execution times of
stream processing algorithms, coupled with the bursty nature of
the streams to be processed. In this paper we present a mathe
matical framework for such a rate analysis for streaming applica
tions, and illustrate its feasibility through a detailed case study of
a MPEG2 decoder application. When integrated into a tool for
automated designspace exploration, such an analysis can be used
for fast performance evaluation of different stream processing ar
chitectures.
I. INTRODUCTION
Lately, there has been a tremendousincrease in portable and
mobile devices running algorithms for processing streams of
audio and video data, and sometimes network packets. These
include handheld computers and mobile phones, and it is
expected that their usage will increase even more in the fu
ture. Such devices typically have very stringent constraints
pertaining to cost, size, and power consumption, and have
posed several challenges towards developingappropriatemod
els, methodologies, languages and tools for designing them
(for example, see [10, 19, 20]).
The architecture of such devices typically consists of mul
tiple processing elements (PEs) onto which parts of an appli
cation are mapped, and they are integrated on a single chip
following a systemonachip (SoC) design paradigm. In this
setup, a systemlevel view of stream processing is as follows:
the input stream enters a PE, gets processed by a function or
algorithm implemented on this PE, and the processed stream
enters another PE for further processing. Between two such
PEs there is a buffer which stores the intermediate stream. Fi
nally, the fully processed stream emerges out of a PE and gets
stored in a playout buffer which feeds some realtime client
such as an audio or video output device. The process of map
ping a stream processing application onto such a target archi
tecture gives rise to the problem of determining the range of
input stream rates that can be supported by the architecture for
a givenmapping. Anyfeasible implementation,ormapping,of
an algorithm onto an architecture is subject to constraints such
as (i) the buffers between any two PEs should not overflow,
and (ii) the playout buffer, which is read by the realtime client
at some specified rate (depending on the quality of the audio
or video output required) should not underflow at any point
in time. Determining the range of feasible input stream rates,
subject to theaboveconstraintsis difficultbecauseoftwo main
reasons. Firstly, there is a high datadependent variability in
the execution time of many stream processing algorithms, be
cause it depends on the properties of the particularaudio/video
sample being processed. Secondly, the input streams them
selves tend to be bursty in nature. These two factors coupled
together can result in increasing the burstiness of the stream
coming out of a PE, thereby necessitating a large amount of
onchip buffer space for its storage. Here, it may be noted that
in contrast to the simple setup described above, there might be
multiple streams being processed by a PE, where the different
streams are processed by different functions—all of which are
implemented on the same PE. The burstiness of the outgoing
processed streams in such cases would additionally depend on
the scheduling policy used to schedule them on the PE [16].
The importance of the above problem of rate analysis stems
from the fact that onchip buffers are available only at a
premium, because of their large area requirements (see [9]).
Therefore, when mapping a streaming application onto a spec
ified architecture, it is necessary to accurately identify the fea
sible range of input stream rates (and bursts) that can be sup
ported by the available onchip buffers. This also includes the
minimum rate that should be maintained to ensure the quality
of the audio/video output.
A. Our results and relation to previous work
There is a large body of work on onchip traffic analysis and
SoC communication architecture design (see [13, 14] and the
references therein) which is relevant to the problem addressed
in this paper. However, most of this relies on simulation based
approaches. In the context of our problem, it typically requires
several hours to simulate a few minutes of audio/videodata for
any reasonably detailed processor model. Further, simulation
based approaches often fail to accurately characterize the al
lowed input rates and burst ranges, and strongly depend on the
audio or video data used, which itself is difficult to select.
In this paper we present a mathematical framework for rate
analysis for streaming applications, with the aim of overcom
ing the main problems associated with simulation based meth
ods. For the sake of generality, we consider any stream to be
made up of a potentially infinite sequence of stream objects,
where a stream object might be a macroblock, a video frame,
an audio sample, or a network packet, depending on the ap
plication in question. Given a specification of the architecture
along with the different buffer sizes, the scheduling policies
Page 2
implemented at the different PEs, the execution requirement
per stream object of each processing function implemented on
a PE, and the output rate required to drive the realtime client
(such as the audio or video terminal), the proposed framework
can be used to compute the minimum and maximum rate of
the input stream. Here, rate refers to the precise characteri
zation of the stream—including the allowed burst range/jitter
and the longterm arrival rate—which are described more pre
cisely in Section II. Any input stream whose rate is in between
the computed minimum and maximum rates is guaranteed to
satisfy all the constraints pertaining to buffer overflow and un
derflow. We also substantiate these theoretical results through
a detailed case study of mapping a MPEG2 decoder applica
tion on a specified architecture.
The proposedframeworkcan beused to drivea systemlevel
design space exploration where different possible mappings of
a streaming application onto an architecture with fixed buffers
need to be evaluated. It can also be used for optimal buffer
sizing in an architecture, which is of crucial importance due to
the high space requirements of onchip buffers.
Our framework is based on the theory of Network Calculus
[4], which extends the concept of service curves proposed in
[7, 8] by placing it in an algebraic setting developedin [2]. Al
thoughNetwork Calculus was originally developed,and is still
being largely used in the domain of communication networks,
very recently it was used to analyse SoC architectures in the
context of network processors [6, 18]. This work was further
extended in [5, 11]. Our work in this paper follows this line
of development. On an abstract level, all the previous papers
consideredtheproblem: givenaninputstreamandtheschedul
ing policy at each PE, what is the worstcase buffer require
ment and what is the nature of the output stream? However,
the problem addressed in this paper is the “reverse problem”,
where the output stream and the buffer size are given and the
nature of the input stream needs to be computed. It turns out
that noneofthepreviousresults canbeextendedto addressthis
question, and a more elaborate theory based on [2] is required.
Our work is also related to the problem of multimedia smooth
ing in the domain of communication networks [4], which ad
dresses the problems of shaping an input stream to meet buffer
constraints and that of computing the optimal playback delay
orbufferingtimetomaintainqualityofservice. Itmaybenoted
here that the maindifferencesbetweenthe domains of commu
nication networks and onchip communication in SoC archi
tectures are that in the former case (i) buffers are more read
ily available since there are no space constraints, (ii) packet
dropping is a feasible option, which might not be possible in
the latter case due to power/performance constraints, and (iii)
shapingcan be employedto reduce bufferconsumption,which
might be too costly to employ in the latter case. Lastly, related
to this paper is also the work in [20], which proposes that on
chip traffic for multimediaapplications exhibitsselfsimilarity,
and uses this property for optimal buffer sizing.
The problem is formally defined in the next section, fol
lowed by the details of the proposed framework. In Section III
we present the case study of mapping a MPEG2 decoder on a
specified architecture. Finally, in Section IV we outline some
of the possible directions in which this work may be extended.
Fig. 1. A node with a processing element and an internal buffer of size b,
processing an input stream and feeding the processed stream into a playout
buffer of size B.
II. RATE ANALYSIS WITH BUFFER CONSTRAINTS
In this section we first state the problem definition, followed
by some notation and then the case of a single PE with a play
out buffer. This is then extended to consider the case of a
stream passing through multiple PEs.
As mentioned in the last section, a stream is processed by
multiple nodes, where each node consists of a PE and an in
ternal buffer. Let us first consider the last node in the path
of a stream, which feeds the processed stream into a playout
buffer of size B, as shown in Figure 1. Let this node consist
of a PE and an internal buffer of size b. The playout buffer
is read by a realtime client such as an audio/video output de
vice, at some specified rate. Let the input stream entering the
node be denoted by x(t), where x(t) denotes the number of
stream objects that arrived during the time interval [0,t]. The
PE provides a guaranteed service β(∆) of the following form:
within any time interval of length ∆, it will be able to process
at least β(∆) number of input stream objects. The function
β therefore provides a lower bound on the service provided
by the PE, and is determined by the time required to process
each stream object and the scheduling policy implemented at
the PE (in case multiple streams or other tasks are also being
processed by it). Let us denote the processed output stream
entering the playout buffer by y(t), which (like x(t)) denotes
the numberof stream objects comingout during the time inter
val [0,t]. The realtime client consumes stream objects from
the playout buffer at a rate C(t), which denotes the number of
stream objects consumed within the time interval [0,t].
Therefore, x, y and C are functions denoting cumulative
values, while the function β denotes values over time interval
lengths and is referred to as a service curve [7]. Throughout
this paper, we assume all functions f to be widesense increas
ing (which means f(s) ≤ f(t), ∀s ≤ t), and f(t) = 0 for
t ≤ 0. Now, given β, C, and the buffer sizes b and B, the
problem is to compute the function (or set of possible func
tions) x(t), such that (i) the playout buffer does not overflow,
(ii) it does not underflow, and (iii) the internal buffer at the
node does not overflow. These constraints are subject to the
realtime server consuming stream objects at the specified rate
C(t) and the processing element providing a guaranteed ser
vice β. The version of the problem with multiple PEs is a
simple extension of this, and is stated later.
As mentioned before, the constraint on playout buffer un
derflow is to maintain the quality of the audio/video output.
The constraints on buffer overflow is motivated by the fact
that typicallystatic schedulingpolicies areimplementedon the
PEs (for simplicity), and hence checking buffer filllevels and
stalling a processor in case an output buffer is full, can not be
easily implemented.
Page 3
A. Notation
For any two functions f and g, the minplus convolution of
f and g is given by: (f ⊗ g)(t) = infs:0≤s≤t{f(t − s) +
g(s)}. The minplus deconvolution of f and g is given by:
(f ? g)(t) = supu:u≥0{f(t + u) − g(u)}. We use f ∧ g to
denote the infimum of f and g, or the minimum if it exists, and
f ∨ g to denote the supremum of f and g, or the maximum if
it exists.
B. Buffer underflow and overflow constraints
Following our problem description, the constraint on the play
out buffer underflow can be stated as:
y(t) ≥ C(t) ∀t ≥ 0
Since the PE providesa serviceguaranteeof β, it can be shown
that y(t) ≥ (x ⊗ β)(t) (see [4] for details). Hence, the min
imum value of y(t) is equal to (x ⊗ β)(t) and the constraint
given by Eqn.(1) can be reformulated as:
(x ⊗ β)(t) ≥ C(t) ∀t ≥ 0
It can be shown that for any functions f, g and h, g ⊗h ≥ f if
and only if h ≥ f ? g. Using this, Eqn.(2) can be stated as:
x(t) ≥ (C ? β)(t) ∀t ≥ 0
Similarly, the constraint on the playout buffer overflow can be
stated as: y(t) − C(t) ≤ B
large as x(t) but not larger, this constraint can be reformulated
as:
x(t) ≤ C(t) + B ∀t ≥ 0
Finally,theconstraintthattheinternalbufferatthenodeshould
not overflow, is given by: x(t) − y(t) ≤ b
y(t) ≥ (x ⊗ β)(t), the minimum value of y(t) is (x ⊗ β)(t)
and the above constraint, as before, can be formulated as:
x(t) ≤ (x ⊗ β)(t) + b ∀t ≥ 0
Eqns.(3), (4) and (5) therefore state all the constraints that the
input stream x(t) is required to satisfy.
(1)
(2)
(3)
∀t ≥ 0. Since y(t) can be as
(4)
∀t ≥ 0. Since
(5)
C. Computing bounds on x(t)
Eqns.(4) and (5) can be combined and stated as follows:
x(t) ≤ (C(t) + B) ∧ ((x ⊗ β)(t) + b) ∀t ≥ 0
Let xmax(t) be the maximum value of x(t) which satisfies the
above inequality. This inequality is of the form:
x ≤ (C + B) ∧ Π(x)
where x and C are functions and Π is an operator given by
Π(x) = (x⊗β)+b. It followsfrom[4] (see also [2] forfurther
details), that the maximum solution which satisfies Eqn.(6) is
given by xmax(t) = Π(C(t)+B), where Π is the subadditive
closure of Π and is defined as
Π(x) = x ∧ Π(x) ∧ Π(Π(x)) ∧ ...
Since Π(x) = (x ⊗ β) + b, it follows that:
Π(x) = x ∧ (x ⊗ β + b) ∧ (x ⊗ β ⊗ β + 2b) ∧ ...
(6)
or, Π(x) = x∧infn≥1{x⊗β(n)+nb}, where β(n)is the nth
selfconvolution of β. Now, it is known that for any functions
f, g andh, (f∧g)⊗h = (f⊗h)∧(g⊗h) [4]. Usingthis result,
it follows that: Π(x) = x ⊗ δ0 ∧ x ⊗ infn≥1{β(n)+ nb},
whereδ0is afunctiondefinedasδ0(t) = +∞forallt > 0,and
δ0(t) = 0 forall t ≤ 0. Hence, Π(x) = x⊗infn≥0{β(n)+nb}
since, for any function f, by convention f(0)= δ0(see [4]).
The subadditive closure of any function f, denoted by¯f,
is defined as¯f = infn≥0{f(n)} (which is similar to the sub
additive closure of an operator as described above). Hence, it
follows that Π(x) = x ⊗ (β + b) and therefore,
xmax(t) = (C(t) + B) ⊗ (β(t) + b)
Similarly, to obtain a lower bound on x(t), we recast Eqn.(5)
as follows: x(t) ≥ (x(t) − b) ? β(t). By combining this with
Eqn.(3), we obtain: x(t) ≥ (C ? β)(t) ∨ (x(t) − b) ? β(t).
This is of the form:x ≥ (C ? β) ∨ Γ(x)
where Γ is an operator given by Γ(x) = (x − b) ? β. Using a
result [4] analogous to the existence of the maximum solution
to Eqn.(6), it follows that Eqn.(8) has one minimum solution
which is given by:xmin(t) = Γ(C(t) ? β(t))
where Γ is the superadditive closure of Γ and is defined as
Γ(x) = x ∨ Γ(x) ∨ Γ(Γ(x)) ∨ ...
Unlike Eqn.(7), Eqn.(9) unfortunately does not give a
closedform solution to xmin(t) and must be iteratively com
puted for any given problem instance. Eqns.(7) and (9) there
fore giveupperand lower boundson the functionx(t), andthis
is summarized in the following theorem, which is the main re
sult of this paper.
(7)
(8)
(9)
Theorem 1 Any nondecreasing function x(t) which satisfies
the inequality: xmin(t) ≤ x(t) ≤ xmax(t), ∀t ≥ 0, respects
both, buffer overflow and the playout buffer underflow con
straints, where xminand xmaxare computed using Eqns.(7)
and (9).
D. The case of multiple processing elements
PEs in the path of a stream, other than those considered in
the preceding subsections (i.e. those which do not feed their
output into a playout buffer) process an input stream x?(t) and
the output y?(t) is fed into another PE for further processing.
The only constraint for any such PE is that the associated in
ternal buffer should not overflow. The required output y?(t)
for such a PE is determined from the input x(t) of the pre
ceding PE. Following this composition scheme, we first fix an
input x(t) of the PE which feeds the playout buffer (where
xmin(t) ≤ x(t) ≤ xmax(t) from the previous subsection).
This x(t) is the requiredoutputy?(t) ofthe immediatenextPE,
for which we compute bounds x?maxand x?min(using similar
techniques as described above) and choose some x?(t) lying in
between these bounds. This is the required output of the im
mediatenextPE, andthis processis followeduntilwe compute
x?(t) (or bounds on it) for the input stream entering the first
PE in the path of the stream. Any input stream conforming
to this computed value is guaranteed to respect all the buffer
constraints.
Page 4
MCVLD
IQ
MP
IDCT
PE1
VOUTPE2
B1
B2
Bout
C(t)x(t)
????
Fig. 2. Mapping the MPEG2 decoder application onto a multiprocessor SoC.
III. RATE ANALYSIS FOR A MPEG2 DECODER
Inthissectionweapplytherateanalysismethodologydevel
opedintheprevioussectiontostudythemappingofaMPEG2
decoder[3] applicationontoa multiprocessorSoC architecture
with fixed buffers. By comparing the results from our mathe
matical framework with those obtained from a system simu
lator, we show that our framework is able to provide useful
bounds on the allowed rates of the input stream. For any in
put sequence x(t) and the computed bounds xmaxand xmin,
the results obtained by our framework were in conformance
with the simulation results in terms of predicting buffer over
flow, underflow, and all cases where the buffer constraints are
satisfied.
A. The MPEG2 decoder application
Ourtargetarchitectureconsists of a numberofPEs intercon
nected by an onchip communication network. This network
can be seen as a system of pointtopoint buffered FIFO chan
nels of limited capacity. The PEs exchange packetized data by
writing/reading to/from these channels.
ThepartofthearchitectureontowhichtheMPEG2decoder
application is mapped, along with the mapping of the applica
tion’s task graph on it is shown in Figure 2. In this figure, PE1
and PE2are programmable processors and V OUT denotes a
video output port.
The task graph of the MPEG2 decoder includes several
tasks such as variable length decoding (VLD), inverse quan
tization (IQ), inverse discrete cosine transform (IDCT), for
mation of motion predictors (MP) and motion compensation
(MC). Based on profiling information, we partitioned this set
of tasks into two subsets, with one being executed on PE1,
and the other on PE2.
A compressed video bit stream arrives into a buffer B1(as
shown in Figure 2) and is processed by the VLD and the IQ
tasks running on PE1. Decompressed macroblocks are writ
ten into the buffer B2, which is read by PE2for further pro
cessing. Note that the data exchange between PE1and PE2
can be seen as a single stream of packets with each packet en
capsulatingIDCT coefficientsandmotionvectorsforonemac
roblock. This information is processed by the IDCT, MP and
MC tasks mapped onto PE2. Finally, the decoded video sam
ples are written (one macroblock at a time) into the playout
buffer Bout.
Boutis read by an output process located in the video port,
which reads one macroblockat a time, at a constant rate that is
determined by the frame rate and the resolution of the decoded
video stream. The constraints associated with Boutare that
it should never be empty when the output process attempts to
read from it, and it should also not overflow when PE2writes
into it. We would not want adopt the option of stalling PE2
when Boutis filled, in order to use simple static scheduling
algorithms on PE2when multiple streams are processed by
it. Additionally, we also require the buffers B1and B2not to
overflow.
Satisfying all the above conditions simultaneously is diffi
cult, because tasks executing on PE1produce a bursty time
varying traffic on its output. This is mainly due to the fact that
the execution time of the VLD task exhibits high variability,
which depends on the structure of the compressed stream and
the propertiesof the encodedvideoinformation(see also [20]).
Using the proposedrate analysis technique, we can compute
upper and lower bounds on the macroblock output rate that is
to be satisfied by the processor PE1. Any macroblock output
stream from PE1which conforms to these bounds is guaran
teed not to overflow the buffers B2and Boutand also not un
derflow the buffer Bout. Depending on such a chosen output
rate of PE1, upper and lower bounds on the input rate to the
buffer B1can also be computed, as discussed in the previous
section.
B. Analytical results
Due to space restrictions, we will only concentrateon buffer
overflow and underflow constraints associated with B2 and
Boutand compute bounds that the macroblock output stream
coming out of PE1is required to satisfy. Following the nota
tion ofSectionII(see alsoFigure1), theoutputprocesslocated
in the video port is a realtime client characterized by a cumu
lative rate function C(t). Boutis the playout buffer of size B.
The processing capacity of PE2is characterized by a guaran
teed service of at least β(t). The buffer B2corresponds to the
internal buffer of PE2with size b. Let x(t) be the cumula
tive rate function of any macroblock stream on the output of
PE1. The curves denoted by xmax(t) and xmin(t), which are
computed analytically using the framework developed in Sec
tion II, are the bounds which the macroblock stream x(t) on
PE1’s output has to conform in order to guarantee that buffers
B2and Boutdo not overflow and Boutdoes not underflow.
C(t) can be derivedfromthe parametersof the video stream
to be decoded, and is given as follows:
?
NFt
C(t) =
0if t ≤ t0
if t > t0
where N is the number of macroblocks per video frame, F is
the video frame rate and t0is the time, starting from which the
realtime client starts reading data out of the playout buffer. t0
can be referred to as the playback delay or buffering time. All
video streams used in our experiments have F = 25 frames
per second and N = 1620. The playback delay t0, in general,
can be chosen arbitrarily. We have set t0to be equalto the time
required by the realtime client to read half a frame from the
playout buffer.
A system configuration is defined by a set of parameters for
β(t), B, and b. In our experiments we varied the size of the
playout buffer B. Assuming that the processor PE2 is un
loaded and hence its entire capacity is available for processing
Page 5
50100150200250 300350
2500
5000
7500
10000
12500
15000
t [ms]
# macroblocks
x(t)
max
x (t)
min
Fig. 3. The analytical bounds xmin(t) and xmax(t) computed for a system
configuration with B = 2430 macroblocks.
the IDCT, MP and MC tasks mapped onto it, the service curve
β(t) was modeled by a straight line. The slope of this line was
set to be equal to the longterm average macroblock produc
tion rate on the output of PE1(which was measured using the
system simulator described in the next subsection). This was
based on the assumption that in the long term, PE2has suf
ficient capacity to process all the incoming macroblocks. B2
was always set to a fixed size of b = 500 macroblocks. Fig
ure 3 shows the resulting bounds xmin(t) and xmax(t) com
puted by implementingEqns.(7) and (9) of Section II in Math
ematica (from Wolfram Research), with the system configura
tion and β(t) describedaboveand the playoutbuffersize B set
to 2430 macroblocks.
C. Simulation Setup
We performed simulations of the MPEG2 decoder applica
tion (shown in Figure 2) using a transaction level model of the
system architecture (see [12]). The system model was writ
ten in SystemC [17], and the models of the programmable
PEs were based on the SimProfile configuration of the Sim
pleScalar [1] instruction set simulator. Both, PE1and PE2,
had a RISCcore (similar to the MIPS3000 processor) aug
mented with MPEG2 specific hardware accelerators. PE1
was enhanced with bitstream access operations, while PE2
had special support for application kernels such as IDCT,
Add Clip and Block Average and could prefetch memory in a
special videoblock mode. Floating point operations were not
used on either of the PEs. The implementation of the MPEG2
decoder was based on the source code available from [15].
We simulated the decoding of several MPEG–2 video clips.
All video clips had parameters as described in the previous
subsection. The clips were encoded using a constant bit rate of
9.78 Mb/s and a resolution of 720x576 pixels (typically used
in DVD applications). A selection of the simulated scenarios
(with each scenario being a combination of a video clip and
the playout buffer size), representing corner cases for our ex
periments, is summarized in Table I. Video sequence A cor
responds to a video clip with global motion, whereas video
sequence B corresponds to a video clip with moving objects
and still background and video sequence C represents a still
picture.
For all the simulation scenarios listed in Table I, using our
simulation setup we measured x(t) at the output of PE1and
the maximum and minimum fill levels of B2and the playout
buffer Bout.
D. Comparing the analytical bounds with simulation
In this subsection we evaluate the usefulness of the analyti
cal boundson the input rate x(t), by comparingthe predictions
TABLE I
SIMULATION SCENARIOS
ScenarioBuffer Size B
# macroblocks
1620
2430
2430
2430
3240
Video
Clip
1
2
3
4
5
Sequence A
Sequence A
Sequence B
Sequence C
Sequence C
0
24
68 10
12 14
8
x 10
0
500
1000
1500
2000
2500
02468101214
(c)(a)
(d)
(b)
0
500
1000
1500
2000
2500
0
500
1000
1500
2000
2500
0
500
1000
1500
2000
2500
x 10
8
x 10
8
x 10
8
0
24
6810
1214
02468101214
Fig. 4. (a) The difference plot for a macroblock stream x(t) which is
compliant with the computed upper and lower bounds; (b) Corresponding
playout buffer fill levels; (c) The difference plot for a noncompliant stream
x(t); (d) Corresponding playout buffer fill levels indicating buffer overflow.
The horizontalaxis shows time in ns and the verticalaxis shows the number
of processed macroblocks in the playout buffer.
on buffer overflow/underflow/conformance deduced from the
analytical framework, with the results obtained by simulation.
For the ease of interpretation of the simulation results, we
always show a difference plot. Such a plot does not show the
absolute values of x(t) (obtained from the simulation) versus
xmin(t) and xmax(t) (which are computed following Theo
rem1ofSectionII). Instead,it showsthecurvescorresponding
to thedifferencesxmax(t)−xmin(t) andx(t)−xmin(t). From
such a plot it is possible to detect when an input stream x(t)
(where x(t) is measured from the input stream resulting from
simulating the decodingalgorithm on a video clip) violates the
computed bounds. A violation occurs whenever the curve rep
resenting x(t)−xmin(t) crosses the curvexmax(t)−xmin(t),
or goes below 0.
Figure 4(a) shows an example where x(t) resulting from a
video clip is compliant with the bounds xmin(t) and xmax(t).
In Figure 4(b) the corresponding playout buffer fill level (as
measured from simulation) is shown, which confirms that no
bufferoverflowor underflowoccurs. Figure 4(c)depicts an ex
ample of a sequence x(t) (obtained from the simulation Sce
nario 1), which violates the upper bound. The corresponding
bufferfill levelplot in Figure4(d)shows thatthe playoutbuffer
overflows.
In Figure 5 we show the difference plots corresponding to
the Scenarios 2–5 which are outlined in Table I. All the fig
ures in this subsection show an excerpt of 1.36 seconds (corre
sponding to 34 frames) of video sequences from some simula
tion scenario (1 to 5). The left bar plot in Figure 6 shows nor
malized values of largest buffer fill levels observed at the play