Scalable video coding based on motion-compensated temporal filtering: complexity and functionality analysis.
ABSTRACT Video coding techniques yielding state-of-the-art compression performance require large amount of computational resources, hence practical implementations, which target a broad market, often tend to trade-off coding efficiency and flexibility for reduced complexity. Scalable video coding instead, not only provides seamless adaptation to bit-rate variation, but also allows the end user to trim down the resources he needs to perform real-time decoding by limiting the process to a subset of the original content. Hence, by choosing the quality, frame-rate and/or resolution of the reconstructed sequence, each decoder can meet its hardware limitations without affecting the encoding process of the media provider. This paper proposes a preliminary analysis of the memory-access behavior of a fully scalable video decoder and investigates the capability of selecting the operational settings in order to adapt to the available hardware resources on the target device.
- [show abstract] [hide abstract]
ABSTRACT: Three-dimensional (3-D) frequency coding is an alternative approach to hybrid coding concepts used in today's standards. The first part of this paper presents a study on concepts for temporal-axis frequency decomposition along the motion trajectory in video sequences. It is shown that, if a two-band split is used, it is possible to overcome the problem of spatial inhomogeneity in the motion vector field (MVF), which occurs at the positions of uncovered and covered areas. In these cases, original pixel values from one frame are placed into the lowpass-band signal, while displaced-frame-difference values are embedded into the highpass band. This technique is applicable with arbitrary MVF's; examples with block-matching and interpolative motion compensation are given. Derivations are first performed for the example of two-tap quadrature mirror filters (QMF's), and then generalized to any linear-phase QMF's. With two-band analysis and synthesis stages arranged as cascade structures, higher resolution frequency decompositions are realizable. In the second part of the paper, encoding of the temporal-axis subband signals is discussed. A parallel filterbank scheme was used for spatial subband decomposition, and adaptive lattice vector quantization was employed to approach the entropy rate of the 3-D subband samples. Coding results suggest that high-motion video sequences can be encoded at significantly lower rates than those achievable with conventional hybrid coders. Main advantages are the high energy compaction capability and the nonrecursive decoder structure. In the conclusion, the scheme is interpreted more generally, viewed as a motion-compensated short-time spectral analysis of video sequences, which can adapt to the quickness of changes. Although a 3-D multiresolution representation of the picture information is produced, a true multiresolution representation of motion information, based on spatio-temporal decimation and interpolation of the MVF, is regarded as the still-missing part.IEEE Transactions on Image Processing 02/1994; 3(5):559-71. · 3.20 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: Performance bounds for generalized hybrid coding of video sequences with motion-compensating prediction are derived based on rate-distortion theory. It is shown that the spatial power spectrum of the motion-compensated prediction error can be calculated from the signal power spectrum and the displacement estimation error p.d.f.. A spatial Wiener filter can improve the efficiency of motion-compensating prediction. Memoryless encoding of the motion-compensated prediction error and intraframe encoding of the motion-compensated prediction error are compared. An evaluation of the rate-distortion functions for a typical videoconference sampling format shows that for integer pel accuracy of the displacement estimate the additional gain by motion-compensating prediction over pure intraframe coding is limited to ∼ 0.8 bits/sample in moving areas. Required accuracies of the displacement estimate for a gain of motion-compensating interframe coding over intraframe coding are given.IEEE Journal on Selected Areas in Communications 09/1987; · 3.12 Impact Factor
- [show abstract] [hide abstract]
ABSTRACT: This article explores the efficiency of motion-compensated three-dimensional transform coding, a compression scheme that employs a motion-compensated transform for a group of pictures. We investigate this coding scheme experimentally and theoretically. The practical coding scheme employs in temporal direction a wavelet decomposition with motion-compensated lifting steps. Further, we compare the experimental results to that of a predictive video codec with single-hypothesis motion compensation and comparable computational complexity. The experiments show that the 5/3 wavelet kernel outperforms both the Haar kernel and, in many cases, the reference scheme utilizing single-hypothesis motion-compensated predictive coding. The theoretical investigation models this motion-compensated subband coding scheme for a group of K pictures with a signal model for K motion-compensated pictures that are decorrelated by a linear transform. We utilize the Karhunen–Loeve transform to obtain theoretical performance bounds at high bit-rates and compare to both optimum intra-frame coding of individual motion-compensated pictures and single-hypothesis motion-compensated predictive coding. The investigation shows that motion-compensated three-dimensional transform coding can outperform predictive coding with single-hypothesis motion compensation by up to .Signal Processing: Image Communication. 08/2004;
SCALABLE VIDEO CODING BASED ON MOTION-COMPENSATED TEMPORAL
FILTERING: COMPLEXITY AND FUNCTIONALITY ANALYSIS
Fabio Verdicchio*, Yiannis Andreopoulos, Tom Clerckx, Joeri Barbarien,
Adrian Munteanu, Jan Cornelis and Peter Schelkens
Vrije Universiteit Brussel (VUB)
Department of Electronics and Information Processing (ETRO-IRIS)
Pleinlaan 2, B-1050, Brussels, Belgium
compression performance require large amount of
computational resources, hence practical implementations,
which target a broad market, often tend to trade-off coding
efficiency and flexibility for reduced complexity. Scalable
video coding instead, not only provides seamless
adaptation to bit-rate variation, but also allows the end
user to trim down the resources he needs to perform real-
time decoding by limiting the process to a subset of the
original content. Hence, by choosing the quality, frame-
rate and/or resolution of the reconstructed sequence, each
decoder can meet its hardware limitations without
affecting the encoding process of the media provider. This
paper proposes a preliminary analysis of the memory-
access behavior of a fully scalable video decoder and
investigates the capability of selecting the operational
settings in order to adapt to the available hardware
resources on the target device.
coding techniques yielding state-of-the-art
Streaming of multimedia content over heterogeneous
networks, e.g. the Internet, where a variety of end-users
may request the same material while experiencing
different available bandwidths, is the natural environment
for scalable video coding (SVC). The media provider,
using SVC techniques, generates a single compressed bit-
stream, from which appropriate subsets, providing
different visual quality, frame-rate and resolution, can be
extracted to meet the bit-rate requirements of a broad
range of clients without the necessity for a low-level
transcoding (i.e. full decoding and re-encoding). In case of
SVC solely, the usage of a code-stream parser will suffice.
Previous works  have focused on standardized non-
scalable video codecs, such as MPEG4-AVC, and
analyzed the application under the data transfer and
storage perspective, hence measuring complexity in terms
of access frequency along with execution time. The
authors of  profiled both encoder and decoder and
reported the tradeoff between coding performance and
complexity of a number of possible configurations
supported by the standard and suggested an a-priori
selection of the tools employed at encoding time, as a way
to control complexity. Once a set of features is retained
and the algorithm is fully specified, source code
transformation and data flow optimization techniques 
can be applied to achieve significant speedups of the
application and/or reduction of the resources required,
such as memory and clock frequency, thus decreasing
The main contribution of this paper is to provide initial
insights in the requirements of the fully scalable coding
scheme proposed in  and report evidence of its
capability to steer the computational and memory
complexity of the decoder by simply adjusting client
settings (frame-rate, resolution, and quality) independently
of the encoding process. We believe this to be a necessary
kick-off step of a broader feasibility study, targeting
efficient implementations of fully scalable video coding
methods, which recently raised the attention of the MPEG-
The remainder of this paper is structured as follows: in
section 2 we give a brief survey of motion compensated
temporal filtering (MCTF) video coding scheme. In
section 3 we describe the detailed configuration used to
perform the experiments, while the results are reported
and commented in section 4. Finally, conclusions are
drawn in section 5.
2. SPATIAL DOMAIN MOTION COMPENSATED
The basic architecture analyzed in this paper is the open-
loop scheme depicted in Figure 1, which performs a
temporal and then a spatial decomposition (T+2D) of the
video prior to embedded compression  thus alleviating
the closed prediction loop used in conventional hybrid-
L , H
Compensated Temporal Filtering
DWT: Discrete Wavelet Transform
RSC: Resolution Scalable Coder
EC: Entropy Coder
MVC: Motion Vector CoderMVC: Motion Vector Coder
Figure 1. Spatial-domain (SD) MCTF encoder architecture
L , H
Compensated Temporal Filtering
DWT: Discrete Wavelet Transform
RSC: Resolution Scalable Coder
EC: Entropy Coder
The core of the encoding procedure is the initial
removal of temporal correlation within the sequence
(MCTF); this task is basically accomplished by a low and
high-pass filtering of the input frames along the temporal
direction. To achieve efficient decorrelation, the input
samples need to be aligned along the motion trajectories
. Hence two additional stages are incorporated into
temporal filtering: first a motion estimation step (ME)
determines the displacement information (i.e. the motion
vectors (MV)), identifying the corresponding blocks in the
temporal direction. In the motion compensation phase
(MC) each macroblock is positioned according to the
corresponding MVs before
performed. MCTF is efficiently implemented using the
lifting framework , as depicted in Figure 2.
temporal filtering is
Figure 2. MCTF: one level decomposition with lifting.
The P operator generates a prediction of the each odd
frame (current frame) based upon the even frames
(reference frames) using the above-mentioned ME/MC
stages. The output of P is then subtracted from the
current frame to obtain the residual error frame (high-pass
temporal frame) or H-frame. This information is also
added back to the reference frame by the U operator,
which performs an additional MC stage using the reversed
motion field, to generate a set of L-frames (or low-pass
temporal frames). These frames represent a temporally
smooth version of the input sequence, sampled at half of
the original frame-rate. The temporal filtering process
continues recursively to obtain subsequent temporal levels.
Finally, as shown in Figure 1, a 2D discrete wavelet
transform (DWT) is performed on the MCTF output,
yielding a multiresolution spatio-temporal representation
of the input sequence. The result of a three-level
decomposition performed both in the temporal and spatial
direction is illustrated in Figure 3. Each wavelet frame is
subsequently encoded using an embedded intra-band
compression algorithm , which progressively encodes
the coefficients of each spatial resolution, enabling the
video reconstruction at a dyadic set of resolution levels.
input sequenceinput sequence
level 1level 1
Figure 3. Output of three-level T+2D decomposition.
level 2level 2
level 3 level 3
The decoding process basically consists of the inverse
encoding steps. Due to embedded encoding, the user
receives the data belonging to those temporal or spatial
levels that are necessary to reach the requested frame-rate
or spatial-resolution, thus saving bandwidth and
computations. Moreover, due to embedded coding, the
visual quality of the decoded video sequence can be
progressively refined for any target resolution-level or
frame-rate. Hence, the user can trade-off the overall visual
quality for bandwidth and computational cost.
3. SELECTED CONFIGURATION AND
Since multimedia application such as video coding are
clearly identified as data dominated applications, the
complexity metric used in this paper as main indicator is
memory access frequency, i.e. the number of accesses to
the memory the application has to perform each second in
order to operate in real time. A thorough study of the
typical size of the necessary memory buffers for scalable
MCTF-based codecs (see e.g. ) is deferred to a later
stage, in which the possible exploitation of multi-
hierarchical memory architectures will be investigated. To
evaluate the complexity of the decoder in the scalable
framework, we select a single set of features and trim
scalable parameters in the given range. Thereafter we use
the ATOMIUM Analyzer tool  to measure for each
setting the access rate required to achieve real-time
decoding. Experiments are carried out using 3 sequences
“Bus”, “Canoa” and “Football”, which have the same
spatial resolution (CIF) and frame rate (30 fps). These
sequences are considered to typically contain medium to
complex motion scenes. Given the original frame-rate and
resolution, we limit our analysis to video reconstruction at
full or half-frame-rate, and full or half resolution as well.
3.1. Temporal filtering
With respect to the general scheme of Figure 2 we disable
the U operator hence
by the presence of visual artifacts in the updated L-frames.
Thus, to preserve visual quality when a lower frame-rate is
targeted, we choose to sacrifice the performance gain
provided by the update step. On the other hand this choice
yields a reduction in the complexity of the decoder (MC
stages are halved) and more important, allows for parallel
reconstruction of different temporal levels. In fact, since
and corrected, it can be immediately used as a reference in
the reconstruction of any upper temporal level. In all
experiments, we use 4 temporal levels, and the sequence is
decoded at 30 fps or at 15 fps when data from level 1 is
not received / processed. No attempt is made to replace
the dropped frames using temporal interpolation.
at any temporal level i ,
tA is the input sequence. This choice is motivated
with j>0, once a frame has been compensated
3.2. Motion Compensation
Motion estimation (ME) – performed only at encoding
time – and MC employ block-based techniques with
fractional pixel accuracy (1/4 pel in our tests), whose
efficiency is enhanced using multi-hypothesis and
bidirectional prediction with multiple references .
Additionally, each macroblock (MB) can be split into 4
children blocks predicted independently. The splitting can
be iteratively refined, but we restrict the experiment to
MBs of 16x16 or 8x8 pixels. When the resolution is
halved, the decoder reconstructs both the error-frames and
the reference-frames at half resolution, and MC occurs
using properly scaled motion vectors, achieving
approximately a reduction of the memory accesses by a
factor of four. This reduction needs to be evaluated
experimentally since samples predicted at full resolution
without interpolation (MB with integer displacement) may
require interpolation when the displacement is halved, thus
causing some additional operations. As for the frame-rate,
no attempt is made to interpolate frames decoded at lower
resolutions either. In fact, due to the use of orthogonal or
bi-orthogonal wavelet filters, the information that is absent
when a DWT level is not decoded cannot be recovered via
3.3. Motion vector coding
The MV coding engine used for these experiments does
not support quality scalability; savings in accesses, then,
only come when a temporal level is dropped. We did not
profile this component.
3.4. Wavelet Transform
Literature conveys a large amount of 2D DWT related
implementation research, (e.g. ). The spatial wavelet
engine is not the focus of our profiling experiments and is
not discussed in the next section. However, this codec
component also benefits from the scalable approach, as its
memory cost scales with the temporal and spatial scaling
3.5. Texture Coding
The wavelet coefficients of each resolution level are
compressed in an embedded manner using the QT-L
algorithm of . This component is our main focus in the
profiling stage, because its behavior is affected by each
operational parameter (quality, resolution or frame-rate),
as various spatial and temporal levels contribute
differently to the compressed stream, hence to the amount
of operations the decoder performs at any target bit-rate.
4. MEMORY PROFILING RESULTS
Figure 4 (a) reports the access rate caused by the sole QT-
L module when decoding three different sequences at
original frame-rate and resolution for a wide set of bit-
rates. It is noticeable that the access rate is approximately
linearly depending on the target bit-rate, and that this
behavior is not strongly content-dependant. Notice that a
similar linear behavior is observed by decoding to lower-
resolutions and/or frame-rates, as shown in Figure 4 (b)
for the “Bus” sequence.
From the access-rate perspective, the option providing the
largest gain is resolution scaling: accesses performed by
QT-L are significantly reduced, especially at high bit-
rates, while those caused by the MC should diminish,
independently of the bit-rate, to 25% of the amount needed
at full resolution. With respect to the later, experiments
report a figure of 33.2%, confirming the expected
overhead due to MV scaling and additional interpolations.
These results show that the access rate can be decreased in
a fine grain manner by varying the target bit-rate until real-
time decoding is achieved. To avoid reducing excessively
the visual quality, the user can switch to a less demanding
configuration, decoding lower-resolution versions of the
input video, or fewer frames, as shown in Figure 4 (c).
Thus, when one resolution or temporal level (or both) are
not processed, the decoder switches to an operational
point which lays on a curve positioned below the one
corresponding to full-resolution and frame-rate decoding.
These results show also that the relationship between the
access rate and target bit-rate for decoding at different
resolutions and frame-rates can be “learned” using
appropriate training on large datasets. In this way, the
decoder can estimate the optimum bit-rate, given a
resolution and frame-rate. Conversely, for bandwidth-
limited applications, the decoder can estimate the optimum
operational settings in terms of resolution and frame-rate,
for given access rate and channel-bandwidth.
rate, and user-specified
QTL: Average accesses versus bit-rate
0 500 10001500200025003000 35004000
QT-L: Accesses versus Bit-rate
250500 750 1000 1250 1500 1750 2000 2250 2500 2750 3000
MC and QT-L
Acesses (108 / sec)
Full Frame-Rate 30 fps; Full Resolution 352x288
Figure 4. QT-L access rate for (a) three different sequences, and
(b) the “Bus” sequence, decoded at full/half frame-rate (FF/HF),
and full/half resolution (FR/HR). (c) Access rates obtained for
both QT-L and MC components on the “Bus” sequence.
This paper proposes a preliminary analysis of the
memory-access behavior of a fully scalable video decoder
and investigates the impact of varying the operational
settings on the memory-access rate. It is shown that by
choosing the quality, frame-rate and/or resolution of the
reconstructed sequence, each decoder can meet its
hardware limitations without requiring transcoding, hence
affecting the encoding process of the media provider.
 S. Saponara, C. Blanch, K. Denolf, J. Bormans, “The JVT
Advanced Video Coding Standard: Complexity and Performance
Analysis on a Tool-by-Tool Basis,” IEEE Workshop on Packet
Video (PV’03), Nantes, France, April 2003.
 K. Denolf, P. Vos, J. Bormans, and I. Bolsens, “Cost-
efficient C-Level Design of an MPEG-4 Video Decoder,”
Workshop on Power and Timing Modeling, Optimization and
Simulation, Goettingen, Germany, Sept. 2000.
 I. Andreopoulos, J. Barbarien, F. Verdicchio, A. Munteanu,
M. van der Schaar, J. Cornelis, and P. Schelkens, “Response to
Call for Evidence on Scalable Video Coding,” ISO/IEC
JTC1/SC29/WG11, M9911, Trondheim, Norway, July 2003.
 “Call for Evidence on Scalable Video Coding Advances,”
ISO/IEC JTC1/SC29/WG11 (MPEG), Pattaya, Thailand, MPEG
Report W5559, March, 2003.
 J. R. Ohm, “Three-dimensional subband coding with
motion compensation,” IEEE Trans. Image Processing, vol. 3,
no. 5, pp. 559-571, Sept. 1994.
 B. Girod, “The efficiency of motion-compensated
prediction for hybrid video coding of video sequences,” IEEE J.
Select. Areas Commun., vol. SAC-5, pp. 1140-1154, Aug. 1987.
 M. Flierl and B. Girod “Video Coding with Motion-
Compensated Lifted Wavelet Transforms,” J. Image Comm.,
Special Issue on Subband/Wavelet Video Coding, submitted.
 P. Schelkens, A. Munteanu, J. Barbarien, M. Galca, X.
Giro-Nieto, and J. Cornelis, “Wavelet coding of volumetric
medical datasets,” IEEE Trans. Medical Imaging, vol. 22, no. 3,
pp. 441-458, March 2003.
 H. Devos, H. Eeckhaut, M. Christiaens, F. Verdicchio, D.
Stroobandt, and P. Schelkens, "Performance requirements for
reconfigurable hardware for a scalable wavelet video decoder,"
Proc. of IEEE ProRiSC 2003, Veldhoven, The Netherlands, pp.
56-63, Nov. 2003.
 M. Flierl, T. Wiegand, and B. Girod, “Rate-constrained
multihypothesis prediction for motion-compensated video
compression,” IEEE Transactions on Circuits and Systems for
Video Technology, vol. 12, no. 11, pp. 957-969, Nov. 2002.
 Y. Andreopoulos, P. Schelkens, G. Lafruit, K. Masselos, J.
Cornelis, “High-level cache modeling for 2-D discrete wavelet
transform implementations,” VLSI Signal Proc. Systems, vol. 34,
no. 3, pp. 209-226, July 2003.