Content uploaded by Stefan Strahl

Author content

All content in this area was uploaded by Stefan Strahl

Content may be subject to copyright.

AN EFFICIENT FINE-GRAIN SCALABLE COMPRESSION SCHEME FOR SPARSE DATA

Stefan Strahl∗and Alfred Mertins

Signal Processing Group

Department of Physics, University of Oldenburg

26111 Oldenburg, Germany

alfred.mertins@uni-oldenburg.de

stefan.strahl@mail.uni-oldenburg.de

ABSTRACT

A ﬁne-grain scalable and efﬁcient compression scheme for sparse

data based on adaptive signiﬁcance-trees is presented. Com-

mon approaches for 2-D image compression like EZW (embed-

ded wavelet zero tree) and SPIHT (set partitioning in hierarchi-

cal trees) use a ﬁxed signiﬁcance-tree that captures well the inter-

and intraband correlations of wavelet coefﬁcients. For most 1-

D signals like audio, such rigid coefﬁcient correlations are not

present. We address this problem by dynamically selecting an op-

timal signiﬁcance-tree for the actual data frame from a given set of

possible trees. Experimental results on sparse representations of

audio signals are given, showing that this coding scheme outper-

forms single-type tree coding schemes and performs comparable

to the MPEG AAC coder while additionally achieving ﬁne-grain

scalability.

1. INTRODUCTION

Recent advances in sparse signal representation ([1, 2, 3]) have in-

creased the interest to apply these methods on audio data ([4, 5])

and led to the demand for an efﬁcient compression scheme of

sparse audio representations. Moreover, the increase of heteroge-

neous networks like the Internet introduced problems such as bi-

trate ﬂuctuation, different target channel capacities or storage costs

for multi-bitrate ﬁles. Storing the data in an embedded manner us-

ing signiﬁcance-trees can address this issues in a generic manner.

Bitplane coding and signiﬁcance-trees have been successfully

applied to image coding ([6],[7]). Such coding schemes success-

fully capture the structure of the wavelet-based image represen-

tation, making very efﬁcient sorting passes and a low number of

sorting bits possible. Such natural rigid correlations cannot be

found in audio signal representations like e.g. the MDCT trans-

form, necessitating the derivation of optimal signiﬁcance-trees in

a data dependent manner.

How to generate these signiﬁcance trees capturing the variant

spectral distribution of audio data and the principle of our progres-

sive compression scheme using these signiﬁcance-trees, referred

to as signiﬁcance tree coding (STC), are discussed in Section 2. In

Section 3, we present experimental results on sparse audio repre-

sentations including subjective listening tests.

∗This work was partly funded by the DFG through the International

Graduate School for Neurosensory Science and Systems at the University

of Oldenburg

2. BASIC CONCEPTS

2.1. Signiﬁcance Trees

Signiﬁcance-tree coding algorithms like EZW [6] or SPIHT [7]

exploit the fact that it can be beneﬁcial, especially for sparse data,

to describe signiﬁcant coefﬁcients of a bitplane via their position

and value information instead of transmitting all values one by

one. These spatial orientation trees can be mathematically repre-

sented using parent-children coefﬁcient coordinate relationships.

Fig. 1a shows the case of image compression, were the offspring

O(i, j)of the wavelet parent coefﬁcients at position (i, j), except

for the highest and lowest pyramid level, have been deﬁned as

O(i, j) = {(2i, 2j),(2i, 2j+ 1),(2i+ 1,2j),(2i+ 1,2j+ 1)}.

Due to the fact that the 2-dimensional wavelet transformation has

a typical coefﬁcient inter- and intra-band correlation [8], this rigid

tree structure can capture the correlation with a reasonable com-

putational complexity, giving an efﬁcient compression scheme.

*

coefficient

index i

012 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23

(a) (b)

Figure 1: Parent-offspring dependencies in SPIHT with different

styles. (a) 2-D tree. (b) 1-D tree following the offspring rule

O(i) = iN +{0, N −1}.

For 1-dimensional signals like audio data, the problem of se-

lecting the optimal tree structures remains unsolved despite con-

siderable efforts. Most existing algorithms use a single type of

tree as shown in Fig. 1b with the ﬁxed parent-children relationship

O(i) = iN +{0,1,··· , N −1}for different positive integers N.

For the MDCT transform, N= 4 was adopted in [9, 10, 11, 12]

and the wavelet packet transform was encoded using N= 2 in

[13, 14]. This type of tree will be referenced in the following as

SPIHT-style signiﬁcance trees.

2.2. Bitplane coding using Signiﬁcance-Trees

The set of Mtransform coefﬁcients to be encoded for an audio

frame is denoted by the vector X= (X1, X2,...,XM), and the

according coordinates set is denoted by M= (1,2,··· , M ). The

algorithm starts with the most signiﬁcant bitplane nmax, which

can be easily computed with nmax =blog2(max

i∈M{| Xi|})c. A

coefﬁcient Xican then be expressed as

Xi=s

nmax

k=nmin

bi,k2k

with bi,k ∈ {0,1}and s∈ {±1}being the sign. If Xiis an

integer value, then nmin = 0. To encode real-valued coefﬁcients,

nmin can be negative.

During the bitplane-coding process, all bitplanes n≤nmax

are processed iteratively (i.e., the bits bi,n,i= 1,2,...,M are

transmitted) in so-called sorting and reﬁnement passes [7]. In a

sorting pass, all coefﬁcients that become signiﬁcant with respect

to the actual bitplane nare found by employing tests on the coefﬁ-

cient absolute values, and these test results are written to the output

bitstream. For coefﬁcients that are found to be signiﬁcant, also a

sign bit is transmitted. During the reﬁnement passes, the lower bit-

planes of already identiﬁed signiﬁcant coefﬁcients are transmitted.

The sequence of the coefﬁcient sorting is deﬁned by the

signiﬁcance-tree so that all elements in the coefﬁcient set Xare

uniquely mapped into nodes in the trees. Each signiﬁcance tree

Tis composed of several nodes that link coefﬁcient coordinates

i(position information) of scalars Xiin a hierarchical manner.

A tree Tis said to be signiﬁcant with respect to bitplane nif

any scalar inside the tree is signiﬁcant, that is, if the magnitude

of at least one coefﬁcient in the set is larger or equal to 2n. The

pseudocode of the sorting pass is as follows:

TreeSigniﬁcance (current tree T, current threshold 2n)

•If Tis insigniﬁcant with respect to 2n, emit ‘0’ and return;

•If Tis signiﬁcant with respect to 2n, emit ‘1’;

•If root node N(T)is signiﬁcant with respect to 2n, emit

‘1’, otherwise emit ‘0’;

•Call TreeSigniﬁcance() for each subtree with root node as

offspring of N(T)with threshold 2n;

•Return;

2.3. Proposed Adaptive Signiﬁcance-Tree Selection

The SPIHT-style signiﬁcance trees proposed for one-dimensional

data so far are rather arbitrary. They are simply derived by project-

ing the known 2-D trees into the vector notation of 1-D structures.

To establish better tree structures and in the case of audio data to

capture the dynamically variant spectral behavior, we predeﬁne a

set of signiﬁcance-trees and dynamically select the locally optimal

ones for each data frame.

For tree construction, in general, it is important to recall that

trees should be built in such a way that the coefﬁcients that are

most likely to be large in magnitude are located close to the roots

of the trees, whereas the small coefﬁcients should be located at

the outer leaves. The larger the (sub)-trees that contain small co-

efﬁcients are, the more efﬁcient the coding will be. In contrast to

[15] we used non-complete signiﬁcance trees by placing remaining

nodes at the last treelevel.

In this paper we design the set of µpossible signiﬁcance-trees

by constructing these trees out of msubtrees with different roots

and different sorting orders. The coding cost to encode the tree

selection information is log2(µ)bits per frame. We considered

m= 8 with equally sized subtrees and m= 10 with logarithmi-

cally sized subtrees. See Fig. 2 for an illustration of the trees. Each

subtree was selected from four different types of trees (ascending,

descending, concave oder convex) yielding µ= 65.536 possible

trees (tree selection needs 16bit per frame) for the equally sized

and µ= 1.048.576 (bit cost of 20bit per frame) for the logarith-

mically sized subtrees.

x0

x1 x17 x33 x49

x2 x3 x4 x18 x19 x20 x34 x35 x36 x50 x51 x52

x5 x6 x7 x8

x10 x11 x12 x13 x14 x15 x16

x21 x22 x23 x24

x25 x26 x27 x28 x29 x30 x31 x32

x37 x38 x39 x40

x41 x42 x43 x44 x45 x46 x47 x48

x53 x54 x55 x56

x57 x58 x59 x60 x61 x62 x63 x64

x9

x0

x63x62x61x60x59x58x57x56x55x54x53x52x51x50x49x48

x47x46x45x44x43x42x41x40

x39x38x37x36

x31x30x29x28x27x26x25x24

x23x22x21x20x15x14x13x12

x35x34x33x19x18x17x11x10x9x7x6x5x3

x32x16x8x4x2x1

(a) (b)

Figure 2: Examples of possible signiﬁcance-trees with treeorder

N= 2 and framelength M= 64 (a) m= 4 (equally sized trees).

(b) m= 6 (log-sized trees).

For a given data frame to be encoded, we select the tree that

allows us to encode the largest number of high-magnitude coefﬁ-

cients within the ﬁrst νtree levels. In the experiments, νwas set

to 3.

2.4. STC Algorithm

Let us assume that a set of optimal local signiﬁcance

trees for transmitting a coefﬁcient set Xhas been found,

for example, through testing the efﬁciencies of various

possible trees as mentioned above. The compression

scheme then operates as follows: Iteratively, all bitplanes

n=nmax, nmax −1, nmax −2, . . . , nmin are processed in

sorting and reﬁnement passes. In a sorting pass, all coefﬁcients

that become for the ﬁrst time signiﬁcant (i.e., their magnitude

exceeds the current threshold 2n) are logged to a list of signiﬁcant

coefﬁcients (LSC) and their signs are encoded. This means, at any

point in the encoding process, the LSC contains the coordinates

of all coefﬁcients that have been found to exceed the current test

threshold of 2n. When all signiﬁcant coefﬁcients with respect to

the current threshold 2nhave been identiﬁed and their coordinates

have been moved to the LSC, the reﬁnement pass stores the

bitplane information for the signiﬁcant coefﬁcients by processing

the LSC, except for the coefﬁcients that were included in the last

sorting pass. The overall algorithm is as follows.

STC Algorithm:

1. Tree Generation: select one of the µpossible signiﬁcance-

trees, containing mlocal subtrees;

2. Initialization: output n=blog2(max

i∈M{| Xi|})c; output

selected signiﬁcance-tree; sequentially do: set LSC (list of

signiﬁcant coefﬁcients) as an empty list.

3. Sorting Pass: sequentially call TreeSigniﬁcance, move all

signiﬁcant coefﬁcients into the according LSC, output their

signs.

4. Reﬁnement Pass: sequentially, for each coefﬁcient in ac-

cording LSCs, except those included in the last sorting pass,

output the nth most signiﬁcant bit of Xi.

5. Quantization-Step Update: decrement nby 1 and go to

Step 3.

The process is repeated until the desired bit budget is achieved,

or, in case of lossless compression, all bits in all coefﬁcients have

been encoded.

3. EXPERIMENTAL RESULTS

3.1. Comparison of compression schemes on sparse audio

data

In this section, we compare the performance of run-length cod-

ing, Huffman coding, arithmetic coding and adaptively selected

and ﬁxed signiﬁcance trees on sparse audio representations. The

parameters for the Huffman coding were set to 8 levels of allowed

splitting. For our STC algorithm the number of possible trees was

set to µ= 65.536 (equally sized) and µ= 1.048.576 (logarith-

mically sized), respectively, as described in Section 2.3.

The audio signal was selected as the cha2.wav ﬁle [16]

(mono, 16 bits, 48 kHz) and the bitrate was set to R= 96 kbps. To

obtain a sparse signal representation of the audio signal a MDCT

transform with a framesize M= 1024 was used. The MDCT

of audio signals already results in a sparse signal representation

([17]) but to increase the sparseness we also used the Basis Pur-

suit algorithm ([1]) with an overcomplete MDCT-Basis, where the

subband signals were oversampled by a factor of two. The frame

bit budget Rfwas computed as Rf=bR·M/F scwhere F s is

the sampling rate in Hz, yielding Rf= 2048 bits per frame for 96

kbps. For the compression schemes to achieve the desired bitrate,

a linear quantization of the MDCT coefﬁcients has been applied.

The treeorder of the signiﬁcance trees has been set to N= 4. As a

performance measure, the frame-wise signal-to-noise ratio (SNR)

was used, which was computed as the ratio of a frame’s energy,

divided by the energy of the reconstruction error in the frame. For

the two scenarios, we obtained the results listed in Table 1.

Table 1: Average frame-wise SNRs in dB for the cha2.wav sig-

nal coded at 96 kbps, using different algorithms.

scenario MDCT Basis Pursuit

(segSNR) overMDCT

RLE 4.94 13.74

Huffman 27.01 25.12

Arith 28.91 26.34

SPIHT 32.99 30.19

STC-lin 34.27 31.47

STC-log 34.56 31.51

From Table 1 it can be seen that an adaptive signiﬁcance-tree

selection beneﬁts from the variant spectral distribution of audio

data compared to a ﬁxed signiﬁcance tree. The representation us-

ing critical sampling achieves better results compared to the over-

sampled case as here the double amount of coefﬁcients needs to be

encoded.

3.2. Combination with the MPEG AAC Standard

In this experiment, we use the state-of-the-art MPEG AAC com-

pression scheme and combine it with our STC algorithm in order

to achieve progressive coding. For this, we keep the AAC scheme

unchanged up to the point where Huffman coding is employed,

then apply the STC algorithm to carry out the compression of the

quantizer indices. The quantizer indices form a sparse representa-

tion of the audio signal. In the experiment, the reference software

of [18] was used.

The compression of quantizer indices can either be lossless or

lossy, depending on the number of bits transmitted. On the decoder

side, the received quantizer indices (either exact values or approx-

imations, depending on the bitrate) are injected into the standard

AAC decoder. All other side information is transmitted as pro-

duced by the AAC coder.

Note that the results for the AAC coder were produced by en-

coding the signal individually for each bitrate. For STC, the encod-

ing was done once at 64 kbps, and then lower rates were realized

by truncating the frame-wise embedded bitstream produced by the

STC algorithm.

To measure the subjective quality, we carried out listening tests

comparing the STC-scheme with the MPEG2-AAC standard and

the MPEG-4-AAC-BSAC standard, which is currently the only

standardized ﬁne-grain progressive audio compression scheme.

Twenty test persons for the scenario with eight equally sized sub-

trees per frame using signals from the sound quality access mate-

rial (SQAM) [19] and from the 1990 MPEG evaluation [16] have

been used.

The measurement procedure was set up according to the ITU

recommendation BS.1116-1 [20]. The quality ratings between one

(very annoying) and ﬁve (indistinguishable from the original) were

translated into the subjective difference grade, which is the differ-

ence between the rating for the encoded test item and the hidden

reference and ranges from zero (equal quality) down to -4 (the

lowest grade). The results for three different test signals are de-

picted in Fig. 3. As one can see, the performance of STC is almost

equal to the AAC performance, and it is signiﬁcantly better than

the BSAC one.

4. CONCLUSIONS

The ﬁne-grain scalable sparse audio representation compression

problem has been addressed in this study. While in almost all exist-

ing algorithms, a single type of signiﬁcance-tree has been adopted

for sorting signiﬁcant coefﬁcients and transmitting position infor-

mation, we have proposed a novel adaptive signiﬁcance-tree tech-

nique. Such a tree is generated dynamically to suit variant spec-

tral behavior from frame to frame. Based on the dynamic tree

selection, a compression scheme has been proposed which pro-

vides both high compression quality and ﬁne-grain bitrate scalabil-

ity. Experimental results clearly demonstrate the following prop-

erties: the method outperforms the existing SPIHT-like algorithms

and yields competitive quality as the nonscalable AAC audio com-

pression scheme, yet with ﬁne scalability of one-bit granularity per

frame.

-4

-3

-2

-1

0

1

16kbps 32kbps 48kbps 64kbps

AAC

STC

BSAC

AAC

STC

BSAC

AAC

STC

BSAC

AAC

STC

BSAC

Bitrate

Diffgrade

Figure 3: Subjective difference grades for different codecs at bi-

trates between 16 and 64 kbps for one mono channel.

5. REFERENCES

[1] S. S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic

decomposition by basis pursuit,” SIAM Journal on Scientiﬁc

Computing, vol. 20, no. 1, pp. 33–61, 1999. [Online].

Available: citeseer.ist.psu.edu/chen98atomic.html

[2] M. S. Lewicki and T. J. Sejnowski, “Learning over-

complete representations,” Neural Computation, vol. 12,

no. 2, pp. 337–365, 2000. [Online]. Available: cite-

seer.ist.psu.edu/lewicki98learning.html

[3] J. Fuchs, “Sparsity and uniqueness for some speciﬁc under-

determined linear systems,” in Proc. IEEE ICASSP 2005,

Philadelphia, USA, March 2005.

[4] M. Davies and L. Daudet, “Sparse audio representations us-

ing the mclt,” in press, 2005.

[5] R. Gribonval, “Sparse decomposition of stereo signals with

matching pursuit and application to blind separation of more

than two sources from a stereo mixture,” in ICASSP02, Or-

lando, Florida, USA, May 2002.

[6] J. M. Shapiro, “Embedded image coding using zerotrees

of wavelet coefﬁcients,” IEEE Trans. on Signal Processing,

vol. 41, no. 12, pp. 3445–3462, 1993.

[7] A. Said and W. A. Pearlman, “A new, fast and efﬁcient image

codec based on set partitioning in hierarchical trees,” IEEE

Trans. on Circuits and Systems for Video Technology, vol. 6,

no. 3, pp. 243–250, 1996.

[8] Z. Liu and L. J. Karam, “Quantifying the intra and inter

subband correlations in the zerotree-based wavelet image

coders,” in Conf. Record of the 36th Asilomar Conf. on Sig-

nals, Systems and Computers, Sep. 2002, pp. 1730–1734.

[9] C. Dunn, “Efﬁcient audio coding with ﬁne-grain scalability,”

in AES 111th Convention. NY, USA: preprint 5492, Sep.

2001.

[10] M. Raad, A. Mertins, and R. Burnett, “Audio coding based

on the modulated lapped transform (MLT) and set partition-

ing in hierarchical trees,” in Prof. 6th World Multiconference

on Systemics, Cybernetics and Informatics, Orlando, USA,

Jul. 2002, pp. 303–306.

[11] M. Raad and A. Mertins, “From lossy to lossless audio cod-

ing using SPIHT,” in Proc. of the 5th Int. Conf. on Digital

Audio Effects, Hamburg, Germany, Sep. 2002, pp. 245–250.

[12] M. Raad, A. Mertins, and R. Burnett, “Scalable to lossless

audio compression based on perceptual set partitioning in hi-

erarchical trees (PSPIHT),” in Proc. Int. Conf. on Acoustics,

Speech, and Signal Processing, Apr. 2003, pp. V624–627.

[13] Z. Lu and W. A. Pearlman, “An efﬁcient, low-complexity au-

dio coder delivering multiple levels of quality for interactive

applications,” in Proc. IEEE Signal Processing Society Work-

shop on Multimedia Signal Processing, Dec. 1998, pp. 529–

534.

[14] ——, “High quality scalable stereo audio coding,”

1999. [Online]. Available: http://www.cipr.rpi.edu/ pearl-

man/papers/scal audio.ps.gz

[15] S. S. H. Zhou, A. Mertins, “An efﬁcient, ﬁne-grain scalable

audio compression scheme,” in Proc. AES 118th Convention,

Barcelona, Spain, May 2005.

[16] ISO/MPEG, “Audio test report. ISO/IEC/JTC 1/SC 2/WG

11 MPEG MPEG90/N0030,” International Organization for

Standardization, 1990.

[17] M. Davies and N. Mitianoudis, “Simple mixture model for

sparse overcomplete ica,” in IEE Proc.-Vis. Image Signal

Process., Vol. 151, No. 1, February 2004.

[18] “Mpeg-4 audio reference software.” [Online]. Available:

http://www.iso.ch/iso/en/ittf/PubliclyAvailableStandards/

ISO IEC 14496-5 2001 Software Reference/

[19] “Sound quality assessment material.” [Online]. Available:

http://sound.media.mit.edu/mpeg4/audio/sqam/

[20] ITU-R Recommendation BS.1116-1, “Methods for the sub-

jective assessment of small impairments in audio sys-

tems including multichannel sound systems,” International

Telecommunication Union, Geneva, Dec. 1997.