Page 1
Proc. of the 11thInt. Conference on Digital Audio Effects (DAFx08), Espoo, Finland, September 14, 2008
DETECTION AND IDENTIFICATION OF SPARSE AUDIO TAMPERING USING
DISTRIBUTED SOURCE CODING AND COMPRESSIVE SENSING TECHNIQUES
Giorgio Prandi
Dipartimento di Elettronica e Informazione,
Politecnico di Milano, Italy
prandi@elet.polimi.it
Giuseppe Valenzise
Dipartimento di Elettronica e Informazione,
Politecnico di Milano, Italy
valenzise@elet.polimi.it
Marco Tagliasacchi
Dipartimento di Elettronica e Informazione
Politecnico di Milano, Italy
tagliasa@elet.polimi.it
Augusto Sarti
Dipartimento di Elettronica e Informazione
Politecnico di Milano, Italy
sarti@elet.polimi.it
ABSTRACT
In most practical applications, for the sake of information integrity
not only it is useful to detect whether a multimedia content has
been modified or not, but also to identify which kind of attack has
been carried out. In the case of audio streams, for example, it
may be useful to localize the tamper in the time and/or frequency
domain. In this paper we devise a hashbased tampering detection
and localization system exploiting compressive sensing principles.
The multimedia content provider produces a small hash signature
using a limited number of random projections of a timefrequency
representation of the original audio stream. At the content user
side, the hash signature is used to estimate the distortion between
the original and the received stream and, provided that the tamper
is sufficiently sparse or sparsifiable in some orthonormal basis ex
pansion or redundant dictionary (e.g. DCT or wavelet), to identify
the timefrequency portion of the stream that has been manipu
lated. In order to keep the hash length small, the algorithm exploits
distributed source coding techniques.
1. INTRODUCTION
Thedeliveryofmultimediacontentsinpeertopeernetworksmight
give rise to different versions of the same multimedia object at dif
ferent nodes. In the case of audio files, some versions might dif
fer from the original because of processing due, for instance, to
transcoding or bitstream truncation. In other cases, malicious at
tacks might occur by tampering part of the audio stream and pos
sibly affecting its semantic content. In this paper we propose to
add a small hash to the audio stream to detect and identify audio
tampers. At the content user side, the information contained in the
hash enables to estimate the distortion of the received audio stream
with respect to the original version, and to localize the tampering,
if the attack is sparse in one of the analyzed basis expansion.
In the past few years, hashes have been used for the protection
of multimedia contents, especially in the field of image authenti
cation. In [1], the authors propose a system that performs image
authentication using distributed source codes. The hash consists
of syndrome bits applied to quantized random projections of the
original image. To perform authentication, a SlepianWolf de
coder receives in input the hash and the (possibly tampered) im
age, which serves as side information. If decoding succeeds, the
image is declared authentic. The work has been extended in [2] to
perform tampering localization, at the cost of extra syndrome bits.
The authors of [3] present an interesting algorithm that produces
hashes robust to some legitimate image manipulations (like crop
ping, scaling, rotation) and sensitive to illegal manipulations (like
image tampering).
Watermarking has been used to solve the problem of tamper
ing localization [4][5]. A fragile watermark is inserted into the
image when it is created, and extracted during the authentication
phase. Tampering can be localized by identifying the damage to
the watermark. Watermarking techniques have also been used to
solve the problem of content authentication and tampering local
ization in audio data. In [6] two complementary watermarks are
embedded in audio signals to achieve audio protection and tamper
ing localization. However, to the authors’ knowledge, in the litera
ture there are no previous works addressing the problem of identi
fying tampering in audio streams. In general, watermarking based
schemes suffer from the following disadvantages: 1) watermark
ing authentication is not backward compatible with previously en
coded contents (unmarked contents cannot be authenticated later
by just retrieving the corresponding hash); 2) the original content
is distorted by the watermark; 3) the bitrate required to compress
a multimedia content might increase due to the embedded water
mark. Conversely, content hashing embeds a signature of the origi
nal content as part of the header information, or can provide a hash
separately from the content upon a user’s request. In order to limit
the rate overhead, the size of the hash needs to be as small as possi
ble. At the same time, the goal of tampering localization calls for
increasing the hash size, in order to capture as much as possible
about the original multimedia object. In this paper we explicitly
target these conflicting requirements by proposing a hashing tech
nique based on compressive sensing principles. The key tenet is
that, if the tampering is sparse enough (or it can be sparsified in
some orthonormal basis or redundant dictionary), it can be iden
tified by means of a limited number of random projections of the
original signal. In addition, in order to keep the size of the hash
as small as possible, the hash information is encoded by exploiting
distributed source coding tools.
The rest of the paper is organized as follows. Section 2 pro
vides background information about compressive sensing and dis
tributed source coding; Section 3 gives a detailed description of
DAFX1
Page 2
Proc. of the 11thInt. Conference on Digital Audio Effects (DAFx08), Espoo, Finland, September 14, 2008
the system. The results of tampering estimation are depicted in
Section 4. Finally, Section 5 gives some concluding remarks.
2. BACKGROUND
2.1. Compressive sensing
In our system, compressive sensing principles are used to build the
hash signature of the audio stream. Compressive sensing allows to
capture and represent signals at rates below the Nyquist frequency
[7]. In fact, it has been proved that it is possible to reconstruct a
signal using a limited number of nonadaptive linear random pro
jections that preserve the original structure of the signal, by solv
ing a specific optimization problem. The main constraint which
must be satisfied is that the signal has to be sparse or, at least,
compressible, i.e. the signal can be represented in some basis ex
pansion using only a few large magnitude coefficients. A more
detailed discussion about compressive sensing can be found in [7].
2.2. Distributed source coding
As mentioned in previous sections, in our system we use a dis
tributed source coding technique to reconstruct the hash signature
of the audio stream at the content user side. Distributed source
coding has been widely applied to video coding [8] in order to
move the computational complexity from the encoder to the de
coderside. Accordingtodistributedsourcecodingprinciplesstated
by the WynerZiv theorem, it is possible to perform lossy encoding
with side information at the decoder. The side information repre
sents a distorted version of the source, which is made available at
the decoder side only. In our approach, the original information is
thehashcomputedfromthecontentprovider, andthesideinforma
tion consists of the hash signature computed from the audio stream
received at the user side (which may be modified with respect to
the original). By requesting syndrome bits from the encoder, the
decoder is able to correct the possibly distorted side information
and to finally reconstruct the original hash signature. The more the
side information is distorted, the more syndrome bits are needed
to reconstruct the original hash; if the number of requested bits ex
ceeds some prespecified threshold, we may consider the received
stream too distorted and completely unauthentic. Under normal
conditions, the number of requested syndrome bits is significantly
less than the number of bits of the original information, so, the
hash reconstruction approach based on distributed source coding
technique allows to save bits with respect to the direct transmis
sion of the original hash from the content provider to the user.
3. DESCRIPTION OF THE SYSTEM
The proposed tampering detection and localization scheme is de
pictedinFigure1. Theproduceroftheoriginalaudiostreambuilds
a small hash signature of the audio signal X ∈ RN, where N is
the total number of audio samples of the signal. The audio content
can be distributed over a network through untrusted nodes where
a modification of the signal may occur. The user receives the au
dio stream?X. In order to perform content authentication, the user
exploiting the hash, the user can estimate the distortion of the re
ceived content?X with respect to the original X. Furthermore, if
duces a tamper estimationˆ e which identifies the attack in the time
sends a request for the hash signature to the content provider. By
the tampering is sparse in some basis expansion, the system pro
frequency domain. At the content producer side, the encoder gen
erates the hash signature H(X,S) as follows:
1. Frame based subband logenergy extraction: the original
singlechannel audio stream X is partitioned into nonover
lapping frames of size F. The power spectrum of each
frame is subdivided into U Mel frequency subbands, and
foreachsubbandtherelatedspectrallogenergyisextracted.
Let denote hf,uthe energy value for the uth band at frame
f. The corresponding logenergy value is computed as fol
lows:
xf,u= log(1 + hf,u)
(1)
The logenergy values are stored in a vector x ∈ Rn, where
n = UN/F is the number of logenergy values extracted
from the audio stream. Note that using Mel frequency sub
bands and logenergy values gives to tamper detection and
identification an immediate perceptual semantics.
2. Random projections: A number of linear random projec
tions y ∈ Rm, m < n, is produced as y = Ax. The
entries of the matrix A ∈ Rm×nare sampled from a Gaus
sian distribution N(0,1/n), using some random seed S,
which will be sent as part of the hash to the user.
3. WynerZiv encoding: The random projections y are quan
tized with a uniform scalar quantizer with step size ∆. Bit
plane extraction is performed on the quantization bin in
dexes. Each bitplane is encoded by sending syndrome bits
generated by means of an LDPC code. The rate allocated
to the hash depends on the expected distortion between the
original and the tampered audio stream.
The content user receives the (possibly tampered) audio stream?X
H(X,S) to the authentication server at content producer side. On
each user’s request, a different seed S is used in order to avoid that
a malicious attack could exploit the knowledge of the nullspace of
A.
and requests the syndrome bits and the random seed of the hash
1. Framebased subband logenergy extraction: computed on
signal?X using the same algorithm described above for the
duced.
content producer side. At this step, the vector ? x is pro
2. Random projections: ? y = A? x.
using the hash syndrome bits and ? y as side information.
and the tampered audio stream is higher than the maximum
distortion expected by the original content producer (deter
mined by the rate allocated to the hash signature) decoding
might fail. In this case, the audio stream is declared to be
completely unauthentic and no tampering localization can
be provided.
3. WynerZiv decoding: A quantized version ˆ y is obtained
LDPC decoding is performed starting from the most signif
icant bitplane. If the actual distortion between the original
4. Distortion estimation: If WynerZiv decoding succeeds, an
estimate of the distortion in terms of a perceptual SNR mea
sure is computed using the projections of the subsampled
energy spectrum of the tamper. Letˆb = ˆ y − ? y the projec
tions of the subsampled energy spectrum of the tamper; we
compute an estimate of the perceptual SNR of the received
DAFX2
Page 3
Proc. of the 11thInt. Conference on Digital Audio Effects (DAFx08), Espoo, Finland, September 14, 2008
X
Original content
producer
Content user
Untrusted
network
tamper
x %
y %
ˆ y
+

ˆb
Ax
Tampering estimation
ˆ=
b
ˆ=
b
Distortion
estimation
Trusted
network
(X,S)
Framebased
subband
logenergy
extraction
Random
projections
=
y
x
y
WynerZiv
encoding
Ax
Random
seed S
Framebased
subband
logenergy
extraction
Random
projections
=
y
%
WynerZiv
decoding
%
11
+
+
AΦ αz
ˆ
b
DD
=+
AΦ αz
ˆb
22
AΦ αz
ˆ e
)
~
X
,(
X
P
SNR
X~
Φαe =
Figure 1: Block diagram of the proposed tampering localization scheme
audio stream as
SNRP = 10log10
??m
j=1eˆ yj
?m
j=1eˆbj
?
,ˆ yj ∈ ˆ y,
ˆbj ∈ˆb
(2)
5. Tampering estimation: An estimate of the tampering e =
? x − x can be obtained by solving the following undeter
ˆb
= ˆ y − ? y =
=Ae + z
mined system of linear equations:
=A(x − ? x) + z =
(3)
where z is the hash quantization noise.
There exists an infinite number of solutions to (3); how
ever, in the hypothesis that e is sparse, the optimal way for
recovering e is to seek the sparsest solution of (3), i.e. the
one that minimizes e0, where the ?0norm  · 0simply
counts the number of nonzeros entries of e [9]. Unfortu
nately, such a problem is NP hard and it is difficult to solve
in practice. Nonetheless, recent literature about compres
sive sensing [9] has shown that, if e is sufficiently sparse,
an approximation of it can be recovered by solving the fol
lowing ?1minimization problem:
ˆ e = mine1
where ? is chosen such that z2 ≤ ?. Problem (4) is a spe
cial instance of a second order cone program (SOCP) [9]
and can be solved in O(n3) time. Nevertheless, several fast
algorithms have been proposed in the literature that attempt
to find the sparsest e satisfying the constraint b−Ae2 ≤
?. In our experiments, we adopt the SPGL1 algorithm [10],
which is specifically designed for large scale sparse recon
struction problems. If e is not sparse enough with respect
to the number of projections m, the solution found does
not fulfil the constraint. In such cases, it is not possible to
perform tampering localization in the original logenergy
domain. However, it is possible to perform the analysis in
s.t.
ˆb − Ae2 ≤ ?
(4)
other domains in which the tamper may be sparse. In our
scheme, we assume that the tamper is sparse in some or
thonormal basis Φ, so that:
e = Φα
(5)
where α are the coefficients of the expansion of e in the
basis Φ. In this case, instead of equation (3) we use the
following one:
ˆb = AΦDα + zD
Due to the missing knowledge of basis Φ, one can try to
sparsify the tamper in different bases ΦD. In our system
we have implemented the expansion of the tamper in the
DCT, DCT 2D and Haar wavelet bases. If the tampering in
thebasisΦDissufficientlysparse, findingtheminimum?1
normsolutionof(6)allowstoobtainatamperingestimation
ˆ α. Then, we can transform back the result to the original
logenergy domain:
(6)
ˆ e = ΦDˆ α
(7)
4. EXPERIMENTAL RESULTS
We have carried out some experiments on the first 32 seconds of
Etta James’ song “At last”, sampled at 44100 Hz, at 16bit per
sample. The size of the audio frame has been set to F = 11025
samples (0.25 seconds), and the number of Mel frequency bands
has been set to U = 32, obtaining a total of 128 audio frames
corresponding to n = 4096 logenergy coefficients. The testbed
has been built considering 3 kinds of tampering:
• Time localized tampering (T): a timelimited audio frag
ment, takenfromanotherportionofthewholesong, ismixed
with the original audio stream;
• Frequency localized tampering (F): a lowpass phoneband
filter (cut frequency at 3400 Hz and stop frequency at 4000
Hz) is applied to the entire original audio stream;
• Timefrequency localized tampering (TF): a lowpass and a
bandstop filters are applied to two different portions of the
original audio stream (see Figure 3(b)).
DAFX3
Page 4
Proc. of the 11thInt. Conference on Digital Audio Effects (DAFx08), Espoo, Finland, September 14, 2008
Log Energy
Subbands
Subbands
Subbands
5 1015
Tamper
15
Tamper
15
Tamper
2025 30
10
20
30
20
30
30
Subbands
Subbands
Subbands
5 1015 202530
10
20
30
20
30
30
10
Reconstructed tamper in the Haar domain
5 1015
Subbands
Subbands
Subbands
Time [s]
15
5 1015 2025 30
10
20
30
20
30
30
10
Reconstructed tamper in the logenergy domain
5 1015
Reconstructed tamper in the logenergy domain
(b) Logenergy spectrum of the tamper
5 1015
Reconstructed tamper in the logenergy domain
Subbands
Subbands
Subbands
5 1015 20 25 30
10
20
30
20
30
30
10
(a) Logenergy spectrum of the original audio signal
5 10
Log Energy
Subbands
202530
10
20
30
Subbands
2025 30
10
20
30
Reconstructed tamper in the Haar domain
5 1015
Subbands
Time [s]
15
5 102025 30
10
20
30
Subbands
20 2530
10
20
30
Log Energy
5 1020 2530
10
20
202530
10
20
Reconstructed tamper in the Haar domain
(c) Reconstructedtamperinlogenergydomain. Inthiscase
the estimation reaches a Normalized MSE of 6.52 · 10−2
Reconstructed tamper in the Haar domain
Time [s]
5 10202530
10
20
202530
10
20
Log Energy
5 1015
Tamper
2025 30
5 101520 2530
Time [s]
5 10 15202530
10
Reconstructed tamper in the logenergy domain
5 1015 202530
(d) Reconstructed tamper in Haar wavelet domain. The
Normalized MSE value is 3.01 · 10−3
Figure 2: Example of timefrequency tampering consisting of a
lowpass filter and a stopband filter.
We evaluate the goodness of the tampering estimation by cal
culating the normalized MSE between the logenergy spectrum of
the original tamper and the logenergy spectrum of the estimated
one:
?n
j
MSEN =
j=1(ˆ ej− ej)2
?n
j=1e2
,ej ∈ e,ˆ ej ∈ ˆ e
(8)
Results related to Normalized MSE obtained with a fixed bit rate
for the hash are shown in Tables 1 (for 200 bps) and 2 (for 400
bps). From the tables it is clear that, by looking for a sparse tam
per in other bases besides the canonical one, better results can be
achieved using the same hash length, as highlighted by the bold
numbers in the tables.
5. CONCLUSIONS
In this paper, a novel algorithm to detect and identify audio tam
pering by means of the recent compressive sensing framework has
been described. Using the distributed source coding paradigm, we
Logenergy
1.78 · 10−2
7.80 · 10−2
6.52 · 10−2
DCTDCT 2D
2.88 · 10−3
2.88 · 10−3
1.09 · 10−2
Haar Wavelet
8.22 · 10−4
4.95 · 10−3
3.01 · 10−3
T
F
TF
5.03 · 10−4
5.57 · 10−2
4.78 · 10−2
Table 1: Distortion of tamper estimation MSEN using a fixed bit
rate for the hash signature of 200 bps.
Logenergy
1.27 · 10−3
6.71 · 10−2
6.47 · 10−3
DCTDCT 2D
1.42 · 10−3
1.22 · 10−3
4.84 · 10−3
Haar Wavelet
5.95 · 10−5
1.95 · 10−3
2.19 · 10−4
T
F
TF
4.27 · 10−5
3.23 · 10−2
1.20 · 10−2
Table 2: Distortion of tamper estimation MSEN using a fixed bit
rate for the hash signature of 400 bps.
have shown how to produce very small yet effective hash signa
tures; in addition, looking for a sparse tamper in some transformed
domain enables a further reduction of the hash payload overhead.
6. REFERENCES
[1] Y.C. Lin, D. Varodayan, and B. Girod,
cation based on distributed source coding,” in IEEE Inter
national Conference on Image Processing, S.Antonio, TX,
Sept. 2007, vol. 3.
[2] Y.C. Lin, D. Varodayan, and B. Girod, “Spatial Models for
Localization of Image Tampering Using Distributed Source
Codes,” in Picture Coding Symposium (PCS), Lisbon, Por
tugal, Nov. 2007.
[3] S. Roy and Q. Sun, “Robust Hash for Detecting and Local
izing Image Tampering,” in IEEE International Conference
on Image Processing, S.Antonio, TX, 2007, vol. 6.
[4] J. Fridrich,“Image watermarking for tamper detection,”
in IEEE International Conference on Image Processing,
Chicago, Oct. 1998, vol. 2.
[5] J.J. Eggers and B. Girod, “Blind watermarking applied to
image authentication,” in IEEE International Conference on
Acoustics, Speech, and Signal Processing, Salt Lake City,
2001, vol. 3.
[6] C.S. Lu, H.Y.M. Liao, and L.H. Chen, “Multipurpose audio
watermarking,” in Proc. 15th Int. Conf. on Pattern Recogni
tion, 2000.
[7] R.G. Baraniuk, “Compressive Sensing,” Signal Processing
Magazine, IEEE, vol. 24, no. 4, pp. 118–121, 2007.
[8] B. Girod, AM Aaron, S. Rane, and D. RebolloMonedero,
“Distributed video coding,” Proceedings of the IEEE, vol.
93, no. 1, pp. 71–83, 2005.
[9] E. Candes,“Compressive sampling,”
Congress of Mathematicians, Madrid, Spain, 2006.
[10] E. van den Berg and M. P. Friedlander, “In pursuit of a root,”
Tech. Rep. TR200719, Department of Computer Science,
University of British Columbia, June 2007, Preprint avail
able at http://www.optimizationonline.org/
DB_HTML/2007/06/1708.html.
“Image authenti
in International
DAFX4