Identification of Sparse Audio Tampering Using Distributed Source Coding and Compressive Sensing Techniques.
-
Citations (0)
-
Cited In (0)
Page 1
Proc. of the 11thInt. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008
DETECTION AND IDENTIFICATION OF SPARSE AUDIO TAMPERING USING
DISTRIBUTED SOURCE CODING AND COMPRESSIVE SENSING TECHNIQUES
Giorgio Prandi
Dipartimento di Elettronica e Informazione,
Politecnico di Milano, Italy
prandi@elet.polimi.it
Giuseppe Valenzise
Dipartimento di Elettronica e Informazione,
Politecnico di Milano, Italy
valenzise@elet.polimi.it
Marco Tagliasacchi
Dipartimento di Elettronica e Informazione
Politecnico di Milano, Italy
tagliasa@elet.polimi.it
Augusto Sarti
Dipartimento di Elettronica e Informazione
Politecnico di Milano, Italy
sarti@elet.polimi.it
ABSTRACT
In most practical applications, for the sake of information integrity
not only it is useful to detect whether a multimedia content has
been modified or not, but also to identify which kind of attack has
been carried out. In the case of audio streams, for example, it
may be useful to localize the tamper in the time and/or frequency
domain. In this paper we devise a hash-based tampering detection
and localization system exploiting compressive sensing principles.
The multimedia content provider produces a small hash signature
using a limited number of random projections of a time-frequency
representation of the original audio stream. At the content user
side, the hash signature is used to estimate the distortion between
the original and the received stream and, provided that the tamper
is sufficiently sparse or sparsifiable in some orthonormal basis ex-
pansion or redundant dictionary (e.g. DCT or wavelet), to identify
the time-frequency portion of the stream that has been manipu-
lated. In order to keep the hash length small, the algorithm exploits
distributed source coding techniques.
1. INTRODUCTION
Thedeliveryofmultimediacontentsinpeer-to-peernetworksmight
give rise to different versions of the same multimedia object at dif-
ferent nodes. In the case of audio files, some versions might dif-
fer from the original because of processing due, for instance, to
transcoding or bitstream truncation. In other cases, malicious at-
tacks might occur by tampering part of the audio stream and pos-
sibly affecting its semantic content. In this paper we propose to
add a small hash to the audio stream to detect and identify audio
tampers. At the content user side, the information contained in the
hash enables to estimate the distortion of the received audio stream
with respect to the original version, and to localize the tampering,
if the attack is sparse in one of the analyzed basis expansion.
In the past few years, hashes have been used for the protection
of multimedia contents, especially in the field of image authenti-
cation. In [1], the authors propose a system that performs image
authentication using distributed source codes. The hash consists
of syndrome bits applied to quantized random projections of the
original image. To perform authentication, a Slepian-Wolf de-
coder receives in input the hash and the (possibly tampered) im-
age, which serves as side information. If decoding succeeds, the
image is declared authentic. The work has been extended in [2] to
perform tampering localization, at the cost of extra syndrome bits.
The authors of [3] present an interesting algorithm that produces
hashes robust to some legitimate image manipulations (like crop-
ping, scaling, rotation) and sensitive to illegal manipulations (like
image tampering).
Watermarking has been used to solve the problem of tamper-
ing localization [4][5]. A fragile watermark is inserted into the
image when it is created, and extracted during the authentication
phase. Tampering can be localized by identifying the damage to
the watermark. Watermarking techniques have also been used to
solve the problem of content authentication and tampering local-
ization in audio data. In [6] two complementary watermarks are
embedded in audio signals to achieve audio protection and tamper-
ing localization. However, to the authors’ knowledge, in the litera-
ture there are no previous works addressing the problem of identi-
fying tampering in audio streams. In general, watermarking based
schemes suffer from the following disadvantages: 1) watermark-
ing authentication is not backward compatible with previously en-
coded contents (unmarked contents cannot be authenticated later
by just retrieving the corresponding hash); 2) the original content
is distorted by the watermark; 3) the bit-rate required to compress
a multimedia content might increase due to the embedded water-
mark. Conversely, content hashing embeds a signature of the origi-
nal content as part of the header information, or can provide a hash
separately from the content upon a user’s request. In order to limit
the rate overhead, the size of the hash needs to be as small as possi-
ble. At the same time, the goal of tampering localization calls for
increasing the hash size, in order to capture as much as possible
about the original multimedia object. In this paper we explicitly
target these conflicting requirements by proposing a hashing tech-
nique based on compressive sensing principles. The key tenet is
that, if the tampering is sparse enough (or it can be sparsified in
some orthonormal basis or redundant dictionary), it can be iden-
tified by means of a limited number of random projections of the
original signal. In addition, in order to keep the size of the hash
as small as possible, the hash information is encoded by exploiting
distributed source coding tools.
The rest of the paper is organized as follows. Section 2 pro-
vides background information about compressive sensing and dis-
tributed source coding; Section 3 gives a detailed description of
DAFX-1
Page 2
Proc. of the 11thInt. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008
the system. The results of tampering estimation are depicted in
Section 4. Finally, Section 5 gives some concluding remarks.
2. BACKGROUND
2.1. Compressive sensing
In our system, compressive sensing principles are used to build the
hash signature of the audio stream. Compressive sensing allows to
capture and represent signals at rates below the Nyquist frequency
[7]. In fact, it has been proved that it is possible to reconstruct a
signal using a limited number of non-adaptive linear random pro-
jections that preserve the original structure of the signal, by solv-
ing a specific optimization problem. The main constraint which
must be satisfied is that the signal has to be sparse or, at least,
compressible, i.e. the signal can be represented in some basis ex-
pansion using only a few large magnitude coefficients. A more
detailed discussion about compressive sensing can be found in [7].
2.2. Distributed source coding
As mentioned in previous sections, in our system we use a dis-
tributed source coding technique to reconstruct the hash signature
of the audio stream at the content user side. Distributed source
coding has been widely applied to video coding [8] in order to
move the computational complexity from the encoder to the de-
coderside. Accordingtodistributedsourcecodingprinciplesstated
by the Wyner-Ziv theorem, it is possible to perform lossy encoding
with side information at the decoder. The side information repre-
sents a distorted version of the source, which is made available at
the decoder side only. In our approach, the original information is
thehashcomputedfromthecontentprovider, andthesideinforma-
tion consists of the hash signature computed from the audio stream
received at the user side (which may be modified with respect to
the original). By requesting syndrome bits from the encoder, the
decoder is able to correct the possibly distorted side information
and to finally reconstruct the original hash signature. The more the
side information is distorted, the more syndrome bits are needed
to reconstruct the original hash; if the number of requested bits ex-
ceeds some pre-specified threshold, we may consider the received
stream too distorted and completely unauthentic. Under normal
conditions, the number of requested syndrome bits is significantly
less than the number of bits of the original information, so, the
hash reconstruction approach based on distributed source coding
technique allows to save bits with respect to the direct transmis-
sion of the original hash from the content provider to the user.
3. DESCRIPTION OF THE SYSTEM
The proposed tampering detection and localization scheme is de-
pictedinFigure1. Theproduceroftheoriginalaudiostreambuilds
a small hash signature of the audio signal X ∈ RN, where N is
the total number of audio samples of the signal. The audio content
can be distributed over a network through untrusted nodes where
a modification of the signal may occur. The user receives the au-
dio stream?X. In order to perform content authentication, the user
exploiting the hash, the user can estimate the distortion of the re-
ceived content?X with respect to the original X. Furthermore, if
duces a tamper estimationˆ e which identifies the attack in the time-
sends a request for the hash signature to the content provider. By
the tampering is sparse in some basis expansion, the system pro-
frequency domain. At the content producer side, the encoder gen-
erates the hash signature H(X,S) as follows:
1. Frame based subband log-energy extraction: the original
single-channel audio stream X is partitioned into non-over-
lapping frames of size F. The power spectrum of each
frame is subdivided into U Mel frequency subbands, and
foreachsubbandtherelatedspectrallog-energyisextracted.
Let denote hf,uthe energy value for the u-th band at frame
f. The corresponding log-energy value is computed as fol-
lows:
xf,u= log(1 + hf,u)
(1)
The log-energy values are stored in a vector x ∈ Rn, where
n = UN/F is the number of log-energy values extracted
from the audio stream. Note that using Mel frequency sub-
bands and log-energy values gives to tamper detection and
identification an immediate perceptual semantics.
2. Random projections: A number of linear random projec-
tions y ∈ Rm, m < n, is produced as y = Ax. The
entries of the matrix A ∈ Rm×nare sampled from a Gaus-
sian distribution N(0,1/n), using some random seed S,
which will be sent as part of the hash to the user.
3. Wyner-Ziv encoding: The random projections y are quan-
tized with a uniform scalar quantizer with step size ∆. Bit-
plane extraction is performed on the quantization bin in-
dexes. Each bitplane is encoded by sending syndrome bits
generated by means of an LDPC code. The rate allocated
to the hash depends on the expected distortion between the
original and the tampered audio stream.
The content user receives the (possibly tampered) audio stream?X
H(X,S) to the authentication server at content producer side. On
each user’s request, a different seed S is used in order to avoid that
a malicious attack could exploit the knowledge of the nullspace of
A.
and requests the syndrome bits and the random seed of the hash
1. Frame-based subband log-energy extraction: computed on
signal?X using the same algorithm described above for the
duced.
content producer side. At this step, the vector ? x is pro-
2. Random projections: ? y = A? x.
using the hash syndrome bits and ? y as side information.
and the tampered audio stream is higher than the maximum
distortion expected by the original content producer (deter-
mined by the rate allocated to the hash signature) decoding
might fail. In this case, the audio stream is declared to be
completely unauthentic and no tampering localization can
be provided.
3. Wyner-Ziv decoding: A quantized version ˆ y is obtained
LDPC decoding is performed starting from the most signif-
icant bitplane. If the actual distortion between the original
4. Distortion estimation: If Wyner-Ziv decoding succeeds, an
estimate of the distortion in terms of a perceptual SNR mea-
sure is computed using the projections of the subsampled
energy spectrum of the tamper. Letˆb = ˆ y − ? y the projec-
tions of the subsampled energy spectrum of the tamper; we
compute an estimate of the perceptual SNR of the received
DAFX-2
Page 3
Proc. of the 11thInt. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008
X
Original content
producer
Content user
Untrusted
network
tamper
x %
y %
ˆ y
+
-
ˆb
Ax
Tampering estimation
ˆ=
b
ˆ=
b
Distortion
estimation
Trusted
network
(X,S)
Frame-based
subband
log-energy
extraction
Random
projections
=
y
x
y
Wyner-Ziv
encoding
Ax
Random
seed S
Frame-based
subband
log-energy
extraction
Random
projections
=
y
%
Wyner-Ziv
decoding
%
11
+
+
AΦ αz
ˆ
b
DD
=+
AΦ αz
ˆb
22
AΦ αz
ˆ e
)
~
X
,(
X
P
SNR
X~
Φαe =
Figure 1: Block diagram of the proposed tampering localization scheme
audio stream as
SNRP = 10log10
??m
j=1eˆ yj
?m
j=1eˆbj
?
,ˆ yj ∈ ˆ y,
ˆbj ∈ˆb
(2)
5. Tampering estimation: An estimate of the tampering e =
? x − x can be obtained by solving the following undeter-
ˆb
= ˆ y − ? y =
=Ae + z
mined system of linear equations:
=A(x − ? x) + z =
(3)
where z is the hash quantization noise.
There exists an infinite number of solutions to (3); how-
ever, in the hypothesis that e is sparse, the optimal way for
recovering e is to seek the sparsest solution of (3), i.e. the
one that minimizes ||e||0, where the ?0norm || · ||0simply
counts the number of nonzeros entries of e [9]. Unfortu-
nately, such a problem is NP hard and it is difficult to solve
in practice. Nonetheless, recent literature about compres-
sive sensing [9] has shown that, if e is sufficiently sparse,
an approximation of it can be recovered by solving the fol-
lowing ?1minimization problem:
ˆ e = min||e||1
where ? is chosen such that ||z||2 ≤ ?. Problem (4) is a spe-
cial instance of a second order cone program (SOCP) [9]
and can be solved in O(n3) time. Nevertheless, several fast
algorithms have been proposed in the literature that attempt
to find the sparsest e satisfying the constraint ||b−Ae||2 ≤
?. In our experiments, we adopt the SPGL1 algorithm [10],
which is specifically designed for large scale sparse recon-
struction problems. If e is not sparse enough with respect
to the number of projections m, the solution found does
not fulfil the constraint. In such cases, it is not possible to
perform tampering localization in the original log-energy
domain. However, it is possible to perform the analysis in
s.t.
||ˆb − Ae||2 ≤ ?
(4)
other domains in which the tamper may be sparse. In our
scheme, we assume that the tamper is sparse in some or-
thonormal basis Φ, so that:
e = Φα
(5)
where α are the coefficients of the expansion of e in the
basis Φ. In this case, instead of equation (3) we use the
following one:
ˆb = AΦDα + zD
Due to the missing knowledge of basis Φ, one can try to
sparsify the tamper in different bases ΦD. In our system
we have implemented the expansion of the tamper in the
DCT, DCT 2D and Haar wavelet bases. If the tampering in
thebasisΦDissufficientlysparse, findingtheminimum?1-
normsolutionof(6)allowstoobtainatamperingestimation
ˆ α. Then, we can transform back the result to the original
log-energy domain:
(6)
ˆ e = ΦDˆ α
(7)
4. EXPERIMENTAL RESULTS
We have carried out some experiments on the first 32 seconds of
Etta James’ song “At last”, sampled at 44100 Hz, at 16-bit per
sample. The size of the audio frame has been set to F = 11025
samples (0.25 seconds), and the number of Mel frequency bands
has been set to U = 32, obtaining a total of 128 audio frames
corresponding to n = 4096 log-energy coefficients. The testbed
has been built considering 3 kinds of tampering:
• Time localized tampering (T): a time-limited audio frag-
ment, takenfromanotherportionofthewholesong, ismixed
with the original audio stream;
• Frequency localized tampering (F): a low-pass phone-band
filter (cut frequency at 3400 Hz and stop frequency at 4000
Hz) is applied to the entire original audio stream;
• Time-frequency localized tampering (TF): a low-pass and a
band-stop filters are applied to two different portions of the
original audio stream (see Figure 3(b)).
DAFX-3
Page 4
Proc. of the 11thInt. Conference on Digital Audio Effects (DAFx-08), Espoo, Finland, September 1-4, 2008
Log Energy
Subbands
Subbands
Subbands
5 1015
Tamper
15
Tamper
15
Tamper
2025 30
10
20
30
20
30
30
Subbands
Subbands
Subbands
5 1015 202530
10
20
30
20
30
30
10
Reconstructed tamper in the Haar domain
5 1015
Subbands
Subbands
Subbands
Time [s]
15
5 1015 2025 30
10
20
30
20
30
30
10
Reconstructed tamper in the log-energy domain
5 1015
Reconstructed tamper in the log-energy domain
(b) Log-energy spectrum of the tamper
5 1015
Reconstructed tamper in the log-energy domain
Subbands
Subbands
Subbands
5 1015 20 25 30
10
20
30
20
30
30
10
(a) Log-energy spectrum of the original audio signal
5 10
Log Energy
Subbands
202530
10
20
30
Subbands
2025 30
10
20
30
Reconstructed tamper in the Haar domain
5 1015
Subbands
Time [s]
15
5 102025 30
10
20
30
Subbands
20 2530
10
20
30
Log Energy
5 1020 2530
10
20
202530
10
20
Reconstructed tamper in the Haar domain
(c) Reconstructedtamperinlog-energydomain. Inthiscase
the estimation reaches a Normalized MSE of 6.52 · 10−2
Reconstructed tamper in the Haar domain
Time [s]
5 10202530
10
20
202530
10
20
Log Energy
5 1015
Tamper
2025 30
5 101520 2530
Time [s]
5 10 15202530
10
Reconstructed tamper in the log-energy domain
5 1015 202530
(d) Reconstructed tamper in Haar wavelet domain. The
Normalized MSE value is 3.01 · 10−3
Figure 2: Example of time-frequency tampering consisting of a
low-pass filter and a stop-band filter.
We evaluate the goodness of the tampering estimation by cal-
culating the normalized MSE between the log-energy spectrum of
the original tamper and the log-energy spectrum of the estimated
one:
?n
j
MSEN =
j=1(ˆ ej− ej)2
?n
j=1e2
,ej ∈ e,ˆ ej ∈ ˆ e
(8)
Results related to Normalized MSE obtained with a fixed bit rate
for the hash are shown in Tables 1 (for 200 bps) and 2 (for 400
bps). From the tables it is clear that, by looking for a sparse tam-
per in other bases besides the canonical one, better results can be
achieved using the same hash length, as highlighted by the bold
numbers in the tables.
5. CONCLUSIONS
In this paper, a novel algorithm to detect and identify audio tam-
pering by means of the recent compressive sensing framework has
been described. Using the distributed source coding paradigm, we
Log-energy
1.78 · 10−2
7.80 · 10−2
6.52 · 10−2
DCTDCT 2D
2.88 · 10−3
2.88 · 10−3
1.09 · 10−2
Haar Wavelet
8.22 · 10−4
4.95 · 10−3
3.01 · 10−3
T
F
TF
5.03 · 10−4
5.57 · 10−2
4.78 · 10−2
Table 1: Distortion of tamper estimation MSEN using a fixed bit
rate for the hash signature of 200 bps.
Log-energy
1.27 · 10−3
6.71 · 10−2
6.47 · 10−3
DCTDCT 2D
1.42 · 10−3
1.22 · 10−3
4.84 · 10−3
Haar Wavelet
5.95 · 10−5
1.95 · 10−3
2.19 · 10−4
T
F
TF
4.27 · 10−5
3.23 · 10−2
1.20 · 10−2
Table 2: Distortion of tamper estimation MSEN using a fixed bit
rate for the hash signature of 400 bps.
have shown how to produce very small yet effective hash signa-
tures; in addition, looking for a sparse tamper in some transformed
domain enables a further reduction of the hash payload overhead.
6. REFERENCES
[1] Y.C. Lin, D. Varodayan, and B. Girod,
cation based on distributed source coding,” in IEEE Inter-
national Conference on Image Processing, S.Antonio, TX,
Sept. 2007, vol. 3.
[2] Y.C. Lin, D. Varodayan, and B. Girod, “Spatial Models for
Localization of Image Tampering Using Distributed Source
Codes,” in Picture Coding Symposium (PCS), Lisbon, Por-
tugal, Nov. 2007.
[3] S. Roy and Q. Sun, “Robust Hash for Detecting and Local-
izing Image Tampering,” in IEEE International Conference
on Image Processing, S.Antonio, TX, 2007, vol. 6.
[4] J. Fridrich,“Image watermarking for tamper detection,”
in IEEE International Conference on Image Processing,
Chicago, Oct. 1998, vol. 2.
[5] J.J. Eggers and B. Girod, “Blind watermarking applied to
image authentication,” in IEEE International Conference on
Acoustics, Speech, and Signal Processing, Salt Lake City,
2001, vol. 3.
[6] C.S. Lu, H.Y.M. Liao, and L.H. Chen, “Multipurpose audio
watermarking,” in Proc. 15th Int. Conf. on Pattern Recogni-
tion, 2000.
[7] R.G. Baraniuk, “Compressive Sensing,” Signal Processing
Magazine, IEEE, vol. 24, no. 4, pp. 118–121, 2007.
[8] B. Girod, AM Aaron, S. Rane, and D. Rebollo-Monedero,
“Distributed video coding,” Proceedings of the IEEE, vol.
93, no. 1, pp. 71–83, 2005.
[9] E. Candes,“Compressive sampling,”
Congress of Mathematicians, Madrid, Spain, 2006.
[10] E. van den Berg and M. P. Friedlander, “In pursuit of a root,”
Tech. Rep. TR-2007-19, Department of Computer Science,
University of British Columbia, June 2007, Preprint avail-
able at http://www.optimization-online.org/
DB_HTML/2007/06/1708.html.
“Image authenti-
in International
DAFX-4
View other sources
Hide other sources
-
Available from Marco Tagliasacchi · 1 Nov 2012
-
Available from psu.edu