- A preview of this full-text is provided by Wiley.
- Learn more

Download available

Content available from Signal Processing, IET

This content is subject to copyright. Terms and conditions apply.

IET Signal Processing

Research Article

Secure echo-hiding audio watermarking

method based on improved PN sequence and

robust principal component analysis

ISSN 1751-9675

Received on 15th August 2019

Revised 26th December 2019

Accepted on 18th February 2020

E-First on 27th March 2020

doi: 10.1049/iet-spr.2019.0376

www.ietdl.org

Shengbei Wang1, Chao Wang1, Weitao Yuan1, Lin Wang2, Jianming Wang1

1Tianjin Key Laboratory of Autonomous Intelligence Technology and Systems, Tianjin Polytechnic University, 300387 Tianjin, People's Republic

of China

2Techfantasy. Co. Ltd., 300387 Tianjin, People's Republic of China

E-mail: wangjianming@tjpu.edu.cn

Abstract: Echo-hiding has been widely studied for audio watermarking. This study proposes a more secure echo-hiding method

based on modified pseudo-noise (PN) sequence and robust principal component analysis (RPCA). In the proposed method, the

RPCA is used to decompose the original audio signal into low-rank and sparse parts and then a pair of opposite modified PN

sequences is employed to embed watermarks. The modified PN sequence improves the robustness of watermark detection by

providing additional correlation peaks. Meanwhile, benefit from the RPCA and the opposite PN sequences, the security of the

proposed method is improved since watermarks cannot be detected from the whole signal even if the PN sequence is known,

which is an obvious improvement compared with the previous PN-based echo-hiding methods. In the watermark detection

process, the authors make use of the low-rank and sparse characteristics of the watermarked signal to detect watermarks from

the low-rank and sparse parts, respectively. Based on this basic framework, they also propose a multi-bit embedding scheme,

which obtains a doubled embedding capacity compared with the previous PN-based echo-hiding methods. The proposed

method was evaluated with respect to inaudibility, security, and robustness. The experiment results verified the effectiveness of

the proposed method.

1Introduction

With the rapid development of digital media and network

technology, the audio transmission has become more and more

convenient. However, illegal dissemination behaviours that ignore

copyright seriously harm the interests of audio authors. As a result,

copyright protection has lodged itself in the public mind. In order

to solve this problem, scholars want to add tags to the audio to

prove the ownership and therefore the audio watermarking

technology has been proposed [1, 2]. After so many years of

exploration, the achievements of watermarking technology have

been significantly improved.

Audio watermarking [3] has been considered as an effective

technique to prevent audio from unauthorised operations. In

general, there are several requirements for audio watermarking,

e.g. inaudibility, blindness, robustness, capacity, and security [3–5].

Inaudibility requires that the embedded watermarks should not

degrade the audio quality [6]. Blindness suggests that the

watermarks can be detected without the original audio. At present,

more and more scholars pay attention to blind watermarking.

Robustness guarantees that the embedded watermarks cannot be

destroyed by allowable audio operations such as compression.

Embedding capacity is also necessary to evaluate the watermarking

method since the copyright of the audio signal can be better

protected when more watermarks are embedded. The last and most

recent concern on watermarking is that the copyright information

contained in the watermarked signal should not be easily

discovered by the attackers, which is called security. In general, the

five requirements described above are the main criteria for

evaluating a watermarking method [7].

In the past, many audio watermarking methods have been

proposed, e.g. support vector [8, 9], phase coding [10], spread

spectrum [11, 12], techniques based on masking [13, 14],

patchwork [15, 16], echo-hiding [17–23], and so on. As one typical

audio watermarking method, echo-hiding has a simple and easy-to-

operate embedding and detection process. Besides, the watermark

detection of echo-hiding does not need the original signal (i.e. it is

a blind method). The echo-hiding method was first proposed by

Gruhl et al. [17], who described how to embed the watermarks

using one backward echo kernel and how to detect it using

cepstrum operations. As a critical factor for echo-hiding, echo

kernel plays a vital role which greatly affects the performance of

the watermarking method. Therefore, many echo-hiding algorithms

have been proposed to design more effective kernels, such as the

dual-kernel [24–26], backward–forward echo kernels [27] etc.

Although the above echo-hiding methods have a simple

embedding and detection process, they have a fatal security flaw,

since the watermarks can be easily obtained with a cepstrum

analysis of the watermarked signal. In order to overcome this

drawback, the security algorithm [18, 19] using pseudo-noise (PN)

sequence has been proposed. The PN sequence is used as a secret

key to embed multiple echoes into an original signal. Owing to

very small amplitude of each echo, there will be no obvious peak in

the power spectrum after cepstrum analysis, that is, its power

spectrum becomes nearly smooth in the mean time sense.

Therefore, it is impossible to detect the watermarks directly

through cepstrum analysis in these watermarking schemes. In order

to obtain the watermarks, the corresponding PN sequence must be

used for correlation operation after the cepstrum analysis, which

greatly increases the security of the basic echo-hiding methods [17,

24–27]. In [19], the PN sequence-based echo-hiding method was

further improved. During the watermark detection process, the

authors used real cepstrum instead of the complex cepstrum to

make the cepstrum peak more obvious. In recent years, some

modified PN sequences have been proposed for echo-hiding

methods to enhance robustness and inaudibility performance. In

[28], the PN sequence was modified to a new sequence denoted by

q(n) and three peaks were obtained after correlation, which

improved the accuracy of watermark detection. In [21], the PN

sequence was improved and more large peaks were produced after

correlation. These two schemes improved the performance of

watermarking methods by modifying the PN sequence and taking

advantage of the correlation operation.

However, the algorithms mentioned above still have the

following shortcomings:

IET Signal Process., 2020, Vol. 14 Iss. 4, pp. 229-242

© The Institution of Engineering and Technology 2020

229

(i) Low security: In these methods, the PN sequence is used as a

security key for watermark detection. If the PN sequence is leaked,

the watermarks will be easily detected by the attackers.

Furthermore, many methods, e.g. [21, 28], try to obtain better

robustness by modifying the PN sequence, however, such

modifications also make the PN sequence more regular, thus the

complexity of the PN sequence is reduced which makes it easier to

crack.

(ii) Low capacity: For most PN sequence-based echo-hiding

methods, the PN sequence should be relatively long to achieve

good inaudibility and robustness performance. However, if the

length of the PN sequence is guaranteed to be long enough, the

number of watermarks that can be embedded will be reduced.

To solve the above issues, this paper proposes a more secure echo-

hiding method based on improved PN sequence and the robust

principal component analysis (RPCA) [29]. As we considered, the

root cause of the security problem in PN-based echo-hiding

methods is that these methods directly add the echoes on the

original whole signal. Therefore, watermarks can be easily detected

after cepstrum and correlation analysis of the watermarked signal

when the PN sequence is known. In this paper, the RPCA is

employed to improve the PN-based echo-hiding method. The

original signal is first decomposed into two parts, i.e. low-rank and

sparse parts, using RPCA. Watermarks are separately embedded

into them using a pair of opposite PN sequences. Accordingly, the

cepstrum and correlation analysis of whole watermarked signal can

no longer produce any obvious peaks for watermark detection even

using the correct PN sequences, since there is an opposite effect

between the correlation results of the two parts in the watermarked

signal (low-rank and sparse parts). To correctly detect the

watermarks, the proposed method takes advantage of the low-rank

and sparse characteristics of the embedded echoes. Watermarks can

be separately detected from the low-rank and sparse parts using the

correct decomposition parameter of RPCA and the corresponding

PN sequence. In particular, the PN sequence employed above is

designed by considering the correlation property of the PN

sequence and it provides two extra correlation peaks to effectively

improve the robustness of watermark detection. Based on the

above framework, this paper also implements a multi-bit

embedding scheme which achieves a doubled embedding capacity

compared with the previous PN-based echo-hiding methods.

This paper is organised as follows. Section 2 reviews typical

echo-hiding methods. Section 3 describes our proposed method

based on the improved PN sequence and the RPCA. Section 4

evaluates the inaudibility, security, and robustness of the proposed

method with a series of experiments and compares it with the

previous echo-hiding method and the PN sequence-based echo-

hiding method. In the last section, we give a summary of our work.

2Review of typical echo-hiding

The widely accepted model of echo-hiding is

y(n) = x(n) h(n),

(1)

where x(n) is the original audio signal, h(n) is the echo kernel, y(n)

is the watermarked signal, and the operation symbol stands for

convolution. The backward echo kernel is defined as

h(n) = δ(n) + αδ(n−d),

(2)

where δ( ) is a Dirac delta function, α denotes the attenuation

amplitude of the echo, and d is the delay of the echo. To improve

the security, the PN sequence-based echo kernel [18, 19] was

proposed, which is given by

h(n) = δ(n) + α

i= 0

L− 1

p(i)δ(n−d−i),

(3)

where p(i) { − 1, + 1}, 0 iL− 1 is the PN sequence of

length L. The pulse representation of PN-based echo kernel is

shown in Fig. 1. In the watermark detection process, the

watermarked signal is analysed by real cepstrum analysis of (1),

i.e.

cy(n) = cx(n) + ch(n),

(4)

where cx(n) = −1{log (x(n)) }, ch(n) = −1{log (h(n)) },

is the absolute value operation, and ( ) and −1( ) are Fourier

transform and inverse Fourier transform, respectively. For PN-

based echo-hiding, the ch(n) can be calculated in more detail [19]

ch(n) α

2(p( − n+d) + p(n−d)) .

(5)

The PN sequence is a necessary condition for obtaining the

watermarks. The cross-correlation operation of (4) is carried out

with the PN sequence

d(τ) = E(cy(n)p(n−τ))

E(cx(n)p(n−τ))

+α

2E(p( − n+d)p(n−τ))

+α

2E(p(n−d)p(n−τ)),

(6)

where E( ) calculates the mathematical expectation. We know

from (6) that (α/2)E(p( − n+d)p(n−τ)) is small and negligible,

and when τ=d, the term of (α/2)E(p(n−d)p(n−τ)) has a

maximum value of α/2. Hence, we can detect the watermarks by

detecting the maximum value (peak value).

In order to further improve the above algorithms, some

modified PN sequences were proposed. In [28], the modified PN

sequence was proposed, which is defined by

q(i) = p(i), if i= 0 or i=L− 1

( − 1)y(i)p(i), if 0 < i<L− 1,

(7)

where

y(i) = fix q(i− 1) + p(i− 1) + p(i) + p(i+ 1)

4

(8)

and the fix( ) function is used to take an integer in the direction

nearest to zero. The correlation operation of q(i) is

r(τ) = E(q(i−d)q(i−τ)) .

(9)

The calculation result is shown in the top two panels of Fig. 2. We

can find that there are three peaks at τ=d, τ=d+ 1, and

τ=d− 1. By detecting these three peaks, the robustness of

watermark detection is effectively improved.

In [21], the PN sequence is modified to

Fig. 1 Echo kernels based on PN sequence

230 IET Signal Process., 2020, Vol. 14 Iss. 4, pp. 229-242

© The Institution of Engineering and Technology 2020

p

¯(i) = [p(0), p(0), …, p(0)

nr− elements

,p(1), p(1), …, p(1)

nr− elements

, …,

p(L− 1), p(L− 1), …, p(L− 1)

nr− elements

],

(10)

where nr is defined as the number of repetitions. The correlation

result of this sequence can be seen in the bottom two panels of

Fig. 2. We can obtain (2nr− 1) peaks for watermark detection.

These peaks are located at

τ= {(d− (nr− 1)), …, (d− 1), d, (d+ 1), …, (d+ (nr− 1))}.

The above two methods indeed improve the robustness of

watermark detection compared with the original PN sequence.

However, such modifications to the PN sequence reduce its

complexity and therefore make it easier to crack, i.e. the

watermarking method will be less secure. In addition, to ensure the

performance of the watermarking method, the PN sequence should

be as long as possible. However, this will reduce the embedding

capacity of the echo-hiding method. In the next section, we will

introduce the proposed echo-hiding method based on the improved

PN sequence and RPCA.

3Proposed watermarking scheme

3.1 RPCA decomposition

Although the PN sequence increases the security of the echo-hiding

method, the watermarks can be easily obtained by the attackers

when the PN sequence is known. The key reason for this is that

most of the previous echo-hiding methods (including the PN

sequence-based methods) directly add the echoes to the original

signal. Accordingly, the cepstrum of the watermarked signal can be

regarded as the addition of the cepstrum of the original signal and

that of the echo kernel, which is conducive for the attackers to

watermark detection. We considered if the echoes were not added

directly to the original signal, the watermarks will not be easily

obtained even if the PN sequence is known. As a result, the

security of the typical echo-hiding methods and the PN-based

echo-hiding methods will be further improved.

This paper introduces RPCA [29, 30] to the original PN-based

echo-hiding methods. In the proposed method, the original audio

signal is first decomposed into two parts, i.e. low-rank and sparse

parts, and then watermarks are separately embedded into them

using the proposed kernels. That is, the echoes are not added

directly to the original signal but to its sub-signals.

The process of audio decomposition based on RPCA is as

follows. Since the audio signal (denoted as x(n)) is the one-

dimensional signal, we first compute the short-time Fourier

transform (STFT) representation of x(n) in time–frequency (T–F)

domain. The obtained T–F representation is denoted by M F×T,

where F is the number of frequency bins and T is the number of

time bins. Note that to obtain a good decomposition effect, the

window size and hop size (half of the window size) change with

frame lengths, to ensure the obtained T–F representation a square

matrix (approximately) (For example, for frame length of 5512 (i.e.

8 bps for 44.1 kHz sampled signal), we use window size of 144 and

hop size of 72. The obtained M is nearly a square matrix, where

F= 73 and T= 75. The window size and hop size for other frame

length can be calculated the same way.). The magnitude and phase

spectrograms of M are M0 F×T and P F×T, respectively.

The decomposition operation of the RPCA is performed on M0 by

solving the following convex optimisation problem using principal

component pursuit:

minimise L0+λS01

subject to M0=L0+S0,

(11)

where L0 is a low-rank matrix, S0 is a sparse matrix, and the λ

(λ> 0) is a positive parameter that controls the decomposition

balance between the low-rank and the sparse parts. The is

the nuclear norm operation and the 1 is the 1-norm. In (11),

the L0 and S01 are separately defined as

L0=

i= 1

min ( f,t)

σi,

(12)

S01=

f,t

Sf,t,

(13)

where σi is the ith singular value of L0, f is the index of frequency

bin (1 < f<F), and t is the index of time bins (1 < t<T). When

considering the constraint condition M0=L0+S0, the objective

function in (11) can be set as follows

min

L0,S0

L0+λS01+μ

2M0−L0−S0F

2,

(14)

where μ is a positive scalar, the F is the Frobenius norm. The

L0 and S0 can be obtained by solving (14) with proximal gradient

method.

The obtained L0 and S0 are then synthesised with the previously

obtained phase spectrogram P and we can obtain the time-domain

low-rank signal xl(n) and sparse signal xs(n), respectively, using

inverse STFT. Then the x(n) is decomposed as

x(n) = xl(n) + xs(n) .

(15)

Watermarks are embedded into xl(n) and xs(n), respectively. Note

that, since different λ will produce different decomposition results,

the λ can be also used as a secret key for the proposed method. In

the next section, we will describe how the proposed method

improves security during the watermark embedding and detection

process and how to increase the embedding capacity of the

proposed method with a multi-bit embedding scheme.

3.2 Watermark embedding process

When xl(n) and xs(n) are obtained, the one-bit watermark w is

duplicated as wl and ws and then separately embedded into them.

The embedding process is shown in the top panel of Fig. 3.

3.2.1 Design of the echo kernel: Before watermark embedding,

the echo kernels for low-rank and sparse parts need to be designed.

As to ensure the robustness of the proposed method, a new echo

kernel is designed. In general, the cross-correlation describes the

degree which two functions (sequences) match each other at

different relative positions. This degree can be calculated by

mathematical expectation, i.e. E( ) [ − 1, 1]. The cross-

correlation turns into the autocorrelation when these two functions

are identical and when they are completely matched, a positive

peak will occur, i.e. E( ) = 1. Conversely, when they are

completely mismatched, a negative peak will occur, i.e.

Fig. 2 Top two panels: left panel: the result of E(q(i−d)q(i−τ)) and

right panel: the zoom-in observation of the left panel, where the length of

q(i) is L= 1023 and d= 154. Bottom two panels: left panel: the result of

E(p

¯(i−d)p

¯(i−τ)) and right panel: the zoom-in observation of the left

panel, where the length of q

¯(i) is L= 1023, d= 154, and nr= 3

IET Signal Process., 2020, Vol. 14 Iss. 4, pp. 229-242

© The Institution of Engineering and Technology 2020

231

E( ) = − 1. Inspired by [21, 28], we propose a new modified PN

sequence

p

~

(n) = {p(i), − p(i)},

(16)

where p(i), 0 IL− 1 can be any PN sequence, 0 n 2L− 1

and 2L is determined by the frame length and less than the frame

length. For example, when p(i) = { − 1, 1, 1, − 1} then

p

~

(n) = { − 1, 1,1, − 1, 1, − 1, − 1, 1}.

When calculating E(p

~(n−d)p

~(n−τ)), there will be two

negative peaks appearing at τ=d−L and τ=d+L with peak

value of 0.5 (i.e. E(p

~(n−d)p

~(n−τ)) − 0.5) except for the

positive peak at τ=d, since E(p(i−d)( − p(i−τ))) = − 1. As

shown in Fig. 4, we can clearly find three peaks in the correlation

result. Therefore, detection of the watermarks using these three

peaks can considerably improve the robustness.

It should be noted that the proposed PN sequence in (16) can

also be designed as

p

~(n) = {p(i), p(i)} .

(17)

Similar to the above analysis, there will be two positive peaks

appearing at τ=d−L and τ=d+L with a peak value of 0.5 (i.e.

E(p

~(n−d)p

~(n−τ)) 0.5) except for the positive peak at

τ=d. The result of E(p

~(n−d)p

~(n−τ)) is shown in Fig. 5. By

comparing Figs. 4 and 5, we can know that the detection

performance of p

~(n) and p

~(n) are the same. In the experiment,

we use p

~(n) to evaluate the proposed method.

Compared with the previous PN sequences, the proposed PN

sequence is also regularised after modification. However, as the

proposed PN sequence is not modified in adjacent (two or three

adjacent) positions [21, 28], but over a wide range (i.e. L), it is less

regular than the previous modified PN sequences.

3.2.2 Secure embedding with opposite kernels: To improve

the security of the proposed method, we use p

~(n) and −p

~(n) to

design the echo kernel for the low-rank and sparse parts.

In addition, we use the forward and backward echo kernels [27]

to strengthen the kernels and increase the robustness of watermark

detection. The kernels for the low-rank part and sparse part are

defined as

hl(n) = δ(n) + α

j= 0

2L− 1

p

~(j)δ(n−d−j)

+

j= 0

2L− 1

p

~(j)δ(n+d+j),

(18)

hs(n) = δ(n) + α

j= 0

2L− 1

( − p

~(j))δ(n−d−j)

+

j= 0

2L− 1

( − p

~(j))δ(n+d+j).

(19)

where hl(n) is the kernel for the low-rank part, hs(n) is the kernel

for the sparse part, the j is the index of the proposed PN sequence,

d {d0,d1} in (18) and (19) are set the same for the low-rank part

and the sparse part. These two echo kernels are separately

performed on the low-rank signal xl(n) and sparse signal xs(n) using

convolution function to obtain the watermarked low-rank signal

yl(n) and the watermarked sparse signal ys(n). The final

watermarked signal is obtained by

y(n) = yl(n) + ys(n)

=xl(n) hl(n) + xs(n) hs(n) .

(20)

Fig. 3 Process of watermark embedding and detection

Fig. 4 Results of E(p

~(n−d)p

~(n−τ)), where the length of p

~(n) is

2L= 1024 and d= 154

Fig. 5 Results of E(p

~(n−d)p

~(n−τ)), where the length of p

~(n) is

2L= 1024 and d= 154

232 IET Signal Process., 2020, Vol. 14 Iss. 4, pp. 229-242

© The Institution of Engineering and Technology 2020

3.3 Watermark detection process

3.3.1 Detection from the low-rank part and the sparse

part: The watermark detection process is shown in the bottom

panel of Fig. 3. Similar to the embedding process, the watermarked

signal y(n) is first decomposed into the low-rank part y

^l(n) and

sparse part y

^s(n) using RPCA

y(n) = y

^l(n) + y

^s(n),

(21)

where we use the same λ to decompose the watermarked signal.

Recall the watermark embedding process, since the embedded

echoes are separately generated by the low-rank part xl(n) and the

sparse part xs(n), they should have similar low-rank and sparse

properties as xl(n) and xs(n). As a result, the echoes generated by

xl(n) will be assigned to y

^l(n) and the echoes generated by xs(n) will

be assigned to y

^s(n) in watermark detection process if we use the

same λ to decompose y

^(n). Accordingly, we will have

y

^l(n) xl(n) hl(n),

(22)

y

^s(n) xs(n) hs(n) .

(23)

The real cepstrum analysis can be separately performed on the low-

rank part y

^l(n) and the sparse part y

^s(n) for watermark detection,

i.e.

cy

^l(n) cxl(n) + chl(n)

= −1{log (xl(n)) } + −1{log (hl(n)) },

(24)

cy

^s(n) cxs(n) + chs(n)

= −1{log (xs(n)) } + −1{log (hs(n)) } .

(25)

Since the forward and backward echo kernels are employed in the

proposed method, by referring to [27] and the derivation of (5), the

following results can be obtained

chl(n) α

2(1 − α2)(p

~( − n+d) + p

~(n−d)),

(26)

chs(n) − α

2(1 − α2)(p

~( − n+d) + p

~(n−d)) .

(27)

The cross-correlation is then performed on cy

^l(n) and cy

^s(n) using

the proposed sequence p

~(n) and −p

~(n), respectively (see (6))

dl(τ) = E(cy

^l(n)p

~(n−τ))

E(cxl(n)p

~(n−τ))

+α

2(1 − α2)E(p

~( − n+d)p

~(n−τ))

+α

2(1 − α2)E(p

~(n−d)p

~(n−τ)),

(28)

ds(τ) = E(cy

^s(n)( − p

~(n−τ)))

E(cxs(n)( − p

~(n−τ)))

+α

2(1 − α2)E(p

~( − n+d)p

~(n−τ))

+α

2(1 − α2)E(p

~(n−d)p

~(n−τ)) .

(29)

According to the analysis in Section 3.2, there are one positive

peak at τ=d and two negative peaks at τ=d−L and τ=d+L in

the correlation result of the low-rank part and the sparse part, so we

use the following equations to detect the watermarks from them

d

¯l(τ) = dl(τ) − dl(τ−L) − dl(τ+L),

(30)

d

¯s(τ) = ds(τ) − ds(τ−L) − ds(τ+L) .

(31)

Since the Delta function of different delays (d0 and d1) are used to

embed ‘0’ and ‘1’, the watermark bit can be detected by comparing

the peak values at these two positions. For low-rank part, if

d

¯l(d0) > d

¯l(d1), the watermark bit for the low-rank part is detected

as w

^l= 0, otherwise as w

^l= 1. The detection process of the sparse

part is performed the same way, i.e. if d

¯s(d0) > d

¯s(d1), the

watermark bit w

^s for sparse part is detected as w

^s= 0, otherwise as

w

^s= 1. In addition, we use a small trick to improve watermark

detection performance. Here, we set the value before the delay d

(d0 or d1) of the watermarked signal to 0 to reduce the impact on the

correlation operation. This will make the peaks in the delay

position more pronounced and will improve the detection rate

(DR).

3.3.2 Final decision on the watermark bit: We can separately

detect one-bit watermark from the low-rank part and the sparse part

of one audio frame. In a general case, if the detected two

watermark bits are the same, then the final watermark bit w

^ of the

current frame is determined as the same bit, i.e.

w

^=w

^l=w

^s.

(32)

However, if the detected two bits are different from each other, i.e.

one of the following two cases happens: (i) d

¯l(d0) > d

¯l(d1) &

d

¯s(d0) < d

¯s(d1) (i.e. w

^l= 0 and w

^s= 1) or (ii) d

¯l(d0) < d

¯l(d1) &

d

¯s(d0) > d

¯s(d1) (i.e. w

^l= 1 and w

^s= 0), then the final watermark bit

of current frame is determined as follows:

(i) when d

¯l(d0) > d

¯l(d1) & d

¯s(d0) < d

¯s(d1):

w

^=w

^l,d

¯l(d0) − d

¯l(d1) > d

¯s(d1) − d

¯s(d0)

w

^s, otherwise;

(33)

(ii) when d

¯l(d0) < d

¯l(d1) & d

¯s(d0) > d

¯s(d1):

w

^=w

^s,d

¯s(d0) − d

¯s(d1) > d

¯l(d1) − d

¯l(d0)

w

^l, otherwise .

(34)

That is, we take into account the reliabilities of the detected

watermark bits from two parts to determine the final watermark bit.

Note that, similar to other previous PN sequence-based echo-

hiding methods, the proposed p

~(n) is also required to be as long

as possible to increase the inaudibility and robustness of the

proposed method, which will reduce the embedding capacity. To

address this problem, we make use of the obtained two sub-signals,

i.e. the low-rank and sparse parts, of each audio frame, to embed

more bits. We propose a multi-bit embedding scheme based on the

above basic framework. This scheme is covered in more detail in

the last part of this section.

3.4 Security analysis of the proposed method

The security of the proposed method is ensured in a way that the

watermarks cannot be detected from the whole watermarked signal.

According to (20), the cepstrum analysis of the whole watermarked

signal y(n) can be written as

cy(n) = −1{log (yl(n) + ys(n)) }

= −1{log (yl(n)) + (ys(n)) }

= −1 log (yl(n)) 1 + (ys(n))

(yl(n))

= −1 log (yl(n)) + log 1 + (ys(n))

(yl(n)) ,

(35)

or

IET Signal Process., 2020, Vol. 14 Iss. 4, pp. 229-242

© The Institution of Engineering and Technology 2020

233

cy(n) = −1{log (yl(n)) + (ys(n)) }

= −1 log (ys(n)) (yl(n))

(ys(n)) + 1

= −1 log (ys(n)) + log 1 + (yl(n))

(ys(n)) ,

(36)

Then the cy(n) can be expressed as half of the sum of (35) and (36)

cy(n) = 1

2{−1(log (yl(n)) ) + − 1(log (ys(n)) )

+− 1 log 1 + (ys(n))

(yl(n))

+− 1 log 1 + (yl(n))

(ys(n))

(37)

Since each audio frame has its own low-rank and sparse

characteristics, we cannot confirm the analytical expressions of the

low-rank part xl(n) and the sparse part xs(n) for each frame, and

consequently, the last two terms in (37) cannot be theoretically

analysed. Therefore, we experimentally investigated whether the

last two terms can produce correlation peaks for watermark

detection when applying the correlation operation to them.

Here, we tested two cases, i.e. watermarks were separately

embedded into the low-rank and sparse parts with: (i) different

delays (dl= 10 and ds= 15); and (ii) the same delay

(dl=ds= 10). The length of the PN sequence p

~(n) (or −p

~(n))

was set as 60% of the frame length, the embedding capacity was

16 bps, a= 0.008 and λ= 0.8. We calculated the correlation

between the last two terms of (37) and the PN sequence. To obtain

the statistical results, we randomly selected 35 frames and

performed correlation on each of them. Fig. 6 shows the averaged

correlation result calculated on all 35 frames. It can be seen that

there was no peak at the delay positions no matter for different

delays or the same delay. Therefore, the last two terms of (37) have

almost negligible effect on the final correlation results.

According to the above analysis, the cy(n) in (37) can be

approximately written as

cy(n) 1

2{−1(log (yl(n)) ) + − 1(log (ys(n)) )

=1

2(cyl(n) + cys(n)) .

(38)

We calculated the cross-correlation between (38) and the PN

sequences (p

~(n) and −p

~(n)) to observe if there are correlation

peaks. The cross-correlation between (38) and p

~(n) is

d(τ) = E(cy(n)p

~(n−τ))

1

2E(cyl(n)p

~(n−τ)) + 1

2E(cys(n)p

~(n−τ))

1

2E(cxl(n)p

~(n−τ)) + 1

2E(cxs(n)p

~(n−τ))

+α

4(1 − α2)E(p

~( − n+d)p

~(n−τ))

+α

4(1 − α2)E(p

~(n−d)p

~(n−τ))

−α

4(1 − α2)E(p

~( − n+d)p

~(n−τ))

−α

4(1 − α2)E(p

~(n−d)p

~(n−τ)) .

(39)

We can see from the above equation that the first two terms in (37)

always produce opposite peaks at the same delay positions when

they are correlated with p

~(n). These peaks are effectively

compensated in the correlation result. When using −p

~(n) for

correlation, we can get the same result. Therefore, the correlation

peaks cannot be discovered from the whole watermarked signal

even the PN sequences of low-rank and sparse parts are known,

which means the watermarks cannot be detected from the whole

watermarked signal. The above analysis suggests that the proposed

method is more secure compared with the previous PN sequence-

based echo-hiding methods.

3.5 Multi-bit embedding scheme

In the above framework, the same watermark is embedded into

low-rank and sparse parts, so this does not really increase the

embedding capacity. This section describes how to embed two bits

into one frame under the proposed framework.

In general, when we embed two bits into one frame, it turns out

as the following two cases:

Case 1: Two bits are the same. In this case, the two watermark bits

can be ‘0’ & ‘0’ or ‘1’ & ‘1’. In the embedding process, we use

opposite PN sequences to separately embed them (‘0’ & ‘0’ or ‘1’

& ‘1’) into the low-rank part and the sparse part. In particular, the

delay for both the low-rank and the sparse will be set the same, that

is, we set the delay as d0 for ‘0’ & ‘0’ case and d1 for ‘1’ & ‘1’

case, respectively. An example is shown in Figs. 7a and b. As

explained in Section 3.4, it would be unable to observe the

correlation peaks when the PN sequences (positive p

~(n) or

Fig. 6 Cross-correlation results between the last two terms of (37) and the

PN sequence

(a) The cross-correlation results when embedding watermarks using different delays

(the delays are 10 and 15, respectively), (b) The cross-correlation results when

embedding watermarks using the same delay (delay is 10)

Fig. 7 Illustration of the cross-correlation results between low-rank and

sparse parts with PN sequence in multi-bit embedding scheme

(a), (b) The same two watermark bits are embedded into low-rank and sparse parts,

(c), (d) The different watermark bits are embedded into low-rank and sparse parts

234 IET Signal Process., 2020, Vol. 14 Iss. 4, pp. 229-242

© The Institution of Engineering and Technology 2020

negative −p

~(n)) are correlated with the whole watermarked

signal since we use the same delay but opposite PN sequences.

Therefore, this watermark embedding method is also secure. To

correctly detect the watermark bits, we apply RPCA decomposition

to the watermarked signal and watermarks (‘0’ & ‘0’ or ‘1’ & ‘1’)

can be detected by finding the correlation peaks from low-rank and

sparse parts, respectively.

Case 2: Two bits are different. In this case, the two watermark bits

can be ‘0’ & ‘1’ or ‘1’ & ‘0’. Different from case 1, we use the

same PN sequence (i.e. positive p

~(n)) but different delays to

embed watermarks into the low-rank part and the sparse part. As

shown in Figs. 7c and d, the delays for bit ‘0’ and bit ‘1’ are set as

d0 and d1, respectively. For both ‘0’ & ‘1’ and ‘1’ & ‘0’, the

correlation between the whole watermarked signal and the PN

sequence can be expressed as

d(τ) = E(cy(n)p

~(n−τ))

1

2E(cyl(n)p

~(n−τ)) + 1

2E(cys(n)p

~(n−τ))

=1

2E(cxl(n)p

~(n−τ)) + 1

2E(cxs(n)p

~(n−τ))

+α

4(1 − α2)E(p

~( − n+d0)p

~(n−τ))

+α

4(1 − α2)E(p

~(n−d0)p

~(n−τ))

+α

4(1 − α2)E(p

~( − n+d1)p

~(n−τ))

+α

4(1 − α2)E(p

~(n−d1)p

~(n−τ)),

(40)

i.e. there are always two peaks of the similar values appearing at

τ=d0 and τ=d1 in d(τ). Therefore, it is difficult to detect the

watermarks from the whole watermarked signal by comparing the

peaks at d0 and d1. Similar to case 1, we can correctly detect the

watermarks (‘0’ & ‘1’ or ‘1’ & ‘0’) by comparing the peak

positions in low-rank and sparse parts after RPCA decomposition.

Using the above scheme, we can not only improve the capacity of

watermark embedding but also guarantee the security of the multi-

bit embedding method.

3.6 Frame synchronisation

The watermark detection process works with an assumption that

each frame is synchronised. In the proposed method, frame

synchronisation is realised as follows. Starting from the first audio

sample of the received watermarked signal, we take a frame-length

segment as a calculation unit (indexed by k) and apply RPCA

decomposition, cepstrum calculation, and correlation analysis. The

correlation values at delay positions (d0 and d1) of the low-rank

signal (denoted by Pl0

k and Pl1

k) and those of the sparse signal

(denoted by Ps0

k and Ps1

k) are recorded. The synchronisation

coefficient c

k for the kth unit is calculated by

c

k= max (Pl0

k,Pl1

k) × max (Ps0

k,Ps1

k) .

(41)

Then we move to the next audio sample and repeat this process

until the end of the signal. A synchronisation vector c is obtained

c= {c

1, c

2, …, c

k, …, c

K}, K=L−F+ 1,

(42)

where K is the unit number, L is the length of the watermarked

signal, F is the frame length. If a calculation unit perfectly matches

a correct frame, its synchronisation coefficient will be a local

maximum in c. We extract the local-maximum coefficients of c

using multi-scale searching and reset the non-maximum

coefficients to 0. The final synchronisation vector

^

c is normalised

to [0, 1], which indicates the position of each frame.

We calculate frame synchronisation results for three typical

cases: (i) normal watermarked signal; (ii) watermarked signal with

insertion between the 14th and 15th frames; and (iii) watermarked

signal with cropping the 15th frame. The synchronisation results

are shown in Fig. 8. One can see that each frame can be correctly

synchronised for case (i) (see Fig. 8b). For cases (ii) and (iii) all the

intact frames are correctly synchronised while the inserted or

cropped segments cannot be synchronised (see Figs. 8c and d).

4Evaluations

4.1 Database

The experimental database we used was RWC Music Database

[31], which contains 102 dual-channel audio files sampled at the

rate of 44.1 kHz and quantised with 16 bits. We took 10 s from

each audio as experimental material and all watermarking

operations were performed on one channel only. There are many

parameters in the proposed method that affect its performance, e.g.

the length of the PN sequence, the delays, the attenuation

amplitude, the embedding capacity (bps), the RPCA decomposition

parameter etc. In the experiments, the length of PN sequence (i.e.

p

~(n) and −p

~(n)) was fixed at 60% of the frame length. The

delays used for the low-rank and sparse parts were fixed at

d= [d0,d1] = [10, 20]. We adjusted one parameter each time and

then observed how the experimental results change with it.

4.2 Evaluations for inaudibility

The signal-to-noise ratio (SNR), the log-spectrum distortion (LSD)

[32], and the perceptual evaluation of audio quality (PEAQ) [33,

34] were used to measure the inaudibility of the proposed method.

The calculation of SNR is as follows:

SNR(dB) = 10 × log10

nx(n)2

n(y(n) − x(n))2,

(43)

where x(n) stands for the original signal and y(n) stands for the

watermarked signal. The higher the value of SNR, the better the

audio quality.

The LSD can calculate the logarithmic spectrum distance

between the spectra of the original signal and the watermarked

signal, which is defined as

Fig. 8 Illustration of frame synchronisation

(a) Watermarked signal, frame synchronisation results of, (b) Normal watermarked

signal, (c) Watermarked signal with insertion, (d) Watermarked signal with cropping

IET Signal Process., 2020, Vol. 14 Iss. 4, pp. 229-242

© The Institution of Engineering and Technology 2020

235

DLS(dB) = 1

2π−π

π

10log10

Py(ω)

Px(ω)

2

dω,

(44)

where Px(ω) is the spectrum of the original signal and Py(ω) is the

spectrum of the watermarked signal. The LSD result <1.0 dB

indicates less distortion of the watermarked signal.

The PEAQ is fully compliant to ITU-R BS.1387 covering the

applicability to high-quality audio signals with sampling

frequencies of 44.1–48 kHz. The PEAQ works out a value called

objective difference grade (ODG), by comparing the perceptual

quality of the original audio signal and the watermarked audio

signal. The ODG ranges from −4 (very annoying) to 0

(imperceptible). The threshold of PEAQ is −1 ODG. We observed

the inaudibility performance of the proposed method measured by

SNR, LSD, and PEAQ under different parameter settings.

4.2.1 Attenuation amplitude: For echo-hiding based methods, the

attenuation amplitude has a strong effect on inaudibility. Therefore,

we evaluated the inaudibility performance of the proposed method

as a function of attenuation amplitude. In this experiment, the

embedding capacity was fixed at 16 bps, the RPCA decomposition

parameter as fixed at λ= 0.8, and the attenuation amplitudes were

set as [0.002, 0.005, 0.008, 0.01, 0.04].

Since the kernels for both low-rank and sparse parts have

attenuation amplitudes (see (18) and (19)), we conducted three

experiments. In the first experiment, the attenuation amplitude of

the low-rank part (denoted as αl) was changed while the attenuation

amplitude of the sparse part (denoted as αs) was fixed at

αs= 0.008. In the second experiment, the αs was changed while the

αl was fixed at αl= 0.008. In the third experiment, the αl and αs

were changed simultaneously. The results of the first experiment

are shown in Table 1 (I). Obviously, when αl increased, the values

of SNR and PEAQ decreased, and the values of LSD increased,

that is, the sound quality of the watermarked signal decreased.

Table 1 (II) shows the results of the second experiment, which

were similar to the first experiment. Finally, we can see the results

of the third experiment from Table 1 (III) that the quality of

watermarked signal decreased when αl and αs increased. Overall,

the sound quality of the watermarked signals degraded when the

attenuation amplitude increased.

4.2.2 Decomposition parameter: The decomposition parameter

λ controls the balance between the low-rank part and the sparse

part. When λ becomes larger, more information will be assigned to

the low-rank part and vice versa. We checked the inaudibility of the

proposed method using λ= [0.2, 0.4, 0.6, 0.8, 1.0], where the

embedding capacity was 16 bps and the attenuation amplitudes for

the low-rank and sparse parts were αl=αs= 0.008. The

experimental results are shown in Table 2. When λ increased, the

SNR, LSD, and PEAQ became better. In particular, the SNR and

PEAQ showed more significant changes.

4.2.3 Embedding capacity (bps): In general, when more

watermarks are embedded in a signal, the inaudibility will be

degraded. Therefore, we observed the inaudibility performance of

the proposed method under different embedding capacities. In the

experiment, we set the embedding capacity as 4, 8, 16, 32, 64, and

128 bps, αl=αs= 0.008, and λ= 0.8. According to Table 3, one

can see that the sound quality of the watermarked signal was

improved as embedding capacity increased, which was very

different from most general watermarking methods. The reason for

this was that when the embedding capacity increased, the PN

sequence became shorter, that is, the number of added echoes was

reduced, leading to better inaudibility. However, as the embedding

capacity increased, the DR decreased, which will be shown in later

experiments.

We also evaluated the inaudibility of the multi-bit embedding

scheme under varied embedding capacity, where αl=αs= 0.008

and λ= 0.8. The results are shown in Table 4. Compared with

Table 3, the inaudibility of the multi-bit embedding scheme did not

change much when the embedding capacity increased.

According to all these results, it is easy to find that the proposed

method could satisfy inaudibility when the attenuation amplitudes

of low-rank and sparse parts satisfied αl=αs 0.01, the

decomposition parameter satisfied λ 0.6, and the embedding

Table 1 Inaudibility under different attenuation amplitude setting: (I) inaudibility affected by αl when αs= 0.008, (II) inaudibility

affacted by αs when αl= 0.008, and (III) inaudibility affected by αl and αs

The attenuation amplitude (I) αs= 0.008 (II) αl= 0.008 (III) αl=αs

αl/αs/αl & αsSNR, dB LSD, dB PEAQ (ODG) SNR, dB LSD, dB PEAQ (ODG) SNR, dB LSD, dB PEAQ (ODG)

0.002 13.4466 0.4052 −0.2843 26.1629 0.3313 0.0885 26.5489 0.1637 0.1363

0.005 14.2300 0.4062 −0.2504 19.8797 0.3678 −0.0440 18.9918 0.3164 −0.0455

0.008 14.9641 0.4317 −0.2307 14.9641 0.4317 −0.2307 14.9641 0.4317 −0.2307

0.01 15.3917 0.4566 −0.2249 12.6564 0.4774 −0.3493 13.0391 0.4950 −0.3397

0.04 12.7118 0.8463 −0.6378 −0.5949 0.8947 −1.7957 1.0339 0.9056 −1.5929

Table 2 Inaudibility under different decomposition parameters λ

Decomposition parameter λSNR, dB LSD, dB PEAQ (ODG)

0.2 10.9375 0.5409 −0.4695

0.4 11.1205 0.4783 −0.4498

0.6 12.1600 0.4455 −0.3694

0.8 14.9641 0.4317 −0.2307

1.0 19.2594 0.4088 −0.1256

Table 3 Inaudibility under different embedding capacities (bps)

Embedding capacity, bps SNR, dB LSD, dB PEAQ (ODG)

4 8.6697 0.7356 −0.6610

8 13.2167 0.5480 −0.3297

16 14.9641 0.4317 −0.2307

32 16.0787 0.3729 −0.1580

64 19.9342 0.3161 0.0009

128 27.4793 0.2319 0.0938

236 IET Signal Process., 2020, Vol. 14 Iss. 4, pp. 229-242

© The Institution of Engineering and Technology 2020

capacity was higher than 8 bps. For the multi-bit embedding

scheme, when the embedding capacity was higher than 16 × 2 bps,

we obtained good inaudibility.

4.3 Evaluations for security

4.3.1 Security of the proposed method: The DR is often used to

evaluate whether the watermarks can be detected and thus it can

verify the security of the proposed method. The DR is defined as

DR = Number of correctly detected watermarks

Number of total watermarks × 100% .

(45)

To prove the security of the proposed method, three different

embedding schemes were compared to the proposed method. In

these experiments, we set αl=αs= 0.008 and λ= 0.8. The

embedding capacities were set as 4, 8, 16, 32, 64, and 128 bps. The

other key parameters, i.e. the delays and the PN sequences, were

taken as variables to implement different embedding schemes.

In the embedding process, we decomposed the original audio

signal into low-rank and sparse parts and then separately embedded

watermarks into them using the following four schemes:

•Scheme I: opposite PN sequences and different delays for low-

rank and sparse parts (dl= [10, 20] and ds= [15, 25]);

•Scheme II: the same PN sequence and different delays for low-

rank and sparse parts (dl= [10, 20] and ds= [15, 25]);

•Scheme III: the same PN sequence and the same delays for low-

rank and sparse parts (dl=ds= [10, 20]);

•Scheme IV: opposite PN sequences and same delays for low-

rank and sparse parts (dl=ds= [10, 20]), i.e. the proposed

method.

We detected the watermarks from the whole watermarked signals

and calculated the DRs for each embedding scheme. The

experimental results are compared in Fig. 9. It can be seen from

Figs. 9a–c that the watermarks could be correctly detected from the

whole watermarked signal with the first three embedding schemes,

indicating that these embedding schemes were not secure. In

particular, the DRs of scheme I and scheme II were very high at

low embedding capacities. These results suggested that there were

obvious peaks in the correlation results between the whole

watermarked signal and PN sequences (different or same) when the

delays for the low-rank part and sparse part were different. The

DRs of scheme III were even higher than scheme I and scheme II,

owing to the strengthened embedding, i.e. the same PN sequence

and the same delays. Compared with these embedding schemes,

watermarks could not be detected from the whole watermarked

signal with either the PN sequence of the low-rank part or the PN

sequence of the sparse part in scheme IV, as shown in Fig. 9d.

These results suggested that the proposed method was more secure.

4.3.2 Security of the multi-bit embedding scheme: We also

observed the watermark detection performance of the multi-bit

embedding scheme. In the experiment, we set αl=αs= 0.008 and

λ= 0.8. The embedding capacities were twice of the above

experiments, i.e. 4 × 2, 8 × 2, 16 × 2, 32 × 2, 64 × 2, and

128 × 2 bps. The results in Fig. 10 demonstrate that the watermarks

could not be detected from the whole watermarked signal.

Therefore, the multi-bit embedding scheme did not compromise the

security of the proposed method.

Table 4 Inaudibility of the multi-bit embedding scheme under different embedding capacities (bps)

Embedding capacity, bps SNR, dB LSD, dB PEAQ (ODG)

4 × 2 6.9664 0.7595 −0.7829

8 × 2 11.7453 0.5775 −0.3957

16 × 2 13.4406 0.4587 −0.2959

32 × 2 14.5109 0.3917 −0.2212

64 × 2 18.3256 0.3278 −00338

128 × 2 26.0395 0.2516 0.0762

Fig. 9 DRs of four different embedding schemes (Scheme I–Scheme IV) when applying correlation to the whole watermarked signal, where the black bar

shows the DRs when using the PN sequence of the low-rank part for correlation and the white bar shows the DRs when using the PN sequence of the sparse

part for correlation. For Scheme II and Scheme III, the PN sequences for the low-rank part and the sparse part were the same

(a) Scheme I, (b) Scheme II, (c) Scheme III, (d) Scheme IV

Fig. 10 DRs of the multi-bit embedding scheme when applying correlation

to the whole watermarked signal, where the black bar shows the DRs of the

low-rank part and the white bar shows the DRs of the sparse part

IET Signal Process., 2020, Vol. 14 Iss. 4, pp. 229-242

© The Institution of Engineering and Technology 2020

237

We also exhibited the cross-correlation results of two different

embedding cases, i.e. Case 1 and Case 2, where the correlation was

performed between the whole watermarked signal and the PN

sequences. According to Figs. 11a and b, there were no obvious

peaks in the correlation results for Case 1, i.e. the watermark bits

could not be detected from the whole watermarked signal. The

correlation results for Case 2 are shown in Figs. 11c and d. As

predicted, there were always two peaks at the delay positions and it

was difficult to infer the watermarks from such results. Therefore,

the multi-bit embedding scheme was also secure and the

watermarks could not be detected from the whole watermarked

signal.

4.4 Evaluations for robustness

4.4.1 Evaluations for normal detection: This section evaluated

the normal detection ability of the proposed method under different

parameter settings. Watermarks were embedded into the low-rank

part and the sparse part and then detected from them, respectively.

(i) The attenuation amplitude: The experiments were also divided

into three sub-experiments. Similar to the inaudibility experiments

in Section 4.2.1, the attenuation amplitudes of the low-rank part,

the sparse part, and both low-rank and sparse parts were separately

controlled. Here, we set the attenuation amplitudes as [0.002,

0.005, 0.008, 0.01, 0.04], the embedding capacity as 16 bps, and λ

as 0.8. The results are shown in Fig. 12. It is found from Figs. 12a

and b that when the amplitude of the low-rank part (or the sparse

part) increased, its own DRs increased while the other part's DRs

Fig. 11 Correlation results of two different embedding cases: (a) and (b) are the results when embedding the same bits (‘0’ & ‘0’ and ‘1’ & ‘1’) to low-rank

part and sparse part, (c) and (d) are the results when embedding different bits (‘0’ & ‘1’ and ‘1’ & ‘0’) to low-rank part and sparse part

(a) Low-rank: ‘0’ bit, Sparse: ‘0’ bit, (b) Low-rank: ‘1’ bit, Sparse: ‘1’ bit, (c) Low-rank: ‘0’ bit, Sparse: ‘1’ bit, (d) Low-rank: ‘1’ bit, Sparse: ‘0’ bit

Fig. 12 DRs under different attenuation amplitude settings, where ‘Low-rank’ denoted the DRs calculated from the low-rank part, ‘Sparse’ denoted the DRs

calculated from the sparse part, and ‘Overall’ denoted the DRs calculated by considering the reliabilities of the detected watermark bits from two parts using

(32) to (34)

(a) The attenuation amplitude of sparse part is 0.008, (b) The attenuation amplitude of low − rank part is 0.008, (c) The attenuation amplitudesof low-rank and sparse partsare set the

same,which are 0.002, 0.005, 0.008, 0.01, 0.04

Fig. 13 DRs under different λ, where ‘Low-rank’ denoted the DRs

calculated from the low-rank part, ‘Sparse’ denoted the DRs calculated

from the sparse part, and ‘Overall’ denoted the DRs calculated by

considering the reliabilities of the detected watermark bits from two parts

using (32) to (34)

238 IET Signal Process., 2020, Vol. 14 Iss. 4, pp. 229-242

© The Institution of Engineering and Technology 2020

decreased. When the al and as were set as 0.008 and 0.01,

respectively, both low-rank part and sparse part had good DRs. The

overall DRs were always close to 100% by considering the

reliabilities of the detected watermark bits from two parts (see (32)

to (34)), which is an advantage of the proposed method. From

Fig. 12c, one can see that when the attenuation amplitudes of both

low-rank and sparse parts increased simultaneously, their DRs were

improved. However, when the attenuation amplitudes were too

large, their DRs started to decline. For ‘Overall’ case, the DRs in

Fig. 12c were always satisfactory, except for extremely small

amplitude (0.002). Furthermore, these results also suggested that

the proposed method was feasible, since the low-rank part and the

sparse part could be successfully separated out from the whole

watermarked signal and watermarks could be correctly detected

from them, by taking advantage of the low-rank and sparse

characteristics of the embedded echoes (see (22) and (23)).

(ii) The decomposition parameter: In this experiment, the λ of

different values were used to embed and detect the watermarks.

The range of λ was set as [0.2, 0.4, 0.6, 0.8, 1.0], the embedding

capacity was 16 bps, and αl=αs= 0.008. According to Fig. 13, the

DRs of low-rank part increased while the DRs of the sparse part

decreased, when λ increased. This was because that the low-rank

part contained more information when λ became larger and thus led

to better robustness. As a result, there exists a DR trade-off

between the low-rank part and the sparse part. According to

Fig. 13, we could obtain a well-balanced DR results for both low-

rank part and the sparse part when λ was around 0.6 to 1.0. In

addition, the ‘Overall’ DRs were always close to 100% for all λ.

(iii) The embedding capacity: We evaluated the watermark

detection performance of the proposed method under different

embedding capacities. Similar to the previous evaluations, the

embedding capacities were set as 4, 8, 16, 32, 64, 128 bps,

αl=αs= 0.008, and λ= 0.8. The results are shown in Fig. 14. It is

found that the DRs decreased with the increase of the embedding

capacity. The main reason for this was that when the embedding

capacity increased, the length of the PN sequence became smaller,

which affected the watermark detection. By comparing these

results and the previous inaudibility results in Table 3, we could

obtain satisfactory robustness and inaudibility performance of the

proposed method when embedding capacity was lower than 64 bps.

We also observed the DR results of the multi-bit embedding

method under different embedding capacities. The results are

shown in Fig. 15. Compared with Fig. 14, the DR of the multi-bit

embedding method also decreased as the embedding capacity

increased. Besides, since no optimisation was used in watermark

detection as those in the basic framework ((32) to (34)), the multi-

bit embedding method did not show better robustness than the

proposed method. In our future work, the optimisation detection for

the multi-bit embedding method will be explored.

4.4.2 Robustness against attacks: We evaluated the robustness

of the proposed method against different attacks, including

Gaussian-noise addition of SNR of 20 dB (GNA), band-pass

filtering (BPF), re-sampling with 16 and 22 kHz (Res. 16 and Res.

22), re-quantisation with 8 bits and 32 bits (Req. 8 and Req. 32),

MP3 compression with 128 and 64 kbps mono (MP3. 128 and

MP3. 64). The experimental results are shown in Table 5. The

proposed method was basically robust against these attacks at 4, 8

and 16 bps. However, for MP3 compression with 64 kbps mono,

GNA, and BPF at high embedding capacities, the robustness

needed to be improved.

4.5 Comparison with previous echo-hiding methods

Finally, we compared the proposed method with two typical

previous echo-hiding methods: the backward echo-hiding method

and the PN sequence-based method [18, 19].

Fig. 14 DRs under different embedding capacities, where ‘Low-rank’ denoted the DRs calculated from the low-rank part, ‘Sparse’ denoted the DRs

calculated from the sparse part, and ‘Overall’ denoted the DRs calculated by considering the reliabilities of the detected watermark bits from two parts

Fig. 15 DRs of the multi-bit embedding method under different embedding capacities

IET Signal Process., 2020, Vol. 14 Iss. 4, pp. 229-242

© The Institution of Engineering and Technology 2020

239

4.5.1 Comparison with backward echo-hiding method: For a

fair comparison, we ensured the similar DRs of two methods by

adjusting their attenuation amplitudes and then compare their

inaudibility performance. In the experiment, both two methods had

the same delays (d= [10, 20]), the λ of the proposed method was

set as 0.8, and the embedding capacity were 4, 8, and 16 bps.

First, we compare the watermark detection and security

performance of the proposed method and the backward echo-

hiding method. The DRs of two methods were controlled to be

similar in normal detection, as shown in Fig. 16a. In Fig. 16b, we

can find that the DRs of the proposed method were much lower

than the backward echo-hiding method when detecting the

watermarks over the whole watermarked signal. These results

suggested that the proposed method was more secure than the

backward echo-hiding method.

When the DRs of the two methods were similar (see Fig. 16a),

we compared the inaudibility of two methods, as shown in Fig. 17.

The proposed method had similar SNR results and LSD results

compared with the backward echo-hiding method. In particular, the

PEAQ results of the proposed method were obviously better than

the backward echo-hiding method. Since the PEAQ is more

accurate in evaluating the audio quality compared to SNR and

LSD, we can conclude that the proposed method could provide

much better inaudibility performance than the backward echo-

hiding method while maintaining the similar DRs.

4.5.2 Comparison with previous PN-based echo-hiding

method: Since both the proposed method and the original PN-

based method [18, 19] are implemented with the PN sequence, we

could easily maintain the similar inaudibility of them and then

compare their watermark detection and security performance. In

the experiment, both methods were implemented using the PN

sequence of the same length (60% of the frame length) and the

same delays (d= [10, 20]). The λ of the proposed method was set

as 0.8. We compared the results on different embedding capacities

of 4, 8, and 16 bps. The inaudibility results of the two methods are

shown in Fig. 18 and the DRs under similar inaudibility are shown

in Fig. 19a. One can see that the proposed method and the previous

PN-based echo-hiding method had the similar inaudibility and DRs

results.

Moreover, we compared the security performance of these two

methods. The DRs are shown in Fig. 19b. For the previous PN-

based echo-hiding, we can correctly detect the watermarks from

the whole watermarked signal using the correct PN sequence (i.e.

the PN sequence used for watermark embedding). In contrast,

watermarks could not be detected from the whole watermarked

signal generated by the proposed method even using the correct PN

sequences of the low-rank part and the sparse part. These results

suggested that the proposed method was more secure than the

previous PN-based echo-hiding method.

Finally, we compared the proposed method with four typical

PN-based echo-hiding methods. Embedding capacity follows their

original implementations. Robustness against several common

attacks of these methods are shown in Table 6. These results

suggested that the proposed method which greatly improved the

security performance also achieved similar or even better

robustness than the other PN-based echo-hiding methods.

Table 5 Robustness of the proposed method against different attacks, where inaudibility at each embedding capacity was

given

Inaudibility (SNR (dB), LSD (dB), PEAQ (ODG))

Capacity 4 bps 8 bps 16 bps 32 bps 64 bps 128 bps

Watermarked SNR LSD PEAQ SNR LSD PEAQ SNR LSD PEAQ SNR LSD PEAQ SNR LSD PEAQ SNR LSD PEAQ

signal 8.67 0.74 −0.66 13.22 0.55 −0.33 14.96 0.43 −0.23 16.08 0.37 −0.16 19.93 0.32 0.00 27.48 0.23 0.09

Detection rate, %

Capacity 4 bps 8 bps 16 bps 32 bps 64 bps 128 bps

Attacks Low-

rank

Sparse Overall Low-

rank

Sparse Overall Low-

rank

Sparse Overall Low-

rank

Sparse Overall Low-

rank

Sparse Overall Low-

rank

Sparse Overall

GNA 59.39 92.75 89.93 74.61 87.10 84.52 76.83 80.55 80.97 78.06 76.12 78.62 73.87 62.93 68.24 59.35 59.74 55.07

BPF 59.49 96.18 97.60 76.89 87.72 93.86 78.11 73.46 81.05 67.95 65.31 63.94 63.57 54.01 55.03 53.88 55.02 50.41

res. 16 69.02 99.02 98.50 84.30 95.21 93.95 76.42 85.23 80.80 67.69 76.61 65.18 61.04 59.24 51.71 54.90 60.71 51.49

res. 22 74.22 99.44 99.61 89.56 97.10 99.18 87.10 87.80 91.52 78.92 79.55 76.43 68.77 62.51 55.86 58.14 61.54 52.52

req. 8 64.88 98.53 98.19 81.83 96.00 96.24 82.37 86.76 88.66 75.17 77.43 73.91 66.25 60.23 57.51 57.43 59.97 52.24

req. 32 73.97 98.87 100.0 90.80 98.92 100.0 95.95 94.96 99.78 94.87 85.89 95.51 85.54 67.45 77.66 65.39 62.86 57.51

MP3. 128 70.51 98.53 99.71 86.73 95.25 98.93 87.41 82.71 88.87 73.66 73.39 70.19 66.42 58.47 55.50 57.04 58.58 51.71

MP3. 64 65.83 93.73 94.78 74.36 78.65 80.4 68.82 65.88 62.55 60.44 59.46 54.18 57.01 52.24 50.80 52.45 53.08 50.28

Fig. 16 Comparison on watermark detection and security between the

proposed method and the backward echo-hiding method

(a) The DRs the proposed method and the backward echo-hiding method were

controlled to be similar, (b) The DRs of the proposed method was much lower than the

backward echo-hiding method when detecting the watermarks over the whole

watermarked signal, which suggested that the proposed method was more secure

Fig. 17 Inaudibility comparison between the proposed method and the

backward echo-hiding method, where the proposed method outperformed

the backward echo-hiding method on inaudibility while maintaining the

similar DRs (see Fig. 16a)

240 IET Signal Process., 2020, Vol. 14 Iss. 4, pp. 229-242

© The Institution of Engineering and Technology 2020

5Conclusions

This paper proposed a more secure echo-hiding audio

watermarking based on improved PN sequence and the RPCA. In

the proposed method, the RPCA was used to decompose the

original audio signal into low-rank and sparse parts and then a pair

of opposite PN sequences was employed to embed watermarks into

them, which greatly improved the security of the proposed method.

In the watermark detection process, we took advantage of the low-

rank and sparse characteristics of the embedded echoes and then

separately detected watermarks from them. Experimental results

revealed that the proposed method had good inaudibility

performance and provided better security compared with the

previous echo-hiding method and the PN-based echo-hiding

method. The overall DR of the proposed method was always

satisfactory by considering the reliabilities of the detected

watermark bits from low-rank and sparse parts. Furthermore, the

embedding capacity of the proposed method could be considerably

increased compared with the previous PN-based methods by using

the multi-bit embedding scheme.

6Acknowledgments

This work was supported by the National Natural Science

Foundation of China (61902280 and 61373104), the Natural

Science Foundation of Tianjin (19JCYBJC15600 and

18JCYBJC15300), and the Scientific Research Project of Tianjin

Education Commission (19PTZWHZ00020). It was also supported

by the Program for Innovative Research Team in University of

Tianjin (TD13-5032) and Tianjin Major Project for Civil-Military

Integration of Science and Technology (18ZXJMTG00260).

7References

[1] Diego, R., Ballesteros, L.D.M., Camilo, L.: ‘Authenticity verification of audio

signals based on fragile watermarking for audio forensics’, Expert Syst. Appl.,

2018, 91, pp. 211–222

[2] Liu, Z., Huang, Y., Huang, J.: ‘Patchwork-based audio watermarking robust

against de-synchronization and recapturing attacks’, IEEE Trans. Inf.

Forensics Secur., 2019, 14, (5), pp. 1171–1180

[3] Hua, G., Huang, J., Shi, Y.Q., et al. ‘Twenty years of digital audio

watermarking – a comprehensive review’, Signal Process., 2016, 128, pp.

222–242

[4] Kondo, K.: ‘Towards estimation of quality of watermarked audio signal using

objective measures’. Proc. Int. Conf. Intelligent Information Hiding and

Multimedia Signal Processing, Beijing, People's Republic of China, 2013, pp.

279–282

[5] Subhashini, R., Bagan, K.B.: ‘Robust audio watermarking for monitoring and

information embedding’. Proc. Int. Conf. Signal Processing, Communication

and Networking (ICSCN), Chennai, India, 2017, pp. 1–4

[6] Mustapha, H., Bachir, B., David, M., et al.: ‘Adjustable audio watermarking

algorithm based on DWPT and psychoacoustic modeling’, Multimed. Tools

Appl.., 2018, 77, (10), pp. 11693–11725

[7] Kanhe, A., Gnanasekaran, A.: ‘Robust image-in-audio watermarking

technique based on DCT-SVD transform’, EURASIP J. Audio Speech Music

Process., 2018, 2018, p. 16

[8] Verma, V.S., Bhardwaj, A., Jha, R.K.: ‘A new scheme for watermark

extraction using combined noise-induced resonance and support vector

machine with PCA based feature reduction’, Multimed. Tools Appl., 2019, 78,

pp. 23203–23224

[9] Kanhe, A., Gnanasekaran, A.: ‘Security of electronic patient record using

imperceptible DCT-SVD based audio watermarking technique’, Int. J.

Electron. Telecommun., 2019, 65, (1), pp. 19–24

Fig. 18 Inaudibility comparison between the proposed method and the previous PN-based echo-hiding method

Fig. 19 Comparison on watermark detection and security between the proposed method and the previous PN-based echo-hiding method

(a) The proposed method and the previous PN-based echo-hiding method had similar DRs under similar inaudibility (see Fig. 18), (b) The DRs of the proposed method was much

lower than the previous PN-based echo-hiding when detecting the watermarks over the whole watermarked signal using the corresponding PN sequences, i.e. the proposed method

was more secure than the previous PN-based echo-hiding method

Table 6 Comparison of the proposed method with other PN-based echo-hiding methods, where ‘–’ means corresponding

results were not provided by the method

Detection rate, %

Method MPN [35] DUAL [36] OPT [37] HRS [38] Proposed

capacity 1 bps 1 bps 5 bps 10 bps 4 bps

normal 97.23 99.57 99.97 96.5 99.26

req. 8 88.36 91.46 96.73 85.8 98.19

GNA 74.34 82.67 93.94 76.0 89.93

MP3. 128 95.14 97.03 93.41 90.8 99.71

BPF — — — 91.8 97.60

IET Signal Process., 2020, Vol. 14 Iss. 4, pp. 229-242

© The Institution of Engineering and Technology 2020

241

[10] Ngo, N.M., Kurkoski, B.M., Unoki, M.: ‘Robust and reliable audio

watermarking based on dynamic phase coding and error control coding’. Proc.

Int. Conf. European Signal Processing Conf. (EUSIPCO), Nice, France, 2015,

pp. 2276–2280

[11] Xiang, Y., Natgunanathan, I., Peng, D., et al.: ‘Spread spectrum audio

watermarking using multiple orthogonal PN sequences and variable

embedding strengths and polarities’, IEEE Trans. Audio Speech Lang.

Process., 2018, 26, (3), pp. 529–539

[12] Galajit, K., Karnjana, J., Aimmanee, P., et al.: ‘Digital audio watermarking

method based on singular spectrum analysis with automatic parameter

estimation using a convolutional neural network’. Proc. Int. Conf. Intelligent

Information Hiding and Multimedia Signal Processing, Sendai, Japan, 2018,

pp. 63–73

[13] Tiwari, A., Jain, L.: ‘Digital audio watermarking using frequency masking

technique’, Int. J. Comput. Appl., 2015, 126, (4), pp. 1–7

[14] Zebbiche, K., Khelifi, F., Loukhaoukha, K.: ‘Robust additive watermarking in

the DTCWT domain based on perceptual masking’, Multimed. Tools Appl.,

2018, 77, (16), pp. 21281–21304

[15] Natgunanathan, I., Xiang, Y., Rong, Y., et al.: ‘Robust patchwork-based

embedding and decoding scheme for digital audio watermarking’, IEEE

Trans. Audio Speech Lang. Process., 2012, 20, (8), pp. 2232–2239

[16] Natgunanathan, I., Xiang, Y., Hua, G., et al.: ‘Patchwork-based multilayer

audio watermarking’, IEEE Trans. Audio Speech Lang. Process., 2017, 25,

(11), pp. 2176–2187

[17] Gruhl, D., Lu, A., Bender, W.: ‘Echo hiding’. Information Hiding, First Int.

Workshop, Cambridge, UK, 1996, pp. 293–315

[18] Ko, B.-S., Nishimura, R., Suzuki, Y.: ‘Time-spread echo method for digital

audio watermarking using PN sequences’. Proc. Int. Conf. Acoustics, Speech,

and Signal Processing, ICASSP, Orlando, FL, USA, 2002, pp. 2001–2004

[19] Ko, B.-S., Nishimura, R., Suzuki, Y.: ‘Time-spread echo method for digital

audio watermarking’, IEEE Trans. Multimed., 2005, 7, (2), pp. 212–221

[20] Hu, P., Peng, D., Yi, Z., et al.: ‘Robust time-spread echo watermarking using

characteristics of host signals’, Electron. Lett., 2016, 52, (1), pp. 5–6

[21] Natgunanathan, I., Xiang, Y., Pan, L., et al.: ‘Robustness and embedding

capacity enhancement in time-spread echo-based audio watermarking’. Proc.

Int. Conf. Industrial Electronics and Applications (ICIEA), Hefei, People's

Republic of China, 2016, pp. 1536–1541

[22] Shu, N., Kotaro, S., Senya, K.: ‘Audio watermark sharing based on time

spread echo method’, IEICE Technical Report; IEICE Tech. Rep., 2017, 117,

(282), pp. 23–26

[23] Shoya, O., Shunsuke, A., Takeru, M., et al.: ‘A study on the system of

detecting falsification for conference records using echo spread method and

octave similarity’, IEICE Technical Report; IEICE Tech. Rep., 2019, 118,

(478), pp. 57–64

[24] Oh, H., Seok, J.W., Hong, J.W., et al.: ‘New echo embedding technique for

robust and imperceptible audio watermarking’. Proc. Int. Conf. Acoustics,

Speech, and Signal Processing, ICASSP, Salt Lake City, UT, USA, 2001, pp.

1341–1344

[25] Xu, C., Wu, J., Sun, Q., et al.: ‘Applications of digital watermarking

technology in audio signals’, J. Audio Eng. Soc., 1999, 47, (10), pp. 805–812

[26] Oh, H., Kim, H.W., Seok, J.W., et al.: ‘Transparent and robust audio

watermarking with a new echo embedding technique’. Proc. Int. Conf.

Multimedia and Expo, ICME, Tokyo, Japan, 2001

[27] Kim, H.J., Choi, Y.H.: ‘A novel echo-hiding scheme with backward and

forward kernels’, IEEE Trans. Circuits Syst. Video Techn., 2003, 13, (8), pp.

885–889

[28] Xiang, Y., Peng, P., Natgunanathan, I., et al.: ‘Effective pseudo noise

sequence and decoding function for imperceptibility and robustness

enhancement in time-spread echo-based audio watermarking’, IEEE Trans.

Multimed., 2011, 13, (1), pp. 2–13

[29] Candès, E.J., Li, X., Ma, Y., et al.: ‘Robust principal component analysis?’, J.

ACM, 2011, 58, (3), pp. 1–37

[30] Huang, P.-S., Chen, S.D., Smaragdis, P., et al.: ‘Singing-voice separation from

monaural recordings using robust principal component analysis’. Proc. Int.

Conf. Acoustics, Speech and Signal Processing ICASSP, Kyoto, Japan, 2012,

pp. 57–60

[31] Goto, M., Hashiguchi, H., Nishimura, T., et al.: ‘RWC music database: music

genre database and musical instrument sound database’. Proc. Int. Conf.

Music Information Retrieval, Baltimore, MD, USA, 2003

[32] Hoffmann, E., Kolossa, D., Köhler, B.-U., et al.: ‘Using information theoretic

distance measures for solving the permutation problem of blind source

separation of speech signals’, EURASIP J. Audio Speech Music Process.,

2012, 2012, (1), pp. 1–14

[33] ITU-R: ‘Method for objective measurements of perceived audio quality’,

BS.1387, 2001

[34] Lin, Y., Waleed, H.A.: ‘Perceptual evaluation of audio watermarking using

objective quality measures’. Proc. of the IEEE Int. Conf. on Acoustics,

Speech, and Signal Processing, ICASSP 2008, Las Vegas, Nevada, USA, 30

March – 4 April 2008, pp. 1745–1748

[35] Xiang, Y., Peng, D., Natgunanathan, I., et al.: ‘Effective pseudonoise

sequence and decoding function for imperceptibility and robustness

enhancement in time-spread echo-based audio watermarking’, IEEE Trans.

Multimedia, 2011, 13, (1), pp. 2–13

[36] Xiang, Y., Natgunanathan, I., Peng, D., et al.: ‘A dual-channel time-spread

echo method for audio watermarking’, IEEE Trans. Inf. Forensics Secur.,

2012, 7, (2), pp. 383–392

[37] Hua, G., Goh, J., Thing, V.L.L.: ‘Time-spread echo-based audio watermarking

with optimized imperceptibility and robustness’, IEEE/ACM Trans. Audio

Speech Lang. Process., 2015, 23, (2), pp. 227–239

[38] Chen, O.T.-C., Wu, W.-C.: ‘Highly robust, secure, and perceptual-quality

echo hiding scheme’, IEEE Trans. Audio Speech Lang. Process., 2008, 16,

(3), pp. 629–638

242 IET Signal Process., 2020, Vol. 14 Iss. 4, pp. 229-242

© The Institution of Engineering and Technology 2020