Conference PaperPDF Available

Improving Speech Privacy in Personal Sound Zones

Authors:

Abstract and Figures

This paper proposes two methods for providing speech privacy between spatial zones in anechoic and reverberant environments. The methods are based on masking the content leaked between regions. The masking is optimised to maximise the speech intelligibility contrast (SIC) between the zones. The first method uses a uniform masker signal that is combined with desired multizone loudspeaker signals and requires acoustic contrast between zones. The second method computes a space-time domain masker signal in parallel with the loudspeaker signals so that the combination of the two emphasises the spectral masking in the targeted quiet zone. Simulations show that it is possible to achieve a significant SIC in anechoic environments whilst maintaining speech quality in the bright zone.
Content may be subject to copyright.
IMPROVING SPEECH PRIVACY IN PERSONAL SOUND ZONES
Jacob Donley, Christian Ritzand W. Bastiaan Kleijn
School of Electrical, Computer and Telecommunications Engineering, University of Wollongong, Australia
School of Engineering and Computer Science, Victoria University of Wellington, New Zealand
ABSTRACT
This paper proposes two methods for providing speech privacy be-
tween spatial zones in anechoic and reverberant environments. The
methods are based on masking the content leaked between regions.
The masking is optimised to maximise the speech intelligibility con-
trast (SIC) between the zones. The first method uses a uniform
masker signal that is combined with desired multizone loudspeaker
signals and requires acoustic contrast between zones. The second
method computes a space-time domain masker signal in parallel with
the loudspeaker signals so that the combination of the two empha-
sises the spectral masking in the targeted quiet zone. Simulations
show that it is possible to achieve a significant SIC in anechoic envi-
ronments whilst maintaining speech quality in the bright zone.
Index Termsmultizone soundfield reproduction, personal
sound zones, speech privacy, speech intelligibility
1. INTRODUCTION
Using an array of loudspeakers, multizone soundfield reproduction
[1] aims to provide listeners in a target zone with their own indi-
vidual soundfield that does not interfere with other zones within the
reproduction region. In some cases, it is desirable to create zones of
quiet, where audio from neighbouring zones is suppressed or can-
celled [1, 2, 3]. The multizone approach can be used for applications
such as the creation of personal sound zones [4] in multi-participant
teleconferencing, restaurants/caf´
es, entertainment/cinema, vehicle
cabins and public announcement locations where the reproduction
can be optimised to provide private quiet zones.
In order to keep the sounds zones personal it is necessary to min-
imise the interzone audio interference (leakage) to maximise the in-
dividual experience. The existence of leakage means that the re-
production of speech in a particular zone may be intelligible in other
zones, deviating from the desired personal sound zones. Some of the
earlier methods treat the leakage with hard constraints and attempt to
completely remove it [1, 2]. This results in zones that are mostly free
of the interference but this is difficult to achieve in situations where
a desired soundfield in the bright zone is obscured by or directed
to another zone, as the system requires reproduction signals many
times the amplitude of what is reproduced within any zone. This
is known as the multizone occlusion problem [1, 4, 5] and has been
dealt with in various ways such as the control of planarity [6], or-
thogonal basis planewaves [3] and alleviated zone constraints [3, 7].
Reproduction in reverberant rooms has also been accomplished with
enhanced acoustic contrast using sparse methods [8].
More recent work has focused on alleviating the constraint so
that the amount of leakage is controlled by a weighting function [3,
7]. Allowing the sound to leak into other zones can improve the
practicality of the system but decreases the individuality of zones.
Existing methods focus on single frequency soundfields, al-
though there has been work attempting to create multizone sound-
fields for wideband speech [9]. More recently, work has been done
[10] to extend a method [3] to the reproduction of weighted wide-
band speech soundfields by using the spatial weighting function.
This is shown in [11] to allow each zone’s acoustic content to be
controlled by dynamic space-time-frequency weighting.
To maintain speech privacy amongst the zones it is necessary to
keep the leaked speech unintelligible [12]. If the leaked speech is
at a level below the threshold of hearing then it may be expected to
start becoming inaudible and/or masked. To reproduce clear speech
in a weighted multizone soundfield at a level of 60 dBA in a zone,
known as the ‘bright’ zone, the level of leaked speech in the quiet
zone could be reduced to around 30 dBA to 35 dBA [8, 11] which
is still well above the threshold of hearing (0 dBA).
In this work it is shown for the first time, as far as the authors are
aware, a difference, or contrast, in intelligibility across the personal
sound zones which corresponds to private sound zones. Contribu-
tions are made by evaluating the objective intelligibility of repro-
duced speech and providing methods of control for increased pri-
vacy between zones as a baseline study. A method is provided and
evaluated for increasing privacy in multizone speech soundfields in
anechoic and reverberant environments by using noise to mask the
leaked spectrum into the target quiet zone so that it becomes unin-
telligible. A third contribution is the description and analysis of an
enhanced method for increasing privacy and at the same time im-
proving perceived quality in reproductions, analysed using objective
(instrumental) measures. This is achieved by performing a weighted
multizone reproduction on the noise masker so that it has more in-
fluence in the target quiet zone and less in the target bright zone.
This paper begins with an explanation of the weighted multizone
speech soundfield method used in this work in Section 2. Noise
masking and its relation to speech intelligibility and speech privacy
are explained in Section 3. Results of the noise masking methods
and conclusions are given in Section 4 and Section 5, respectively.
2. WEIGHTED MULTIZONE SPEECH SOUNDFIELDS
The following section provides an overview of the weighted orthog-
onal basis expansion synthesis [3] and the cylindrical harmonic ex-
pansion reproduction [2] used in this work to reproduce speech in
one zone and suppress it in another. This initial step creates a wide-
band controllable contrast in the level between zones which is then
used to reduce leakage between zones.
A multizone soundfield reproduction is depicted in Fig. 1. The
circular reproduction region, D, of radius R, contains three sub re-
gions called the bright, quiet and unattended zone, denoted by Db,
Dqand D(DbDq), respectively. The radius of Dband Dqis
© 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future
media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Unattended Zone
Bright Zone
Quiet Zone
rz
rz
rπ
Fig. 1. A weighted multizone soundfield reproduction layout is
shown. The shading depicts the desired bright zone soundfield par-
tially directed towards the quiet zone causing the occlusion problem.
rand their centres are located on a circle of radius rzconcentric
with D. The angle of the desired planewave in Dbis θand is repro-
duced by loudspeakers positioned on an arc of angle φL, radius Rl,
concentric with Dand with the first loudspeaker at angle φ.
Any arbitrary soundfield, including the reproduction of planewave
speech, can be described by an infinite set of planewaves arriving
from all angles [13]. In the orthogonal basis expansion approach to
multizone soundfield reproduction [3] it is shown that a soundfield
function, S(x, k), that fulfils the wave equation, where xDis
an arbitrary spatial sampling point and kis the wavenumber of the
soundfield, can be described with an additional weighting function,
w(x). This weighting function provides relative importance to the
reproduction in different zones and the weighted soundfield function
used throughout this work can be written as
S(x, k) = X
j
Pj(k)Fj(x, k),(1)
where the coefficients for the orthogonal wavefields, Fj(x, k), for a
given weighting function are Pj(k)and j∈ {1,...,N}where N
is the number of basis planewaves [3].
The complex loudspeaker weights used to reproduce the sound-
field in the time-frequency domain are defined as [14]
˜
Ql(k) =
M
X
m=M
2eimφlφsPjPj(k)imeimφp
iπH(1)
m(kRl),(2)
where M=kRis the truncation length [3], i=1,Rand Rl
are from Fig. 1, φp= (j1)∆φare the wavefield angles, φ=
2π/N,φlis the angle of the lth loudspeaker from the horizontal
axis and φsis the angular spacing of the loudspeakers. Here, Pj
is chosen to minimise the difference between the desired soundfield
and the actual soundfield [3]. In this work frequency, f=kc/2π
[13] and c= 343 m s1is the speed of sound.
In order to reproduce planewave speech soundfields ˜
Ql(k)must
be applied to the speech in the time-frequency domain and inverse
transformed back to the time-domain to obtain the set of loudspeaker
signals. This can be done by means of a Gabor transform or any
unitary time-frequency transformation as
˜qal (n) = 1
2K
K1
X
m=0
˜
Ql(mk)˜
Ya(mk)eiπmn/K ,(3)
where ˜
Ya(k)is the discrete Fourier transform of the ath overlap-
ping windowed frame of the input speech signal, y(n). Each loud-
speaker signal, ql(n), is reconstructed by performing overlap-add
reconstruction with the synthesis window. This results in the loud-
speaker signals, which will reproduce the multiple zones.
The observed signals, p(x, n), can be found at any arbitrary
point in the soundfield by convolving each of the loudspeaker sig-
nals with the transfer function, H(x,xl, k), and summing, as
p(x, n) = 1
2KX
l
K1
X
m=0
Ql(mk)H(x,xl, mk)eiπmn/K ,(4)
where xlis the position of the lth loudspeaker and Ql(k)is the time-
frequency transform of ql(n). The soundfield can now be evaluated
at any given point in the reproduction region for different input sig-
nals and the resulting pressure, p(x, n), can be observed in the bright
zone and quiet zone. From this it is possible to analyse the speech
intelligibility in each zone as presented in the following section.
3. PRIVATE SOUND ZONES
This section discusses the relationship between speech privacy and
intelligibility and how they are affected in a multizone soundfield
reproduction scenario. The use of the Speech Intelligibility Contrast
(SIC) is proposed for improving the privacy in personal sound zones.
3.1. Speech Privacy and Intelligibility Contrast
A measure is required to optimally design and evaluate the per-
formance of a method to control privacy in the multizone sound-
field reproduction. The relationship between speech intelligibility
and privacy is highly correlated. Two measurement standards cur-
rently published for assessing speech privacy in closed and open plan
spaces are ASTM E2638 [15] and ASTM E1130 [16], respectively.
These standards are based on two different measures, which are the
Speech Privacy Class (SPC) and the Articulation Index (AI). Both
are highly correlated to speech intelligibility and the SPC has been
shown to be a better measure for higher privacy situations [12] mak-
ing it reasonable to maximise a measure of intelligibility contrast
between zones to obtain privacy.
It has been shown that objective intelligibility measures are
highly correlated with subjective measures and are based on analysing
spectral band powers. High mutual information between the clean
speech (talker), y(n), and the degraded speech (listener), p(x, n)
from (4), is attained at high signal to noise ratio (SNR) [17], hence
indicating that reducing the SNR, for example by adding noise,
reduces intelligibility. In this work the intelligibility for two signals
x1(n)and x2(n)is denoted as I(x1;x2). The particular measure
Mcan be the mutual information, such as that provided by the
Short-Time Objective Intelligibility (STOI) [18] or Speech Trans-
mission Index (STI) [19]. The intelligibility of the pressure signal at
a spatial point xand the signal y(n)is then IM(p(x,·); y).
In this work the SIC is defined as
SICM=1
kDbkZDbIMdx1
kDqkZDqIMdx,(5)
where kDbkand kDqkare the sizes of Dband Dq, respectively, and
the domain is restricted such that IMfor any xDbis greater than
or equal to IMfor any xDq. The following two subsections pro-
vide two methods to maximise SICM.
3.2. Improving Multizone Privacy
To maximise the SIC, IMmust be zero at all points in Dqwhilst
maintaining maximum IMat all points in Db. Ideally, the mean
SNR of p(x, n)over Dbshould be maintained as high as possible,
so to increase SICMthe mean SNR of p(x, n)over Dqshould be
reduced. To maximise the SIC noise is added to ql(n)under the con-
straint that the mean amplitude of p(x, n)over Dqis less than that
of p(x, n)over Db. This then becomes a constrained optimisation
dependent on the reproduced signals in the bright and quiet zones as
max
GRSICM,(6)
where the noise levels, GdB, of ql(n)are optimised.
To increase the SIC a time-domain noise mask, u(n), is added
to each loudspeaker signal, ql(n), which is derived from its time-
frequency domain representation from (3). Noise is added at differ-
ent gain, GdB, relative to the maximum amplitude among Lloud-
speaker signals, A= max({ql(n) : l= 1, ..., L}). The noise mask
is added as
q
l(n) = ql(n) + u(n)A10
GdB
20 dB ,(7)
where the new loudspeaker signals are q
l(n). In this work u(n)is
chosen to be uniform white noise with no directivity and this method
is referred to as the ‘Flat Mask’ due to its spatial and spectral unifor-
mity.
Then by transforming q
l(n)for use in (4), SICMis obtained
from (4) and (5). Now SICMcan be optimised with (6) using GdB
in (7). However, this method does not control u(n)in the spatial
domain and so the mean IMover Dbis also reduced even though
the SIC is maintained.
3.3. Improving Multizone Privacy and Quality
Ideally a private personal sound zone system would have a maximum
SIC whilst maintaining high perceptual quality in the bright zone.
Adding u(n)to ql(n)adds error to p(x, n)for all xwhich as a side-
effect reduces the quality of p(x, n)for any xDband a trade-off
between target quality and privacy becomes necessary. Following
a similar notation to IM, the quality of p(x, n),xDbdegraded
from y(n)is any speech quality assessment model of measure, ´
M,
denoted by B´
M(p(x,·); y)∈ {0, ..., 1}, scaled to match that of IM.
Now a new optimisation can be defined as
max
GdBRSICM+λ
kDbkZDbB´
Mdx,(8)
where the noise levels, GdB, are defined below, λis a weighting pa-
rameter for the importance of quality in the optimisation and IM
B´
Mfor xDb. This optimisation also requires minimum mean
SNR of p(x, n)over Dqand maximum mean SNR of p(x, n)over
Dbachieved here by applying zone weighting to u(n).
To simplify the optimisation of (8) in this work, constraints are
applied to the multizone reproduction of u(n), which is a planewave
field in Dqand quiet in Db. The constraints are θ= 0°, so that the
masker source is collocated with the leakage, and a new weighting
function, ¯w(x), is constrained to an importance in Dqof unity, 104in
Dband 0.05 in the unattended zone. The remainder of the multizone
reproduction is the same as used to generate ql(n)for y(n).
The goal is to find another set of loudspeaker signals that would
reproduce u(n)in Dqto control the mean SNR of p(x, n)over Dq,
therefore solving (8). To do this, u(n)is transformed to the time-
frequency domain as ˜
Ua(k)and used as the input signal in (3). New
loudspeaker weights, ˜
Ql(k), are derived from (2). Then, from (3),
the loudspeaker signals, ˆql(n), are reconstructed and these become
the new noise mask signals as
q′′
l(n) = ql(n) + ˆql(n)A10
GdB
20 dB ,(9)
where the new loudspeaker signals are q′′
l(n)with noise levels, GdB.
In this work this method is referred to as the ‘Zone Weighted Mask’
due to the masker signal being dependent on the multizone scenario.
Then by transforming q′′
l(n)for use in (4), SICMis obtained
from (4) and (5). Now SICMcan be optimised with (8) using GdB
in (9). The optimisation problem can now be analysed by measuring
IMfor xDbDq,B´
Mfor xDband for various GdB.
4. RESULTS
This section presents objective intelligibility results for the bright
and quiet zones in anechoic and reverberant reproduction environ-
ments and discusses the SIC and quality trade-off.
4.1. Multizone Reproduction Evaluation
The layout of Fig. 1 is evaluated, where r= 0.3 m,rz= 0.6 m,R=
1 m and Rl= 1.5 m. The value of θ={0°,15°,90°}for the angle
of the desired planewave virtual source in the bright zone. These
angles are chosen to represent multizone occlusion scenarios. Input
speech signals sampled at 16 kHz are transformed to the frequency
domain using an FFT and 64 ms windows with 50% overlapping.
The loudspeaker signals, q
l(n)and q′′
l(n), are generated using the
methods outlined in section 2 and 3. The reproduction is performed
for L= 295 and φL= 2πwhich, for the cases in this work, is free
of aliasing problems below 8 kHz [2, 3].
The zone weights are constant and are chosen so that the bright
zone weight is unity, the unattended zone weight is 0.05 the re-
production importance of the bright zone following [3, 10] and the
weight of the quiet zone is set to 104. Frequency dependent zone
weighting and signal filtering may give further improvements. The
noise masking methods, ‘Flat Mask’ and ‘Zone Weighted Mask’, are
applied with GdB ranging from 40 dB to 20 dB in (7) and (9).
Speech files for the evaluation were taken from the TIMIT cor-
pus [20]. Twenty files were randomly selected such that the selection
was constrained to have a male to female speaker ratio of 50 : 50.
Three reverberant rooms and one anechoic are evaluated. The
rooms walls have an absorption coefficient of 0.3and are 4 m ×
9 m ×3 m,8 m×10 m ×3 m and 9 m ×14 m ×3 m, sizes that were
selected to match a small office, medium office and restaurant/caf´
e,
respectively. The multizone setup is placed in the centre of the rooms
and recordings are analysed from both zones where 32 receivers are
positioned randomly in each zone. Room reflections are simulated
using the image method [21] with approximately 446 ×103,206 ×
103and 149 ×103images for each of the respective rooms, at 0.5 s
in length and sampling frequency of 16 kHz.
The reproductions are analysed using the STOI, STI and Percep-
tual Evaluation of Speech Quality (PESQ) [22] measures to evaluate
the performance with SICSTOI and SICSTI in anechoic and rever-
berant environments, respectively. The STOI measure is designed
for the prediction of time-frequency weighted noisy speech like the
simulated recordings in this work. The STI measure is currently
Noise Mask (GdB )
-40 -20 0 20
θ = 90°
PESQ (MOS)
1
1.7
2.4
3.2
3.9
4.6
BZ STOI
QZ STOI
BZ PESQ
STOI (%WC)
0
20
40
60
80
100
θ = 0°
Flat Mask
θ = 0°
Zone Weighted Mask
STOI (%WC)
0
20
40
60
80
100
θ = 15° θ = 15°
Noise Mask (GdB )
-40 -20 0 20
STOI (%WC)
0
20
40
60
80
100
θ = 90°
PESQ (MOS)
1
1.7
2.4
3.2
3.9
4.6
PESQ (MOS)
1
1.7
2.4
3.2
3.9
4.6
Fig. 2. Mean STOI and PESQ are shown for the anechoic environ-
ment and 95% confidence intervals are indicated. BZ and QZ are
the bright and quiet zone, respectively. Black dashed lines indicate
optimum GdB and λ= 1.
the only choice for a reverberant objective intelligibility measure. A
good objective measure for speech quality is the PESQ measure.
The STOI and PESQ are measured in this work with the clean,
y(n), and degraded, p(x, n), speech for each file and receiver com-
bination. The STI is measured for each receiver using the systems
impulse response found with a logarithmic sine sweep. The intelli-
gibility and quality results are then averaged over each zone like that
of (5) and (8). This results in three object measures, two weighting
methods, four rooms, 13 levels of added noise, 20 speech files and
64 receiver positions totalling 332,800 data points.
4.2. Intelligibility Contrast from Noise-Based Sound Masking
Fig. 2 shows that by using the ‘Flat Mask’ method to obtain privacy
between zones it is possible to obtain upwards of 85% SICSTOI
but this is only possible within a small range of GdB (25 dB to
20 dB). The range remains the same size as the angle is increased
but the GdB which is required to maintain SICSTOI is increased to
approximately 15 dB. In each case of the ‘Flat Mask’ method the
signal in the bright zone is of poor quality as shown by the corre-
sponding PESQ curve (which is undesirable).
It can be seen that θhas a small impact on the range of increased
SICSTOI and it is possible to maintain above 80% SICSTOI for dif-
ferent angles. The effect of θis only minor due to the large zone
weighting used in the reproduction process. Fig. 2 shows that with
a small change in angle, 15°, SICSTOI remains the same and the
PESQ curve starts to rise.
The effect of the spatially weighted noise maskers can be clearly
seen in Fig. 2 where the use of a ‘Zone Weighted Mask’ improves
SICSTOI across all scenarios. The maximum improvement occurs
when GdB is between 5 dB and 20 dB and provides a SICSTOI
of greater than 95% for every scenario. Even when the occlusion
problem is present it is still possible to obtain privacy with greater
STI
0
0.2
0.4
0.6
0.8
1
θ = 0°
Flat Mask
θ = 0°
Zone Weighted Mask
STI
0
0.2
0.4
0.6
0.8
1
θ = 15° θ = 15°
Noise Mask (GdB )
-40 -20 0 20
STI
0
0.2
0.4
0.6
0.8
1
θ = 90°
Noise Mask (GdB )
-40 -20 0 20
θ = 90°
PESQ (MOS)
1
1.7
2.4
3.2
3.9
4.6
PESQ (MOS)
1
1.7
2.4
3.2
3.9
4.6
PESQ (MOS)
1
1.7
2.4
3.2
3.9
4.6
BZ STI
QZ STI
BZ PESQ
Room 1
Room 2
Room 3
Fig. 3. Mean STI and PESQ are shown for the small office, medium
office and restaurant/caf´
e labelled as Room 1, Room 2 and Room 3,
respectively. BZ and QZ are the bright and quiet zones, respectively.
Vertical black lines indicate optimum GdB and λ= 1.
than 95% SICSTOI when GdB is between 0 dB and 15 dB.
Another benefit of using the ‘Zone Weight Mask’ is that the qual-
ity of the bright zone reproduction is increased within the region
where SICSTOI is significantly large. With a SICSTOI of greater
than 70% it is also possible to obtain a PESQ of greater than 3.4 re-
ducing to 3.2 and 2.8 for a SICSTOI of 80% and 90%, respectively.
This shows the trade-off between reproduction quality and zone pri-
vacy which is controlled using λand may depend on the application
of the private multizone system.
With the multizone reproduction in different reverberant rooms
it can be seen in Fig. 3 that a contrast in intelligibility is still possible
without room equalisation. The quality is reduced most likely due
to uncontrolled early reflections inhibiting the bright zone, however,
the SICSTI still remains high at various GdB albeit reduced from
an ideal anechoic environment. The maximum SICSTI is 40% and
occurs with the ‘Zone Weighted Mask’ for Room 3.
5. CONCLUSIONS
This paper has investigated speech privacy between bright and quiet
zones in multizone reproduction scenarios. Methods have been pro-
posed and evaluated for increasing the speech intelligibility contrast
(SIC) in anechoic and reverberant environments showing that added
noise can be used to mask the leaked spectrum to provide a sig-
nificant SIC of higher than 95%. It has also been shown that it is
possible to maintain quality in the bright zone with a PESQ MOS of
3.2 whilst providing a SIC above 80% by using space-time domain
masker signals and that speech privacy can be achieved in reverber-
ant rooms using the methods outlined in this paper. Future work will
look into further improvement of the quality and privacy in reverber-
ant environments as well as a reduction in the number of required
loudspeakers.
6. REFERENCES
[1] M. Poletti, “An investigation of 2-D multizone surround sound
systems,” in Proc. 125th Conv. Audio Eng. Soc. 2008, Audio
Eng. Soc.
[2] Y. J. Wu and T. D Abhayapala, “Spatial multizone soundfield
reproduction: Theory and design,” IEEE Trans. Audio, Speech,
Lang. Process., vol. 19, pp. 1711–1720, 2011.
[3] W. Jin, W. B. Kleijn, and D. Virette, “Multizone soundfield
reproduction using orthogonal basis expansion,” in Int. Conf.
on Acoust., Speech and Signal Process. (ICASSP). 2013, pp.
311–315, IEEE.
[4] T. Betlehem, W. Zhang, M. Poletti, and T. D. Abhayapala,
“Personal Sound Zones: Delivering interface-free audio to
multiple listeners,” IEEE Signal Process. Mag., vol. 32, pp.
81–91, 2015.
[5] T. Betlehem and P. D. Teal, “A constrained optimization
approach for multi-zone surround sound,” in Int. Conf. on
Acoust., Speech and Signal Process. (ICASSP). 2011, pp. 437–
440, IEEE.
[6] P. Coleman, P. Jackson, M. Olik, and J. A. Pedersen, “Personal
audio with a planar bright zone,” J. Acoust. Soc. of Am., vol.
136, pp. 1725–1735, 2014.
[7] H. Chen, T. D. Abhayapala, and W. Zhang, “Enhanced sound
field reproduction within prioritized control region,” in INTER-
NOISE and NOISE-CON Congr. and Conf. Proc. 2014, vol.
249, pp. 4055–4064, Inst. of Noise Control Eng.
[8] W. Jin and W. B. Kleijn, “Theory and design of multizone
soundfield reproduction using sparse methods,” IEEE Trans.
Audio, Speech, Lang. Process., vol. 23, pp. 2343–2355, 2015.
[9] N. Radmanesh and I. S. Burnett, “Generation of isolated wide-
band sound fields using a combined two-stage lasso-ls algo-
rithm,” IEEE Trans. Audio, Speech, Lang. Process., vol. 21,
pp. 378–387, 2013.
[10] J. Donley and C. Ritz, “An efficient approach to dynamically
weighted multizone wideband reproduction of speech sound-
fields,” in China Summit & Int. Conf. Signal and Inform. Pro-
cess. (ChinaSIP). 2015, pp. 60–64, IEEE.
[11] J. Donley and C. Ritz, “Multizone reproduction of speech
soundfields: A perceptually weighted approach,” in Asia-
Pacific Signal & Inform. Process. Assoc. Annu. Summit and
Conf. (APSIPA ASC). 2015, pp. 342–345, IEEE.
[12] B. N. Gover and J. S. Bradley, “ASTM metrics for rating
speech privacy of closed rooms and open plan spaces, Cana-
dian Acoust., vol. 39, pp. 50–51, 2011.
[13] E. G. Williams, Fourier Acoustics: Sound Radiation and
Nearfield Acoustical Holography, Academic Press, 1999.
[14] Y. J. Wu and T. D. Abhayapala, “Theory and design of
soundfield reproduction using continuous loudspeaker con-
cept,” IEEE Trans. Audio, Speech, Lang. Process., vol. 17,
pp. 107–116, 2009.
[15] Standard test method for objective measurement of the speech
privacy provided by a closed room, ASTM Int. E2638-10,
2010.
[16] Standard test method for objective measurement of speech pri-
vacy in open plan spaces using articulation index, ASTM Int.
E1130-08, 2008.
[17] W. B. Kleijn, J. B. Crespo, R. C. Hendriks, P. Petkov, B. Sauert,
and P. Vary, “Optimizing speech intelligibility in a noisy envi-
ronment: A unified view, IEEE Signal Process. Mag., vol. 32,
pp. 43–54, 2015.
[18] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen,
“An algorithm for intelligibility prediction of timefrequency
weighted noisy speech,” IEEE Trans. Audio, Speech, Lang.
Process., vol. 19, pp. 2125–2136, 2011.
[19] Sound system equipment-Part 16: Objective rating of speech
intelligibility by speech transmission index, IEC 60268-16,
2003.
[20] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett,
N. Dahlgren, and V. Zue, “TIMIT acoustic-phonetic contin-
uous speech corpus,” Linguistic Data Consortium, 1993.
[21] J. B. Allen and D. A. Berkley, “Image method for efficiently
simulating small-room acoustics,” J. Acoust. Soc. of Am., vol.
65, pp. 943–950, 1979.
[22] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hek-
stra, “Perceptual evaluation of speech quality (PESQ)-a new
method for speech quality assessment of telephone networks
and codecs,” in Int. Conf. on Acoust., Speech and Signal Pro-
cess. (ICASSP). 2001, pp. 749–752, IEEE.
... J.Cheer@soton.ac.uk; in different spatial regions can be leveraged to privately transmit a spoken message to an individual in a public space, as first investigated by Donley et. al. 15 . This objective is achieved when the message is intelligible within the bright zone of the system and is rendered unintelligible in the dark zone, where it could otherwise be overheard. ...
... In either case, leakage of spoken information from the bright zone into the dark zone could represent a loss of privacy for the target listener. Limitations in the acoustic contrast achievable by conventional systems means that speech leaked into the dark zone could remain intelligible; to mitigate this, a pair of sound field control processes may be employed 15 . The first process focuses a speech signal towards the target listener in the bright zone, and the second uses the same array to focus a secondary masking signal into the dark zone, with the purpose of impairing the intelligibility of any leaked speech. ...
... The first process focuses a speech signal towards the target listener in the bright zone, and the second uses the same array to focus a secondary masking signal into the dark zone, with the purpose of impairing the intelligibility of any leaked speech. The performance of such a system can therefore be characterised by the speech intelligibility contrast (SIC) 15 i.e. the difference between the intelligibility of the spoken message in the bright and dark zones, evaluated using an objective speech intelligibility metric. ...
Article
Full-text available
Personal Audio refers to the generation of spatially distinct sound zones that allow individuals within a shared space to listen to their own audio material without affecting, or being affected, by others. Recent interest in such systems has focussed on their performance in public spaces where speech privacy is desirable. To achieve this goal, speech is focussed towards the target listener and a masking signal is focussed into the area where the target speech signal could otherwise be overheard. An effective masking signal must substantially reduce the intelligibility in this region without becoming an annoyance to those nearby. To assess these perceptual requirements, listening tests were carried out using two examples of loudspeaker arrays with different spatial aliasing characteristics, to determine the impacts of different masking signal spectra on speech intelligibility and subjective preference. The results of these tests were used, alongside objective and subjective metrics, to form a design specification for private personal audio systems.
... Traditionally, speech content privacy has been rooted in the idea that a signal-emitting device can be set up within a physical space, such as an office room, to conceal what people say during private conversations. This device emits a special type of noise to mask semantically-relevant speech sounds, like words or phonemes [1,2,3]. Such approaches can effectively mask speech content to the point of rendering it unintelligible to nearby eavesdroppers. ...
... The bottom was created by replacing a target phrase with the same type of noise. Recovery of the target phrase can be measured by a short term objective intelligibility measure (STOI) 3 . The STOI value for the additive noise is higher than the replacement noise, indicating that the target phrase is more recoverable when masked with additive noise. ...
Article
In this paper, we discuss an important aspect of speech privacy: protecting spoken content. New capabilities from the field of machine learning provide a unique and timely opportunity to revisit speech content protection. There are many different applications of content privacy, even though this area has been under-explored in speech technology research. This paper presents several scenarios that indicate a need for speech content privacy even as the specific techniques to achieve content privacy may necessarily vary. Our discussion includes several different types of content privacy including recoverable and non-recoverable content. Finally, we introduce evaluation strategies as well as describe some of the difficulties that may be encountered.
... Traditionally, speech content privacy has been rooted in the idea that a signal-emitting device can be set up within a physical space, such as an office room, to conceal what people say during private conversations. This device emits a special type of noise to mask semantically-relevant speech sounds, like words or phonemes [1,2,3]. Such approaches can effectively mask speech content to the point of rendering it unintelligible to nearby eavesdroppers. ...
... The bottom was created by replacing a target phrase with the same type of noise. Recovery of the target phrase can be measured by a short term objective intelligibility measure (STOI) 3 . The STOI value for the additive noise is higher than the replacement noise, indicating that the target phrase is more recoverable when masked with additive noise. ...
Preprint
In this paper, we discuss an important aspect of speech privacy: protecting spoken content. New capabilities from the field of machine learning provide a unique and timely opportunity to revisit speech content protection. There are many different applications of content privacy, even though this area has been under-explored in speech technology research. This paper presents several scenarios that indicate a need for speech content privacy even as the specific techniques to achieve content privacy may necessarily vary. Our discussion includes several different types of content privacy including recoverable and non-recoverable content. Finally, we introduce evaluation strategies as well as describe some of the difficulties that may be encountered.
... • The interpolation scheme is extended to a novel dynamic perceptual weighting approach based on spreading functions and human hearing thresholds [116]. • New field metrics are proposed for speech quality and privacy over soundfields [117], [118] • An optimisation approach to improve speech privacy using the newly defined field metrics is presented. The optimisation is further extended to include speech quality [117], [118]. ...
... • New field metrics are proposed for speech quality and privacy over soundfields [117], [118] • An optimisation approach to improve speech privacy using the newly defined field metrics is presented. The optimisation is further extended to include speech quality [117], [118]. ...
Thesis
Full-text available
The experience and utility of personal sound is a highly sought after characteristic of shared spaces. Personal sound allows individuals, or small groups of individuals, to listen to separate streams of audio content without external interruption from a third-party. The desired effects of personal acoustic environments can also be areas of minimal sound, where quiet spaces facilitate an effortless mode of communication. These characteristics have become exceedingly difficult to produce in busy environments such as cafes, restaurants, open plan offices and entertainment venues. The concept of, and the ability to provide, spaces of such nature has been of significant interest to researchers in the past two decades. This thesis answers open questions in the area of personal sound reproduction using loudspeaker arrays, which is the active reproduction of soundfields over extended spatial regions of interest. We first provide a review of the mathematical foundations of acoustics theory, single zone and multiple zone soundfield reproduction, as well as background on the human perception of sound. We then introduce novel approaches for the integration of psychoacoustic models in multizone soundfield reproductions and describe implementations that facilitate the efficient computation of complex soundfield synthesis. The psychoacoustic based zone weighting is shown to considerably improve soundfield accuracy, as measured by the soundfield error, and the proposed computational methods are shown capable of providing several orders of magnitude better performance with insignificant effects on synthesis quality. Consideration is then given to the enhancement of privacy and quality in personal sound zones and in particular on the effects of unwanted sound leaking between zones. Optimisation algorithms, along with a priori estimations of cascaded zone leakage filters, are then established so as to provide privacy between the sound zones without diminishing quality. Simulations and real-world experiments are performed, using linear and part-circle loudspeaker arrays, to confirm the practical feasibility of the proposed privacy and quality control techniques. The experiments show that good quality and confidential privacy are achievable simultaneously. The concept of personal sound is then extended to the active suppression of speech across loudspeaker boundaries. Novel suppression techniques are derived for linear and planar loudspeaker boundaries, which are then used to simulate the reduction of speech levels over open spaces and suppression of acoustic reflections from walls. The suppression is shown to be as effective as passive fibre panel absorbers. Finally, we propose a novel ultrasonic parametric and electrodynamic loudspeaker hybrid design for acoustic contrast enhancement in multizone reproduction scenarios and show that significant acoustic contrast can be achieved above the fundamental spatial aliasing frequency.
... Another form of data degradation involves a signalemitting device (system N from our problem description) that can be installed into a physical space (e.g., an office or meeting room). The device emits a signal (audible or subsonic) that conceals human conversation, such that anyone outside of the physical space cannot eavesdrop [1,7,12]. In this scenario, the system N emits a noise to transform the input speech so that an automatic speech recognition (ASR) system (or human listener), as system A, performs poorly on the task of word recognition. ...
Conference Paper
Audio AI services present an opportunity to conceptualise smart buildings in a new light. Microphones can capture fine-grained audio information that can be used for determining how many people are inside of a building, where they are, and what kinds of activities are taking place. This information can feed into smart resource management systems or it could be used for assistive technologies. Generally speaking, audio is regarded as a less intrusive type of information collection than video surveillance, but significant issues of privacy and security persist with audio capture. Such issues warrant a serious discussion about how safe it is to use audio-capture in smart buildings for AI decision-making. This position paper initiates a discussion of research directions for the safety of audio services related to three key areas: data degradation strategies, dynamic customisation of tools, and privacy-aware technologies. In each area, we identify key challenges and highlight solution concepts with the potential to address the issue.
... This approach to the human hearing system models the process that a sound undergoes from the moment it enters our outer ear until it reaches our neurological system and beyond. This knowledge has been widely used in several signal processing areas, such as coding (Brandenburg and Johnston, 1990), equalization (Välimäki and Reiss, 2016), personal sound zones (Donley et al., 2016), or noise cancellation (Wang et al., 2018;Mosquera-Sánchez et al., 2018). Generally speaking, the field of pyschoacoustics provides a more realistic approach to any sound or noise processing since it a https://orcid.org/0000-0002-8719-8106 ...
... This approach to the human hearing system models the process that a sound undergoes from the moment it enters our outer ear until it reaches our neurological system and beyond. This knowledge has been widely used in several signal processing areas, such as coding (Brandenburg and Johnston, 1990), equalization (Välimäki and Reiss, 2016), personal sound zones (Donley et al., 2016), or noise cancellation (Wang et al., 2018;Mosquera-Sánchez et al., 2018). Generally speaking, the field of pyschoacoustics provides a more realistic approach to any sound or noise processing since it a https://orcid.org/0000-0002-8719-8106 ...
... Nowadays, the car cabin is a very common living environment where several passengers may require different audio contents to be delivered. For this reason, one frequent need is to create a separate listening environment between the rear and front seats [1], to allow the navigation systems to be heard only by the driver or -another very common scenario -to improve speech privacy [2]. In recent years, personal audio systems have been thoroughly studied, especially in the automotive scenario [1], [3], [4] exploiting different techniques which involve signal processing and the design of speaker arrays. ...
... Privacy in multizone reproduction systems was first studied in [13] where the authors also use noise to mask message signals in "quiet" zones to reduce intelligibility. While their method is applicable both in anechoic and in reverberant conditions, the performance is degraded in the presence of echoes. ...
Preprint
We address the problem of privately communicating audio messages to multiple listeners in a reverberant room, using a set of loudspeakers. We propose two methods based on emitting noise. In the first method, the loudspeakers emit noise signals that are appropriately filtered so that after echoing along the multiple paths in the room, they sum up and descramble to yield distinct meaningful audio messages only at specific focusing spots, while being incoherent everywhere else. In the second method, adapted from wireless communications, we project noise signals onto the nullspace of the MIMO channel matrix between the loudspeakers and listeners. Loudspeakers reproduce a sum of the projected noise signals and intended messages. Again because of the echoes, the MIMO nullspace changes between the different locations in the room. Thus the listeners at focusing spots hear intended messages, while the acoustic channel of an eavesdropper at any other location is jammed. We show using both numerical and real experiments, that with a small number of speakers and a few impulse response measurements, audio messages can be indeed be communicated to a set of listeners while ensuring negligible intelligibility elsewhere.
Conference Paper
Full-text available
In this paper a method for the reproduction of multizone speech soundfields using perceptual weighting criteria is proposed. Psychoacoustic models are used to derive a space-time-frequency weighting function to control leakage of perceptually unimportant energy from the bright zone into the quiet zone. This is combined with a method for regulating the number of basis planewaves used in the reproduction to allow for an efficient implementation using a codebook of predetermined weights based on desired soundfield energy in the zones. The approach is capable of improving the mean squared error for reproduced speech in the bright zone by -10.5 decibels. Results also show that the approach leads to a significant reduction in the spatial error within the bright zone whilst requiring 65% less loudspeaker signal power for the case where the soundfield in this zone is in line with, and hence partially directed to, the quiet zone.
Conference Paper
Full-text available
This paper proposes and evaluates an efficient approach for practical reproduction of multizone soundfields for speech sources. The reproduction method, based on a previously proposed approach, utilises weighting parameters to control the soundfield reproduced in each zone whilst minimising the number of loudspeakers required. Proposed here is an interpolation scheme for predicting the weighting parameter values of the multizone soundfield model that otherwise requires significant computational effort. It is shown that initial computation time can be reduced by a factor of 1024 with only -85dB of error in the reproduced soundfield relative to reproduction without interpolated weighting parameters. The perceptual impact on the quality of the speech reproduced using the method is also shown to be negligible. By using pre-saved soundfields determined using the proposed approach, practical reproduction of dynamically weighted multizone soundfields of wideband speech could be achieved in real-time. Index Terms— multizone soundfield reproduction, wideband multizone soundfield, weighted multizone soundfield, look-up tables (LUT), interpolation, sound field synthesis (SFS)
Article
Full-text available
Sound rendering is increasingly being required to extend over certain regions of space for multiple listeners, known as personal sound zones, with minimum interference to listeners in other regions. In this article, we present a systematic overview of the major challenges that have to be dealt with for multizone sound control in a room. Sound control over multiple zones is formulated as an optimization problem, and a unified framework is presented to compare two state-of-the-art sound control techniques. While conventional techniques have been focusing on point-to-point audio processing, we introduce a wave-domain sound field representation and active room compensation for sound pressure control over a region of space. The design of directional loudspeakers is presented and the advantages of using arrays of directional sources are illustrated for sound reproduction, such as better control of sound fields over wide areas and reduced total number of loudspeaker units, thus making it particularly suitable for establishing personal sound zones.
Article
Full-text available
Reproduction of multiple sound zones, in which personal audio programs may be consumed without the need for headphones, is an active topic in acoustical signal processing. Many approaches to sound zone reproduction do not consider control of the bright zone phase, which may lead to self-cancellation problems if the loudspeakers surround the zones. Conversely, control of the phase in a least-squares sense comes at a cost of decreased level difference between the zones and frequency range of cancellation. Single-zone approaches have considered plane wave reproduction by focusing the sound energy in to a point in the wavenumber domain. In this article, a planar bright zone is reproduced via planarity control, which constrains the bright zone energy to impinge from a narrow range of angles via projection in to a spatial domain. Simulation results using a circular array surrounding two zones show the method to produce superior contrast to the least-squares approach, and superior planarity to the contrast maximization approach. Practical performance measurements obtained in an acoustically treated room verify the conclusions drawn under free-field conditions.
Article
Surround sound systems can produce a desired sound field over an extended region of space by using higher order Ambisonics. One application of this capability is the production of multiple independent soundfields in separate zones. This paper investigates multi-zone surround systems for the case of two dimensional reproduction. A least squares approach is used for deriving the loudspeaker weights for producing a desired single frequency wave field in one of N zones, while producing silence in the other N-1 zones. It is shown that reproduction in the active zone is more difficult when an inactive zone is in-line with the virtual sound source and the active zone. Methods for controlling this problem are discussed.
Article
This article discusses the relationship between the two metrics, and their suitability for use in any type of space, including spaces not fitting the definition of either open or closed. The E2638 method provides a rating of the average performance of a closed room - without any assumptions as to talker location - to each of a number of listener positions outside the room, close to the room boundaries. E2638 includes a table of categories that identifies the frequency with which speech sounds would be audible or intelligible for various SPC values. The high correlation for both metrics implies both are useful for rating intelligibility over a wide range. The two current ASTM metrics for rating speech privacy of building spaces are highly correlated, and both seem well suited for use in conditions where speech is intelligible, such as in open plan spaces.
Article
Higher-order ambisonics has been identified as a robust technique for synthesizing a desired sound field. However, the synthesis algorithm requires a large number of secondary sources to derive the optimal results for large reproduction regions and over high operating frequencies. This paper proposes an enhanced method for synthesizing the sound field using a relatively small number of secondary sources which allows improved synthesizing accuracy for certain subregions of the interested zone. This method introduces the spherical harmonic translation into the mode matching algorithm to acquire a uniform modal-domain representation of the sound fields within different sub-regions. Then by changing the weighing of each region, the least mean squares solution can be easily controlled to cater for certain prioritized reproduction requirements. Simulations show that this technique can effectively improve the matching accuracy of a given sub-region, while only slightly increasing the global reproduction error. This method is shown to be especially effective in the situations where the number of secondary sources is limited.
Article
Multizone soundfield reproduction over an extended spatial region is a challenging problem in acoustic signal processing. We introduce a method of reproducing a multizone soundfield within a desired region in reverberant environments. It is based on the identification of the acoustic transfer function (ATF) from the loudspeaker over the desired reproduction region using a limited number of microphone measurements. We assume that the soundfield is sparse in the domain of planewave decomposition and identify the ATF using sparse methods. The estimates of the ATFs are then used to derive the optimal least-squares solution for the loudspeaker filters that minimize the reproduction error over the entire reproduction region. Simulations confirm that the method leads to a significantly reduced number of required microphones for accurate multizone sound reproduction, while it also facilitates the reproduction over a wide frequency range. Practical experiments are used to verify the sparse planewave representation of the reverberant soundfield in a real-world listening environment.
Article
Modern communication technology facilitates communication from anywhere to anywhere. As a result, low speech intelligibility has become a common problem, which is exacerbated by the lack of feedback to the talker about the rendering environment. In recent years, a range of algorithms has been developed to enhance the intelligibility of speech rendered in a noisy environment. We describe methods for intelligibility enhancement from a unified vantage point. Before one defines a measure of intelligibility, the level of abstraction of the representation must be selected. For example, intelligibility can be measured on the message, the sequence of words spoken, the sequence of sounds, or a sequence of states of the auditory system. Natural measures of intelligibility defined at the message level are mutual information and the hit-or-miss criterion. The direct evaluation of high-level measures requires quantitative knowledge of human cognitive processing. Lower-level measures can be derived from higher-level measures by making restrictive assumptions. We discuss the implementation and performance of some specific enhancement systems in detail, including speech intelligibility index (SII)-based systems and systems aimed at enhancing the sound-field where it is perceived by the listener. We conclude with a discussion of the current state of the field and open problems.
Conference Paper
We introduce a method for 2-D spatial multizone soundfield reproduction based on describing the desired multizone soundfield as an orthogonal expansion of basis functions over the desired reproduction region. This approach finds the solution to the Helmholtz equation that is closest to the desired soundfield in a weighted least squares sense. The basis orthogonal set is formed using QR factorization with as input a suitable set of solutions of the Helmholtz equation. The coefficients of the Helmholtz solution wavefields can then be calculated, reducing the multizone sound reproduction problem to the reconstruction of a set of basis wavefields over the desired region. The method facilitates its application with a more practical loudspeaker configuration. The approach is shown effective for both accurately reproducing sound in the selected bright zone and minimizing sound leakage into the predefined quiet zone.