To read the full-text of this research, you can request a copy directly from the authors.
... When personal audio systems are used for entertainment purposes, e.g. [5][6][7], poor separation between the listening zones can be regarded as an inconvenience, but certain applications of personal audio technology in public spaces could have more significant consequences regarding listeners' privacy [8,9]. For example, a personal audio system could be used to transmit sensitive conversations between staff and customers through security partitions in places such as banks, surgeries or pharmacy counters. ...
... As well as the standard considerations of reverberation [11,12], background noise [13], filter design [5,[14][15][16][17], loudspeaker selection [6,18] and zonal geometry [19,20], speech privacy control requires an understanding of human speech perception to be integrated into the design. A successful system must ensure that speech intended for the target listener is clear and intelligible, whilst also guaranteeing that listeners outside of the target region cannot understand these messages [8]. ...
... However, this measure does not capture the essence of what listeners expect a speech privacy control system to deliver: a significant difference in the intelligibility of a speech signal between the zones. This ratio has been described as the speech intelligibility contrast [8]. ...
Multi-zone sound field control allows individuals to listen to personalised audio content whilst sharing a physical space. Applications of this technology include home entertainment, audio reproduction in public spaces such as museums, shops or exhibitions, and providing areas where the privacy of sensitive communication can be safeguarded without the need for physical barriers. The problem of transmitting a speech signal to a single listener and reducing the intelligibility of that signal elsewhere is the focus of the present thesis. The motivation behind the presented experiments and simulations is to identify the practical trade-offs that must be considered in the design of these "Speech Privacy Control" systems.
Conventional personal audio systems use loudspeaker array processing to produce a bright zone for the intended user of the system and a dark zone where silence is desired. However, established performance metrics and system optimisation techniques do not necessarily yield privacy for the target listener, as attenuated speech may remain intelligible within the dark zone. A system is proposed that focusses a synthetic masking signal into the dark zone to selectively reduce the intelligibility of the leaked speech. Privacy is ensured by adjusting the masker to meet pre-defined constraints on the speech intelligibility in each zone. This design methodology utilises information from speech intelligibility tests and subjective preference evaluations in order to improve the utility and acceptability of such systems for all nearby listeners.
In addition to the design of the masking signal, the performance of a speech privacy control system is affected by the loudspeaker array design and the location of the listening zones. These effects are explored using experimental measurements of a loudspeaker array in a room, and the results are used to select two system configurations for additional evaluation using listening tests. The perceived performance of a system is also affected by the surrounding acoustic environment, notably due to reverberation and background noise, which may change over time. The effects of room reverberation are investigated using image source simulations and acoustical measurements within a room, and the performance is evaluated in terms of the achievable level of acoustic contrast, the difference in speech intelligibility between zones, and the masking signal levels that are required to achieve privacy. A proposal is made to further enhance privacy by combining the effects of background noise and artificial masking signals. This method reduces the level of acoustic contrast that is required to achieve a given level of privacy, compared to the case where the masking is provided by the background noise alone.
Voice interaction, as an emerging human-computer interaction method, has gained great popularity, especially on smart devices. However, due to the open nature of voice signals, voice interaction may cause privacy leakage. In this paper, we propose a novel scheme, called SeVI, to protect voice interaction from being deliberately or unintentionally eavesdropped. SeVI actively generates jamming noise of superior characteristics, while a user is performing voice interaction with his/her device, so that attackers cannot obtain the voice contents of the user. Meanwhile, the device leverages the prior knowledge of the generated noise to adaptively cancel received noise, even when the device usage environment is changing due to movement, so that the user voice interactions are unaffected. SeVI relies on only normal microphone and speakers and can be implemented as light-weight software. We have implemented SeVI on a commercial off-the-shelf (COTS) smartphone and conducted extensive real-world experiments. The results demonstrate that SeVI can defend both online eavesdropping attacks and offline digital signal processing (DSP) analysis attacks.
Reproducing zones of personal sound is a challenging signal processing problem which has garnered considerable research interest in recent years. We introduce in this work an extended method to multizone soundfield reproduction which overcomes issues with speech privacy and quality. Measures of Speech Intelligibility Contrast (SIC) and speech quality are used as cost functions in an optimisation of speech privacy and quality. Novel spatial and (temporal) frequency domain speech masker filter designs are proposed to accompany the optimisation process. Spatial masking filters are designed using multizone soundfield algorithms which are dependent on the target speech multizone reproduction. Combinations of estimates of acoustic contrast and long term average speech spectra are proposed to provide equal masking influence on speech privacy and quality. Spatial aliasing specific to multizone soundfield reproduction geometry is further considered in analytically derived low-pass filters. Simulated and real-world experiments are conducted to verify the performance of the proposed method using semi-circular and linear loudspeaker arrays. Simulated implementations of the proposed method show that significant speech intelligibility contrast and speech quality is achievable between zones. A range of Perceptual Evaluation of Speech Quality (PESQ) Mean Opinion Scores (MOS) that indicate good quality are obtained while at the same time providing confidential privacy as indicated by SIC. The simulations also show that the method is robust to variations in the speech, virtual source location, array geometry and number of loudspeakers. Real-world experiments confirm the practicality of the proposed methods by showing that good quality and confidential privacy are achievable.
This paper proposes and evaluates an efficient approach for practical reproduction of multizone soundfields for speech sources. The reproduction method, based on a previously proposed approach, utilises weighting parameters to control the soundfield reproduced in each zone whilst minimising the number of loudspeakers required. Proposed here is an interpolation scheme for predicting the weighting parameter values of the multizone soundfield model that otherwise requires significant computational effort. It is shown that initial computation time can be reduced by a factor of 1024 with only -85dB of error in the reproduced soundfield relative to reproduction without interpolated weighting parameters. The perceptual impact on the quality of the speech reproduced using the method is also shown to be negligible. By using pre-saved soundfields determined using the proposed approach, practical reproduction of dynamically weighted multizone soundfields of wideband speech could be achieved in real-time. Index Terms— multizone soundfield reproduction, wideband multizone soundfield, weighted multizone soundfield, look-up tables (LUT), interpolation, sound field synthesis (SFS)
Sound rendering is increasingly being required to extend over certain regions of space for multiple listeners, known as personal sound zones, with minimum interference to listeners in other regions. In this article, we present a systematic overview of the major challenges that have to be dealt with for multizone sound control in a room. Sound control over multiple zones is formulated as an optimization problem, and a unified framework is presented to compare two state-of-the-art sound control techniques. While conventional techniques have been focusing on point-to-point audio processing, we introduce a wave-domain sound field representation and active room compensation for sound pressure control over a region of space. The design of directional loudspeakers is presented and the advantages of using arrays of directional sources are illustrated for sound reproduction, such as better control of sound fields over wide areas and reduced total number of loudspeaker units, thus making it particularly suitable for establishing personal sound zones.
Reproduction of multiple sound zones, in which personal audio programs may be consumed without the need for headphones, is an active topic in acoustical signal processing. Many approaches to sound zone reproduction do not consider control of the bright zone phase, which may lead to self-cancellation problems if the loudspeakers surround the zones. Conversely, control of the phase in a least-squares sense comes at a cost of decreased level difference between the zones and frequency range of cancellation. Single-zone approaches have considered plane wave reproduction by focusing the sound energy in to a point in the wavenumber domain. In this article, a planar bright zone is reproduced via planarity control, which constrains the bright zone energy to impinge from a narrow range of angles via projection in to a spatial domain. Simulation results using a circular array surrounding two zones show the method to produce superior contrast to the least-squares approach, and superior planarity to the contrast maximization approach. Practical performance measurements obtained in an acoustically treated room verify the conclusions drawn under free-field conditions.
A recent approach to surround sound is to perform exact control of the sound field over a region of space. Here, the driving signals for an array of loudspeakers are chosen to create a desired sound field over an extended area. An interesting subtopic is multi-zone surround sound, where two or more listeners can experience totally independent sound fields. However, multi-zone surround sound is a challenge because implementation can be very non-robust. We formulate multi-zone sound reproduction as a convex optimization problem, where the sound energy leakage into other listener zones is limited to fixed levels, and a constraint is placed on the loudspeaker weights to improve the robustness. An interior point algorithm is de vised for computing the loudspeaker weights, and its performance is compared with least squares approaches of multi-zone reproduction in typical two-zone cases.
The prohibitive number of speakers required for the reproduction of isolated soundfields is the major limitation preventing solution deployment. This paper addresses the provision of personal soundfields (zones) to multiple listeners using a limited number of speakers with an underlying assumption of fixed virtual sources. For such multizone systems, optimization of speaker positions and weightings is important to reduce the number of active speakers. Typically, single stage optimization is performed, but in this paper a new two-stage pressure matching optimization is proposed for wideband sound sources. In the first stage, the least-absolute shrinkage and selection operator (Lasso) is used to select the speakers' positions for all sources and frequency bands. A second stage then optimizes reproduction using all selected speakers on the basis of a regularized least-squares (LS) algorithm. The performance of the new, two-stage approach is investigated for different reproduction angles, frequency range and variable total speaker weight powers. The results demonstrate that using two-stage Lasso-LS optimization can give up to 69 dB improvement in the mean squared error (MSE) over a single-stage LS in the reproduction of two isolated audio signals within control zones using e.g. 84 speakers.
In the development process of noise-reduction algorithms, an objective machine-driven intelligibility measure which shows high correlation with speech intelligibility is of great interest. Besides reducing time and costs compared to real listening experiments, an objective intelligibility measure could also help provide answers on how to improve the intelligibility of noisy unprocessed speech. In this paper, a short-time objective intelligibility measure (STOI) is presented, which shows high correlation with the intelligibility of noisy and time-frequency weighted noisy speech (e.g., resulting from noise reduction) of three different listening experiments. In general, STOI showed better correlation with speech intelligibility compared to five other reference objective intelligibility models. In contrast to other conventional intelligibility models which tend to rely on global statistics across entire sentences, STOI is based on shorter time segments (386 ms). Experiments indeed show that it is beneficial to take segment lengths of this order into account. In addition, a free Matlab implementation is provided.
Image methods are commonly used for the analysis of the acoustic properties
of enclosures. In this paper we discuss the theoretical and practical
use of image techniques for simulating, on a digital computer, the
impulse response between two points in a small rectangular room.
The resulting impulse response, when convolved with any desired input
signal, such as speech, simulates room reverberation of the input
signal. This technique is useful in signal processing or psychoacoustic
studies. The entire process is carried out on a digital computer
so that a wide range of room parameters can be studied with accurate
control over the experimental conditions. A FORTRAN implementation
of this model has been included.
This paper proposes and evaluates an efficient approach for practical reproduction of multizone soundfields for speech sources. The reproduction method, based on a previously proposed approach, utilises weighting parameters to control the soundfield reproduced in each zone whilst minimising the number of loudspeakers required. Proposed here is an interpolation scheme for predicting the weighting parameter values of the multizone soundfield model that otherwise requires significant computational effort. It is shown that initial computation time can be reduced by a factor of 1024 with only −85dB of error in the reproduced soundfield relative to reproduction without interpolated weighting parameters. The perceptual impact on the quality of the speech reproduced using the method is also shown to be negligible. By using pre-saved soundfields determined using the proposed approach, practical reproduction of dynamically weighted multizone soundfields of wideband speech could be achieved in real-time.
Surround sound systems can produce a desired sound field over an extended region of space by using higher order Ambisonics. One application of this capability is the production of multiple independent soundfields in separate zones. This paper investigates multi-zone surround systems for the case of two dimensional reproduction. A least squares approach is used for deriving the loudspeaker weights for producing a desired single frequency wave field in one of N zones, while producing silence in the other N-1 zones. It is shown that reproduction in the active zone is more difficult when an inactive zone is in-line with the virtual sound source and the active zone. Methods for controlling this problem are discussed.
This article discusses the relationship between the two metrics, and their suitability for use in any type of space, including spaces not fitting the definition of either open or closed. The E2638 method provides a rating of the average performance of a closed room - without any assumptions as to talker location - to each of a number of listener positions outside the room, close to the room boundaries. E2638 includes a table of categories that identifies the frequency with which speech sounds would be audible or intelligible for various SPC values. The high correlation for both metrics implies both are useful for rating intelligibility over a wide range. The two current ASTM metrics for rating speech privacy of building spaces are highly correlated, and both seem well suited for use in conditions where speech is intelligible, such as in open plan spaces.
Higher-order ambisonics has been identified as a robust technique for synthesizing a desired sound field. However, the synthesis algorithm requires a large number of secondary sources to derive the optimal results for large reproduction regions and over high operating frequencies. This paper proposes an enhanced method for synthesizing the sound field using a relatively small number of secondary sources which allows improved synthesizing accuracy for certain subregions of the interested zone. This method introduces the spherical harmonic translation into the mode matching algorithm to acquire a uniform modal-domain representation of the sound fields within different sub-regions. Then by changing the weighing of each region, the least mean squares solution can be easily controlled to cater for certain prioritized reproduction requirements. Simulations show that this technique can effectively improve the matching accuracy of a given sub-region, while only slightly increasing the global reproduction error. This method is shown to be especially effective in the situations where the number of secondary sources is limited.
Multizone soundfield reproduction over an extended spatial region is a challenging problem in acoustic signal processing. We introduce a method of reproducing a multizone soundfield within a desired region in reverberant environments. It is based on the identification of the acoustic transfer function (ATF) from the loudspeaker over the desired reproduction region using a limited number of microphone measurements. We assume that the soundfield is sparse in the domain of planewave decomposition and identify the ATF using sparse methods. The estimates of the ATFs are then used to derive the optimal least-squares solution for the loudspeaker filters that minimize the reproduction error over the entire reproduction region. Simulations confirm that the method leads to a significantly reduced number of required microphones for accurate multizone sound reproduction, while it also facilitates the reproduction over a wide frequency range. Practical experiments are used to verify the sparse planewave representation of the reverberant soundfield in a real-world listening environment.
Modern communication technology facilitates communication from anywhere to anywhere. As a result, low speech intelligibility has become a common problem, which is exacerbated by the lack of feedback to the talker about the rendering environment. In recent years, a range of algorithms has been developed to enhance the intelligibility of speech rendered in a noisy environment. We describe methods for intelligibility enhancement from a unified vantage point. Before one defines a measure of intelligibility, the level of abstraction of the representation must be selected. For example, intelligibility can be measured on the message, the sequence of words spoken, the sequence of sounds, or a sequence of states of the auditory system. Natural measures of intelligibility defined at the message level are mutual information and the hit-or-miss criterion. The direct evaluation of high-level measures requires quantitative knowledge of human cognitive processing. Lower-level measures can be derived from higher-level measures by making restrictive assumptions. We discuss the implementation and performance of some specific enhancement systems in detail, including speech intelligibility index (SII)-based systems and systems aimed at enhancing the sound-field where it is perceived by the listener. We conclude with a discussion of the current state of the field and open problems.
We introduce a method for 2-D spatial multizone soundfield reproduction based on describing the desired multizone soundfield as an orthogonal expansion of basis functions over the desired reproduction region. This approach finds the solution to the Helmholtz equation that is closest to the desired soundfield in a weighted least squares sense. The basis orthogonal set is formed using QR factorization with as input a suitable set of solutions of the Helmholtz equation. The coefficients of the Helmholtz solution wavefields can then be calculated, reducing the multizone sound reproduction problem to the reconstruction of a set of basis wavefields over the desired region. The method facilitates its application with a more practical loudspeaker configuration. The approach is shown effective for both accurately reproducing sound in the selected bright zone and minimizing sound leakage into the predefined quiet zone.
Spatial multizone soundfield reproduction over an extended region of open space is a complex and challenging problem in acoustic signal processing. In this paper, we provide a framework to recreate 2-D spatial multizone soundfields using a single array of loudspeakers which encompasses all spatial regions of interest. The reproduction is based on the derivation of an equivalent global soundfield consisting of a number of individual multizone soundfields. This is achieved by using spatial harmonic coefficients translation between coordinate systems. A multizone soundfield reproduction problem is then reduced to the reproduction over the entire region. An important advantage of this approach is the full use of the available dimensionality of the soundfield. This paper provides quantitative performances of a 2-D multizone system and reveals some fundamental limits on 2-D multizone soundfield reproduction. The extensions of the multizone soundfield reproduction design in reverberant rooms are also included.
Intended for use as both a textbook and a reference, "Fourier Acoustics" develops the theory of sound radiation uniquely from the viewpoint of Fourier Analysis. This powerful perspective of sound radiation provides the reader with a comprehensive and practical understanding which will enable him or her to diagnose and solve sound and vibration problems in the 21st Century. As a result of this perspective, "Fourier Acoustics" is able to present thoroughly and simply, for the first time in book form, the theory of nearfield acoustical holography, an important technique which has revolutionised the measurement of sound. Relying little on material outside the book, "Fourier Acoustics" will be invaluable as a graduate level text as well as a reference for researchers in academia and industry. It talks about the physics of wave propogation and sound vibration in homogeneous media. It deals with acoustics, such as radiation of sound, and radiation from vibrating surfaces; inverse problems, such as the theory of nearfield acoustical holography; and, mathematics of specialized functions, such as spherical harmonics.
Reproduction of a soundfield is a fundamental problem in acoustic signal processing. A common approach is to use an array of loudspeakers to reproduce the desired field where the least-squares method is used to calculate the loudspeaker weights. However, the least-squares method involves matrix inversion which may lead to errors if the matrix is poorly conditioned. In this paper, we use the concept of theoretical continuous loudspeaker on a circle to derive the discrete loudspeaker aperture functions by avoiding matrix inversion. In addition, the aperture function obtained through continuous loudspeaker method reveals the underlying structure of the solution as a function of the desired soundfield, the loudspeaker positions, and the frequency. This concept can also be applied for the 3-D soundfield reproduction using spherical harmonics analysis with a spherical array. Results are verified through computer simulations.
Previous objective speech quality assessment models, such as bark spectral distortion (BSD), the perceptual speech quality measure (PSQM), and measuring normalizing blocks (MNB), have been found to be suitable for assessing only a limited range of distortions. A new model has therefore been developed for use across a wider range of network conditions, including analogue connections, codecs, packet loss and variable delay. Known as perceptual evaluation of speech quality (PESQ), it is the result of integration of the perceptual analysis measurement system (PAMS) and PSQM99, an enhanced version of PSQM. PESQ is expected to become a new ITU-T recommendation P.862, replacing P.861 which specified PSQM and MNB
Multizone reproduction of speech soundfields: A perceptually weighted approach