- Pasi Pertilä
The work in this project has two main objectives. The first is to localize the sound event in terms of azimuth and elevation with respect to the microphone. The second is to recognize the type of sound event (eg. car, bus, train, speech etc.) and identify its temporal onset and offset. The above objectives are planned to be achieved using machine learning methods on real life multichannel audio recordings, where more than one sound can occur simultaneously.
Research Item (17)
Mean square error (MSE) has been the preferred choice as loss function in the current deep neural network (DNN) based speech separation techniques. In this paper, we propose a new cost function with the aim of optimizing the extended short time objective intelligibility (ESTOI) measure. We focus on applications where low algorithmic latency ($\leq 10$ ms) is important. We use long short-term memory networks (LSTM) and evaluate our proposed approach on four sets of two-speaker mixtures from extended Danish hearing in noise (HINT) dataset. We show that the proposed loss function can offer improved or at par objective intelligibility (in terms of ESTOI) compared to an MSE optimized baseline while resulting in lower objective separation performance (in terms of the source to distortion ratio (SDR)). We then proceed to propose an approach where the network is first initialized with weights optimized for MSE criterion and then trained with the proposed ESTOI loss criterion. This approach mitigates some of the losses in objective separation performance while preserving the gains in objective intelligibility.
In this paper, we propose a convolutional recurrent neural network for joint sound event localization and detection (SELD) of multiple overlapping sound events in three-dimensional (3D) space. The proposed network takes a sequence of consecutive spectrogram time-frames as input and maps it to two outputs in parallel. As the first output, the sound event detection (SED) is performed as a multi-label multi-class classification task on each time-frame producing temporal activity for all the sound event classes. As the second output, localization is performed by estimating the 3D Cartesian coordinates of the direction-of-arrival (DOA) for each sound event class using multi-class regression. The proposed method is able to associate multiple DOAs with respective sound event labels and further track this association with respect to time. The proposed method uses separately the phase and magnitude component of the spectrogram calculated on each audio channel as the feature, thereby avoiding any method- and array-specific feature extraction. The method is evaluated on five Ambisonic and two circular array format datasets with different overlapping sound events in anechoic, reverberant and real-life scenarios. The proposed method is compared with two SED, three DOA estimation, and one SELD baselines. The results show that the proposed method is generic to array structures, robust to unseen DOA labels, reverberation, and low SNR scenarios. The proposed joint estimation of DOA and SED in comparison to the respective standalone baselines resulted in a consistently higher recall of the estimated number of DOAs across datasets.
- Apr 2018
This paper presents an algorithm for multichannel sound source separation using explicit modeling of level and time differences in source spatial covariance matrices (SCM). We propose a novel SCM model in which the spatial properties are modeled by the weighted sum of direction of arrival (DOA) kernels. DOA kernels are obtained as the combination of phase and level difference covariance matrices representing both time and level differences between microphones for a grid of predefined source directions. The proposed SCM model is combined with the NMF model for the magnitude spectrograms. Opposite to other SCM models in the literature, in this work, source localization is implicitly defined in the model and estimated during the signal factorization. Therefore, no localization pre-processing is required. Parameters are estimated using complex-valued non-negative matrix factorization (CNMF) with both Euclidean distance and Itakura Saito divergence. Separation performance of the proposed system is evaluated using the two-channel SiSEC development dataset and four channels signals recorded in a regular room with moderate reverberation. Finally, a comparison to other state-of-the-art methods is performed, showing better achieved separation performance in terms of SIR and perceptual measures.
In this paper we propose a method for separation of moving sound sources. The method is based on first tracking the sources and then estimation of source spectrograms using multichannel non-negative matrix factorization (NMF) and extracting the sources from the mixture by single-channel Wiener filtering. We propose a novel multichannel NMF model with time-varying mixing of the sources denoted by spatial covariance matrices (SCM) and provide update equations for optimizing model parameters minimizing squared Frobenius norm. The SCMs of the model are obtained based on estimated directions of arrival of tracked sources at each time frame. The evaluation is based on established objective separation criteria and using real recordings of two and three simultaneous moving sound sources. The compared methods include conventional beamforming and ideal ratio mask separation. The proposed method is shown to exceed the separation quality of other evaluated blind approaches according to all measured quantities. Additionally, we evaluate the method's susceptibility towards tracking errors by comparing the separation quality achieved using annotated ground truth source trajectories.
- Oct 2017
- Parametric Time-Frequency Domain Spatial Audio
This chapter introduces methods for factorizing the spectrogram of multichannel audio into repetitive spectral objects and apply the introduced models to the analysis of spatial audio and modification of spatial sound through source separation. The purpose of decomposing an audio spectrogram using spectral templates is to learn the underlying structures (audio objects) from the observed data. The chapter discusses two main scenarios such as parameterization of multichannel surround sound and parameterization of microphone array signals. It explains the principles of source separation by time-frequency filtering using separation masks constructed from the spectrogram models. The chapter introduces a spatial covariance matrix model based on the directions of arrival of sound events and spectral templates, and discusses its relationship to conventional spatial audio signal processing. Source separation using spectrogram factorization models is achieved via time- frequency filtering of the original observation short-time Fourier transform (STFT) by a generalized Wiener filter obtained from the spectrogram model parameters.
- Aug 2017
- 2017 25th European Signal Processing Conference (EUSIPCO)
- Sep 2015
This paper proposes a method for binaural reconstruction of a sound scene captured with a portable-sized array consisting of several microphones. The proposed processing is separating the scene into a sum of small number of sources, and the spectrogram of each of them is in turn represented as a small number of latent components. The direction of arrival (DOA) of each source is estimated, which is followed by binaural rendering of each source at its estimated direction. For representing the sources, the proposed method uses low-rank complex-valued non-negative matrix factorization combined with DOA-based spatial covariance matrix model. The binaural reconstruction is achieved by applying the binaural cues (head-related transfer function) associated with the estimated source DOA to the separated source signals. The binaural rendering quality of the proposed method was evaluated using a speech intelligibility test. The test results indicated that the proposed binaural rendering was able to improve the intelligibility of speech over stereo recordings and separation by minimum variance distortionless response beamformer with the same binaural synthesis in a three-speaker scenario. An additional listening test evaluating the subjective quality of the rendered output indicates no added processing artifacts by the proposed method in comparison to unprocessed stereo recording.
- Apr 2015
- Pasi Pertilä
- Joonas Nikunen
Speech separation algorithms are faced with a difficult task of producing high degree of separation without containing unwanted artifacts. The time–frequency (T–F) masking technique applies a real-valued (or binary) mask on top of the signal’s spectrum to filter out unwanted components. The practical difficulty lies in the mask estimation. Often, using efficient masks engineered for separation performance leads to presence of unwanted musical noise artifacts in the separated signal. This lowers the perceptual quality and intelligibility of the output. Microphone arrays have been long studied for processing of distant speech. This work uses a feed-forward neural network for mapping microphone array’s spatial features into a T–F mask. Wiener filter is used as a desired mask for training the neural network using speech examples in simulated setting. The T–F masks predicted by the neural network are combined to obtain an enhanced separation mask that exploits the information regarding interference between all sources. The final mask is applied to the delay-and-sum beamformer (DSB) output. The algorithm’s objective separation capability in conjunction with the separated speech intelligibility is tested with recorded speech from distant talkers in two rooms from two distances. The results show improvement in instrumental measure for intelligibility and frequency-weighted SNR over complex-valued non-negative matrix factorization (CNMF) source separation approach, spatial sound source separation, and conventional beamforming methods such as the DSB and minimum variance distortionless response (MVDR).
- Sep 2014
- 15th Annual Conference of the International Speech Communication Association (Interspeech'14)
- Pasi Pertilä
- Joonas Nikunen
High level of noise reduces the perceptual quality and intelligibility of speech. Therefore, enhancing the captured speech signal is important in everyday applications such as telephony and teleconferencing. Microphone arrays are typically placed at a distance from a speaker and require processing to enhance the captured signal. Beamforming provides directional gain towards the source of interest and attenuation of interference. It is often followed by a single channel post-filter to further enhance the signal. Non-linear spatial post-filters are capable of providing high noise suppression but can produce unwanted musical noise that lowers the perceptual quality of the output. This work proposes an artificial neural network (ANN) to learn the structure of naturally occurring post-filters to enhance speech from interfering noise. The ANN uses phase-based features obtained from a multichannel array as an input. Simulations are used to train the ANN in a supervised manner. The performance is measured with objective scores from speech recorded in an office environment. The post-filters predicted by the ANN are found to improve the perceptual quality over delay-and-sum beamforming while maintaining high suppression of noise characteristic to spatial post-filters.
- May 2014
- ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
This paper studies multichannel audio separation using non-negative matrix factorization (NMF) combined with a new model for spatial covariance matrices (SCM). The proposed model for SCMs is parameterized by source direction of arrival (DoA) and its parameters can be optimized to yield a spatially coherent solution over frequencies thus avoiding permutation ambiguity and spatial aliasing. The model constrains the estimation of SCMs to a set of geometrically possible solutions. Additionally we present a method for using a priori DoA information of the sources extracted blindly from the mixture for the initialization of the parameters of the proposed model. The simulations show that the proposed algorithm exceeds the separation quality of existing spatial separation methods.
- Mar 2014
This paper addresses the problem of sound source separation from a multichannel microphone array capture via estimation of source spatial covariance matrix (SCM) of a short-time Fourier transformed mixture signal. In many conventional audio separation algorithms the source mixing parameter estimation is done separately for each frequency thus making them prone to errors and leading to suboptimal source estimates. In this paper we propose a SCM model which consists of a weighted sum of direction of arrival (DoA) kernels and estimate only the weights dependent on the source directions. In the proposed algorithm, the spatial properties of the sources become jointly optimized over all frequencies, leading to more coherent source estimates and mitigating the effect of spatial aliasing at high frequencies. The proposed SCM model is combined with a linear model for magnitudes and the parameter estimation is formulated in a complex-valued non-negative matrix factorization (CNMF) framework. Simulations consist of recordings done with a hand-held device sized array having multiple microphones embedded inside the device casing. Separation quality of the proposed algorithm is shown to exceed the performance of existing state of the art separation methods with two sources when evaluated by objective separation quality metrics.
- Oct 2012
This article proposes a new spatial audio coding (SAC) method that is based on parametrization of multichannel audio by sound objects using non-negative tensor factorization (NTF). The NTF model represents the multichannel audio signal with a linear combination of objects that are composed of fixed spectral bases with a time-varying gain and a channel-dependent spatial gain. The parameters of the model are estimated using perceptually motivated NTF model and are used for upmixing a downmixed and encoded mixture signal inWiener filtering manner. The performance of the proposed coding is evaluated using listening tests, which prove the coding performance being almost equal to conventional SAC methods. Additionally, the proposed coding enables controlling the upmix content by meaningful objects and the sound source separation possibility of the encoding scheme is demonstrated by examples.
- Jan 2012
- Signal Processing Conference (EUSIPCO), 2012 Proceedings of the 20th European
This paper presents a novel method for solving the permutation ambiguity of frequency-domain independent component analysis based on source signal envelope correlation maximization. The proposed method is developed for blind source separation with high sampling frequency and significant spatial aliasing. We propose a method that analyzes the source envelope using a rank-one singular value decomposition (SVD) applied to an initial source magnitude spectrogram obtained by a time difference of arrival (TDoA) based permutation alignment method. The permutation for frequencies with incoherent TDoA are corrected by maximizing the cross-correlation of the SVD analyzed source activation vector and each independent component magnitude envelope. We evaluate the separation quality using real high sampling frequency speech captures and the proposed method is found to improve the separation over the baseline algorithm.
- Oct 2011
- IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2011, New Paltz, NY, USA, October 16-19, 2011
This paper proposes a new spatial audio coding (SAC) method that is based on parametrization of multichannel audio by sound objects using non-negative tensor factorization (NTF). The spatial parameters are estimated using perceptually motivated NTF model and are used for upmixing a downmixed and encoded mixture signal. The performance of the proposed coding is evaluated using listening tests, which prove the coding performance being on a par with conventional SAC methods. The novelty of the proposed coding is that it enables controlling the upmix content by meaningful objects.
- May 2010
- Proceedings of the Audio Engineering Society Convention
This paper proposes a new object-based audio coding algorithm, which uses non-negative matrix factorization (NMF) for the magnitude spectrogram representation and the phase information is coded separately. The magnitude model is obtained using a perceptually weighted NMF algorithm, which minimizes the noise-to-mask ratio (NMR) of the decomposition, and is able to utilize long term redundancy by an object-based representation. Methods for the quantization and entropy coding of the NMF representation parameters are proposed, and the quality loss is evaluated using the NMR measure. The quantization of the phase information is also studied. Additionally we propose a sparseness criteria for the NMF algorithm, which is set to favor the gain values having the highest probability and thus the shortest entropy coding word length, resulting to a reduced bitrate.
This paper proposes a novel algorithm for minimizing the perceptual distortion in non-negative matrix factorization (NMF) based audio representation. We formulate the noise-to-mask ratio audio quality criterion in a form where it can be used in NMF and propose an algorithm for optimizing the criterion. We also propose a method for compensating the spreading of the representation error in the synthesis filterbank. The objective perceptual quality produced by the proposed method is found to outperform all the reference methods. We also study the trade-off between the window length and the rank of factorization with a fixed data rate, and find that the best performance is obtained with window lengths between 10 and 30 ms.