Conference PaperPDF Available

Speaker localization by humanoid robots in reverberant environments

Authors:

Abstract and Figures

One of the important tasks of a humanoid-robot auditory system is speaker localization. It is used for the construction of the surrounding acoustic scene and as an input for additional processing methods. Localization is usually required to operate indoors under high reverberation levels. Recently, an algorithm for speaker localization under these conditions was proposed. The algorithm uses a spherical microphone array and the processing is performed in the spherical harmonics domain, requiring a relatively large number of microphones to efficiently cover the entire frequency range of speech. However, the number of microphones in the auditory system of a humanoid robot is usually limited. The current paper proposes an improvement of the previously published algorithm. The improvement aims to overcome the frequency limitations imposed by the insufficient number of microphones. The improvement is achieved by using a novel space-domain distance algorithm that does not requires the transformation to the spherical harmonics domain, thereby avoiding the frequency range limitations. A numerical study shows two important results. The first is that, using the improved algorithm, the operation frequency range can be significantly extended. The second important result is related to the fact that higher frequencies contain more detailed information about the surrounding sound field. Hence, the additional higher frequencies lead to improved localization accuracy.
Content may be subject to copyright.
A preview of the PDF is not available
... Another way to estimate a single DOA in 2 dimensions is presented in [104], called the space-domain distance (SDD) method. It relies on a distance metric that is applied between the captured time-frequency (TF) bin and a calculated TF bin that estimates what would have been captured if the signal would have been located in a given direction. ...
... The SFT coefficients are estimated via measurements or simulations and can be used to pick the TF bins that contain information of the direct-path signal given a pre-defined DOA. The proposed SDD metric in [104] measures the distance between the direct-path TF bins that were calculated from the input signals and the same TF bin using the SFT coefficients given a pre-defined DOA. Given this metric, a grid-search is carried out to find the direction that minimizes it. ...
... A reverberation-robust approach is presented in [104] and discussed in Section 4.1. It carries out the SDD method and defines the signal subspace solely as the eigenvector with the largest eigenvalue. ...
Article
Full-text available
Sound source localization (SSL) in a robotic platform has been essential in the overall scheme of robot audition. It allows a robot to locate a sound source by sound alone. It has an important impact on other robot audition modules, such as source separation, and it enriches human–robot interaction by complementing the robot’s perceptual capabilities. The main objective of this review is to thoroughly map the current state of the SSL field for the reader and provide a starting point to SSL in robotics. To this effect, we present: the evolution and historical context of SSL in robotics; an extensive review and classification of SSL techniques and popular tracking methodologies; different facets of SSL as well as its state-of-the-art; evaluation methodologies used for SSL; and a set of challenges and research motivations.
... The functionality of SSL on a robotic platform could be useful in several situations, for instance, locating human speakers without visual contact and mapping an unknown acoustic environment [4]. It has been achieved by various methodologies, such as head-related transfer function (HRTF) based time-difference-of-arrival (TDOA), spacedomain distance (SDD), acoustic beamforming, etc [5], [6], [7], [8], [9]. Among those techniques, the acoustic beamforming is widely used to obtain the sound map of a measured field in industrial applications such as transporting pass-by noise localisation and machine fault detection [4], [10], [11], [12], [13]. ...
Article
Constraint by the physical geometry, the lower and upper frequency bound and the scale of the scanning area of a microphone array are limited. Owing to its movable feature, for the service robots, achieving a wider working frequency range with a global view requires a virtually larger and denser array, which can be realised using non-synchronous measurements beamforming with a movable microphone array prototype. However, even when using the state-of-the-art method, it is challenging to localise multiple broadband sources, owing to the difficulty in selecting an appropriate operating frequency without any prior information about the target signal. Therefore, this letter proposes a tensor-completion-based non-synchronous measurements method for broadband multiple-sound-source localisation. The tensor data structure of the broadband signal is analysed, and an alternating direction method based on multiplier optimisation with a tensor multi-norm constraint is proposed. This algorithm can provide a sound map with a distinct global view of three different speech signal sources with high accuracy. Compared with the matrix-based optimisation method, the proposed method can significantly reduce the mean square error of the estimated source location.
... The LSDD-DPD test was implemented according to [14]. The DPD test method was implemented according to [30], with the spherical harmonics coefficients of the plane-wave density estimated up to spherical harmonics order N = 1 and with averaging over 3 time frames and 15 frequencies to construct the correlation matrix. For all tested methods the minimal operating frequency was limited to 1 kHz due to the array aperture. ...
Article
Full-text available
The coherent signal subspace method (CSSM) enables the direction-of-arrival (DoA) estimation of coherent sources with subspace localization methods. The focusing process that aligns the signal subspaces within a frequency band to its central frequency is central to the CSSM. Within current focusing approaches, a direction-independent focusing approach may be more suitable for reverberant environments since no initial estimation of the sources' DoAs is required. However, these methods use integrals over the steering function, and cannot be directly applied to arrays around complex scattering structures, such as robot heads. In this paper, current direction-independent focusing methods are extended to arrays for which the steering function is available only for selected directions, typically in a numerical form. Spherical harmonics decomposition of the steering function is then employed to formulate several aspects of the focusing error. A case of two coherent sources is studied and guidelines for the selection of the frequency smoothing bandwidth are suggested. The performance of the proposed methods is then investigated for an array that is mounted on a robot head. The focusing process is integrated within the direct-path dominance (DPD) test method for speaker localization, originally designed for spherical arrays, extending its application to arrays with arbitrary configurations. Finally, experiments with real data verify the feasibility of the proposed method to successfully estimate the DoAs of multiple speakers under real-world conditions.
... Combination of EB-MUSIC with the latter as a preprocessing stage for source counting was also shown to work well [9]. Similar bin selection approaches have also been proposed [10,11,12]. ...
... The DPD test shows robustness to both reverberant and noisy environments [4]. Several variants of the method were proposed, including alternative methods for bin selection [4], different DOA estimation algorithms [5], [6], and various methods for fusing the estimates from different bins [7]- [10]. Although DPD test based methods show good performance under reverberation, they have been developed for processing in the spherical harmonics domain and are restricted to microphone arrays with a spherical configuration. ...
Conference Paper
p>Algorithms for acoustic source localization and tracking are essential for a wide range of applications such as personal assistants, smart homes, tele-conferencing systems, hearing AIDS, or autonomous systems. Numerous algorithms have been proposed for this purpose which, however, are not evaluated and compared against each other by using a common database so far. The IEEE-AASP Challenge on sound source localization and tracking (LOCATA) provides a novel, comprehensive data corpus for the objective benchmarking of state-of-the-art algorithms on sound source localization and tracking. The data corpus comprises six tasks ranging from the localization of a single static sound source with a static microphone array to the tracking of multiple moving speakers with a moving microphone array. It contains real-world multichannel audio recordings, obtained by hearing AIDS, microphones integrated in a robot head, a planar and a spherical microphone array in an enclosed acoustic environment, as well as positional information about the involved arrays and sound sources represented by moving human talkers or static loudspeakers.</p
Conference Paper
Auditory systems of humanoid robots usually acquire the surrounding sound field by means of microphone arrays. These arrays can undergo motion related to the robot’s activity. The conventional approach to dealing with this motion is to stop the robot during sound acquisition. This approach avoids changing the positions of the microphones during the acquisition and reduces the robot’s ego-noise. However, stopping the robot can interfere with the naturalness of its behaviour. Moreover,the potential performanceimprovementdue to motion of the sound acquiring system can not be attained. This potential is analysed in the current paper. The analysis considers two different types of motion: (i) rotation of the robot’s head and (ii) limb gestures. The study presented here combines both theoretical and numerical simulation approaches. The results show that rotation of the head improves the high frequency performance of the microphone array positioned on the head of the robot. This is complemented by the limb gestures, which improve the low-frequency performance of the array positioned on the torso and limbs of the robot.
Article
Full-text available
In this work, a multiple sound source localization and counting method is presented, that imposes relaxed sparsity constraints on the source signals. A uniform circular microphone array is used to overcome the ambiguities of linear arrays, however the underlying concepts (sparse component analysis and matching pursuit-based operation on the histogram of estimates) are appli- cable to any microphone array topology. Our method is based on detecting time-frequency (TF) zones where one source is dominant over the others. Using appropriately selected TF components in these “single-source” zones, the proposed method jointly estimates the number of active sources and their corresponding directions of arrival (DOAs) by applying a matching pursuit-based approach to the histogram of DOA estimates. The method is shown to have excellent performance for DOA estimation and source counting, and to be highly suitable for real-time applications due to its low complexity. Through simulations (in various signal-to-noise ratio conditions and reverberant environments) and real environment experiments, we indicate that our method outperforms other state-of-the-art DOA and source counting methods in terms of accuracy, while being significantly more efficient in terms of computational complexity.
Conference Paper
Full-text available
A recent and fast evolving application for microphone arrays is the auditory systems of humanoid robots. These arrays, in contrast to conventional arrays, are not fixed in a given position, but move together with the robot. While imposing a challenge to most conventional array processing algorithms, this movement offers an opportunity to enhance performance if utilized in an appropriate manner. The array movement can increase the amount of information gathered and, therefore, improve various aspects of array processing. This paper presents a theoretical framework for the processing of moving microphone arrays for humanoid robot audition based on a representation of the surrounding sound field in the spherical harmonics domain. A simulation study is provided, illustrating the use and the potential advantage of the proposed framework.
Article
Full-text available
An important aspect of a humanoid robot is audition. Previous work has presented robot systems capable of sound localization and source segregation based on microphone arrays with various configurations. However, no theoretical framework for the design of these arrays has been presented. In the current paper, a design framework is proposed based on a novel array quality measure. The measure is based on the effective rank of a matrix composed of the generalized head related transfer functions (GHRTFs) that account for microphone positions other than the ears. The measure is shown to be theoretically related to standard array performance measures such as beamforming robustness and DOA estimation accuracy. Then, the measure is applied to produce sample designs of microphone arrays. Their performance is investigated numerically, verifying the advantages of array design based on the proposed theoretical framework.
Conference Paper
Full-text available
We propose an adaptive blind source separation algorithm in the context of robot audition using a microphone array. Our algorithm presents two steps: a fixed beamforming step to reduce the reverberation and the background noise and a source separation step. In the fixed beamforming preprocessing, we build the beamforming filters using the Head Related Transfer Functions (HRTFs) which allows us to take into consideration the effect of the robot's head on the near acoustic field. In the source separation step, we use a separation algorithm based on the l1 norm minimization. We evaluate the performance of the proposed algorithm in a total adaptive way with real data and varying number of sources and show good separation and source number estimation results.
Article
Full-text available
In this work, a multiple sound source localization and counting method is presented, that imposes relaxed sparsity constraints on the source signals. A uniform circular microphone array is used to overcome the ambiguities of linear arrays, however the underlying concepts (sparse component analysis and matching pursuit-based operation on the histogram of estimates) are applicable to any microphone array topology. Our method is based on detecting time-frequency (TF) zones where one source is dominant over the others. Using appropriately selected TF components in these “single-source” zones, the proposed method jointly estimates the number of active sources and their corresponding directions of arrival (DOAs) by applying a matching pursuit-based approach to the histogram of DOA estimates. The method is shown to have excellent performance for DOA estimation and source counting, and to be highly suitable for real-time applications due to its low complexity. Through simulations (in various signal-to-noise ratio conditions and reverberant environments) and real environment experiments, we indicate that our method outperforms other state-of-the-art DOA and source counting methods in terms of accuracy, while being significantly more efficient in terms of computational complexity.
Article
One of the major challenges encountered when localizing multiple speakers in real world environments is the need to overcome the effect of multipath distortion due to room reverberation. A wide range of methods has been proposed for speaker localization, many based on microphone array processing. Some of these methods are designed for the localization of coherent sources, typical of multipath environments, and some have even reported limited robustness to reverberation. Nevertheless, speaker localization under conditions of high reverberation still remains a challenging task. This paper proposes a novel multiple-speaker localization technique suitable for environments with high reverberation, based on a spherical microphone array and processing in the spherical harmonics (SH) domain. The non-stationarity and sparsity of speech, as well as frequency smoothing in the SH domain, are exploited in the development of a direct-path dominance test. This test can identify time-frequency (TF) bins that contain contributions from only one significant source and no significant contribution from room reflections, such that localization based on these selected TF-bins is performed accurately, avoiding the potential distortion due to other sources and reverberation. Computer simulations and an experiment in a real reverberant room validate the robustness of the proposed method in the presence of high reverberation .
Article
This study addresses a framework for a robot audition system, including sound source localization (SSL) and sound source separation (SSS), that can robustly recognize simultaneous speeches in a real environment. Because SSL estimates not only the location of speakers but also the number of speakers, such a robust framework is essential for simultaneous speech recognition. Moreover, improvement in the performance of SSS is crucial for simultaneous speech recognition because the robot has to recognize the individual source of speeches. For simultaneous speech recognition, current robot audition systems mainly require noise-robustness, high resolution, and real-time implementation. Multiple signal classification (MUSIC) based on standard Eigenvalue decomposition (SEVD) and Geometric-constrained high-order decorrelation-based source separation (GHDSS) are techniques utilizing microphone array processing, which are used for SSL and SSS, respectively. To enhance SSL robustness against noise while detecting simultaneous speeches, we improved SEVD-MUSIC by incorporating generalized Eigenvalue decomposition (GEVD). However, GEVD-based MUSIC (GEVD-MUSIC) and GHDSS mainly have two issues: (1) the resolution of pre-measured transfer functions (TFs) determines the resolution of SSL and SSS and (2) their computational cost is expensive for real-time processing. For the first issue, we propose a TF-interpolation method integrating time-domain-based and frequency-domain-based interpolation. The interpolation achieves super-resolution robot audition, which has a higher resolution than that of the pre-measured TFs. For the second issue, we propose two methods for SSL: MUSIC based on generalized singular value decomposition (GSVD-MUSIC) and hierarchical SSL (H-SSL). GSVD-MUSIC drastically reduces the computational cost while maintaining noise-robustness for localization. In addition, H-SSL reduces the computational cost by introducing a hierarchical search algorithm instead of using a greedy search for localization. These techniques are integrated into a robot audition system using a robot-embedded microphone array. The preliminary experiments for each technique showed the following: (1) The proposed interpolation achieved approximately 1-degree resolution although the TFs are only at 30-degree intervals in both SSL and SSS; (2) GSVD-MUSIC attained 46.4 and 40.6% of the computational cost compared to that of SEVD-MUSIC and GEVD-MUSIC, respectively; (3) H-SSL reduced 71.7% of the computational cost to localize a single speaker. Finally, the robot audition system, including super-resolution SSL and SSS, is applied to robustly recognize four sources of speech occurring simultaneously in a real environment. The proposed system showed considerable performance improvements of up to 7% for the average word correct rate during simultaneous speech recognition, especially when the TFs were at more than 30-degree intervals.
Article
Speech signals recorded in real environments may be corrupted by ambient noise and reverberation. Therefore, noise reduction and dereverberation algorithms for speech enhancement are typically employed in speech communication systems. Although microphone arrays are useful in reducing the effect of noise and reverberation, existing methods have limited success in significantly removing both reverberation and noise in real environments. This paper presents a method for noise reduction and dereverberation that overcomes some of the limitations of previous methods. The method uses a spherical microphone array to achieve plane-wave decomposition (PWD) of the sound field, based on direction-of-arrival (DOA) estimation of the desired signal and its reflections. A multi-channel linearly-constrained minimum-variance (LCMV) filter is introduced to achieve further noise reduction. The PWD beamformer achieves dereverberation while the LCMV filter reduces the uncorrelated noise with a controllable dereverberation constraint. In contrast to other methods, the proposed method employs DOA estimation, rather than room impulse response identification, to achieve dereverberation, and relative transfer function (RTF) estimation between the source reflections to achieve noise reduction while avoiding signal cancellation. The paper includes a simulation investigation and an experimental study, comparing the proposed method to currently available methods.