Evaluation of generalized cross-correlation methods for direction of arrival estimation using two microphones in real environments

Multimedia & Multimodal Processing Research Group, Telecommunication Engineering Department, Polytechnic School, University of Jaén, Spain
Applied Acoustics (Impact Factor: 1.07). 08/2012; 73(8). DOI: 10.1016/j.apacoust.2012.02.002

ABSTRACT The localization of sound sources, and particularly speech, has a numerous number of applications to the industry. This has motivated a continuous effort in developing robust direction-of-arrival detection algorithms, in order to overcome the limitations imposed by real scenarios, such as multiple reflections and undesirable noise sources. Time difference of arrival-based methods, and particularly, generalized cross-correlation approaches have been widely investigated in acoustic signal processing, but there is considerable lack in the technical literature about their evaluation in real environments when only two microphones are used. In this work, four generalized cross-correlation methods for localization of speech sources with two microphones have been analyzed in different real scenarios with a stationary noise source. Furthermore, these scenarios have been acoustically characterized, in order to relate the behavior of these cross-correlation methods with the acoustic properties of noisy scenarios. The scope of this study is not only to assess the accuracy and reliability of a set of well-known localization algorithms, but also to determine how the different acoustic properties of the room under analysis have a determinant influence in the final results, by incorporating in the analysis additional factors to the reverberation time and signal-to-noise ratio. Results of this study have outlined the influence of the acoustic properties analysed in the performance of these methods.

Download full-text


Available from: José Escolano, Jun 23, 2015
1 Follower
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: One of the main issues within the field of social robotics is to endow robots with the ability to direct attention to people with whom they are interacting. Different approaches follow bio-inspired mechanisms, merging audio and visual cues to localize a person using multiple sensors. However, most of these fusion mechanisms have been used in fixed systems, such as those used in video-conference rooms, and thus, they may incur difficulties when constrained to the sensors with which a robot can be equipped. Besides, within the scope of interactive autonomous robots, there is a lack in terms of evaluating the benefits of audio-visual attention mechanisms, compared to only audio or visual approaches, in real scenarios. Most of the tests conducted have been within controlled environments, at short distances and/or with off-line performance measurements. With the goal of demonstrating the benefit of fusing sensory information with a Bayes inference for interactive robotics, this paper presents a system for localizing a person by processing visual and audio data. Moreover, the performance of this system is evaluated and compared via considering the technical limitations of unimodal systems. The experiments show the promise of the proposed approach for the proactive detection and tracking of speakers in a human-robot interactive framework.
    Sensors 06/2014; 14(6):9522-9545. DOI:10.3390/s140609522 · 2.05 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Estimating the direction of a sound source is an important technique used in various engineering fields, including intelligent robots and surveillance systems. In a household where a user’s voice and noises emitted from electric appliances originate from arbitrary directions in 3-D space, robots need to recognize the directions of multiple sound sources in order to effectively interact with the user. This paper proposes an ear-based estimation (localization) system using two artificial robot ears, each consisting of a spiral-shaped pinna and two microphones, for application in humanoid robots. Four microphones are asymmetrically placed on the left and right sides of the head. The proposed localization algorithm is based on a spatially mapped generalized cross-correlation function which is transformed from the time domain to the space domain by using a measured inter-channel time difference map. For validation of the proposed localization method, two experiments (single- and multiple-source cases) were conducted using male speech. In the case of a single source, with the exception of laterally biased sources, the localization was achieved with an error of less than 10°. In a multiple-source environment, one source was fixed at the front side and the other source changed its direction; from the experimental results, the error rates on the localization of the fixed and varying sources are 0% and 36.9% respectively within an error bound of 15°.
    Applied Acoustics 03/2014; 77:49–58. DOI:10.1016/j.apacoust.2013.10.001 · 1.07 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present an automatic speech recognition system that uses a missing data approach to compensate for challenging environmental noise containing both additive and convolutive components. The unreliable and noise-corrupted (“missing”) components are identified using a Gaussian mixture model (GMM) classifier based on a diverse range of acoustic features. To perform speech recognition using the partially observed data, the missing components are substituted with clean speech estimates computed using both sparse imputation and cluster-based GMM imputation. Compared to two reference mask estimation techniques based on interaural level and time difference-pairs, the proposed missing data approach significantly improved the keyword accuracy rates in all signal-to-noise ratio conditions when evaluated on the CHiME reverberant multisource environment corpus. Of the imputation methods, cluster-based imputation was found to outperform sparse imputation. The highest keyword accuracy was achieved when the system was trained on imputed data, which made it more robust to possible imputation errors.
    Computer Speech & Language 01/2012; 27(3):2219-2231. DOI:10.1016/j.csl.2012.06.005 · 1.81 Impact Factor