Evaluation of generalized cross-correlation methods for direction of arrival estimation using two microphones in real environments

Multimedia & Multimodal Processing Research Group, Telecommunication Engineering Department, Polytechnic School, University of Jaén, Spain
Applied Acoustics (Impact Factor: 1.1). 08/2012; 73(8). DOI: 10.1016/j.apacoust.2012.02.002

ABSTRACT The localization of sound sources, and particularly speech, has a numerous number of applications to the industry. This has motivated a continuous effort in developing robust direction-of-arrival detection algorithms, in order to overcome the limitations imposed by real scenarios, such as multiple reflections and undesirable noise sources. Time difference of arrival-based methods, and particularly, generalized cross-correlation approaches have been widely investigated in acoustic signal processing, but there is considerable lack in the technical literature about their evaluation in real environments when only two microphones are used. In this work, four generalized cross-correlation methods for localization of speech sources with two microphones have been analyzed in different real scenarios with a stationary noise source. Furthermore, these scenarios have been acoustically characterized, in order to relate the behavior of these cross-correlation methods with the acoustic properties of noisy scenarios. The scope of this study is not only to assess the accuracy and reliability of a set of well-known localization algorithms, but also to determine how the different acoustic properties of the room under analysis have a determinant influence in the final results, by incorporating in the analysis additional factors to the reverberation time and signal-to-noise ratio. Results of this study have outlined the influence of the acoustic properties analysed in the performance of these methods.

1 Bookmark
  • [Show abstract] [Hide abstract]
    ABSTRACT: Sound source localization using a two-microphone array is an active area of research, with considerable potential for use with video conferencing, mobile devices, and robotics. Based on the observed time-differences of arrival between sound signals, a probability distribution of the location of the sources is considered to estimate the actual source positions. However, these algorithms assume a given number of sound sources. This paper describes an updated research account on the solution presented in Escolano et al. [J. Acoust. Am. Soc. 132(3), 1257-1260 (2012)], where nested sampling is used to explore a probability distribution of the source position using a Laplacian mixture model, which allows both the number and position of speech sources to be inferred. This paper presents different experimental setups and scenarios to demonstrate the viability of the proposed method, which is compared with some of the most popular sampling methods, demonstrating that nested sampling is an accurate tool for speech localization.
    The Journal of the Acoustical Society of America 02/2014; 135(2):742-753. · 1.65 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: The propagating speed of waves depends on the physical properties of the transmitting material. Since these properties can vary along the propagation path, they cannot be determined from local measurements. However, mean values of the propagation speed can be obtained from time measurements, either between distributed sources and sensors (Time Of Arrival, TOA) if both are synchronized or otherwise from time differences between distributed sensors (Time Difference Of Arrivals, TDOA). This contribution investigates the required assumptions for speed estimation from time measurements and provides closed-form solutions for the synchronized and unsynchronized case. Furthermore the achievable accuracy is determined in terms of Cramer-Rao bounds. The analysis is carried out for the propagation of sound waves in air, where the propagation speed varies with the air temperature. Example results from loudspeaker-microphone recordings are provided. However the closed-form relations apply also to the propagation of other types of waves in linear regimes. This manuscript extends previous work by the authors by providing closed-form solutions and by a parallel treatment of the TOA and the TDOA measurements.
    Multidimensional Systems and Signal Processing 04/2014; · 0.86 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: One of the main issues within the field of social robotics is to endow robots with the ability to direct attention to people with whom they are interacting. Different approaches follow bio-inspired mechanisms, merging audio and visual cues to localize a person using multiple sensors. However, most of these fusion mechanisms have been used in fixed systems, such as those used in video-conference rooms, and thus, they may incur difficulties when constrained to the sensors with which a robot can be equipped. Besides, within the scope of interactive autonomous robots, there is a lack in terms of evaluating the benefits of audio-visual attention mechanisms, compared to only audio or visual approaches, in real scenarios. Most of the tests conducted have been within controlled environments, at short distances and/or with off-line performance measurements. With the goal of demonstrating the benefit of fusing sensory information with a Bayes inference for interactive robotics, this paper presents a system for localizing a person by processing visual and audio data. Moreover, the performance of this system is evaluated and compared via considering the technical limitations of unimodal systems. The experiments show the promise of the proposed approach for the proactive detection and tracking of speakers in a human-robot interactive framework.
    Sensors 01/2014; 14(6):9522-9545. · 2.05 Impact Factor


Available from
Jun 3, 2014