Evaluation of general cross-correlation methods for direction of arrival estimation using two microphones in real environments

Multimedia & Multimodal Processing Research Group, Telecommunication Engineering Department, Polytechnic School, University of Jaén, Spain
Applied Acoustics (Impact Factor: 1.02). 08/2012; 73(8). DOI: 10.1016/j.apacoust.2012.02.002


The localization of sound sources, and particularly speech, has a numerous number of applications to the industry. This has motivated a continuous effort in developing robust direction-of-arrival detection algorithms, in order to overcome the limitations imposed by real scenarios, such as multiple reflections and undesirable noise sources. Time difference of arrival-based methods, and particularly, generalized cross-correlation approaches have been widely investigated in acoustic signal processing, but there is considerable lack in the technical literature about their evaluation in real environments when only two microphones are used. In this work, four generalized cross-correlation methods for localization of speech sources with two microphones have been analyzed in different real scenarios with a stationary noise source. Furthermore, these scenarios have been acoustically characterized, in order to relate the behavior of these cross-correlation methods with the acoustic properties of noisy scenarios. The scope of this study is not only to assess the accuracy and reliability of a set of well-known localization algorithms, but also to determine how the different acoustic properties of the room under analysis have a determinant influence in the final results, by incorporating in the analysis additional factors to the reverberation time and signal-to-noise ratio. Results of this study have outlined the influence of the acoustic properties analysed in the performance of these methods.

  • Source
    • "Here, the generalized cross-correlation method, which applies a phase transform (GCC-PHAT) (Knapp and Carter, 1976) with a parameter γ for changing the level of normalization (Tikander et al., 2003) is used. Compared to a conventional cross-correlation, GCC-PHAT suppresses secondary peaks in the cross-correlation and has been shown to produce better target localization accuracy (Perez-Lorenzo et al., 2012). "
    [Show abstract] [Hide abstract]
    ABSTRACT: We present an automatic speech recognition system that uses a missing data approach to compensate for challenging environmental noise containing both additive and convolutive components. The unreliable and noise-corrupted (“missing”) components are identified using a Gaussian mixture model (GMM) classifier based on a diverse range of acoustic features. To perform speech recognition using the partially observed data, the missing components are substituted with clean speech estimates computed using both sparse imputation and cluster-based GMM imputation. Compared to two reference mask estimation techniques based on interaural level and time difference-pairs, the proposed missing data approach significantly improved the keyword accuracy rates in all signal-to-noise ratio conditions when evaluated on the CHiME reverberant multisource environment corpus. Of the imputation methods, cluster-based imputation was found to outperform sparse imputation. The highest keyword accuracy was achieved when the system was trained on imputed data, which made it more robust to possible imputation errors.
    Preview · Article · Jan 2012 · Computer Speech & Language
  • [Show abstract] [Hide abstract]
    ABSTRACT: The propagating speed of waves depends on the physical properties of the transmitting material. Since these properties can vary along the propagation path, they cannot be determined from local measurements. However, mean values of the propagation speed can be obtained from time measurements, either between distributed sources and sensors (Time Of Arrival, TOA) if both are synchronized or otherwise from time differences between distributed sensors (Time Difference Of Arrivals, TDOA). This contribution investigates the required assumptions for speed estimation from time measurements and provides closed-form solutions for the synchronized and unsynchronized case. Furthermore the achievable accuracy is determined in terms of Cramer-Rao bounds. The analysis is carried out for the propagation of sound waves in air, where the propagation speed varies with the air temperature. Example results from loudspeaker-microphone recordings are provided. However the closed-form relations apply also to the propagation of other types of waves in linear regimes. This manuscript extends previous work by the authors by providing closed-form solutions and by a parallel treatment of the TOA and the TDOA measurements.
    No preview · Article · Apr 2014 · Multidimensional Systems and Signal Processing
  • [Show abstract] [Hide abstract]
    ABSTRACT: Sound source localization using a two-microphone array is an active area of research, with considerable potential for use with video conferencing, mobile devices, and robotics. Based on the observed time-differences of arrival between sound signals, a probability distribution of the location of the sources is considered to estimate the actual source positions. However, these algorithms assume a given number of sound sources. This paper describes an updated research account on the solution presented in Escolano et al. [J. Acoust. Am. Soc. 132(3), 1257-1260 (2012)], where nested sampling is used to explore a probability distribution of the source position using a Laplacian mixture model, which allows both the number and position of speech sources to be inferred. This paper presents different experimental setups and scenarios to demonstrate the viability of the proposed method, which is compared with some of the most popular sampling methods, demonstrating that nested sampling is an accurate tool for speech localization.
    No preview · Article · Feb 2014 · The Journal of the Acoustical Society of America
Show more