Figure 1 - uploaded by Vicente Peruffo Minotto
Content may be subject to copyright.
Source publication
The aim of most microphone array applications is to localize sound sources in a noisy and reverberant environment. For that purpose, many different sound source localization (SSL) algorithms have been proposed, where the SRP-PHAT (steered response power using the phase transform) has been known as one of the state-of-the-art methods. Its original f...
Similar publications
Sound Source Localization (SSL) is a key task in audio signal processing, focusing on estimating the position of sound sources relative to a reference point, typically a microphone array. In this paper, we introduce a novel dataset named SoS, specifically designed for indoor SSL scenarios, containing real-life recordings augmented by background noi...
Human emotions are expressed through body gestures, voice variations and facial expressions. Research in the area of facial expression recognition has been active for last 20 years for improving the system performance. This work proposes a novel geometrical modeling of facial regions based feature extraction technique for emotion recognition. Most...
Nowadays, more and more object recognition tasks are being solved with Convolutional Neural Networks (CNN). Due to its high recognition rate and fast execution, the convolutional neural networks have enhanced most of computer vision tasks, both existing and new ones. In this article, we propose an implementation of traffic signs recognition algorit...
Genetic algorithms are effective in solving many optimiza-tion tasks. However, the long execution time associated with it prevents its use in many domains. In this paper, we pro-pose a new approach for parallel implementation of genetic algorithm on graphics processing units (GPUs) using CUDA programming model. We exploit the parallelism within a c...
Most image processing algorithms are inherently parallel, so multithreading processors are suitable in such applications. In huge image databases, image processing takes very long time for run on a single core processor because of single thread execution of algorithms. Graphical Processors Units (GPU) is more common in most image processing applica...
Citations
... 15,17 In the audible acoustic range, recent work using the Generalized Cross Correlation and Steered Response Power algorithms shows that the use of the Phase Transform results in an increased resolution and dynamic range at the cost of the loss of information on the magnitude of the signals normally used for absolute sound pressure level reconstruction. 18,19 The few acoustic imaging algorithms that use phase filtering in the literature have either been developed in the frequency domain, or are not well suited for time domain reconstruction of transient or unsteady events. 20,21 The present work proposes a time domain imaging algorithm that allows for vibration reconstruction of an extended source based on the concept of PC. ...
An acoustic imaging algorithm is proposed herein for transient noise source time reconstruction. Time domain formulations are not well suited for acoustic imaging because of the size of the resulting system to be inversed. Based on the phase coherence principle widely used in ultrasound imaging and image processing, the first step of the algorithm consists in proposing the phase coherence metric used to reject pixels that are unlikely to contribute to the radiated sound field. This translates in a reduction of the domain size and ill-posedness of the problem. In the second step, the inverse problem is solved using the Tikhonov regularization and the generalized cross-validation to extract the vibration field on the imaging domain. Two test cases are considered: a simulated baffled piston and a panel submitted to a mechanical impact in anechoic conditions. The actual vibration field of the panel is measured with an optical technique for reference. In both numerical and experimental cases, the reconstructed vibration field using the proposed approach compares well with their respective reference. The results confirm that transient excitations can be localized and quantified with the proposed approach, in contrast with the classical time-domain beamforming that dramatically overestimates its magnitude.
... However, when using PHAT filtering, the magnitude of the signals used for absolute sound pressure level reconstruction is lost [10]. Due to their high degree of parallelizability on graphics processor units, these formulations are well suited for real-time imaging [11]. However, the few acoustic imaging algorithms proposed in the literature that use phase filtering have either been developed in the frequency domain, or are not well suited for time domain reconstruction of transient or unsteady events because of the loss of information on the source magnitude [12]. ...
The characterization of vibroacoustic sources using microphone arrays in the time domain is still challenging because of the bad conditioning and extensive computational resources required to solve the associated ill-posed problem. The Near-field Acoustical Holography (NAH) framework and the Time-Reversal techniques are among the approaches proposed to solve this problem. However, such techniques involve either dense microphone arrays in the vicinity of the sources to be characterized or high computational complexity. This work proposes a new Time-Domain Phase Coherence algorithm (TD-PCa) based on the phase coherence principle widely used in the fields of image processing and ultrasound imaging. The proposed TD-PCa is numerically and experimentally validated using a regular 121 microphone array located in front of an impacted, baffled, simply-supported plate in an anechoic chamber. The resulting vibration field is reconstructed with the proposed TD-PCa and compared with the vibration field estimated with the Delay-and-Sum (DAS) standard approach. Moreover, the imaging results are compared with vibration field measurements conducted on the plate using the deflectometry technique with a high-speed camera. The results show that the acceleration field reconstructed with the proposed TD-PCa is in good agreement with the vibration field measured optically.
... There exist plenty of CUDA-based implementations in the field of audio signal processing [5][6][7], even approaching sound-source localization algorithms [8,9], which are constrained to use an NVIDIA platform. OpenMP-based implementations were also used in [10] for multi-core platforms audio systems based on Beamforming and Wave Field Synthesis. ...
... The platform also includes 132 GB of main memory. Up to 8,192 OpenCL work-items can be launched in parallel on this platform. -p100 is a NVIDIA GPU Tesla P100 accelerator implementing the Pascal micro-architecture. ...
The Steered Response Power with Phase Transform (SRP-PHAT) algorithm is a well-known method for sound source localization due to its robust performance in noisy and reverberant environments. This algorithm is used in a large number of acoustic applications such as automatic camera steering systems, human–machine interaction, video gaming and audio surveillance. SPR-PHAT implementations require to handle a high number of signals coming from a microphone array and a huge search grid that influences the localization accuracy of the system. In this context, high performance in the localization process can only be achieved by using massively parallel computational resources. Different types of multi-core machines based either on multiple CPUs or on GPUs are commonly employed in diverse fields of science for accelerating a number of applications, mainly using OpenMP and CUDA as programming frameworks, respectively. This implies the development of multiple source codes which limits the portability and application possibilities. On the contrary, OpenCL has emerged as an open standard for parallel programming that is nowadays supported by a wide range of architectures. In this work, we evaluate an OpenCL-based implementations of the SRP-PHAT algorithm in two state-of-the-art CPU and GPU platforms. Results demonstrate that OpenCL achieves close-to-CUDA performance in GPU (considered as upper bound) and outperforms in most of the CPU configurations based on OpenMP.
... This raises from the direct proportional dependency of sampling rate � � and angle resolution represented by following equation: Consequently, the transfer in frequency domain alone is not sufficient to solve the computational effort problem. On this account, the algorithms has been realized on specialized Hardware like Field Programmable Gate Arrays (FPGA) [12] or Graphics Processing Units (GPUs) [13] beside powerful PCs. ...
Beamforming techniques are widely used in many fields of research like sonar, radar, wireless communication and speech processing applications. Beamforming algorithms are mainly used for signal enhancement and direction of arrival estimation. In applications for tracking mobile communication partners like speaker or aircrafts, beamforming algorithms are employed to estimate the direction of arrival. Beamforming processing is always accompanied by high computational costs, which are challenging for embedded devices. Recent approaches process the calculation in frequency domain to reduce the processing time. However, in many cases the processing is still very slow and cannot be used for real-time processing. Alternative real-time solutions based on FPGAs face the drawback of long development processes and restricted communication interfaces. This paper introduces a novel implementation approach based on System on Chip (SoC) technology with optimized Hardware-/Software Partitioning for real-time delay and sum beamforming. Basis for evaluation is a runtime measurement of software implementations to determine computation steps with high processing time and parallelization capabilities. Extracted computation steps are implemented on an associated FPGA with full pipelined architecture for high data throughput and fast processing speed. The complexity of the deduced architecture is evaluated regarding data length and data width with respect to computational accuracy.
... Indeed, considering multiple microphones, a sound wave reaches each sensor with a different energy and in different temporal instants, which are observable as phase shifts in the captured signals. In general, main categories in which the algorithms can be grouped are Cobos et al. (2017), Meng and Xiao (2017) and Dibiase et al. (2001): signal energy-based locator, Time Difference of Arrival (TDOA) based locators, those based on the Steered-Response Power (SRP) of a beamformer, and high resolution spectral estimation based locators. ...
... T a g g e d P Energy-based approaches localize speakers by using the energy measures of the signals acquired from the acous- tic sensors (Cobos et al., 2017;Meng and Xiao, 2017). More in details, these techniques are based on the average of the signals energy calculated over several windows of the signals themselves. ...
... Additionally, source position is estimated by using an averaged measure, instead of a sample-by-sample or frame-by-frame information, thus making these techniques intrinsically less precise ( Cobos et al., 2017). For a survey on energy-based localization methods, the interested reader can refer to Cobos et al. (2017) and Meng and Xiao (2017). ...
In the field of human speech capturing systems, a fundamental role is played by the source localization algorithms. In this paper a Speaker Localization algorithm (SLOC) based on Deep Neural Networks (DNN) is evaluated and compared with state-of-the art approaches. The speaker position in the room under analysis is directly determined by the DNN, leading the proposed algorithm to be fully data-driven. Two different neural network architectures are investigated: the Multi Layer Perceptron (MLP) and Convolutional Neural Networks (CNN). GCC-PHAT (Generalized Cross Correlation-PHAse Transform) Patterns, computed from the audio signals captured by the microphone are used as input features for the DNN. In particular, a multi-room case study is dealt with, where the acoustic scene of each room is influenced by sounds emitted in the other rooms. The algorithm is tested by means of the home recorded DIRHA dataset, characterized by multiple wall and ceiling microphone signals for each room. In detail, the focus goes to speaker localization task in two distinct neighboring rooms.
As term of comparison, two algorithms proposed in literature for the addressed applicative context are evaluated, the Crosspower Spectrum Phase Speaker Localization (CSP-SLOC) and the Steered Response Power using the Phase Transform speaker localization (SRP-SLOC). Besides providing an extensive analysis of the proposed method, the article shows how DNN-based algorithm significantly outperforms the state-of-the-art approaches evaluated on the DIRHA dataset, providing an average localization error, expressed in terms of Root Mean Square Error (RMSE), equal to 324 mm and 367 mm, respectively, for the Simulated and the Real subsets.
... They have relative advantage for real-time applications in which performance is more important. The significant approach based steered response power-phase transform (SRP-PHAT) introduces an effective approach for the robust signal processing in sound localization [9], and GPU-based acceleration is also proposed in TDOA measurement using SRP-PHAT [10]. Our paper is based on initial approach [11], which using the GCC-PHAT-based TDOA measurement, specially presenting our experience in implementing GPU-based acceleration approach to guarantee real-time performance. ...
A particle filter (PF) has been introduced for effective position estimation of moving targets for non-Gaussian and nonlinear systems. The time difference of arrival (TDOA) method using acoustic sensor array has normally been used to for estimation by concealing the location of a moving target, especially underwater. In this paper, we propose a GPU -based acceleration of target position estimation using a PF and propose an efficient system and software architecture. The proposed graphic processing unit (GPU)-based algorithm has more advantages in applying PF signal processing to a target system, which consists of large-scale Internet of Things (IoT)-driven sensors because of the parallelization which is scalable. For the TDOA measurement from the acoustic sensor array, we use the generalized cross correlation phase transform (GCC-PHAT) method to obtain the correlation coefficient of the signal using Fast Fourier Transform (FFT), and we try to accelerate the calculations of GCC-PHAT based TDOA measurements using FFT with GPU compute unified device architecture (CUDA). The proposed approach utilizes a parallelization method in the target position estimation algorithm using GPU-based PF processing. In addition, it could efficiently estimate sudden movement change of the target using GPU-based parallel computing which also can be used for multiple target tracking. It also provides scalability in extending the detection algorithm according to the increase of the number of sensors. Therefore, the proposed architecture can be applied in IoT sensing applications with a large number of sensors. The target estimation algorithm was verified using MATLAB and implemented using GPU CUDA. We implemented the proposed signal processing acceleration system using target GPU to analyze in terms of execution time. The execution time of the algorithm is reduced by 55% from to the CPU standalone operation in target embedded board, NVIDIA Jetson TX1. Also, to apply large-scaled IoT sensing applications, we use NVIDIA Tesla K40c as target GPU. The execution time of the proposed multi-state-space model-based algorithm is similar to the one-state-space model algorithm because of GPU-based parallel computing. Experimental results show that the proposed architecture is a feasible solution in terms of high-performance and area-efficient architecture.
... Finally, recent works are also focusing on hardware aspects in the nodes with the aim of efficiently computing the SRP. In this context, the use of graphics processing units (GPUs) for implementing SRP-based approaches is specially promising [96,97]. In [98] the performance of SRP-PHAT is analyzed over a massive multichannel processing framework in a multi-GPU system, analyzing its performance as a function of the number of microphones and available computational resources in the system. ...
Wireless acoustic sensor networks (WASNs) are formed by a distributed group of acoustic-sensing devices featuring audio playing and recording capabilities. Current mobile computing platforms offer great possibilities for the design of audio-related applications involving acoustic-sensing nodes. In this context, acoustic source localization is one of the application domains that have attracted the most attention of the research community along the last decades. In general terms, the localization of acoustic sources can be achieved by studying energy and temporal and/or directional features from the incoming sound at different microphones and using a suitable model that relates those features with the spatial location of the source (or sources) of interest. This paper reviews common approaches for source localization in WASNs that are focused on different types of acoustic features, namely, the energy of the incoming signals, their time of arrival (TOA) or time difference of arrival (TDOA), the direction of arrival (DOA), and the steered response power (SRP) resulting from combining multiple microphone signals. Additionally, we discuss methods not only aimed at localizing acoustic sources but also designed to locate the nodes themselves in the network. Finally, we discuss current challenges and frontiers in this field.
... Among prominent optimizations are the Stochastic Region Contraction (Do, Silverman, & Yu, 2007), Coarse-to-Fine Region Contraction and Stochastic Particle Filtering (Do & Silverman, 2009). A comparative analysis between multithreaded and GPU implementations of the algorithm can be found on da Peruffo Minotto, Rosito Jung, da Silveira, andLee (2013) Test scenarios include a 4 cores CPU executing respectively 1 and 8 threads, and CUDA code running on two different video cards. ...
... Among prominent optimizations are the Stochastic Region Contraction (Do, Silverman, & Yu, 2007), Coarse-to-Fine Region Contraction and Stochastic Particle Filtering (Do & Silverman, 2009). A comparative analysis between multithreaded and GPU implementations of the algorithm can be found on da Peruffo Minotto, Rosito Jung, da Silveira, andLee (2013) Test scenarios include a 4 cores CPU executing respectively 1 and 8 threads, and CUDA code running on two different video cards. ...
... 50 ms). While the SRP-PHAT may be implemented in different ways [20], e.g., Eq. (4) has been shown to be suitable for GPU implementation [26]. In general, one drawback of using the SRP-PHAT's global maxima to localize potential speakers is that the precision of such approaches tend to drop as the number of simultaneous speech sources increases. ...
Humans can extract speech signals that they need to understand from a mixture of background noise, interfering sound sources, and reverberation for effective communication. Voice Activity Detection (VAD) and Sound Source Localization (SSL) are the key signal processing components that humans perform by processing sound signals received at both ears, sometimes with the help of visual cues by locating and observing the lip movements of the speaker. Both VAD and SSL serve as the crucial design elements for building applications involving human speech. For example, systems with microphone arrays can benefit from these for robust speech capture in video conferencing applications, or for speaker identification and speech recognition in Human Computer Interfaces (HCIs). The design and implementation of robust VAD and SSL algorithms in practical acoustic environments are still challenging problems, particularly when multiple simultaneous speakers exist in the same audiovisual scene. In this work we propose a multimodal approach that uses Support Vector Machines (SVMs) and Hidden Markov Models (HMMs) for assessing the video and audio modalities through an RGB camera and a microphone array. By analyzing the individual speakers’ spatio-temporal activities and mouth movements, we propose a mid-fusion approach to perform both VAD and SSL for multiple active and inactive speakers. We tested the proposed algorithm in scenarios with up to three simultaneous speakers, showing an average VAD accuracy of 95.06% with an average error of 10.9 cm when estimating the three-dimensional locations of the speakers.
In the last three decades, the Steered Response Power (SRP) method has been widely used for the task of Sound Source Localization (SSL), due to its satisfactory localization performance on moderately reverberant and noisy scenarios. Many works have analysed and extended the original SRP method to reduce its computational cost, to allow it to locate multiple sources, or to improve its performance in adverse environments. In this work, we review over 200 papers on the SRP method and its variants, with emphasis on the SRP-PHAT method. We also present eXtensible-SRP, or X-SRP, a generalized and modularized version of the SRP algorithm which allows the reviewed extensions to be implemented. We provide a Python implementation of the algorithm which includes selected extensions from the literature.