About
95
Publications
12,986
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,557
Citations
Introduction
Publications
Publications (95)
The training of deep learning-based multichannel speech enhancement and source localization systems relies heavily on the simulation of room impulse response and multichannel diffuse noise, due to the lack of large-scale real-recorded datasets. However, the acoustic mismatch between simulated and real-world data could degrade the model performance...
Self-supervised learning (SSL) has emerged as a popular approach for learning audio representations. One goal of audio self-supervised pre-training is to transfer knowledge to downstream audio tasks, generally including clip-level and frame-level tasks. While frame-level tasks are important for fine-grained acoustic scene/event understanding, prior...
This work proposes a neural network to extensively exploit spatial information for multichannel joint speech separation, denoising and dereverberation, named SpatialNet. In the short-time Fourier transform (STFT) domain, the proposed network performs end-to-end speech enhancement. It is mainly composed of interleaved narrow-band and cross-band bloc...
In this work, we extend our previously proposed offline SpatialNet for long-term streaming multichannel speech enhancement in both static and moving speaker scenarios. SpatialNet exploits spatial information, such as the spatial/steering direction of speech, for discriminating between target speech and interferences, and achieved outstanding perfor...
Supervised learning methods have shown effectiveness in estimating spatial acoustic parameters such as time difference of arrival, direct-to-reverberant ratio and reverberation time. However, they still suffer from the simulation-to-reality generalization problem due to the mismatch between simulated and real-world acoustic characteristics and the...
In indoor scenes, reverberation is a crucial factor in degrading the perceived quality and intelligibility of speech. In this work, we propose a generative dereverberation method. Our approach is based on a probabilistic model utilizing a recurrent variational auto-encoder (RVAE) network and the convolutive transfer function (CTF) approximation. Di...
Sound event detection (SED) often suffers from the data deficiency problem. The recent baseline system in the DCASE2023 challenge task 4 leverages the large pretrained self-supervised learning (SelfSL) models to mitigate such restriction, where the pretrained models help to produce more discriminative features for SED. However , the pretrained mode...
This work proposes a neural network to extensively exploit spatial information for multichannel joint speech separation, denoising and dereverberation, named SpatialNet.In the short-time Fourier transform (STFT) domain, the proposed network performs end-to-end speech enhancement. It is mainly composed of interleaved narrow-band and cross-band block...
In recent years, self-supervised learning (SSL) has emerged as a popular approach for learning audio representations. The ultimate goal of audio self-supervised pre-training is to transfer knowledge to downstream audio tasks, generally including clip-level and frame-level tasks. Clip-level tasks classify the scene or sound of an entire audio clip,...
Moving humans, agents, and subjects bring many challenges to robot self‐localisation and environment perception. To adapt to dynamic environments, SLAM researchers typically apply several deep learning image segmentation models to eliminate these moving obstacles. However, these moving obstacle segmentation methods cost too much computation resourc...
This work proposes a multichannel narrow-band speech separation network. In the short-time Fourier transform (STFT) domain, the proposed network processes each frequency independently, and all frequencies use a shared network. For each frequency, the network performs end-to-end speech separation, namely taking as input the STFT coefficients of micr...
In multichannel speech enhancement, both spectral and spatial information are vital for discriminating between speech and noise. How to fully exploit these two types of information and their temporal dynamics remains an interesting research problem. As a solution to this problem, this paper proposes a multi-cue fusion network named McNet, which cas...
This work proposes a new learning target based on reverberation time shortening (RTS) for speech dereverberation. The learning target for dereverberation is usually set as the direct-path speech or optionally with some early reflections. This type of target suddenly truncates the reverberation, and thus it may not be suitable for network training....
Sound event detection (SED), as a core module of acoustic environmental analysis, suffers from the problem of data deficiency. The integration of semi-supervised learning (SSL) largely mitigates such problem. This paper researches on several core modules of SSL, and introduces a random consistency training (RCT) strategy. First, a hard mixup data a...
The artificial intelligence (AI) community has recently made tremendous progress in developing self-supervised learning (SSL) algorithms that can learn high-quality data representations from massive amounts of unlabeled data. These methods brought great results even to the fields outside of AI. Due to the joint efforts of researchers in various are...
Self-supervised learning (SSL) learns knowledge from a large amount of unlabeled data, and then transfers the knowledge to a specific problem with a limited number of labeled data. SSL has achieved promising results in various domains. This work addresses the problem of segment-level general audio SSL, and proposes a new transformer-based teacher-s...
This work proposes a subband network for single-channel speech dereverberation, and also a new learning target based on reverberation time shortening (RTS). In the time-frequency domain, we propose to use a subband network to perform dereverberation for different frequency bands independently. The time-domain convolution can be well decomposed to s...
This work proposes a multichannel speech separation method with narrow-band Conformer (named NBC). The network is trained to learn to automatically exploit narrow-band speech separation information, such as spatial vector clustering of multiple speakers. Specifically, in the short-time Fourier transform (STFT) domain, the network processes each fre...
Multiple moving sound source localization in real-world scenarios remains a challenging issue due to interaction between sources, time-varying trajectories, distorted spatial cues, etc. In this work, we propose to use deep learning techniques to learn competing and time-varying direct-path phase differences for localizing multiple moving sound sour...
Direct-path relative transfer function (DP-RTF) refers to the ratio between the direct-path acoustic transfer functions of two microphone channels. Though DP-RTF fully encodes the sound spatial cues and serves as a reliable localization feature, it is often erroneously estimated in the presence of noise and reverberation. This paper proposes to lea...
This paper addresses the problem of multi-channel multi-speech separation based on deep learning techniques. In the short time Fourier transform domain, we propose an end-to-end narrow-band network that directly takes as input the multi-channel mixture signals of one frequency, and outputs the separated signals of this frequency. In narrow-band, th...
Sound event detection (SED), as a core module of acoustic environmental analysis, suffers from the problem of data deficiency. The integration of semi-supervised learning (SSL) largely mitigates such problem while bringing no extra annotation budget. This paper researches on several core modules of SSL, and introduces a random consistency training...
Direct-path relative transfer function (DP-RTF) refers to the ratio between the direct-path acoustic transfer functions of two microphone channels. Though DP-RTF fully encodes the sound spatial cues and serves as a reliable localization feature, it is often erroneously estimated in the presence of noise and reverberation. This paper proposes to lea...
This paper addresses the problem of multi-channel multi-speech separation based on deep learning techniques. In the short time Fourier transform domain, we propose an end-to-end narrow-band network that directly takes as input the multi-channel mixture signals of one frequency, and outputs the separated signals of this frequency. In narrow-band, th...
Dynamic objects in the environment, such as people and other agents, lead to challenges for existing simultaneous localization and mapping (SLAM) approaches. To deal with dynamic environments, computer vision researchers usually apply some learning-based object detectors to remove these dynamic objects. However, these object detectors are computati...
Dynamic objects in the environment, such as people and other agents, lead to challenges for existing simultaneous localization and mapping (SLAM) approaches. To deal with dynamic environments, computer vision researchers usually apply some learning-based object detectors to remove these dynamic objects. However, these object detectors are computati...
This paper addresses the problem of microphone array generalization for deep-learning-based end-to-end multichannel speech enhancement. We aim to train a unique deep neural network (DNN) potentially performing well on unseen microphone arrays. The microphone array geometry shapes the network's parameters when training on a fixed microphone array, a...
This paper proposes a deep neural network (DNN) based direct-path relative transfer function (DP-RTF) enhancement method for robust direction of arrival (DOA) estimation in noisy and reverberant environments. The DP-RTF refers to the ratio between the direct-path acoustic transfer functions of the two microphone channels. First, the complex-value D...
This paper proposes a deep neural network (DNN) based direct-path relative transfer function (DP-RTF) enhancement method for robust direction of arrival (DOA) estimation in noisy and reverberant environments. The DP-RTF refers to the ratio between the direct-path acoustic transfer functions of the two microphone channels. First, the complex-value D...
Direct-path relative transfer function (DP-RTF) refers to the ratio between the direct-path acoustic transfer functions of two channels. Though DP-RTF fully encodes the sound directional cues and serves as a reliable localization feature, it is often erroneously estimated in the presence of noise and reverberation. This paper proposes a supervised...
Sound event detection is a core module for acoustic environmental analysis. Semi-supervised learning technique allows to largely scale up the dataset without increasing the annotation budget, and recently attracts lots of research attention. In this work, we study on two advanced semi-supervised learning techniques for sound event detection. Data a...
This paper addresses the problem of sound-source localization (SSL) with a robot head, which remains a challenge in real-world environments. In particular we are interested in locating speech sources, as they are of high interest for human-robot interaction. The microphone-pair response corresponding to the direct-path sound propagation is a functi...
This paper proposes a full-band and sub-band fusion model, named as FullSubNet, for single-channel real-time speech enhancement. Full-band and sub-band refer to the models that input full-band and sub-band noisy spectral feature, output full-band and sub-band speech target, respectively. The sub-band model processes each frequency independently. It...
In single-channel speech enhancement, methods based on full-band spectral features have been widely studied. However, only a few methods pay attention to non-full-band spectral features. In this paper, we explore a knowledge distillation framework based on sub-band spectral mapping for single-channel speech enhancement. Specifically, we divide the...
This paper proposes a delayed subband LSTM network for online monaural (single-channel) speech enhancement. The proposed method is developed in the short time Fourier transform (STFT) domain. Online processing requires frame-by-frame signal reception and processing. A paramount feature of the proposed method is that the same LSTM is used across fre...
In this paper we address the problem of multichannel speech enhancement in the short-time Fourier transform (STFT) domain and in the framework of sequence-to-sequence deep learning. A long short-time memory (LSTM) network takes as input a sequence of STFT coefficients associated with a frequency bin of multichannel noisy-speech signals. The network...
Robust multi-person tracking with robots opens the door to analysing engagement and social signals in real-world environments. Multi-person scenarios are charaterised by (i) a time-varying number of people, (ii) intermittent auditory (\eg speech turns) and visual cues (\eg person appearing/disappearing) and (iii) impact of the robot actions in perc...
This paper addresses the problem of multichannel online dereverberation. The proposed method is carried out in the short-time Fourier transform (STFT) domain, and for each frequency band independently. In the STFT domain, the time-domain room impulse response is approximately represented by the convolutive transfer function (CTF). The multichannel...
We propose a method using a long short-term memory (LSTM) network to estimate the noise power spectral density (PSD) of single-channel audio signals represented in the short time Fourier transform (STFT) domain. An LSTM network common to all frequency bands is trained, which processes each frequency band individually by mapping the noisy STFT magni...
We propose a method using a long short-term memory (LSTM) network to estimate the noise power spectral density (PSD) of single-channel audio signals represented in the short time Fourier transform (STFT) domain. An LSTM network common to all frequency bands is trained, which processes each frequency band individually by mapping the noisy STFT magni...
This paper addresses the problem of under-determinded speech source separation from multichannel microphone singals, i.e. the convolutive mixtures of multiple sources. The time-domain signals are first transformed to the short-time Fourier transform (STFT) domain. To represent the room filters in the STFT domain, instead of the widely-used narrowba...
Sound source localization (SSL) is an important technique for many audio processing systems, such as speech enhancement/recognition, human-robot interaction. Although many methods have been proposed for SSL, it still remains a challenging task to achieve accurate localization under adverse acoustic scenarios. In this paper, a novel binaural SSL met...
We address the problem of online localization and tracking of multiple moving speakers in reverberant environments. The paper has the following contributions. We use the direct-path relative transfer function (DP-RTF), an inter- channel feature that encodes acoustic information robust against reverberation, and we propose an online algorithm well s...
This paper addresses the problem of under-determinded speech source separation from multichannel microphone singals, i.e. the convolutive mixtures of multiple sources. The time-domain signals are first transformed to the short-time Fourier transform (STFT) domain. To represent the room filters in the STFT domain, instead of the widely-used narrowba...
This paper addresses the problem of multichannel online dereverberation. The proposed method is carried out in the short-time Fourier transform (STFT) domain, and for each frequency band independently. In the STFT domain, the time-domain room impulse response is approximately represented by the convolutive transfer function (CTF). The multichannel...
This paper presents an online multiple-speaker localization and tracking method, as the INRIA-Perception contribution to the LOCATA Challenge 2018. First, the recursive least-square method is used to adaptively estimate the direct-path relative transfer function as an interchannel localization feature. The feature is assumed to associate with a sin...
This paper addresses the problem of online multiple-speaker localization and tracking in reverberant environments. We propose to use the direct-path relative transfer function (DP-RTF) -- a feature that encodes the inter-channel direct-path information robust against reverberation, hence well suited for reliable localization. A complex Gaussian mix...
This paper addresses the problem of audio source recovery from multichannel noisy convolutive mixture for source separation and speech enhancement, assuming known mixing filters. We propose to conduct the source recovery in the short-time Fourier transform domain, and based on the convolutive transfer function (CTF) approximation. Compared to the t...
This paper addresses the problem of speech separation and enhancement from multichannel convolutive and noisy mixtures, \emph{assuming known mixing filters}. We propose to perform the speech separation and enhancement task in the short-time Fourier transform domain, using the convolutive transfer function (CTF) approximation. Compared to time-domai...
This paper addresses the problems of blind multichannel identification and equalization for joint speech dereverberation and noise reduction. The time-domain cross-relation method is hardly applicable for blind room impulse response identification due to the near-common zeros of the long impulse responses. We extend the cross-relation method to the...
This paper addresses the problems of blind channel identification and multichannel equalization for speech dereverberation and noise reduction. The time-domain cross-relation method is not suitable for blind room impulse response identification, due to the near-common zeros of the long impulse responses. We extend the cross-relation method to the s...
Binaural sound source localization is an important technique for speech enhancement, video conferencing and human-robot interaction, etc. However, in realistic scenarios, the reverberation and environmental noise would degrade the precision of sound direction estimation. Therefore, reliable sound localization is essential to practical applications....
This paper addresses the problem of multiple-speaker localization in noisy and reverberant environments, using binaural recordings of an acoustic scene. A Gaussian mixture model (GMM) is adopted, whose components correspond to all the possible candidate source locations defined on a grid. After optimizing the GMM-based objective function, given an...
This paper addresses the problem of multiple-speaker localization in noisy and reverberant environments, using binaural recordings of an acoustic scene. A Gaussian mixture model (GMM) is adopted, whose components correspond to all the possible candidate source locations defined on a grid. After optimizing the GMM-based objective function, given an...
Statistical likelihood ratio test is a widely used voice activity detection (VAD) method, in which the likelihood ratio of the current temporal frame is compared with a threshold. A fixed threshold is always used, but this is not suitable for various types of noise. In this paper, an adaptive threshold is proposed as a function of the local statist...
This paper addresses the problem of sound-source localization of a single speech source in noisy and reverberant environments. For a given binaural microphone setup, the binaural response corresponding to the direct-path propagation of a single source is a function of the source direction. In practice, this response is contaminated by noise and rev...
Speaker diarization consists of assigning speech signals to speakers engaged in dialog. An audio-visual spatiotemporal diarization model is proposed. The model is well suited for challenging scenarios that consist of several participants engaged in multi-party dialog while they move around and turn their heads towards the other participants rather...
Speaker diarization consists of assigning speech signals to people engaged in a dialogue. An audio-visual spatiotemporal diarization model is proposed. The model is well suited for challenging scenarios that consist of several participants engaged in multi-party interaction while they move around and turn their heads towards the other participants...
Keyword spotting remainsa challenge when applied to real-world environments with dramatically changing noise. In recent studies, audio-visual integration methods have demonstrated superiorities since visual speech is not influenced by acoustic noise. However, for visual speech recognition, individual utterance mannerisms can lead to confusion and f...
One of the main applications of the humanoid robot NAO - a small robot companion - is human-robot interaction (HRI). NAO is particularly well suited for HRI applications because of its design, hardware specifications, programming capabilities, and affordable cost. Indeed, NAO can stand up, walk, wander, dance, play soccer, sit down, recognize and g...