
Tomohiro Nakatani- Dr. of Informatics
- Researcher at Nippon Telegraph and Telephone
Tomohiro Nakatani
- Dr. of Informatics
- Researcher at Nippon Telegraph and Telephone
About
420
Publications
50,129
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
11,079
Citations
Introduction
Current institution
Publications
Publications (420)
In this paper, the latest advancements in source separation and target speech extraction technologies are reviewed. The former technology separates individual sounds from an acoustic signal recorded with multiple voices and other sounds, and the latter one extracts only the speech of the desired speaker. These technologies make speech more understa...
This paper overviews neural target sound information extraction (TSIE), which consists of extracting the desired information about a sound source in an observed sound mixture given clues about the target source. TSIE is a general framework, which covers various applications, such as target speech/sound extraction (TSE), personalized voice activity...
In this paper, we introduce a multi-talker distant automatic speech recognition (DASR) system we designed for the DASR task 1 of the CHiME-8 challenge. Our system performs speaker counting, diarization, and ASR. It handles various recording conditions, from diner parties to professional meetings and from two to eight speakers. We perform diarizatio...
Multi-channel acoustic signal processing is a well-established and powerful tool to exploit the spatial diversity between a target signal and non-target or noise sources for signal enhancement. However, the textbook solutions for optimal data-dependent spatial filtering rest on the knowledge of second-order statistical moments of the signals, which...
Multichannel acoustic signal processing is a well-established and powerful tool to exploit the spatial diversity between a target signal and nontarget or noise sources for signal enhancement. However, the textbook solutions for optimal data-dependent spatial filtering rest on the knowledge of second-order statistical moments of the signals, which h...
This paper proposes novel methods for extracting a single Speech signal of Interest (SOI) from a multichannel observed signal in underdetermined situations, i.e., when the observed signal contains more speech signals than microphones. It focuses on extracting the SOI using prior knowledge of the SOI’s Direction of Arrival (DOA). Conventional beamfo...
Target sound extraction (TSE) consists of isolating a desired sound from a mixture of arbitrary sounds using clues to identify it. A TSE system requires solving two problems at once, identifying the target source and extracting the target signal from the mixture. For increased practicability, the same system should work with various types of sound....
We present a distant automatic speech recognition (DASR) system developed for the CHiME-8 DASR track. It consists of a diarization first pipeline. For diarization, we use end-to-end diarization with vector clustering (EEND-VC) followed by target speaker voice activity detection (TS-VAD) refinement. To deal with various numbers of speakers, we devel...
Binaural target sound extraction (TSE) aims to extract a desired sound from a binaural mixture of arbitrary sounds while preserving the spatial cues of the desired sound. Indeed, for many applications, the target sound signal and its spatial cues carry important information about the sound source. Binaural TSE can be realized with a neural network...
This paper proposes a computationally efficient joint optimization algorithm that performs online source separation, dereverberation, and noise reduction based on blind and spatially-regularized processing. When applying such online Blind Source Separation (BSS) as online Independent Vector Extraction (IVE) to a speech application, we must focus on...
We propose computationally efficient algorithms for extracting a single source of interest (SOI) using geometrically-regularized independent vector extraction (GR-IVE). Conventional GR-IVE relies on a block majorization-minimization (block MM) algorithm, which successively optimizes each part (block) of the separation matrix based on the minimizati...
In this paper, we address the multichannel blind source extraction (BSE) of a single source in diffuse noise environments. To solve this problem even faster than by fast multichannel nonnegative matrix factorization (FastMNMF) and its variant, we propose a BSE method called NoisyILRMA, which is a modification of independent low-rank matrix analysis...
Combining end-to-end neural speaker diarization (EEND) with vector clustering (VC), known as EEND-VC, has gained interest for leveraging the strengths of both methods. EEND-VC estimates activities and speaker embeddings for all speakers within an audio chunk and uses VC to associate these activities with speaker identities across different chunks....
Full-rank spatial covariance analysis (FCA) is a technique for blind source separation (BSS), and can be applied to underdetermined situations where the sources outnumber the microphones. This paper proposes multi-frame FCA as an extension of FCA to improve the BSS performance when the room reverberations are not so short that multiple time frames...
Beamforming is a powerful tool designed to enhance speech signals from the direction of a target source. Computing the beamforming filter requires estimating spatial covariance matrices (SCMs) of the source and noise signals. Time-frequency masks are often used to compute these SCMs. Most studies of mask-based beamforming have assumed that the sour...
Target speech extraction (TSE) extracts the speech of a target speaker in a mixture given auxiliary clues characterizing the speaker, such as an enrollment utterance. TSE addresses thus the challenging problem of simultaneously performing separation and speaker identification. There has been much progress in extraction performance following the rec...
It is essential to perform speech intelligibility (SI) experiments with human listeners to evaluate the effectiveness of objective intelligibility measures. Recently crowdsourced remote testing has become popular to collect a massive amount and variety of data with relatively small cost and in short time. However, careful data screening is essentia...
A majorization-minimization (MM) algorithm for independent vector analysis optimizes a separation matrix $W = [w_1, \ldots, w_m]^h \in \mathbb{C}^{m \times m}$ by minimizing a surrogate function of the form $\mathcal{L}(W) = \sum_{i = 1}^m w_i^h V_i w_i - \log | \det W |^2$, where $m \in \mathbb{N}$ is the number of sensors and positive definite ma...
This paper develops a framework that can accurately perform denoising, dereverberation, and source separation using a relatively small number of microphones. It has been empirically confirmed that Independent Vector Analysis (IVA) can blindly separate
$N$
sources from their sound mixture even with diffuse noise when a sufficiently large number (...
Far-field multi-speaker automatic speech recognition (ASR) has drawn increasing attention in recent years. Most existing methods feature a signal processing frontend and an ASR backend. In realistic scenarios, these modules are usually trained separately or progressively, which suffers from either inter-module mismatch or a complicated training pro...
This paper develops a framework that can perform denoising, dereverberation, and source separation accurately by using a relatively small number of microphones. It has been empirically confirmed that Independent Vector Analysis (IVA) can blindly separate $N$ sources from their sound mixture even with diffuse noise when a sufficiently large number (...
This paper proposes an approach for optimizing a Convolutional BeamFormer (CBF) that can jointly perform denoising (DN), dereverberation (DR), and source separation (SS). First, we develop a blind CBF optimization algorithm that requires no prior information on the sources or the room acoustics, by extending a conventional joint DR and SS method. F...
We extend the state-of-the-art online dereverberation method, online weighted prediction error (WPE), which predicts late reverberation components using a multichannel linear prediction (MCLP) filter. The multi-input/output inverse theorem states that in general such an MCLP filter for WPE exists only if the number of sources is less than that of m...
We address the determined audio source separation problem in the time-frequency domain. In independent deeply learned matrix analysis (IDLMA), it is assumed that the inter-frequency correlation of each source spectrum is zero, which is inappropriate for modeling nonstationary signals such as music signals. To account for the correlation between fre...
Sound event localization aims at estimating the positions of sound sources in the environment with respect to an acoustic receiver (e.g. a microphone array). Recent advances in this domain most prominently focused on utilizing deep recurrent neural networks. Inspired by the success of transformer architectures as a suitable alternative to classical...
Time-domain training criteria have proven to be very effective for the separation of single-channel non-reverberant speech mixtures. Likewise, mask-based beamforming has shown impressive performance in multi-channel reverberant speech enhancement and source separation. Here, we propose to combine neural network supported multi-channel source separa...
Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both singlechannel and multichannel conditions. However, severe performance degradation is still observed in the reverberant and noisy scenarios, and there is still a large performance gap between anechoic and reverberant conditions...
Blind source separation (BSS) of audio signals aims to separate original source signals from their mixtures recorded by microphones. The applications include automatic speech recognition in a noisy/multi-speaker environment, hearing aids, and music analysis. Independent component analysis (ICA) can perform BSS efficiently, but it is basically inapp...
In this paper, we address the problem of extracting all super-Gaussian source signals from a linear mixture in which (i) the number of super-Gaussian sources K is less than that of sensors
$M$
, and (ii) there are up to
$M - K$
stationary Gaussian noises that do not need to be extracted. To solve this problem, independent vector extraction (IVE...
We address a blind source separation (BSS) problem in a noisy reverberant environment in which the number of microphones M is greater than the number of sources of interest, and the other noise components can be approximated as stationary and Gaussian distributed. Conventional BSS algorithms for the optimization of a multi-input multi-output convol...
Many subjective experiments have been performed to develop objective speech intelligibility measures, but the novel coronavirus outbreak has made it very difficult to conduct experiments in a laboratory. One solution is to perform remote testing using crowdsourcing; however, because we cannot control the listening conditions, it is unclear whether...
Sound event localization frameworks based on deep neural networks have shown increased robustness with respect to reverberation and noise in comparison to classical parametric approaches. In particular, recurrent architectures that incorporate temporal context into the estimation process seem to be well-suited for this task. This paper proposes a n...
Recently, the end-to-end approach has been successfully applied to multi-speaker speech separation and recognition in both single-channel and multichannel conditions. However, severe performance degradation is still observed in the reverberant and noisy scenarios, and there is still a large performance gap between anechoic and reverberant condition...
We address a blind source separation (BSS) problem in a noisy reverberant environment in which the number of microphones $M$ is greater than the number of sources of interest, and the other noise components can be approximated as stationary and Gaussian distributed. Conventional BSS algorithms for the optimization of a multi-input multi-output conv...
Target speaker extraction, which aims at extracting a target speaker's voice from a mixture of voices using audio, visual or locational clues, has received much interest. Recently an audio-visual target speaker extraction has been proposed that extracts target speech by using complementary audio and visual clues. Although audio-visual target speake...
We extend the linear prediction-based dereverberation method called weighted prediction error (WPE). WPE optimizes a causal finite impulse response (FIR) filter that predicts the late reverberation components of an observed signal. However, by the multi-input/output inverse (MINT) theorem, in general, such FIR filters exist only when the number of...
This paper presents a computationally efficient approach to blind source separation (BSS) of audio signals, applicable even when there are more sources than microphones (i.e., the underdetermined case). When there are as many sources as microphones (i.e., the determined case), BSS can be performed computationally efficiently by independent componen...
Target speech extraction, which extracts the speech of a target speaker in a mixture given auxiliary speaker clues, has recently received increased interest. Various clues have been investigated such as pre-recorded enrollment utterances, direction information, or video of the target speaker. In this paper, we explore the use of speaker activity in...
Developing microphone array technologies for a small number of microphones is important due to the constraints of many devices. One direction to address this situation consists of virtually augmenting the number of microphone signals, e.g., based on several physical model assumptions. However, such assumptions are not necessarily met in realistic c...
Sound event localization frameworks based on deep neural networks have shown increased robustness with respect to reverberation and noise in comparison to classical parametric approaches. In particular, recurrent architectures that incorporate temporal context into the estimation process seem to be well-suited for this task. This paper proposes a n...
Time-domain training criteria have proven to be very effective for the separation of single-channel non-reverberant speech mixtures. Likewise, mask-based beamforming has shown impressive performance in multi-channel reverberant speech enhancement and source separation. Here, we propose to combine neural network supported multi-channel source separa...
In this paper, we propose a method combining variational autoencoder model of speech with a spatial clustering approach for multi-channel speech separation. The advantage of integrating spatial clustering with a spectral model was shown in several works. As the spectral model, previous works used either factorial generative models of the mixed spee...
This paper proposes new blind signal processing techniques for optimizing a multi-input multi-output (MIMO) convolutional beamformer (CBF) in a computationally efficient way to simultaneously perform dereverberation and source separation. For effective CBF optimization, a conventional technique factorizes it into a multiple-target weighted predicti...
In this paper, we address the problem of extracting all super-Gaussian source signals from a linear mixture in which (i) the number of super-Gaussian sources $K$ is less than that of sensors $M$, and (ii) there are up to $M - K$ stationary Gaussian noises that do not need to be extracted. To solve this problem, independent vector extraction (IVE) u...
The machine recognition of speech spoken at a distance from the microphones, known as far-field automatic speech recognition (ASR), has received a significant increase of attention in science and industry, which caused or was caused by an equally significant improvement in recognition accuracy. Meanwhile it has entered the consumer market with digi...
The machine recognition of speech spoken at a distance from the microphones, known as far-field automatic speech recognition (ASR), has received a significant increase in attention in science and industry, which caused or was caused by an equally significant improvement in recognition accuracy. Meanwhile, it has entered the consumer market with dig...
Recently, the source separation performance was greatly improved by time-domain audio source separation based on dual-path recurrent neural network (DPRNN). DPRNN is a simple but effective model for a long sequential data. While DPRNN is quite efficient in modeling a sequential data of the length of an utterance, i.e., about 5 to 10 second data, it...
Most approaches to multi-talker overlapped speech separation and recognition assume that the number of simultaneously active speakers is given, but in realistic situations, it is typically unknown. To cope with this, we extend an iterative speech extraction system with mechanisms to count the number of sources and combine it with a single-talker sp...
In this study, we propose a new concept, the gammachirp envelope distortion index (GEDI), based on the signal-to-distortion ratio in the auditory envelope, SDRenv, to predict the intelligibility of speech enhanced by nonlinear algorithms. The objective of GEDI is to calculate the distortion between enhanced and clean-speech representations in the d...
This paper proposes methods that can optimize a Convolutional BeamFormer (CBF) for performing denoising, dereverberation, and source separation (DN+DR+SS) at the same time. Conventionally, cascade configuration composed of a Weighted Prediction Error minimization (WPE) dereverberation filter followed by a Minimum Variance Distortionless Response (M...
The performance of speech enhancement algorithms in a multi-speaker scenario depends on correctly identifying the target speaker to be enhanced. Auditory attention decoding (AAD) methods allow to identify the target speaker which the listener is attending to from single-trial EEG recordings. Aiming at enhancing the target speaker and suppressing in...