About
60
Publications
8,278
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
595
Citations
Publications
Publications (60)
Although end-to-end (E2E) text-to-speech (TTS) models with HiFi-GAN-based neural vocoder (e.g. VITS and JETS) can achieve human-like speech quality with fast inference speed, these models still have room to further improve the inference speed with a CPU for practical implementations because HiFi-GAN-based neural vocoder unit is a bottleneck. Additi...
There is a need to improve the synthesis quality of HiFi-GAN-based real-time neural speech waveform generative models on CPUs while preserving the controllability of fundamental frequency (
$f_{\mathrm{o}}$
) and speech rate (SR). For this purpose, we propose Harmonic-Net and Harmonic-Net+, which introduce two extended functions into the HiFi-GAN...
Speech-rate conversion technology, which can expand or compress speech waveforms while preserving the pitch of the sound, is traditionally realized by signal-processing-based approaches. To improve the synthesis quality, this paper proposes a machine-learning-based approach using neural vocoders, to perform neural speech-rate conversion. The propos...
This paper investigates a real-time neural speech synthesis system on CPUs that can synthesize high-fidelity 48 kHz speech waveforms to cover the entire frequency range audible by human beings. Although most previous studies on 48 kHz speech synthesis have used traditional source-filter vocoders or a WaveNet vocoder for waveform generation, they ha...
In this paper, we propose a quasi-periodic parallel WaveGAN (QPPWG) waveform generative model, which applies a quasi-periodic (QP) structure to a parallel WaveGAN (PWG) model using pitch-dependent dilated convolution networks (PDCNNs). PWG is a small-footprint GAN-based raw waveform generative model, whose generation time is much faster than real t...
In this paper, we propose a quasi-periodic parallel WaveGAN (QPPWG) waveform generative model, which applies a quasi-periodic (QP) structure to a parallel WaveGAN (PWG) model using pitch-dependent dilated convolution networks (PDCNNs). PWG is a small-footprint GAN-based raw waveform generative model, whose generation time is much faster than real t...
In this paper, we propose a parallel WaveGAN (PWG)-like neural vocoder with a quasi-periodic (QP) architecture to improve the pitch controllability of PWG. PWG is a compact non-autoregressive (non-AR) speech generation model, whose generation speed is much faster than real-time. While PWG is taken as a vocoder to generate speech on the basis of aco...
The recent progress of text-to-speech synthesis (TTS) technology has allowed computers to read any written text aloud with voice that is artificial but almost indistinguishable from real human speech. Such improvement in the quality of synthetic speech has expanded the application of the TTS technology. This chapter will explain the mechanism of a...
A sound field control approach is investigated for recording a primary sound field and synthesizing it at a secondary field without exterior radiation using circular double-layer arrays of microphones and loudspeakers. Although the conventional least-squares (LS) and generalized singular value decomposition (GSVD) approaches are based on numerical...
This paper provides two methods for realizing three-dimensional localized sound zone generation based on external radiation cancelling using multiple loudspeakers for personal sound systems. The radiation property produced by a spherical or circular loudspeaker array outside the sphere or the circle is different from that inside them. The sound pre...
Generating acoustically bright and dark zones using loudspeakers is gaining attention as one of the most important acoustic communication techniques for such uses as personal sound systems and multilingual guide services. Although most conventional methods are based on numerical solutions, an analytical approach based on the spatialFourier transfor...
A study revisits large vocabulary continuous speech recognition (LVCSR)-based spoken language identification (LID) along with recent technological advancements for multilingual spoken language applications. The study shows that recent LVCSR-based LID can determine the language of an utterance by selecting the hypothesis with the highest likelihood...
This paper provides an analytical method for realizing local sound field propagation that can generate audible acoustic signals close to loudspeakers in the horizontal plane, but very low amplitudes at and beyond the reference distance. The proposed method is based on dimension mismatches between a 3-dimensional sound field propagated from a point...
Sensing of high-definition three-dimensional (3D) sound-space information is of crucial importance for realizing total 3D spatial sound technology. We have proposed a sensing method for 3D sound-space information using symmetrically and densely arranged microphones. This method is called SENZI (Symmetrical object with ENchased Zillion microphones)....
A novel signal processing method is proposed for sound field recording and reproduction using multiple parallel linear microphone and loudspeaker arrays. In sound field recording and reproduction, the problem is how to calculate the transfer filters that transform the signals recorded by microphones into the driving signals of the loudspeakers. The...
Sound field reproduction systems seek to realistically convey 3D spatial audio by re-creating the sound pressure inside a region enclosing the listener. High-order Ambisonics (HOA), a sound field reproduction technology, is notable for defining a scalable encoding format that characterizes the sound field in a system-independent way. Sound fields s...
Novel signal processing is proposed for generating acoustically bright and dark zones at arbitrary horizontal positions using a linear array of loudspeakers. Most conventional methods are based on the numerical calculation of the inverse of the spatial correlation matrix between control points and the positions of the loudspeakers. However, such me...
High-order Ambisonics (HOA) is a sound field reproduction technique that defines a scalable and system-independent encoding of spatial sound information. Decoding of HOA signals for reproduction using loudspeaker arrays can be a difficult task if the angular spacing between adjacent loudspeakers, as observed from the listening position, is not unif...
Ambisonics, a sound field reproduction technique, can present spatial audio with high accuracy. However, it has not been widely adopted due to the hardware requirements it imposes. In consequence, very few Ambisonics encoded contents are available. End-users will find it more attractive to invest in Ambisonics systems if they can be made backwards...
The adoption rate of multi-channel audio systems has dramatically increased in recent years. It is common to find 5.1- or 7.1-channel systems in typical home theaters. However, most users do not setup the satellite loudspeakers at the prescribed positions for aesthetic reasons or due to space constraints. Recently, we introduced a technique to opti...
We proposed a sensing method of 3D sound-space information based on symmetrically and densely arranged microphones mounted on a solid sphere. We call this method SENZI [Sakamoto et al., ISUC2008 (2008)]. In SENZI, the sensed signals from each of the microphone is simply weighted and summed to synthesize a listener's HRTF, reflecting the listener's...
Loudspeaker distributions resulting in crosstalk cancellers robust to head rotation are identified. A series of computer simulations were conducted to evaluate the resulting binaural signals of crosstalk cancellers for various loudspeaker configurations. 52,650 different two-channel loudspeaker arrangements were considered and the crosstalk cancell...
Sensing of high-definition 3D sound-space information is important to realize total 3D spatial sound technology. Nevertheless, conventional methods cannot sense comprehensive 3D sound-space information at a listening point properly and precisely so that the information can be reproduced simultaneously for many individual remote listeners facing in...
We propose a new speech privacy technique based on simple summation of numerous signals using N-channel loudspeakers. A speech signal mixed with high-level white or pink noise and with a delay set appropriately for each channel is reproduced by each loudspeaker to be synchronized at a specified sweet spot. At the sweet spot, SNR increases proportio...
Several dereverberation algorithms have been studied. The sampling frequencies used in conventional studies are typically 8–16 kHz because their main purpose is preprocessing for improving the intelligibility of speech communication and articulation for automatic speech recognition. However, in next-generation communication systems, techniques to a...
Three-dimensional (3D) radiated sound field display systems are important toward realizing ultra-realistic communications systems such as 3D television. In this paper, a 3D radiated sound field display system using directional loudspeakers and wave field synthesis is proposed. The proposed system is based on the Fresnel-Kirchhoff diffraction formul...
Near 3D sound field display systems are important toward realizing ultra-realistic communications systems such as 3D television. We have proposed the near 3D sound field reproduction systems using directional loudspeakers and wave field synthesis and developed the real system by constructing the surrounding microphone array and the radiated loudspe...
It is very important to develop near 3D sound field reproduction techniques in order to realize the ultra-realistic communications such as 3D television and 3D tele-conference. In this paper, the near 3D sound field reproduction system using directional loudspeakers and wave field synthesis is developed by constructing the surrounding microphone ar...
We have developed the field recording, recognition and reproduction (FIR3) system to record a sound field for later reproduction with the goal of reconstructing the sound information of a room in another space at another time. In this system, a surrounding microphone array is used to record a sound field. A method for detecting sound source positio...
Ambisonics, a sound field synthesis and reproduction technique, has shown promising results in conveying three-dimensional spatialized sound. Ambisonic encodings directly describe the spatial properties of sound fields without reference to the reproduction system. Precise regeneration of a sound field requires a large number of loudspeakers arrange...