Takuma Okamoto

Takuma Okamoto
National Institute of Information and Communications Technology | NICT · Advanced Speech Translation Research and Development Promotion Center

About

53
Publications
6,493
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
450
Citations
Citations since 2017
29 Research Items
381 Citations
2017201820192020202120222023020406080
2017201820192020202120222023020406080
2017201820192020202120222023020406080
2017201820192020202120222023020406080

Publications

Publications (53)
Article
Speech-rate conversion technology, which can expand or compress speech waveforms while preserving the pitch of the sound, is traditionally realized by signal-processing-based approaches. To improve the synthesis quality, this paper proposes a machine-learning-based approach using neural vocoders, to perform neural speech-rate conversion. The propos...
Article
Full-text available
This paper investigates a real-time neural speech synthesis system on CPUs that can synthesize high-fidelity 48 kHz speech waveforms to cover the entire frequency range audible by human beings. Although most previous studies on 48 kHz speech synthesis have used traditional source-filter vocoders or a WaveNet vocoder for waveform generation, they ha...
Article
Full-text available
In this paper, we propose a quasi-periodic parallel WaveGAN (QPPWG) waveform generative model, which applies a quasi-periodic (QP) structure to a parallel WaveGAN (PWG) model using pitch-dependent dilated convolution networks (PDCNNs). PWG is a small-footprint GAN-based raw waveform generative model, whose generation time is much faster than real t...
Preprint
In this paper, we propose a quasi-periodic parallel WaveGAN (QPPWG) waveform generative model, which applies a quasi-periodic (QP) structure to a parallel WaveGAN (PWG) model using pitch-dependent dilated convolution networks (PDCNNs). PWG is a small-footprint GAN-based raw waveform generative model, whose generation time is much faster than real t...
Preprint
In this paper, we propose a parallel WaveGAN (PWG)-like neural vocoder with a quasi-periodic (QP) architecture to improve the pitch controllability of PWG. PWG is a compact non-autoregressive (non-AR) speech generation model, whose generation speed is much faster than real-time. While PWG is taken as a vocoder to generate speech on the basis of aco...
Chapter
The recent progress of text-to-speech synthesis (TTS) technology has allowed computers to read any written text aloud with voice that is artificial but almost indistinguishable from real human speech. Such improvement in the quality of synthetic speech has expanded the application of the TTS technology. This chapter will explain the mechanism of a...
Article
Full-text available
A sound field control approach is investigated for recording a primary sound field and synthesizing it at a secondary field without exterior radiation using circular double-layer arrays of microphones and loudspeakers. Although the conventional least-squares (LS) and generalized singular value decomposition (GSVD) approaches are based on numerical...
Article
This paper provides two methods for realizing three-dimensional localized sound zone generation based on external radiation cancelling using multiple loudspeakers for personal sound systems. The radiation property produced by a spherical or circular loudspeaker array outside the sphere or the circle is different from that inside them. The sound pre...
Article
Full-text available
Generating acoustically bright and dark zones using loudspeakers is gaining attention as one of the most important acoustic communication techniques for such uses as personal sound systems and multilingual guide services. Although most conventional methods are based on numerical solutions, an analytical approach based on the spatialFourier transfor...
Article
Full-text available
A study revisits large vocabulary continuous speech recognition (LVCSR)-based spoken language identification (LID) along with recent technological advancements for multilingual spoken language applications. The study shows that recent LVCSR-based LID can determine the language of an utterance by selecting the hypothesis with the highest likelihood...
Article
This paper provides an analytical method for realizing local sound field propagation that can generate audible acoustic signals close to loudspeakers in the horizontal plane, but very low amplitudes at and beyond the reference distance. The proposed method is based on dimension mismatches between a 3-dimensional sound field propagated from a point...
Article
Full-text available
Sensing of high-definition three-dimensional (3D) sound-space information is of crucial importance for realizing total 3D spatial sound technology. We have proposed a sensing method for 3D sound-space information using symmetrically and densely arranged microphones. This method is called SENZI (Symmetrical object with ENchased Zillion microphones)....
Article
A novel signal processing method is proposed for sound field recording and reproduction using multiple parallel linear microphone and loudspeaker arrays. In sound field recording and reproduction, the problem is how to calculate the transfer filters that transform the signals recorded by microphones into the driving signals of the loudspeakers. The...
Article
Sound field reproduction systems seek to realistically convey 3D spatial audio by re-creating the sound pressure inside a region enclosing the listener. High-order Ambisonics (HOA), a sound field reproduction technology, is notable for defining a scalable encoding format that characterizes the sound field in a system-independent way. Sound fields s...
Conference Paper
Novel signal processing is proposed for generating acoustically bright and dark zones at arbitrary horizontal positions using a linear array of loudspeakers. Most conventional methods are based on the numerical calculation of the inverse of the spatial correlation matrix between control points and the positions of the loudspeakers. However, such me...
Article
High-order Ambisonics (HOA) is a sound field reproduction technique that defines a scalable and system-independent encoding of spatial sound information. Decoding of HOA signals for reproduction using loudspeaker arrays can be a difficult task if the angular spacing between adjacent loudspeakers, as observed from the listening position, is not unif...
Conference Paper
Ambisonics, a sound field reproduction technique, can present spatial audio with high accuracy. However, it has not been widely adopted due to the hardware requirements it imposes. In consequence, very few Ambisonics encoded contents are available. End-users will find it more attractive to invest in Ambisonics systems if they can be made backwards...
Article
The adoption rate of multi-channel audio systems has dramatically increased in recent years. It is common to find 5.1- or 7.1-channel systems in typical home theaters. However, most users do not setup the satellite loudspeakers at the prescribed positions for aesthetic reasons or due to space constraints. Recently, we introduced a technique to opti...
Article
Full-text available
We proposed a sensing method of 3D sound-space information based on symmetrically and densely arranged microphones mounted on a solid sphere. We call this method SENZI [Sakamoto et al., ISUC2008 (2008)]. In SENZI, the sensed signals from each of the microphone is simply weighted and summed to synthesize a listener's HRTF, reflecting the listener's...
Article
Full-text available
Loudspeaker distributions resulting in crosstalk cancellers robust to head rotation are identified. A series of computer simulations were conducted to evaluate the resulting binaural signals of crosstalk cancellers for various loudspeaker configurations. 52,650 different two-channel loudspeaker arrangements were considered and the crosstalk cancell...
Article
Sensing of high-definition 3D sound-space information is important to realize total 3D spatial sound technology. Nevertheless, conventional methods cannot sense comprehensive 3D sound-space information at a listening point properly and precisely so that the information can be reproduced simultaneously for many individual remote listeners facing in...
Article
We propose a new speech privacy technique based on simple summation of numerous signals using N-channel loudspeakers. A speech signal mixed with high-level white or pink noise and with a delay set appropriately for each channel is reproduced by each loudspeaker to be synchronized at a specified sweet spot. At the sweet spot, SNR increases proportio...
Article
Several dereverberation algorithms have been studied. The sampling frequencies used in conventional studies are typically 8–16 kHz because their main purpose is preprocessing for improving the intelligibility of speech communication and articulation for automatic speech recognition. However, in next-generation communication systems, techniques to a...
Article
Full-text available
Three-dimensional (3D) radiated sound field display systems are important toward realizing ultra-realistic communications systems such as 3D television. In this paper, a 3D radiated sound field display system using directional loudspeakers and wave field synthesis is proposed. The proposed system is based on the Fresnel-Kirchhoff diffraction formul...
Conference Paper
Full-text available
Near 3D sound field display systems are important toward realizing ultra-realistic communications systems such as 3D television. We have proposed the near 3D sound field reproduction systems using directional loudspeakers and wave field synthesis and developed the real system by constructing the surrounding microphone array and the radiated loudspe...
Conference Paper
Full-text available
It is very important to develop near 3D sound field reproduction techniques in order to realize the ultra-realistic communications such as 3D television and 3D tele-conference. In this paper, the near 3D sound field reproduction system using directional loudspeakers and wave field synthesis is developed by constructing the surrounding microphone ar...
Article
We have developed the field recording, recognition and reproduction (FIR3) system to record a sound field for later reproduction with the goal of reconstructing the sound information of a room in another space at another time. In this system, a surrounding microphone array is used to record a sound field. A method for detecting sound source positio...
Article
Full-text available
Ambisonics, a sound field synthesis and reproduction technique, has shown promising results in conveying three-dimensional spatialized sound. Ambisonic encodings directly describe the spatial properties of sound fields without reference to the reproduction system. Precise regeneration of a sound field requires a large number of loudspeakers arrange...

Network

Cited By