
Hideki KawaharaWakayama University · Center for Joint Research and Development
Hideki Kawahara
PhD
About
268
Publications
34,667
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
8,412
Citations
Introduction
Hideki Kawahara currently works at the Center for Joint Research and Development, Wakayama University. Hideki does research in Speech Processing, Acoustic Engineering, and Electrical Engineering. Their current project is Speech analysis, modification, and resynthesis framework.
Additional affiliations
August 2015 - present
August 2015 - present
March 1992 - March 1997
Publications
Publications (268)
The purpose of this paper is to make easily available to the scientific community an efficient voice morphing tool called STRAIGHTMORPH and provide a short tutorial on its use with examples. STRAIGHTMORPH consists of a set of Matlab functions allowing the generation of high-quality, parametrically-controlled morphs of an arbitrary number of voice s...
We propose protocols for acquiring speech materials, making them reusable for future investigations, and presenting them for subjective experiments. We also provide means to evaluate existing speech materials' compatibility with target applications. We built these protocols and tools based on structured test signals and analysis methods, including...
Better communication with older people requires not only improving speech intelligibility but also understanding how well emotions can be conveyed and the effect of age and hearing loss (HL) on emotion perception. In this paper, emotion discrimination experiments were conducted using a vocal morphing method and an HL simulator in young normal heari...
The purpose of this paper is to make easily available to the scientific community an efficient voice morphing tool called STRAIGHTMORPH and provide a short tutorial on its use with examples. STRAIGHTMORPH consists of a set of Matlab functions allowing the generation of high-quality, parametrically-controlled morphs of an arbitrary number of voice s...
We generalized a voice morphing algorithm capable of handling temporally variable, multiple-attributes, and multiple instances. The generalized morphing provides a new strategy for investigating speech diversity. However, excessive complexity and the difficulty of preparation have prevented researchers and students from enjoying its benefits. To ad...
The purpose of this paper is to make easily available to the scientific community an efficient voice morphing tool called STRAIGHTMORPH and provide a short tutorial on its use with examples. STRAIGHTMORPH consists of a set of Matlab functions allowing to generate high-quality, parametrically-controlled morphs of an arbitrary number of voice samples...
Better communication with older people requires not only improving speech intelligibility but also understanding how well emotions can be conveyed and the effect of age and hearing loss (HL) on emotion perception. In this paper, emotion discrimination experiments were conducted using a vocal morphing method and an HL simulator in young normal heari...
We propose to apply the frequency modulation transfer function as a supplemental measure for evaluating pitch extractors' performance. We introduced a new family of time-stretched-pulse for acoustic system analysis (Kawahara and Yatabe, ICASSP2021, called CAPRICEP). Composing a test signal using the CAPRICEP units by Walsh-Hadamard sequence enables...
We introduce a general framework for measuring acoustic properties such as liner time-invariant (LTI) response, signal-dependent time-invariant (SDTI) component, and random and time-varying (RTV) component simultaneously using structured periodic test signals. The framework also enables music pieces and other sound materials as test signals by "saf...
This article focuses on the research tool for investigating the fundamental frequencies of voiced sounds. We introduce an objective and informative measurement method of pitch extractors' response to frequency-modulated tones. The method uses a new test signal for acoustic system analysis. The test signal enables simultaneous measurement of the ext...
We propose an objective measurement method for pitch extractors' responses to frequency-modulated signals. It enables us to evaluate different pitch extractors with unified criteria. The method uses extended time-stretched pulses combined by binary orthogonal sequences. It provides simultaneous measurement results consisting of the linear and the n...
A perception experiment and analyses were conducted to clarify the acoustic features of pop-out voice. Speech items pronounced by 779 native Japanese speakers were prepared for stimuli by mixing them with a babble noise that consisted of overlapping short sentences spoken by 10 Japanese speakers. Using a 5-point scale, 12 Japanese participants rate...
We propose a simple method to measure acoustic responses using any sounds by converting them suitable for measurement. This method enables us to use music pieces for measuring acoustic conditions. It is advantageous to measure such conditions without annoying test sounds to listeners. In addition, applying the underlying idea of simultaneous measur...
We propose an objective measurement method for pitch extractors' responses to frequency-modulated signals. The method simultaneously measures the linear and the non-linear time-invariant responses and random and time-varying responses. It uses extended time-stretched pulses combined by binary orthogonal sequences. Our recent finding of involuntary...
We can estimate the size of a speaker solely from their speech sounds, regardless of whether the sounds are voiced or unvoiced. In this study, we developed a size perception model based on the computational theory of the stabilised wavelet transform (SWT) to explain a variety of size discrimination data. We also conducted extended experiments to ev...
We introduced a measurement procedure for the involuntary response of voice fundamental-frequency to frequency modulated auditory stimulation. This involuntary response plays an essential role in voice fundamental frequency control while less investigated due to technical difficulties. This article introduces an interactive and real-time tool for i...
Auditory feedback plays an essential role in the regulation of the fundamental frequency of voiced sounds. The fundamental frequency also responds to auditory stimulation other than the speaker's voice. We propose to use this response of the fundamental frequency of sustained vowels to frequency-modulated test signals for investigating involuntary...
We introduce a new member of TSP (Time Stretched Pulse) for acoustic and speech measurement infrastructure, based on a simple all-pass filter and systematic randomization. This new infrastructure fundamentally upgrades our previous measurement procedure, which enables simultaneous measurement of multiple attributes, including non-linear ones withou...
No PDF available
ABSTRACT
We introduce a method that enables the simultaneous measurement of system attributes. The attributes are the linear time-invariant, nonlinear time-invariant, and random and extra responses without introducing additional equipment and post-processing. This new procedure uses a new member of time stretched pulses called FVN...
We introduce a new acoustic measurement method that can measure the linear time-invariant response, the nonlinear time-invariant response, and random and time-varying responses simultaneously. The method uses a set of orthogonal sequences made from a set of unit FVNs (Frequency domain variant of Velvet Noise), a new member of the TSP (Time Stretche...
We propose a new family of test signals for acoustic measurements such as impulse response, nonlinearity, and the effects of background noise. The proposed family complements difficulties in existing families, the Swept-Sine (SS), pseudo-random noise such as the maximum length sequence (MLS). The proposed family uses the frequency domain variant of...
We introduce real-time and interactive tools for assisting vocal training. In this presentation, we demonstrate mainly a tool based on real-time visualizer of fundamental frequency candidates to provide information-rich feedback to learners. The visualizer uses an efficient algorithm using analytic signals for deriving phase-based attributes. We st...
Voice morphing is a framework to generate a new sound which has the mixed attribute of given voice examples. It provides a flexible tool for investigating perceptual attributes in voice communication, especially for quantifying para- and extra-linguistic cues. Recent advances in parametric representation of speech sounds made the morphing-based app...
The past decades have seen an explosion of research into the psychological, cognitive, neural, biological, and technical mechanisms of voice perception. These mechanisms refer to the general ability to extract information from voices expressed by other living beings or by technical systems. Voice perception research is now a lively area of research...
We propose a new excitation source signal for VOCODERs and an all-pass impulse response for post-processing of synthetic sounds and pre-processing of natural sounds for data-augmentation. The proposed signals are variants of velvet noise, which is a sparse discrete signal consisting of a few non-zero (1 or -1) elements and sounds smoother than Gaus...
The "cocktail party problem" requires us to discern individual sound sources from mixtures of sources. The brain must use knowledge of natural sound regularities for this purpose. One much-discussed regularity is the tendency for frequencies to be harmonically related (integer multiples of a fundamental frequency). To test the role of harmonicity i...
We introduce a simple and linear SNR (strictly speaking, periodic to random power ratio) estimator (0dB to 80dB without additional calibration/linearization) for providing reliable descriptions of aperiodicity in speech corpus. The main idea of this method is to estimate the background random noise level without directly extracting the background n...
We formulated and implemented a procedure to generate aliasing-free excitation source signals based on the Fujisaki- Ljungqvist model. It uses a new antialiasing filter in the contin- uous time domain followed by an IIR digital filter for response equalization. We introduced a general designing procedure of cosine series to design the new antialias...
We applied a novel aperiodicity analysis method (Kawahara et al. 2016, SSW9) to CSJ database (Maekawa, 2003, ISCA & IEEE Workshop). The applied method derives the amount of aperiodicity as a time-frequency map, using three staged procedures. The first stage derives a probability map of the fundamental component witout prior information. The second...
Recent advances of computational power and software foundations make it possible to introduce interactive and realtime tools in speech and hearing science education with relatively low cost. The tool, SparkNG (Speech Production and Auditory perception Research Kernel, Next Generation) consists of four applications. They are (a) real-time FFT analyz...
This paper introduces a general and flexible framework for F0 and aperiodicity (additive non periodic component) analysis, specifically intended for high-quality speech synthesis and modification applications. The proposed framework consists of three subsystems: instantaneous frequency estimator and initial aperiodicity detector, F0 trajectory trac...
Hearing impaired (HI) people often have difficulty understanding speech in multi-speaker or noisy environments. With HI listeners, however, it is often difficult to specify which stage, or stages, of auditory processing are responsible for the deficit. There might also be cognitive problems associated with age. In this paper, a HI simulator, based...
A closed-form representation of anti-aliased L-F model is derived for a LPF function family based on cosine series. The Matlab based implementation of the derived form provides virtually aliasing-free source signal, which is applicable to speech synthesis and F0 extractor evaluation. This aliasing-free representation is also suitable for testing pe...
Morphing provides a flexible research strategy for non- and para linguistic aspects of speech. Recent extension of the morphing procedure has made it possible to interpolate and extrapolate physical attributes of arbitrarily many utterance examples. By using utterances representing typical instantiation of the non- and para linguistic information i...
This paper describes a simulator for presenting normal hearing (NH) listeners with the experience of a hearing impaired (HI) listener. The simulator is based on the compressive gammachirp (cGC) filter used to derive level-dependent filter shapes and the cochlear compression function from to notched-noise masking data. The level dependence of the cG...
A new group delay representation, which yields value zero for periodic signals irrespective to the initial phase and the relative level of each harmonic component. This new group delay representation provides a unified basis for defining 'aperiodicity' in speech sounds. For example, the periodic to noise ratio or harmonic to noise ratio is directly...
In 1988, Roy Patterson introduced “Pulse ribbon model of pitch perception,” a static representation of periodic signals and inspired ongoing investigations on underlying principles of STRAIGHT, a speech analysis, modification, and synthesis framework. It led to temporally static representations of power spectra, instantaneous frequency and group de...
The invention relates to a periodic signal processing method, a periodic signal conversion method, and a periodic signal processing device capable of reducing the influence of periodicity without using a spectral model. Time windows are arranged such that a center of each of the time windows is at a division position which divides a fundamental fre...
Our study introduces a mobile navigation system enabling a sound input interface. To realize high-performance environmental sound recognition system using Android devices, we organized a database of environmental sounds collected in our daily lives. Crowdsourcing is a useful approach for organizing a database based on collaborative works of people....
Our study introduces an interactive 3D sound playback interface system that is controlled by the user’s behavior. It consists of an Android terminal, stereo headphone, and Nintendo Wii Balance Board. Traditional binaural audio systems can only deal with simple fixed playback conditions. On the other hand, our system assumes that the user is continu...
While humans use their voice mainly for communicating information about the world, paralinguistic cues in the voice signal convey rich dynamic information about a speaker's arousal and emotional state, and extralinguistic cues reflect more stable speaker characteristics including identity, biological sex and social gender, socioeconomic or regional...
A group delay-based excitation source analysis and design method is introduced for extension of TANDEM-STRAIGHT, a speech analysis, modification and synthesis system. This introduction makes all components of the system be based on interference-free representations. They are power spectrum, instantaneous frequency and group delay representations. T...
A highly-reproducible estimation method of vocal tract length (VTL) and text independent VTL estimation method are proposed based on a Japanese vowel database spoken by 385 male and female speakers ranging from age 6 to 56 and other vowel database with MRI-based vocal tract shape information. Proposed methods are based on interference-free power sp...
Another simple and high-speed F0 extractor with high temporal resolution based on our previous proposal has been developed by adding a higher-order symmetry measure. This extension made the proposed method significantly more robust than the previous one. The proposed method is a detector of the lowest prominent sinusoidal component. It can use seve...
Voice morphing is a powerful tool for exploratory research and various applications. A temporally variable multi-aspect morphing is extended to enable morphing of arbitrarily many voices in a single step procedure. The proposed method is implemented based on interference-free representations of periodic signals and found to yield highly-naturally s...
In this paper, we demonstrate an auditory spectrogram based on a dynamic compressive gammachirp filterbank (GCFB) that enables accurate and robust estimation of vocal tract length (VTL) for both voiced and whispered speech. Normalized VTLs of 21 speakers were derived by using the least squared analysis of their VTL ratios (for all permutations, 420...
A study was conducted to propose a new cross-synthesis framework based on an interference-free representation of a power spectrum combined with normalization and modulation transfer function design for spectral envelope preprocessing of speech sounds. The proposed cross-synthesis enabled control of the linguistic information and the timbre identity...
This chapter presents a unified gammachirp framework for -estimating cochlear compression and synthesizing sounds with inverse compression that -cancels the compression of a normal-hearing (NH) listener to simulate the -experience of a hearing-impaired (HI) listener. The compressive gammachirp (cGC) filter was -fitted to notched-noise masking data...
It is important for the development of hearing aids and other audio devices to make accurate estimates of the frequency selectivity and compression of the auditory filter. Previously, we reported a technique for estimating the compression of the auditory filter that combined data from a simultaneous notched-noise experiment and a temporal masking c...
A Japanese vowel database of males, females and children speakers (385 speakers in total) along with relevant physical data [Deguchi et al. (2011)] was analyzed using a set of F0 adaptive procedures, which were developed for a speech analysis, modification and synthesis framework TANDEM-STRAIGHT [Kawahara et al. (2008)] and its extensions. By restr...
We have developed a method to build a Japanese automatic speech recognition (ASR) system based on 3-gram language model expansion with the Google database. Our aim is to enhance the recognition accuracy of ASR systems based on the 3-gram language model, even in cases where the language model is trained using short text segments. We investigate a pr...
A new spectral envelope estimation procedure is proposed to recover details beyond band limitation imposed by the Shannon's sampling theory when interpreting periodic excitation of voiced sounds as the sampling operation in the frequency domain. The proposed procedure is a hybrid of STRAIGHT, a F0-adaptive spectral envelope estimation and the auto...
A periodicity extraction method is introduced to analyze voiced sounds with a complex excitation behavior. Although general voiced sound has only one periodicity, some voiced sounds such as the pathological voice and the singing voice often have multiple periodicities. A method for estimating multiple periodicities from voiced sounds to deal with t...
There has recently been a series of studies concerning the interaction of glottal pulse rate (GPR) and mean-formant-frequency (MFF) in the perception of speaker characteristics and speech recognition. This paper extends the research by comparing the recognition and discrimination performance achieved with voiced words to that achieved with whispere...
A simple model for generating aperiodic components in synthetic speech is introduced by modifying lower frequency representation for improving voice quality of resynthesized or morphed speech. The new representation is simple enough to arrow intuitive manipulation of this quality relating attribute. The model represents aperiodic component using a...
New set of voice excitation source analysis methods are applied to study Japanese traditional singing voices, especially Noh. The first method, XSX (excitation Structure extractor) is capable of visualize detailed structure of subharmonic periodicity, by using multiple dedicated periodicity detectors. The second one analyzes symmetry of the fundame...
Realistic reconstruction and manipulation of strong vocal expressions found in singing voices is a challenging and exciting topic. A speech analysis, modification and resynthesis framework based on interference-free power spectral and instantaneous frequency representations for periodic sounds is extended for handling such voices. Strong expression...
We introduce novel auditory features in the hidden Markov model (HMM) system for detecting child speakers. The features derived by the gammachirp auditory filterbank (GCFB) have been demonstrated to be suitable for vocal tract length (VTL) estimation, both theoretically and experimentally. We performed numerical experiments to distinguish between c...