
Minoru TsuzakiKyoto City University of Arts · Faculty of Music
Minoru Tsuzaki
About
131
Publications
5,351
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
784
Citations
Citations since 2017
Introduction
Skills and Expertise
Publications
Publications (131)
The perceptual simultaneity range for two diotically presented tones increases with decreasing frequency separation of the two tones from approximately 0.5 Bark. As the present study of two frequency regions shows, this effect is not observed when the two tones are not presented to the same ear, i.e., presented dichotically. Since the increase in s...
Auditory feedback plays an essential role in the regulation of the fundamental frequency of voiced sounds. The fundamental frequency also responds to auditory stimulation other than the speaker's voice. We propose to use this response of the fundamental frequency of sustained vowels to frequency-modulated test signals for investigating involuntary...
No PDF available
ABSTRACT
Recently, we found that the range of perceptual simultaneity, within which two asynchronous pure tones are perceived to start simultaneously, shows a V-shaped curve as a function of frequency separation with the breakpoint at 0.5 Bark. This study aimed to test whether such a law of perceptual simultaneity range is applicab...
The perceptual simultaneity range (PSR) for two pure tones follows a V-shape curve as a function of frequency separation between two tones. This study was conducted to test whether such a V-shape rule of the PSR for two tones is applicable to the PSR for two complex tones. A psychophysical experiment was conducted to measure the PSR for two two-ton...
When two tones start with a small onset asynchrony, one might perceive them as starting simultaneously. The range of this perceptual synchrony is defined as a perceptual simultaneity range (PSR), within which two tones are perceived as simultaneous. Our earlier study found V-shaped behavior of the PSR as a function of frequency separation (Δ f) for...
Absolute pitch (AP)—an ability to identify an isolated pitch without musical context—is commonly believed to be a valuable ability for musicians. However, relative pitch (RP)—an ability to perceive pitch relations—is more important in most musical contexts. In this study, music students in East Asian and Western countries (Japan, China, Poland, Ger...
Many temporal models for pitch perception have adopted a configuration of "delay-lines and coincidence detectors" after the cochlear filtering. Autocorrelation functions are a usual way of its implementation. However, a series of experiments by the authors' group have revealed that the perceived pitch would shift upwards by the effect of aging. Bec...
Abstract: We previously found that the spatial memory of piano keyboard was not accurate enough to play without external spatial cues even in trained pianists. Therefore, pianists must need real-time acquisition of sensory information on the target key position or some reference points. Furthermore, we observed that amateur pianists made errors wit...
Absolute pitch (AP) possessors can name the pitch class of the note simply by hearing a periodic tone. It has been reported that the AP judgment can shift by one or two semitones when AP possessor become old. We confirmed this age-related AP shift by a series of psychophysical experiments with piano sounds as well as synthesized complex tones. AP p...
We previously found that the spatial memory of piano keyboard was not accurate enough to play without external spatial cues even in trained pianists. Therefore, the real-time acquisition of spatial information should be essential for piano performance. The aim of the present study was to test how and when the online acquisition of the visual and au...
Previous studies have indicated that extended exposure to a high level of sound might increase the risk of hearing loss among professional symphony orchestra musicians. One of the major problems associated with musicians' hearing loss is difficulty in estimating its risk simply on the basis of the physical amount of exposure, i.e. the exposure leve...
SAWS's are acoustic stimuli in which an impulse response of vocal tract and its scaled version are alternately placed in the time domain at a constant periodic rate. When the scale factor is close to unity (1.0), the perceived pitch corresponded to the original periodicity. As the difference in the scaling became large, the pitch tended to be match...
A study was conducted to demonstrate independence of mental representations for tonotopic and periodic scales in perceptual judgment of vowel-like sounds. The researchers tested the hypothesis that mean formant frequency (MFF) of a vowel and F0 information was represented on a two-dimensional representational plane in the auditory mental representa...
Musicians are sensitive to the synchrony of multiple tone onsets. However , even when several sounds have a simultaneous onset, their temporal relationship might not be preserved at the cochlear level because of "cochlear delays" in perception. The purpose of this study was to investigate whether cochlear delay significantly affects synchrony judgm...
Accuracy of spatial memory of a piano key was investigated using 14 players with long term (> 15 years) training (LT-group), and 13 players with short-term (< 13 years) training (ST-group). The experimental task was to move his/her left or right index finger on the target key (C2, C3, E3, A4, C5 or C6 for each hand) position after touching the refe...
The purpose of this study is to investigate the sufficient "similarity" between consecutive auditory events for the auditory system to define the fundamental period for pitch perception. It is possible to contaminate the periodicity of harmonic complex tones by scaling the impulse response in the time domain at every other cycle. Scale-alternating...
It has been modeled that the auditory system encodes acoustic signals into two independent informations. One is tonotopic information reflecting the frequency response characteristics of the basilar membrane, and another is periodicity information reflecting the temporal patterns of phase-locked auditory nerve firing. Based on the previous study ab...
The cochlear delay shifts the arrival of lower-frequency components of an auditory signal slightly but systematically behind that of higher-frequency components. Therefore, even if all of the components of a complex tone physically begin simultaneously, their temporal relation is not preserved at the cochlear level. In our previous study, the accur...
Timbre provided by the resonant characteristics of the vibrating body can be represented as spectral envelope patterns and can contribute as one of the important cues for sound source identification. However, its concept is not strictly established while that of loudness, and of pitch are well known. Recently, the fact that the spectral pattern can...
Previous studies have suggested that professional musicians comprehend features of music-derived sound even if the sound sequence lacks the traditional temporal structure of music. We tested this hypothesis through behavioral and functional brain imaging experiments. Musicians were better than nonmusicians at identifying scrambled pieces of piano m...
Synchrony judgment is one of the most important abilities for musicians. Only a few milliseconds of onset asynchrony result in a significant difference in musical expression. Using behavioural responses and Auditory Brainstem Responses (ABR), this study investigates whether synchrony judgment accuracy improves with training and, if so, whether phys...
This paper reconfirms that talker identity can be transmitted across languages. Talker discrimination was examined in the ABX paradigm, where the stimuli A and B were utterances by different talkers in the same language and the stimulus X was an utterance by either of A or B in the different language. The average hit rate of this discrimination tas...
Vowels are produced as a sequence of vocal tract impulse responses which are periodically excited by glottal pulses. Each impulse response reflects the shape and the size of the vocal tract. The size, i.e., the resonancescale, is kept almost constant in the normal speech sounds. It has been found that we can sensitively ?hear? such size variations...
This paper describes the NICT speech synthesis system submit-ted to the Blizzard Challenge 2009: a hidden Markov model (HMM)-based synthesizer constructed by training trajectory HMMs considering global variance. To improve naturalness of the synthesized speech a mixed excitation approach based on closed-loop residual modeling through the training o...
Speech sounds convey information about the size of the speaker. Several studies have demonstrated that human vowel recognition is possible even for an unnatural size range, and have revealed that size factor normalization can be achieved automatically in the auditory system. In this study, we further investigated the characteristics of the size nor...
This paper details a speech synthesis system developed at NICT for the Blizzard Challenge 2010. The system depends on an HMM-based speech synthesis technique that possesses two dis-tinctive features: HMM training under global-variance con-straint on the parameter trajectory and trainable mixed excita-tion for source-filter vocoding. For this year's...
We already examined language independent control characteristics of the communicative prosody generation using multi-dimensional impressions of input lexicons. In this paper, we synthesized English single phrase utterances using prosodic characteristics of Japanese speech aiming at language independent applications. The reading-style speech prosodi...
Several experimental studies have shown that the human auditory system has a mechanism for extracting speaker-size information, using sufficiently long sounds. This paper investigated influence of vowel duration on the processing for size extraction using short vowels. In a size estimation experiment, listeners subjectively estimated the size (heig...
In this paper, we introduce Japanese segmental duration characteristics and computational modeling that we have been studying for around three decades in speech synthesis. A series of experimental results are also shown on loudness dependence in the duration perception. These computational duration modeling and perceptual studies on duration error...
A multi-dimensional perceptual space for communicative speech prosodies was derived using a psychometric method from multi-dimensional expressions of impressions to characterize paralinguistic information conveyed by prosody in communication. Single word utterances of “n” were employed to allow freedom from lexical effects and to cover communicativ...
An empirical study is carried out to achieve a computer-based methodology for evaluating a speaker's accent in a second language as an alternative to a native-speaker tutor. Its primary target is the disfluency in the temporal aspects of an English learner's speech. Conventional approaches commonly use measures based solely on the acoustic features...
Automatic evaluation of English timing control proficiency is carried out by comparing segmental duration differences between learners and reference native speakers. To obtain an objective measure matched to human subjective evaluation, we introduced a measure reflecting perceptual characteristics. The proposed measure evaluates duration difference...
Changes in vocal tract size vary the formant frequencies, even when the shape of vocal tracts is the same and the spoken vowels are categorized to be the same. Several studies have demonstrated that the normalization of vocal tract size can be achieved in a bottom-up manner. To investigate how fast this process works, the identification of vowel se...
The occurrence of a sound event in a sound stream can be perceived without difficulty even when it is difficult to define the acoustic boundary. This study was designed to investigate which acoustical feature functions as an effective cue to “mark” the occurrence of a new event. When two steady sounds are connected by a short frequency glide, at...
This study investigates whether the delay caused in the course of wave propagation along the basilar membrane (BM) of the cochlea (i.e., the cochlear delay) affects the perceptual judgment of the synchronization of two sounds. An experiment was conducted to examine the detection of asynchrony using two types of chirps (compensatory and enhanced c...
To recognize a transposed melody, two properties can be hypothesized to function as a significant invariant feature: a melodic contour, an up-down movement; and melodic intervals, the distances between consecutive pitches. Both properties are realized by changing the frequency of notes. Frequency changes are coded in two different ways in the early...
This paper introduces a large-scale phonetically-balanced En- glish speech corpus developed at ATR for corpus-based speech synthesis. This corpus includes a 16-hour American English speech data spoken by a professional male narrator in "read- ing style." The contents of prompt sentences concern ba- sically news articles, travel conversations, and n...
To investigate the features of proficiency in musical performance, we focused on the role of auditory feedback in piano performance and measured its effects in both highly and less-trained pianist groups. In the first experiments, two groups played well-learned pieces under an auditory-feedback condition (performing with sound) as well as no-audito...
Size extraction, Resonance characteristics, Size modulation detection, Timbre perception Experiments were performed with listeners to detect the STSM in a vowel sequence. The measured characteristics appeared to be high-pass. The observed high-pass tendency suggested that a more efficient cue was available based on the differences in fine temporal...
A study was conducted to investigate whether the delay caused in the course of wave propagation along the basilar membrane of the cochlear has any significant effect on perceptual judgment of synchronization of two complex tones. Onset synchrony of sinusoidal components is extensively assumed to be an important caution for their perceptual unificat...
To simulate the perceptual extraction of temporal structures of speech, the authors have been proposing an event-plausibilty model that detects the occurrence of subevents in continuous speech signals based on a auditory processing. One of its core components is the filterbank module that simulates the mechanical frequency analysis of the basilar m...
When a receiver of acoustic signals is surrounded by several vibrating bodies, it becomes important to “sort out” sound energies
into subparts appropriately to represent the original sources. This issue is called a problem of source segregation, and has
been investigated in several ways as a core of the auditory scene analysis. Pitch, or a perceptu...
We have been studying temporal characteristics of Japanese for decades. Not only acoustic measurements but also perceptual studies on temporal modification have revealed the control principles lying behind manifestation of segmental duration characteristics. This talk tries to introduce familiar generation principles such as ‘‘mora‐timing’’ and phr...
When we hear continuous sound signals, we extract the timing between each segment although each segmental boundary is acoustically unclear. In order to investigate on which points we perceive the temporal structures in sound signals, perceptual experiments were conducted to measure the sensitivity in detecting the temporal deviation by using sound...
Some studies in the field of developmental psychology have suggested that a complex task executed by an expert includes the automated and sequential execution of some subordinate skills. In this study, we regard the development of musical performance as the length of motor control of a musical instrument carried out with a feedforward process. We t...
To investigate the perceptual cues that detect arrivals of new auditory events, the detectability of deviation applied to an isochronous temporal structure was measured. The stimuli were sequences of complex tones whose formant frequencies alternate between two steady values. Five steady parts were connected by four formant glides. The isochronous...
It is known that the tonotopic and periodic aspects of sounds are separated in the auditory sense. To investigate the manner of representing these two aspects, a psychological experiment was designed. The stimuli are sequences of vowels, each of which changes its fundamental frequency and ‘‘size,’’ i.e., the centroid of its formants, using a STRAIG...
We can identify vowels pronounced by speakers with any size vocal tract. Together, we can discriminate the different sizes of vocal tracts. To simulate these abilities, a computational model has been proposed in which size information is extracted and separated from the shape information. It is important to investigate temporal characteristics of t...
Onset synchrony of components is widely assumed to be an important cue for perceptual unification as a single tone. However, even if the components begin simultaneously, their temporal relation might not be preserved at the cochlear level. The cochlear delay shifts the arrival of a lower component slightly but systematically behind a higher compone...
In this paper, we evaluate various cost functions for selecting a segment sequence in terms of the correspondence between the cost and perceptual scores to the naturalness of synthetic speech. The results demonstrate that the conventional average cost, which shows the degradation of naturalness over the entire synthetic utterance, has better corres...
Aiming at prosody control for conversational speech synthesis, communicative prosodies were generated based on the prosodic characteristics derived from one word utterance " n" . The grouping of F0 patterns using VQ revealed four F0 dynamic patterns (rise, gradual fall, fall, and rise&fall) for large amounts of one-wo r d u t t e r a n c e " n " in...
We introduce a new computational model, the "event-plausibility" model, as an extension of the loudness-jump model, which has been proposed with the aim of extracting temporal structures in speech based on simulated auditory processing. The main characteristic of the new model is to overcome a drawback of the loudness jump model, i.e., insensitivit...
In order to study the importance of auditory feedback in musical performance, the effectiveness of piano practice for performance was compared between the cases: where auditory feedback was provided and the case where it was not provided. As participants, the experiment investigated both highly trained performers and less-trained performers. Half o...
The acceptability of changes in segment duration at different speaking rates is studied to find useful perceptual characteristics for designing an objective naturalness measure in speech synthesis. Based on a series of previous studies on the intra-phrase positional dependency of perceptual acceptability, we investigate three factors: (1) speaking...
To provide an appropriate model for perception of temporal structures of speech, we applied a comprehensive computational model of the human auditory peripherals to detect changes in speech signals that potentially indicate arrivals of new events. In each tonotopic sub-band, an increase in the activation level was taken into account for the plausib...
For use as a naturalness criterion for duration rules in speech synthesis, human acceptability of change in segment duration is investigated with regard to the temporal position within a phrase. Three perceptual experiments are carried out to introduce variations in the attribute and context of a phrase in sentence speech: (1) the length of a phras...
We investigated the temporal dynamics of auditory normalization and size perception by measuring vowel recognition performance using sequences of vowels in which vocal tract length was modulated during the sequence. The modulation of speaker size was achieved by scaling the frequency axis of the transfer function of vocal tract. The temporal modula...
Aiming at prosody control for speech synthesis expressing speaking attitudes, F0 shapes were characterized by their perceptual impressions. To directly correlate F0 shapes with perceptual impressions, single word utterances "n" extracted from daily conversations were employed. The analysis showed that speaking attitudes were manifested in the globa...
When a portion of a sound is replaced by a noise burst, its duration is perceived to be shorter than that of its intact counterpart. To test the robustness of this shrinking effect by noise replacement and to validate the hypothesis that duration can be estimated as a function of accumulated perceptual evidence for the target sound, the shrinking e...
The duration of sounds generally tends to be perceived as shorter when a portion is replaced by a noise burst. However, a reversal/prolongation tendency can occur if a compelling isochronous context is functioning. To test the robustness of the durational shrinkage as well as to investigate what aspect is the core feature providing the isochronism,...
This paper describes a new concatenative TTS system under development at ATR. The system, named XIMERA, is based on corpus-based technologies, as was the case for the preceding TTS systems from ATR, namely -talk and CHATR. The prominent features of XIMERA are (1) large corpora (a 110hours corpus of a Japanese male, a 60-hours corpus of a Japanese f...
Correction and detection of the time-dependent channel variability in Mandarin speech corpus were carried out. The effect of the corrective filter on the performance of channel equalization was examined, using four different ways to design the corrective filter. The channel equalization technique which, was based on the long-term power spectral den...
In concatenative speech synthesis, various factors affect the naturalness of synthetic speech. A cost for segment selection is calculated by integrating some sub-costs capturing the degradation of naturalness caused by such factors. In this paper, we optimize each sub-cost function for converting a linguistic feature or an acoustic parameter into a...
This paper describes a new concatenative TTS system under development at ATR. The system, named XIMERA, is based on corpus-based technologies, as was the case for the preceding TTS systems from ATR, namely ν-talk and CHATR. The prominent features of XIMERA are (1) large corpora (a 110-hours corpus of a Japanese male, a 60-hours corpus of a Japanese...
This paper describes optimizing a cost function for segment selection in concatenative Text-to-Speech based on perceptual characteristics. We use the norm of a local cost for each seg- ment as an integrated cost function for a segment sequence to consider both the degradation of naturalness over the entire syn- thetic speech and the local degradati...
To investigate mechanisms for perceiving the duration of an auditory event, an effect of perceptual grouping upon perceived duration was studied psychophysically. In the first experiment, the perceived duration of a spoken word was measured under three conditions of acoustic continuity (i.e., (a) intact, (b) noise-replaced, and (c) gap-replaced) as...
To provide a perceptual framework for the objective evaluation of durational rules in speech synthesis, two experiments were conducted to investigate the differences between vowel (V) onsets and V-offsets in their functions of marking the perceived temporal structure of speech. The first experiment measured the detectability of temporal modificatio...
In this paper, we investigate the effect of using a novel cost, RMS (root mean square) cost, for segment selection for concatenative text-to-speech synthesis. The RMS cost is affected not only by the total degradation of naturalness but also by the local degradation of naturalness. From the results of experiments comparing this approach with segmen...
The paper studies voice quality variation in a large-scale single speaker corpus used in recent corpus-based speech synthesis. First, a perceptual experiment is conducted to obtain scores for voice quality difference in a stimulus made by concatenating phrases collected from separate recording sessions. Second, acoustic measures are examined on the...
In segment selection for concatenative text-to-speech (TTS), it is important to utilize a cost that corresponds to the perceptual characteristics. We clarify correspondence to the perceptual scores of the cost, and then various functions to integrate the costs are evaluated. The perceptual scores are determined from results of perceptual experiment...
We propose a method for integrating speech recognition and generation within a unified framework. The method consists of STRAIGHT, warped-frequency DCT, and an HMM engine. The warped-frequency DCT i s used to derive a kind of mel-cepstral coefficient from the smoothed spectrum of STRAIGHT, which is known as a high-quality vocoder. This analysis/syn...
Human subjective acceptability of durational distortions in speech segments or portions is significantly affected by various segmental and sequential properties, e.g., the vowel color and temporal position in a word [Kato et al., J. Acoust. Soc. Am. 101, 2311-2322 (1997); 104, 540-549 (1998)]. The current study focused on the effects of phoneme cla...
This paper proposes a novel unit selection algorithm for Japanese text-to-speech (TTS) systems. Since Japanese syllables consist of CV (C: consonant, V: vowel) or V, except when a vowel is devoiced, CV units are basic to concatenative TTS systems for Japanese. However, speech synthesized with CV units sometimes have discontinuities due to V-V conca...
To contribute to the naturalness criteria of speech synthesis, acceptability of changes in segment duration has been investigated. Previous studies showed context dependency of the acceptability evaluation such as intraphrase positional effect, where listeners were more sensitive to the phrase-initial segment duration than the phrase-final one. Suc...
It is important to design an appropriate cost function to improve the quality of speech produced by a corpus?based text?to?speech (TTS) synthesis. Although the final product of the TTS system is evaluated perceptually, the definition of cost functions has to be based on the physical parameters of speech signal. And a cost function will become inapp...