[Show abstract][Hide abstract] ABSTRACT: Aiming at building up an objective evaluation of second language (L2) learner's Japanese timing control characteristics, this study propose an objective measure to simulate a subjective measure of L2 speakers' production given by Japanese native evaluators from the view point of goodness of production. The focus here is on phonemic length contrast, e.g., /kako/ "the past" versus /kakko/ "parentheses" and /kaze/ "wind" versus /kaze:/ "taxation" which is difficult for L2 learners particularly when incorporated with speaking rates. The proposed objective measure uses a vowel onset time marker as a key perceptual and psychoacoustic marker to normalize speaking rate variations. The proposed new measure reflects tempo normalization between L2 learners by dividing the proficiency of mora-timing control in production. Results show that both vowel length contrast and consonant length contrast have a significantly higher correlation with the subjective evaluation score in which the coefficient was stable than using a simple duration difference measure. These results suggest that applying the psychoacoustic parameters would be effective to build up an objective evaluation of L2 learners. [Work supported by JSPS.].
The Journal of the Acoustical Society of America 05/2013; 133(5):3611. · 1.65 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Aiming at communicative speech synthesis, prosody control using impressions has been proposed by applying the correlation between impressions of input lexicons and prosody. In this paper, as the first step to compute communicative prosody, we attempt to predict the F0 generation model parameters by estimating the impressions of input sentence from its constituent lexicons. To obtain an impression vector consisting of three dimensional factors (positive-negative, confident-doubtful and allowable-unacceptable) for a given input utterance, we proposed a computational scheme to calculate impression vectors using impression scores of constituent words. Using obtained sentence impression vectors, F0 control parameters are predicted by applying three-layered feed-forward neural networks. To evaluate the effectiveness of the proposed computational framework, we experimentally confirmed that F0 parameters of communicative speech could be generated from the impressions of input lexicons.
Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE), 2013 International Conference; 01/2013
[Show abstract][Hide abstract] ABSTRACT: To establish an effective training paradigm for a computer-assisted language learning system that enables non-native learners to exploit native-like perceptual cues for identifying length contrasts, e.g., consonant length, in the Japanese language (hereafter, Japanese length contrast), two experiments were conducted to investigate the identification accuracies of native Korean listeners before and after intensive perceptual training. The first experiment assessed the perceptual characteristics that native Korean listeners show when identifying the length contrast in the face of variation in speaking rate. The results suggested that native Korean listeners identify length contrasts by relying on a fixed-length criterion instead of adapting to changes in speaking rate. Also, Korean listeners showed a noticeable bias toward geminate responses for all speaking rates, with a stronger bias for slower speaking rates. The second experiment was based on these perceptual characteristics of Korean listeners. It investigated the extent to which intensive training helps improve listeners’ ability to identify Japanese length contrast and to examine whether greater stimulus variability during training leads to more robust training effects. The overall results showed that the perceptual training improved Korean listeners’ ability to perceive consonant length contrast. Moreover, misidentification of geminates decreased at post-test. However, the effect of training did not differ greatly between conditions of high stimulus variability during training (training with three speaking rates) and low stimulus variability (training with a single speaking rate). Training also did not generalize to perception of untrained contrast type (vowel length contrast). These results suggest that while perceptual training helps Korean learners learn a new, difficult-to-learn L2 phonemic length contrast, it is still difficult for learners to acquire native-like perceptual criteria.
Journal of East Asian Linguistics 01/2013; · 0.28 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Japanese phonemic length contrasts are difficult to perceive for native-Korean listeners learning Japanese as a second language (L2). Aiming at an effective L2 training method for L2 learners, two experiments were conducted. Experiment 1 evaluated which acoustic cues Korean listeners rely on when categorizing the phonemic length contrast. Experiment 2 examined how differences in speaking rate variation (slow, normal, and fast) and contrast type (vowel contrast vs. consonant contrast) would affect the effectiveness of perceptual training, using a minimal-pair identification task with words embedded in carrier sentences. There were four training conditions comprising combinations of two contrast types (vowel or consonant length) and two speaking-rate variations (single rate or three different rates). Results show that L2 listeners exploit absolute segmental duration to identify phonemic length contrast rather than durational criteria that vary according to speaking rate. Moreover, the trained groups significantly improved in their overall accuracy after training. However, none of the training groups showed significant generalization to untrained contrast types. These results suggest that the effect of training is limited and does not generalize to untrained contrast types even when speaking rate variation is incorporated during training. [Work supported in part by the Grant-in-Aid for Scientific Research (B), JSPS.].
The Journal of the Acoustical Society of America 10/2011; 130(4):2575. · 1.65 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: An empirical study is carried out to achieve a computer-based methodology for evaluating a speaker's accent in a second language as an alternative to a native-speaker tutor. Its primary target is the disfluency in the temporal aspects of an English learner's speech. Conventional approaches commonly use measures based solely on the acoustic features of given speech, such as segmental duration differences between learners and reference native speakers. However, our auditory system, unlike a microphone, is not transparent: it does not send incoming acoustic signals into the brain without any treatment. Therefore, this study uses auditory perceptual characteristics as weighting factors on the conventional measure. These are the loudness of the corresponding speech segment and the magnitude of the jump in loudness between this target segment and each of the two adjacent speech segments. These factors were originally found through general psychoacoustical procedures [H. Kato et al., JASA, 101, 2311-2322 (1997); 104, 540-549 (1998); 111, 387-400 (2002)], so they are applicable to any speech signal despite the difference in language. Experiments show that these weightings dramatically improve evaluation performance. The contribution of psychoacoustics to evaluation methodology of second language speech is also discussed. [Work supported by RISE project, Waseda Univ. and KAKENHI 20300069, JSPS.].
The Journal of the Acoustical Society of America 05/2009; 125(4):2752. · 1.65 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: A multi-dimensional perceptual space for communicative speech prosodies was derived using a psychometric method from multi-dimensional expressions of impressions to characterize paralinguistic information conveyed by prosody in communication. Single word utterances of “n” were employed to allow freedom from lexical effects and to cover communicative prosodic variations as much as possible. The analysis of daily conversations showed that conversational speech impressions were manifested in the global F0 control of “n” as differences of average height (high–low) and dynamic patterns (rise, fall, gradual fall, and rise&fall). Using controlled single utterances of “n”, multiple dimensional scaling analysis was applied to a mutual distance matrix obtained by 26 dimensional vectors expressing perceptual impressions. The result showed the three-dimensional structure of a perceptual impression space, and each dimension corresponded to different F0 control characteristics. The positive–negative impression can be controlled by average F0 height while confident–doubtful or allowable–unacceptable impressions can be controlled by F0 dynamic patterns.Unlike conventional categorical classification of prosodic patterns frequently observed in studies of emotional prosody, this control characterization enables us to flexibly and quantitatively describe prosodic impressions. These experimental results allow the possibility of input specifications for communicative prosody generation using impression vectors and control through average F0 height and F0 dynamic patterns. Instead of the generation of speech with categorical prototypical prosody, more adequate communicative speech synthesis can be approached through input specification and its correspondence with control characteristics.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we introduce Japanese segmental duration characteristics and computational modeling that we have been studying for around three decades in speech synthesis. A series of experimental results are also shown on loudness dependence in the duration perception. These computational duration modeling and perceptual studies on duration error sensitivity to loudness give some insights for computational human modeling of spoken language capability. As a first trial to figure out how these findings could be efficiently employed in other field like language learning, we introduce our current efforts on the objective evaluation of 2nd language speaking skill and the research consortium of AESOP (Asian English Speech cOrpus Project) where researchers in Asian countries have started to work together.
[Show abstract][Hide abstract] ABSTRACT: Automatic evaluation of English timing control proficiency is carried out by comparing segmental duration differences between learners and reference native speakers. To obtain an objective measure matched to human subjective evaluation, we introduced a measure reflecting perceptual characteristics. The proposed measure evaluates duration differences weighted by the loudness of the corresponding speech segment and the differences or jumps in loudness from the two adjacent speech segments. Experiments showed that estimated scores using the new perception-based measure provided a correlation coefficient of 0.72 with subjective evaluation scores given by native English speakers on the basis of naturalness in timing control. This correlation turned out to be significantly higher than that of 0.54 obtained when using a simple duration difference measure.
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2009, 19-24 April 2009, Taipei, Taiwan; 01/2009 · 4.63 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: We already examined language independent control characteristics of the communicative prosody generation using multi-dimensional impressions of input lexicons. In this paper, we synthesized English single phrase utterances using prosodic characteristics of Japanese speech aiming at language independent applications. The reading-style speech prosodies of English phrases were modified by prosodic characteristics derived from one-word utterance of Japanese speech "n". Modifications were carried out based on lexical impressions corresponding to six impressions consisting of confident, doubtful, allowable, unacceptable, positive and negative. The perceptual evaluation experiment showed the naturalness of speech with communicative prosody modified by the impression of input lexicons. These experimental results support the usefulness of the communicative prosody control based on the impression of input lexicons and suggest possibilities of language independent applications.
[Show abstract][Hide abstract] ABSTRACT: Aiming at prosody control for conversational speech synthesis, communicative prosodies were generated based on the prosodic characteristics derived from one word utterance " n" . The grouping of F0 patterns using VQ revealed four F0 dynamic patterns (rise, gradual fall, fall, and rise&fall) for large amounts of one-wo r d u t t e r a n c e " n " in daily conversations. Through the analysis using an F0 generation model, different control characteristics were found for these patterns. A communicative prosody control scheme is proposed for short utterances reflecting these control characteristics for three dimensional representative perceptual impressions, confident-doubtful, allowable-unacceptable and positive-negative previously obtained by MDS analysis. The naturalness evaluation tests for synthesized conversational speech showed superiority in naturalness of the proposed prosody control. These results indicate the possibility of communicative prosody generation for conversational speech synthesis through perceptional impression expressions using corpus-based approach.