Keikichi Hirose

Keikichi Hirose
  • Professor Emeritus at The University of Tokyo

About

484
Publications
57,838
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,544
Citations
Current institution
The University of Tokyo
Current position
  • Professor Emeritus

Publications

Publications (484)
Article
This paper describes a novel approach to flexible control of speaker characteristics using tensor representation of multiple Gaussian mixture models (GMM). In voice conversion studies, realization of conversion from/to an arbitrary speaker's voice is one of the important objectives. For this purpose, eigenvoice conversion (EVC) based on an eigenvoi...
Article
This paper develops an online and freely available framework to aid teaching and learning the prosodic control of Tokyo Japanese: how to generate its adequate word accent and phrase intonation. This framework is called OJAD (Online Japanese Accent Dictionary) [1] and it provides three features. 1) Visual, auditory, systematic, and comprehensive ill...
Article
Full-text available
When synthesizing speech from Japanese text, correct assignment of accent nuclei for input text with arbitrary contents is indispensable in obtaining naturally-sounding synthetic speech. A phenomenon called accent sandhi occurs in utterances of Japanese; when a word is uttered in a sentence, its accent nucleus may change depending on the contexts o...
Conference Paper
Full-text available
Statistical parametric speech synthesis technologies, such as HMM-based and DNN-based ones, gain special attention from researchers because of their ability in generating speech in various voice qualities and styles. In these methods, all acoustic parameters (except durational ones) are handled in a frame-by-frame manner, which is not appropriate f...
Conference Paper
Full-text available
Article
This paper provides an analysis of several practical issues related to the theory and implementation of Grapheme-to-Phoneme (G2P) conversion systems utilizing the Weighted Finite-State Transducer paradigm. The paper addresses issues related to system accuracy, training time and practical implementation. The focus is on joint n-gram models which hav...
Conference Paper
Full-text available
This work investigated crosslinguistic perception of Mandarin utterances conveying six classes of attitudes, i.e., dominant/ submissive, friendly/hostile, polite/rude, serious/joking, praising/blaming, and sincere/insincere. Five groups of subjects were tested: native Mandarin speakers, Japanese L2 learners of Mandarin, French L2 learners of Mandar...
Article
Automatic recognition of vowel length in Japanese has several applications in speech processing such as for computer assisted language learning (CALL) systems. Standard automatic speech recognition (ASR) systems make use of hidden Markov models (HMMs) to carry out the recognition. However, HMMs are not particularly well-suited for this problem sinc...
Article
Expressive speech synthesis has received increased attention in recent times. Stress (or pitch accent) is the perceptual prominence within words or utterances, which contributes to the expressivity of speech. This paper summarizes our contribution to Mandarin expressive speech synthesis. A novel hierarchical stress modeling and generation method fo...
Article
Full-text available
A data adaptive approach to spectral analysis of audio signals is implemented in this paper. The audio signals are non-stationary as well as non-linear in nature and the traditional Fourier based spectral representation is not effective. The Hilbert spectral analysis implemented by noise assisted bivariate empirical mode decomposition (NA-BEMD) is...
Chapter
The generation process model of fundamental frequency contours is ideal to represent the global features of prosody. It is a command response model, where the commands have clear relations with linguistic and para/nonlinguistic information conveyed by the utterance. By handling fundamental frequency contours in the framework of the generation proce...
Article
Full-text available
This paper introduces a robust voiced/non-voiced (VnV) speech classification method using bivariate empirical mode decomposition (bEMD). Fractional Gaussian noise (fGn) is employed as the reference signal to derive a data adaptive threshold for VnV discrimination. The analyzing speech signal and fGn are combined to generate a complex signal which i...
Article
Generation process model of fundamental frequency (F0) contours is known to represent global movements of F0's keeping a clear relation with linguistic information of utterances. While HMM-based speech synthesis can generate a good quality of speech, problems, which arise from frame-by-frame processing, are pointed out. These problems are expected...
Article
This paper describes a novel approach to construct a mapping function between a given speaker pair using probability density functions (PDF) of matrix variate. In voice conversion studies, two important functions should be realized: 1) precise modeling of both the source and target feature spaces, and 2) construction of a proper transform function...
Book
The volume addresses issues concerning prosody generation in speech synthesis, including prosody modeling, how we can convey para- and non-linguistic information in speech synthesis, and prosody control in speech synthesis (including prosody conversions). A high level of quality has already been achieved in speech synthesis by using selection-based...
Conference Paper
English is the only language available for global communication and is known to have a large diversity of pronunciations due to the influence of speakers' mother tongue, called accents. Our previous studies [1], [2] made an attempt to do speaker-basis clustering of those pronunciations, where every speaker was assumed to speak with his own accent....
Conference Paper
Full-text available
Rhythm plays an important role in the naturalness of speech. This study compared rhythmic patterns of Mandarin speech between native speakers and two groups of L2 speakers whose first languages were Cantonese and English, respectively. The study started from isolated words, but focused on continuous speech, for which eleven durational metrics were...
Article
Full-text available
This paper presents a two-stage soft thresholding algorithm based on discrete cosine transform (DCT) and empirical mode decomposition (EMD). In the first stage, noisy speech is decomposed into eight frequency bands and a specific noise variance is calculated for each one. Based on this variance, each band is denoised using soft thresholding in DCT...
Conference Paper
Full-text available
This paper presents an efficient pitch estimation algorithm for noisy speech signal using ensemble empirical mode decomposition (EEMD) based time domain filtering. The dominant harmonic of noisy speech is enhanced to make pitch period more prominent. The normalized autocorrelation function (NACF) of the modified signal is then decomposed into time...
Conference Paper
Full-text available
This paper introduces a hierarchical stress generation for expressive speech synthesis. In the previous study, we proposed a novel hierarchical Mandarin stress modeling method, and the text-based stress prediction experiments demonstrates a reliable stress assignment can be obtained from textual features. However, the stress model should be further...
Conference Paper
The exemplar-based approaches, which model signals as a sparse linear combination of exemplars of signals, are proved to have state-of-the-art performance in noise robust ASR, especially on low SNRs. However, since both the speech exemplars and noise exemplars are built from training data and are fixed throughout the process of enhancing speech fea...
Article
Full-text available
The 15 papers in this special issue focus on statistical parametric speech synthesis.
Conference Paper
English is the only language available for international communication and is used by approximately 1.5 billions of speakers. It is also known to have a large diversity of pronunciation partly due to the influence of the speakers' mother tongue, called accents. Our project aims at creating a global and individual-basis map of English pronunciations...
Article
For Japanese speech processing, being able to automatically recognize between geminate and singleton consonants can have many benefits. In standard recognition methods, hidden Markov Models (HMMs) are used. However, HMMs are not good at differentiating between items that are distinguished primarily by temporal differences rather than spectral diffe...
Article
Generation process model of fundamental frequency (F0) contours is ideal to represent global movements of F0's keeping a clear relation with back-grounding linguistic information of utterances. Using the model, improvements of HMM-based speech synthesis are expected. A new method is developed to cope with erroneous F0's of utterances included in HM...
Article
This paper describes a novel approach to construct a mapping function between a given speaker pair using probability density functions (PDF) of matrix variate. In voice conversion studies, two important functions should be realized: 1) precise modeling of both the source and target feature spaces, and 2) construction of a proper transform function...
Conference Paper
In this paper, we propose the use of deep neural networks to expand conventional methods of statistical feature enhancement based on piecewise linear transformation. Stereo-based piecewise linear compensation for environments (SPLICE), which is a powerful statistical approach for feature enhancement, models the probabilistic distribution of input n...
Article
Full-text available
Empirical mode decomposition (EMD) is a newly developed tool to analyze nonlinear and non-stationary signals. It is used to decompose any signal into a finite number of time varying subband signals termed as intrinsic mode functions (IMFs). Such data adaptive decomposition is recently used in speech enhancement. This study presents the concept of E...
Article
For non-native learners of Japanese, the pitch accent can be cumbersome to acquire without proper instruction. A Computer Assisted Language Learning (CALL) system could aid these learners in this acquisition provided that it can generate helpful feedback based on automatic analysis of the learner's utterance. For this, it is necessary to consider t...
Conference Paper
This paper proposes a new method of estimating perceptual femininity (PF) of an input utterance using Gaussian Mixture Model (GMM) supervectors and support vector regression (SVR). The method is used to develop a femininity estimation tool, which is introduced to voice therapy of Gender Identity Disorder (GID) clients, especially MtF (Male to Femal...
Article
Full-text available
This paper proposes a feature enhancement method that can achieve high speech recognition performance in a variety of noise environments with feasible computational cost. As the well-known Stereo-based Piecewise Linear Compensation for Environments (SPLICE) algorithm, the proposed method learns piecewise linear transformation to map corrupted featu...
Conference Paper
It is well-known that human speech recognition (HSR) is much more robust than automatic speech recognition (ASR) [1], [2]. Given that HSR's robustness to large acoustic variability is extremely high, it is reasonable for researchers to assume that humans are able to extract invariant patterns underlying input utterances [3]. Recently in development...
Article
As classic and intrinsic requirements, synthetic speech need to convey correct information with good quality of naturalness to listeners. Fundamental frequency (F0) contours need to be controlled to meet these requirements. Additional challenges have been introduced to tonal languages because the F0 contour reflects both intelligibility and natural...
Article
Artificial Bandwidth Extension (ABE) is a technique that attempts to regenerate the wideband signal (0-8 kHz) from a narrowband signal (300-3.4 kHz) in order to improve the speech quality of today's telephone systems. By employing the well-known source-filter model, ABE can be divided into 2 sub-problems, namely extension of the excitation sig...
Article
We describe a new method of voice conversion aimed at character conversion by the eigenvoice Gaussian mixture model (EV-GMM) approach. Using an eigenvoice space built from 273 speakers and speech samples of three different characters created by a single skilled voice actor/actress, the conversion can generate the voices of the three characters from...
Article
Full-text available
Brain-computer interface is a communication system that connects the brain with computer (or other devices) but is not dependent on the normal output of the brain (i.e., peripheral nerve and muscle). Electro-oculogram is a dominant artifact which has a significant negative influence on further analysis of real electroencephalography data. This pape...
Article
This paper presents a novel audio discrimination algorithm using spatial features in time-frequency (TF) space. Three types of audio signals - speech, music without vocal and music with background vocal are taken into consideration for classification. The audio segment is transformed into TF domain yielding the spatial illustration of energy. Nonne...
Article
Artificial Bandwidth Extension (ABE) has been introduced to improve perceived speech quality and intelligibility of narrow- band telephone speech. Most of the existing algorithms divided ABE into 2 sub-problems, namely extension of the excitation signal and that of the spectral envelope. In this paper, we pro- pose a new method for spectral envelop...
Article
This paper introduces the first online and free framework for teaching and learning Japanese prosody including word accent and phrase intonation. This framework is called OJAD (Online Japanese Accent Dictionary) [1] and it provides three functions. 1) Visual, auditory, systematic, and comprehensive illustration of patterns of accent change (accent...
Conference Paper
Full-text available
This study first examines the differences in the gross features of the fundamental frequency contour (the F0 contour) responsible for discriminating utterances of three sentence types, namely declarative, imperative and interrogative, in Bangla. In order to realize these differences in speech synthesis, these differences are then interpreted in ter...
Conference Paper
In recent years Computer-Assisted Language Learning (CALL) systems have been widely used in foreign language education. Some systems use automatic speech recognition (ASR) technologies to detect pronunciation errors and estimate the proficiency level of individual students. When speech recording is done in a CALL classroom, however, utterances of a...
Conference Paper
Pronunciation errors are often made by learners of a foreign language. To build a Computer-Assisted Language Learning (CALL) system to support them, automatic error detection is essential. In this study, Japanese learners of Chinese are focused on. We investigated in automatic detection of their typical and frequent phoneme production errors. For t...
Conference Paper
Multimodal speech recognition is a promising approach to realize noise robust automatic speech recognition (ASR), and is currently gathering the attention of many researchers. Multimodal ASR utilizes not only audio features, which are sensitive to background noises, but also non-audio features such as lip shapes to achieve noise robustness. Althoug...
Article
A new method was proposed for synthesizing sentence fundamental frequency (F0) contours of Mandarin speech. The method is based on representing a sentence logarithmic F0 contour as a superposition of tone components on phrase components, as in the case of the generation process model (F0 model). However, the method is not fully depending on the mod...
Conference Paper
Generation process model of fundamental frequency contours is ideal to represent global features of prosody. It is a command response model, where the commands have clear relations with linguistic and para/non linguistic information conveyed by the utterance. Therefore, by handling fundamental frequency contours in the framework of the generation p...
Conference Paper
This paper describes a novel approach to flexible control of speaker characteristics using tensor representation of speaker space. In voice conversion studies, realization of conversion from/to an arbitrary speaker's voice is one of the important objectives. For this purpose, eigenvoice conversion (EVC) based on an eigenvoice Gaussian mixture model...
Conference Paper
In previous work we have defined a pseudosyllable unit for an English read speech recognition task. In this study we investigate the robustness of extraction of the pseudosyllable units and investigate how such units can be integrated into speech recognition systems. An evaluation method that maps hypothesis phonemes to reference phonemes is propos...
Article
Full-text available
This paper presents a data-adaptive technique of cardiovascular disease diagnosis by analyzing electrocardiogram (ECG) signals. The separation of high-frequency (HF) and low-frequency (LF) components are performed by employing empirical mode decomposition (EMD) designed for analyzing nonstationary and non-linear signals. The EMD is used to decompos...
Article
Full-text available
Speech synthesis based on hidden Markov models (HMMs) processes both segmental and prosodic features of speech together in a frame-by-frame manner. One benefit of this method is that time alignment of both features is kept automatically. However, when the training data are limited, frame-by-frame representation is not appropriate for prosodic featu...
Article
Full-text available
This paper presents a novel algorithm of speech enhancement using data adaptive soft-thresolding technique. The noisy speech signal is decomposed into a finite set of band limited signals called intrinsic mode functions (IMFs) using empirical mode decomposition (EMD). Each IMF is divided into fixed length subframes. On the basis of noise contaminat...
Article
Full-text available
In the modern world where people are usually busier than ever, family members are geographically relocated due to globalization of companies and humans are inundated with more information than they can process, ambient communications through mobile media or internet based communication can provide rich social connections to friends and family. Peop...
Conference Paper
Full-text available
One of the most effective approaches to noise robust speech recognition is to remove the noise effect directly from corrupted MFCC vectors. However, VTS enhancement, which is a typical method for performing MFCC enhancement, provides limited improvement when the noise is highly non-stationary. This is because the VTS enhancement method cannot use a...
Conference Paper
Full-text available
SPLICE is one of the speech enhancement methods based on feature conversion, which shows a high performance with a relatively small amount of calculation. After modeling noisy speech features as GMM, conversion functions are obtained for individual GMM components. The original SPLICE estimates clean feature vectors as a weighted summation of the co...
Conference Paper
Full-text available
In this paper, we demonstrate the potential of incorporating syllable-level information in acoustic modeling. The unit of syllable is not rigorously defined, which leads to a problem for its use. In this study, we derive syllable structures from the sonorant-band intensity profile of speech signal. We analyze the error statistics of a phone-based c...
Conference Paper
Full-text available
This paper presents a multiple kernel learning (MKL) approach to speech/music discrimination (SMD). The time-frequency representation (spectrogram) implemented by short-time Fourier transform (STFT) of audio segment is decomposed by wavelet packet transform into different subband levels. The subbands, which contain rich texture information, are use...
Article
Full-text available
It is well-known that the performance of automatic speech recognition (ASR) systems are easily affected by acoustic mismatch between training and testing conditions. This mis-match is often caused by various kinds of environmental noise or distortion. To reduce the effect of mismatch, feature normalization, feature enhancement, model adaptation, et...
Article
Full-text available
The GMM-based spectral conversion techniques were applied to emotion conversion but it was found that spectral transformation alone is not sufficient for conveying the required target emotion. In this paper, we adopt the tone nucleus model to carry the most important information of tones and represent F 0 contour for Mandarin speech. And then tone...
Article
Full-text available
Frame-by-frame representation is not appropriate for prosodic features, which are tightly related to speech units spreading a wide time span, such as words, phrases and so on. This causes an inherit problem in fundamental frequency (F 0) contour generation by HMM-based speech synthesis. Our formerly-developed method, which modify generated F 0 cont...
Article
Full-text available
For language learners it can be very difficult to speak with-out a non-native accent. This is due to the phenomenon called language transfer. At times, this can decrease the intelligibility of the speech and make it difficult to convey the correct message to the listener. Thus, many learners have a desire to speak more naturally in order not to be...
Article
Full-text available
It is difficult to demonstrate the effectiveness of prosodic features in automatic word recognition. Recently, we applied the suprasegmental concept and proposed an extra layer of acoustic modeling with syllables. Nevertheless, there is a mis-match between the syllable and the word units and that makes subsequent steps after acoustic modeling diffi...
Article
Full-text available
This work introduces a modified WFST-based multiple to multiple EM-driven alignment algorithm for Grapheme-to-Phoneme (G2P) conversion, and preliminary experi-mental results applying a Recurrent Neural Network Lan-guage Model (RNNLM) as an N-best rescoring mecha-nism for G2P conversion. The alignment algorithm lever-ages the WFST framework and intr...
Article
One of the biggest difficultiesin automatic speech recognition (ASR) is how to deal with variations of speech signals caused by non-linguistic information, such as age, gender, etc. Various methods have been proposed to compensate for the variations and one of them is speech structure [1]. Speech structure, which extracts only contrastive features...
Article
This paper introduces speaker adaptive training techniques to tensor-based arbitrary speaker conversion. In voice conversion studies, realization of conversion from/to an arbitrary speaker's voice is one of the important objectives. For this purpose, eigenvoice conversion (EVC), which is based on an eigenvoice Gaussian mixture model (EV-GMM), was p...
Article
Generation process model of fundamental frequency (F0) contours can well represent F0 movements of speech keeping a clear relation with linguistic information of utterances. Therefore, by using the model, improvement of HMM-based speech synthesis is expected. One of major problems preventing the use of the model is that the performance of automatic...
Article
Full-text available
A novel and robust pitch estimation method is presented in this paper. The basic idea is to reshape the speech signal using a combination of the dominant harmonic modification (DHM) and data adaptive time domain filtering techniques. The noisy speech signal is filtered within the ranges of fundamental frequencies to obtain the pre-filtered signal (...
Article
We proposed a structural representation of speech that is robust to speaker difference due to its transformation-invariant property in previous works, where we compared two speech structures by calculating the distance between two structural vectors, each composed of the lengths of a structure's edges. However, this distance cannot yield matching s...
Conference Paper
Pause and F0 play important roles in the process of understanding a spoken message. Up to the present, pause insertion and F0 contour generation have been modeled separately and evaluated independently from each other. However, the occurrence of a pause has a direct influence on the F0 contour of the portion of an utterance immediately after the pa...
Article
Full-text available
This paper presents a comparative study of prosodic features of utterances of two types of Bangla sentences: the declarative type and the interrogative (‘yes-no’ question) type whose textual contents are identical except for the punctuation marks in Bangla. The study is based on the analysis of 44 utterances each of the declarative type and the int...
Article
Frame-by-frame representation is not appropriate for prosodic features, which are tightly related to speech units spreading a wide time span, such as words, phrases and so on. This causes an inherit problem in fundamental frequency (F0) contour generation by HMM-based speech synthesis. A method is developed to modify F0 contours in the framework of...

Network

Cited By