Chiori Hori

National Institute of Information and Communications Technology, Edo, Tokyo, Japan

Are you Chiori Hori?

Claim your profile

Publications (118)58.6 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Recently, a novel speaker adaptation method was proposed that applied the Speaker Adaptive Training (SAT) concept to a speech recognizer consisting of a Deep Neural Network (DNN) and a Hidden Markov Model (HMM), and its utility was demonstrated. This method implements the SAT scheme by allocating one Speaker Dependent (SD) module for each training speaker to one of the intermediate layers of the front-end DNN. It then jointly optimizes the SD modules and the other part of network, which is shared by all the speakers. In this paper, we propose an improved version of the above SAT-based adaptation scheme for a DNN-HMM recognizer. Our new training adopts a Linear Transformation Network (LTN) for the SD module, and such LTN employment leads to more appropriate regularization in both the SAT and adaptation stages by replacing an empirically selected anchorage of a network for regularization in the preceding SAT-DNN-HMM with a SAT-optimized anchorage. We elaborate the effectiveness of our proposed method over TED Talks corpus data. Our experimental results show that a speaker-adapted recognizer using our method achieves a significant word error rate reduction of 9.2 points from a baseline SI-DNN recognizer and also steadily outperforms speaker-adapted recognizers, each of which originates from the preceding SAT-based DNN-HMM.
    No preview · Article · Aug 2015
  • Jinfu Ni · Yoshinori Shiga · Chiori Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper addresses intonation synthesis combining both statistical and generative models to manipulate fundamental frequency (F 0) contours in the framework of HMM-based speech synthesis. An F 0 contour is represented as a superposition of micro, accent, and register components at logarithmic scale in light of the Fujisaki model. Three component sets are extracted from a speech corpus by an algorithm of pitch decomposition upon a functional F 0 model, and separated context-dependent (CD) HMM is trained for each component. At the phase of speech synthesis, CDHMM-generated micro, accent, and register components are superimposed to form F 0 contours for input text. Objective and subjective evaluations are carried out on a Japanese speech corpus. Compared with the conventional approach, this method demonstrates the improved performance in naturalness by achieving better local and global F 0 behaviors and exhibits a link between phonology and phonetics, making it possible to flexibly control intonation using given marking information on the fly to manipulate the parameters of the functional F 0 model.
    No preview · Article · May 2015 · Journal of Signal Processing Systems
  • M. Saiko · H. Yamamoto · R. Isotani · C. Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: We propose a new multi-lingual unsupervised acoustic model (AM) training method for low-resourced languages under mismatch conditions. In those languages, there is very limited or no transcribed speech. Thus, unsupervised acoustic modeling using AMs of different languages (not low-resourced languages) has been proposed. The conventional method has shown to be effective for similar acoustic conditions, such as speaking-style, between a low-resourced language and different languages. However, since it is not easy to prepare the matched AMs of different languages, mismatch problem between each AM and the speech of a low-resourced language for unsupervised acoustic modeling is practically occurred. In this paper, we deal with this mismatch problem. To generate more accurate automatic transcriptions under mismatch conditions, we introduce two things: (1) Initial AMs were trained with speech of different languages that was mapped to the phonemes of a low-resourced language and (2) Iterative process to switch back and forth between training of AMs and adaptation of the initial AMs. The proposed method without any transcriptions achieved a word error rate of 32.1% on the evaluation set of IWSLT2011, while the word error rates of the conventional method and the supervised training method were 39.3 and 22.7%, respectively.
    No preview · Article · Apr 2015
  • [Show abstract] [Hide abstract]
    ABSTRACT: Robot utterances generally sound monotonous, unnatural and unfriendly because their Text-to-Speech systems are not optimized for communication but for text reading. Here, we present a non-monologue speech synthesis for robots. The key novelty lies in speech synthesis based on Hidden Markov models (HMMs) using a non-monologue corpus: we collected a speech corpus in a non-monologue style in which two professional voice talents read scripted dialogues, and HMMs were then trained with the corpus and used for speech synthesis. We conducted experiments in which the proposed method was evaluated by 24 subjects in three scenarios: text reading, dialogue and domestic service robot (DSR) scenarios. In the DSR scenario, we used a physical robot and compared our proposed method with a baseline method using the standard Mean Opinion Score criterion. Our experimental results showed that our proposed method’s performance was (1) at the same level as the baseline method in the text-reading scenario and (2) exceeded it in the DSR scenario. We deployed our proposed system as a cloud-based speech synthesis service so that it can be used without any cost.
    No preview · Article · Mar 2015 · Advanced Robotics
  • X. Hu · M. Saiko · C. Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: Tone plays an important role in distinguishing lexical meaning in tonal languages, such as Mandarin and Thai. It has been revealed that tone information is helpful to improve automatic speech recognition (ASR) for these languages. In this study, we incorporate tone features from the fundamental frequency (Fo) and fundamental frequency variation (FFV) to the convolutional neural network (CNN), a state-of-the-art acoustic modeling approach, for acoustic modeling of the ASR systems. Due to its abilities of reducing spectral variations and modeling spectral correlations existing in speech signals, the CNN is expected to model well tone patterns which mainly behave in the frequency domain, by Fo contur. We conduct speech ASR experiments on Mandarin and Thai to evaluate the effectivenesses of the proposed approaches. With the help of tone features, the character error rates (CERs) of Mandarin achieve 4.3-7.1% relative reductions, and the word error rates (WERs) of Thai achieve 0.41-6.26% relative reductions. The CNN shows its clear superiority to the deep neural network (DNN), with relative CER reductions of 5.4-13.1% for Mandarin, and relative WER reductions of 0.5-5.6% for Thai.
    No preview · Article · Feb 2015
  • J. Ni · Y. Shiga · C. Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: Expressive intonation makes focal prominence to give emphases that highlight the focus of speech. This paper describes a method for improving the expressiveness of HMM-based voices, particularly putting focal prominence on a word. Different from previous methods, our method exploits a speech corpus available for model training, without needing to record additional emphasis speech. This method employs a functional Fq model to decompose the pitch accents of utterances into components of lexical accent and pitch register. The two components are anchored by a limited number of target points to establish the topological relations between prosodie and linguistic features of the utterances. The F0 model is further used to adjust the gradient prominence levels of pitch accents to make focal prominence under the constraint of the topological relations. In this way, the demand of recording emphasis speech samples is significantly reduced. Moreover, the emphases with focal prominence can be contextually labeled for training context-dependent models. Experiments are conducted on a neutral speech corpus in Japanese, particularly on expansion of the local pitch range of nuclear pitch accents (the most prominent accents) of utterances. The results demonstrated that the proposed method gracefully put focal prominence on specific words while keeping a high degree of the naturalness in synthetic speech.
    No preview · Article · Feb 2015
  • X. Hu · X. Lu · C. Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: Due to its ability of reducing spectral variations and modeling spectral correlations existed in speech signals, the convolutional neural network (CNN) has been shown effective in modeling speech compared to deep neural network (DNN). In this study, we explore applying CNN to Mandarin speech recognitions. Besides exploring appropriate CNN architecture for recognition performance, focuses are on investigating the effective acoustic features, and effectivenesses of applying tonal information which have been verified helpful in other types of acoustic models to the acoustic features in the CNN. We conduct speech recognition experiments on Mandarin broadcast speech recognition to test the effectivenesses of the proposed approaches. The CNN shows its clear superiority to the DNN, with relative reductions of character error rate (CER) among 7.7-13.1% for broadcast news speech (BN), and 5.4-9.9% for broadcast conversation speech (BC). Like in the Gaussian Mixture Model (GMM) and DNN systems, the tonal information characterized by the fundamental frequency (F0) and fundamental frequency variations (FFV) are found still helpful in CNN models, they achieve relative CER reductions over 6.7% for BN and 4.3% for BC respectively when compared with the baseline Mel-filter bank feature.
    No preview · Article · Oct 2014
  • X. Lu · Y. Tsao · P. Shen · C. Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: In most algorithms for acoustic event detection (AED), frame based acoustic representations are used in acoustic modeling. Due to lack of context information in feature representation, large model confusions may occur during modeling. We have proposed a feature learning and representation algorithm to explore context information from temporal-frequency patches of signal for AED. With the algorithm, a sparse feature was extracted based on an acoustic dictionary composed of a bag of spectral patches. In our previous algorithm, the feature was obtained based on a definition of Euclidian distance between input signal and acoustic dictionary. In this study, we formulate the sparse feature extraction as l1 regularization in signal reconstruction. The sparsity of the representation is efficiently controlled via varying a regularization parameter. A support vector machine (SVM) classifier was built on the extracted sparse feature for AED. Our experimental results showed that the spectral patch based sparse representation effectively improved the performance by incorporating temporal-frequency context information in modeling.
    No preview · Article · Oct 2014
  • Youzheng Wu · Xinhu Hu · Chiori Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents our recent progress on translating TED speeches1, a collection of public lectures covering a variety of topics. Specially, we use word-to-word alignment to compose translation units of bilingual tuples and present a recurrent neural network-based translation model (RNNTM) to capture long-span context during estimating translation probabilities of bilingual tuples. However, this RNNTM has severe data sparsity problem due to large tuple vocabulary and limited training data. Therefore, a factored RNNTM, which takes bilingual tuples in addition to source and target phrases of the tuples as input features, is proposed to partially address the problem. Our experimental results on the IWSLT2012 test sets show that the proposed models significantly improve the translation quality over state-of-the-art phrase-based translation systems.
    No preview · Conference Paper · May 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Robot utterances generally sound monotonous, unnatural, and unfriendly because their Text-to-Speech (TTS) systems are not optimized for communication but for text-reading. Here we present a non-monologue speech synthesis for robots. We collected a speech corpus in a non-monologue style in which two professional voice talents read scripted dialogues. Hidden Markov models (HMMs) were then trained with the corpus and used for speech synthesis. We conducted experiments in which the proposed method was evaluated by 24 subjects in three scenarios: text-reading, dialogue, and domestic service robot (DSR) scenarios. In the DSR scenario, we used a physical robot and compared our proposed method with a baseline method using the standard Mean Opinion Score (MOS) criterion. Our experimental results showed that our proposed method's performance was (1) at the same level as the baseline method in the text-reading scenario and (2) exceeded it in the DSR scenario. We deployed our proposed system as a cloud-based speech synthesis service so that it can be used without any cost.
    No preview · Conference Paper · May 2014
  • Source
    Chien-Lin Huang · Chiori Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: This study presents a novel approach to semantic context inference based on term association matrices for spoken document retrieval. Each recognized term in a spoken document infers a semantic vector containing a bag of semantic terms from a term association matrix. Such a semantic term expansion and re-weighting make the semantic context inference vector a suitable representation for speech indexing. We consider both words and syllables on term association matrices for semantic context inference. The syllable lattice bigram instead of the single-best speech recognition results and various term weighting schemes have been studied for semantic context inference. Experiments were conducted on Mandarin Chinese broadcast news. The results indicate the proposed approach offers a significant performance improvement of spoken document retrieval.
    Full-text · Conference Paper · May 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Among many speaker adaptation embodiments, Speaker Adaptive Training (SAT) has been successfully applied to a standard Hidden-Markov-Model (HMM) speech recognizer, whose state is associated with Gaussian Mixture Models (GMMs). On the other hand, recent studies on Speaker-Independent (SI) recognizer development have reported that a new type of HMM speech recognizer, which replaces GMMs with Deep Neural Networks (DNNs), outperforms GMM-HMM recognizers. Along these two lines, it is natural to conceive of further improvement to a preset DNN-HMM recognizer by employing SAT. In this paper, we propose a novel training scheme that applies SAT to a SI DNN-HMM recognizer. We then implement the SAT scheme by allocating a Speaker-Dependent (SD) module to one of the intermediate layers of a seven-layer DNN, and elaborate its utility over TED Talks corpus data. Experiment results show that our speaker-adapted SAT-based DNN-HMM recognizer reduces the word error rate by 8.4% more than that of a baseline SI DNN-HMM recognizer, and (regardless of the SD module allocation) outperforms the conventional speaker adaptation scheme. The results also show that the inner layers of DNN are more suitable for the SD module than the outer layers.
    No preview · Conference Paper · May 2014
  • Xugang Lu · Yu Tsao · Shigeki Matsuda · Chiori Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: Acoustic event detection is an important step for audio content analysis and retrieval. Traditional detection techniques model the acoustic events on frame-based spectral features. Considering the temporal-frequency structures of acoustic events may be distributed in time-scales beyond frames, we propose to represent those structures as a bag of spectral patch exemplars. In order to learn the representative exemplars, k-means clustering based vector quantization (VQ) was applied on the whitened spectral patches which makes the learned exemplars focus on high-order statistical structure. With the learned spectral exemplars, a sparse feature representation is extracted based on the similarity measurement to the learned exemplars. A support vector machine (SVM) classifier was built on the sparse representation for acoustic event detection. Our experimental results showed that the sparse representation based on the patch based exemplars significantly improved the performance compared with traditional frame based representations.
    No preview · Conference Paper · May 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: An ensemble speaker and speaking environment modeling (ESSEM) approach was recently developed. This ESSEM process consists of offline and online phases. The offline phase establishes an environment structure using speech data collected under a wide range of acoustic conditions, whereas the online phase estimates a set of acoustic models that matches the testing environment based on the established environment structure. Since the estimated acoustic models accurately characterize particular testing conditions, ESSEM can improve the speech recognition performance under adverse conditions. In this work, we propose two maximum a posteriori (MAP) based algorithms to improve the online estimation part of the original ESSEM framework. We first develop MAP-based environment structure adaptation to refine the original environment structure. Next, we propose to utilize the MAP criterion to estimate the mapping function of ESSEM and enhance the environment modeling capability. For the MAP estimation, three types of priors are derived; they are the clustered prior (CP), the sequential prior (SP), and the hierarchical prior (HP) densities. Since each prior density is able to characterize specific acoustic knowledge, we further derive a combination mechanism to integrate the three priors. Based on the experimental results on the Aurora-2 task, we verify that using the MAP-based online mapping function estimation can enable ESSEM to achieve better performance than using the maximum-likelihood (ML) based counterpart. Moreover, by using an integration of the online environment structuring adaptation and mapping function estimation, the proposed MAP-based ESSEM framework is found to provide the best performance. Compared with our baseline results, MAP-based ESSEM achieves an average word error rate reduction of 15.53% (5.41 to 4.57%) under 50 testing conditions at a signal-to-noise ratio (SNR) of 0 to 20 dB over the three standardized testing sets.
    No preview · Article · Feb 2014 · IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • X. Lu · Y. Tsao · S. Matsuda · C. Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: Denoising autoencoder (DAE) is effective in restoring clean speech from noisy observations. In addition, it is easy to be stacked to a deep denoising autoencoder (DDAE) architecture to further improve the performance. In most studies, it is supposed that the DAE or DDAE can learn any complex transform functions to approximate the transform relation between noisy and clean speech. However, for large variations of speech patterns and noisy environments, the learned model is lack of focus on local transformations. In this study, we propose an ensemble modeling of DAE to learn both the global and local transform functions. In the ensemble modeling, local transform functions are learned by several DAEs using data sets obtained from unsupervised data clustering and partition. The final transform function used for speech restoration is a combination of all the learned local transform functions. Speech denoising experiments were carried out to examine the performance of the proposed method. Experimental results showed that the proposed ensemble DAE model provided superior restoration accuracy than traditional DAE models.
    No preview · Article · Jan 2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper regards social question-and-answer (Q&A) collections such as Yahoo! Answers as knowledge repositories and investigates techniques to mine knowledge from them to improve sentence-based complex question answering (QA) systems. Specifically, we present a question-type-specific method (QTSM) that extracts question-type-dependent cue expressions from social Q&A pairs in which the question types are the same as the submitted questions. We compare our approach with the question-specific and monolingual translation-based methods presented in previous works. The question-specific method (QSM) extracts question-dependent answer words from social Q&A pairs in which the questions resemble the submitted question. The monolingual translation-based method (MTM) learns word-to-word translation probabilities from all of the social Q&A pairs without considering the question or its type. Experiments on the extension of the NTCIR 2008 Chinese test data set demonstrate that our models that exploit social Q&A collections are significantly more effective than baseline methods such as LexRank. The performance ranking of these methods is QTSM > {QSM, MTM}. The largest F3 improvements in our proposed QTSM over QSM and MTM reach 6.0% and 5.8%, respectively.
    No preview · Article · Jan 2014 · Computer Speech & Language
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Speaker clustering has been widely adopted for clustering the speech data based on acoustic characteristics so that an unsupervised speaker normalization and speaker adaptive training can be applied for a better speech recognition performance. In this study, we present a vector space speaker clustering approach with long-term feature analysis. The supervector based on the GMM mean vectors is adopted to represent the characteristics of speakers. To achieve a robust representation, total variability subspace modeling, which has been successfully applied in speaker recognition for compensating channel and session variability over the GMM mean supervector, is used for speaker clustering. We apply a long-term feature analysis strategy to average short-time spectral features over a period of time to capture the speaker traits that are manifested over a speech segment longer than a spectral frame. Experiments conducted on lecture style speech show that this speaker clustering approach offers a better speech recognition performance.
    Full-text · Conference Paper · Oct 2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a joint analysis approach to address the acoustic feature normalization for robust speech recognition. The variations in acoustic environments and speakers are the major challenge for speech recognition. The conventional normalizations of these two variations are separately processed, applying the speaker normalization with an assumption of a noise free condition and applying the noise compensation with an assumption of speaker independency, and thus resulting in a suboptimal performance. The proposed joint analysis approach simultaneously considers the vocal tract length normalization and averaged temporal information of cepstral features. In a data-driven manner, the Gaussian mixture model is used to estimate the conditional parameters in the joint analysis. Experimental results show that the proposed approach achieves a substantial improvement.
    Full-text · Conference Paper · Oct 2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This study presents a novel approach to spoken document retrieval based on neural probabilistic language modeling for semantic inference. The neural network based language model is applied to estimate word association in a continuous space. The different kinds of weighting schemes are investigated to represent recognized words of a spoken document into an indexing vector. The indexing vector is transferred into the semantic indexing vector through the neural probabilistic language model. Such a semantic word inference and re-weighting make the semantic indexing vector a suitable representation for speech indexing. Experimental results conducted on Mandarin Chinese broadcast news show that the proposed approach can achieve a substantial and consistent improvement of spoken document retrieval.
    Full-text · Conference Paper · Oct 2013
  • Source
    Chien-Lin Huang · Chiori Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents the deep neural networks to classification of children with voice impairments from speech signals. In the analysis of speech signals, 6,373 static acoustic features are extracted from many kinds of low-level-descriptors and functionals. To reduce the variability of extracted features, two-dimensional normalizations are applied to smooth the interspeaker and inter-feature mismatch using the feature warping approach. Then, the feature selection is used to explore the discriminative and low-dimensional representation based on techniques of principal component analysis and linear discriminant analysis. In such representation, the robust features are obtained by eliminating noise features via subspace projection. Finally, the deep neural networks are adopted to classify the children with voice impairments. We conclude that deep neural networks with the proposed feature normalization and selection can significantly contribute to the robustness of recognition in practical application scenarios. We have achieved an UAR of 60.9% for the four-way diagnosis classification on the development set. This is a relative improvement of 16.2% to the official baseline by using our single system.
    Full-text · Conference Paper · Oct 2013

Publication Stats

847 Citations
58.60 Total Impact Points

Institutions

  • 2009-2015
    • National Institute of Information and Communications Technology
      • Spoken Language Communication Laboratory
      Edo, Tokyo, Japan
  • 2014
    • Academia Sinica
      • Research Center for Information Technology Innovation
      T’ai-pei, Taipei, Taiwan
  • 2012
    • Kyoto Institute of Technology
      Kioto, Kyōto, Japan
  • 2005-2007
    • Carnegie Mellon University
      • Language Technologies Institute
      Pittsburgh, Pennsylvania, United States
  • 2003-2004
    • NTT Communication Science Laboratories
      Kioto, Kyōto, Japan
    • Nippon Telegraph and Telephone
      Edo, Tōkyō, Japan
  • 2000-2004
    • Tokyo Institute of Technology
      • • Graduate School of Science and Engineering
      • • Computer Science Department
      Edo, Tokyo, Japan
  • 2001
    • Yamagata University
      • Faculty of Engineering
      Ямагата, Yamagata, Japan