Chiori Hori

Academia Sinica, T’ai-pei, Taipei, Taiwan

Are you Chiori Hori?

Claim your profile

Publications (98)58.5 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Robot utterances generally sound monotonous, unnatural and unfriendly because their Text-to-Speech systems are not optimized for communication but for text reading. Here, we present a non-monologue speech synthesis for robots. The key novelty lies in speech synthesis based on Hidden Markov models (HMMs) using a non-monologue corpus: we collected a speech corpus in a non-monologue style in which two professional voice talents read scripted dialogues, and HMMs were then trained with the corpus and used for speech synthesis. We conducted experiments in which the proposed method was evaluated by 24 subjects in three scenarios: text reading, dialogue and domestic service robot (DSR) scenarios. In the DSR scenario, we used a physical robot and compared our proposed method with a baseline method using the standard Mean Opinion Score criterion. Our experimental results showed that our proposed method’s performance was (1) at the same level as the baseline method in the text-reading scenario and (2) exceeded it in the DSR scenario. We deployed our proposed system as a cloud-based speech synthesis service so that it can be used without any cost.
    Advanced Robotics 03/2015; 29(7):449-456. DOI:10.1080/01691864.2015.1009164
  • Jinfu Ni, Yoshinori Shiga, Chiori Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper addresses intonation synthesis combining both statistical and generative models to manipulate fundamental frequency (F 0) contours in the framework of HMM-based speech synthesis. An F 0 contour is represented as a superposition of micro, accent, and register components at logarithmic scale in light of the Fujisaki model. Three component sets are extracted from a speech corpus by an algorithm of pitch decomposition upon a functional F 0 model, and separated context-dependent (CD) HMM is trained for each component. At the phase of speech synthesis, CDHMM-generated micro, accent, and register components are superimposed to form F 0 contours for input text. Objective and subjective evaluations are carried out on a Japanese speech corpus. Compared with the conventional approach, this method demonstrates the improved performance in naturalness by achieving better local and global F 0 behaviors and exhibits a link between phonology and phonetics, making it possible to flexibly control intonation using given marking information on the fly to manipulate the parameters of the functional F 0 model.
    Journal of Signal Processing Systems 01/2015; DOI:10.1007/s11265-015-1011-7
  • [Show abstract] [Hide abstract]
    ABSTRACT: Among many speaker adaptation embodiments, Speaker Adaptive Training (SAT) has been successfully applied to a standard Hidden-Markov-Model (HMM) speech recognizer, whose state is associated with Gaussian Mixture Models (GMMs). On the other hand, recent studies on Speaker-Independent (SI) recognizer development have reported that a new type of HMM speech recognizer, which replaces GMMs with Deep Neural Networks (DNNs), outperforms GMM-HMM recognizers. Along these two lines, it is natural to conceive of further improvement to a preset DNN-HMM recognizer by employing SAT. In this paper, we propose a novel training scheme that applies SAT to a SI DNN-HMM recognizer. We then implement the SAT scheme by allocating a Speaker-Dependent (SD) module to one of the intermediate layers of a seven-layer DNN, and elaborate its utility over TED Talks corpus data. Experiment results show that our speaker-adapted SAT-based DNN-HMM recognizer reduces the word error rate by 8.4% more than that of a baseline SI DNN-HMM recognizer, and (regardless of the SD module allocation) outperforms the conventional speaker adaptation scheme. The results also show that the inner layers of DNN are more suitable for the SD module than the outer layers.
    ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 05/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Acoustic event detection is an important step for audio content analysis and retrieval. Traditional detection techniques model the acoustic events on frame-based spectral features. Considering the temporal-frequency structures of acoustic events may be distributed in time-scales beyond frames, we propose to represent those structures as a bag of spectral patch exemplars. In order to learn the representative exemplars, k-means clustering based vector quantization (VQ) was applied on the whitened spectral patches which makes the learned exemplars focus on high-order statistical structure. With the learned spectral exemplars, a sparse feature representation is extracted based on the similarity measurement to the learned exemplars. A support vector machine (SVM) classifier was built on the sparse representation for acoustic event detection. Our experimental results showed that the sparse representation based on the patch based exemplars significantly improved the performance compared with traditional frame based representations.
    ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 05/2014
  • Youzheng Wu, Xinhu Hu, Chiori Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents our recent progress on translating TED speeches1, a collection of public lectures covering a variety of topics. Specially, we use word-to-word alignment to compose translation units of bilingual tuples and present a recurrent neural network-based translation model (RNNTM) to capture long-span context during estimating translation probabilities of bilingual tuples. However, this RNNTM has severe data sparsity problem due to large tuple vocabulary and limited training data. Therefore, a factored RNNTM, which takes bilingual tuples in addition to source and target phrases of the tuples as input features, is proposed to partially address the problem. Our experimental results on the IWSLT2012 test sets show that the proposed models significantly improve the translation quality over state-of-the-art phrase-based translation systems.
    ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 05/2014
  • Chien-Lin Huang, Chiori Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: This study presents a novel approach to semantic context inference based on term association matrices for spoken document retrieval. Each recognized term in a spoken document infers a semantic vector containing a bag of semantic terms from a term association matrix. Such a semantic term expansion and re-weighting make the semantic context inference vector a suitable representation for speech indexing. We consider both words and syllables on term association matrices for semantic context inference. The syllable lattice bigram instead of the single-best speech recognition results and various term weighting schemes have been studied for semantic context inference. Experiments were conducted on Mandarin Chinese broadcast news. The results indicate the proposed approach offers a significant performance improvement of spoken document retrieval.
    ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 05/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: An ensemble speaker and speaking environment modeling (ESSEM) approach was recently developed. This ESSEM process consists of offline and online phases. The offline phase establishes an environment structure using speech data collected under a wide range of acoustic conditions, whereas the online phase estimates a set of acoustic models that matches the testing environment based on the established environment structure. Since the estimated acoustic models accurately characterize particular testing conditions, ESSEM can improve the speech recognition performance under adverse conditions. In this work, we propose two maximum a posteriori (MAP) based algorithms to improve the online estimation part of the original ESSEM framework. We first develop MAP-based environment structure adaptation to refine the original environment structure. Next, we propose to utilize the MAP criterion to estimate the mapping function of ESSEM and enhance the environment modeling capability. For the MAP estimation, three types of priors are derived; they are the clustered prior (CP), the sequential prior (SP), and the hierarchical prior (HP) densities. Since each prior density is able to characterize specific acoustic knowledge, we further derive a combination mechanism to integrate the three priors. Based on the experimental results on the Aurora-2 task, we verify that using the MAP-based online mapping function estimation can enable ESSEM to achieve better performance than using the maximum-likelihood (ML) based counterpart. Moreover, by using an integration of the online environment structuring adaptation and mapping function estimation, the proposed MAP-based ESSEM framework is found to provide the best performance. Compared with our baseline results, MAP-based ESSEM achieves an average word error rate reduction of 15.53% (5.41 to 4.57%) under 50 testing conditions at a signal-to-noise ratio (SNR) of 0 to 20 dB over the three standardized testing sets.
    02/2014; 22(2):403-416. DOI:10.1109/TASLP.2013.2292362
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper regards social question-and-answer (Q&A) collections such as Yahoo! Answers as knowledge repositories and investigates techniques to mine knowledge from them to improve sentence-based complex question answering (QA) systems. Specifically, we present a question-type-specific method (QTSM) that extracts question-type-dependent cue expressions from social Q&A pairs in which the question types are the same as the submitted questions. We compare our approach with the question-specific and monolingual translation-based methods presented in previous works. The question-specific method (QSM) extracts question-dependent answer words from social Q&A pairs in which the questions resemble the submitted question. The monolingual translation-based method (MTM) learns word-to-word translation probabilities from all of the social Q&A pairs without considering the question or its type. Experiments on the extension of the NTCIR 2008 Chinese test data set demonstrate that our models that exploit social Q&A collections are significantly more effective than baseline methods such as LexRank. The performance ranking of these methods is QTSM > {QSM, MTM}. The largest F3 improvements in our proposed QTSM over QSM and MTM reach 6.0% and 5.8%, respectively.
    Computer Speech & Language 01/2014; 29(1). DOI:10.1016/j.csl.2014.06.001
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a new system for automatic transcription of lectures. The system combines a number of novel features, including deep neural network acoustic models using multi-level adaptive networks to incorporate out-of-domain information, and factored recurrent neural network language models. We demonstrate that the system achieves large improvements on the TED lecture transcription task from the 2012 IWSLT evaluation -- our results are currently the best reported on this task, showing an relative WER reduction of more than 16% compared to the closest competing system from the evaluation.
    Proc. Interspeech; 08/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This study presents an overview of VoiceTra, which was developed by NICT and released as the world'fs first network-based multilingual speech-to-speech translation system for smartphones, and describes in detail its multilingual speech recognition, its multilingual translation, and its multilingual speech synthesis in regards to field experiments. We show the effects of system updates using the data collected from field experiments to improve our acoustic and language models.
    Proceedings of the 2013 IEEE 14th International Conference on Mobile Data Management - Volume 02; 06/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Noise reduction algorithms are widely used to mitigate noise effects on speech to improve the robustness of speech technology applications. However, they inevitably cause speech distortion. The tradeoff between noise reduction and speech distortion is a key concern in designing noise reduction algorithms. This study proposes a novel framework for noise reduction by considering this tradeoff. We regard speech estimation as a function approximation problem in a regularized reproducing kernel Hilbert space (RKHS). In the estimation, the objective function is formulated to find an approximation function by controlling the tradeoff between approximation accuracy and function complexity. For noisy observations, this is equivalent to controlling the tradeoff between noise reduction and speech distortion. Since the target function is approximated in an RKHS, either a linear or nonlinear mapping function can be naturally incorporated in the estimation by a “kernel trick”. Traditional signal subspace and Wiener filtering based noise reduction can be derived as special cases when a linear kernel function is applied in this framework. We first provided a theoretical analysis of the tradeoff property of the framework in noise reduction. Then we applied our proposed noise reduction method in speech enhancement and noisy robust speech recognition experiments. Compared to several classical noise reduction methods, our proposed method showed promising advantages.
    IEEE Transactions on Signal Processing 02/2013; 61(3):601-610. DOI:10.1109/TSP.2012.2229991
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper outlines the first Asian network-based speech-to-speech translation system developed by the Asian Speech Translation Advanced Research (A-STAR) consortium. Eight research groups comprising the A-STAR members participated in the experiments, covering nine languages, i.e., eight Asian languages (Hindi, Indonesian, Japanese, Korean, Malay, Thai, Vietnamese, and Chinese) and English. Each A-STAR member contributed one or more of the following spoken language technologies: automatic speech recognition, machine translation, and text-to-speech through Web servers. The system was designed to translate common spoken utterances of travel conversations from a given source language into multiple target languages in order to facilitate multiparty travel conversations between people speaking different Asian languages. It covers travel expressions including proper nouns that are names of famous places or attractions in Asian countries. In this paper, we describe the issues of developing spoken language technologies for Asian languages, and discuss the difficulties involved in connecting different heterogeneous spoken language translation systems through Web servers. This paper also presents speech-translation results including subjective evaluation, from the first A-STAR field testing which was carried out in July 2009.
    Computer Speech & Language 02/2013; 27(2). DOI:10.1016/j.csl.2011.07.001
  • [Show abstract] [Hide abstract]
    ABSTRACT: This study presents a noise robust front-end postprocessing technology. After cepstral feature analysis, the feature normalization is usually applied for noisy reduction in spoken language recognition. We investigate a highly effective MVAW processing based on standard MFCC and SDC features on NIST-LRE 2007 tasks. The procedure includes mean subtraction, variance normalization, auto-regression moving-average filtering and feature warping. Experiments were conducted on a common GMM-UBM system. The results indicated significant improvements in recognition accuracy.
    Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2013 Asia-Pacific; 01/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The maximum a posteriori (MAP) criterion is popularly used for feature compensation (FC) and acoustic model adaptation (MA) to reduce the mismatch between training and testing data sets. MAP-based FC and MA require prior densities of mapping function parameters, and designing suitable prior densities plays an important role in obtaining satisfactory performance. In this paper, we propose to use an environment structuring framework to provide suitable prior densities for facilitating MAP-based FC and MA for robust speech recognition. The framework is constructed in a two-stage hierarchical tree structure using environment clustering and partitioning processes. The constructed framework is highly capable of characterizing local information about complex speaker and speaking acoustic conditions. The local information was utilized to specify hyper-parameters in prior densities, which were then used in MAP-based FC and MA to handle the mismatch issue. We evaluated the proposed framework on Aurora-2, a connected digit recognition task, and Aurora-4, a large vocabulary continuous speech recognition (LVCSR) task. On both tasks, experimental results showed that with the prepared environment structuring framework, we could obtain suitable prior densities for enhancing the performance of MAP-based FC and MA.
    Computer Speech & Language 01/2013; 28(3). DOI:10.1016/j.csl.2013.11.005
  • [Show abstract] [Hide abstract]
    ABSTRACT: This study presents a novel approach to spoken document retrieval based on neural probabilistic language modeling for semantic inference. The neural network based language model is applied to estimate word association in a continuous space. The different kinds of weighting schemes are investigated to represent recognized words of a spoken document into an indexing vector. The indexing vector is transferred into the semantic indexing vector through the neural probabilistic language model. Such a semantic word inference and re-weighting make the semantic indexing vector a suitable representation for speech indexing. Experimental results conducted on Mandarin Chinese broadcast news show that the proposed approach can achieve a substantial and consistent improvement of spoken document retrieval.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • E. Mizukami, T. Misu, C. Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: We proposed the WFSTDM which is an expandable and adaptable dialogue management platform. The WFSTDM combines various WFSTs and enables us to develop new dialogue management WFSTs necessary for rapid prototyping of spoken dialogue systems. In this paper, we illustrate the outline of the WFSTDM and introduce the WFSTDM builder, a network-based spoken dialogue system development tool. In addition, we go into details about the spoken dialogue system AssisTra for iPhone that we developed as an example to show the implementation of the WFSTDM on smartphones. We also discuss about spoken dialogue systems on smartphones as tools for collecting field data.
    Mobile Data Management (MDM), 2013 IEEE 14th International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a joint analysis approach to address the acoustic feature normalization for robust speech recognition. The variations in acoustic environments and speakers are the major challenge for speech recognition. The conventional normalizations of these two variations are separately processed, applying the speaker normalization with an assumption of a noise free condition and applying the noise compensation with an assumption of speaker independency, and thus resulting in a suboptimal performance. The proposed joint analysis approach simultaneously considers the vocal tract length normalization and averaged temporal information of cepstral features. In a data-driven manner, the Gaussian mixture model is used to estimate the conditional parameters in the joint analysis. Experimental results show that the proposed approach achieves a substantial improvement.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Speaker clustering has been widely adopted for clustering the speech data based on acoustic characteristics so that an unsupervised speaker normalization and speaker adaptive training can be applied for a better speech recognition performance. In this study, we present a vector space speaker clustering approach with long-term feature analysis. The supervector based on the GMM mean vectors is adopted to represent the characteristics of speakers. To achieve a robust representation, total variability subspace modeling, which has been successfully applied in speaker recognition for compensating channel and session variability over the GMM mean supervector, is used for speaker clustering. We apply a long-term feature analysis strategy to average short-time spectral features over a period of time to capture the speaker traits that are manifested over a speech segment longer than a spectral frame. Experiments conducted on lecture style speech show that this speaker clustering approach offers a better speech recognition performance.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • Source
    Chien-Lin Huang, C. Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents the deep neural networks to classification of children with voice impairments from speech signals. In the analysis of speech signals, 6,373 static acoustic features are extracted from many kinds of low-level-descriptors and functionals. To reduce the variability of extracted features, two-dimensional normalizations are applied to smooth the interspeaker and inter-feature mismatch using the feature warping approach. Then, the feature selection is used to explore the discriminative and low-dimensional representation based on techniques of principal component analysis and linear discriminant analysis. In such representation, the robust features are obtained by eliminating noise features via subspace projection. Finally, the deep neural networks are adopted to classify the children with voice impairments. We conclude that deep neural networks with the proposed feature normalization and selection can significantly contribute to the robustness of recognition in practical application scenarios. We have achieved an UAR of 60.9% for the four-way diagnosis classification on the development set. This is a relative improvement of 16.2% to the official baseline by using our single system.
    Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2013 Asia-Pacific; 01/2013
  • Proceedings of COLING 2012; 12/2012

Publication Stats

738 Citations
58.50 Total Impact Points

Institutions

  • 2014
    • Academia Sinica
      • Research Center for Information Technology Innovation
      T’ai-pei, Taipei, Taiwan
  • 2010–2014
    • National Institute of Information and Communications Technology
      • Spoken Language Communication Laboratory
      Edo, Tōkyō, Japan
  • 2003–2004
    • NTT Communication Science Laboratories
      Kioto, Kyōto, Japan
    • Nippon Telegraph and Telephone
      Edo, Tōkyō, Japan
  • 2000–2003
    • Tokyo Institute of Technology
      • Computer Science Department
      Edo, Tōkyō, Japan
  • 2001
    • Yamagata University
      • Faculty of Engineering
      Ямагата, Yamagata, Japan