Chiori Hori

Academia Sinica, T’ai-pei, Taipei, Taiwan

Are you Chiori Hori?

Claim your profile

Publications (96)57.41 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: Among many speaker adaptation embodiments, Speaker Adaptive Training (SAT) has been successfully applied to a standard Hidden-Markov-Model (HMM) speech recognizer, whose state is associated with Gaussian Mixture Models (GMMs). On the other hand, recent studies on Speaker-Independent (SI) recognizer development have reported that a new type of HMM speech recognizer, which replaces GMMs with Deep Neural Networks (DNNs), outperforms GMM-HMM recognizers. Along these two lines, it is natural to conceive of further improvement to a preset DNN-HMM recognizer by employing SAT. In this paper, we propose a novel training scheme that applies SAT to a SI DNN-HMM recognizer. We then implement the SAT scheme by allocating a Speaker-Dependent (SD) module to one of the intermediate layers of a seven-layer DNN, and elaborate its utility over TED Talks corpus data. Experiment results show that our speaker-adapted SAT-based DNN-HMM recognizer reduces the word error rate by 8.4% more than that of a baseline SI DNN-HMM recognizer, and (regardless of the SD module allocation) outperforms the conventional speaker adaptation scheme. The results also show that the inner layers of DNN are more suitable for the SD module than the outer layers.
    ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 05/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: Acoustic event detection is an important step for audio content analysis and retrieval. Traditional detection techniques model the acoustic events on frame-based spectral features. Considering the temporal-frequency structures of acoustic events may be distributed in time-scales beyond frames, we propose to represent those structures as a bag of spectral patch exemplars. In order to learn the representative exemplars, k-means clustering based vector quantization (VQ) was applied on the whitened spectral patches which makes the learned exemplars focus on high-order statistical structure. With the learned spectral exemplars, a sparse feature representation is extracted based on the similarity measurement to the learned exemplars. A support vector machine (SVM) classifier was built on the sparse representation for acoustic event detection. Our experimental results showed that the sparse representation based on the patch based exemplars significantly improved the performance compared with traditional frame based representations.
    ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 05/2014
  • Youzheng Wu, Xinhu Hu, Chiori Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents our recent progress on translating TED speeches1, a collection of public lectures covering a variety of topics. Specially, we use word-to-word alignment to compose translation units of bilingual tuples and present a recurrent neural network-based translation model (RNNTM) to capture long-span context during estimating translation probabilities of bilingual tuples. However, this RNNTM has severe data sparsity problem due to large tuple vocabulary and limited training data. Therefore, a factored RNNTM, which takes bilingual tuples in addition to source and target phrases of the tuples as input features, is proposed to partially address the problem. Our experimental results on the IWSLT2012 test sets show that the proposed models significantly improve the translation quality over state-of-the-art phrase-based translation systems.
    ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 05/2014
  • Chien-Lin Huang, Chiori Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: This study presents a novel approach to semantic context inference based on term association matrices for spoken document retrieval. Each recognized term in a spoken document infers a semantic vector containing a bag of semantic terms from a term association matrix. Such a semantic term expansion and re-weighting make the semantic context inference vector a suitable representation for speech indexing. We consider both words and syllables on term association matrices for semantic context inference. The syllable lattice bigram instead of the single-best speech recognition results and various term weighting schemes have been studied for semantic context inference. Experiments were conducted on Mandarin Chinese broadcast news. The results indicate the proposed approach offers a significant performance improvement of spoken document retrieval.
    ICASSP 2014 - 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 05/2014
  • [Show abstract] [Hide abstract]
    ABSTRACT: An ensemble speaker and speaking environment modeling (ESSEM) approach was recently developed. This ESSEM process consists of offline and online phases. The offline phase establishes an environment structure using speech data collected under a wide range of acoustic conditions, whereas the online phase estimates a set of acoustic models that matches the testing environment based on the established environment structure. Since the estimated acoustic models accurately characterize particular testing conditions, ESSEM can improve the speech recognition performance under adverse conditions. In this work, we propose two maximum a posteriori (MAP) based algorithms to improve the online estimation part of the original ESSEM framework. We first develop MAP-based environment structure adaptation to refine the original environment structure. Next, we propose to utilize the MAP criterion to estimate the mapping function of ESSEM and enhance the environment modeling capability. For the MAP estimation, three types of priors are derived; they are the clustered prior (CP), the sequential prior (SP), and the hierarchical prior (HP) densities. Since each prior density is able to characterize specific acoustic knowledge, we further derive a combination mechanism to integrate the three priors. Based on the experimental results on the Aurora-2 task, we verify that using the MAP-based online mapping function estimation can enable ESSEM to achieve better performance than using the maximum-likelihood (ML) based counterpart. Moreover, by using an integration of the online environment structuring adaptation and mapping function estimation, the proposed MAP-based ESSEM framework is found to provide the best performance. Compared with our baseline results, MAP-based ESSEM achieves an average word error rate reduction of 15.53% (5.41 to 4.57%) under 50 testing conditions at a signal-to-noise ratio (SNR) of 0 to 20 dB over the three standardized testing sets.
    02/2014; 22(2):403-416. DOI:10.1109/TASLP.2013.2292362
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper regards social question-and-answer (Q&A) collections such as Yahoo! Answers as knowledge repositories and investigates techniques to mine knowledge from them to improve sentence-based complex question answering (QA) systems. Specifically, we present a question-type-specific method (QTSM) that extracts question-type-dependent cue expressions from social Q&A pairs in which the question types are the same as the submitted questions. We compare our approach with the question-specific and monolingual translation-based methods presented in previous works. The question-specific method (QSM) extracts question-dependent answer words from social Q&A pairs in which the questions resemble the submitted question. The monolingual translation-based method (MTM) learns word-to-word translation probabilities from all of the social Q&A pairs without considering the question or its type. Experiments on the extension of the NTCIR 2008 Chinese test data set demonstrate that our models that exploit social Q&A collections are significantly more effective than baseline methods such as LexRank. The performance ranking of these methods is QTSM > {QSM, MTM}. The largest F3 improvements in our proposed QTSM over QSM and MTM reach 6.0% and 5.8%, respectively.
    Computer Speech & Language 01/2014; DOI:10.1016/j.csl.2014.06.001 · 1.81 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a new system for automatic transcription of lectures. The system combines a number of novel features, including deep neural network acoustic models using multi-level adaptive networks to incorporate out-of-domain information, and factored recurrent neural network language models. We demonstrate that the system achieves large improvements on the TED lecture transcription task from the 2012 IWSLT evaluation -- our results are currently the best reported on this task, showing an relative WER reduction of more than 16% compared to the closest competing system from the evaluation.
    Proc. Interspeech; 08/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This study presents an overview of VoiceTra, which was developed by NICT and released as the world'fs first network-based multilingual speech-to-speech translation system for smartphones, and describes in detail its multilingual speech recognition, its multilingual translation, and its multilingual speech synthesis in regards to field experiments. We show the effects of system updates using the data collected from field experiments to improve our acoustic and language models.
    Proceedings of the 2013 IEEE 14th International Conference on Mobile Data Management - Volume 02; 06/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Noise reduction algorithms are widely used to mitigate noise effects on speech to improve the robustness of speech technology applications. However, they inevitably cause speech distortion. The tradeoff between noise reduction and speech distortion is a key concern in designing noise reduction algorithms. This study proposes a novel framework for noise reduction by considering this tradeoff. We regard speech estimation as a function approximation problem in a regularized reproducing kernel Hilbert space (RKHS). In the estimation, the objective function is formulated to find an approximation function by controlling the tradeoff between approximation accuracy and function complexity. For noisy observations, this is equivalent to controlling the tradeoff between noise reduction and speech distortion. Since the target function is approximated in an RKHS, either a linear or nonlinear mapping function can be naturally incorporated in the estimation by a “kernel trick”. Traditional signal subspace and Wiener filtering based noise reduction can be derived as special cases when a linear kernel function is applied in this framework. We first provided a theoretical analysis of the tradeoff property of the framework in noise reduction. Then we applied our proposed noise reduction method in speech enhancement and noisy robust speech recognition experiments. Compared to several classical noise reduction methods, our proposed method showed promising advantages.
    IEEE Transactions on Signal Processing 02/2013; 61(3):601-610. DOI:10.1109/TSP.2012.2229991 · 3.20 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper outlines the first Asian network-based speech-to-speech translation system developed by the Asian Speech Translation Advanced Research (A-STAR) consortium. Eight research groups comprising the A-STAR members participated in the experiments, covering nine languages, i.e., eight Asian languages (Hindi, Indonesian, Japanese, Korean, Malay, Thai, Vietnamese, and Chinese) and English. Each A-STAR member contributed one or more of the following spoken language technologies: automatic speech recognition, machine translation, and text-to-speech through Web servers. The system was designed to translate common spoken utterances of travel conversations from a given source language into multiple target languages in order to facilitate multiparty travel conversations between people speaking different Asian languages. It covers travel expressions including proper nouns that are names of famous places or attractions in Asian countries. In this paper, we describe the issues of developing spoken language technologies for Asian languages, and discuss the difficulties involved in connecting different heterogeneous spoken language translation systems through Web servers. This paper also presents speech-translation results including subjective evaluation, from the first A-STAR field testing which was carried out in July 2009.
    Computer Speech & Language 02/2013; 27(2). DOI:10.1016/j.csl.2011.07.001 · 1.81 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The maximum a posteriori (MAP) criterion is popularly used for feature compensation (FC) and acoustic model adaptation (MA) to reduce the mismatch between training and testing data sets. MAP-based FC and MA require prior densities of mapping function parameters, and designing suitable prior densities plays an important role in obtaining satisfactory performance. In this paper, we propose to use an environment structuring framework to provide suitable prior densities for facilitating MAP-based FC and MA for robust speech recognition. The framework is constructed in a two-stage hierarchical tree structure using environment clustering and partitioning processes. The constructed framework is highly capable of characterizing local information about complex speaker and speaking acoustic conditions. The local information was utilized to specify hyper-parameters in prior densities, which were then used in MAP-based FC and MA to handle the mismatch issue. We evaluated the proposed framework on Aurora-2, a connected digit recognition task, and Aurora-4, a large vocabulary continuous speech recognition (LVCSR) task. On both tasks, experimental results showed that with the prepared environment structuring framework, we could obtain suitable prior densities for enhancing the performance of MAP-based FC and MA.
    Computer Speech & Language 01/2013; DOI:10.1016/j.csl.2013.11.005 · 1.81 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This study presents a noise robust front-end postprocessing technology. After cepstral feature analysis, the feature normalization is usually applied for noisy reduction in spoken language recognition. We investigate a highly effective MVAW processing based on standard MFCC and SDC features on NIST-LRE 2007 tasks. The procedure includes mean subtraction, variance normalization, auto-regression moving-average filtering and feature warping. Experiments were conducted on a common GMM-UBM system. The results indicated significant improvements in recognition accuracy.
    Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2013 Asia-Pacific; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: This study presents a novel approach to spoken document retrieval based on neural probabilistic language modeling for semantic inference. The neural network based language model is applied to estimate word association in a continuous space. The different kinds of weighting schemes are investigated to represent recognized words of a spoken document into an indexing vector. The indexing vector is transferred into the semantic indexing vector through the neural probabilistic language model. Such a semantic word inference and re-weighting make the semantic indexing vector a suitable representation for speech indexing. Experimental results conducted on Mandarin Chinese broadcast news show that the proposed approach can achieve a substantial and consistent improvement of spoken document retrieval.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • E. Mizukami, T. Misu, C. Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: We proposed the WFSTDM which is an expandable and adaptable dialogue management platform. The WFSTDM combines various WFSTs and enables us to develop new dialogue management WFSTs necessary for rapid prototyping of spoken dialogue systems. In this paper, we illustrate the outline of the WFSTDM and introduce the WFSTDM builder, a network-based spoken dialogue system development tool. In addition, we go into details about the spoken dialogue system AssisTra for iPhone that we developed as an example to show the implementation of the WFSTDM on smartphones. We also discuss about spoken dialogue systems on smartphones as tools for collecting field data.
    Mobile Data Management (MDM), 2013 IEEE 14th International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a joint analysis approach to address the acoustic feature normalization for robust speech recognition. The variations in acoustic environments and speakers are the major challenge for speech recognition. The conventional normalizations of these two variations are separately processed, applying the speaker normalization with an assumption of a noise free condition and applying the noise compensation with an assumption of speaker independency, and thus resulting in a suboptimal performance. The proposed joint analysis approach simultaneously considers the vocal tract length normalization and averaged temporal information of cepstral features. In a data-driven manner, the Gaussian mixture model is used to estimate the conditional parameters in the joint analysis. Experimental results show that the proposed approach achieves a substantial improvement.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Speaker clustering has been widely adopted for clustering the speech data based on acoustic characteristics so that an unsupervised speaker normalization and speaker adaptive training can be applied for a better speech recognition performance. In this study, we present a vector space speaker clustering approach with long-term feature analysis. The supervector based on the GMM mean vectors is adopted to represent the characteristics of speakers. To achieve a robust representation, total variability subspace modeling, which has been successfully applied in speaker recognition for compensating channel and session variability over the GMM mean supervector, is used for speaker clustering. We apply a long-term feature analysis strategy to average short-time spectral features over a period of time to capture the speaker traits that are manifested over a speech segment longer than a spectral frame. Experiments conducted on lecture style speech show that this speaker clustering approach offers a better speech recognition performance.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • Chien-Lin Huang, C. Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents the deep neural networks to classification of children with voice impairments from speech signals. In the analysis of speech signals, 6,373 static acoustic features are extracted from many kinds of low-level-descriptors and functionals. To reduce the variability of extracted features, two-dimensional normalizations are applied to smooth the interspeaker and inter-feature mismatch using the feature warping approach. Then, the feature selection is used to explore the discriminative and low-dimensional representation based on techniques of principal component analysis and linear discriminant analysis. In such representation, the robust features are obtained by eliminating noise features via subspace projection. Finally, the deep neural networks are adopted to classify the children with voice impairments. We conclude that deep neural networks with the proposed feature normalization and selection can significantly contribute to the robustness of recognition in practical application scenarios. We have achieved an UAR of 60.9% for the four-way diagnosis classification on the development set. This is a relative improvement of 16.2% to the official baseline by using our single system.
    Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2013 Asia-Pacific; 01/2013
  • Proceedings of COLING 2012; 12/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: The performance of English automatic speech recognition systems decreases when recognizing spontaneous speech mainly due to multiple pronunciation variants in the utterances. Previous approaches address this problem by modeling the alteration of the pronunciation on a phoneme to phoneme level. However, the phonetic transformation effects induced by the pronunciation of the whole sentence have not yet been considered. In this article, the sequence-based pronunciation variation is modeled using a noisy channel approach where the spontaneous phoneme sequence is considered as a “noisy” string and the goal is to recover the “clean” string of the word sequence. Hereby, the whole word sequence and its effect on the alternation of the phonemes will be taken into consideration. Moreover, the system not only learns the phoneme transformation but also the mapping from the phoneme to the word directly. In this study, first the phonemes will be recognized with the present recognition system and afterwards the pronunciation variation model based on the noisy channel approach will map from the phoneme to the word level. Two well-known natural language processing approaches are adopted and derived from the noisy channel model theory: Joint-sequence models and statistical machine translation. Both of them are applied and various experiments are conducted using microphone and telephone of spontaneous speech.
    IEICE Transactions on Information and Systems 08/2012; E95.D(8):2084-2093. DOI:10.1587/transinf.E95.D.2084 · 0.22 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Developing a multilingual speech translation system requires efforts in constructing automatic speech recognition (ASR), machine translation (MT), and text-to-speech synthesis (TTS) components for all possible source and target languages. If the numerous ASR, MT, and TTS systems for different language pairs developed independently in different parts of the world could be connected, multilingual speech translation systems for a multitude of language pairs could be achieved. Yet, there is currently no common, flexible framework that can provide an entire speech translation process by bringing together heterogeneous speech translation components. In this article we therefore propose a distributed architecture framework for multilingual speech translation in which all speech translation components are provided on distributed servers and cooperate over a network. This framework can facilitate the connection of different components and functions. To show the overall mechanism, we first present our state-of-the-art technologies for multilingual ASR, MT, and TTS components, and then describe how to combine those systems into the proposed network-based framework. The client applications are implemented on a handheld mobile terminal device, and all data exchanges among client users and spoken language technology servers are managed through a Web protocol. To support multiparty communication, an additional communication server is provided for simultaneously distributing the speech translation results from one user to multiple users. Field testing shows that the system is capable of realizing multiparty multilingual speech translation for real-time and location-independent communication.
    ACM Transactions on Speech and Language Processing 07/2012; DOI:10.1145/2287710.2287712

Publication Stats

731 Citations
57.41 Total Impact Points

Institutions

  • 2014
    • Academia Sinica
      • Research Center for Information Technology Innovation
      T’ai-pei, Taipei, Taiwan
  • 2010–2014
    • National Institute of Information and Communications Technology
      • Spoken Language Communication Laboratory
      Edo, Tōkyō, Japan
  • 2003–2004
    • NTT Communication Science Laboratories
      Kioto, Kyōto, Japan
    • Nippon Telegraph and Telephone
      Edo, Tōkyō, Japan
  • 2000–2003
    • Tokyo Institute of Technology
      • Computer Science Department
      Edo, Tōkyō, Japan
  • 2001
    • Yamagata University
      • Faculty of Engineering
      Ямагата, Yamagata, Japan