Chiori Hori

National Institute of Information and Communications Technology, Edo, Tōkyō, Japan

Are you Chiori Hori?

Claim your profile

Publications (89)50.18 Total impact

  • [Show abstract] [Hide abstract]
    ABSTRACT: An ensemble speaker and speaking environment modeling (ESSEM) approach was recently developed. This ESSEM process consists of offline and online phases. The offline phase establishes an environment structure using speech data collected under a wide range of acoustic conditions, whereas the online phase estimates a set of acoustic models that matches the testing environment based on the established environment structure. Since the estimated acoustic models accurately characterize particular testing conditions, ESSEM can improve the speech recognition performance under adverse conditions. In this work, we propose two maximum a posteriori (MAP) based algorithms to improve the online estimation part of the original ESSEM framework. We first develop MAP-based environment structure adaptation to refine the original environment structure. Next, we propose to utilize the MAP criterion to estimate the mapping function of ESSEM and enhance the environment modeling capability. For the MAP estimation, three types of priors are derived; they are the clustered prior (CP), the sequential prior (SP), and the hierarchical prior (HP) densities. Since each prior density is able to characterize specific acoustic knowledge, we further derive a combination mechanism to integrate the three priors. Based on the experimental results on the Aurora-2 task, we verify that using the MAP-based online mapping function estimation can enable ESSEM to achieve better performance than using the maximum-likelihood (ML) based counterpart. Moreover, by using an integration of the online environment structuring adaptation and mapping function estimation, the proposed MAP-based ESSEM framework is found to provide the best performance. Compared with our baseline results, MAP-based ESSEM achieves an average word error rate reduction of 15.53% (5.41 to 4.57%) under 50 testing conditions at a signal-to-noise ratio (SNR) of 0 to 20 dB over the three standardized testing sets.
    Audio, Speech, and Language Processing, IEEE/ACM Transactions on. 01/2014; 22(2):403-416.
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper regards social question-and-answer (Q&A) collections such as Yahoo! Answers as knowledge repositories and investigates techniques to mine knowledge from them to improve sentence-based complex question answering (QA) systems. Specifically, we present a question-type-specific method (QTSM) that extracts question-type-dependent cue expressions from social Q&A pairs in which the question types are the same as the submitted questions. We compare our approach with the question-specific and monolingual translation-based methods presented in previous works. The question-specific method (QSM) extracts question-dependent answer words from social Q&A pairs in which the questions resemble the submitted question. The monolingual translation-based method (MTM) learns word-to-word translation probabilities from all of the social Q&A pairs without considering the question or its type. Experiments on the extension of the NTCIR 2008 Chinese test data set demonstrate that our models that exploit social Q&A collections are significantly more effective than baseline methods such as LexRank. The performance ranking of these methods is QTSM > {QSM, MTM}. The largest F3 improvements in our proposed QTSM over QSM and MTM reach 6.0% and 5.8%, respectively.
    Computer Speech & Language 01/2014; · 1.46 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a new system for automatic transcription of lectures. The system combines a number of novel features, including deep neural network acoustic models using multi-level adaptive networks to incorporate out-of-domain information, and factored recurrent neural network language models. We demonstrate that the system achieves large improvements on the TED lecture transcription task from the 2012 IWSLT evaluation -- our results are currently the best reported on this task, showing an relative WER reduction of more than 16% compared to the closest competing system from the evaluation.
    Proc. Interspeech; 08/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: This study presents an overview of VoiceTra, which was developed by NICT and released as the world'fs first network-based multilingual speech-to-speech translation system for smartphones, and describes in detail its multilingual speech recognition, its multilingual translation, and its multilingual speech synthesis in regards to field experiments. We show the effects of system updates using the data collected from field experiments to improve our acoustic and language models.
    Proceedings of the 2013 IEEE 14th International Conference on Mobile Data Management - Volume 02; 06/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper outlines the first Asian network-based speech-to-speech translation system developed by the Asian Speech Translation Advanced Research (A-STAR) consortium. Eight research groups comprising the A-STAR members participated in the experiments, covering nine languages, i.e., eight Asian languages (Hindi, Indonesian, Japanese, Korean, Malay, Thai, Vietnamese, and Chinese) and English. Each A-STAR member contributed one or more of the following spoken language technologies: automatic speech recognition, machine translation, and text-to-speech through Web servers. The system was designed to translate common spoken utterances of travel conversations from a given source language into multiple target languages in order to facilitate multiparty travel conversations between people speaking different Asian languages. It covers travel expressions including proper nouns that are names of famous places or attractions in Asian countries. In this paper, we describe the issues of developing spoken language technologies for Asian languages, and discuss the difficulties involved in connecting different heterogeneous spoken language translation systems through Web servers. This paper also presents speech-translation results including subjective evaluation, from the first A-STAR field testing which was carried out in July 2009.
    Computer Speech & Language 02/2013; · 1.46 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Noise reduction algorithms are widely used to mitigate noise effects on speech to improve the robustness of speech technology applications. However, they inevitably cause speech distortion. The tradeoff between noise reduction and speech distortion is a key concern in designing noise reduction algorithms. This study proposes a novel framework for noise reduction by considering this tradeoff. We regard speech estimation as a function approximation problem in a regularized reproducing kernel Hilbert space (RKHS). In the estimation, the objective function is formulated to find an approximation function by controlling the tradeoff between approximation accuracy and function complexity. For noisy observations, this is equivalent to controlling the tradeoff between noise reduction and speech distortion. Since the target function is approximated in an RKHS, either a linear or nonlinear mapping function can be naturally incorporated in the estimation by a “kernel trick”. Traditional signal subspace and Wiener filtering based noise reduction can be derived as special cases when a linear kernel function is applied in this framework. We first provided a theoretical analysis of the tradeoff property of the framework in noise reduction. Then we applied our proposed noise reduction method in speech enhancement and noisy robust speech recognition experiments. Compared to several classical noise reduction methods, our proposed method showed promising advantages.
    IEEE Transactions on Signal Processing 02/2013; 61(3):601-610. · 2.81 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This study presents a novel approach to spoken document retrieval based on neural probabilistic language modeling for semantic inference. The neural network based language model is applied to estimate word association in a continuous space. The different kinds of weighting schemes are investigated to represent recognized words of a spoken document into an indexing vector. The indexing vector is transferred into the semantic indexing vector through the neural probabilistic language model. Such a semantic word inference and re-weighting make the semantic indexing vector a suitable representation for speech indexing. Experimental results conducted on Mandarin Chinese broadcast news show that the proposed approach can achieve a substantial and consistent improvement of spoken document retrieval.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a joint analysis approach to address the acoustic feature normalization for robust speech recognition. The variations in acoustic environments and speakers are the major challenge for speech recognition. The conventional normalizations of these two variations are separately processed, applying the speaker normalization with an assumption of a noise free condition and applying the noise compensation with an assumption of speaker independency, and thus resulting in a suboptimal performance. The proposed joint analysis approach simultaneously considers the vocal tract length normalization and averaged temporal information of cepstral features. In a data-driven manner, the Gaussian mixture model is used to estimate the conditional parameters in the joint analysis. Experimental results show that the proposed approach achieves a substantial improvement.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The maximum a posteriori (MAP) criterion is popularly used for feature compensation (FC) and acoustic model adaptation (MA) to reduce the mismatch between training and testing data sets. MAP-based FC and MA require prior densities of mapping function parameters, and designing suitable prior densities plays an important role in obtaining satisfactory performance. In this paper, we propose to use an environment structuring framework to provide suitable prior densities for facilitating MAP-based FC and MA for robust speech recognition. The framework is constructed in a two-stage hierarchical tree structure using environment clustering and partitioning processes. The constructed framework is highly capable of characterizing local information about complex speaker and speaking acoustic conditions. The local information was utilized to specify hyper-parameters in prior densities, which were then used in MAP-based FC and MA to handle the mismatch issue. We evaluated the proposed framework on Aurora-2, a connected digit recognition task, and Aurora-4, a large vocabulary continuous speech recognition (LVCSR) task. On both tasks, experimental results showed that with the prepared environment structuring framework, we could obtain suitable prior densities for enhancing the performance of MAP-based FC and MA.
    Computer Speech & Language 01/2013; · 1.46 Impact Factor
  • Chien-Lin Huang, C. Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents the deep neural networks to classification of children with voice impairments from speech signals. In the analysis of speech signals, 6,373 static acoustic features are extracted from many kinds of low-level-descriptors and functionals. To reduce the variability of extracted features, two-dimensional normalizations are applied to smooth the interspeaker and inter-feature mismatch using the feature warping approach. Then, the feature selection is used to explore the discriminative and low-dimensional representation based on techniques of principal component analysis and linear discriminant analysis. In such representation, the robust features are obtained by eliminating noise features via subspace projection. Finally, the deep neural networks are adopted to classify the children with voice impairments. We conclude that deep neural networks with the proposed feature normalization and selection can significantly contribute to the robustness of recognition in practical application scenarios. We have achieved an UAR of 60.9% for the four-way diagnosis classification on the development set. This is a relative improvement of 16.2% to the official baseline by using our single system.
    Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2013 Asia-Pacific; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: This study presents a noise robust front-end postprocessing technology. After cepstral feature analysis, the feature normalization is usually applied for noisy reduction in spoken language recognition. We investigate a highly effective MVAW processing based on standard MFCC and SDC features on NIST-LRE 2007 tasks. The procedure includes mean subtraction, variance normalization, auto-regression moving-average filtering and feature warping. Experiments were conducted on a common GMM-UBM system. The results indicated significant improvements in recognition accuracy.
    Signal and Information Processing Association Annual Summit and Conference (APSIPA), 2013 Asia-Pacific; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Speaker clustering has been widely adopted for clustering the speech data based on acoustic characteristics so that an unsupervised speaker normalization and speaker adaptive training can be applied for a better speech recognition performance. In this study, we present a vector space speaker clustering approach with long-term feature analysis. The supervector based on the GMM mean vectors is adopted to represent the characteristics of speakers. To achieve a robust representation, total variability subspace modeling, which has been successfully applied in speaker recognition for compensating channel and session variability over the GMM mean supervector, is used for speaker clustering. We apply a long-term feature analysis strategy to average short-time spectral features over a period of time to capture the speaker traits that are manifested over a speech segment longer than a spectral frame. Experiments conducted on lecture style speech show that this speaker clustering approach offers a better speech recognition performance.
    Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on; 01/2013
  • E. Mizukami, T. Misu, C. Hori
    [Show abstract] [Hide abstract]
    ABSTRACT: We proposed the WFSTDM which is an expandable and adaptable dialogue management platform. The WFSTDM combines various WFSTs and enables us to develop new dialogue management WFSTs necessary for rapid prototyping of spoken dialogue systems. In this paper, we illustrate the outline of the WFSTDM and introduce the WFSTDM builder, a network-based spoken dialogue system development tool. In addition, we go into details about the spoken dialogue system AssisTra for iPhone that we developed as an example to show the implementation of the WFSTDM on smartphones. We also discuss about spoken dialogue systems on smartphones as tools for collecting field data.
    Mobile Data Management (MDM), 2013 IEEE 14th International Conference on; 01/2013
  • [Show abstract] [Hide abstract]
    ABSTRACT: Developing a multilingual speech translation system requires efforts in constructing automatic speech recognition (ASR), machine translation (MT), and text-to-speech synthesis (TTS) components for all possible source and target languages. If the numerous ASR, MT, and TTS systems for different language pairs developed independently in different parts of the world could be connected, multilingual speech translation systems for a multitude of language pairs could be achieved. Yet, there is currently no common, flexible framework that can provide an entire speech translation process by bringing together heterogeneous speech translation components. In this article we therefore propose a distributed architecture framework for multilingual speech translation in which all speech translation components are provided on distributed servers and cooperate over a network. This framework can facilitate the connection of different components and functions. To show the overall mechanism, we first present our state-of-the-art technologies for multilingual ASR, MT, and TTS components, and then describe how to combine those systems into the proposed network-based framework. The client applications are implemented on a handheld mobile terminal device, and all data exchanges among client users and spoken language technology servers are managed through a Web protocol. To support multiparty communication, an additional communication server is provided for simultaneously distributing the speech translation results from one user to multiple users. Field testing shows that the system is capable of realizing multiparty multilingual speech translation for real-time and location-independent communication.
    ACM Transactions on Speech and Language Processing - TSLP. 07/2012;
  • [Show abstract] [Hide abstract]
    ABSTRACT: The tradeoff between noise reduction and speech distortion is a key concern in designing noise reduction algorithms. We have proposed a regularization framework for noise reduction with the consideration of the tradeoff problem. We regard speech estimation as a functional approximation problem in a reproducing kernel Hilbert space (RKHS). In the estimation, the objective function is formulated to find an approximation function that gives a good tradeoff between the approximation accuracy and complexity of the function. By using a regularization method, the approximation function can be estimated from noisy observations. In this paper, we further provided a theoretical analysis of the tradeoff property of the framework in noise reduction. We applied the framework for speech enhancement experiments in real applications. Compared with several classical noise reduction methods, the proposed framework showed promising advantages.
    Chinese Spoken Language Processing (ISCSLP), 2012 8th International Symposium on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we perform a comparison of lookahead composition and on-the-fly hypothesis rescoring using a common decoder. The results on a large vocabulary speech recognition task illustrate the differences in the behaviour of these algorithms in terms of error rate, real time factor, memory usage and internal statistics of the decoder. The evaluations were performed when the decoder was operated at either the state or arc level. The results show the dynamic approaches also work well at the state level even though there is greater dynamic construction cost.
    Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: Ensemble acoustic modeling can be used to model different factors that cause variability of acoustic space, and provide different combination to improve the performance of automatic speech recognition (ASR). One of the main concerns is how to partition the training data set to several subsets based on which ensemble models are trained. In this study, we focus on ensemble acoustic modeling concerned with acoustic variability caused by gender and accent for Chinese large vocabulary continuous speech recognition (LVCSR). Considering that gender and accent information may be encoded in local acoustic realizations of a few specific phonetic classes rather than in a global acoustic distribution, we proposed a acoustic space partition method based on broad phonetic class (BPC) modeling of speaker for ensemble acoustic modeling. With the principal component analysis (PCA) of the BPC based speaker representation, we designed two level hierarchical data partitions in the low dimensional speaker factor space that concerned with gender and accent information. Ensemble acoustic models were trained on the partitioned data sets on both levels. Speech recognition results showed that using acoustic models trained based on the first level and second level partitions got 9.73% and 32.29% relative improvements in character error reduction rate, respectively.
    Chinese Spoken Language Processing (ISCSLP), 2012 8th International Symposium on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: The performance of English automatic speech recognition systems decreases when recognizing spontaneous speech mainly due to multiple pronunciation variants in the utterances. Previous approaches address this problem by modeling the alteration of the pronunciation on a phoneme to phoneme level. However, the phonetic transformation effects induced by the pronunciation of the whole sentence have not yet been considered. In this article, the sequence-based pronunciation variation is modeled using a noisy channel approach where the spontaneous phoneme sequence is considered as a “noisy” string and the goal is to recover the “clean” string of the word sequence. Hereby, the whole word sequence and its effect on the alternation of the phonemes will be taken into consideration. Moreover, the system not only learns the phoneme transformation but also the mapping from the phoneme to the word directly. In this study, first the phonemes will be recognized with the present recognition system and afterwards the pronunciation variation model based on the noisy channel approach will map from the phoneme to the word level. Two well-known natural language processing approaches are adopted and derived from the noisy channel model theory: Joint-sequence models and statistical machine translation. Both of them are applied and various experiments are conducted using microphone and telephone of spontaneous speech.
    IEICE Transactions on Information and Systems 01/2012; E95.D(8):2084-2093. · 0.22 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: Use of a linear projection (LP) function to transform multiple sets of acoustic models into a single set of acoustic models is proposed for characterizing testing environments for robust automatic speech recognition. The LP function is an extension of the linear regression (LR) function used in maximum likelihood linear regression (MLLR) and maximum a posteriori linear regression (MAPLR) by incorporating local information in the ensemble acoustic space to enhance the environment modeling capacity. To estimate the nuisance parameters of the LP function, we developed maximum likelihood LP (MLLP) and maximum a posteriori LP (MAPLP) and derived a set of integrated prior (IP) densities for MAPLP. The IP densities integrate multiple knowledge sources from the training set, previously seen speech data, current utterance, and a prepared tree structure. We evaluated the proposed MLLP and MAPLP on the Aurora-2 database in an unsupervised model adaptation manner. Experimental results show that the LP function outperforms the LR function with both ML- and MAP-based estimates over different test conditions. Moreover, because the MAP-based estimate can handle over-fittings well, MAPLP has clear improvements over MLLP. Compared to the baseline result, MAPLP provides a significant 10.99% word error rate reduction.
    Acoustics, Speech and Signal Processing (ICASSP), 2012 IEEE International Conference on; 01/2012
  • [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we present our work on collecting spontaneous texts from the Web for constructing a language model in a Chinese speech recognition system. The selection of spontaneous-like texts involves two steps: First, word-segmented web texts are selected using a perplexity-based approach in which the style-related words are strengthened by omitting infrequent topic words from similarity measurements. Second, the selected texts are then clustered based on non-noun part-of-speech (POS) words and optimal clusters are chosen by referring to a set of spontaneous seed sentences. Using the language model interpolated with the one trained by the selected sentences and a baseline model, speech recognition evaluations were conducted on an open domain spontaneous test set. We effectively reduced the character error rate (CER), with 1.64% absolute (or 6.5% relative) reduction by comparison with the baseline model. We also verified that the proposed method is superior to the conventional perplexity-based approach with about 1% absolute (or 4.0% relative) reduction in CER.
    Chinese Spoken Language Processing (ISCSLP), 2012 8th International Symposium on; 01/2012

Publication Stats

629 Citations
50.18 Total Impact Points

Institutions

  • 2010–2014
    • National Institute of Information and Communications Technology
      • Spoken Language Communication Laboratory
      Edo, Tōkyō, Japan
  • 2000–2004
    • Tokyo Institute of Technology
      • Computer Science Department
      Tokyo, Tokyo-to, Japan
  • 2003
    • Nippon Telegraph and Telephone
      Edo, Tōkyō, Japan
    • NTT Communication Science Laboratories
      Kioto, Kyōto, Japan
  • 2001
    • Yamagata University
      • Faculty of Engineering
      Ямагата, Yamagata, Japan