Hirokazu Masataki’s research while affiliated with NTT Communication Science Laboratories and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (48)


Viterbi Approximation of Latent Words Language Models for Automatic Speech Recognition
  • Article

February 2019

·

18 Reads

·

7 Citations

Journal of Information Processing

Ryo Masumura

·

Taichi Asami

·

Takanobu Oba

·

[...]

·

This paper presents a Viterbi approximation of latent words language models (LWLMs) for automatic speech recognition (ASR). The LWLMs are effective against data sparseness because of their soft-decision clustering structure and Bayesian modeling, so LWLMs can perform robustly in multiple ASR tasks. Unfortunately, implementing an LWLM to ASR is difficult because of its computation complexity. In our previous work, we implemented an n-gram approximation of LWLM for ASR by sampling words according to a stochastic process and training word n-gram LMs. However, the previous approach cannot take into account a latent word sequence behind a recognition hypothesis. Our solution is the Viterbi approximation that simultaneously decodes both the recognition hypothesis and the latent word sequence. The Viterbi approximation is implemented as a two-pass ASR decoding in which the latent word sequence is estimated from a decoded recognition hypothesis using Gibbs sampling. Experiments show the effectiveness of the Viterbi approximation in an n-best rescoring framework. In addition, we investigate the relationship of the n-gram approximation and the Viterbi approximation.




Fig. 1 Model structure of LWLMs.
Fig. 2 Model structure of n-gram mixture models.
Table 2 Out-of-vocabulary rate [%] in Experiment 1.
Fig. 3 Model structure of LWLM mixture models.
Table 4 Experimental data set in Experiment 2.
Domain Adaptation Based on Mixture of Latent Words Language Models for Automatic Speech Recognition
  • Article
  • Full-text available

June 2018

·

144 Reads

·

6 Citations

IEICE Transactions on Information and Systems

This paper proposes a novel domain adaptation method that can utilize out-of-domain text resources and partially domain matched text resources in language modeling. A major problem in domain adaptation is that it is hard to obtain adequate adaptation effects from out-of-domain text resources. To tackle the problem, our idea is to carry out model merger in a latent variable space created from latent words language models (LWLMs). The latent variables in the LWLMs are represented as specific words selected from the observed word space, so LWLMs can share a common latent variable space. It enables us to perform flexible mixture modeling with consideration of the latent variable space. This paper presents two types of mixture modeling, i.e., LWLM mixture models and LWLM cross-mixture models. The LWLM mixture models can perform a latent word space mixture modeling to mitigate domain mismatch problem. Furthermore, in the LWLM cross-mixture models, LMs which individually constructed from partially matched text resources are split into two element models, each of which can be subjected to mixture modeling. For the approaches, this paper also describes methods to optimize mixture weights using a validation data set. Experiments show that the mixture in latent word space can achieve performance improvements for both target domain and out-of-domain compared with that in observed word space. Copyright © 2018 The Institute of Electronics, Information and Communication Engineers.

Download






Table 2 Experimental data set. 
Table 3 Experimental results on validation set: PPL, WER [%], OOV rate [%], and RTF. (a)-(q) are limited vocabulary setups and (r)-(z) are expanded vocabulary setups. No PPL comparison is possible among setups with different vocabulary size. 
Investigation of Combining Various Major Language Model Technologies including Data Expansion and Adaptation

October 2016

·

117 Reads

·

7 Citations

IEICE Transactions on Information and Systems

This paper aims to investigate the performance improvements made possible by combining various major language model (LM) technologies together and to reveal the interactions between LM technologies in spontaneous automatic speech recognition tasks. While it is clear that recent practical LMs have several problems, isolated use of major LM technologies does not appear to offer sufficient performance. In consideration of this fact, combining various LM technologies has been also examined. However, previous works only focused on modeling technologies with limited text resources, and did not consider other important technologies in practical language modeling, i.e., use of external text resources and unsupervised adaptation. This paper, therefore, employs not only manual transcriptions of target speech recognition tasks but also external text resources. In addition, unsupervised LM adaptation based on multi-pass decoding is also added to the combination. We divide LM technologies into three categories and employ key ones including recurrent neural network LMs or discriminative LMs. Our experiments show the effectiveness of combining various LM technologies in not only in-domain tasks, the subject of our previous work, but also out-of-domain tasks. Furthermore, we also reveal the relationships between the technologies in both tasks.


Citations (33)


... context information) as well as fixed-length output. It is now possible to treat a discrete symbolic series of variable length as a continuous vector of fixed length [5,6,7,8,9,10]. ...

Reference:

A Proposal for a Method of Determining Contextual Semantic Frames by Understanding the Mutual Objectives and Situations Between Speech Recognition and Interlocutors
Combinations of various language model technologies including data expansion and adaptation in spontaneous speech recognition
  • Citing Conference Paper
  • September 2015

... The authors in [2] discuss the possibility of differentiation between read and spontaneous speech by just looking at the intonation or prosody. Read and spontaneous speech classification based on variance of GMM supervectors has been studied in [1]. From a speaker role characterization perspective, in [6] the authors use acoustic and linguistic features derived from an automatic speech recognition system to characterize and detect spontaneous speech. ...

Read and spontaneous speech classification based on variance of GMM supervectors
  • Citing Conference Paper
  • September 2014

... In contrast, latent words LMs (LWLMs) (Deschacht et al., 2012) are clearly effective for outof domain tasks. We employed the LWLM to speech recognition and the resulting performance was significantly superior in out-of domain tasks while the performance was comparable in domainmatched task to conventional LMs (Masumura et al., 2013a;Masumura et al., 2013b). LWLMs are generative models that employ a latent word space. ...

Viterbi decoding for latent words language models using gibbs sampling
  • Citing Conference Paper
  • August 2013

... Using linguistic features, such as part-of-speech tags and semantic role labels, they proposed a method to estimate these intentions through logistic regression assuming that users' utterances are given in the form of explicit sentences. Conventionally, supplemental prosodic features such as F0 have been used for intention recognition [7,8,9,10]. Fujie et al. proposed a method for estimating whether the user's attitude to the system is positive or negative via Bayesian discrimination using para-linguistic information such as fundamental frequency (F0) [4]. ...

Agreement and disagreement utterance detection in conversational speech by extracting and integrating local features
  • Citing Conference Paper
  • September 2015

... Wei et al. (2014aWei et al. ( ,b, 2013 use submodular function-based subset selection on generated transcripts to find a minimal set of ASR training data and Wu et al. (2007) use an entropy measure for the same. Asami et al. (2015) employ a joint Kullback-Leibler divergencebased subset selection on out-of-domain samples for ASR adaptation across acoustic characteristics such as speaker, noise and recording devices. Similarly, Liu et al. (2015) study subset selection to obtain low-vocabulary speech corpora for ASR, while Kirchhoff and Bilmes (2014) use a submodular approach for data selection in machine translation. ...

Training data selection for acoustic modeling via submodular optimization of joint kullback-leibler divergence
  • Citing Conference Paper
  • September 2015

... In addition, the h-LWLMs are related to other extended modeling of LWLMs. One related modeling is latent words RNN LMs (LWRNNLMs) [23], [24] that use the RNN modeling for latent variable modeling instead of n-gram modeling. The h-LWLMs differ from these extended models by taking into account the hierarchical structure of latent words. ...

Latent words recurrent neural network language models
  • Citing Conference Paper
  • September 2015

... They achieved 0.29% and 0.51% reductions in WERs. A Viterbi approximation of latent word language models (LWLMs) for ASR was proposed by the authors of [32]; they concluded that the combination of an n-gram approximation method and the Viterbi approximation method improved ASR performance. Confusing words is another factor that affects the understanding of speech. ...

Viterbi Approximation of Latent Words Language Models for Automatic Speech Recognition
  • Citing Article
  • February 2019

Journal of Information Processing

... As a special case of SLU, spoken utterance classification (SUC) aims at classifying the observed utterance into one of the predefined semantic classes L = {l 1 , ..., l k } (Masumura et al., 2018). Thus, a semantic classifier is trained to maximize the class-posterior probability for a given observation, W = {w 1 , w 2 , ..., w j }, representing a sequence of tokens. ...

Neural Confnet Classification: Fully Neural Network Based Spoken Utterance Classification Using Word Confusion Networks
  • Citing Conference Paper
  • April 2018

... In the past, statistical machine translation (Cucu et al. 2013;D'Haro and Banchs 2016) was used for this purpose. With the development of neural network-based language models, autoregressive sequence-to-sequence models are used for error correction (Tanaka et al. 2018;Liao et al. 2023), like neural machine translation. Moreover, with the advancement of attention mechanisms (Chan et al. 2016;Inaguma and Kawahara 2023), research utilizing the Transformer architecture (Vaswani et al. 2017) for error correction has demonstrated strong performance (Mani et al. 2020;Leng et al. 2021bLeng et al. , 2023. ...

Neural Error Corrective Language Models for Automatic Speech Recognition
  • Citing Conference Paper
  • September 2018

... The original SR system was used in a preliminary study with an endoscopic simulator (10). In the present study, we upgraded the VoiceRex system by introducing the latest technology (11,12). We also upgraded the text-processing method and vocabulary database of the system. ...

Role Play Dialogue Aware Language Models Based on Conditional Hierarchical Recurrent Encoder-Decoder
  • Citing Conference Paper
  • September 2018