About
407
Publications
52,583
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
10,563
Citations
Publications
Publications (407)
Eye motion-based human-machine interfaces are used to provide a means of communication for those who can move nothing but their eyes because of injury or disease. To detect eye motions, electrooculography (EOG) is used. For efficient communication, the input speed is critical. However, it is difficult for conventional EOG recognition methods to acc...
With the recent progress in computer hardware and computer graphics (CG) techniques, applications using 3D virtual space are getting popular. So far, a mouse and a keyboard are generally used in these applications. While a mouse is a very successful input device for continuously controlling 2D objects, it is not necessarily intuitive for controllin...
We propose a person verification method based on behavioral patterns from complex human movements. Behavioral patterns are represented by anthropometric and kinematic features of human body motion acquired by a Kinect RGBD sensor. We focus on complex movements to demonstrate that independent and rhythmic movement of body parts carries a significant...
We propose a person verification method using behavioral patterns of human upper body motion. Behavioral patterns are represented by three-dimensional features obtained from a time-of-flight camera. We take a statistical approach to model the behavioral patterns using Gaussian mixture models (GMM) and support vector machines. We employ the maximum...
In noisy environments, speech recognition decoders often incorrectly produce speech hypotheses for non-speech periods, and non-speech hypotheses, such as silence or a short pause, for speech periods. It is crucial to reduce such errors to improve the performance of speech recognition systems. This paper proposes an approach using normalized speech/...
The articles in this special issue brings together leading experts from various disciplines to explore the impact of new approaches to automatic speech recognition (ASR).
We propose an active learning framework for speech recognition that reduces the amount of data required for acoustic modeling. This framework consists of two steps. We first obtain a phone-error distribution using an acoustic model estimated from transcribed speech data. Then, from a text corpus we select a sentence whose phone-occurrence distribut...
To provide an efficient means of communication for those who cannot move muscles of the whole body except eyes due to amyotrophic lateral sclerosis (ALS), we are developing a speech synthesis interface that is based on electrooculogram (EOG) input. EOG is an electrical signal that is observed through electrodes attached on the skin around eyes and...
In this paper we present a statistical approach to question answering (QA). Our motivation is to build robust systems for many languages without the need for highly tuned linguistic modules. Consequently, word tokens and web data are used extensively but neither explicit linguistic knowledge nor annotated data is incorporated. A mathematical model...
We present our unified approach to question an-swering in different languages and describe our ex-periments on the Japanese language NTCIR-3 Ques-tion Answering Challenge (QAC-1) tasks 1 and 2. The model we use for Japanese language question answer-ing (QA) is identical to the one we have applied suc-cessfully on the English language TREC QA tasks,...
This paper presents our recent work in regard to building Large
Vocabulary Continuous Speech Recognition (LVCSR) systems for the Thai,
Indonesian, and Chinese languages. For Thai, since there is no word
boundary in the written form, we have proposed a new method for
automatically creating word-like units from a text corpus, and applied
topic and sp...
We focus on the problem of speech recognition in the presence of nonstationary sudden noise, which is very likely to happen in home environments. As a model compensation method for this problem, we investigated the use of factorial hidden Markov model (FHMM) architecture developed from a clean-speech hidden Markov model (HMM) and a sudden-noise HMM...
Variations in walking speed have a strong impact on gait-based person
identification. We propose a method that is robust against walking-speed
variations. It is based on a combination of cubic higher-order local
auto-correlation (CHLAC), gait silhouette-based principal component
analysis (GSP), and a statistical framework using hidden Markov models...
The 13 papers in this special section focus on new frontiers in rich transcription. The papers concentrate mainly on three areas: speaker diarization approaches, error analysis, techniques, and features; capitalization and punctuation; and descriptions of complete rich transcription systems.
For large vocabulary continuous speech recognition, speech decoders treat time sequence with context information using large probabilistic models. The software of such speech decoders tend to be large and complex since it has to handle both relationships of its component functions and timing of computation at the same time. In the traditional signa...
In gait-based person identification, statistical methods such as hidden Markov models (HMMs) have been proved to be effective. Their performance often degrades, however, when the amount of training data for each walker is insufficient. In this paper, we propose walker adaptation and walker adaptive training, where the data from the other walkers ar...
Three-dimensional structure prediction of a molecule can be modeled as a minimum energy search problem in a potential landscape. Popular ab initio structure prediction approaches based on this formalization are the Monte Carlo methods represented by the Metropolis method. However, their prediction performance degrades for larger molecules such as p...
This paper presents a novel approach to identify and/or verify persons by using three-dimensional dynamic and structural features extracted from human motion depicted on image streams. These features are extracted from body landmarks which are detected and tracked when the person is asked to perform specific movements, representing the dynamics of...
It is expensive to prepare a sufficient amount of training data for acoustic modeling for developing large vocabulary continuous speech recognition systems. This is a serious problem especially for resource-deficient languages. We propose an active learning method that effectively reduces the amount of training data without any degradation in recog...
We propose a committee-based method of active learning for large vocabulary continuous speech recognition. Multiple recognizers are trained in this approach, and the recognition results obtained from these are used for selecting utterances. Those utterances whose recognition results differ the most among recognizers are selected and transcribed. Pr...
A general framework of language model task adaptation is to select documents in a large training set based on a language model estimated on a development data. However, this strategy has a deficiency that the selected documents are biased to the most frequent patterns in the development data. To address this problem, a new task adaptation method is...
Research and development in the field of spoken language depends critically on the existence of software tools. A large range of excellent tools have been developed and are widely used today. Most tools were developed by individuals who recognized the need for a given tool, had the necessary conceptual and programming skills, and were deeply rooted...
In recent years, adaptation techniques have been given special focus in speaker recognition tasks, mainly targeting speaker and session variation disentangling under the Maximum a Posteriori (MAP) criterion. For these techniques, unseen mixtures are usually adapted in a global manner, if ever. In this paper, we explore Structural MAP (SMAP), Maximu...
We propose Cross-Channel Spectral Subtraction (CCSS), a source separation method for recognizing meeting speech where one microphone is prepared for each speaker. The method quickly adapts to changes in transfer functions and uses spectral subtraction to suppress the speech of other speakers. Compared with conventional source separation methods bas...
This paper proposes new interfaces using semi-synchronous speech and pen input for mobile environments. A user speaks while writing, and the pen input complements the speech so that recognition performance will be higher than with speech alone. Since the input speed and input information are different between the two modes, speaking and writing, a...
We propose unsupervised cross-validation (CV) and aggregated (Ag) adaptation algorithms that integrate the ideas of ensemble methods, such as CV and bagging, in the iterative unsupervised batch-mode adaptation framework. These algorithms are used to reduce overtraining problems and to improve speech recognition performance. The algorithms are const...
In speech recognition systems, decoding or data evaluation processes are sometimes performed as a part of a model estimation process. For example, when unsupervised adaptation is performed, parameters are adapted using decoding hypotheses generated by an initial model. In these model estima-tion frameworks, how to design dependency structures betwe...
This paper proposes a new hybrid method for machine transliteration Our method is based on combining a newly proposed two step conditional random field (CRF) method and the well known joint source channel model (JSCM) The contributions of this paper are as follows (1) A two step CRF model for machine transliteration is proposed The first CRF segmen...
The 12 papers in this special issue focus on speech processing for natural interaction with intelligent environments.
One of the main problems in developing a text-to-speech (TTS) synthesizer for French lies in grapheme-to-phoneme conversion. Automatic converters produce still too many errors in their phoneme sequences, to be helpful for people learning French as a foreign language. The prediction of the phonetic realizations of word-final consonants (WFCs) in gen...
French is known to be a language with major pronunciation irregularities at word endings with consonants. Particularly, the well-known phonetic phenomenon called Liaison is one of the major issues for French phonetizers. Rule-based methods have been used to solve these issues. Yet, the current models still produce a great number of pronunciation er...
We previously proposed a decoding method for automatic speech recognition utilizing hypothesis scores weighted by voice activity detection (VAD)-measures. This method uses two Gaussian mixture models (GMMs) to obtain confidence measures: one for speech, the other for non-speech. To achieve good search performance, we need to adapt the GMMs properly...
The amount of available Thai broadcast news transcribed text for training a language model is still very limited, comparing to other major languages. Since the construction of a broadcast news corpus is very costly and time-consuming, newspaper text is often used to increase the size of training text data. This paper proposes a language model topic...
Recognition errors of proper nouns and foreign words significantly decrease the performance of ASR-based speech applications such as voice dialing systems, speech summarization, spoken document retrieval, and spoken query-based information retrieval (IR). The reason is that proper nouns and words that come from other languages are usually the most...
We propose a statistical framework for high-level feature extraction that uses SIFT Gaussian mixture models (GMMs) and audio models. SIFT features were extracted from all the image frames and modeled by a GMM. In addition, we used mel-frequency cepstral coefficients and ergodic hidden Markov models to detect high-level features in audio streams. Th...
We have previously proposed unsupervised cross-validation (CV) adaptation that introduces CV into an iterative unsupervised batch mode adaptation framework to suppress the influence of errors in an internally generated recognition hypothesis and have shown that it improves recognition performance. However, a limitation was that the experiments were...
We propose a committee-based active learning method for large vocabulary continuous speech recognition. In this approach, multiple recognizers are prepared beforehand, and the recognition results obtained from them are used for selecting utterances. Here, a progressive search method is used for aligning sentences, and voting entropy is used as a me...
We propose a novel approach to identify users by comparing features extracted from image streams acquired from a Time-of-Flight camera. These features represent body landmarks which are detected and tracked over a small period of time in which the user is asked to perform specific movements. This information is later matched against a previously tr...
Since speech is highly variable, even if we have a fairly large-scale database, we cannot avoid the data sparseness problem in constructing automatic speech recognition (ASR) systems. How to train and adapt statistical models using limited amounts of data is one of the most important research issues in ASR. This paper summarizes major techniques th...
Speech is the primary means of communication between humans. For reasons ranging from technological curiosity about the mechanisms
for mechanical realization of human speech capabilities to the desire to automate simple tasks which necessitate human–machine
interactions, research in automatic speech recognition by machines has attracted a great dea...
This paper presents a joint optimization method of a two-step conditional random field (CRF) model for machine transliteration and a fast decoding algorithm for the proposed method. Our method lies in the category of direct orthographical mapping (DOM) between two languages without using any intermediate phonemic mapping. In the two-step CRF model,...
In this paper we perform a cross-comparison of the T 3 WFST decoder against three different speech recognition decoders on three separate tasks of variable difficulty. We show that the T 3 decoder performs favorably against several established veterans in the field, including the Juicer WFST decoder, Sphinx3, and HDecode in terms of RTF versus Word...
In this study we focus on robust speech recognition in car environments. For this purpose we used weighted finite-state transducers
(WFSTs) because they provide an elegant, uniform, and flexible way of integrating various knowledge sources into a single
search network. To improve the robustness of the WFST speech recognition system, we performed no...
In large vocabulary continuous speech recognition (LVCSR) the acoustic model computations often account for the largest processing overhead. Our weighted finite state transducer (WFST) based decoding engine can utilize a commodity graphics processing unit (GPU) to perform the acoustic computations to move this burden off the main processor. In this...
We propose a two-step active learning method for supervised speaker adaptation. In the first step, the initial adaptation data is collected to obtain a phone error distribution. In the second step, those sentences whose phone distributions are close to the error distribution are selected, and their utterances are collected as the additional adaptat...
This paper summarizes my 40 years of research on speech and speaker recognition, focusing on selected topics that I have investigated at NTT Laboratories, Bell Laboratories and Tokyo Institute of Technology with my colleagues and students. These topics include: the importance of spectral dynamics in speech perception; speaker recognition methods us...
This paper describes our system for "NEWS 2009 Machine Transliteration Shared Task" (NEWS 2009). We only par- ticipated in the standard run, which is a direct orthographical mapping (DOP) be- tween two languages without using any intermediate phonemic mapping. We propose a new two-step conditional ran- dom field (CRF) model for DOP machine translit...
We propose an automatic utterance type recognizer that distinguishes declarative questions from statements in Indonesian speech. Since utterances in these two types have the same words with the same order and differ only in their intonations, their classification requires not only a speech recognizer, but also an intonation recognizer. In this pape...
Research in automatic speaker recognition has now spanned four decades. This paper surveys the major themes and advances made
in the past 40 years of research so as to provide a technological perspective and an appreciation of the fundamental progress
that has been accomplished in this important area of speech-based human biometrics. Although many...
In the Weighted Finite State Transducer (WFST) framework for speech recognition, we can reduce memory usage and increase flexibility by using on-the-fly composition which generates the search network dynamically during decoding. Methods have also been proposed for optimizing WFSTs in on-the-fly composition, however, these operations place restricti...
Traditional language models rely on lexical units that are defined as entities separated from each other by word boundary markers. Since there are no such boundaries in Thai, alternative definitions of lexical units have to be pursued. The problem is to find the optimal set of lexical units that constitutes the vocabulary of the language model and...
In this paper we present a fast method for computing acoustic likelihoods that makes use of a graphics processing unit (GPU). After enabling the GPU acceleration the main processor runtime dedicated to acoustic scoring tasks is reduced from the largest consumer to just a few percent even when using mixture models with a large number of Gaussian com...
Independent component analysis (ICA) is not only popular for blind source separation but also for unsupervised learning when the observations can be decomposed into some independent components. These components represent the specific speaker, gender, accent, noise or environment, and act as the basis functions to span the vector space of the human...
An unsupervised cross-validation adaptation algorithm and its variation are proposed that introduce the idea of cross-validation in the unsupervised batch-mode adaptation framework to improve the adaptation performance. The first algorithm is constructed on a general adaptation technique such as MLLR and can be used in combination with any adaptati...
This paper presents a new method for au- tomatically generating abbreviations for Chi- nese organization names. Abbreviations are commonly used in spoken Chinese, especially for organization names. The generation of Chinese abbreviation is much more complex than English abbreviations, most of which are acronyms and truncations. The abbreviation gen...
While OOV is always a problem for most languages in ASR, in the Chinese case the problem can be avoided by utilizing char- acter n-grams and moderate performances can be obtained. However, character n- gram has its own limitation and proper addition of new words can increase the ASR performance. Here we propose a dis- criminative lexicon adaptation...
Long organization names are often abbreviated in spoken Chinese, and abbreviated utterances cannot be recognized correctly if the abbreviations are not included in the recognition vocabulary. Therefore, it is very important to automatically generate and add abbreviations for organization names to the vocabulary. Generation of Chinese abbreviations...
Text corpus size is an important issue when building a language model (LM). This is a particularly important issue for languages where little data is available. This paper introduces an LM adaptation technique to improve an LM built using a small amount of task-dependent text with the help of a machine-translated text corpus. Icelandic speech recog...
While OOV is always a problem for most languages in ASR, in the Chinese case the problem can be avoided by utilizing character n-grams and moderate performances can be obtained. However, character n-gram has its own limitation and proper addition of new words can increase the ASR performance. Here we propose a discriminative lexicon adaptation appr...
APSIPA ASC 2009: Asia-Pacific Signal and Information Processing Association, 2009 Annual Summit and Conference. 4-7 October 2009. Sapporo, Japan. Oral session: Speech and Music Processing (5 October 2009). Text corpus size is an important issue when building a language model (LM) in particular where insufficient training and evaluation data are ava...
APSIPA ASC 2009: Asia-Pacific Signal and Information Processing Association, 2009 Annual Summit and Conference. 4-7 October 2009. Sapporo, Japan. Oral session: Infrastructure Software for Speech Processing (5 October 2009). In this paper we present an overview of the Tokyo Tech Transducer-based Decoder T3 (pronounced tee-cubed). There is a high lev...
APSIPA ASC 2009: Asia-Pacific Signal and Information Processing Association, 2009 Annual Summit and Conference. 4-7 October 2009. Sapporo, Japan. Poster session: Automatic Speech Recognition (6 October 2009). We propose a noise robust speech recognition method based on combining novel features extracted from fundamental frequency (F0) information a...
APSIPA ASC 2009: Asia-Pacific Signal and Information Processing Association, 2009 Annual Summit and Conference. 4-7 October 2009. Sapporo, Japan. Poster session: Automatic Speech Recognition (6 October 2009). Query term misrecognition caused by the speech recognizer is one of the important issues in the spoken query information retrieval. The misre...
APSIPA ASC 2009: Asia-Pacific Signal and Information Processing Association, 2009 Annual Summit and Conference. 4-7 October 2009. Sapporo, Japan. Poster session: Automatic Speech Recognition (6 October 2009). This paper focuses on the description and evaluation of our new prototype Weighted Finite State Transducer (WFST) based ASR system for access...
This paper describes our video summarization system using a model selection technique to estimate the optimal number of scenes for a summary. It uses a minimum description length as a model selection criterion and carries out two- stage estimation. First, we estimate the number of scenes in each shot, and then we estimate the number of scenes in a...
We have previously proposed a cross-validation (CV) based Gaussian mixture optimization method that efficiently opti- mizes the model structure based on CV likelihood. In this study, we propose aggregated cross-validation (AgCV) that introduces a bagging-like approach in the CV framework to reinforce the model selection ability. While a single mode...
The segmental eigenvoice method has been proposed to provide rapid speaker adaptation with limited amounts of adaptation data. In this method, the speaker-vector space is clustered to several subspaces and PCA is applied to each of the resulting subspaces. In this paper, we propose two new techniques to improve the performance of this segmental eig...
Although speech derived from read texts, news broadcasts, and other similar prepared contexts can be recognized with high accuracy, recognition performance drastically decreases for spontaneous speech. This is due to the fact that spontaneous speech and read speech are significantly different acoustically as well as linguistically. This paper stati...
We have previously proposed a noise-robust speaker verification method using fundamental frequency (F(0)) extracted using the Hough transform. The method also incorporates an automatic stream-weight and decision threshold estimation technique. It has been confirmed that the proposed method is effective for white noise at various SNR conditions. Thi...
Independent component analysis (ICA) is a popular approach for blind source separation (BSS). In this study, we develop a new mutual information measure for BSS and unsupervised learning of acoustic models. The underlying concept of ICA unsupervised learning algorithm is to demix the observations vectors and identify the corresponding mixture sourc...
We have previously proposed a noise-robust speaker verification method using fundamental frequency (F0) extracted using the Hough transform. The method also incorporates an automatic stream-weight and decision threshold estimation technique. It has been confirmed that the proposed method is effective for white noise at various SNR conditions. This...
In this paper we present evaluations on the large vocabulary speech decoder we are currently developing at Tokyo Institute of Technology. Our goal is to build a fast, scalable, flexible decoder to operate on weighted finite state transducer (WFST) search spaces. Even though the development of the decoder is still in its infancy we have already impl...
Summary form only given. Waseda University, Tokyo Institute of Technology, and six companies, Asahi-kasei, Hitachi, Mitsubishi, NEC, Oki and Toshiba, initiated a three year project in 2006 supported by the ministry of economy, industry and trade (METI), Japan, for jointly developing fundamental automatic speech recognition (ASR) technology. The pro...
This paper presents a language modeling approach to sentence retrieval for Question Answering (QA) that we used in Question Answering on speech transcripts (QAst), a pilot task at the Cross Language Evaluation Forum (CLEF) evaluations 2007. A language model (LM) is generated for each sentence and these models are combined with document LMs to take...
Large speech and text corpora are crucial to the development of a state-of-the-art speech recognition system. This paper reports on the construction and evaluation of the first Thai broadcast news speec h and text corpora. Specifications and conventions used in the transcription process are described in the paper. The speech corpus contains about 1...
The Tokyo Institute of Technology team participated in the high-level feature extraction, surveillance event detection pilot and Rushes summarization tasks for TRECVID2008. In the high-level feature (HLF) extraction task, we employed a framework using a tree-structured codebook and a node selection technique last year. This year we focused on the p...
Question answering research has only recently started to spread from short factoid questions to more complex ones. One significant chal- lenge is the evaluation: manual evaluation is a difficult, time-consuming process and not ap- plicable within efficient development of sys- tems. Automatic evaluation requires a cor- pus of questions and answers,...
We propose a robust score scene detection method for baseball broadcast videos. This method is based on the data-driven approach
which has been successful in statistical speech recognition. Audio and video feature streams are integrated by a multi-stream
hidden Markov model to model each scene. The proposed method was evaluated in score scene detec...