[Show abstract][Hide abstract] ABSTRACT: We describe our recent effort implementing SRI's UMPC-based Pashto speech-to-speech (S2S) translation system on a smart phone running the Android operating system. In order to maintain very low latencies of system response on computationally limited smart phone platforms, we developed efficient algorithms and data structures and optimized model sizes for various system components. Our current Android-based S2S system requires less than one-fourth the system memory and significantly lower processor speed with a sacrifice of 15% relative loss of system accuracy, compared to a laptop-based platform.
Spoken Language Technology Workshop (SLT), 2010 IEEE; 01/2011
[Show abstract][Hide abstract] ABSTRACT: In this work, we compare several known approaches for multilingual acoustic modeling for three languages, Dari, Farsi and Pashto, which are of recent geo-political interest. We demonstrate that we can train a single multilingual acoustic model for these languages and achieve recognition accuracy close to that of monolingual (or language-dependent) models. When only a small amount of training data is available for each of these languages, the multilingual model may even outperform the monolingual ones. We also explore adapting the multilingual model to target language data, which are able to achieve improved automatic speech recognition (ASR) performance compared to the monolingual models for both large and small amounts of training data by 3% relative word error rate (WER).
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, May 22-27, 2011, Prague Congress Center, Prague, Czech Republic; 01/2011
[Show abstract][Hide abstract] ABSTRACT: Research in multilingual speech recognition has shown that current speech recognition technology generalizes across different languages, and that similar modeling assumptions hold, provided that linguistic knowledge (e.g., phone inventory, pronunciation dictionary, etc.) and transcribed speech data are available for the target language. Linguists make a very conservative estimate that 4000 languages are spoken today in the world, and in many of these languages, very limited linguistic knowledge and speech data/resources are available. Rapid transition to a new target language becomes a practical concern within the concept of tiered resources (e.g., different amounts of acoustically matched/mismatched data). In this paper, we present our research efforts towards multilingual spoken information retrieval with limitations in acoustic training data. We propose different retrieval algorithms to leverage existing resources from resource-rich languages as well as the target language. Proposed algorithms employ confusion-embedded hybrid pronunciation networks, and lattice-based phonetic search within a proper name retrieval task. We use Latin-American Spanish as the target language by intentionally limiting available resources for this language. After searching for queries consisting of Spanish proper names in Spanish Broadcast News data, we demonstrate that retrieval performance degradations (due to data sparseness during automatic speech recognition (ASR) deployment in the target language) are compensated by employing English acoustic models. It is shown that the proposed algorithms for developing rapid transition of rich languages to underrepresented languages are able to achieve comparable retrieval performance using 25% of the available training data.
IEEE Transactions on Audio Speech and Language Processing 09/2010; · 1.68 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: Speech systems work reasonably well under homogeneous acoustic environmental conditions but become fragile in practical applications
involving real-world environments (e.g., in-car, broadcast news, digital archives, etc.) where the audio stream contains multi-environment
characteristics. To date, most approaches dealing with environmental noise in speech systems are based on assumptions concerning
the noise, rather than exploring and characterizing the nature of the noise. In this chapter, we present our recent advances
in the formulation and development of an in-vehicle environmental sniffing framework previously presented in [1,2,3,4]. The system is comprised of different components to detect, classify and track acoustic environmental conditions. The first
goal of the framework is to seek out detailed information about the environmental characteristics instead of just detecting
environmental change points. The second goal is to organize this knowledge in an effective manner to allow intelligent decisions
to direct subsequent speech processing systems. After presenting our proposed in-vehicle environmental sniffing framework,
we consider future directions and present discussion on supervised versus unsupervised noise clustering, and closed-set versus
open-set noise classification.
Key wordsAutomatic speech recognition-robustness-environmental sniffing-multimodal-speech enhancement-model adaptation-environmental sniffing-dialog management-mobile-route navigation-in-vehicle
[Show abstract][Hide abstract] ABSTRACT: We summarize recent progress on SRI's IraqCommtrade Iraqi Arabic-English two-way speech-to-speech translation system. In the past year we made substantial developments in our speech recognition and machine translation technology, leading to significant improvements in both accuracy and speed of the IraqComm system. On the 2008 NIST-evaluation dataset our twoway speech-to-text (S2T) system achieved 6% to 8% absolute improvement in BLEU in both directions, compared to our previous year system.
Acoustics, Speech and Signal Processing, 2009. ICASSP 2009. IEEE International Conference on; 05/2009
[Show abstract][Hide abstract] ABSTRACT: We address the problem of retrieving out-of-vocabulary (OOV) words/queries from audio archives for spoken term detection (STD) task. Many STD systems use the output of an automatic speech recognition (ASR) system which has a limited and fixed vocabulary, and are not capable of detecting rare words of high information content, such as named entities. Since such words are often of great interest for a retrieval task it is important to index spoken archives in a way that allows a user to search an OOV query/term.1 In this work, we employ hybrid recognition systems which contain both words and subword units (graphones) to generate hybrid lattice indexes. We use a word-based STD system as our baseline, and present improvements by employing our proposed hybrid STD system that uses words plus graphones on the English broadcast news genre of the 2006 NIST STD task.
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2008, March 30 - April 4, 2008, Caesars Palace, Las Vegas, Nevada, USA; 01/2008
[Show abstract][Hide abstract] ABSTRACT: In this study, we focus on the problem of removing/normalizing the impact of spoken language variation in bilingual speaker recognition (BSR) systems. In addition to environment, recording, and channel mismatches, spoken language mismatch is an additional factor resulting in performance degradation in speaker recognition systems. In today's world, the number of bilingual speakers is increasing with English becoming the universal second language. Data sparseness is becoming an important research issue to deploy speaker recognition systems with limited resources (e.g., short train/test durations). Therefore, leveraging existing resources from different languages becomes a practical concern in limited-resource BSR applications, and effective language normalization schemes are required to achieve more robust speaker recognition systems. Here, we propose two novel algorithms to address the spoken language mismatch problem: normalization at the utterance-level via language identification (LID), and normalization at the segment-level via multilingual phone recognition (PR). We evaluated our algorithms using a bilingual (Spanish-English) speaker set of 80 speakers. Experimental results show improvements over a baseline system which employs fusion of language-dependent speaker models with fixed weights
Acoustics, Speech and Signal Processing, 2007. ICASSP 2007. IEEE International Conference on; 05/2007
[Show abstract][Hide abstract] ABSTRACT: Automatic speech recognition systems work reasonably well under clean conditions but become fragile in practical applications involving real-world environments. To date, most approaches dealing with environmental noise in speech systems are based on assumptions concerning the noise, or differences in collecting and training on a specific noise condition, rather than exploring the nature of the noise. As such, speech recognition, speaker ID, or coding systems are typically retrained when new acoustic conditions are to be encountered. In this paper, we propose a new framework entitled Environmental Sniffing to detect, classify, and track acoustic environmental conditions. The first goal of the framework is to seek out detailed information about the environmental characteristics instead of just detecting environmental changes. The second goal is to organize this knowledge in an effective manner to allow smart decisions to direct subsequent speech processing systems. Our current framework uses a number of speech processing modules including a hybrid algorithm with T<sup>2</sup>-BIC segmentation, Gaussian mixture model/hidden Markov model (GMM/HMM)-based classification and noise language modeling to achieve effective noise knowledge estimation. We define a new information criterion that incorporates the impact of noise into Environmental Sniffing performance. We use an in-vehicle speech and noise environment as a test platform for our evaluations and investigate the integration of Environmental Sniffing for automatic speech recognition (ASR) in this environment. Noise sniffing experiments show that our proposed hybrid algorithm achieves a classification error rate of 25.51%, outperforming our baseline system by 7.08%. The sniffing framework is compared to a ROVER solution for automatic speech recognition (ASR) using different noise conditioned recognizers in terms of word error rate (WER) and CPU usage. Results show that the model matching scheme using the knowledge extr-
acted from the audio stream by Environmental Sniffing achieves better performance than a ROVER solution both in accuracy and computation. A relative 11.1% WER improvement is achieved with a relative 75% reduction in CPU resources
IEEE Transactions on Audio Speech and Language Processing 03/2007; · 1.68 Impact Factor
[Show abstract][Hide abstract] ABSTRACT: This paper describes the system developed jointly at SRI and OGI for participation in the 2006 NIST Spoken Term De- tection (STD) evaluation. We participated in the three genres of the English track: Broadcast News (BN), Conversational Tele- phone Speech (CTS), and Conference Meetings (MTG). The system consists of two phases. First, audio indexing, an offline phase, converts the input speech waveform into a searchable in- dex. Second, term retrieval, possibly an online phase, returns a ranked list of occurrences for each search term. We used a word-based indexing approach, obtained with SRI's large vo- cabulary Speech-to-Text (STT) system. Apart from describing the submitted system and its per- formance on the NIST evaluation metric, we study the trade- offs between performance and system design. We examine per- formance versus indexing speed, effectiveness of different in- dex ranking schemes on the NIST score, and the utility of ap- proaches to deal with out-of-vocabulary (OOV) terms. Index Terms: spoken term detection, audio indexing
INTERSPEECH 2007, 8th Annual Conference of the International Speech Communication Association, Antwerp, Belgium, August 27-31, 2007; 01/2007
[Show abstract][Hide abstract] ABSTRACT: Research in multilingual speech recognition has shown that current speech recognition technology generalizes across different languages, and that similar modeling assumptions hold, provided that linguistic knowledge (e.g., phoneme inventory, pronunciation dictionary, etc.) and transcribed speech data are available for the target language. Linguists make a very conservative estimate that 4000 languages are spoken today in the world, and in many of these languages, very limited linguistic knowledge and speech data/resources are available. Rapid transition to a new target language becomes a practical concern within the concept of tiered resources. In this study, we present our research efforts towards multilingual spoken information retrieval with limitations in acoustic training data. We propose different retrieval algorithms to leverage existing resources from resource-rich languages as well as the target language using a lattice-based search. We use Latin-American Spanish as the target language. After searching for queries consisting of Spanish proper names in Spanish Broadcast News data, we obtain performance (max-F value of 28.3%) close to that of a Spanish based system (trained on speech data from 36 speakers) using only 25% of all the available speech data from the original target language
Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on; 06/2006
[Show abstract][Hide abstract] ABSTRACT: In this chapter, we present our recent advances in the formulation and development of an in-vehicle hands-free route navigation
system. The system is comprised of a multi-microphone array processing front-end, environmental sniffer (for noise analysis),
robust speech recognition system, and dialog manager and information servers. We also present our recently completed speech
corpus for in-vehicle interactive speech systems for route planning and navigation. The corpus consists of five domains which
include: digit strings, route navigation expressions, street and location sentences, phonetically balanced sentences, and
a route navigation dialog in a human Wizard-of-Oz like scenario. A total of 500 speakers were collected from across the United
States of America during a six month period from April-Sept. 2001. While previous attempts at in-vehicle speech systems have
generally focused on isolated command words to set radio frequencies, temperature control, etc., the CU-Move system is focused
on natural conversational interaction between the user and in-vehicle system. After presenting our proposed in-vehicle speech
system, we consider advances in multi-channel array processing, environmental noise sniffing and tracking, new and more robust
acoustic front-end representations and built-in speaker normalization for robust ASR, and our back-end dialog navigation information
retrieval sub-system connected to the WWW. Results are presented in each sub-section with a discussion at the end of the chapter.
[Show abstract][Hide abstract] ABSTRACT: In this paper, we introduce the problem of audio stream phrase recognition for information retrieval for a new National Gallery of the Spoken Word (NGSW). This will be the first large-scale repository of its kind, consisting of speeches, news broadcasts, and recordings that are of historical content from the 20 th Century. We propose a system diagram and discuss critical processing tasks such as: an environment classifier, recognizer model adaptation for acoustic background noise, restricted channels, and speaker variability, natural language processor, and speech enhancement/feature processing. A probe NGSW data set is used to perform experiments using SPHINX-III LVCSR and a previously formulated RSPL-keyword spotting system. Results are reported for WSJ, BN, and NGSW corpora. Results from sub-system evaluations are reported for (i) model adaptation based on mixture weight adjustment with MLLR (reduces WER by 2.6% over a baseline BN trained model), speaker and environ...
[Show abstract][Hide abstract] ABSTRACT: Detecting whether a talker is speaking his native language is useful for speaker recognition, speech recognition, and intelligence applications. We study the problem of de-tecting nonnative speakers of American English, using two standard speech corpora. We apply approaches effec-tive in speaker verification to this task, including systems based on MLLR, phone N-gram, prosodic, and word N-gram features. Results show equal error rates between 12% and 20%, depending on the system, test data, and choice of training data. Asymmetries in performance are most likely explained by differences in native language distributions in the corpora. Model combination yields substantial improvements over individual models, with the best result being around 8.6% EER. While phone N-grams are widely used in related tasks (e.g., language and dialect ID), we find that it is the least effective model in combination; MLLR, prosody, and word N-gram systems play stronger roles. Overall, results suggest that individ-ual systems and system combinations found useful for speaker ID also offer promise for nonnativeness detec-tion, and that further efforts are warranted in this area.
[Show abstract][Hide abstract] ABSTRACT: We investigate a variety of methods for improving language recognition accuracy based on techniques in speech recognition, and in some cases borrowed from speaker recognition. First, we look at the question of language-dependent versus language-independent phone recognition for phonotactic (PRLM) language recogniz-ers, and find that language-independent recognizers give superior performance in both PRLM and PPRLM sys-tems. We then investigate ways to use speaker adaptation (MLLR) transforms as a complementary feature for lan-guage characterization. Borrowing from speech recogni-tion, we find that both PRLM and MLLR systems can be improved with the inclusion of discriminatively trained multilayer perceptrons as front ends. Finally, we com-pare language models to support vector machines as a modeling approach for phonotactic language recognition, and find them to be potentially superior, and surprisingly complementary.