-
[show abstract]
[hide abstract]
ABSTRACT: This paper summarizes our recent efforts for building a Turkish Broadcast News transcription and retrieval system. The agglutinative nature of Turkish leads to a high number of out-of-vocabulary (OOV) words which in turn lower automatic speech recognition (ASR) accuracy. This situation compromises the performance of speech retrieval systems based on ASR output. Therefore using a word-based ASR is not adequate for transcribing speech in Turkish. To alleviate this problem, various sub-word-based recognition units are utilized. These units solve the OOV problem with moderate size vocabularies and perform even better than a 500 K word vocabulary as far as recognition accuracy is concerned. As a novel approach, the interaction between recognition units, words and sub-words, and discriminative training is explored. Sub-word models benefit from discriminative training more than word models do, especially in the discriminative language modeling framework. For speech retrieval, a spoken term detection system based on automata indexation is utilized. As with transcription, retrieval performance is measured under various schemes incorporating words and sub-words. Best results are obtained using a cascade of word and sub-word indexes together with term-specific thresholding.
IEEE Transactions on Audio Speech and Language Processing 08/2009; · 1.50 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: This paper presents the analysis of recognition errors in large vocabulary continuous speech recognition (LVCSR) of Turkish. This analysis aims to learn the source of the recognition errors and investigate useful features to rectify them. These features will be used in corrective language models. First, recognition experiments were performed using word and sub-word (morph) language models. Morphs outperformed words for out-of-vocabulary words and achieved 1.5% absolute significant improvements over words. Then, the errors in the recognition output of the morph model were manually labeled according to the predefined error classes. This subjective labeling revealed that errors due to incorrect syntax can be corrected. Therefore, using syntactic dependency relations as features in the corrective language models is expected to yield higher accuracies.
Signal Processing and Communications Applications Conference, 2009. SIU 2009. IEEE 17th; 05/2009
-
[show abstract]
[hide abstract]
ABSTRACT: This paper presents two-pass speech recognition techniques to handle the out-of-vocabulary (OOV) problem in Turkish newspaper content transcription. OOV words are assumed to be replaced by acoustically ldquosimilarrdquo in-vocabulary (IV) words during decoding. Therefore, the first pass recognition lattice is used as the prior knowledge to adapt the vocabulary and the search space for the second pass. Vocabulary adaptation and lattice extension are performed with words similar to the hypothesis lattice words. These words are selected from a fallback vocabulary using distance functions that take the agglutinative language characteristics of Turkish into account. Morphology-based and phonetic-distance-based similarity functions respectively yield 1.9% and 4.6% absolute accuracy improvements. Statistical sub-word units are also utilized to handle the OOV problem encountered in the word-based system. Using sub-words alleviates the OOV problem and improves the recognition accuracy - OOV accuracy improved from 0% to 60.2%. However, this introduces ungrammatical items to the recognition output. Since automatically derived sub-word units do not provide explicit morphological features, the lattice extension strategy is modified to correct these ungrammatical items. Lattice extension for sub-words reduces the word error rate to 32.3% from 33.9%. This improvement is statistically significant at p=0.002 as measured by the NIST MAPSSWE significance test.
IEEE Transactions on Audio Speech and Language Processing 02/2009; · 1.50 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: Ever-increasing computing power and connectivity bandwidth, together with falling storage costs, are resulting in an overwhelming amount of data of various types being produced, exchanged, and stored. Consequently, information search and retrieval has emerged as a key application area. Text-based search is the most active area, with applications that range from Web and local network search to searching for personal information residing on one's own hard-drive. Speech search has received less attention perhaps because large collections of spoken material have previously not been available. However, with cheaper storage and increased broadband access, there has been a subsequent increase in the availability of online spoken audio content such as news broadcasts, podcasts, and academic lectures. A variety of personal and commercial uses also exist. As data availability increases, the lack of adequate technology for processing spoken documents becomes the limiting factor to large-scale access to spoken content. In this article, we strive to discuss the technical issues involved in the development of information retrieval systems for spoken audio documents, concentrating on the issue of handling the errorful or incomplete output provided by ASR systems. We focus on the usage case where a user enters search terms into a search engine and is returned a collection of spoken document hits.
IEEE Signal Processing Magazine 06/2008; · 4.07 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: In this paper, we address the problem of how to improve the automatic speech recognition (ASR) performance on audio conference data by speaker segmentation and speaker adaptation. A new speaker segmentation method is proposed, where the speaker turns and speaker labels are automatically determined. For speaker adaptation, we use Vocal Tract Length Normalization and Maximum Likelihood Linear Regression. On a corpus of multi-speaker teleconferences, the word error rate of the ASR system improves over 4% absolute.
Multimedia and Expo, 2007 IEEE International Conference on; 08/2007
-
[show abstract]
[hide abstract]
ABSTRACT: In this study, we investigated word error rate performance of Turkish continuous speech recognition system with sparse packet losses in a distributed architecture. In this distributed architecture, speech feature vectors consisting of MFCCs and logarithmic power are transmitted with UDP protocol. A special UDP header is defined to be in the distributed system. Sparse packet losses are artificially generated by considering different scenarios. Two packet loss concealment methods, Lagrange and spline interpolation, are used as a front-end process in the recognition system. In the experimental study, speech feature vectors are obtained by using HTK. The SRI language modelling toolkit is used to generate statistical language models. Acoustic modeling and recognition are performed using AT&T software. The word error rate (WER) of the baseline system is 32.1%. This error rate is increased up to 34.2% with the sparse packet losses. In our study, we have seen that the packet concealment methods reduce the WER of the speech recognition system to 32.4%.
Signal Processing and Communications Applications, 2007. SIU 2007. IEEE 15th; 07/2007
-
[show abstract]
[hide abstract]
ABSTRACT: Speech recognition and language processing systems require large amounts of transcribed speech corpora. Manual transcription is expensive and slow. Computers may do the same task faster but with more errors. Computer aided transcription is a compromise between these two methods. The output lattices of an ASR engine are manipulated to be used as language models in combination with a letter-based N-gram language model. The combined model is used as the language model of the open source Dasher application. The resulting application allows easy transcription of speech data thanks to the combination of both models at letter level. It is shown that the combined model performs better than both a letter-based N-gram model and models combined at sentence level.
Signal Processing and Communications Applications, 2007. SIU 2007. IEEE 15th; 07/2007
-
[show abstract]
[hide abstract]
ABSTRACT: It is widely acknowledged that pronunciations in spontaneous speech differ significantly from citation form. For this reason, pronunciation modeling has received considerable attention in recent automatic speech recognition literature. Most of the attention however has focused on describing an alternate pronunciation as a different sequence of phonetic units using the same inventory of phones which describe canonical pronunciations. Analysis of manual phonetic transcription of conversational speech reveals a large number (>20%) of cases of genuine ambiguity: instances where human labelers disagree on the identity of the surface form. The authors investigate and characterize the acoustic evidence in the context of this ambiguity. They show that when a pronunciation change occurs, it is often the case that neither the canonical nor the alternate phone represent the acoustics very well. Based on this analysis, two methods for accommodating pronunciation ambiguity are developed. The first method attempts to resolve the ambiguity by separately modeling each baseform/surface-form pair. The second method treats the surface form as a hidden variable and “averages out” the ambiguity
Acoustics, Speech, and Signal Processing, 2000. ICASSP '00. Proceedings. 2000 IEEE International Conference on; 02/2000
-
[show abstract]
[hide abstract]
ABSTRACT: Accurately modelling pronunciation variability in conversational
speech is an important component of an automatic speech recognition
system. We describe some of the projects undertaken in this direction
during and after WS97, the Fifth LVCSR Summer Workshop, held at Johns
Hopkins University, Baltimore, in July-August, 1997. We first illustrate
a use of hand-labelled phonetic transcriptions of a portion of the
Switchboard corpus, in conjunction with statistical techniques, to learn
alternatives to canonical pronunciations of words. We then describe the
use of these alternate pronunciations in an automatic speech recognition
system. We demonstrate that the improvement in recognition performance
from pronunciation modelling persists as the system is enhanced with
better acoustic and language models
Acoustics, Speech and Signal Processing, 1998. Proceedings of the 1998 IEEE International Conference on; 06/1998 · 4.63 Impact Factor
-
[show abstract]
[hide abstract]
ABSTRACT: In the early '90s, the availability of the TIMIT read-speech phonetically transcribed corpus led to work at AT&T on the automatic inference of pronunciation variation. This work, briefly summarized here, used stochastic decisions trees trained on phonetic and linguistic features, and was applied to the DARPA North American Business News read-speech ASR task. More recently, the ICSI spontaneous-speechphonetically transcribed corpus was collected at the behest of the 1996 and 1997 LVCSR Summer Workshops held at Johns Hopkins University. A 1997 workshop (WS97) group focused on pronunciation inference from this corpus for application to the DoD Switchboard spontaneous telephone speech ASR task. We describe several approaches taken there. Theseinclude (1) oneanalogousto the AT&T approach, (2) one, inspired by work at WS96 and CMU, that involved adding pronunciation variants of a sequence of one or more words (`multiwords ') in the corpus (with corpus-derived probabilities) into the ASR lexi...
05/1998;
-
[show abstract]
[hide abstract]
ABSTRACT: In the early '90s, the availability of the TIMIT read-speech phonetically transcribed corpus led to work at AT&T on the automatic inference of pronunciation variation. This work, briefly summarized here, used stochastic decisions trees trained on phonetic and linguistic features, and was applied to the DARPA North American Business News read-speech ASR task. More recently, the ICSI spontaneous-speechphonetically transcribed corpus was collected at the behest of the 1996 and 1997 LVCSR Summer Workshops held at Johns Hopkins University. A 1997 workshop (WS97) group focused on pronunciation inference from this corpus for application to the DoD Switchboard spontaneous telephone speech ASR task. We describe several approaches taken there. Theseinclude (1) oneanalogousto the AT&T approach, (2) one, inspired by work at WS96 and CMU, that involved adding pronunciation variants of a sequence of one or more words (`multiwords ') in the corpus (with corpus-derived probabilities) into the ASR lexi...
05/1998;
-
[show abstract]
[hide abstract]
ABSTRACT: INTRODUCTION Pronunciations in spontaneous, conversational speech tend to be much more variable than in careful read speech where pronunciations of words are more likely to adhere to their citation forms. Most speech recognition systems, however, rely on pronouncing dictionaries which contain few alternate pronunciations for most words. This limitation in capturing an important source of variability is potentially a significant cause for the relatively poor performance of recognition systems on large vocabulary conversational speech recognition (LVCSR) tasks. We report some of the methods investigated to address this issue at WS97, the Fifth LVCSR Summer Workshop, held at Johns Hopkins University, Baltimore, in July-August, 1997. As a first step towards alleviating this problem, we identified a systematic way of generating alternate pronunciations of words by using a phonetically labelled portion of the Switchboard corpus [1]. One viewpoint we explored was that pronunciation v
03/1998;
-
[show abstract]
[hide abstract]
ABSTRACT: Accurately modelling of pronunciation variability in
conversational speech is an important component for automatic speech
recognition. We describe some of the projects undertaken in this
direction at WS97 [the Fifth LVCSR (large-vocabulary conversational
speech recognition) Summer Workshop], held at Johns Hopkins University,
Baltimore, in July-August 1997. We first illustrate a use of
hand-labelled phonetic transcriptions of a portion of the Switchboard
corpus, in conjunction with statistical techniques, to learn
alternatives to canonical pronunciations of words. We then describe the
use of these alternative pronunciations in a recognition experiment as
well as in the acoustic training of an automatic speech recognition
system. Our results show a reduction of the word error rate in both
cases-0.9% without acoustic retraining and 2.2% with acoustic retraining
Automatic Speech Recognition and Understanding, 1997. Proceedings., 1997 IEEE Workshop on; 01/1998
-
[show abstract]
[hide abstract]
ABSTRACT: It is widely acknowledged that pronunciations in spontaneous speech differ significantly from citation form. For this reason, pronunciation modeling has received considerable attention in recent automatic speech recognition literature. Most of the attention however has focused on describing an alternate pronunciation as a different sequence of phonetic units using the same inventory of phones which describe canonical pronunciations. Analysis of manual phonetic transcription of conversational speech reveals a large number (>20%) of cases of genuine ambiguity: instances where human labelers disagree on the identity of the surface form. The authors investigate and characterize the acoustic evidence in the context of this ambiguity. They show that when a pronunciation change occurs, it is often the case that neither the canonical nor the alternate phone represent the acoustics very well. Based on this analysis, two methods for accommodating pronunciation ambiguity are developed. The first method attempts to resolve the ambiguity by separately modeling each baseform/surface-form pair. The second method treats the surface form as a hidden variable and "averages out" the ambiguity.
Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing 3:1679-1682.