Publications (28)0 Total impact
-
Article: Development of the 2003 CU-HTK Conversational Telephone Speech Transcription System
[show abstract] [hide abstract]
ABSTRACT: This paper describes the development of the 2003 CU-HTK large vocabulary speech recognition system for Conversational Telephone Speech (CTS). The system was designed based on a multipass, multi-branch structure where the output of all branches is combined using system combination. A number of advanced modelling techniques such as Speaker Adaptive Training, Heteroscedastic Linear Discriminant Analysis, Minimum Phone Error estimation and specially constructed Single Pronunciation dictionaries were employed. The effectiveness of each of these techniques and their potential contribution to the result of system combination was evaluated in the framework of a state-of-the-art LVCSR system with sophisticated adaptation. The final 2003 CU-HTK CTS system constructed from some of these models is described and its performance on the DARPA/NIST 2003 Rich Transcription (RT-03) evaluation test set is discussed.06/2004; -
Article: Generating And Evaluating Segmentations For Automatic Speech
[show abstract] [hide abstract]
ABSTRACT: Speech recognition systems for conversational telephone speech require the audio data to be automatically divided into regions of speech and non-speech. The quality of this audio segmentation affects the recognition accuracy. This paper describes several approaches to segmentation and compares the resulting recogniser performance. It is shown that using Gaussian Mixture Models outperforms an energy-detection method and using the output from the speech recogniser itself increases performance further. An upper bound on possible performance was obtained when deriving a segmentation from a forced alignment of the reference words and this outperformed using manually marked word times. Finally the correlation between an appropriately defined segmentation score and WER is shown to be over 0.95 across three data sets, suggesting that segmentations can be evaluated directly without the need for full decoding runs.06/2004; -
Article: Recent Advances In Broadcast News Transcription
[show abstract] [hide abstract]
ABSTRACT: This paper describes recent advances in the CU-HTK Broadcast News English (BN-E) transcription system and its performance in the DARPA/NIST Rich Transcription 2003 Speech-to-Text (RT03) evaluation. Heteroscedastic linear discriminant analysis (HLDA) and discriminative training, which were previously developed in the context of the recognition of conversational telephone speech, have been successfully applied to the BN-E task for the first time. A number of new features have also been added. These include gender-dependent (GD) discriminative training; and modified discriminative training using lattice re-generation and combination. On the 2003 evaluation set the system gave an overall word error rate of 10.7% in less than 10 times real time (10RT).10/2003; -
Article: Automatic Complexity Control For Hlda Systems
[show abstract] [hide abstract]
ABSTRACT: Designing a state-of-the-art large vocabulary speech recognition systems is a highly complex problem. A wide range of techniques are available that affect the performance and number of free parameters. Selecting the appropriate complexity of system is both time-consuming and only a limited number of possible systems can be examined. This paper presents initial results on automatic system selection when both the number of dimensions and the number of components vary. Various complexity control schemes are discussed and evaluated. Limitations of schemes based on predicting held-out data log-likelihoods are described. In addition, problems of standard approximations for this task are detailed.10/2003; -
Article: New Features In The Cu-Htk System For Transcription Of Conversational Telephone Speech
[show abstract] [hide abstract]
ABSTRACT: This paper discusses new features integrated into the Cambridge University HTK (CU-HTK) system for the transcription of conversational telephone speech. Major improvements have been achieved12/2000; -
Article: Efficient Class-Based Language Modelling For Very Large
[show abstract] [hide abstract]
ABSTRACT: This paper investigates the perplexity and word error rate performance of two different forms of class model and the respective data-driven algorithms for obtaining automatic word classifications. The computational complexity of the algorithm for the `conventional' two-sided class model is found to be unsuitable for very large vocabularies ( 100k) or large numbers of classes ( 2000). A one-sided class model is therefore investigated and the complexity of its algorithm is found to be substantially less in such situations. Perplexity results are reported on both English and Russian data. For the latter both 65k and 430k vocabularies are used. Lattice rescoring experiments are also performed on an English language broadcast news task. These experimental results show that both models, when interpolated with a word model, perform similarly well. Moreover, classifications are obtained for the one-sided model in a fraction of the time required by the two-sided model, especially for very large vocabularies.12/2000; -
Article: New Features In The Cu-Htk System For Transcription Of
[show abstract] [hide abstract]
ABSTRACT: This paper discusses new features integrated into the Cambridge University HTK (CU-HTK) system for the transcription of conversational telephone speech. Major improvements have been achieved by the use of maximum mutual information estimation in training as well as maximum likelihood estimation; the use of a full variance transform for adaptation; the inclusion of unigram pronunciation probabilities; and word-level posterior probability estimation using confusion networks for use in minimum word error rate decoding, confidence score estimation and system combination. Improvements are demonstrated via performance on the NIST March 2000 evaluation of English conversational telephone speech transcription (Hub5E). In this evaluation the CU-HTK system gave an overall word error rate of 25.4%, which was the best performance by a statistically significant margin.12/2000; -
Article: The Cu-Htk March 2000 Hub5e Transcription System
[show abstract] [hide abstract]
ABSTRACT: This paper describes the Cambridge University HTK (CU-HTK) system developed for the NIST March 2000 evaluation of English conversational telephone speech transcription (Hub5E). A range of new features have been added to the HTK system used in the 1998 Hub5 evaluation, and the changes taken together have resulted in an 11% relative decrease in word error rate on the 1998 evaluation test set. Major changes include the use of maximum mutual information estimation in training as well as conventional maximum likelihood estimation; the use of a full variance transform for adaptation; the inclusion of unigram pronunciation probabilities; and word-level posterior probability estimation using confusion networks for use in minimum word error rate decoding, confidence score estimation and system combination. On the March 2000 Hub5 evaluation set the CU-HTK system gave an overall word error rate of 25.4%, which was the best performance by a statistically significant margin. This paper describes th...11/2000; -
Article: Broadcast News Transcription Using Htk
[show abstract] [hide abstract]
ABSTRACT: This paper examines the issues in extending a large vocabulary speech recognition system designed for clean and noisy read speech tasks to handle broadcast news transcription. Results using the 1995 DARPA H4 evaluation data set are presented for different front-end analyses and use of unsupervised model adaptation using maximum likelihood linear regression (MLLR). The HTK system for the 1996 H4 evaluation is then described. It includes a number of new features over previous HTK large vocabulary systems including decoder-guided segmentation, segment clustering, cache-based language modelling, and combined MAP and MLLR adaptation. The system runs in multiple passes through the data and the detailed results of each pass are given.11/2000; -
Article: Segment Generation and Clustering in the HTK Broadcast News Transcription System
[show abstract] [hide abstract]
ABSTRACT: This paper describes the segmentation, gender detection and segment clustering scheme used in the 1997 HTK broadcast news evaluation system and presents results on both the unpartitioned 1996 development and the 1997 evaluation sets. The stages of our approach are presented, namely classification, segmentation and gender detection, gender relabelling, and clustering of speech segments. The evaluation audio stream has been segmented according to audio type with a frame accuracy up to 95%. Further segmentation and gender labelling gave up to 99% frame accuracy with 127 multiple speaker segments. Experiments using two different segmentation approaches and three clustering schemes are presented. 1. Introduction The transcription of broadcast news requires techniques to deal with the large variety of data types present. Of particular importance is the presence of varying channel types (wide-band and telephone); data portions containing speech and/or music often simultaneously and a wide v...08/2000; -
Article: The 1997 HTK Broadcast News Transcription System
[show abstract] [hide abstract]
ABSTRACT: This paper presents the recent development of the HTK broadcast news transcription system. Previously we have used data type specific modelling based on adapted Wall Street Journal trained HMMs. However, we are now using data for which no manual preclassification or segmentation is available and therefore automatic techniques are required and compatible acoustic modelling strategies must be adopted. A number of recognition experiments are presented that compare data-type specific and non-specific models; differing amounts of training data; the use of gender-dependent modelling and the effects of automatic data-type classification. Based on these experiments, the HTK system for the 1997 broadcast news evaluation was designed. A detailed description of this system is given which includes a class-based language modelling component. The complete system yields an overall word error rate of 22.0% on the 1996 unpartitioned broadcast news development test data and just 15.8% on the 1997 evalua...08/2000; -
Article: The 1998 HTK Broadcast News Transcription System: Development and Results
[show abstract] [hide abstract]
ABSTRACT: This paper presents the development of the HTK broadcast news transcription system for the November 1998 Hub4 evaluation. Relative to the previous year's system The system a number of features were added including vocal tract length normalisation; cluster-based variance normalisation; double the quantity of acoustic training data; interpolated word level language models to combine text sources; increased broadcast news language model training data; and an extra adaptation stage using a full-variance transform. Overall these changes to the system reduced the error rate by 13% on the 1997 evaluation data and the final system had an overall word error rate of 13.8% for the 1998 evaluation data sets. 1. Introduction Significant progress in the accurate transcription of broadcast news data has been made over the last few years so that we are now at a point where such systems can be used for a variety of tasks such as audio indexing and retrieval. However there is still much interest in re...08/2000; -
Article: The 1998 Htk System For Transcription Of Conversational Telephone Speech
[show abstract] [hide abstract]
ABSTRACT: This paper describes the 1998 HTK large vocabulary speech recognition system for conversational telephone speech as used in the NIST 1998 Hub5E evaluation. Front-end and language modelling experiments conducted using various training and test sets from both the Switchboard and Callhome English corpora are presented. Our complete system includes reduced bandwidth analysis, sidebased cepstral feature normalisation, vocal tract length normalisation (VTLN), triphone and quinphone hidden Markov models (HMMs) built using speaker adaptive training (SAT), maximum likelihood linear regression (MLLR) speaker adaptation and a confidence score based system combination. A detailed description of the complete system together with experimental results for each stage of our multi-pass decoding scheme is presented. The word error rate obtained is almost 20% better than our 1997 system on the development set. 1. INTRODUCTION Transcription of conversational telephone speech is a complex task, which has...05/1999; -
Article: The Htk Large Vocabulary Recognition System For The 1995 Arpa H3 Task
[show abstract] [hide abstract]
ABSTRACT: The HTK large vocabulary speech recognition system has previously shown very good performance for clean speech. This paper describes developments of the system aimed at recognition of speech from the ARPA H3 task which contains data of a relatively low signal-to-noise ratio from unknown microphones. It is shown that a two-phase approach can be effective. The first phase is to derive an initial set of models that are more appropriate for the current conditions than using models trained on clean speech. This is done using either single-pass retraining with multiple microphone data or parallel model combination which combines HMMs trained on clean data with estimates of convolutional and additive noise. The second stage provides more detailed environmental and speaker adapatation using maximum likelihood linear regression which estimates a set of linear transformations of the model parameters to the current conditions. Experiments are reported on both the 1994 ARPA CSR S5 (alternate microphones) and S10 (additive noise) spoke tasks as well as the 1995 ARPA CSR H3 task. The HTK system yielded the lowest error rates in both the H3-P0 and H3-C0 tests.04/1998; -
Article: The Development Of The 1994 Htk Large Vocabulary Speech Recognition System
[show abstract] [hide abstract]
ABSTRACT: This paper describes recent developments of the HTK large vocabulary continuous speech recognition system. The system uses tied-state cross-word context-dependent mixture Gaussian HMMs and a dynamic network decoder that can operate in a single pass. In the last year the decoder has been extended to produce word lattices to allow flexible and efficient system development, as well as multi-pass operation for use with computationally expensive acoustic and/or language models. The system vocabulary can now be up to 65k words, the final acoustic models have been extended to be sensitive to more acoustic context (quinphones), a 4-gram language model has been used and unsupervised incremental speaker adaptation incorporated. The resulting system gave the lowest error rates on both the H1-P0 and H1-C1 hub tasks in the November 1994 ARPA CSR evaluation.04/1998; -
Article: Experiments In Broadcast News Transcription
[show abstract] [hide abstract]
ABSTRACT: This paper presents the recent development of the HTK broadcast news transcription system. Previously we have used data type specific modelling based on adapted Wall Street Journal trained HMMs. However, we are now experimenting with data for which no manual pre-classification or segmentation is available and therefore automatic techniques are required and compatible acoustic modelling strategies adopted. An approach for automatic audio segmentation and classification is described and evaluated as well as extensions to our previous work on segment clustering. A number of recognition experiments are presented that compare datatype specific and non-specific models; differing amounts of training data; the use of gender-dependent modelling and the effects of automatic data-type classification. It is shown that robust segmentation into a small number of audio types is possible and that models trained on a wide variety of data types can yield good performance. 1.04/1998; -
Article: Comparison Of Part-Of-Speech And Automatically Derived Category-Based Language Models For Speech Recognition
[show abstract] [hide abstract]
ABSTRACT: This paper compares various category-based language models when used in conjunction with a word-based trigram by means of linear interpolation. Categories corresponding to parts-of-speech as well as automatically clustered groupings are considered. The category-based model employs variable-length n-grams and permits each word to belong to multiple categories. Relative word error rate reductions of between 2 and 7 % over the baseline are achieved in N-best rescoring experiments on the Wall Street Journal corpus. The largest improvement is obtained with a model using automatically determined categories. Perplexities continue to decrease as the number of different categories is increased, but improvements in the word error rate reach an optimum.04/1998; -
Article: Modelling Word-Pair Relations In A Category-Based Language Model
[show abstract] [hide abstract]
ABSTRACT: A new technique for modelling word occurrence correlations within a word-category based language model is presented. Empirical observations indicate that the conditional probability of a word given its category, rather than maintaining the constant value normally assumed, exhibits an exponential decay towards a constant as a function of an appropriately defined measure of separation between the correlated words. Consequently a functional dependence of the probability upon this separation is postulated, and methods for determining both the related word pairs as well as the function parameters are developed. Experiments using the LOB, Switchboard and Wall Street Journal corpora indicate that this formulation captures the transient nature of the conditional probability effectively, and leads to reductions in perplexity of between 8 and 22%, where the largest improvements are delivered by correlations of words with themselves (self-triggers), and the reductions increase with the size of the training corpus.03/1998; -
Article: Word-Pair Relations for Category-Based Language Models
[show abstract] [hide abstract]
ABSTRACT: A new technique for modelling word occurrence correlations within a word-category based language model is presented. Empirical observations indicate that the conditional probability of a word given its category, rather than maintaining the constant value normally assumed, exhibits an exponential decay towards a constant as a function of an appropriately defined measure of separation between the correlated words. Consequently a functional dependence of the probability upon this separation is postulated, and methods for determining both the related word pairs as well as the function parameters are developed. Experiments using the LOB, Switchboard and Wall Street Journal corpora indicate that this formulation captures the transient nature of the conditional probability effectively, and leads to reductions in perplexity of between 8 and 22%, where the largest improvements are delivered by correlations of words with themselves (self-triggers), and the reductions increase with the size of the training corpus. Contents 1.06/1997; -
Article: The Development Of The 1996 Htk Broadcast News Transcription System
[show abstract] [hide abstract]
ABSTRACT: This paper describes our efforts in extending a large vocabulary speech recognition system to handle broadcast news transcription. Results using the 1995 DARPA H4 evaluation data set are presented for different front-end analyses and for the use of unsupervised model adaptation using maximum likelihood linear regression (MLLR). The HTK system for the 1996 H4 evaluation is then described. It includes a number of new features compared to previous HTK large vocabulary systems including decoder-guided segmentation, segment clustering, cache-based language modelling, and combined MAP and MLLR adaptation. The system makes multiple passes through the data and the detailed results of each pass are given. The overall word error rate obtained by the 1996 evaluation system was 27.5%, and a bug-fixed version reduced this to 26.6%. 1. INTRODUCTION Large vocabulary continuous speech recognition (LVCSR) systems have traditionally been developed for read speech with a close talking microphone. Recen...05/1997;
Institutions
-
2000–2004
-
University of Cambridge
- Department of Engineering
Cambridge, ENG, United Kingdom
-