Tony Robinson

Tony Robinson
Speechmatics

PhD

About

127
Publications
16,683
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,941
Citations
Citations since 2016
0 Research Items
1383 Citations
2016201720182019202020212022050100150200
2016201720182019202020212022050100150200
2016201720182019202020212022050100150200
2016201720182019202020212022050100150200
Additional affiliations
October 1985 - September 2000
University of Cambridge
Position
  • Lecturer

Publications

Publications (127)
Article
Full-text available
This paper investigates the scaling properties of Recurrent Neural Network Language Models (RNNLMs). We discuss how to train very large RNNs on GPUs and address the questions of how RNNLMs scale with respect to model size, training-set size, computational costs and memory. Our analysis shows that despite being more costly to train, RNNLMs obtain mu...
Article
This paper describes connectionist techniques for recognition of Broadcast News. The fundamental difference between connectionist systems and more conventional mixture-of-Gaussian systems is that connectionist models directly estimate posterior probabilities as opposed to likelihoods. Access to posterior probabilities has enabled us to develop a nu...
Article
It is well known that recognition performance degrades significantly when moving from a speakerdependent to a speaker-independent system. Traditional hidden Markov model (HMM) systems have successfully applied speaker-adaptation approaches to reduce this degradation. In this paper we presentandevaluate some techniques for speaker-adaptation of a hy...
Article
Full-text available
The client-server model is being advocated for speech recognition over networks, where the acoustic features are calculated by the client, compressed and transmitted to the server. This has provoked a number of papers claiming that as recognition accuracy and perceptual quality are different goals, a new compression approach is needed. This is veri...
Article
This paper explores the interaction between a language model’s perplexity and its effect on the word error rate of a speech recognition system. Much recent research has indicated that these two measures are not as well correlated as was once thought, and many examples exist of models which have a much lower perplexity than the equivalent N -gram mo...
Article
This paper describes a spoken document retrieval (SDR) system for British and North American Broadcast News. The system is based on a connectionist large vocabulary speech recognizer and a probabilistic information retrieval system. We discuss the development of a realtime Broadcast News speech recognizer, and its integration into an SDR system. Tw...
Article
Full-text available
ABBOT is a hybrid connectionist-HMM large vocabulary continuous speech recognition system developed at the Cambridge University Engineering Department. This uses a recurrent neural network acoustic model to map acoustic features into posterior phone probabilities. These posterior probabilities are then converted to scaled likelihoods and used as ob...
Article
Full-text available
This paper describes the development of a connectionist-hidden Markov model (HMM) system for the 1997 DARPA Hub-4E CSR evaluations. We describe both system development and the enhancements designed to improve performance on broadcast news data. Both multilayer perceptron (MLP) and recurrent neural network acoustic models have been investigated. We...
Article
The speech waveform can be modelled as a piecewise-stationary linear stochastic state space system, and its parameters can be estimated using an expectation-maximisation (EM) algorithm. One problem is the initialisation of the EM algorithm. Standard initialisation schemes can lead to poor formant trajectories. But these trajectories however are imp...
Article
Full-text available
This paper describes the participation of the THISL group at the TREC-8 Spoken Document Retrieval (SDR) track. The THISL SDR system consists of the realtime version of the ABBOT large vocabulary speech recognition system and the THISLIR text retrieval system. The TREC-8 evaluation assessed SDR performance on a corpus of 500 hours of broadcast news...
Article
Full-text available
This paper concerns modelling speech using a piecewise stationary linear stochastic state space model. The purpose of the paper is to compare two algorithms for speech model parameter estimation: subspace state space system identification (4SID) and ExpectationMaximisation (EM). The 4SID and EM methods are similar in that they both estimate a state...
Conference Paper
This paper concerns speech enhancement. Speech is first modelled and then model parameters used to design a Kalman smoother to reduce the background noise. The 4SID (subspace state space system identification), DOA (direction-of-arrival) and polynomial techniques are described within a common subspace state space framework, and assumptions for each...
Article
Full-text available
This paper describes the SPRACH system developed for the 1998 Hub-4E broadcast news evaluation. The system is based on the connectionist-HMM framework and uses both recurrent neural network and multi-layer perceptron acoustic models. We describe both a system designed for the primary transcription hub, and a system for the less-than 10 times real-t...
Article
Full-text available
Adaptive language models have consistently been shown to lead to a significant reduction in language model perplexity compared to the equivalent static trigram model on many data sets. When these language models have been applied to speech recognition, however, they have seldom resulted in a corresponding reduction in word error rate. This paper wi...
Article
Full-text available
This paper presents two techniques for language model adaptation. The first is based on the use of mixtures of language models: the training text is partitioned according to topic, a language model is constructed for each component, and at recognition time appropriate weightings are assigned to each component to model the observed style of language...
Article
Full-text available
Much recent research has demonstrated that the correlation between a language model's perplexity and its effect on the word error rate of a speech recognition system is not as strong as was once thought. This represents a major problem for those involved in developing language models. This paper describes the development of new measures of language...
Article
Full-text available
This paper described the THISL spoken document retrieval system for British and North American Broadcast News. The system is based on the ABBOT large vocabulary speech recognizer and a probabilistic text retrieval system. We discuss the development of a realtime British English Broadcast News system, and its integration into a spoken document retri...
Article
Full-text available
This paper described the THISL spoken document retrieval system for British and North American Broadcast News. The system is based on the ABBOT large vocabulary speech recognizer, using a recurrent network acoustic model, and a probabilistic text retrieval system. We discuss the development of a realtime British English Broadcast News system, and i...
Article
This paper describes the THISL system that participated in the TREC-7 evaluation, Spoken Document Retrieval (SDR) Track, and presents the results obtained, together with some analysis. The THISL system is based on the ABBOT speech recognition system and the thislIR text retrieval system. In this evaluation we were concerned with investigating the s...
Article
Automatic summarisation of spoken audio is a fairly new research pursuit, in large part due to the relative novelty of technology for accurately decoding audio into text. Techniques that account for the peculiarities and potential ambiguities of decoded audio (high error rates, lack of syntactic boundaries) appear promising for culling summary info...
Article
Full-text available
Two important components of a speech archiving system are the compression scheme and the search facility. We investigate two ways of providing these components. The first is to run the recogniser directly from the compressed speech -- we show how even with a 2.4kbit/sec codec it is possible to produce good recognition results; but the search is slo...
Article
Full-text available
It is well known that recognition performance degrades significantly when moving from a speakerdependent to a speaker-independent system. Traditional hidden Markov model (HMM) systems have successfully applied speaker-adaptation approaches to reduce this degradation. In this paper we present and evaluate some techniques for speaker-adaptation of a...
Article
This paper describes the THISL system that participated in the TREC-7 evaluation, Spoken Document Retrieval (SDR) Track, and presents the results obtained, together with some analysis. The THISL system is based on the ABBOT speech recognition system and the thislIR text retrieval system. In this evaluation we were concerned with investigating the s...
Conference Paper
This paper describes the THISL news retrieval system which maintains an archive of BBC radio and television news recordings. The system uses the ABBOT large vocabulary continuous speech recognition system to transcribe news broadcasts, and the thisIIR text retrieval system to index and access the transcripts. Decoding and indexing is performed auto...
Article
Full-text available
This paper describes the participation of the THISL group at the TREC-8 Spoken Document Retrieval (SDR) track. The THISL SDR system consists of the realtime version of the ABBOT large vocabulary speech recognition system and the THISLIR text retrieval system. The TREC-8 evaluation as- sessed SDR performance on a corpus of 500 hours of broad- cast n...
Article
We investigate the enhancement of speech corrupted by unknown independent additive noise when only a single microphone is available. We present adaptive enhancement systems based on an existing non-adaptive technique [Ephraim, Y., 19992a. IEEE Transactions on Signal Processing 40 (4), 725-735]. This approach models the speech and noise statistics u...
Article
Full-text available
This paper describes a new algorithm to enhance and recognise noisy speech when only the noisy signal is available. The system uses autoregressive hidden Markov models (HMMs) to model the clean speech and noise and combines these to form a model for the noisy speech. The probability framework developed is then used to reestimate the noise models fr...
Article
Full-text available
A new method to improve the accuracy of Autoregressive Hidden Markov Model (AR-HMM) based recognition systems is proposed. The technique uses the bilinear transform to warp the frequency scale of the observation vectors, hence it uses a better perceptual measure to compare the observation vectors to the trained models. Results presented for the E-s...
Article
Full-text available
We have previously developed a speech enhancement scheme which can adapt to unknown additive noise. We model speech and noise using perceptual frequency or `warped' autoregressive HMMs (AR-HMMs) and estimate the clean speech and noise parameters within this framework. In this current work, we investigate the use of our system as a front end to a MF...
Conference Paper
Full-text available
This paper describes a new search technique for large vocabulary speech recognition based on a stack decoder. Considerable memory savings are achieved with the combination of a tree based lexicon and a new search technique. The search proceeds time-first, that is partial path hypotheses are extended into the future in the inner loop and a tree walk...
Article
Full-text available
Contents 1 Introduction 4 1.1 What is Speech Analysis? . . . . . . . . . . . . . . . . . . . . 4 1.1.1 So what is an acoustic vector? . . . . . . . . . . . . . . 4 1.2 Why Speech Analysis? . . . . . . . . . . . . . . . . . . . . . . 4 1.3 The problems of speech analysis . . . . . . . . . . . . . . . . . 7 1.4 Standard references for this course . ....
Article
Describes a complete system for the recognition of off-line handwriting. Preprocessing techniques are described, including segmentation and normalization of word images to give invariance to scale, slant, slope and stroke thickness. Representation of the image is discussed and the skeleton and stroke features used are described. A recurrent neural...
Article
Full-text available
Recent DARPA CSR evaluations have focused on the transcription of broadcast news from both television and radio programmes [17]. This is a challenging task because the data includes a variety of speaking styles and channel conditions. This paper describes the development of a connectionist-hidden Markov model (HMM) system, and the enhancements desi...
Article
This paper describes a new search technique for large vocabulary speech recognition based on a stack decoder. Considerable memory savings are achieved with the combination of a tree based lexicon and a new search technique. The search proceeds time-first, that is partial path hypotheses are extended into the future in the inner loop and a tree walk...
Article
Full-text available
This paper describes a low bit-rate segmental formant vocoder. The formants are estimated using mixture of Gaussians whose means are constrained to vary linearly with time within a segment. A new method of smoothing the power spectrum has been used in order to improve modelling with mixtures of Gaussians. Pitch is estimated using the autocorrelatio...
Article
Full-text available
This paper describes a new search technique for large vocabulary speech recognition based on a stack decoder. Considerable memory savings are achieved with the combination of a tree based lexicon and a new search technique. The search proceeds time-first, that is partial path hypotheses are extended into the future in the inner loop and a tree walk...
Conference Paper
Full-text available
In this paper we investigate a number of ensemble methods for improving the performance of connectionist acoustic models for large vocabulary continuous speech recognition. We discuss boosting, a data selection technique which results in an ensemble of models, and mixtures-ofexperts. These techniques have been applied to multilayer perceptron acous...
Article
Full-text available
ABBOT is the hybrid connectionist-hidden Markov model largevocabulary speech recognition system developed at Cambridge University. In this system, a recurrent network maps each acoustic vector to an estimate of the posterior probabilities of the phone classes. The maximum likelihood word string is then extracted using Markov models. As in tradition...
Conference Paper
This paper describes a new low bit-rate formant vocoder. The formant parameters are represented by Gaussian mixture distributions, which are estimated from the discrete Fourier transform (DFT) magnitude spectrum of the speech signal. A voiced/unvoiced classification mechanism has been developed based on the harmonic nature of each formant in the DF...
Conference Paper
Full-text available
Presents two techniques for language model adaptation. The first is based on the use of mixtures of language models: the training text is partitioned according to topic, a language model is constructed for each component and, at recognition time, appropriate weightings are assigned to each component to model the observed style of language. The seco...
Article
We present a Bayesian framework for inferring the parameters of a mixture of experts model based on ensemble learning by variational free energy minimisation. The Bayesian approach avoids the over-fitting and noise level under-estimation problems of traditional maximum likelihood inference. We demonstrate these methods on artificial problems and su...
Article
Full-text available
This report describes a program that performs compression of waveform files such as audio data. A simple predictive model of the waveform is used followed by Huffman coding of the prediction residuals. This is both fast and near optimal for many commonly occuring waveform signals. This framework is then extended to lossy coding under the conditions...
Article
Full-text available
This paper presents an application of recurrent networks for phone probability estimation in large vocabulary speech recognition. The need for efficient exploitation of context information is discussed
Article
Full-text available
this document is the UK English equivalent of a subset of the US American English WSJ0 database [1]
Article
This paper describes the Sqaleproject in which the ARPA large vocabulary evaluation paradigm was adapted to meet the needs of European multilingual speech recognition development. It involved establishing a framework for sharing training and test materials, defining common protocols for training and testing systems, developing systems, running an e...
Preprint
Full-text available
This paper presents a joint prediction/vector quantisation scheme, where each codebook element contains a predictive component associated with a vector known by both encoder and decoder. A method for training the code-book is proposed, and the application to feedback VQ is described. Experiments are performed to demonstrate the scheme on synthetic...
Article
Full-text available
ABBOT is the hybrid connectionist-hidden Markov model largevocabulary speech recognition system developed at Cambridge University. In this system, a recurrent network maps each acoustic vector to an estimate of the posterior probabilities of the phone classes, which are used as observation probabilities within an HMM. This paper describes the syste...
Article
Full-text available
Hybrid connectionist-hidden Markov model large vocabulary speech recognition has, in recent years, been shown to be competitive with more traditional HMM systems [4]. Connectionist acoustic models generally use considerably less parameters than HMM's, allowing real-time operation without significant degradation of performance. However, the small nu...
Conference Paper
Full-text available
The paper describes a new formant analysis technique whereby the formant parameters are represented in the form of Gaussian mixture distributions. These are estimated from the discrete Fourier transform (DFT) magnitude spectrum of the speech signal. The parameters obtained are the means, variances and the masses of the density functions, which are...
Conference Paper
Full-text available
ABBOT is the hybrid connectionist hidden Markov model (HMM) large vocabulary continuous speech recognition system developed at Cambridge University Engineering Department. ABBOT makes effective use of the linear input network (LIN) adaptation technique to achieve speaker and channel adaptation. Although the LIN is effective at adapting to new speak...
Conference Paper
Full-text available
ABBOT is a hybrid (connectionist-hidden Markov model) large-vocabulary speech recognition (LVCSR) system, developed at Cambridge University. In this system, a recurrent network maps each acoustic vector to an estimate of the posterior probabilities of the phone classes, which are used as observation probabilities within an HMM. This paper describes...
Conference Paper
Hybrid connectionist-hidden Markov model large vocabulary speech recognition has been shown to be competitive with more traditional HMM systems. Connectionist acoustic models generally use considerably less parameters than HMM's, allowing real-time operation without significant degradation of performance. However, the small number of parameters in...
Article
This paper describes a new algorithm to enhance and recognise noisy speech when only the noisy signal is available. The system uses autoregressive hidden Markov models (HMMs) to model the clean speech and noise and combines these to form a model for the noisy speech. The combined model is used to determine the likelihood of each observation being j...
Article
Full-text available
This paper describes a new algorithm to enhance and recognise noisy speech when only the noisy signal is available. The system uses autoregressive hidden Markov models (HMMs) to model the clean speech and noise and combines these to form a model for the noisy speech. The combined model is used to determine the likelihood of each observation being j...
Article
Full-text available
Abbot is the hybrid connectionist-hidden Markov model large-vocabulary speech recognition system developed at Cambridge University. In this system, a recurrent network maps each acoustic vector to an estimate of the posterior probabilities of the phone classes. This paper describes the system which participated in the November 1995 ARPA H3 Multiple...
Conference Paper
Full-text available
This paper presents a real-time speech recognition system used to transcribe broadcast radio speech. The system is based on ABBOT, the hybrid connectionist-HMM large vocabulary continuous speech recognition system developed at the Cambridge University Engineering Department. Developments designed to make the system more robust to acoustic variabili...
Article
Full-text available
A method for incorporating context-dependent phone classes in a connectionist-HMM hybrid speech recognition system is introduced. A modular approach is adopted, where single-layer networks discriminate between different context classes given the phone class and the acoustic data. The context networks are combined with a context-independent (CI) net...
Article
Full-text available
This paper describes the training of a recurrent neural network
Conference Paper
Full-text available
Conventional speaker independent speech recognition systems are trained using data from many different speakers. Inter-speaker variability is a major problem because parametric representations of speech are highly speaker dependent. This paper describes a technique which allows speaker dependent parameters to be considered when building a speaker i...
Conference Paper
It is well known that recognition performance degrades significantly when moving from a speakerdependent to a speaker-independent system. Traditional hidden Markov model (HMM) systems have successfully applied speaker-adaptation approaches to reduce this degradation. In this paper we present and evaluate some techniques for speaker-adaptation of a...