Computer Speech & Language

Published by Elsevier
Online ISSN: 1095-8363
Print ISSN: 0885-2308
Artificial talkers and speech synthesis systems have long been used as a means of understanding both speech production and speech perception. The development of an airway modulation model is described that simulates the time-varying changes of the glottis and vocal tract, as well as acoustic wave propagation, during speech production. The result is a type of artificial talker that can be used to study various aspects of how sound is generated by humans and how that sound is perceived by a listener. The primary components of the model are introduced and simulation of words and phrases are demonstrated.
A sentence verification task (SVT) was used to study the effects of sentence predictability on comprehension of natural speech and synthetic speech that was controlled for intelligibility. Sentences generated using synthetic speech were matched on intelligibility with natural speech using results obtained from a separate sentence transcription task. In the main experiment, the sentence verification task included both true and false sentences that varied in predictability. Results showed differences in verification speed between natural and synthetic sentences, despite the fact that these materials were equated for intelligibility. This finding suggests that the differences in perception and comprehension between natural and synthetic speech go beyond segmental intelligibility as measured by transcription accuracy. The observed differences in response times appear to be related to the cognitive processes involved in understanding and verifying the truth value of short sentences. Reliable effects of predictability on error rates and response latencies were also observed. High-predictability sentences displayed lower error rates and faster response times than low-predictability sentences. However, predictability did not have differential effects on the processing of synthetic speech as expected. The results demonstrate the need to develop new measures of sentence comprehension that can be used to study speech communication at processing levels above and beyond those indexed through transcription tasks, or forced-choice intelligibility tests such as the Modified Rhyme Test (MRT) or the Diagnostic Rhyme Test (DRT).
Language is being increasingly harnessed to not only create natural human-machine interfaces but also to infer social behaviors and interactions. In the same vein, we investigate a novel spoken language task, of inferring social relationships in two-party conversations: whether the two parties are related as family, strangers or are involved in business transactions. For our study, we created a corpus of all incoming and outgoing calls from a few homes over the span of a year. On this unique naturalistic corpus of everyday telephone conversations, which is unlike Switchboard or any other public domain corpora, we demonstrate that standard natural language processing techniques can achieve accuracies of about 88%, 82%, 74% and 80% in differentiating business from personal calls, family from non-family calls, familiar from unfamiliar calls and family from other personal calls respectively. Through a series of experiments with our classifiers, we characterize the properties of telephone conversations and find: (a) that 30 words of openings (beginnings) are sufficient to predict business from personal calls, which could potentially be exploited in designing context sensitive interfaces in smart phones; (b) our corpus-based analysis does not support Schegloff and Sack's manual analysis of exemplars in which they conclude that pre-closings differ significantly between business and personal calls - closing fared no better than a random segment; and (c) the distribution of different types of calls are stable over durations as short as 1-2 months. In summary, our results show that social relationships can be inferred automatically in two-party conversations with sufficient accuracy to support practical applications.
Pictorial description of hierarchical feature computation. 
Segmental and suprasegmental speech signal modulations offer information about paralinguistic content such as affect, age and gender, pathology, and speaker state. Speaker state encompasses medium-term, temporary physiological phenomena influenced by internal or external biochemical actions (e.g., sleepiness, alcohol intoxication). Perceptual and computational research indicates that detecting speaker state from speech is a challenging task. In this paper, we present a system constructed with multiple representations of prosodic and spectral features that provided the best result at the Intoxication Subchallenge of Interspeech 2011 on the Alcohol Language Corpus. We discuss the details of each classifier and show that fusion improves performance. We additionally address the question of how best to construct a speaker state detection system in terms of robust and practical marginalization of associated variability such as through modeling speakers, utterance type, gender, and utterance length. As is the case in human perception, speaker normalization provides significant improvements to our system. We show that a held-out set of baseline (sober) data can be used to achieve comparable gains to other speaker normalization techniques. Our fused frame-level statistic-functional systems, fused GMM systems, and final combined system achieve unweighted average recalls (UARs) of 69.7%, 65.1%, and 68.8%, respectively, on the test set. More consistent numbers compared to development set results occur with matched-prompt training, where the UARs are 70.4%, 66.2%, and 71.4%, respectively. The combined system improves over the Challenge baseline by 5.5% absolute (8.4% relative), also improving upon our previously best result.
A vector quantization based talker recognition system is described and evaluated. The system is based on constructing highly efficient short-term spectral representations of individual talkers using vector quantization codebook construction techniques. Although the approach is intrinsically text-independent, the system can be easily extended to text-dependent operation for improved performance and security by encoding specified training word utterances to form word prototypes. The system has been evaluated using a 100-talker database of 20,000 spoken digits. In a talker verification mode, average equal-error rate performance of 2.2% for text-independent operation and 0.3% for text-dependent operation is obtained for 7-digit long test utterances.
This paper describes a statistically motivated framework for performing real-time dialogue state updates and policy learning in a spoken dialogue system. The framework is based on the partially observable Markov decision process (POMDP), which provides a well-founded, statistical model of spoken dialogue management. However, exact belief state updates in a POMDP model are computationally intractable so approximate methods must be used. This paper presents a tractable method based on the loopy belief propagation algorithm. Various simplifications are made, which improve the efficiency significantly compared to the original algorithm as well as compared to other POMDP-based dialogue state updating approaches. A second contribution of this paper is a method for learning in spoken dialogue systems which uses a component-based policy with the episodic Natural Actor Critic algorithm.The framework proposed in this paper was tested on both simulations and in a user trial. Both indicated that using Bayesian updates of the dialogue state significantly outperforms traditional definitions of the dialogue state. Policy learning worked effectively and the learned policy outperformed all others on simulations. In user trials the learned policy was also competitive, although its optimality was less conclusive. Overall, the Bayesian update of dialogue state framework was shown to be a feasible and effective approach to building real-world POMDP-based dialogue systems.
Higher quality speech synthesis is required to make text-to-speech technologies useful in more applications, and prosody is one component of synthesis technology with the greatest need for improvement. This paper describes computational models for the prediction of abstract prosodic labels for synthesis—accent location, symbolic tones and relative prominence level—from text that is tagged with part-of-speech labels and marked for prosodic constituent structure. Specifically, the model uses multiple levels of a prosodic hierarchy and at each level combines decision tree probability functions with Markov sequence assumptions. An advantage of decision trees is the ability to incorporate linguistic knowledge in an automatic training framework, which is needed for building systems that reflect particular speaking styles. Studies of accent and tone variability across speakers are reported and used to motivate new evaluation metrics. Prediction experiments show an improvement in accuracy of prominence location prediction over simple decision trees, with accuracy similar to the level of variability observed across speakers.
We investigate whether accent identification is more effective for English utterances embedded in a different language as part of a mixed code than for English utterances that are part of a monolingual dialogue. Our focus is on Xhosa and Zulu, two South African languages for which code-mixing with English is very common. In order to carry out our investigation, we extract English utterances from mixed-code Xhosa and Zulu speech corpora, as well as comparable utterances from an English-only corpus by Xhosa and Zulu mother-tongue speakers. Experiments using automatic accent identification systems show that identification is substantially more accurate for the utterances originating from the mixed-code speech. These findings are supported by a corresponding set of perceptual experiments in which human subjects were asked to identify the accents of recorded utterances. We conclude that accent identification is more successful for these utterances because accents are more pronounced for English embedded in mother-tongue speech than for English spoken as part of a monolingual dialogue by non-native speakers. Furthermore we find that this is true for human listeners as well as for automatic identification systems.
This paper investigates the unique pharyngeal and uvular consonants of Arabic from the point of view of automatic speech recognition (ASR). Comparisons of the recognition error rates for these phonemes are analyzed in five experiments that involve different combinations of native and non-native Arabic speakers. The most three confusing consonants for every investigated consonant are discussed. All experiments use the Hidden Markov Model Toolkit (HTK) and the Language Data Consortium (LDC) WestPoint Modern Standard Arabic (MSA) database. Results confirm that these Arabic distinct consonants are a major source of difficulty for Arabic ASR. While the recognition rate for certain of these unique consonants such as /ℏ/ can drop below 35% when uttered by non-native speakers, there is advantage to include non-native speakers in ASR. Besides, regional differences in pronunciation of MSA by native Arabic speakers require the attention of Arabic ASR research.
For large vocabulary recognition, partial phonetic information can be used to reduce the number of likely match candidates before some detailed analysis is performed. This paper reports experimental results for lexical access via broad acoustic-phonetic feature representation. A manageable subset of the lexicon is retrieved using a speaker-independent word representation indicative of manner of articulation, stress, and fricative location. The synchronized electroglottographic (EGG) signal is used as a second channel of data to ensure reliable first pass utterance representation. The glottal sensing characteristics of the EGG aid endpoint detection and the voiced/unvoiced/mixed/silent classifications.
Recent studies of English vocabulary have suggested that much of the linguistic content of the speech signal resides in stressed syllables and in broad phonetic classes corresponding to manner of articulation, both of which are comparatively easy to recognize. The implication is drawn that a promising strategy for speech recognition is to concentrate initially on these aspects of the signal, using phonotactic, lexical and (if available) higher level constraints to reduce the need for more detailed analysis. This paper argues that the evaluation criteria used to date in such studies are inappropriate, and, using a more appropriate information-theoretic approach, shows, by repeating a representative experiment, that many of the resulting claims are misleading and that there is in fact no reason to expect a recognition strategy of the type suggested to be particularly fruitful.
When users access information from text, they engage in strategic fixation, visually scanning the text to focus on regions of interest. However, because speech is both serial and ephemeral, it does not readily support strategic fixation. This paper describes two design principles, indexing and transcript-centric access that address the problem of speech access by supporting strategic fixation. Indexing involves users constructing external visual indices into speech. Users visually scan these indices to find information-rich regions of speech for more detailed processing and playback. Transcription involves transcribing speech using automatic speech recognition (ASR) and enriching that transcription with visual cues. The resulting enriched transcript is time-aligned to the original speech, allowing users to scan the transcript as a whole or the additional visual cues present in the transcript, to fixate and play regions of interest.
A two-stage approach to phoneme label alignment is presented. A self-organizing neural network is employed in the first stage. The second stage performs the label alignment of an independently given input phoneme string to the corresponding speech signal. The first stage transforms signal parameters into a set of continuously valued acoustic-phonetic features. The second stage uses the Viterbi decoding/level building technique to position the label boundaries.The validity of the feature transformation approach in stage one is demonstrated in a detailed experimental analysis, the results of which are used to derive a multi-dimensional probability density model for all individual phonemes. These models are used in the second stage label alignment process.Results are given in two parts. The first provides the experimental evidence to support the use of probability density functions based on acoustic-phonetic features, in the form of histograms for a number of vocalic and consonantal Danish and British English phonemes. The second gives the results from the label alignment process. Here, differences between reference time boundaries from a manually labelled test speech corpus and time boundaries from the alignment process are presented in histograms showing the label alignment time differences for a number of selected phoneme paris for Danish and British English. The results show an overall accuracy of the label alignment of 85% and 43% for Danish and British English, respectively.
Although speech derived from read texts, news broadcasts, and other similar prepared contexts can be recognized with high accuracy, recognition performance drastically decreases for spontaneous speech. This is due to the fact that spontaneous speech and read speech are significantly different acoustically as well as linguistically. This paper statistically and quantitatively analyzes differences in acoustic features between spontaneous and read speech using two large-scale speech corpora, “Corpus of Spontaneous Japanese (CSJ)” and “Japanese Newspaper Article Sentences (JNAS)”. Experimental results show that spontaneous speech can be characterized by reduced spectral space in comparison with that of read speech, and that the more spontaneous, the more the spectral space shrinks. This paper also clarifies that reduction in the spectral space leads to reduction in phoneme recognition accuracy. This result indicates that spectral reduction is one major reason for the decrease of recognition accuracy in spontaneous speech.
We report on some recent improvements to an HMM-based, continuous speech recognition system which is being developed at AT&T Bell Laboratories. These advances, which include the incorporation of inter-word, context-dependent units and an improved feature analysis, lead to a recognition system which gives a 95% word accuracy for speaker-independent recognition of the 1000-word DARPA resource management task using the standard word-pair grammar (with a perplexity of about 60). It will be shown that the incorporation of inter-word units into training results in better acoustic models of word juncture coarticulation and gives a 20% reduction in error rate. The effect of an improved set of spectral and log-energy, features is further to reduce word error rate by about 30%. Since we use a continuous density HMM to characterize each subword unit, it is simple and straightforward to add new features to the feature vector (initially a 24-element vector, consisting of 12 cepstral and 12 delta cepstral coefficients). We investigate augmenting the feature vector with 12 second difference (delta-delta) cepstral coefficients and with first (delta) and second difference (delta-delta) log energies, thereby giving a 38-element feature vector. Additional error rate reductions of 11% and 18% were achieved, respectively. With the improved acoustic modeling of subword units, the overall error rate reduction was over 42%. We also found that the spectral vectors, corresponding to the same speech unit, behave differently statistically, depending on whether they are at word boundaries or within a word. The results suggest that intra-word and inter-word units should be modeled independently, even when they appear in the same context. Using a set of subword units which included variants for intra-word and inter-word, context-dependent phones, an additional decrease of about 6–10% in word error rate resulted.
The majority of previous studies on vocal expression have been conducted on posed expressions. In contrast, we utilized a large corpus of authentic affective speech recorded from real-life voice controlled telephone services. Listeners rated a selection of 200 utterances from this corpus with regard to level of perceived irritation, resignation, neutrality, and emotion intensity. The selected utterances came from 64 different speakers who each provided both neutral and affective stimuli. All utterances were further automatically analyzed regarding a comprehensive set of acoustic measures related to F0, intensity, formants, voice source, and temporal characteristics of speech. Results first showed that several significant acoustic differences were found between utterances classified as neutral and utterances classified as irritated or resigned using a within-persons design. Second, listeners’ ratings on each scale were associated with several acoustic measures. In general the acoustic correlates of irritation, resignation, and emotion intensity were similar to previous findings obtained with posed expressions, though the effect sizes were smaller for the authentic expressions. Third, automatic classification (using LDA classifiers both with and without speaker adaptation) of irritation, resignation, and neutral performed at a level comparable to human performance, though human listeners and machines did not necessarily classify individual utterances similarly. Fourth, clearly perceived exemplars of irritation and resignation were rare in our corpus. These findings were discussed in relation to future research.
In this paper we define two alternatives to the familiar perplexity statistic (hereafter lexical perplexity), which is widely applied both as a figure of merit and as an objective function for training language models. These alternatives, respectively acoustic perplexity and the synthetic acoustic word error rate, fuse information from both the language model and the acoustic model. We show how to compute these statistics by effectively synthesizing a large acoustic corpus, demonstrate their superiority (on a modest collection of models and test sets) to lexical perplexity as predictors of language model performance, and investigate their use as objective functions for training language models. We develop an efficient algorithm for training such models, and present results from a simple speech recognition experiment, in which we achieved a small reduction in word error rate by interpolating a language model trained by synthetic acoustic word error rate with a unigram model.
In large vocabulary continuous speech recognition (LVCSR) the acoustic model computations often account for the largest processing overhead. Our weighted finite state transducer (WFST) based decoding engine can utilize a commodity graphics processing unit (GPU) to perform the acoustic computations to move this burden off the main processor. In this paper we describe our new GPU scheme that can achieve a very substantial improvement in recognition speed whilst incurring no reduction in recognition accuracy. We evaluate the GPU technique on a large vocabulary spontaneous speech recognition task using a set of acoustic models with varying complexity and the results consistently show by using the GPU it is possible to reduce the recognition time with largest improvements occurring in systems with large numbers of Gaussians. For the systems which achieve the best accuracy we obtained between 2.5 and 3 times speed-ups. The faster decoding times translate to reductions in space, power and hardware costs by only requiring standard hardware that is already widely installed.
Current speech recognition systems perform poorly on conversational speech as compared to read speech, arguably due to the large acoustic variability inherent in conversational speech. Our hypothesis is that there are systematic effects in local context, associated with syllabic structure, that are not being captured in the current acoustic models. Such variation may be modeled using a broader definition of context than in traditional systems which restrict context to be the neighboring phonemes. In this paper, we study the use of word- and syllable-level context conditioning in recognizing conversational speech. We describe a method to extend standard tree-based clustering to incorporate a large number of features, and we report results on the Switchboard task which indicate that syllable structure outperforms pentaphones and incurs less computational cost. It has been hypothesized that previous work in using syllable models for recognition of English was limited because of ignoring the phenomenon of resyllabification (change of syllable structure at word boundaries), but our analysis shows that accounting for resyllabification does not impact recognition performance.
The last decade has witnessed substantial progress in speech recognition technology, with today’s state-of-the-art systems being able to transcribe unrestricted broadcast news audio data with a word error of about 20%. However, acoustic model development for these recognizers relies on the availability of large amounts of manually transcribed training data. Obtaining such data is both time-consuming and expensive, requiring trained human annotators and substantial amounts of supervision. This paper describes some recent experiments using lightly supervised and unsupervised techniques for acoustic model training in order to reduce the system development cost. The approach uses a speech recognizer to transcribe unannotated broadcast news data from the Darpa TDT-2 corpus. The hypothesized transcription is optionally aligned with closed-captions or transcripts to create labels for the training data. Experiments providing supervision only via the language model training materials show that including texts which are contemporaneous with the audio data is not crucial for success of the approach, and that the acoustic models can be initialized with as little as 10 min of manually annotated data. These experiments demonstrate that light or no supervision can dramatically reduce the cost of building acoustic models.
This work deals with automatic lexical acquisition and topic discovery from a speech stream. The proposed algorithm builds a lexicon enriched with topic information in three steps: transcription of an audio stream into phone sequences with a speaker- and task-independent phone recogniser, automatic lexical acquisition based on approximate string matching, and hierarchical topic clustering of the lexical entries based on a knowledge-poor co-occurrence approach. The resulting semantic lexicon is then used to automatically cluster the incoming speech stream into topics. The main advantages of this algorithm are its very low computational requirements and its independence to pre-defined linguistic resources, which makes it easy to port to new languages and to adapt to new tasks. It is evaluated both qualitatively and quantitatively on two corpora and on two tasks related to topic clustering. The results of these evaluations are encouraging and outline future directions of research for the proposed algorithm, such as building automatic orthographic labels of the lexical items.
In recent years there has been an upsurge in interest in approaches to speech pattern processing which go beyond the conventional hidden Markov model (HMM) framework. Current HMM-based models are fragile in noise, limited in their ability to handle pronunciation variation, and costly for large vocabulary spontaneous speech transcription. Their ability to represent dynamic behaviour is limited, and they are incompatible with modern, non-linear theories of phonology. This special issue of Computer Speech and Language on new computational paradigms for acoustic modeling in speech recognition brings together nine papers which are representative of current research in acoustic modeling which seeks to overcome these limitations.
This paper presents an experimental comparison of the performance of the multilayer perceptron (MLP) with that of the mixture density network (MDN) for an acoustic-to-articulatory mapping task. A corpus of acoustic-articulatory data recorded by electromagnetic articulography (EMA) for a single speaker was used as training and test data for this purpose. In theory, the MDN is able to provide a richer, more flexible description of the target variables in response to a given input vector than the least-squares trained MLP. Our results show that the mean likelihoods of the target articulatory parameters for an unseen test set were indeed consistently higher with the MDN than with the MLP. The increase ranged from approximately 3% to 22%, depending on the articulatory channel in question. On the basis of these results, we argue that using a more flexible description of the target domain, such as that offered by the MDN, can prove beneficial when modelling the acoustic-to-articulatory mapping.
This paper proposes a range of techniques for extracting English verb-particle constructions from raw text corpora, complete with valence information. We propose four basic methods, based on the output of a POS tagger, chunker, chunk grammar and dependency parser, respectively. We then present a combined classifier which we show to consolidate the strengths of the component methods.
Conversational speech exhibits considerable pronunciation variability, which has been shown to have a detrimental effect on the accuracy of automatic speech recognition. There have been many attempts to model pronunciation variation, including the use of decision trees to generate alternate word pronunciations from phonemic baseforms. Use of pronunciation models during recognition is known to improve accuracy. This paper describes the incorporation of pronunciation models into acoustic model training in addition to recognition. Subtle difficulties in the straightforward use of alternatives to canonical pronunciations are first illustrated: it is shown that simply improving the accuracy of the phonetic transcription used for acoustic model training is of little benefit. Acoustic models trained on the most accurate phonetic transcriptions result in worse recognition than acoustic models trained on canonical baseforms. Analysis of this counterintuitive result leads to a new method of accommodating nonstandard pronunciations: rather than allowing a phoneme in the canonical pronunciation to be realized as one of a few distinct alternate phones, the hidden Markov model (HMM) states of the phoneme’s model are instead allowed to share Gaussian mixture components with the HMM states of the model(s) of the alternate realization(s). Qualitatively, this amounts to making a soft decision about which surface form is realized. Quantitatively, experiments show that this method is particularly well suited for acoustic model training for spontaneous speech: a 1.7 %(absolute) improvement in recognition accuracy on the Switchboard corpus is presented.
There are many speech and language processing problems which require cascaded classification tasks. While model adaptation has been shown to be useful in isolated speech and language processing tasks, it is not clear what constitutes system adaptation for such complex systems. This paper studies the following questions: In cases where a sequence of classification tasks is employed, how important is to adapt the earlier or latter systems? Is the performance improvement obtained in the earlier stages via adaptation carried on to later stages in cases where the later stages perform adaptation using similar data and/or methods? In this study, as part of a larger scale multiparty meeting understanding system, we analyze various methods for adapting dialog act segmentation and tagging models trained on conversational telephone speech (CTS) to meeting style conversations. We investigate the effect of using adapted and unadapted models for dialog act segmentation with those of tagging, showing the effect of model adaptation for cascaded classification tasks. Our results indicate that we can achieve significantly better dialog act segmentation and tagging by adapting the out-of-domain models, especially when the amount of in-domain data is limited. Experimental results show that it is more effective to adapt the models in the latter classification tasks, in our case dialog act tagging, when dealing with a sequence of cascaded classification tasks.
The automatic recognition of dialogue act is a task of crucial importance for the processing of natural language dialogue at discourse level. It is also one of the most challenging problems as most often the dialogue act is not expressed directly in speaker’s utterance. In this paper, a new cue-based model for dialogue act recognition is presented. The model is, essentially, a dynamic Bayesian network induced from manually annotated dialogue corpus via dynamic Bayesian machine learning algorithms. Furthermore, the dynamic Bayesian network’s random variables are constituted from sets of lexical cues selected automatically by means of a variable length genetic algorithm, developed specifically for this purpose. To evaluate the proposed approaches of design, three stages of experiments have been conducted. In the initial stage, the dynamic Bayesian network model is constructed using sets of lexical cues selected manually from the dialogue corpus. The model is evaluated against two previously proposed models and the results confirm the potentiality of dynamic Bayesian networks for dialogue act recognition. In the second stage, the developed variable length genetic algorithm is used to select different sets of lexical cues to constitute the dynamic Bayesian networks’ random variables. The developed approach is evaluated against some of the previously used ranking approaches and the results provide experimental evidences on its ability to avoid the drawbacks of the ranking approaches. In the third stage, the dynamic Bayesian networks model is constructed using random variables constituted from the sets of lexical cues generated in the second stage and the results confirm the effectiveness of the proposed approaches for designing dialogue act recognition model.
Prosody is an important cue for identifying dialog acts. In this paper, we show that modeling the sequence of acoustic–prosodic values as n-gram features with a maximum entropy model for dialog act (DA) tagging can perform better than conventional approaches that use coarse representation of the prosodic contour through summative statistics of the prosodic contour. The proposed scheme for exploiting prosody results in an absolute improvement of 8.7% over the use of most other widely used representations of acoustic correlates of prosody. The proposed scheme is discriminative and exploits context in the form of lexical, syntactic and prosodic cues from preceding discourse segments. Such a decoding scheme facilitates online DA tagging and offers robustness in the decoding process, unlike greedy decoding schemes that can potentially propagate errors. Our approach is different from traditional DA systems that use the entire conversation for offline dialog act decoding with the aid of a discourse model. In contrast, we use only static features and approximate the previous dialog act tags in terms of lexical, syntactic and prosodic information extracted from previous utterances. Experiments on the Switchboard-DAMSL corpus, using only lexical, syntactic and prosodic cues from three previous utterances, yield a DA tagging accuracy of 72% compared to the best case scenario with accurate knowledge of previous DA tags (oracle), which results in 74% accuracy.
All speech produced by humans includes information about the speaker, including conveying the emotional state of the speaker. It is thus desirable to include vocal affect in any synthetic speech where improving the naturalness of the speech produced is important. However, the speech factors which convey affect are poorly understood, and their implementation in synthetic speech systems is not yet commonplace. A prototype system for the production of emotional synthetic speech using a commercial formant synthesiser was developed based on vocal emotion descriptions given in the literature. This paper describes work to improve and augment this system, based on a detailed investigation of emotive material spoken by two actors (one amateur, one professional). The results of this analysis are summarised, and were used to enhance the existing emotion rules used in the speech synthesis system. The enhanced system was evaluated by naive listeners in a perception experiment, and the simulated emotions were found to be more realistic than in the original version of the system.
This paper presents empirical results of an analysis on the role of prosody in the recognition of dialogue acts and utterance mood in a practical dialogue corpus in Mexican Spanish. The work is configured as a series of machine-learning experimental conditions in which models are created by using intonational and other data as predictors and dialogue act tagging data as targets. We show that utterance mood can be predicted from intonational information, and that this mood information can then be used to recognize the dialogue act.
Speaker adaptation is recognized as an essential part of today’s large-vocabulary automatic speech recognition systems. A family of techniques that has been extensively applied for limited adaptation data is transformation-based adaptation. In transformation-based adaptation we partition our parameter space in a set of classes, estimate a transform (usually linear) for each class and apply the same transform to all the components of the class. It is known, however, that additional gains can be made if we do not constrain the components of each class to use the same transform. In this paper two speaker adaptation algorithms are described. First, instead of estimating one linear transform for each class (as maximum likelihood linear regression (MLLR) does, for example) we estimate multiple linear transforms per class of models and a transform weights vector which is specific to each component (Gaussians in our case). This in effect means that each component receives its own transform without having to estimate each one of them independently. This scheme, termed maximum likelihood stochastic transformation (MLST) achieves a good trade-off between robustness and acoustic resolution. MLST is evaluated on the Wall Street Journal(WSJ) corpus for non-native speakers and it is shown that in the case of 40 adaptation sentences the algorithm outperforms MLLR by more than 13%. In the second half of this paper, we introduce a variant of the MLST designed to operate under sparsity of data. Since the majority of the adaptation parameters are the transformations, we estimate them on the training speakers and adapt to a new speaker by estimating the transform weights only. First we cluster the speakers in a number of sets and estimate the transformations on each cluster. The new speaker will use transformations from all clusters to perform adaptation. This method, termed basis transformation, can be seen as a speaker similarity scheme. Experimental results on the WSJ show that when basis transformation is cascaded with MLLR marginal gains can be obtained from MLLR only, for adaptation of native speakers.
This paper investigates supervised and unsupervised adaptation of stochastic grammars, including n-gram language models and probabilistic context-free grammars (PCFGs), to a new domain. It is shown that the commonly used approaches of count merging and model interpolation are special cases of a more general maximum a posteriori (MAP) framework, which additionally allows for alternate adaptation approaches. This paper investigates the effectiveness of different adaptation strategies, and, in particular, focuses on the need for supervision in the adaptation process. We show that n-gram models as well as PCFGs benefit from either supervised or unsupervised MAP adaptation in various tasks. For n-gram models, we compare the benefit from supervised adaptation with that of unsupervised adaptation on a speech recognition task with an adaptation sample of limited size (about 17 h), and show that unsupervised adaptation can obtain 51% of the 7.7% adaptation gain obtained by supervised adaptation. We also investigate the benefit of using multiple word hypotheses (in the form of a word lattice) for unsupervised adaptation on a speech recognition task for which there was a much larger adaptation sample available. The use of word lattices for adaptation required the derivation of a generalization of the well-known Good-Turing estimate.
Transformation-based model adaptation techniques have been used for many years to improve robustness of speech recognition systems. While the estimation criterion used to estimate transformation parameters has been mainly based on maximum likelihood estimation (MLE), Bayesian versions of some of the most popular transformation-based adaptation methods have been recently introduced, like MAPLR, a maximum a posteriori(MAP) based version of the well-known maximum likelihood linear regression (MLLR) algorithm. This is in fact an attempt to constraint parameter estimation in order to obtain reliable estimation with a limited amount of data, not only to prevent overfitting the adaptation data but also to allow integration of prior knowledge into transformation-based adaptation techniques. Since such techniques require the estimation of a large number of transformation parameters when the amount of adaptation data is large, it is also required to define a large number of prior densities for these parameters. Robust estimation of these prior densities is therefore a crucial issue that directly affects the efficiency and effectiveness of the Bayesian techniques. This paper proposes to estimate these priors using the notion of hierarchical priors, embedded into the tree structure used to control transformation complexity. The proposed algorithm, called structural MAPLR (SMAPLR), has been evaluated on the Spoke3 1993 test set of the WSJ task. It is shown that SMAPLR reduces the risk of overtraining and exploits the adaptation data much more efficiently than MLLR, leading to a significant reduction of the word error rate for any amount of adaptation data.
In this paper, the use of discriminative linear transforms (DLT) is investigated to construct speaker adaptive speech recognition systems, where a discriminative criterion rather than ML is used for transform parameter estimation. The minimum phone error (MPE) criterion is investigated for DLT estimation, by making use of a so-called weak-sense auxiliary function to derive the estimation formulae. An implementation based on lattices is used for DLT statistics accumulation, where the use of a weakened language model allows more confusion data to be included. To improve DLT estimation for unsupervised adaptation, a method of incorporating word correctness information of the supervision into transform estimation is developed. The confidence scores calculated by confusion network decoding are used to represent the word correctness and weight the numerator statistics during DLT estimation. This makes the DLT estimation less sensitive to errors in the supervision. Experiments on transcription of read newspaper data and on conversational telephone speech transcription have shown the improvements of DLT over MLLR for both supervised and unsupervised adaptation, and the effectiveness of confidence scores for improving both normal and DLT-based MLLR adaptation.
A novel technique for maximum “a posteriori” (MAP) adaptation of maximum entropy (MaxEnt) and maximum entropy Markov models (MEMM) is presented.The technique is applied to the problem of automatically capitalizing uniformly cased text. Automatic capitalization is a practically relevant problem: speech recognition output needs to be capitalized; also, modern word processors perform capitalization among other text proofing algorithms such as spelling correction and grammar checking. Capitalization can be also used as a preprocessing step in named entity extraction or machine translation.A “background” capitalizer trained on 20 M words of Wall Street Journal (WSJ) text from 1987 is adapted to two Broadcast News (BN) test sets – one containing ABC Primetime Live text and the other NPR Morning News/CNN Morning Edition text – from 1996.The “in-domain” performance of the WSJ capitalizer is 45% better relative to the 1-gram baseline, when evaluated on a test set drawn from WSJ 1994. When evaluating on the mismatched “out-of-domain” test data, the 1-gram baseline is outperformed by 60% relative; the improvement brought by the adaptation technique using a very small amount of matched BN data – 25–70k words – is about 20–25% relative. Overall, automatic capitalization error rate of 1.4% is achieved on BN data.The performance gain obtained by employing our adaptation technique using a tiny amount of out-of-domain training data on top of the background data is striking: as little as 0.14 M words of in-domain data brings more improvement than using 10 times more background training data (from 2 M words to 20 M words).
This paper describes various speaker normalization and adaptation techniques of a knowledge data base or reference templates to new speakers in automatic speech recognition (ASR). It focuses on a technique for learning spectral transformations, based on a statistical-analysis tool (canonical correlation analysis), to adapt a standard dictionary to arbitrary speakers. The proposed method should permit to improve speaker independence in large vocabulary ASR. Application to an isolated word recognizer improved a 70% correct score to 87%.A dynamic aspect of the speaker adaptation procedure is introduced and evaluated in a particular strategy.
In speaker verification over public telephone networks, utterances can be obtained from different types of handsets. Different handsets may introduce different degrees of distortion to the speech signals. This paper attempts to combine a handset selector with (1) handset-specific transformations, (2) reinforced learning, and (3) stochastic feature transformation to reduce the effect caused by the acoustic distortion. Specifically, during training, the clean speaker models and background models are firstly transformed by MLLR-based handset-specific transformations using a small amount of distorted speech data. Then reinforced learning is applied to adapt the transformed models to handset-dependent speaker models and handset-dependent background models using stochastically transformed speaker patterns. During a verification session, a GMM-based handset classifier is used to identify the most likely handset used by the claimant; then the corresponding handset-dependent speaker and background model pairs are used for verification. Experimental results based on 150 speakers of the HTIMIT corpus show that environment adaptation based on the combination of MLLR, reinforced learning and feature transformation outperforms CMS, Hnorm, Tnorm, and speaker model synthesis.
This paper proposes an instantaneous speaker adaptation method that uses N-best decoding for continuous mixture-density hidden-Markov-model-based speech-recognition systems. This method is effective even for speakers whose decoding using speaker-independent (SI) models are error-prone and for whom speaker adaptation techniques are truly needed. In addition, smoothed estimation and utterance verification are introduced into this method. The smoothed estimation is based on the likelihood values for adapted models of word sequences obtained by N-best decoding and improves the performance of error-prone speakers, and the utterance verification technique reduces the amount of calculation required. Performance evaluation using connected-digit (four-digit strings) recognition experiments performed over actual telephone lines showed a reduction of 36·4% in the error rates of speakers whose decoding using SI models are error-prone.
This paper presents an extended study on the implementation of support vector machine (SVM) based speaker verification in systems that employ continuous progressive model adaptation using the weight-based factor analysis model. The weight-based factor analysis model compensates for session variations in unsupervised scenarios by incorporating trial confidence measures in the general statistics used in the inter-session variability modelling process. Employing weight-based factor analysis in Gaussian mixture models (GMMs) was recently found to provide significant performance gains to unsupervised classification. Further improvements in performance were found through the integration of SVM-based classification in the system by means of GMM supervectors.This study focuses particularly on the way in which a client is represented in the SVM kernel space using single and multiple target supervectors. Experimental results indicate that training client SVMs using a single target supervector maximises performance while exhibiting a certain robustness to the inclusion of impostor training data in the model. Furthermore, the inclusion of low-scoring target trials in the adaptation process is investigated where they were found to significantly aid performance.
This paper investigates the problem of updating over time the statistical language model (LM) of an Italian broadcast news transcription system. Statistical adaptation methods are proposed which try to cope with the complex dynamics of news by exploiting newswire texts daily available on the Internet. In particular, contemporary news reports are used to extend the lexicon of the LM, to minimize the out-of-vocabulary (OOV) word rate, and to adapt the n-gram probabilities. Experiments performed on 19 news shows, spanning a period of one month, showed relative reductions of 58% in OOV word rate, 16% in perplexity, and 4% in word error rate (WER).
Vocal tract length normalization (VTLN) for standard filterbank-based Mel frequency cepstral coefficient (MFCC) features is usually implemented by warping the center frequencies of the Mel filterbank, and the warping factor is estimated using the maximum likelihood score (MLS) criterion. A linear transform (LT) equivalent for frequency warping (FW) would enable more efficient MLS estimation. We recently proposed a novel LT to perform FW for VTLN and model adaptation with standard MFCC features. In this paper, we present the mathematical derivation of the LT and give a compact formula to calculate it for any FW function. We also show that our LT is closely related to different LTs previously proposed for FW with cepstral features, and these LTs for FW are all shown to be numerically almost identical for the sine-log all-pass transform (SLAPT) warping functions. Our formula for the transformation matrix is, however, computationally simpler and, unlike other previous LT approaches to VTLN with MFCC features, no modification of the standard MFCC feature extraction scheme is required. In VTLN and speaker adaptive modeling (SAM) experiments with the DARPA resource management (RM1) database, the performance of the new LT was comparable to that of regular VTLN implemented by warping the Mel filterbank, when the MLS criterion was used for FW estimation. This demonstrates that the approximations involved do not lead to any performance degradation. Performance comparable to front end VTLN was also obtained with LT adaptation of HMM means in the back end, combined with mean bias and variance adaptation according to the maximum likelihood linear regression (MLLR) framework. The FW methods performed significantly better than standard MLLR for very limited adaptation data (1 utterance), and were equally effective with unsupervised parameter estimation. We also performed speaker adaptive training (SAT) with feature space LT denoted CLTFW. Global CLTFW SAT gave results comparable to SAM and VTLN. By estimating multiple CLTFW transforms using a regression tree, and including an additive bias, we obtained significantly improved results compared to VTLN, with increasing adaptation data.
We introduce a strategy for modeling speaker variability in speaker adaptation based on maximum likelihood linear regression (MLLR). The approach uses a speaker-clustering procedure that models speaker variability by partitioning a large corpus of speakers in the eigenspace of their MLLR transformations and learning cluster-specific regression class tree structures. We present experiments showing that choosing the appropriate regression class tree structure for speakers leads to a significant reduction in overall word error rates in automatic speech recognition systems. To realize these gains in unsupervised adaptation, we describe an algorithm that produces a linear combination of MLLR transformations from cluster-specific trees using weights estimated by maximizing the likelihood of a speaker’s adaptation data. This algorithm produces small improvements in overall recognition performance across a range of tasks for both English and Mandarin. More significantly, distributional analysis shows that it reduces the number of speakers with performance loss due to adaptation across a range of adaptation data sizes and word error rates.
In this paper, we present our recent development of a model-domain environment robust adaptation algorithm, which demonstrates high performance in the standard Aurora 2 speech recognition task. The algorithm consists of two main steps. First, the noise and channel parameters are estimated using multi-sources of information including a nonlinear environment-distortion model in the cepstral domain, the posterior probabilities of all the Gaussians in speech recognizer, and truncated vector Taylor series (VTS) approximation. Second, the estimated noise and channel parameters are used to adapt the static and dynamic portions (delta and delta–delta) of the HMM means and variances. This two-step algorithm enables joint compensation of both additive and convolutive distortions (JAC). The hallmark of our new approach is the use of a nonlinear, phase-sensitive model of acoustic distortion that captures phase asynchrony between clean speech and the mixing noise.In the experimental evaluation using the standard Aurora 2 task, the proposed Phase-JAC/VTS algorithm achieves 93.32% word accuracy using the clean-trained complex HMM backend as the baseline system for the unsupervised model adaptation. This represents high recognition performance on this task without discriminative training of the HMM system. The experimental results show that the phase term, which was missing in all previous HMM adaptation work, contributes significantly to the achieved high recognition accuracy.
In this paper, an improved method of model complexity selection for nonnative speech recognition is proposed by using maximum a posteriori (MAP) estimation of bias distributions. An algorithm is described for estimating hyper-parameters of the priors of the bias distributions, and an automatic accent classification algorithm is also proposed for integration with dynamic model selection and adaptation. Experiments were performed on the WSJ1 task with American English speech, British accented speech, and mandarin Chinese accented speech. Results show that the use of prior knowledge of accents enabled more reliable estimation of bias distributions with very small amounts of adaptation speech, or without adaptation speech. Recognition results show that the new approach is superior to the previous maximum expected likelihood (MEL) method, especially when adaptation data are very limited.
Several ways for making the signal processing in an isolated word speech recognition system more robust against large variations in the background noise level are presented. Isolated word recognition systems are sensitive to accurate silence detection, and are easily overtrained on the specific noise circumstances of the training environment. Spectral subtraction provides good noise immunity in the cases where the noise level is lower or slightly higher in the testing environment than during training. Differences in residual noise energy after spectral subtraction between a clean training and noisy testing environment can still cause severe problems. The usability of spectral subtraction is largely increased if complemented with some extra noise immunity processing. This is achieved by the addition of artificial noise after spectral subtraction or by adaptively re-estimating the noise statistics during a training session. Both techniques are almost equally successful in dealing with the noise. Noise addition achieves the additional robustness that the system will never be allowed to learn about low amplitude events that might not be observable in all environments; this is achieved, however, at a cost that some information is consistently thrown away in the most favorable noise situations.
The Gaussian mixture model – Universal background model (GMM–UBM) system is one of the predominant approaches for text-independent speaker verification, because both the target speaker model and the impostor model (UBM) have generalization ability to handle “unseen” acoustic patterns. However, since GMM–UBM uses a common anti-model, namely UBM, for all target speakers, it tends to be weak in rejecting impostors’ voices that are similar to the target speaker’s voice. To overcome this limitation, we propose a discriminative feedback adaptation (DFA) framework that reinforces the discriminability between the target speaker model and the anti-model, while preserving the generalization ability of the GMM–UBM approach. This is achieved by adapting the UBM to a target speaker dependent anti-model based on a minimum verification squared-error criterion, rather than estimating the model from scratch by applying the conventional discriminative training schemes. The results of experiments conducted on the NIST2001-SRE database show that DFA substantially improves the performance of the conventional GMM–UBM approach.
In this work, we combine maximum mutual information parameter estimation with speaker-adapted training (SAT). As will be shown, this can be achieved by performing unsupervised estimation of speaker adaptation parameters on the test data, a distinct advantage for many recognition tasks involving conversational speech. We derive re-estimation formulae for the basic speaker-independent means and variances, the optimal regression class for each Gaussian component when multiple speaker-dependent linear transforms are used for adaptation, as well as the optimal feature-space transformation matrix for use with semi-tied covariance matrices. We also propose an approximation to the maximum likelihood and maximum mutual information SAT re-estimation formulae that greatly reduces the amount of disk space required to conduct training on corpora which contain speech from hundreds or thousands of speakers. We also present empirical evidence of the importance of combination speaker adaptation with discriminative training. In particular, on a subset of the data used for the NIST RT05 evaluation, we show that including maximum likelihood linear regression transformations in the MMI re-estimation formulae provides a WER of 35.2% compared with 39.1% obtained when speaker adaptation is ignored during discriminative training.
We present a system for model-based source separation for use on single channel speech mixtures where the precise source characteristics are not known a priori. The sources are modeled using hidden Markov models (HMM) and separated using factorial HMM methods. Without prior speaker models for the sources in the mixture it is difficult to exactly resolve the individual sources because there is no way to determine which state corresponds to which source at any point in time. This is solved to a small extent by the temporal constraints provided by the Markov models, but permutations between sources remains a significant problem. We overcome this by adapting the models to match the sources in the mixture. We do this by representing the space of speaker variation with a parametric signal model-based on the eigenvoice technique for rapid speaker adaptation. We present an algorithm to infer the characteristics of the sources present in a mixture, allowing for significantly improved separation performance over that obtained using unadapted source models. The algorithm is evaluated on the task defined in the 2006 Speech Separation Challenge [Cooke, M.P., Lee, T.-W., 2008. The 2006 Speech Separation Challenge. Computer Speech and Language] and compared with separation using source-dependent models. Although performance is not as good as with speaker-dependent models, we show that the system based on model adaptation is able to generalize better to held out speakers.
A novel speaker-adaptive learning algorithm is developed and evaluated for a hidden trajectory model of speech coarticulation and reduction. Central to this model is the process of bi-directional (forward and backward) filtering of the vocal tract resonance (VTR) target sequence. The VTR targets are key parameters of the model that control the hidden VTR’s dynamic behavior and the subsequent acoustic properties (those of the cepstral vector sequence). We describe two techniques for training these target parameters: (1) speaker-independent training that averages out the target variability over all speakers in the training set; and (2) speaker-adaptive training that takes into account the variability in the target values among individual speakers. The adaptive learning is applied also to adjust each unknown test speaker’s target values towards their true values. All the learning algorithms make use of the results of accurate VTR tracking as developed in our earlier work. In this paper, we present details of the learning algorithms and the analysis results comparing speaker-independent and speaker-adaptive learning. We also describe TIMIT phone recognition experiments and results, demonstrating consistent superiority of speaker adaptive learning over speaker-independent one measured by the phonetic recognition performance.
Discriminative classifiers are a popular approach to solving classification problems. However, one of the problems with these approaches, in particular kernel based classifiers such as support vector machines (SVMs), is that they are hard to adapt to mismatches between the training and test data. This paper describes a scheme for overcoming this problem for speech recognition in noise by adapting the kernel rather than the SVM decision boundary. Generative kernels, defined using generative models, are one type of kernel that allows SVMs to handle sequence data. By compensating the parameters of the generative models for each noise condition noise-specific generative kernels can be obtained. These can be used to train a noise-independent SVM on a range of noise conditions, which can then be used with a test-set noise kernel for classification. The noise-specific kernels used in this paper are based on Vector Taylor Series (VTS) model-based compensation. VTS allows all the model parameters to be compensated and the background noise to be estimated in a maximum likelihood fashion. A brief discussion of VTS, and the optimisation of the mismatch function representing the impact of noise on the clean speech, is also included. Experiments using these VTS-based test-set noise kernels were run on the AURORA 2 continuous digit task. The proposed SVM rescoring scheme yields large gains in performance over the VTS compensated models.
Top-cited authors
Philip C. Woodland
  • University of Cambridge
Hermann Ney
  • RWTH Aachen University
Shrikanth S Narayanan
  • University of Southern California
Thomas Kisler
  • Ludwig-Maximilians-University of Munich
Uwe Reichel
  • audEERING GmbH