P. V. de Souza’s research while affiliated with IBM Research and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (59)


Figure 1: Reduction in entropy possible gain of 1 bit, but this diminishes as the tree is grown deeper. 
Decision-Tree Based Quantization Of The Feature Space Of A Speech Recognizer
  • Article
  • Full-text available

September 2000

·

127 Reads

·

10 Citations

·

L.R. Bahl

·

·

P. De Souza

We present a decision-tree based procedure to quantize the feature-space of a speech recognizer, with the motivation of reducing the computation time required for evaluating gaussians in a speech recognition system. The entire feature space is quantized into non overlapping regions where each region is bounded by a number of hyperplanes. Further, each region is characterized by the occurence of only a small number of the total alphabet of allophones (sub-phonetic speech units); by identifying the region in which a test feature vector lies, only the gaussians that model the density of allophones that exist in that region need be evaluated. The quantization of the feature space is done in a heirarchical manner using a binary decision tree. Each node of the decision tree represents a region of the feature space, and is further characterized by a hyperplane (a vector v n and a scalar threshold value hn ), that subdivides the region corresponding to the current node into two non-overlapping...

Download


Experiments using data augmentation for speaker adaptation

January 1995

·

27 Reads

·

8 Citations

Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing

·

P.V. de Souza

·

·

[...]

·

L.R. Bahl

Speaker adaptation typically involves customizing some existing (reference) models in order to account for the characteristics of a new speaker. This work considers the slightly different paradigm of customizing some reference data for the purpose of populating the new speaker's space, and then using the resulting (augmented) data to derive the customized models. The data augmentation technique is based on the metamorphic algorithm first proposed in Bellegarda et al. [1992], assuming that a relatively modest amount of data (100 sentences) is available from each new speaker. This contraint requires that reference speakers be selected with some care. The performance of this method is illustrated on a portion of the Wall Street Journal task.


The Metamorphic Algorithm: A Speaker Mapping Approach to Data Augmentation

August 1994

·

34 Reads

·

43 Citations

IEEE Transactions on Speech and Audio Processing

Large vocabulary speaker-dependent speech recognition systems adjust to the acoustic peculiarities of each new speaker based on some enrolment data provided by this speaker. As the amount of data required increases with the sophistication of the underlying acoustic models, the enrolment may get lengthy. To streamline it, it is therefore desirable to make use of previously acquired speech data. The authors describe a data augmentation strategy based on a piecewise linear mapping between the feature space of a new speaker and that of a reference speaker. This speaker-normalizing mapping is used to transform the previously acquired data of the reference speaker onto the space of the new speaker. The performance of the resulting procedure, dubbed the metamorphic algorithm, is illustrated on an isolated utterance speech recognition task with a vocabulary of 20000 words. Results show that the metamorphic algorithm can substantially reduce the word error rate when only a limited amount of enrolment data is available. Alternatively, it leads to a level of performance comparable to that obtained when a much greater amount of enrolment data is required from the new speaker. In addition, it can also be used for tracking spectral evolution over time, thus providing a possible means for robust speaker self-adaptation


Robust methods for using context-dependent features and models in acontinuous speech recognizer

May 1994

·

52 Reads

·

88 Citations

In this paper we describe the method we use to derive acoustic features that reflect some of the dynamics of frame-based parameter vectors. Models for such observations must be context dependent. Such models were outlined in an earlier paper. Here we describe a method for using these models in a recognition system. The method is more robust than using continuous parameter models in recognition. At the same time it does not suffer from the possible information loss in vector quantization based systems


A Method for the Construction of Acoustic Markov Models for Words

November 1993

·

309 Reads

·

59 Citations

IEEE Transactions on Speech and Audio Processing

A technique for constructing Markov models for the acoustic representation of words is described. Word models are constructed from models of subword units called fenones. Fenones represent very short speech events and are obtained automatically through the use of a vector quantizer. The fenonic baseform for a word-i.e., the sequence of fenones used to represent the word-is derived automatically from one or more utterances of that word. Since the word models are all composed from a small inventory of subword models, training for large-vocabulary speech recognition systems can be accomplished with a small training script. A method for combining phonetic and fenonic models is presented. Results of experiments with speaker-dependent and speaker-independent models on several isolated-word recognition tasks are reported. The results are compared with those for phonetics-based Markov models and template-based dynamic programming (DP) matching



Multonic Markov Word Models for Large Vocabulary Continuous Speech Recognition

August 1993

·

12 Reads

·

13 Citations

IEEE Transactions on Speech and Audio Processing

A new class of hidden Markov models is proposed for the acoustic representation of words in an automatic speech recognition system. The models, built from combinations of acoustically based sub-word units called fenones, are derived automatically from one or more sample utterances of a word. Because they are more flexible than previously reported fenone-based word models, they lead to an improved capability of modeling variations in pronunciation. They are therefore particularly useful in the recognition of continuous speech. In addition, their construction is relatively simple, because it can be done using the well-known forward-backward algorithm for parameter estimation of hidden Markov models. Appropriate reestimation formulas are derived for this purpose. Experimental results obtained on a 5000-word vocabulary natural language continuous speech recognition task are presented to illustrate the enhanced power of discrimination of the new models


Estimating Hidden Markov Model Parameters So As To Maximize Speech Recognition Accuracy

February 1993

·

137 Reads

·

70 Citations

IEEE Transactions on Speech and Audio Processing

The problem of estimating the parameter values of hidden Markov word models for speech recognition is addressed. It is argued that maximum-likelihood estimation of the parameters via the forward-backward algorithm may not lead to values which maximize recognition accuracy. An alternative estimation procedure called corrective training, which is aimed at minimizing the number of recognition errors, is described. Corrective training is similar to a well-known error-correcting training procedure for linear classifiers and works by iteratively adjusting the parameter values so as to make correct words more probable and incorrect words less probable. There are strong parallels between corrective training and maximum mutual information estimation; the relationship of these two techniques is discussed and a comparison is made of their performance. Although it has not been proved that the corrective training algorithm converges, experimental evidence suggests that it does, and that it leads to fewer recognition errors that can be obtained with conventional training methods


Context dependent vector quantization for continuous speech recognition

January 1993

·

11 Reads

·

43 Citations

Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing

The authors present a method for designing a vector quantizer for speech recognition that uses decision networks constructed by examining the phonetic context to obtain models for classes in the quantizer. Diagonal Gaussian models are constructed for the vector quantizer classes at each terminal node of the network and are used to label speech parameter vectors during recognition. Experimental results indicate that this method leads to superior vector quantizers for continuous speech.


Citations (30)


... Thus our proposed methods follow the filter based approach. The most conventional filter based approaches are Gini Index (GI) [15], Information (MI) [16], Chi-square Test (X2) [15] and Information Gain (IG) [17]. ...

Reference:

Dimensionality Reduction for Sentiment Classification: Evolving for the Most Prominent and Separable Features
Maximum mutual information estimation of hidden Markov parameters for speech recognition
  • Citing Article
  • January 1988

... The smoothing we employ in the link grammar is motivated by the smoothing typically used for the trigram language model [BBdSM91]. The idea is to linearly combine the 3gram estimators T 3 , L 3 and D 3 with corresponding 2-gram, 1-gram and uniform estimators to obtain smooth distributions t.\, LA and 6.\. ...

Tree-Based Smoothing Algorithm for a Trigram Language Speech Recognition Model
  • Citing Article

... Various cepstral mean and variance normalization techniques were introduced to reduce the channel variability in case of MFCC [83][84][85] and PLP [86]. One of the reasons for the longevity of HMM-GMM is the efficient introduction of discriminative training criteria in model training (maximum mutual information [87,88], minimum classification error [89] and minimum phoneme error [90]). However, discriminative models gain accuracy if the number of observations per parameter is sufficiently large [91,92]. ...

Maximum mutual information estimation of Hidden Markov model parameters for speech recognition
  • Citing Article
  • January 1986

Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing

... Quantization approaches are mostly applied to hidden layer parameter compression and few works discuss its application to embedding matrices. Vector quantization has also been successfully used in speech recognition [29] [30] and computer vision [31]. However, the naive vector quantization method requires a global structure in high dimension space for good performance [10]. ...

Context dependent vector quantization for continuous speech recognition
  • Citing Article
  • January 1993

Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing

... Thus, with an objective of addressing these issues, numerous efforts have been discussed by researchers through effective utilization in speech applications. These applications further employ standard statistical approaches including maximum A posteriori (MAP) estimation [12] and data augmentation [13]. However, the MAP estimation approach is not capable of building a statistical model where there does not exist enough training data or prior information distribution. ...

Experiments using data augmentation for speaker adaptation
  • Citing Article
  • January 1995

Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing

... words. As we will see in Section 2, computation with SCRFs involves summing and/or maximizing over all possible segmentations of the observations into words. Therefore, it is convenient to divide the computation between a " fastmatch " that quickly identifies possible words and their boundaries, and a " detailed-match " that applies the full model. [6]. In SCARF, the fast-match may be done externally with an HMM system, and provided in the form of a lattice. Alternatively, SCARF implements a TF-IDF based fast match that finds potential words based on the TF-IDF similarity between the observations in a detection stream and those expected on the basis of dictionary pronunciations. This ...

A Fast Match for Continuous Speech Recognition Using Allophonic Models
  • Citing Conference Paper
  • January 1992

... A beszédfelismerő rendszerek (Averbuch, Bahl 1987; Goffin, Allauzen 2005; Huang, Acero 1995; Paul 1989; Velkei és Vicsi 2004) az emberi hangot a gép által értelmezhető kóddá alakítják át. A legegyszerűbb beszédfelismerők az akusztikai produktumot írott szöveggé transzformálják anélkül, hogy képesek lennének megérteni a hordozott jelentéstartalmat. ...

Experiments with the Tangora 20,000 word speech recognizer
  • Citing Conference Paper
  • May 1987

... As a result, in an uncomfortable iterative process, network retraining and HMM realignments are alternated to provide targets that are more exact. Direct training of HMM neural network hybrids has been done using full-sequence training techniques like Maximum Mutual Information to increase the likelihood of accurate transcription (Bahl et al., 1986;. However, these methods can only be used to retrain a system that has already been trained at the frame level, and they necessitate the careful adjustment of several hyperparameters, often much more than for deep neural networks. ...

Maximum mutual information estimation of hidden Markov parameters for speech recognition

Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing