G. Zavaliagkos's research while affiliated with Raytheon BBN Technologies and other places

Publications (34)

Article
Up till recently, state-of-the-art, large vocabulary, continuous speech recognition (CSR) had employed hidden Markov modeling (HMM) to model speech sounds. In an attempt to improve over HMM we developed a hybrid system that integrates HMM technology with neural networks. We present the concept of a Segmental Neural Net (SNN) for phonetic modeling i...
Article
In the past few years, the Large Vocabulary Conversational Speech Recognition (LVCSR) community has attempted to address the problem of speech recognition on languages other than English. Work on the CallHome Corpora has verified that current technology is largely language independent, and that the dominant factor with regards to performance on a c...
Article
According to discourse theories in linguistics, conversational utterances possess an informational structure. That is, each sentence consists of two components: the given and the new. The given refers to information that has previously been conveyed in the conversation such as that in That's interesting. The new section of a sentence introduces add...
Article
In the early 1990s, the availability of the TIMIT read-speech phonetically transcribed corpus led to work at AT&T on the automatic inference of pronunciation variation. This work, briefly summarized here, used stochastic decision trees trained on phonetic and linguistic features, and was applied to the DARPA North American Business News read-speech...
Conference Paper
In continuous density hidden Markov models (HMMs) for speech recognition, the probability density function (PDF) for each state is usually expressed as a mixture of Gaussians. We present a model in which the PDF is expressed as the convolution of two densities. We focus on the special case where one of the convolved densities is a M-Gaussian mixtur...
Article
This paper describes the improvements that resulted in the 1998 Byblos large vocabulary conversational speech recognition (LVCSR) system. Salient among these improvements are: improved signal processing, improved hidden Markov model (HMM) topology, use of quinphone context, introduction of diagonal speaker adapted training (DSAT), incorporation of...
Conference Paper
Full-text available
Accurately modelling pronunciation variability in conversational speech is an important component of an automatic speech recognition system. We describe some of the projects undertaken in this direction during and after WS97, the Fifth LVCSR Summer Workshop, held at Johns Hopkins University, Baltimore, in July-August, 1997. We first illustrate a us...
Conference Paper
This paper presents the 1997 BBN Byblos large vocabulary speech recognition (LVCSR) system. We give an outline of the algorithms and procedures used to train the system, describe the recognizer configuration and present the major technological innovations that lead to the performance improvements. The major testbed we present our results for is the...
Conference Paper
According to discourse theories in linguistics, conversational utterances possess an informational structure that partitions each sentence into two portions: a “given” and “new”. We explore this idea by building sub-sentence discourse language models for conversational speech recognition. The internal sentence structure is captured in statistical l...
Article
In the early '90s, the availability of the TIMIT read-speech phonetically transcribed corpus led to work at AT&T on the automatic inference of pronunciation variation. This work, briefly summarized here, used stochastic decisions trees trained on phonetic and linguistic features, and was applied to the DARPA North American Business News read-speech...
Article
In the early '90s, the availability of the TIMIT read-speech phonetically transcribed corpus led to work at AT&T on the automatic inference of pronunciation variation. This work, briefly summarized here, used stochastic decisions trees trained on phonetic and linguistic features, and was applied to the DARPA North American Business News read-speech...
Article
INTRODUCTION Pronunciations in spontaneous, conversational speech tend to be much more variable than in careful read speech where pronunciations of words are more likely to adhere to their citation forms. Most speech recognition systems, however, rely on pronouncing dictionaries which contain few alternate pronunciations for most words. This limita...
Conference Paper
Accurately modelling of pronunciation variability in conversational speech is an important component for automatic speech recognition. We describe some of the projects undertaken in this direction at WS97 [the Fifth LVCSR (large-vocabulary conversational speech recognition) Summer Workshop], held at Johns Hopkins University, Baltimore, in July-Augu...
Conference Paper
This paper explores techniques for utilizing untranscribed training data pools to increase the available training data for automatic speech recognition systems. It has been well estab- lished that current speech recognition technology, especially in Large Vocabulary Conversational Speech Recognition (LVCSR), is largely language independent, and tha...
Conference Paper
Speaker adaptation is the process of transforming some speaker-independent acoustic model in such a way as to more closely match the characteristics of a particular speaker. It has been shown by several researchers to be an effective means of improving the performance of large vocabulary continuous speech recognition systems. Until very recently sp...
Conference Paper
We formulate a novel approach to speaker adaptation. It is predicated upon the fact that the cepstral coefficients used as feature vectors in most state of the art speech recognition systems are coefficients of a Laurent series, and hence represent an analytic function of a complex-valued argument. This analytic function can be characterized by sev...
Conference Paper
We present a framework for maximum a posteriori (MAP) adaptation of large scale HMM recognizers. First we review the standard MAP adaptation for Gaussian mixtures. We then show how MAP can be used to estimated transformations which are shared across many parameters. Finally, we combine both techniques: each of the HMM models is adapted based on an...
Article
We present a framework for maximum a posteriori adaptation of large scale HMM speech recognizers. In this framework, we introduce mechanisms that take advantage of correlations present among HMM parameters in order to maximize the number of parameters that can be adapted by a limited number of observations. We are also separately exploring the feas...
Conference Paper
Full-text available
We attemped to improve recognition accuracy by reducing the inadequacies of the lexicon and language model. Specifically we address the following three problems: (1) the best size for the lexicon, (2) conditioning written text for spoken language recognition, and (3) using additional training outside the text distribution. We found that increasing...
Article
The current state-of-the-art in large-vocabulary, continuous speech recognition is based on the use of hidden Markov models (HMM). In an attempt to improve over HMM performance, the authors developed a hybrid system that combines the advantages of neural networks and HMM using a multiple hypothesis (or N-best) paradigm. The connectionist component...
Conference Paper
Full-text available
We developed a faster search algorithm that avoids the use of the N-Best paradigm until after more powerful knowledge sources have been used. We found, however, that there was little or no decrease in word errors. We then showed that the use of the N-Best paradigm is still essential for the use of still more powerful knowledge sources, and for seve...
Conference Paper
The authors present the concept of a `segmental neural net' (SNN) for phonetic modeling in continuous speech recognition (CSR) and demonstrate how than can be used with a multiple hypothesis (or N-Best) paradigm to combine different CSR systems. In particular, they have developed a system that combines the SNN with a hidden Markov model (HMM) syste...
Conference Paper
Until recently, state-of-the-art, large-vocabulary, continuous speech recognition has employed hidden Markov modeling (HMM) to model speech sounds. The authors previously (ICASSP-92 p.625-8) presented the concept of a segmental neural network (SNN) for phonetic modeling in continuous speech recognition and demonstrated that a feedforward neural net...
Conference Paper
Full-text available
The authors present the concept of a `segmental neural net' (SNN) for phonetic modeling in continuous speech recognition (CSR) and show how this can be used, together with a hidden Markov model (HMM) system, to improve continuous speech recognition (CSR). The SNN is a segment-based model that uses a neural network to correlate features of the speec...
Conference Paper
The authors present the concept of a segmental neural net (SNN) for phonetic modeling in continuous speech recognition (CSR) and demonstrate how this can be used with a multiple hypothesis (or N -Best) paradigm to combine different CSR systems. In particular, the authors developed a system that combines the SNN with a hidden Markov model (HMM) syst...
Article
The authors describe four different ways in which they used the N-Best paradigm within the BYBLOS system. The most obvious use is for the efficient integration of speech recognition with a linguistic natural language understanding module. However, the authors have extended this principle to several other acoustic knowledge sources. They also descri...
Article
Full-text available
In an effort to advance the state of the art in continuous speech recognition employing hidden Markov models (HMM), Segmental Neural Nets (SNN) were introduced recently to ameliorate the well-known limitations of HMMs, namely, the conditional-independence limitation and the relative difficulty with which HMMs can handle segmental features. We descr...

Citations

... This improves the performance of the identification system in the presence of convolution and additive noise [7]. Cepstral mean normalization minimizes the degradation in perceived quality of speech by channel equalization [8]. ...
... In these both applications, the amount of available a priori knowledge for the best human-to-machine communication is much stronger on both sides than for example in the case of brain computer interfacing. Nonetheless, a fine-tuning is needed in a coadaptive manner for optimal results (Zavaliagkos et al., 1995). ...
... For small and medium vocabulary systems, the most widely used approach is time synchronous Viterbi beam search which is based on a dynamic programming algorithm (Ney 1984). It can be extended for large vocabularies by adding the strategies such as dynamic decoding , multi pass search (Murveit et al. 1993) and N -best rescoring (Schwartz et al. 1992). ...
... Numerous studies have investigated the automation of pronunciation variations. Statistical decision trees to generate alternate word pronunciations were used in (RILEY et al., 1999). A phonetic-feature-based prediction model is presented in (BATES et al., 2007). ...
... In short, the hidden-event model does not support our principles 1 and 2 above. Ma and colleagues (Ma et al., 2000), noting that one can build language models specific for certain domains or for certain dialog acts, proposed an even finer-grained decomposition, breaking each utterance into given and new parts, and using a separate language model for each. This technique gave a 0.3% decrease in word error rate. ...
... Mel Frequency Cepstral Coefficients (MFCC) provide a relatively low-dimensional representation via the Mel scale filter bank ( Graciarena et al. 2010), which consists of linearly (below 1 kHz) and logarithmically (above 1 kHz) spaced Mel scale filters. MFCC has been useful for human speech recognition ( Makhoul and Schwartz 1995, Muda et al. 2010, Priyadarshani et al. 2012), and extended to animal vocalisations ( Kogan and Margoliash 1998, Clemins and Johnson 2003, Fox et al. 2006, Lee et al. 2006, Briggs et al. 2009, Stattner et al. 2013). Mostly, MFCC are used with their first and second order derivatives in order to capture dynamic features of the vocal tract. ...
... PSOLA), reliable segmentation and labeling of large speech databases is required. Also, as ASR increasingly uses pronunciation modeling [10,11,6,9,8,7] the demand for statistically based pronunciation models in different languages is growing. To segment or recognize databases of a language, we usually have to (mostly statistically) train the system on the relevant acoustic models, which is an expensive and time-consuming process, because a huge amount of data is needed. ...
... We used the ADI-5 [81] dataset and CALLHOME 11 [82,83] dataset to probe the Channel Classification task. Our multi-class classifier with labels indicate the input signal quality as Satellite recording (SatQ), High quality archived videos (HQ) or Telephony data (TF). ...
... It is well known (e.g. [4,8,14]) that the performance of these systems improves dramatically if the model parameters are suitably adapted to test conditions, particularly when there is a mismatch between the acoustic environment or the speaker characteristics in the training and test speech. However, due to the large number of the parameters to be adjusted in comparison with the amount of realistically available adaptation data, one usually takes recourse to tying (perhaps hierarchically) the adjustments of individual acoustic models. ...
... In order to take into account the coexistence of such different modes, a simple and efficient solution is to use several HMM models in which each one represent one possible mode and then combine them into an unified model. This idea has been introduced and investigated in the handwriting, speech and gesture recognition literature [66,82,81,72,61,51,98]. For example, Iyer et al. combined several individual HMM models to form a so-called parallel-path HMM model and applied it for modeling trajectories of the speech. ...