K. Livescu

K. Livescu
Toyota Technological Institute at Chicago | TTIC · TTIC

About

70
Publications
14,026
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
4,676
Citations
Additional affiliations
February 2008 - present
Toyota Technological Institute at Chicago
Position
  • Professor (Assistant)
September 2005 - December 2007
Massachusetts Institute of Technology
Position
  • Postdoctoral Lecturer
September 1997 - August 2005
Massachusetts Institute of Technology
Position
  • PhD Student

Publications

Publications (70)
Article
Full-text available
Aspects of speech production have provided inspiration for ideas in speech technologies throughout the history of speech processing research. This special issue was inspired by the 2013Workshop on Speech Production in Automatic Speech Recognition in Lyon, France, and this introduction provides an overview of the included papers in the context of th...
Article
Full-text available
In this paper, we show how to create paraphrastic sentence embeddings using the Paraphrase Database (Ganitkevitch et al., 2013), an extensive semantic resource with millions of phrase pairs. We consider several compositional architectures and evaluate them on 24 textual similarity datasets encompassing domains such as news, tweets, web forums, news...
Article
Canonical correlation analysis (CCA) is a fundamental technique in multi-view data analysis and representation learning. Several nonlinear extensions of the classical linear CCA method have been proposed, including kernel and deep neural network methods. These approaches restrict attention to certain families of nonlinear projections, which the use...
Article
Kernel Canonical correlation analysis (KCCA) is a fundamental method with broad applicability in statistics and machine learning. Although there exist closed-form solution to the KCCA objective by solving an $N\times N$ eigenvalue system where $N$ is the training set size, the computational requirements of this approach in both memory and time proh...
Article
Full-text available
We consider the setting in which we train a supervised model that learns task-specific word representations. We assume that we have access to some initial word representations (e.g., unsupervised embeddings), and that the supervised learning procedure updates them to task-specific representations for words contained in the training data. But what a...
Article
Deep CCA is a recently proposed deep neural network extension to the traditional canonical correlation analysis (CCA), and has been successful for multi-view representation learning in several domains. However, stochastic optimization of the deep CCA objective is not straightforward, because it does not decouple over training examples. Previous opt...
Article
Recent studies have been revisiting whole words as the basic modelling unit in speech recognition and query applications, instead of phonetic units. Such whole-word segmental systems rely on a function that maps a variable-length speech segment to a vector in a fixed-dimensional space; the resulting acoustic word embeddings need to allow for accura...
Article
Full-text available
Discriminative segmental models, such as segmental conditional random fields (SCRFs) and segmental structured support vector machines (SSVMs), have had success in speech recognition via both lattice rescoring and first-pass decoding. However, such models suffer from slow decoding, hampering the use of computationally expensive features, such as seg...
Article
Spoken language, especially conversational speech, is characterized by great variability in word pronunciation, including many variants that differ grossly from dictionary prototypes. This is one factor in the poor performance of automatic speech recognizers on conversational speech, and it has been very difficult to mitigate in traditional phone-b...
Article
Full-text available
The Paraphrase Database (PPDB; Ganitkevitch et al., 2013) is an extensive semantic resource, consisting of a list of phrase pairs with (heuristic) confidence estimates. However, it is still unclear how it can best be used, due to the heuristic nature of the confidences and its necessarily incomplete coverage. We propose models to leverage the phras...
Conference Paper
Full-text available
Word representations have proven useful for many NLP tasks, e.g., Brown clusters as features in dependency parsing (Koo et al., 2008). In this paper, we investigate the use of continuous word representations as features for dependency parsing. We compare several popular embeddings to Brown clusters, via multiple types of features, in both news and...
Conference Paper
Full-text available
Previous work has shown that acoustic features can be improved by unsupervised learning of transformations based on canonical correlation analysis (CCA) using articulatory measurements that are available at training time. In this paper, we investigate whether this second view (articulatory data) still helps even when labels are also available at tr...
Conference Paper
Full-text available
Segmental models such as segmental conditional random fields have had some recent success in lattice rescoring for speech recognition. They provide a flexible framework for incorpo-rating a wide range of features across different levels of units, such as phones and words. However, such models have mainly been trained by maximizing conditional likel...
Conference Paper
Measures of acoustic similarity between words or other units are critical for segmental exemplar-based acoustic models, spoken term discovery, and query-by-example search. Dynamic time warping (DTW) alignment cost has been the most commonly used measure, but it has well-known inadequacies. Some recently proposed alternatives require large amounts o...
Conference Paper
Recognition of gesture sequences is in general a very difficult problem, but in certain domains the difficulty may be mitigated by exploiting the domain's ``grammar''. One such grammatically constrained gesture sequence domain is sign language. In this paper we investigate the case of finger spelling recognition, which can be very challenging due t...
Conference Paper
Full-text available
One of the most popular speech recognition architectures con-sists of multiple components (like the acoustic, pronunciation and language models) that are modeled as weighted finite state transducer (WFST) factors in a cascade. These factor WFSTs are typically trained in isolation and combined efficiently for decoding. Recent work has explored joint...
Conference Paper
Full-text available
We introduce Deep Canonical Correlation Analysis (DCCA), a method to learn com-plex nonlinear transformations of two views of data such that the resulting representations are highly linearly correlated. Parameters of both transformations are jointly learned to maximize the (regularized) total correlation. It can be viewed as a nonlinear extension o...
Conference Paper
Full-text available
We study spoken term detection (STD) – the task of deter-mining whether and where a given word or phrase appears in a given segment of speech – using articulatory feature-based pronunciation models. The models are motivated by the re-quirements of STD in low-resource settings, in which it may not be feasible to train a large-vocabulary continuous s...
Conference Paper
Full-text available
Canonical correlation analysis (CCA) and kernel CCA can be used for unsupervised learning of acoustic features when a second view (e.g., articulatory measurements) is available for some training data, and such projections have been used to improve phonetic frame classification. Here we study the be-havior of CCA-based acoustic features on the task...
Conference Paper
Full-text available
We study the recognition of fingerspelling sequences in American Sign Language from video using tandem-style models, in which the outputs of multilayer perceptron (MLP) classifiers are used as observations in a hidden Markov model (HMM)-based recognizer. We compare a baseline HMM-based recognizer, a tandem recognizer using MLP letter clas-sifiers,...
Article
Full-text available
Modern automatic speech recognition systems handle large vocabularies of words, making it infeasible to collect enough repetitions of each word to train individual word models. Instead, large-vocabulary recognizers rep-resent each word in terms of sub-word units. Typically the sub-word unit is the phone, a basic speech sound such as a single conson...
Article
Modern automatic speech recognition systems handle large vocabularies of words, making it infeasible to collect enough repetitions of each word to train individual word models. Instead, large-vocabulary recognizers represent each word in terms of subword units. Typically the subword unit is the phone, a basic speech sound such as a single consonant...
Conference Paper
Full-text available
We study PCA, PLS, and CCA as stochastic opti-mization problems, of optimizing a population objective based on a sample. We suggest several stochastic approximation (SA) methods for PCA and PLS, and investigate their empirical performance.
Conference Paper
Full-text available
We address the problem of learning the map- ping between words and their possible pro- nunciations in terms of sub-word units. Most previous approaches have involved genera- tive modeling of the distribution of pronuncia- tions, usually trained to maximize likelihood. We propose a discriminative, feature-rich ap- proach using large-margin learning....
Conference Paper
Full-text available
We study spoken term detection—the task of determining whether and where a given word or phrase appears in a given segment of speech—in the setting of limited training data. This setting is becoming increasingly important as interest grows in porting spoken term detection to multiple low-resource languages and acoustic environments. We propose a di...
Conference Paper
Full-text available
This paper describes an approach to efficiently construct, and discriminatively train, a weighted finite state transducer (WFST) representation for an articulatory feature-based model of pronunciation. This model is originally implemented as a dynamic Bayesian network (DBN). The work is motivated by a desire to (1) incorporate such a pronunciation...
Conference Paper
Full-text available
We consider the problem of learning a linear transformation of acoustic feature vectors for phonetic frame classification, in a setting where articulatory measurements are available at training time. We use the acoustic and articulatory data to-gether in a multi-view learning approach, in particular using canonical correlation analysis to learn lin...
Article
Full-text available
We investigate joint models of articulatory features and apply these models to the problem of automatically generating articulatory transcriptions of spoken utterances given their word transcriptions. The task is motivated by the need for larger amounts of labeled articulatory data for both speech recognition and linguistics research, which is cost...
Article
The automatic speech recognition research community has experimented with models of speech articulation for several decades, but such models have not yet made it into mainstream recognition systems. The difficulties of adopting articulatory models include their relative complexity and dearth of data, compared to traditional phone-based models and d...
Conference Paper
Full-text available
We address the problem of pronunciation variation in conversational speech with a context-dependent articulatory featurebased model. The model is an extension of previous work using dynamic Bayesian networks, which allow for easy factorization of a state into multiple variables representing the articulatory features. We build context-dependent deci...
Conference Paper
Full-text available
Recognizing aspects of articulation from audio recordings of speech is an important problem, either as an end in itself or as part of an articulatory approach to automatic speech recognition. In this paper we study the frame-level classification of a set of articulatory features (AFs) inspired by the vocal tract variables of articulatory phonology....
Article
Full-text available
We investigate the classification of utterances into high-level dialog act categories using word-based features, under conditions where the train and test data differ by genre and/or language. We handle the cross-language cases with machine translation of the test utterances. We analyze and compare two feature-based approaches to using unlabeled da...
Conference Paper
Full-text available
We consider learning acoustic feature transformations using an additional view of the data, in this case video of the speaker's face. Specifically, we consider a scenario in which clean audio and video is available at training time, while at test time only noisy audio is available. We use canonical correlation analysis (CCA) to learn linear project...
Conference Paper
Full-text available
We consider the problem of predicting the surface pronunciations of a word in conversational speech, using a model of pronunciation variation based on articulatory features. We build context-dependent decision trees for both phone-based and feature-based models, and compare their perplexities on conversational data from the Switchboard Transcriptio...
Article
Full-text available
We consider methods for training a prosodic classifier using la-beled training data from a different genre than the one on which the system will be deployed. Two binary tasks are considered: word-level pitch accent and phrase boundary detection. Us-ing radio news and conversational telephone speech, we con-sider cross-genre training using acoustic...
Article
The phenomenon of anticipatory coarticulation provides a basis for the observed asynchrony between the acoustic and visual onsets of phones in certain linguistic contexts. This type of asynchrony is typically not explicitly modeled in audio-visual speech models. In this work, we study within-word audiovisual asynchrony using manual labels of words...
Article
Full-text available
We study the problem of automatic visual speech recognition (VSR) using dynamic Bayesian network (DBN)-based models consisting of multiple sequences of hidden states, each corresponding to an articulatory feature (AF) such as lip opening (LO) or lip rounding (LR). A bank of discriminative articulatory feature classifiers provides input to the DBN,...
Conference Paper
Full-text available
Clustering algorithms such as k-means perform poorly when the data is high- dimensional. A number of efficient clustering algorithms developed in recent years address this problem by projecting the data into a lower-dimensional sub- space, e.g. via principal components analysis (PCA) or random projections, before clustering. Such techniques typical...
Conference Paper
Full-text available
We study the phonetic information in the signal from an ultrasonic "microphone", a device that emits an ultrasonic wave toward a speaker and receives the reflected, Doppler-shifted signal. This can be used in addition to audio to improve automatic speech recog- nition. This work is an effort to better understand the ultrasonic signal, and potential...
Conference Paper
Full-text available
The features derived from posteriors of a multilayer perceptron (MLP), known as tandem features, have proven to be very effective for automatic speech recognition. Most tandem features to date have relied on MLPs trained for phone classification. We recently showed on a relatively small data set that MLPs trained for articulatory feature classifica...
Conference Paper
Full-text available
We report on investigations, conducted at the 2006 Johns Hopkins Workshop, into the use of articulatory features (AFs) for observation and pronunciation models in speech recognition. In the area of observation modeling, we use the outputs of AF classifiers both directly, in an extension of hybrid HMM/neural network models, and as part of the observ...
Conference Paper
Full-text available
The so-called tandem approach, where the posteriors of a multilayer perceptron (MLP) classifier are used as features in an automatic speech recognition (ASR) system has proven to be a very effective method. Most tandem approaches up to date have relied on MLPs trained for phone classification, and appended the posterior features to some standard fe...
Conference Paper
Full-text available
We present an approach for the manual labeling of speech at the articulatory feature level, and a new set of labeled conversational speech collected using this approach. A detailed transcription, including overlapping or reduced gestures, is useful for studying the great pronunciation variability in conversational speech. It also facilitates the te...
Article
Full-text available
Although much is known about how speech is produced, and research into speech production has resulted in measured articulatory data, feature systems of different kinds, and numerous models, speech production knowledge is almost totally ignored in current mainstream approaches to automatic speech recognition. Representations of speech production all...
Conference Paper
Full-text available
We investigate an asynchronous two-stream dynamic Bayesian network-based model for audio-visual speech recognition. The model allows the audio and visual streams to de-synchronize within the boundaries of each word. The probability of de-synchronization by a given number of states is learned during training. This type of asynchrony has been previou...
Conference Paper
Full-text available
This paper is intended to advertise the public availability of the articulatory feature (AF) classification multi-layer perceptrons (MLPs) which were used in the Johns Hopkins 2006 summer workshop. We describe the design choices, data preparation, AF label generation, and the training of MLPs for feature classi- fication on close to 2000 hours of t...
Article
Full-text available
1 Edited portions of this report have appeared or will appear in references [56, 55, 15, 26, 40]. We gratefully acknowledge a number of remote collaborators that contributed to this project before and during the workshop: Xuemin Chi, Joe Frankel, Lisa Lavoie, Mathew Magimai-Doss, and Kate Saenko. In addition, we are grateful for assistance and supp...
Article
This study investigates the manual labeling of speech, and in particular conversational speech, at the articulatory feature level. A detailed transcription, including subtleties such as overlapping or reduced gestures, is useful for studying the great pronunciation variability in conversational speech. This type of labeling also facilitates the tes...
Conference Paper
Full-text available
Articulatory feature models have been proposed in the automatic speech recognition community as an alternative to phone-based models of speech. In this paper, we extend this approach to the visual modality. Specifically, we adapt a recently proposed feature-based model of pronunciation variation to visual speech recognition (VSR) using a set of vis...
Article
Full-text available
Three research prototype speech recognition systems are described, all of which use recently developed methods from artificial intelligence (specifically support vector machines, dynamic Bayesian networks, and maximum entropy classification) in order to implement, in the form of an automatic speech recognizer, current theories of human speech perce...
Conference Paper
Full-text available
We present an approach to detecting and recognizing spoken isolated phrases based solely on visual input. We adopt an architecture that first employs discriminative detection of visual speech and articulate features, and then performs recognition using a model that accounts for the loose synchronization of the feature streams. Discriminative classi...
Conference Paper
Full-text available
We report on ongoing work on a pronunciation model based on explicit representation of the evolution of mul- tiple linguistic feature streams. In this type of model, most pronunciation variation is viewed as the result of asynchrony between features and changes in feature val- ues. We have implemented such a model using dynamic Bayesian networks. I...
Article
Full-text available
We present an approach to pronunciation mod- eling in which the evolution of multiple lin- guistic feature streams is explicitly represented. This differs from phone-based models in that pronunciation variation is viewed as the result of feature asynchrony and changes in feature values, rather than phone substitutions, inser- tions, and deletions....
Article
Full-text available
In recent years there has been growing interest in discriminative parameter training techniques, resulting from notable improvements in speech recognition performance on tasks ranging in size from digit recognition to Switchboard. Typified by Maximum Mutual Information training, these methods assume a fixed statistical modeling structure, and then...
Article
The MIT SUMMIT speech recognition system models pronunciation using a phonemic baseform dictionary along with rewrite rules for modeling phonological variation and multi-word reductions. Each pronunciation component is encoded within a finitestate transducer (FST) representation whose transition weights can be probabilistically trained using a modi...
Conference Paper
Full-text available
In this paper, we investigate the use of dynamic Bayesian net- works (DBNs) to explicitly represent models of hidden features, such as articulatory or other phonological features, for auto- matic speech recognition. In previous work using the idea of hidden features, the representation has typically been implicit, relying on a single hidden state t...
Conference Paper
In recent years there has been growing interest in discriminative parameter training techniques, resulting from notable improvements in speech recognition performance on tasks ranging in size from digit recognition to Switchboard. Typified of Maximum Mutual Information training, these methods assume a fixed statistical modeling structure, and then...
Article
In recent years there has been growing interest in discriminative parameter training techniques, resulting from notable improvements in speech recognition performance on tasks ranging in size from digit recognition to Switchboard. Typified by Maximum Mutual Information training, these methods assume a fixed statistical modeling structure, and then...
Conference Paper
Full-text available
In recent years there has been growing interest in discriminative parameter training techniques, resulting from notable improvements in speech recognition performance on tasks ranging in size from digit recognition to Switchboard. Typified by maximum mutual information training, these methods assume a fixed statistical modeling structure, and then...
Article
Full-text available
This paper describes preliminary recognition experiments on PhoneBook [1], a corpus of isolated, telephone-bandwidth, read words from a large (almost 8,000-word) vocabulary. We have chosen this corpus as a testbed for experiments on the language model-independent parts of a segment-based recognizer. We present results showing that a segment-based r...
Article
Full-text available
The paper examines the recognition of non-native speech in JUPITER, a speaker-independent, spontaneous-speech conversational system. Because the non-native speech in this domain is limited and varied, speaker- and accent-specific methods are impractical. We therefore chose to model all of the non-native data with a single model. In particular, the...
Article
Full-text available
The performance of automatic speech recognizers has been observed to be dramatically worse for speakers with non-native accents than for native speakers. This poses a problem for many speech recognition systems, which need to handle both native and non-native speech. The problem is further complicated by the large number of non-native accents, whic...
Conference Paper
The paper examines the recognition of non-native speech in JUPITER, a speaker-independent, spontaneous-speech conversational system. Because the non-native speech in this domain is limited and varied, speaker- and accent-specific methods are impractical. We therefore chose to model all of the non-native data with a single model. In particular, the...
Article
Full-text available
Speech recognition, by both humans and machines, benefits from visual observation of the face, espe-cially at low signal-to-noise ratios (SNRs). It has of-ten been noticed, however, that the audible and vis-ible correlates of a phoneme may be asynchronous; perhaps for this reason, automatic speech recogni-tion structures that allow asynchrony betwe...
Article
Full-text available
In recent years there has been growing interest in discriminative parameter training techniques, resulting from notable improvements in speech recognition performance on tasks ranging in size from digit recognition to Switch- board. Typified by Maximum Mutual Information (MMI) or Minimum Classification Error (MCE) training, these methods assume a f...
Article
Full-text available
We consider the problem of learning transformations of acoustic feature vectors for phonetic frame classification, in a multi-view setting where articulatory measurements are avail-able at training time but not at test time. Canonical correlation analysis (CCA) has previously been used to learn linear trans-formations of the acoustic features that...
Article
Full-text available
Thesis (Ph. D.)--Massachusetts Institute of Technology, Dept. of Electrical Engineering and Computer Science, 2005. Includes bibliographical references (p. 131-140). Spoken language, especially conversational speech, is characterized by great variability in word pronunciation, including many variants that differ grossly from dictionary prototypes....

Network

Cited By