Bryan L. Pellom

University of Colorado, Denver, Colorado, United States

Are you Bryan L. Pellom?

Claim your profile

Publications (71)48.76 Total impact

  • Vesa Siivola, Bryan L. Pellom, Meagan Sills
    INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy, August 27-31, 2011; 01/2011
  • Source
    Speech Prosody; 05/2010
  • Source
    Andreas Hagen, Bryan L. Pellom, Kadri Hacioglu
    [Show abstract] [Hide abstract]
    ABSTRACT: This work focuses on generating children's HMM-based acoustic models for speech rec- ognition from adult acoustic models. Collect- ing children's speech data is more costly compared to adult's speech. The patent- pending method developed in this work re- quires only adult data to estimate synthetic children's acoustic models in any language and works as follows: For a new language where only adult data is available, an adult male and an adult female model is trained. A linear transformation from each male HMM mean vector to its closest female mean vector is estimated. This transform is then scaled to a certain power and applied to the female model to obtain a synthetic children's model. In a pronunciation verification task the method yields 19% and 3.7% relative improvement on native English and Spanish children's data, re- spectively, compared to the best adult model. For Spanish data, the new model outperforms the available real children's data based model by 13% relative.
    Human Language Technologies: Conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, May 31 - June 5, 2009, Boulder, Colorado, USA, Short Papers; 01/2009
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents work on developing speech corpora and recognition tools for Turkish by porting SONIC, a speech recognition tool developed initially for English at the Center for Spoken Language Research of the University of Colorado at Boulder. The work presented in this paper had two objectives: The first one is to collect a standard phonetically-balanced Turkish microphone speech corpus for general research use. A 193-speaker triphone-balanced audio corpus and a pronunciation lexicon for Turkish have been developed. The corpus has been accepted for distribution by the Linguistic Data Consortium (LDC) of the University of Pennsylvania in October 2005, and it will serve as a standard corpus for Turkish speech researchers. The second objective was to develop speech recognition tools (a phonetic aligner and a phone recognizer) for Turkish, which provided a starting point for obtaining a multilingual speech recognizer by porting SONIC to Turkish. This part of the work was the first port of this particular recognizer to a language other than English; subsequently, SONIC has been ported to over 15 languages. Using the phonetic aligner developed, the audio corpus has been provided with word, phone and HMM-state level alignments. For the phonetic aligner, it is shown that 92.6% of the automatically labeled phone boundaries are placed within 20ms of manually labeled locations for the Turkish audio corpus. Finally, a phone recognition error rate of 29.2% is demonstrated for the phone recognizer.
    Computer Speech & Language. 01/2007; 21:580-593.
  • Source
    Andreas Hagen, Bryan Pellom, Ronald Cole
    [Show abstract] [Hide abstract]
    ABSTRACT: Speech technology offers great promise in the field of automated literacy and reading tutors for children. In such applications speech recognition can be used to track the reading position of the child, detect oral reading miscues, assessing comprehension of the text being read by estimating if the prosodic structure of the speech is appropriate to the discourse structure of the story, or by engaging the child in interactive dialogs to assess and train comprehension. Despite such promises, speech recognition systems exhibit higher error rates for children due to variabilities in vocal tract length, formant frequency, pronunciation, and grammar. In the context of recognizing speech while children are reading out loud, these problems are compounded by speech production behaviors affected by difficulties in recognizing printed words that cause pauses, repeated syllables and other phenomena. To overcome these challenges, we present advances in speech recognition that improve accuracy and modeling capability in the context of an interactive literacy tutor for children. Specifically, this paper focuses on a novel set of speech recognition techniques which can be applied to improve oral reading recognition. First, we demonstrate that speech recognition error rates for interactive read aloud can be reduced by more than 50% through a combination of advances in both statistical language and acoustic modeling. Next, we propose extending our baseline system by introducing a novel token-passing search architecture targeting subword unit based speech recognition. The proposed subword unit based speech recognition framework is shown to provide equivalent accuracy to a whole-word based speech recognizer while enabling detection of oral reading events and finer grained speech analysis during recognition. The efficacy of the approach is demonstrated using data collected from children in grades 3–5, namely 34.6% of partial words with reasonable evidence in the speech signal are detected at a low false alarm rate of 0.5%.
    Speech Communication. 01/2007;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this chapter, we present our recent advances in the formulation and development of an in-vehicle hands-free route navigation system. The system is comprised of a multi-microphone array processing front-end, environmental sniffer (for noise analysis), robust speech recognition system, and dialog manager and information servers. We also present our recently completed speech corpus for in-vehicle interactive speech systems for route planning and navigation. The corpus consists of five domains which include: digit strings, route navigation expressions, street and location sentences, phonetically balanced sentences, and a route navigation dialog in a human Wizard-of-Oz like scenario. A total of 500 speakers were collected from across the United States of America during a six month period from April-Sept. 2001. While previous attempts at in-vehicle speech systems have generally focused on isolated command words to set radio frequencies, temperature control, etc., the CU-Move system is focused on natural conversational interaction between the user and in-vehicle system. After presenting our proposed in-vehicle speech system, we consider advances in multi-channel array processing, environmental noise sniffing and tracking, new and more robust acoustic front-end representations and built-in speaker normalization for robust ASR, and our back-end dialog navigation information retrieval sub-system connected to the WWW. Results are presented in each sub-section with a discussion at the end of the chapter.
    01/2006: pages 19-45;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We present a novel approach to synthesizing accurate visible speech based on searching and concatenating optimal variable-length units in a large corpus of motion capture data. Based on a set of visual prototypes selected on a source face and a corresponding set designated for a target face, we propose a machine learning technique to automatically map the facial motions observed on the source face to the target face. In order to model the long distance coarticulation effects in visible speech, a large-scale corpus that covers the most common syllables in English was collected, annotated and analyzed. For any input text, a search algorithm to locate the optimal sequences of concatenated units for synthesis is desrcribed. A new algorithm to adapt lip motions from a generic 3D face model to a specific 3D face model is also proposed. A complete, end-to-end visible speech animation system is implemented based on the approach. This system is currently used in more than 60 kindergarten through third grade classrooms to teach students to read using a lifelike conversational animated agent. To evaluate the quality of the visible speech produced by the animation system, both subjective evaluation and objective evaluation are conducted. The evaluation results show that the proposed approach is accurate and powerful for visible speech synthesis.
    IEEE Transactions on Visualization and Computer Graphics 01/2006; 12(2):266-76. · 1.90 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: To date, studies of deceptive speech have largely been con- fined to descriptive studies and observations from subjects, re- searchers, or practitioners, with few empirical studies of the specific lexical or acoustic/prosodic features which may charac- terize deceptive speech. We present results from a study seek- ing to distinguish deceptive from non-deceptive speech using machine learning techniques on features extracted from a large corpus of deceptive and non-deceptive speech. This corpus em- ploys an interview paradigm that includes subject reports of truth vs. lie at multiple temporal scales. We present current results comparing the performance of acoustic/prosodic, lexi- cal, and speaker-dependent features and discuss future research directions.
    INTERSPEECH 2005 - Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, September 4-8, 2005; 01/2005
  • Source
    Vesa Siivola, Bryan L. Pellom
    [Show abstract] [Hide abstract]
    ABSTRACT: Traditionally, when building an n-gram model, we decide the span of the model history, collect the relevant statistics and es- timate the model. The model can be pruned down to a smaller size by manipulating the statistics or the estimated model. This paper shows how an n-gram model can be built by adding suit- able sets of n-grams to a unigram model until desired complex- ity is reached. Very high order n-grams can be used in the model, since the need for handling the full unpruned model is eliminated by the proposed technique. We compare our growing method to entropy based pruning. In Finnish speech recognition tests, the models trained by the growing method outperform the entropy pruned models of similar size. data so that the number of bits required to send a model of the data and the actual data given the model is minimized. In this case, we want to minimize the sum of log likelihood of the train- ing data and the model size. We start with an initial model Mold. We build the model by drawing an (m-1)-gram gm 1 from some distribution (dis- cussed later). We find all the m-grams gm from the training data that start with the prefix gm 1 and add this set G to the model, if the changein data coding length is negative: � = (�Snew − logLnew) − (�Sold − logLold), (1)
    INTERSPEECH 2005 - Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, September 4-8, 2005; 01/2005
  • Source
    Piero Cosi, Bryan L. Pellom
    [Show abstract] [Hide abstract]
    ABSTRACT: This work was conducted with the specific goals of developing improved recognition of children's speech in Italian and the integration of the children's speech recognition models into the Italian version of the Colorado Literacy Tutor platform. Specifically, children's speech recognition research for Italian was conducted using the ITC-irst Children's Speech Corpus. Using the University of Colorado SONIC LVSR system, we demonstrate a phonetic recognition error rate of 16.0% for a system which incorporates Vocal Tract Length Normalization (VTLN), Speaker-Adaptive Trained phonetic models, as well as unsupervised Structural MAP Linear Regression (SMAPLR). These new acoustic models have been incorporated within an Italian version of SONIC the ASR system of the Italian Literacy Tutor program.
    INTERSPEECH 2005 - Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, September 4-8, 2005; 01/2005
  • Andreas Hagen, Bryan L. Pellom
    INTERSPEECH 2005 - Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, September 4-8, 2005; 01/2005
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The Colorado Literacy Tutor (CLT) is a technology-based literacy program, designed on the basis of cognitive theory and scientifically motivated reading research, which aims to improve literacy and student achievement in public schools. One of the critical components of the CLT is a speech recognition system which is used to track the child's progress during oral reading and to provide sufficient information to detect reading miscues. In this paper, we extend on prior work by examining a novel labeling of children's oral reading audio data in order to better understand the factors that contribute most significantly to speech recognition errors. While these events make up nearly 8% of the data, they are shown to account for approximately 30% of the word errors in a state-of-the-art speech recognizer. Next, we consider the problem of detecting miscues during oral reading. Using features derived from the speech recognizer, we demonstrate that 67% of reading miscues can be detected at a false alarm rate of 3%.
    08/2004;
  • Source
    K. Hacioglu, B. Pellom, W. Ward
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, the states in the speech production process are defined by a number of categorical articulatory features. We describe a detector that outputs a stream (sequence of classes) for each articulatory feature given the Mel frequency cepstral coefficient (MFCC) representation of the input speech. The detector consists of a bank of recurrent neural network (RNN) classifiers, a variable depth lattice generator and Viterbi decoder. A bank of classifiers has been previously used for articulatory feature detection by many researchers. We extend their work first by creating variable depth lattices for each feature and then by combining them into product lattices for rescoring using the Viterbi algorithm. During the rescoring we incorporate language and duration constraints along with the posterior probabilities of classes provided by the RNN classifiers. We present our results for the place and manner features using TIMIT data, and compare the results to a baseline system. We report performance improvements both at the frame and segment levels.
    Acoustics, Speech, and Signal Processing, 2004. Proceedings. (ICASSP '04). IEEE International Conference on; 06/2004 · 4.63 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we present recent advances in acoustic and language modeling that improve recognition performance when children read out loud within digital books. First we extend previous work by incorporating cross-utterance word history information and dynamic n-gram language modeling. By additionally incorporating Vocal Tract Length Normalization (VTLN), Speaker-Adaptive Training (SAT) and iterative unsupervised structural maximum a posteriori linear regression (SMAPLR) adaptation we demonstrate a 54% reduction in word error rate. Next, we show how data from children's read-aloud sessions can be utilized to improve accuracy in a spontaneous story summarization task. An error reduction of 15% over previous published results is shown. Finally we describe a novel real-time implementation of our research system that incorporates time-adaptive acoustic and language modeling.
    05/2004;
  • [Show abstract] [Hide abstract]
    ABSTRACT: We present a technique for accurate automatic visible speech synthesis from textual input. When provided with a speech waveform and the text of a spoken sentence, the system produces accurate visible speech synchronized with the audio signal. To develop the system, we collected motion capture data from a speaker's face during production of a set of words containing all diviseme sequences in English. The motion capture points from the speaker's face are retargeted to the vertices of the polygons of a 3D face model. When synthesizing a new utterance, the system locates the required sequence of divisemes, shrinks or expands each diviseme based on the desired phoneme segment durations in the target utterance, then moves the polygons in the regions of the lips and lower face to correspond to the spatial coordinates of the motion capture data. The motion mapping is realized by a key-shape mapping function learned by a set of viseme examples in the source and target faces. A well-posed numerical algorithm estimates the shape blending coefficients. Time warping and motion vector blending at the juncture of two divisemes and the algorithm to search the optimal concatenated visible speech are also developed to provide the final concatenative motion sequence. Copyright © 2004 John Wiley & Sons, Ltd.
    Computer Animation and Virtual Worlds 01/2004; 15:485-500. · 0.44 Impact Factor
  • [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a vision of the near future in which computer interaction is characterized by natural face-to-face conversations with lifelike characters that speak, emote, and gesture. These animated agents will converse with people much like people converse effectively with assistants in a variety of focused applications. Despite the research advances required to realize this vision, and the lack of strong experimental evidence that animated agents improve human--computer interaction, we argue that initial prototypes of perceptive animated interfaces can be developed today, and that the resulting systems will provide more effective and engaging communication experiences than existing systems. In support of this hypothesis, we first describe initial experiments using an animated character to teach speech and language skills to children with hearing problems, and classroom subjects and social skills to children with autistic spectrum disorder. We then show how existing dialogue system architectures can be transformed into perceptive animated interfaces by integrating computer vision and animation capabilities. We conclude by describing the Colorado Literacy Tutor, a computer-based literacy program that provides an ideal testbed for research and development of perceptive animated interfaces, and consider next steps required to realize the vision
    11/2003;
  • Source
    A. Hagen, D.A. Connors, B.L. Pellom
    [Show abstract] [Hide abstract]
    ABSTRACT: Growing demand for high performance in embedded systems is creating new opportunities to use speech recognition systems. In several ways, the needs of embedded computing differ from those of more traditional general-purpose systems. Embedded systems have more stringent constraints on cost and power consumption that lead to design bottlenecks for many computationally-intensive applications. This paper characterizes the speech recognition process on handheld mobile devices and evaluates the use of modern architecture features and compiler techniques for performing real-time speech recognition. We evaluate the University of Colorado sonic speech recognition software on the IMPACT architectural simulator and compiler framework. Experimental results show that by using a strategic set of compiler optimization, a 500 MHz processor with moderate levels of instruction-level parallelism and cache resources can meet the real-time computing and power constraints of an advanced speech recognition application.
    Hardware/Software Codesign and System Synthesis, 2003. First IEEE/ACM/IFIP International Conference on; 11/2003
  • Article: Unknown
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents recent improvements in the development of the University of Colorado "CU Communicator" and "CUMove " spoken dialog systems. First, we describe the CU Communicator system that integrates speech recognition, synthesis and natural language understanding technologies using the DARPA Hub Architecture. Users are able to converse with an automated travel agent over the phone to retrieve up-to-date travel information such as flight schedules, pricing, along with hotel and rental car availability. The CU Communicator has been under development since April of 1999 and represents our test-bed system for developing robust human-computer interactions where reusability and dialogue system portability serve as two main goals of our work. Next, we describe our more recent work on the CU Move dialog system for in-vehicle route planning and guidance. This work is in joint collaboration with HRL and is sponsored as part of the DARPA Communicator program. Specifically, we will provide an overview of the task, describe the data collection environment for in-vehicle systems development, and describe our initial dialog system constructed for route planning.
    10/2003;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper presents a vision of the near future in which computer interaction is characterized by natural face-to-face conversations with lifelike characters that speak, emote, and gesture. These animated agents will converse with people much like people converse effectively with assistants in a variety of focused applications. Despite the research advances required to realize this vision, and the lack of strong experimental evidence that animated agents improve human-computer interaction, we argue that initial prototypes of perceptive animated interfaces can be developed today, and that the resulting systems will provide more effective and engaging communication experiences than existing systems. In support of this hypothesis, we first describe initial experiments using an animated character to teach speech and language skills to children with hearing problems, and classroom subjects and social skills to children with autistic spectrum disorder. We then show how existing dialogue system architectures can be transformed into perceptive animated interfaces by integrating computer vision and animation capabilities. We conclude by describing the Colorado Literacy Tutor, a computer-based literacy program that provides an ideal testbed for research and development of perceptive animated interfaces, and consider next steps required to realize the vision.
    Proceedings of the IEEE 10/2003; · 6.91 Impact Factor
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe a recognition experiment and two analytic experiments on a database of strongly Hispanic-accented English. We show the crucial importance of training on the Hispanicaccented data for acoustic model performance, and describe the tendency of Spanish-accented speakers to use longer, and presumably less-reduced, schwa vowels than native-English speakers.
    05/2003;

Publication Stats

1k Citations
48.76 Total Impact Points

Institutions

  • 2005
    • University of Colorado
      Denver, Colorado, United States
  • 1999–2005
    • University of Colorado at Boulder
      Boulder, Colorado, United States
  • 1996–1999
    • Duke University
      • Department of Electrical and Computer Engineering (ECE)
      Durham, North Carolina, United States