Barbara Peskin

Massachusetts Institute of Technology, Cambridge, Massachusetts, United States

Are you Barbara Peskin?

Claim your profile

Publications (46)58.36 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe the development of our speech recognition system for the National Institute of Standards and Technology (NIST) Spring 2005 Meeting Rich Transcription (RT-05S) evaluation, highlighting improvements made since last year [1]. The system is based on the SRI-ICSI-UW RT-04F conversational telephone speech (CTS) recognition system, with meeting-adapted models and various audio preprocessing steps. This year’s system features better delay-sum processing of distant microphone channels and energy-based crosstalk suppression for close-talking microphones. Acoustic modeling is improved by virtue of various enhancements to the background (CTS) models, including added training data, decision-tree based state tying, and the inclusion of discriminatively trained phone posterior features estimated by multilayer perceptrons. In particular, we make use of adaptation of both acoustic models and MLP features to the meeting domain. For distant microphone recognition we obtained considerable gains by combining and cross-adapting narrow-band (telephone) acoustic models with broadband (broadcast news) models. Language models (LMs) were improved with the inclusion of new meeting and web data. In spite of a lack of training data, we created effective LMs for the CHIL lecture domain. Results are reported on RT-04S and RT-05S meeting data. Measured on RT-04S conference data, we achieved an overall improvement of 17% relative in both MDM and IHM conditions compared to last year’s evaluation system. Results on lecture data are comparable to the best reported results for that task.
    Full-text · Chapter · Feb 2006
  • Source
    Xavier Anguera · Chuck Wooters · Barbara Peskin · Mateu Aguiló
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we describe the ICSI-SRI entry in the Rich Transcrip- tion 2005 Spring Meeting Recognition Evaluation. The current system is based on the ICSI-SRI clustering system for Broadcast News (BN), with extra modules to process the different meetings tasks in which we participated. Our base system uses agglomerative clustering with a BIC-like measure to determine when to stop merging clusters and to decide which pairs of clusters to merge. This approach does not require any pre-trained models, thus increasing robustness and simplify- ing the port from BN to the meetings domain. For the meetings domain, we have added several features to our baseline clustering system, i ncluding a "purifica- tion" module that tries to keep the clusters acoustically ho mogeneous throughout the clustering process, and a delay&sum beamforming algorithm which enhances signal quality for the multiple distant microphones (MDM) sub-task. In post- evaluation work we further improved the delay&sum algorithm, experimented with a new speech/non-speech detector and proposed a new system for the lec- ture room environment.
    Full-text · Conference Paper · Jul 2005
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Both human and automatic processing of speech require recognition of more than just words. In this paper we provide a brief overview of research on structural metadata extraction in the DARPA EARS rich transcription program. Tasks include detection of sentence boundaries, filler words, and disfluencies. Modeling approaches combine lexical, prosodic, and syntactic information, using various modeling techniques for knowledge source integration. The performance of these methods is evaluated by task, by data source (broadcast news versus spontaneous telephone conversations) and by whether transcriptions come from humans or from an (errorful) automatic speech recognizer. A representative sample of results shows that combining multiple knowledge sources (words, prosody, syntactic information) is helpful, that prosody is more helpful for news speech than for conversational speech, that word errors significantly impact performance, and that discriminative models generally provide benefit over maximum likelihood models. Important remaining issues, both technical and programmatic, are also discussed.
    Full-text · Conference Paper · Apr 2005
  • Source
    Andrew O. Hatch · Barbara Peskin · Andreas Stolcke
    [Show abstract] [Hide abstract]
    ABSTRACT: The current "state-of-the-art" in phonetic speaker recognition uses relative frequencies of phone n-grams as features for training speaker models and for scoring test-target pairs. Typically, these relative frequencies are computed from a simple 1-best phone decoding of the input speech. In this paper, we present results on the Switchboard-2 corpus, where we compare 1-best phone decodings versus lattice phone decodings for the purposes of performing phonetic speaker recognition. The phone decodings are used to compute relative frequencies of phone bigrams, which are then used as inputs for two standard phonetic speaker recognition systems: a system based on log-likelihood ratios (LLRs) [1, 2], and a system based on support vector machines (SVMs) [3]. In each experiment, the lattice phone decodings achieve relative reductions in equal-error rate (EER) of between 31% and 66% below the EERs of the 1-best phone decodings. Our best phonetic system achieves an EER of 2.0% on 8-conversation training and 1.4% when combined with a GMM-based system.
    Preview · Conference Paper · Feb 2005
  • [Show abstract] [Hide abstract]
    ABSTRACT: The paper describes our system devised for recognizing speech in meetings, which was an entry in the NIST Spring 2004 Meeting Recognition Evaluation. This system was developed as a collaborative effort between ICSI, SRI, and UW and was based on SRI’s 5xRT Conversational Telephone Speech (CTS) recognizer. The CTS system was adapted to the Meetings domain by adapting the CTS acoustic and language models to the Meeting domain, adding noise reduction and delay-sum array processing for far-field recognition, and adding postprocessing for cross-talk suppression for close-talking microphones. A modified MAP adaptation procedure was developed to make best use of discriminatively trained (MMIE) prior models. These meeting-specific changes yielded an overall 9% and 22% relative improvement as compared to the original CTS system, and 16% and 29% relative improvement as compared to our 2002 Meeting Evaluation system, for the individual-headset and multiple-distant microphones conditions, respectively.
    No preview · Chapter · Jan 2005
  • Source
    Daniel Gillick · Stephen Stafford · Barbara Peskin
    [Show abstract] [Hide abstract]
    ABSTRACT: In order to capture sequential information and to take advantage of extended training data conditions, we developed an algorithm for speaker detection that scores a test segment by comparing it di-rectly to similar instances of that speech in the training data. This non-parametric technique, though at an early stage in its develop-ment, achieves error rates close to 1% on the NIST 2001 Extended Data task and performs extremely well in combination with a stan-dard Gaussian Mixture Model system. We also present a new scor-ing method that significantly improves performance by capturing only positive evidence.
    Preview · Article · Jan 2005 · Acoustics, Speech, and Signal Processing, 1988. ICASSP-88., 1988 International Conference on
  • Source
    A.O. Hatch · A. Stolcke · B. Peskin
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we describe a general technique for optimizing the relative weights of feature sets in a support vector machine (SVM) and show how it can be applied to the field of speaker recognition. Our training procedure uses an objective function that maps the relative weights of the feature sets directly to a classification metric (e.g. equal-error rate (EER)) measured on a set of training data. The objective function is optimized in an iterative fashion with respect to both the feature weights and the SVM parameters (i.e. the support vector weights and the bias values). In this paper, we use this procedure to optimize the relative weights of various subsets of features in two SVM-based speaker recognition systems: a system that uses transform coefficients obtained from maximum likelihood linear regression (MLLR) as features (A. Stolcke, et al., 2005) and another that uses relative frequencies of phone n-grams (W. M. Campbell, et al., 2003), (A. Hatch, et al., 2005). In all cases, the training procedure yields significant improvements in both EER and minimum DCF (i.e. decision cost function), as measured on various test corpora
    Preview · Conference Paper · Jan 2005
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes ICSI's 2005 speaker recognition system, which was one of the top performing systems in the NIST 2005 speaker recognition evaluation. The system is a combination of four sub-systems: 1) a keyword conditional HMM system, 2) an SVM-based lattice phone n-gram system, 3) a sequential nonparametric system, and 4) a traditional cepstral GMM System, developed by SRI. The first three systems are designed to take advantage of higher-level and long-term information. We observe that their performance is significantly improved when there is more training data. In this paper, we describe these sub-systems and present results for each system alone and in combination on the speaker recognition evaluation (SRE) 2005 development and evaluation data sets
    Full-text · Conference Paper · Jan 2005
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe the development of our speech recognition system for the National Institute of Standards and Technology (NIST) Spring 2005 Meeting Rich Transcription (RT-05S) evaluation, highlighting improvements made since last year [1]. The system is based on the SRI-ICSI-UW RT-04F conversational telephone speech (CTS) recognition system, with meeting-adapted models and various audio preprocessing steps. This year's system features better delay-sum processing of distant microphone channels and energy-based crosstalk suppression for close-talking microphones. Acoustic modeling is improved by virtue of various enhancements to the background (CTS) models, including added training data, decision-tree based state tying, and the inclusion of discriminatively trained phone posterior features estimated by multilayer perceptrons. In particular, we make use of adaptation of both acoustic models and MLP features to the meeting domain. For distant microphone recognition we obtained considerable gains by combining and cross-adapting narrow-band (telephone) acoustic models with broadband (broadcast news) models. Language models (LMs) were improved with the inclusion of new meeting and web data. In spite of a lack of training data, we created effective LMs for the CHIL lecture domain. Results are reported on RT-04S and RT-05S meeting data. Measured on RT-04S conference data, we achieved an overall improvement of 17% relative in both MDM and IHM conditions compared to last year's evaluation system. Results on lecture data are comparable to the best reported results for that task.
    Full-text · Conference Paper · Jan 2005
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe the ICSI-SRI-UW team's entry in the Spring 2004 NIST Meeting Recognition Evaluation. The system was derived from SRI's 5xRT Conversational Telephone Speech (CTS) recognizer by adapting CTS acoustic and language models to the Meeting domain, adding noise reduction and delay-sum array processing for far-field recognition, and postprocessing for cross-talk suppression. A modified MAP adaptation procedure was developed to make best use of discriminatively trained (MMIE) prior models. These meeting-specific changes yielded an overall 9% and 22% relative improvement as compared to the original CTS system, and 16% and 29% relative improvement as compared to our 2002 Meeting Evaluation system, for the individual-headset and multiple-distant microphones conditions, respectively.
    Full-text · Article · Jun 2004
  • Source
    Kofi Boakye · Barbara Peskin
    [Show abstract] [Hide abstract]
    ABSTRACT: We present an approach to speaker recognition in the text-independent domain of conversational telephone speech us-ing a text-constrained system designed to employ select high-frequency keywords in the speech stream. The system uses speaker word models generated via Hidden Markov Models (HMMs) — a departure from the traditional Gaussian Mixture Model (GMM) approach dominant in text-independent work, but commonly employed in text-dependent systems — with the expectation that HMMs take greater advantage of sequential in-formation and support more detailed modeling which could be used to aid recognition. Even with a keyword inventory that covers a mere 10% of the word tokens and a system that does not yet incorporate many standard speaker recognition normal-ization schemes, this approach is already achieving equal error rates of 1% on NIST's 2001 Extended Data task.
    Preview · Article · Jan 2004

  • No preview · Conference Paper · Jan 2004
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper provides a progress report on ICSI's Meeting Proj ect, including both the data collected and annotated as part of the pro- ject, as well as the research lines such materials support. We in- clude a general description of the official "ICSI Meeting Cor pus", as currently available through the Linguistic Data Consortium, dis- cuss some of the existing and planned annotations which augment the basic transcripts provided there, and describe several research efforts that make use of these materials. The corpus supports wide- ranging efforts, from low-level processing of the audio signal (in- cluding automatic speech transcription, speaker tracking, and work on far-field acoustics) to higher-level analyses of meeting struc- ture, content, and interactions (such as topic and sentence segmen- tation, and automatic detection of dialogue acts and meeting "hot spots").
    Full-text · Article · Jan 2004
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Both human and automatic processing of speech require recogniz- ing more than just the words. We describe a state-of-the-art sys- tem for automatic detection of "metadata" (information beyond the words) in both broadcast news and spontaneous telephone conver- sations, developed as part of the DARPA EARS Rich Transcription program. System tasks include sentence boundary detection, filler word detection, and detection/correction of disfluencies. To achieve best performance, we combine information from different types of language models (based on words, part-of-speech classes, and au- tomatically induced classes) with information from a prosodic clas- sifier. The prosodic classifier employs bagging and ensemble ap- proaches to better estimate posterior probabilities. We use confu- sion networks to improve robustness to speech recognition errors. Most recently, we have investigated a maximum entropy approach for the sentence boundary detection task, yielding a gain over our standard HMM approach. We report results for these techniques on the official NIST Rich Transcription metadata tasks.
    Full-text · Conference Paper · Jan 2004
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe the ICSI-SRI-UW team's entry in the Spring 2004 NIST Meeting Recognition Evaluation. The system was de- rived from SRI's 5xRT Conversational Telephone Speech (CTS) recognizer by adapting CTS acoustic and language models to the Meeting domain, adding noise reduction and delay-sum array processing for far-field recognition, and postprocessing for cross-talk suppression. A modified MAP adaptation proce- dure was developed to make best use of discriminatively trained (MMIE) prior models. These meeting-specific changes yielded an overall 9% and 22% relative improvement as compared to the original CTS system, and 16% and 29% relative improve- ment as compared to our 2002 Meeting Evaluation system, for the individual-headset and multiple-distant microphones condi- tions, respectively.
    Full-text · Conference Paper · Jan 2004
  • [Show abstract] [Hide abstract]
    ABSTRACT: The area of automatic speaker recognition has been dominated by systems using only short-term, low-level acoustic information, such as cepstral features. While these systems have indeed produced very low error rates, they ignore other levels of information beyond low-level acoustics that convey speaker information. Recently published work has shown examples that such high-level information can be used successfully in automatic speaker recognition systems and has the potential to improve accuracy and add robustness. For the 2002 JHU CLSP summer workshop, the SuperSID project (http://www.clsp.jhu.edu/ws2002/groups/supersid/) was undertaken to exploit these high-level information sources and dramatically increase speaker recognition accuracy on a defined NIST evaluation corpus and task. This paper provides an overview of the structure, data, task, tools, and accomplishments of this project. Wide ranging approaches using pronunciation models, prosodic dynamics, pitch and duration features, phone streams, and conversational interactions were explored and developed. In this paper we show how these novel features and classifiers indeed provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIST extended data task to 0.2% --- a 71% relative reduction in error over the previous state of the art.
    No preview · Article · Aug 2003
  • [Show abstract] [Hide abstract]
    ABSTRACT: While there has been a long tradition of research seeking to use prosodic features, especially pitch, in speaker recognition systems, results have generally been disappointing when such features are used in isolation and only modest improvements have been seen when used in conjunction with traditional cepstral GMM systems. In contrast, we report here on work from the JHU 2002 Summer Workshop exploring a range of prosodic features, using as testbed NIST's 2001 Extended Data task. We examined a variety of modeling techniques, such as n- gram models of turn-level prosodic features and simple vectors of summary statistics per conversation side scored by k nearestneighbor classifiers. We found that purely prosodic models were able to achieve equal error rates of under 10%, and yielded significant gains when combined with more traditional systems. We also report on exploratory work on "conversational" features, capturing properties of the interaction across conversation sides, such as turn-taking patterns.
    No preview · Article · Aug 2003
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: While there has been a long tradition of research seeking to use prosodic features, especially pitch, in speaker recognition systems, results have generally been disappointing when such features are used in isolation and only modest improvements have been seen when used in conjunction with traditional cepstral GMM systems. In contrast, we report here on work from the JHU 2002 Summer Workshop exploring a range of prosodic features, using as testbed the 2001 NIST Extended Data task. We examined a variety of modeling techniques, such as n-gram models of turn-level prosodic features and simple vectors of summary statistics per conversation side scored by k<sup>th</sup> nearest-neighbor classifiers. We found that purely prosodic models were able to achieve equal error rates of under 10%, and yielded significant gains when combined with more traditional systems. We also report on exploratory work on "conversational" features, capturing properties of the interaction across conversation sides, such as turn-taking patterns.
    Preview · Conference Paper · May 2003
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The area of automatic speaker recognition has been dominated by systems using only short-term, low-level acoustic information, such as cepstral features. While these systems have indeed produced very low error rates, they ignore other levels of information beyond low-level acoustics that convey speaker information. Recently published work has shown examples that such high-level information can be used successfully in automatic speaker recognition systems and has the potential to improve accuracy and add robustness. For the 2002 JHU CLSP summer workshop, the SuperSID project (http://www.clsp.jhu.edu/ws2002/groups/supersid/) was undertaken to exploit these high-level information sources and dramatically increase speaker recognition accuracy on a defined NIST evaluation corpus and task. The paper provides an overview of the structure, data, task, tools, and accomplishments of this project. Wide ranging approaches using pronunciation models, prosodic dynamics, pitch and duration features, phone streams, and conversational interactions were explored and developed. We show how these novel features and classifiers indeed provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIST extended data task to 0.2% - a 71% relative reduction in error over the previous state of the art.
    Full-text · Conference Paper · May 2003
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In early 2001, we reported (at the Human Language Technology meeting) the early stages of an ICSI (International Computer Science Institute) project on processing speech from meetings (in collaboration with other sites, principally SRI, Columbia, and UW). We report our progress from the first few years of this effort, including: the collection and subsequent release of a 75-meeting corpus (over 70 meeting-hours and up to 16 channels for each meeting); the development of a prosodic database for a large subset of these meetings, and its subsequent use for punctuation and disfluency detection; the development of a dialog annotation scheme and its implementation for a large subset of the meetings; and the improvement of both near-mic and far-mic speech recognition results for meeting speech test sets.
    Full-text · Conference Paper · May 2003

Publication Stats

1k Citations
58.36 Total Impact Points

Institutions

  • 2003
    • Massachusetts Institute of Technology
      Cambridge, Massachusetts, United States
    • CUNY Graduate Center
      New York City, New York, United States
  • 2002
    • SRI International
      Menlo Park, California, United States