Barbara Peskin

CUNY Graduate Center, New York City, New York, United States

Are you Barbara Peskin?

Claim your profile

Publications (45)46.3 Total impact

  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe the development of our speech recognition system for the National Institute of Standards and Technology (NIST) Spring 2005 Meeting Rich Transcription (RT-05S) evaluation, highlighting improvements made since last year [1]. The system is based on the SRI-ICSI-UW RT-04F conversational telephone speech (CTS) recognition system, with meeting-adapted models and various audio preprocessing steps. This year’s system features better delay-sum processing of distant microphone channels and energy-based crosstalk suppression for close-talking microphones. Acoustic modeling is improved by virtue of various enhancements to the background (CTS) models, including added training data, decision-tree based state tying, and the inclusion of discriminatively trained phone posterior features estimated by multilayer perceptrons. In particular, we make use of adaptation of both acoustic models and MLP features to the meeting domain. For distant microphone recognition we obtained considerable gains by combining and cross-adapting narrow-band (telephone) acoustic models with broadband (broadcast news) models. Language models (LMs) were improved with the inclusion of new meeting and web data. In spite of a lack of training data, we created effective LMs for the CHIL lecture domain. Results are reported on RT-04S and RT-05S meeting data. Measured on RT-04S conference data, we achieved an overall improvement of 17% relative in both MDM and IHM conditions compared to last year’s evaluation system. Results on lecture data are comparable to the best reported results for that task.
    02/2006: pages 463-475;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Both human and automatic processing of speech require recognition of more than just words. In this paper we provide a brief overview of research on structural metadata extraction in the DARPA EARS rich transcription program. Tasks include detection of sentence boundaries, filler words, and disfluencies. Modeling approaches combine lexical, prosodic, and syntactic information, using various modeling techniques for knowledge source integration. The performance of these methods is evaluated by task, by data source (broadcast news versus spontaneous telephone conversations) and by whether transcriptions come from humans or from an (errorful) automatic speech recognizer. A representative sample of results shows that combining multiple knowledge sources (words, prosody, syntactic information) is helpful, that prosody is more helpful for news speech than for conversational speech, that word errors significantly impact performance, and that discriminative models generally provide benefit over maximum likelihood models. Important remaining issues, both technical and programmatic, are also discussed.
    Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP '05). IEEE International Conference on; 04/2005
  • Source
    A.O. Hatch, B. Peskin, A. Stolcke
    [Show abstract] [Hide abstract]
    ABSTRACT: Not Available
    Acoustics, Speech, and Signal Processing, 2005. Proceedings. (ICASSP '05). IEEE International Conference on; 02/2005
  • [Show abstract] [Hide abstract]
    ABSTRACT: The paper describes our system devised for recognizing speech in meetings, which was an entry in the NIST Spring 2004 Meeting Recognition Evaluation. This system was developed as a collaborative effort between ICSI, SRI, and UW and was based on SRI’s 5xRT Conversational Telephone Speech (CTS) recognizer. The CTS system was adapted to the Meetings domain by adapting the CTS acoustic and language models to the Meeting domain, adding noise reduction and delay-sum array processing for far-field recognition, and adding postprocessing for cross-talk suppression for close-talking microphones. A modified MAP adaptation procedure was developed to make best use of discriminatively trained (MMIE) prior models. These meeting-specific changes yielded an overall 9% and 22% relative improvement as compared to the original CTS system, and 16% and 29% relative improvement as compared to our 2002 Meeting Evaluation system, for the individual-headset and multiple-distant microphones conditions, respectively.
    01/2005: pages 196-208;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper we describe the ICSI-SRI entry in the Rich Transcrip- tion 2005 Spring Meeting Recognition Evaluation. The current system is based on the ICSI-SRI clustering system for Broadcast News (BN), with extra modules to process the different meetings tasks in which we participated. Our base system uses agglomerative clustering with a BIC-like measure to determine when to stop merging clusters and to decide which pairs of clusters to merge. This approach does not require any pre-trained models, thus increasing robustness and simplify- ing the port from BN to the meetings domain. For the meetings domain, we have added several features to our baseline clustering system, i ncluding a "purifica- tion" module that tries to keep the clusters acoustically ho mogeneous throughout the clustering process, and a delay&sum beamforming algorithm which enhances signal quality for the multiple distant microphones (MDM) sub-task. In post- evaluation work we further improved the delay&sum algorithm, experimented with a new speech/non-speech detector and proposed a new system for the lec- ture room environment.
    Machine Learning for Multimodal Interaction, Second International Workshop, MLMI 2005, Edinburgh, UK, July 11-13, 2005, Revised Selected Papers; 01/2005
  • Source
    Machine Learning for Multimodal Interaction, Second International Workshop, MLMI 2005, Edinburgh, UK, July 11-13, 2005, Revised Selected Papers; 01/2005
  • Source
    Daniel Gillick, Stephen Stafford, Barbara Peskin
    [Show abstract] [Hide abstract]
    ABSTRACT: In order to capture sequential information and to take advantage of extended training data conditions, we developed an algorithm for speaker detection that scores a test segment by comparing it di-rectly to similar instances of that speech in the training data. This non-parametric technique, though at an early stage in its develop-ment, achieves error rates close to 1% on the NIST 2001 Extended Data task and performs extremely well in combination with a stan-dard Gaussian Mixture Model system. We also present a new scor-ing method that significantly improves performance by capturing only positive evidence.
    01/2005; 1.
  • Source
    A.O. Hatch, A. Stolcke, B. Peskin
    [Show abstract] [Hide abstract]
    ABSTRACT: In this paper, we describe a general technique for optimizing the relative weights of feature sets in a support vector machine (SVM) and show how it can be applied to the field of speaker recognition. Our training procedure uses an objective function that maps the relative weights of the feature sets directly to a classification metric (e.g. equal-error rate (EER)) measured on a set of training data. The objective function is optimized in an iterative fashion with respect to both the feature weights and the SVM parameters (i.e. the support vector weights and the bias values). In this paper, we use this procedure to optimize the relative weights of various subsets of features in two SVM-based speaker recognition systems: a system that uses transform coefficients obtained from maximum likelihood linear regression (MLLR) as features (A. Stolcke, et al., 2005) and another that uses relative frequencies of phone n-grams (W. M. Campbell, et al., 2003), (A. Hatch, et al., 2005). In all cases, the training procedure yields significant improvements in both EER and minimum DCF (i.e. decision cost function), as measured on various test corpora
    Automatic Speech Recognition and Understanding, 2005 IEEE Workshop on; 01/2005
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes ICSI's 2005 speaker recognition system, which was one of the top performing systems in the NIST 2005 speaker recognition evaluation. The system is a combination of four sub-systems: 1) a keyword conditional HMM system, 2) an SVM-based lattice phone n-gram system, 3) a sequential nonparametric system, and 4) a traditional cepstral GMM System, developed by SRI. The first three systems are designed to take advantage of higher-level and long-term information. We observe that their performance is significantly improved when there is more training data. In this paper, we describe these sub-systems and present results for each system alone and in combination on the speaker recognition evaluation (SRE) 2005 development and evaluation data sets
    Automatic Speech Recognition and Understanding, 2005 IEEE Workshop on; 01/2005
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe the ICSI-SRI-UW team's entry in the Spring 2004 NIST Meeting Recognition Evaluation. The system was derived from SRI's 5xRT Conversational Telephone Speech (CTS) recognizer by adapting CTS acoustic and language models to the Meeting domain, adding noise reduction and delay-sum array processing for far-field recognition, and postprocessing for cross-talk suppression. A modified MAP adaptation procedure was developed to make best use of discriminatively trained (MMIE) prior models. These meeting-specific changes yielded an overall 9% and 22% relative improvement as compared to the original CTS system, and 16% and 29% relative improvement as compared to our 2002 Meeting Evaluation system, for the individual-headset and multiple-distant microphones conditions, respectively.
    06/2004;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe the ICSI-SRI-UW team's entry in the Spring 2004 NIST Meeting Recognition Evaluation. The system was de- rived from SRI's 5xRT Conversational Telephone Speech (CTS) recognizer by adapting CTS acoustic and language models to the Meeting domain, adding noise reduction and delay-sum array processing for far-field recognition, and postprocessing for cross-talk suppression. A modified MAP adaptation proce- dure was developed to make best use of discriminatively trained (MMIE) prior models. These meeting-specific changes yielded an overall 9% and 22% relative improvement as compared to the original CTS system, and 16% and 29% relative improve- ment as compared to our 2002 Meeting Evaluation system, for the individual-headset and multiple-distant microphones condi- tions, respectively.
    INTERSPEECH 2004 - ICSLP, 8th International Conference on Spoken Language Processing, Jeju Island, Korea, October 4-8, 2004; 01/2004
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Both human and automatic processing of speech require recogniz- ing more than just the words. We describe a state-of-the-art sys- tem for automatic detection of "metadata" (information beyond the words) in both broadcast news and spontaneous telephone conver- sations, developed as part of the DARPA EARS Rich Transcription program. System tasks include sentence boundary detection, filler word detection, and detection/correction of disfluencies. To achieve best performance, we combine information from different types of language models (based on words, part-of-speech classes, and au- tomatically induced classes) with information from a prosodic clas- sifier. The prosodic classifier employs bagging and ensemble ap- proaches to better estimate posterior probabilities. We use confu- sion networks to improve robustness to speech recognition errors. Most recently, we have investigated a maximum entropy approach for the sentence boundary detection task, yielding a gain over our standard HMM approach. We report results for these techniques on the official NIST Rich Transcription metadata tasks.
    INTERSPEECH 2004 - ICSLP, 8th International Conference on Spoken Language Processing, Jeju Island, Korea, October 4-8, 2004; 01/2004
  • Machine Learning for Multimodal Interaction, First International Workshop,MLMI 2004, Martigny, Switzerland, June 21-23, 2004, Revised Selected Papers; 01/2004
  • [Show abstract] [Hide abstract]
    ABSTRACT: While there has been a long tradition of research seeking to use prosodic features, especially pitch, in speaker recognition systems, results have generally been disappointing when such features are used in isolation and only modest improvements have been seen when used in conjunction with traditional cepstral GMM systems. In contrast, we report here on work from the JHU 2002 Summer Workshop exploring a range of prosodic features, using as testbed NIST's 2001 Extended Data task. We examined a variety of modeling techniques, such as n- gram models of turn-level prosodic features and simple vectors of summary statistics per conversation side scored by k nearestneighbor classifiers. We found that purely prosodic models were able to achieve equal error rates of under 10%, and yielded significant gains when combined with more traditional systems. We also report on exploratory work on "conversational" features, capturing properties of the interaction across conversation sides, such as turn-taking patterns.
    08/2003;
  • [Show abstract] [Hide abstract]
    ABSTRACT: The area of automatic speaker recognition has been dominated by systems using only short-term, low-level acoustic information, such as cepstral features. While these systems have indeed produced very low error rates, they ignore other levels of information beyond low-level acoustics that convey speaker information. Recently published work has shown examples that such high-level information can be used successfully in automatic speaker recognition systems and has the potential to improve accuracy and add robustness. For the 2002 JHU CLSP summer workshop, the SuperSID project (http://www.clsp.jhu.edu/ws2002/groups/supersid/) was undertaken to exploit these high-level information sources and dramatically increase speaker recognition accuracy on a defined NIST evaluation corpus and task. This paper provides an overview of the structure, data, task, tools, and accomplishments of this project. Wide ranging approaches using pronunciation models, prosodic dynamics, pitch and duration features, phone streams, and conversational interactions were explored and developed. In this paper we show how these novel features and classifiers indeed provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIST extended data task to 0.2% --- a 71% relative reduction in error over the previous state of the art.
    08/2003;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In early 2001, we reported (at the Human Language Technology meeting) the early stages of an ICSI (International Computer Science Institute) project on processing speech from meetings (in collaboration with other sites, principally SRI, Columbia, and UW). We report our progress from the first few years of this effort, including: the collection and subsequent release of a 75-meeting corpus (over 70 meeting-hours and up to 16 channels for each meeting); the development of a prosodic database for a large subset of these meetings, and its subsequent use for punctuation and disfluency detection; the development of a dialog annotation scheme and its implementation for a large subset of the meetings; and the improvement of both near-mic and far-mic speech recognition results for meeting speech test sets.
    Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference on; 05/2003
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The area of automatic speaker recognition has been dominated by systems using only short-term, low-level acoustic information, such as cepstral features. While these systems have indeed produced very low error rates, they ignore other levels of information beyond low-level acoustics that convey speaker information. Recently published work has shown examples that such high-level information can be used successfully in automatic speaker recognition systems and has the potential to improve accuracy and add robustness. For the 2002 JHU CLSP summer workshop, the SuperSID project (http://www.clsp.jhu.edu/ws2002/groups/supersid/) was undertaken to exploit these high-level information sources and dramatically increase speaker recognition accuracy on a defined NIST evaluation corpus and task. The paper provides an overview of the structure, data, task, tools, and accomplishments of this project. Wide ranging approaches using pronunciation models, prosodic dynamics, pitch and duration features, phone streams, and conversational interactions were explored and developed. We show how these novel features and classifiers indeed provide complementary information and can be fused together to drive down the equal error rate on the 2001 NIST extended data task to 0.2% - a 71% relative reduction in error over the previous state of the art.
    Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference on; 05/2003
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: While there has been a long tradition of research seeking to use prosodic features, especially pitch, in speaker recognition systems, results have generally been disappointing when such features are used in isolation and only modest improvements have been seen when used in conjunction with traditional cepstral GMM systems. In contrast, we report here on work from the JHU 2002 Summer Workshop exploring a range of prosodic features, using as testbed the 2001 NIST Extended Data task. We examined a variety of modeling techniques, such as n-gram models of turn-level prosodic features and simple vectors of summary statistics per conversation side scored by k<sup>th</sup> nearest-neighbor classifiers. We found that purely prosodic models were able to achieve equal error rates of under 10%, and yielded significant gains when combined with more traditional systems. We also report on exploratory work on "conversational" features, capturing properties of the interaction across conversation sides, such as turn-taking patterns.
    Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference on; 05/2003
  • Source
    Conference Paper: The ICSI Meeting Corpus
    [Show abstract] [Hide abstract]
    ABSTRACT: We have collected a corpus of data from natural meetings that occurred at the International Computer Science Institute (ICSI) in Berkeley, California over the last three years. The corpus contains audio recorded simultaneously from head-worn and table-top microphones, word-level transcripts of meetings, and various metadata on participants, meetings, and hardware. Such a corpus supports work in automatic speech recognition, noise robustness, dialog modeling, prosody, rich transcription, information retrieval, and more. We present details on the contents of the corpus, as well as rationales for the decisions that led to its configuration. The corpus were delivered to the Linguistic Data Consortium (LDC).
    Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03). 2003 IEEE International Conference on; 05/2003
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We investigate the incorporation of larger time-scale information, such as prosody, into standard speaker ID systems. Our study is based on the Extended Data Task of the NIST 2001 Speaker ID evaluation, which provides much more test and training data than has traditionally been available to similar speaker ID investigations. In addition, we have had access to a detailed prosodic feature database of Switchboard-I conversations, including data not previously applied to speaker ID. We describe two baseline acoustic systems, an approach using Gaussian Mixture Models, and an LVCSR-based speaker ID system. These results are compared to and combined with two larger time-scale systems: a system based on an "idiolect" language model, and a system making use of the contents of the prosody database. We find that, with sufficient test and training data, suprasegmental information can significantly enhance the performance of traditional speaker ID systems.
    08/2002;