-
Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2011, May 22-27, 2011, Prague Congress Center, Prague, Czech Republic; 01/2011
-
Multimodal Technologies for Perception of Humans, International Evaluation Workshops CLEAR 2007 and RT 2007, Baltimore, MD, USA, May 8-11, 2007, Revised Selected Papers; 01/2007
-
[show abstract]
[hide abstract]
ABSTRACT: We describe the development of our speech recognition system for the National Institute of Standards and Technology (NIST)
Spring 2005 Meeting Rich Transcription (RT-05S) evaluation, highlighting improvements made since last year [1]. The system
is based on the SRI-ICSI-UW RT-04F conversational telephone speech (CTS) recognition system, with meeting-adapted models and
various audio preprocessing steps. This year’s system features better delay-sum processing of distant microphone channels
and energy-based crosstalk suppression for close-talking microphones. Acoustic modeling is improved by virtue of various enhancements
to the background (CTS) models, including added training data, decision-tree based state tying, and the inclusion of discriminatively
trained phone posterior features estimated by multilayer perceptrons. In particular, we make use of adaptation of both acoustic
models and MLP features to the meeting domain. For distant microphone recognition we obtained considerable gains by combining
and cross-adapting narrow-band (telephone) acoustic models with broadband (broadcast news) models. Language models (LMs) were
improved with the inclusion of new meeting and web data. In spite of a lack of training data, we created effective LMs for
the CHIL lecture domain. Results are reported on RT-04S and RT-05S meeting data. Measured on RT-04S conference data, we achieved
an overall improvement of 17% relative in both MDM and IHM conditions compared to last year’s evaluation system. Results on
lecture data are comparable to the best reported results for that task.
02/2006: pages 463-475;
-
Machine Learning for Multimodal Interaction, Second International Workshop, MLMI 2005, Edinburgh, UK, July 11-13, 2005, Revised Selected Papers; 01/2005
-
[show abstract]
[hide abstract]
ABSTRACT: In collaboration with colleagues at UW, OGI, IBM, and SRI, we are developing technology to process spoken language from informal meetings. The work includes a substantial data collection and transcription effort, and has required a nontrivial degree of infrastructure development. We are undertaking this because the new task area provides a significant challenge to current HLT capabilities, while offering the promise of a wide range of potential applications. In this paper, we give our vision of the task, the challenges it represents, and the current state of our development, with particular attention to automatic transcription.
10/2003;
-
[show abstract]
[hide abstract]
ABSTRACT: We have collected a corpus of data from natural meetings that occurred at the International Computer Science Institute (ICSI) in Berkeley, California over the last three years. The corpus contains audio recorded simultaneously from head-worn and table-top microphones, word-level transcripts of meetings, and various metadata on participants, meetings, and hardware. Such a corpus supports work in automatic speech recognition, noise robustness, dialog modeling, prosody, rich transcription, information retrieval, and more. In this paper, we present details on the contents of the corpus, as well as rationales for the decisions that led to its configuration. The corpus will be delivered to the Linguistic Data Consortium (LDC) [1] by December, 2002, and we expect it to be available through the LDC by the summer of 2003.
03/2003;
-
Nelson Morgan,
Don Baron,
Sonali Bhagat,
Hannah Carvey,
Rajdip Dhillon,
Jane Edwards,
David Gelbart, Adam Janin,
Ashley Krupski,
Barbara Peskin,
Thilo Pfau,
Elizabeth Shriberg,
Andreas Stolcke
[show abstract]
[hide abstract]
ABSTRACT: In early 2001 we reported (at the Human Language Technology meeting) the early stages of an ICSI project on processing speech from meetings (in collaboration with other sites, principally SRI, Columbia, and UW). In this paper we report our progress from the first few years of this effort, including: the collection and subsequent release of a 75-meeting corpus (over 70 meeting-hours and up to 16 channels for each meeting); the development of a prosodic database for a large subset of these meetings, and its subsequent use for punctuation and disfluency detection; the development of a dialog annotation scheme and its implementation for a large subset of the meetings; and the improvement of both near-mic and far-mic speech recognition results for meeting speech test sets.
01/2003;
-
Adam Janin
[show abstract]
[hide abstract]
ABSTRACT: Despite recent advances in automatic speech recognition (ASR) technology, successful ASR applications tend to be limited to telephony, command-and-control functions, head-worn microphones, small vocabularies, and/or simple grammars. Recording and processing spontaneous, human-to-human conversations in natural settings is a far more challenging task. Face-to-face meetings are not only challenging because of acoustic uncertainty, but they also contain very rich content. In contrast to typical ASR corpora, meetings contain a vast amount of overlapping speech, interrupts, false starts, laughter, and other interesting linguistic, acoustic, and prosodic features. To address some of these issues, we have begun research on the recording and processing of meetings. In this paper, we will describe the Meeting Recorder project, including possible applications, corpus collection, and progress to date. 1 Why Record Meetings? 1.1 Challenging Speech Recognition Firstly, natural human-to-human me...
07/2001;
-
[show abstract]
[hide abstract]
ABSTRACT: Multi-stream and multi-band methods can improve the accuracy of speech recognition systems without overly increasing the complexity. However, they cannot be applied blindly. In this paper, we review our experience applying multi-stream and multiband methods to the Broadcast News corpus. We found that multi-stream systems using different acoustic front-ends provide a significant improvement over single stream systems. However, despite the fact that they have been successful on smaller tasks, we have not yet been able to show any improvement using multiband methods. We report various insights gained from the experience in applying these methods in a large-vocabulary task.
09/1999;
-
[show abstract]
[hide abstract]
ABSTRACT: We describe some aspects of a Broadcast News recognition system based on hybrid HMM/MLP acoustic modeling. These include the use of novel `modulation spectrogram' features which are combined with conventional models at the posterior probability level, some experiments with nonlinear segment normalization, and an investigation of the interaction of model size and training set size for an multilayer perceptron (MLP) acoustic classifier. We also report preliminary results of incorporating gender-dependence into this system. 1. Background In recent years, we and our colleagues have promoted the exploration of novel, poorly understood, but promising approaches to speech recognition [2]. While such deviations from incremental improvements might initially hurt performance, the subset of the new methods that would ultimately prove useful would not be found without such explorations. This past year, we attempted to follow this advice, while still developing a system with reasonable performanc...
05/1999;
-
[show abstract]
[hide abstract]
ABSTRACT: We describe the latest version of the SRI-ICSI meeting and lecture recognition system, as was used in the NIST RT-07 evaluations, highlighting improvements made over the last year. Changes in the acoustic preprocessing include updated beamforming software for processing of multiple distant mi-crophones, and various adjustments to the speech segmenter for close-talking microphones. Acoustic models were improved by the combined use of neural-net-estimated phone posterior features, discriminative feature transforms trained with fMPE-MAP, and discriminative Gaussian estimation using MPE-MAP, as well as model adaptation specifically to nonnative and non-American speakers. The net effect of these enhancements was a 14-16% relative error reduction on distant microphones, and a 16-17% error reduction on close-talking microphones. Also, for the first time, we report results on a new "coffee break" meeting genre, and on a new NIST metric designed to evaluate combined speech diarization and recognition.