-
INTERSPEECH 2009, 10th Annual Conference of the International Speech Communication Association, Brighton, United Kingdom, September 6-10, 2009; 01/2009
-
[show abstract]
[hide abstract]
ABSTRACT: This chapter describes the English-language SmartKom-Mobile system and related research. We explain the work required to support a second language in SmartKom and the design of the English speech recognizer. We then discuss research carried out on signal processing methods for robust
speech recognition and on language analysis using the Embodied Construction Grammar formalism. Finally, the results of human-subject
experiments using a novel Wizard and Operator model are analyzed with an eye to creating more felicitous interaction in dialogue systems.
12/2005: pages 453-470;
-
Machine Learning for Multimodal Interaction, First International Workshop,MLMI 2004, Martigny, Switzerland, June 21-23, 2004, Revised Selected Papers; 01/2004
-
INTERSPEECH 2004 - ICSLP, 8th International Conference on Spoken Language Processing, Jeju Island, Korea, October 4-8, 2004; 01/2004
-
[show abstract]
[hide abstract]
ABSTRACT: In collaboration with colleagues at UW, OGI, IBM, and SRI, we are developing technology to process spoken language from informal meetings. The work includes a substantial data collection and transcription effort, and has required a nontrivial degree of infrastructure development. We are undertaking this because the new task area provides a significant challenge to current HLT capabilities, while offering the promise of a wide range of potential applications. In this paper, we give our vision of the task, the challenges it represents, and the current state of our development, with particular attention to automatic transcription.
10/2003;
-
[show abstract]
[hide abstract]
ABSTRACT: For a connected digits speech recognition task, we have compared the performance of two inexpensive electret microphones with that of a single high quality PZM microphone. Recognition error rates were measured both with and without compensation techniques, where both single-channel and two-channel approaches were used. In all cases the task was recognition at a significant distance (2--6 feet) from the talker's mouth. The results suggest that the wide variability in characteristics among inexpensive electret microphones can be compensated for without explicit quality control, and that this is particularly effective when both single-channel and two-channel techniques are used. In particular, the resulting performance for the inexpensive microphones used together is essentially equivalent to the expensive microphone, and better than for either inexpensive microphone used alone.
04/2003;
-
[show abstract]
[hide abstract]
ABSTRACT: We have collected a corpus of data from natural meetings that occurred at the International Computer Science Institute (ICSI) in Berkeley, California over the last three years. The corpus contains audio recorded simultaneously from head-worn and table-top microphones, word-level transcripts of meetings, and various metadata on participants, meetings, and hardware. Such a corpus supports work in automatic speech recognition, noise robustness, dialog modeling, prosody, rich transcription, information retrieval, and more. In this paper, we present details on the contents of the corpus, as well as rationales for the decisions that led to its configuration. The corpus will be delivered to the Linguistic Data Consortium (LDC) [1] by December, 2002, and we expect it to be available through the LDC by the summer of 2003.
03/2003;
-
Nelson Morgan,
Don Baron,
Sonali Bhagat,
Hannah Carvey,
Rajdip Dhillon,
Jane Edwards, David Gelbart,
Adam Janin,
Ashley Krupski,
Barbara Peskin,
Thilo Pfau,
Elizabeth Shriberg,
Andreas Stolcke
[show abstract]
[hide abstract]
ABSTRACT: In early 2001 we reported (at the Human Language Technology meeting) the early stages of an ICSI project on processing speech from meetings (in collaboration with other sites, principally SRI, Columbia, and UW). In this paper we report our progress from the first few years of this effort, including: the collection and subsequent release of a 75-meeting corpus (over 70 meeting-hours and up to 16 channels for each meeting); the development of a prosodic database for a large subset of these meetings, and its subsequent use for punctuation and disfluency detection; the development of a dialog annotation scheme and its implementation for a large subset of the meetings; and the improvement of both near-mic and far-mic speech recognition results for meeting speech test sets.
01/2003;
-
8th European Conference on Speech Communication and Technology, EUROSPEECH 2003 - INTERSPEECH 2003, Geneva, Switzerland, September 1-4, 2003; 01/2003
-
[show abstract]
[hide abstract]
ABSTRACT: Far-field microphone speech signals cause high error rates for automatic speech recognition systems, due to room reverberation and lower signal-to-noise ratios. We have observed large increases in speech recognition word error rates when using a far-field (3-6 feet) microphone in a conference room, in comparison with recordings from close-talking microphones. In an earlier paper, we showed improvements in far-field speech recognition performance using a longterm log spectral subtraction method to combat reverberation. This method is based on a principle similar to cepstral mean subtraction but uses a much longer analysis window (e.g., 1 s) in order to deal with reverberation. Here we show that a combination of short-term noise filtering and longterm log spectral subtraction can further reduce recognition word error rates.
07/2002;
-
[show abstract]
[hide abstract]
ABSTRACT: Even a modest degree of room reverberation can greatly increase the difficulty of Automatic Speech Recognition. We have observed large increases in speech recognition word error rates when using a far-field (3-6 feet) mic in a conference room, in comparison with recordings from headmounted mics. In this paper, we describe experiments with a proposed remedy based on the subtraction of an estimate of the log spectrum from a long-term (e.g., 2 s) analysis window, followed by overlap-add resynthesis. Since the technique is essentially one of enhancement, the processed signal it generates can be used as input for complete speech recognition systems. Here we report results with both HTK and the SRI Hub-5 recognizer. For simpler recognizer configurations and/or moderate-sized training, the improvements are huge, while moderate improvements are still observed for more complex configurations under a number of conditions. 1.
06/2002;