Thomas Schaaf

Carnegie Mellon University, Pittsburgh, Pennsylvania, United States

Are you Thomas Schaaf?

Claim your profile

Publications (30)9.26 Total impact

  • INTERSPEECH 2011, 12th Annual Conference of the International Speech Communication Association, Florence, Italy, August 27-31, 2011; 01/2011
  • Source
    Thomas Schaaf, Florian Metze
    INTERSPEECH 2010, 11th Annual Conference of the International Speech Communication Association, Makuhari, Chiba, Japan, September 26-30, 2010; 01/2010
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Acquiring knowledge about persons is a key functionality for humanoid robots. In a natural environment, the robot not only interacts with different people who he recognizes and who he knows. He will also have to interact with unknown persons, and by acquiring information about them, the robot can memorize these persons and provide extended personalized services. Today, researchers build systems to recognize a person’s face, voice and other features. Most of them depend on pre-collected data. We think that with the given technology it is about time to build a system that collects data autonomously and thus gets to know and learns to recognize persons completely on its own. This paper describes the integration of different perceptual and dialog components and their individual functionality to build a robot that can contact persons, learns their names, and learns to recognize them in future encounters.
    08/2007: pages 302-316;
  • Source
    Mohamed Noamany, Thomas Schaaf, Tanja Schultz
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes the CMU/InterACT effort in developing an Arabic Automatic Speech Recognition (ASR) system for broadcast news and conversations within the GALE 2006 evaluation. Through the span of 9 month in preparation for this evaluation we improved our system by 40% relative compared to our legacy system. These improvements have been achieved by various steps, such as developing a vowelized system, combining this system with a non-vowelized one, harvesting transcripts of TV shows from the web for slightly supervised training of acoustic models, as well as language model adaptation, and finally fine-tuning the overall ASR system.
    01/2007;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes the Ephyra question answering engine, a modular and extensible framework that allows to integrate multiple approaches to question answer- ing in one system. Our framework can be adapted to languages other than English by replacing language-specific components. It supports the two major approaches to question answering, knowledge annotation and knowledge mining. Ephyra uses the web as a data resource, but could also work with smaller corpora. In addition, we propose a novel approach to question interpretation which abstracts from the original formulation of the question. Text patterns are used to interpret a question and to extract answers from text snippets. Our system automatically learns the patterns for answer extraction, using question-answer pairs as training data. Experimental results revealed the potential of this approach.
    Text, Speech and Dialogue, 9th International Conference, TSD 2006, Brno, Czech Republic, September 11-15, 2006, Proceedings; 01/2006
  • [Show abstract] [Hide abstract]
    ABSTRACT: Acquiring knowledge about persons is a key functionality for humanoid robots. In a natural environment, the robot not only interacts with different people who he recognizes and who he knows. He will also have to interact with unknown persons, and by acquiring information about them, the robot can memorize these persons and provide extended personalized services. Today, researchers build systems to recognize a person's face, voice and other features. Most of them depend on precollected data. We think that with the given technology it is about time to build a system that collects data autonomously and thus gets to know and learns to recognize persons completely on its own. This paper describes the integration of different perceptual and dialog components and their individual functionality to build a robot that can contact persons, learns their names, and learns to recognize them in future encounters.
    KI 2006: Advances in Artificial Intelligence, 29th Annual German Conference on AI, KI 2006, Bremen, Germany, June 14-17, 2006, Proceedings; 01/2006
  • Source
    Dirk Gehrig, Thomas Schaaf
    [Show abstract] [Hide abstract]
    ABSTRACT: Gaussian mixture models are the most popular probability den- sity used in automatic speech recognition. During decoding, often many Gaussians are evaluated. Only a small number of Gaussians contributes significantly to probability. Several promising meth- ods to select relevant Gaussians are known. These methods have different properties in terms of required memory, overhead and quality of selected Gaussians. Projection search, bucket box in- tersection, and Gaussian clustering are investigated in a broadcast news system with focus on adaptation (MLLR). Index Terms: speech recognition, LVCSR, Gaussian selection, speaker adaptation, MLLR.
    INTERSPEECH 2006 - ICSLP, Ninth International Conference on Spoken Language Processing, Pittsburgh, PA, USA, September 17-21, 2006; 01/2006
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: We describe a feature extraction method for general audio mod- eling using a temporal extension of Independent Component Analysis (ICA) and demonstrate its utility in the context of a sound classification task in a kitchen environment. Our ap- proach accounts for temporal dependencies over multiple anal- ysis frames much like the standard audio modeling technique of adding first and second temporal derivatives to the feature set. Using a real-world dataset of kitchen sounds, we show that our approach outperforms a canonical version of this standard front end, the mel-frequency cepstral coefficients (MFCCs), which has found successful application in automatic speech recogni- tion tasks.
    INTERSPEECH 2005 - Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, September 4-8, 2005; 01/2005
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: In human-mediated translation scenarios a human interpreter translates between a source and a target language using ei- ther a spoken or a written representation of the source lan- guage. In this paper we improve the recognition performance on the speech of the human translator spoken in the target language by taking advantage of the source language repre- sentations. We use machine translation techniques to trans- late between the source and target language resources and then bias the target language speech recognizer towards the gained knowledge, hence the name Machine Translation Enhanced Au- tomatic Speech Recognition. We investigate several differ- ent techniques among which are restricting the search vocab- ulary, selecting hypotheses from n-best lists, applying cache and interpolation schemes to language modeling, and combin- ing the most successful techniques into our final, iterative sys- tem. Overall we outperform the baseline system by a relative word error rate reduction of 37.6%.
    INTERSPEECH 2005 - Eurospeech, 9th European Conference on Speech Communication and Technology, Lisbon, Portugal, September 4-8, 2005; 01/2005
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Nowadays official documents have to be made available in many languages, like for example in the EU with its 20 official languages. Therefore, the need for effective tools to aid the multitude of human translators in their work becomes easily apparent. An ASR system, enabling the human translator to speak his translation in an unrestricted manner, instead of typing it, constitutes such a tool. In this work we improve the recognition performance of such an ASR system on the target language of the human translator by taking advantage of an either written or spoken source language representation. To do so, machine translation techniques are used to translate between the different languages and then the involved ASR systems are biased towards the gained knowledge. We present an iterative approach for ASR improvement and outperform our baseline system by a relative word error rate reduction of 35.8%/29.9% in the case of a written/spoken source language representation. Further, we show how multiple target languages, as for example provided by different simultaneous translators during European Parliament debates, can be incorporated into our system design for an improvement of all involved ASR systems
    Automatic Speech Recognition and Understanding, 2005 IEEE Workshop on; 01/2005
  • Source
    Thomas. Schaaf
    [Show abstract] [Hide abstract]
    ABSTRACT: Karlsruhe, Universiẗat, Diss., 2004 (Nicht für den Austausch).
    01/2004;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes our effort in developing a Mandarin Broadcast News system for the RT-04f (Rich Transcription) evaluation. Starting from a legacy system, we revisited all the issues including partitioning, acoustic modeling, lan- guage modeling, decoding and system combination strate- gies. We have achieved a sizable improvement, from 21.2% to 5.2% on the development set, from 42.7% to 22.4% mea- sured on the RT-04f evaluation set, over a period of three months.
    01/2004;
  • Source
    John McDonough, Thomas Schaaf, Alex Waibel
    [Show abstract] [Hide abstract]
    ABSTRACT: Modern speech recognition systems are based on the hidden Markov model (HMM) and employ cepstral features to represent input speech. In speaker normalization, the cepstral features of speech from a given speaker are transformed to match the speaker independent HMM. In speaker adaptation, the means of the HMM are transformed to match the input speech. Vocal tract length normalization (VTLN) is a popular normalization scheme wherein the frequency axis of the short-time spectrum is rescaled prior to the extraction of cepstral features. In this work, we develop novel speaker adaptation schemes by exploiting the fact that frequency domain transformations similar to that inherent in VTLN can be accomplished entirely in the cepstral domain through the use of conformal maps. We describe two classes of such maps: rational all-pass transforms (RAPTs) which are well-known in the signal processing literature, and sine-log all-pass transforms (SLAPTs) which are novel in this work. For both classes of maps, we develop the relations necessary to perform maximum likelihood estimation of the relevant transform parameters using enrollment data from a new speaker. We also propose the means by which an HMM may be trained specifically for use with this type of adaptation. Finally, in a set of recognition experiments conducted on conversational speech material from the Switchboard Corpus as well as the English Spontaneous Scheduling Task, we demonstrate the capacity of APT-based speaker adaptation to achieve word error rate reductions superior to those obtained with other popular adaptation techniques, and moreover, reductions that are additive with those provided by VTLN.
    Speech Communication. 01/2004;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Oral communication is transient but many important decisions, social contracts and fact findings are first carried out in an oral setup, documented in written form and later retrieved. At Carnegie Mellons University's Interactive Systems Laboratories we have been experimenting with the documentation of meetings. This paper summarizes part of the progress that we have made in this test bed, specifically on the question of automatic transcription using LVCSR, information access using non-keyword based methods, summarization and user interfaces. The system is capable to automatically construct a searchable and browsable audiovisual database of meetings and provide access to these records.
    08/2002;
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: This paper describes the 2000 ISL large vocabulary speech recognition system for fast decoding of conversational speech which was used in the German Verbmobil-II project. The challenge of this task is to build robust acoustic models to handle dierent dialects, spontaneous eects, and crosstalk as occur in conversational speech. We present speaker incremental normalization and adaptation experiments close to real-time constraints. To reduce the number of consequential errors caused by out-of-vocabulary words (OOV), we conducted ller-model experiments to handle unknown proper names. The overall improvements from 1998 to 2000 resulted in a word error reduction from 40% to 17% on our development test set.
    04/2002;
  • Source
    I. Rogina, T. Schaaf
    [Show abstract] [Hide abstract]
    ABSTRACT: Archiving, indexing, and later browsing through stored presentations and lectures is increasingly being used. We have investigated the special problems and advantages of lectures and propose the design and adaptation of a speech recognizer to a lecture such that the recognition accuracy can be significantly improved by prior analysis of the presented documents using a special class-based language model. We define a tracking accuracy measure which measures how well a system can automatically align recognized words with parts of a presentation and show that by prior exploitation of the presented documents, the tracking accuracy can be improved. The system described in this paper is part of an intelligent meeting room developed in the European Union-sponsored project FAME (Facilitating Agent for Multicultural Exchange).
    Multimodal Interfaces, 2002. Proceedings. Fourth IEEE International Conference on; 02/2002
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: this paper. A subset of three meetings were chosen as the test set
    05/2001;
  • [Show abstract] [Hide abstract]
    ABSTRACT: Describes the 2000 ISL large vocabulary speech recognition system for fast decoding of conversational speech which was used in the German Verbmobil-II project. The challenge of this task is to build robust acoustic models to handle different dialects, spontaneous effects, and crosstalk as occur in conversational speech. We present speaker incremental normalization and adaptation experiments close to real-time constraints. To reduce the number of consequential errors caused by out-of-vocabulary words, we conducted filler-model experiments to handle unknown proper names. The overall improvements from 1998 to 2000 resulted in a word error reduction from 40% to 17% on our development test set
    Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP '01). 2001 IEEE International Conference on; 02/2001
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: Oral communication is transient, but many important decisions, social contracts and fact findings are first carried out in an oral setup, documented in written form and later retrieved. At Carnegie Mellon University's Interactive Systems Laboratories we have been experimenting with the documentation of meetings. The paper summarizes part of the progress that we have made in this test bed, specifically on the question of automatic transcription using large vocabulary continuous speech recognition, information access using non-keyword based methods, summarization and user interfaces. The system is capable of automatically constructing a searchable and browsable audio-visual database of meetings and provide access to these records
    Acoustics, Speech, and Signal Processing, 2001. Proceedings. (ICASSP '01). 2001 IEEE International Conference on; 02/2001
  • Source
    [Show abstract] [Hide abstract]
    ABSTRACT: The speech-to-speech translation system Verbmobil requires a multilingual setting. This consists of recognition engines in the three languages German, English and Japanese that run in one common framework together with a language identification component which is able to switch between these recognizers. This article describes the challenges of multilingual speech recognition and presents different solutions to the problem of the automatic language identification task. The combination of the described components results in a flexible and user-friendly multilingual spoken dialog system.
    01/2001;