Zoltán Tüske

Zoltán Tüske
RWTH Aachen University · Human Language Technology and Pattern Recognition Group

About

55
Publications
15,964
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,267
Citations

Publications

Publications (55)
Preprint
We investigate the impact of aggressive low-precision representations of weights and activations in two families of large LSTM-based architectures for Automatic Speech Recognition (ASR): hybrid Deep Bidirectional LSTM - Hidden Markov Models (DBLSTM-HMMs) and Recurrent Neural Network - Transducers (RNN-Ts). Using a 4-bit integer representation, a na...
Preprint
End-to-end spoken language understanding (SLU) systems that process human-human or human-computer interactions are often context independent and process each turn of a conversation independently. Spoken conversations on the other hand, are very much context dependent, and dialog history contains useful information that can improve the processing of...
Preprint
In our previous work we demonstrated that a single headed attention encoder-decoder model is able to reach state-of-the-art results in conversational speech recognition. In this paper, we further improve the results for both Switchboard 300 and 2000. Through use of an improved optimizer, speaker vector embeddings, and alternative speech representat...
Preprint
We present a comprehensive study on building and adapting RNN transducer (RNN-T) models for spoken language understanding(SLU). These end-to-end (E2E) models are constructed in three practical settings: a case where verbatim transcripts are available, a constrained case where the only available annotations are SLU labels and their values, and a mor...
Preprint
We investigate a set of techniques for RNN Transducers (RNN-Ts) that were instrumental in lowering the word error rate on three different tasks (Switchboard 300 hours, conversational Spanish 780 hours and conversational Italian 900 hours). The techniques pertain to architectural changes, speaker adaptation, language model fusion, model combination...
Preprint
An essential component of spoken language understanding (SLU) is slot filling: representing the meaning of a spoken utterance using semantic entity labels. In this paper, we develop end-to-end (E2E) spoken language understanding systems that directly convert speech input to semantic entities and investigate if these E2E SLU models can be trained so...
Preprint
It is generally believed that direct sequence-to-sequence (seq2seq) speech recognition models are competitive with hybrid models only when a large amount of data, at least a thousand hours, is available for training. In this paper, we show that state-of-the-art recognition performance can be achieved on the Switchboard-300 database using a single h...
Preprint
There has been huge progress in speech recognition over the last several years. Tasks once thought extremely difficult, such as SWITCHBOARD, now approach levels of human performance. The MALACH corpus (LDC catalog LDC2012S05), a 375-Hour subset of a large archive of Holocaust testimonies collected by the Survivors of the Shoah Visual History Founda...
Preprint
We propose a population-based Evolutionary Stochastic Gradient Descent (ESGD) framework for optimizing deep neural networks. ESGD combines SGD and gradient-free evolutionary algorithms as complementary algorithms in one framework in which the optimization alternates between the SGD step and evolution step to improve the average fitness of the popul...
Conference Paper
Full-text available
In this paper we describe the RWTH Aachen keyword search (KWS) system developed in the course of the IARPA Babel program. We put focus on acoustic modeling with neural networks and evaluate the full pipeline with respect to the KWS performance. At the core of this study lie multilingual bottleneck features extracted from a deep neural network train...
Conference Paper
In automatic speech recognition, as in many areas of machine learning, stochastic modeling relies on neural networks more and more. Both in acoustic and language modeling, neural networks today mark the state of the art for large vocabulary continuous speech recognition, providing huge improvements over former approaches that were solely based on G...
Conference Paper
Full-text available
In this work, Portuguese, Polish, English, Urdu, and Arabic automatic speech recognition evaluation systems developed by the RWTH Aachen University are presented. Our LVCSR systems focus on various domains like broadcast news, spontaneous speech, and podcasts. All these systems but Urdu are used for Euronews and Skynews evaluations as part of the E...
Conference Paper
Full-text available
In this paper we continue to investigate how the deep neural network (DNN) based acoustic models for automatic speech recognition can be trained without hand-crafted feature extraction. Previously, we have shown that a simple fully connected feed-forward DNN performs surprisingly well when trained directly on the raw time signal. The analysis of th...
Conference Paper
Full-text available
In this paper, German, Polish, Spanish, and Portuguese large vo-cabulary continuous speech recognition (LVCSR) systems de-veloped by the RWTH Aachen University are presented. All the above mentioned systems for the aforementioned languages are used for the Quaero and EU-Bridge project evaluations. The LVCSR systems developed for these competitive e...
Conference Paper
Full-text available
In this paper we investigate how much feature extraction is required by a deep neural network (DNN) based acoustic model for automatic speech recognition (ASR). We decompose the feature extraction pipeline of a state-of-the-art ASR system step by step and evaluate acoustic models trained on standard MFCC features, critical band energies (CRBE), FFT...
Conference Paper
Full-text available
This paper presents the progress of acoustic models for lowresourced languages (Assamese, Bengali, Haitian Creole, Lao, Zulu) developed within the second evaluation campaign of the IARPA Babel project. This year, the main focus of the project is put on training high-performing automatic speech recognition (ASR) and keyword search (KWS) systems from...
Conference Paper
This paper investigates the application of hierarchical MRASTA bottleneck (BN) features for under-resourced languages within the IARPA Babel project. Through multilingual training of Multilayer Perceptron (MLP) BN features on five languages (Cantonese, Pashto, Tagalog, Turkish, and Vietnamese), we could end up in a single feature stream which is mo...
Conference Paper
In this paper, we describe the RWTH speech recognition system for English lectures developed within the Translectures project. A difficulty in the development of an English lectures recognition system, is the high ratio of non-native speakers. We address this problem by using very effective deep bottleneck features trained on multilingual data. The...
Article
With long-span neural network language models, considerable improvements have been obtained in speech recognition. However, it is difficult to apply these models if the underlying search space is large. In this paper, we combine previous work on lattice decoding with long short-term memory (LSTM) neural network language models. By adding refined pr...
Conference Paper
Full-text available
In this paper, German and English large vocabulary continuous speech recognition (LVCSR) systems developed by the RWTH Aachen University for the IWSLT-2013 evaluation campaign are presented. Good improvements are obtained with state-of-the-art monolingual and multilingual bottleneck features. In addition, an open vocabulary approach using morphemi...
Conference Paper
Full-text available
In this paper we describe the RWTH automatic speech recognition system for Slovenian developed within the transLectures project. The project aims at supporting the transcription and translation of video lectures freely available on the web. Difficulties arise on all levels of modeling: Slovenian is a morphologically rich language with a high level...
Conference Paper
Recently, a multilingual Multi Layer Perceptron (MLP) training method was introduced without having to explicitly map the phonetic units of multiple languages to a common set. This paper further investigates this method using bottleneck (BN) tandem connectionist acoustic modeling for four high-resourced languages - English, French, German, and Poli...
Article
The most widely used acoustic feature extraction meth-ods of current automatic speech recognition (ASR) systems are based on the assumption of stationarity. In this paper we exten-sively evaluate a recently introduced filter stable, non-stationary signal processing method, which relies on an adaptive part-tone decomposition of voiced speech to obta...
Conference Paper
Full-text available
Gaussian Mixture Model (GMM) and Multi Layer Perceptron (MLP) based acoustic models are compared on a French large vocabulary continuous speech recognition (LVCSR) task. In addition to optimizing the output layer size of the MLP, the ef-fect of the deep neural network structure is also investigated. Moreover, using different linear transformations...
Conference Paper
Full-text available
We recently discovered novel discriminative training criteria following a principled approach. In this approach training criteria are developed from error bounds on the global error for pattern classification tasks that depend on non-trivial loss functions. Automatic speech recognition (ASR) is a prominent example for such a task depending on the n...
Conference Paper
Different normalization methods are applied in recent Large Vocabulary Continuous Speech Recognition Systems (LVCSR) to reduce the influence of speaker variability on the acoustic models. In this paper we investigate the use of Vocal Tract Length Normalization (VTLN) and Speaker Adaptive Training (SAT) in Multi Layer Perceptron (MLP) feature extrac...
Conference Paper
Morféma alapú beszédfelismerınek tekintjük azokat a felismerıket, melyek szónál kisebb, morfémaszerő elemekre épülı nyelvi modellt használnak. Kísérleteink során öt különbözı szegmentáló eljárással készített morféma alapú felismerı teljesítményét hasonlítottuk egy standard, szó alapú rendszeréhez tervezett beszédő, híranyag felolvasásos feladaton....
Conference Paper
The improvement achieved by changing the basis of speech recognition from words to morphs (various sub-word units) varies greatly across tasks and languages. We make an attempt to explore the source of this variability by the investigation of three LVCSR tasks corresponding to three speech genres of a highly agglutinative language. Novel, press con...
Conference Paper
Full-text available
The paper describes automatic speech recognition experiments and results on the spontaneous Hungarian MALACH speech corpus. A novel morph-based lexical modeling approach is compared to the traditional word- based one and to another, previously best performing morph-based one in terms of word and letter error rates. The applied language and acoustic...
Conference Paper
Full-text available
A coupled acoustic- and language-modeling approach is presented for the recognition of spontaneous speech primarily in agglutinative languages. The effectiveness of the approach in large vocabulary spontaneous speech recognition is demonstrated on the Hungarian MALACH corpus. The derivation of morphs from word forms is based on a statistical morpho...

Network

Cited By