Martin Wöllmer’s research while affiliated with Technical University of Munich and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (79)


Emotion Recognition in Naturalistic Speech and Language—A Survey
  • Chapter

January 2015

·

138 Reads

·

34 Citations

·

Martin Wöllmer

·

This chapter provides an overview over recent developments in naturalistic emotion recognition based on acoustic and linguistic cues. It discusses a variety of use-cases where emotion recognition can improve quality of service and quality of life. The chapter describes the existing corpora of emotional speech data relating to such scenarios, the underlying theory of emotion modeling, and the need for an optimal unit of analysis. It focuses on the challenges for real-life applications that have become evident: non-prototypicality; lack of solid ground truth and data sparsity; generalization across application scenarios, languages, and cultures; requirements of real-time and incremental processing; robustness with respect to acoustic conditions; and appropriate evaluation measures that reflect real-life settings. The chapter concludes by giving further directions for the field, including novel strategies to augment training data by synthesis and (semi-)unsupervised learning, as well as joint learning of other paralinguistic features by mutual information exploitation.


Figure 1: Hierarchical structure of the XML format. It allows a flexible annotation and segment extraction from audio files (Episode) by providing Sections corresponding e. g. to speech or music parts, subdivided into Turns of a certain speaker, which consist of a sequence of utterances for which time alignments (Sync) are provided. Changes in background (BG), such as noise conditions are also annotated.
Table 3 : OOV rates over the BCN test set when using only the n most frequent words in the texts for building the language model. 
A Broadcast News Corpus for Evaluation and Tuning of German LVCSR Systems
  • Article
  • Full-text available

December 2014

·

164 Reads

·

10 Citations

·

·

·

[...]

·

Gerhard Rigoll

Transcription of broadcast news is an interesting and challenging application for large-vocabulary continuous speech recognition (LVCSR). We present in detail the structure of a manually segmented and annotated corpus including over 160 hours of German broadcast news, and propose it as an evaluation framework of LVCSR systems. We show our own experimental results on the corpus, achieved with a state-of-the-art LVCSR decoder, measuring the effect of different feature sets and decoding parameters, and thereby demonstrate that real-time decoding of our test set is feasible on a desktop PC at 9.2% word error rate.

Download

Feature Enhancement by Deep LSTM Networks for ASR in Reverberant Multisource Environments

July 2014

·

268 Reads

·

67 Citations

Computer Speech & Language

This article investigates speech feature enhancement based on deep bidirectional recurrent neural networks. The Long Short-Term Memory (LSTM) architecture is used to exploit a self-learnt amount of temporal context in learning the correspondences of noisy and reverberant with undistorted speech features. The resulting networks are applied to feature enhancement in the context of the 2013 2nd Computational Hearing in Multisource Environments (CHiME) Challenge Track 2 task, which consists of the Wall Street Journal (WSJ-0) corpus distorted by highly non-stationary, convolutive noise. In extensive test runs, different feature front-ends, network training targets, and network topologies are evaluated in terms of frame-wise regression error and speech recognition performance. Furthermore, we consider gradually refined speech recognition back-ends from baseline ‘out-of-the-box’ clean models to discriminatively trained multi-condition models adapted to the enhanced features. In the result, deep bidirectional LSTM networks processing log Mel filterbank outputs deliver best results with clean models, reaching down to 42 % word error rate (WER) at signal-to-noise ratios ranging from -6 to 9 dB (multi-condition CHiME Challenge baseline: 55 % WER). Discriminative training of the back-end using LSTM enhanced features is shown to further decrease WER to 22 %. To our knowledge, this is the best result reported for the 2nd CHiME Challenge WSJ-0 task yet.


Probabilistic speech feature extraction with context-sensitive Bottleneck neural networks

May 2014

·

157 Reads

·

4 Citations

Neurocomputing

We introduce a novel context-sensitive feature extraction approach for spontaneous speech recognition. As bidirectional Long Short-Term Memory (BLSTM) networks are known to enable improved phoneme recognition accuracies by incorporating long-range contextual information into speech decoding, we integrate the BLSTM principle into a Tandem front-end for probabilistic feature extraction. Unlike the previously proposed approaches which exploit BLSTM modeling by generating a discrete phoneme prediction feature, our feature extractor merges continuous high-level probabilistic BLSTM features with low-level features. By combining BLSTM modeling and Bottleneck (BN) feature generation, we propose a novel front-end that allows us to produce context-sensitive probabilistic feature vectors of arbitrary size, independent of the network training targets. Evaluations on challenging spontaneous, conversational speech recognition tasks show that this concept prevails over recently published architectures for feature-level context modeling.


Fig. 1: Block diagram of the proposed system: The central component is a multi-stream HMM fusing MFCC with optional word predictions by NSC (operating on Mel frequency bands, MFB) and/or the BLSTM-RNN (processing MFCC features). The MFCC feature extraction can optionally by performed on an enhanced speech signal, applying convolutive NMF as pre-processing. 
The TUM+TUT+KUL approach to the CHiME Challenge 2013: Multi-stream ASR exploiting BLSTM networks and sparse NMF

June 2013

·

81 Reads

·

26 Citations

We present our joint contribution to the 2nd CHiME Speech Separation and Recognition Challenge. Our system combines speech enhancement by supervised sparse non-negative matrix factorisation (NMF) with a multi-stream speech recognition system. In addition to a conventional MFCC HMM recogniser, predictions by a bidirectional Long Short-Term Memory recurrent neural network (BLSTM-RNN) and from non-negative sparse classification (NSC) are integrated into a triple-stream recogniser. Experiments are carried out on the small vocabulary and the medium vocabulary recognition tasks of the Challenge. Consistent improvements over the Challenge baselines demonstrate the efficacy of the proposed system, resulting in an average word accuracy of 92.8 % in the small vocabulary task and an average word error rate of 41.42 % in the medium vocabulary task.


Noise robust ASR in reverberated multisource environments applying convolutive NMF and Long Short-Term Memory

May 2013

·

46 Reads

·

16 Citations

Computer Speech & Language

This article proposes and evaluates various methods to integrate the concept of bidirectional Long Short-Term Memory (BLSTM) temporal context modeling into a system for automatic speech recognition (ASR) in noisy and reverberated environments. Building on recent advances in Long Short-Term Memory architectures for ASR, we design a novel front-end for context-sensitive Tandem feature extraction and show how the Connectionist Temporal Classification approach can be used as a BLSTM-based back-end, alternatively to Hidden Markov Models (HMM). We combine context-sensitive BLSTM-based feature generation and speech decoding techniques with source separation by convolutive non-negative matrix factorization. Applying our speaker adapted multi-stream HMM framework that processes MFCC features from NMF-enhanced speech as well as word predictions obtained via BLSTM networks and non-negative sparse classification (NSC), we obtain an average accuracy of 91.86% on the PASCAL CHiME Challenge task at signal-to-noise ratios ranging from −6 to 9 dB. To our knowledge, this is the best result ever reported for the CHiME Challenge task.


Keyword spotting exploiting Long Short-Term Memory

February 2013

·

190 Reads

·

59 Citations

Speech Communication

We investigate various techniques for keyword spotting which are exclusively based on acoustic modeling and do not presume the existence of an in-domain language model. Since adequate context modeling is nevertheless necessary for word spotting, we show how the principle of Long Short-Term Memory (LSTM) can be incorporated into the decoding process. We propose a novel technique that exploits LSTM in combination with Connectionist Temporal Classification in order to improve performance by using a self-learned amount of contextual information. All considered approaches are evaluated on read speech as contained in the TIMIT corpus as well as on the SEMAINE database which consists of spontaneous and emotionally colored speech. As further evidence for the effectiveness of LSTM modeling for keyword spotting, results on the CHiME task are shown.


LSTM-Modeling of Continuous Emotions in an Audiovisual Affect Recognition Framework

February 2013

·

493 Reads

·

300 Citations

Image and Vision Computing

Automatically recognizing human emotions from spontaneous and non-prototypical real-life data is currently one of the most challenging tasks in the field of affective computing. This article presents our recent advances in assessing dimensional representations of emotion, such as arousal, expectation, power, and valence, in an audiovisual human-computer interaction scenario. Building on previous studies which demonstrate that long range context modeling tends to increase accuracies of emotion recognition, we propose a fully automatic audiovisual recognition approach based on Long Short-Term Memory (LSTM) modeling of word-level audio and video features. LSTM networks are able to incorporate knowledge about how emotions typically evolve over time so that the inferred emotion estimates are produced under consideration of an optimal amount of context. Extensive evaluations on the Audiovisual Sub-Challenge of the 2011 Audio/Visual Emotion Challenge show how acoustic, linguistic, and visual features contribute to the recognition of different affective dimensions as annotated in the SEMAINE data base.We apply the same acoustic features as used in the challenge baseline system whereas visual features are computed via a novel facial movement feature extractor. Comparing our results with the recognition scores of all Audiovisual Sub-Challenge participants, we find that the proposed LSTM-based technique leads to the best average recognition performance that has been reported for this task so far.


Fig. 1 Block diagram of the proposed framework.
Fig. 3 LSTM memory block consisting of one memory cell: The input, output, and forget gates collect activations from inside and outside the block which control the cell through multiplicative units (depicted as small circles); input, output, and forget gate scale input, output, and internal state respectively; a i and ao denote activation functions; the recurrent connection of fixed weight 1.0 maintains the internal state.  
Table 4 NSegSRR values for processed audio files of meeting IS1009b. 
Table 5 NSegSRR values for non-processed audio files of meeting IS1009b. 
Fig. 5 NPM for all the RIRs relative to each source.
Real-Time Activity Detection in a Multi-Talker Reverberated Environment

December 2012

·

118 Reads

·

4 Citations

Cognitive Computation

This paper proposes a real-time person activity detection framework operating in presence of multiple sources in reverberated environments. Such a framework is composed by two main parts: The speech enhancement front-end and the activity detector. The aim of the former is to automatically reduce the distortions introduced by room reverberation in the available distant speech signals and thus to achieve a significant improvement of speech quality for each speaker. The overall front-end is composed by three cooperating blocks, each one fulfilling a specific task: Speaker diarization, room impulse responses identification, and speech dereverberation. In particular, the speaker diarization algorithm is essential to pilot the operations performed in the other two stages in accordance with speakers’ activity in the room. The activity estimation algorithm is based on bidirectional Long Short-Term Memory networks which allow for context-sensitive activity classification from audio feature functionals extracted via the real-time speech feature extraction toolkit openSMILE. Extensive computer simulations have been performed by using a subset of the AMI database for activity evaluation in meetings: Obtained results confirm the effectiveness of the approach.



Citations (71)


... Well-studied aspects of source separation are speech denoising and speech enhancement. Previous research on speech denoising comprises NMF (Weninger et al., 2012), deep NMF (Le , recurrent neural network (RNN)-based discriminate training (Weninger et al., 2014b), long short-term memory-RNNs , memory-enhanced RNNs (Weninger et al., 2014a), and deep recurrent autoencoders (Weninger et al., 2014c). Latest approaches to speech source separation also employ different DNN types, such as feed-forward neural networks (FFNNs) (Naithani et al., 2016), RNNs (Huang et al., 2015;Sun et al., 2017) or end-to-end learning using a CNN-or RNN-autoencoder instead of the usual spectral features (Venkataramani et al., 2017). ...

Reference:

New Avenues in Audio Intelligence: Towards Holistic Real-life Audio Understanding
Combining bottleneck-BLSTM and semi-supervised sparse NMF for recognition of conversational speech in highly instationary noise
  • Citing Conference Paper
  • September 2012

... Possible solutions are the calculation of speaker-independent features, such as the changes instead of the absolute values [16], or different normalisation methods [3]. Some research questions have already been answered: For example, it was shown that suprasegmental features perform better than segmental ones [21] or that features are not language-independent [25]. ...

The role of prosody in affective speech, linguistic insights, studies in language and communication
  • Citing Book
  • January 2009

... There are only a few approaches, yet, considering this mutual dependency of speaker characteristics, mostly based on multi-task learning with neural networks. Examples in acoustic speech information exploitation include simultaneous assessment of age, gender, height, and race recognition [45], age, height, weight, and smoking habits recognition at the same time [38], emotion, likability, and personality assessment in one pass [66], commonly targeting deception and sincerity [64] or drowsiness and alcohol intoxication [65] in the recognition, as well as assessment of several emotion dimensions or representations in parallel [14,60,61,63], and aiming at speaker verification [6] co-learning other aspects. Similar approaches can be found in text-based information exploitation [25]. ...

Semantic speech tagging: towards combined analysis
  • Citing Book
  • January 2011

... In the study of identifying singer traits, there has been limited research focusing on traits beyond distinguishing between sexes. Weninger et al. trained models to identify singer's sex, race, age and height, obtained an accuracy of 89.6% for SSC on beat level using a Bidirectional Long Short-Term Memory (BLSTM), however the dataset only contains pop tracks that have quarter beats [6]. Shi trained a model for sex classification and age detection that achieved and performance of 91% and 36% respectively using an internal dataset [7]. ...

Automatic assessment of singer traits in popular music: gender, age, height and race
  • Citing Book
  • January 2011

... • Spectral centroid: Higher spectral centroid values indicate emotions positioned in the upper-right quadrant of the valence-arousal 2D plane, such as excited or happy [26]. Lower values indicate subdued emotions, such as sad. ...

Prosodic, spectral or voice quality? Feature type relevance for the discrimination of emotion pairs
  • Citing Book
  • January 2009

... Another facet of SER is the semantic component, wherein the content of speech is analyzed. Here, the occurrence of certain words with an emotional reference is counted [12]. Technologies that are used to enable voice-based emotion recognition are machine-and deep learning methods, such as convolutional or recurrent neural networks, K-Nearest Neighbor, Support Vector Machines or the Hidden Markov Model. ...

Emotion Recognition in Naturalistic Speech and Language—A Survey
  • Citing Chapter
  • January 2015

... Sun et al. [27] proposed an improved seven-part detection method for faces by combining a domain adaptive method and a fully convolutional network, which improves the generalization across domains. Eyben et al [28] used a feature-level fusion approach to fuse audio and text information for emotion recognition work and obtained high accuracy. Poria et al. [29] applied attention for the first time within the field study of emotion recognition, where the computation of attention was used to model the relationship between different modalities to perform a fusion of different input information. ...

On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues
  • Citing Article
  • January 2009

... It is able to compensate for the ineffectiveness of RNN in the transmission of long time series of information. LSTM networks have shown outstanding performance in a wide variety of pattern recognition applications, including language translation, picture analysis, voice recognition, defect detection, and text recognition [15][16][17][18]. ...

Robust Discriminative Keyword Spotting for Emotionally Colored Spontaneous Speech using Bidirectional LSTM Networks

... D'Mello et al. (2011 introduced AutoTutor, an agent that adapts feedback based on learners' cognitive and emotional states, promoting engagement. Bevacqua et al. (2012) describe systems that select and analyzes feedback based on verbal (like tone of voice and word choice) and non-verbal (such as facial expressions, gestures, and posture) cues by leveraging emotion recognition techniques and embodied conversational agents. For instance, if a learner shows signs of frustration (e.g., frowning, slumped posture, or using negative language), the system might adapt its response by offering empathetic and supportive feedback. ...

Interacting with Emotional Virtual Agents

Lecture Notes of the Institute for Computer Sciences

... Recently, the term Relations has been used to refer to the relation between the individual and the social context in terms of perceived involvement [50] and to the changes detected in a group's involvement in a multiparty interaction [51]. Thus, it is clear that understanding the multiparty human communicative behavior implies an understanding of the modifications of the social structure and dynamics of small [52] and large groups (friends, colleagues, families, students, etc.) [53] and the changes in individuals' behaviors and attitudes that occur because of their membership in social and situational settings [54]. For a more broad overview of how the term context has been defined, the reader is directed to [55]. ...

Temporal and situational context modeling for improved dominance recognition in meetings
  • Citing Article
  • January 2012