Martin Wöllmer’s research while affiliated with Technical University of Munich and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (79)


Figure 1: Distribution of gender, race, and height among 516 singers in the UltraStar Singer Trait Database. Distribution of age is shown on beat level, since it is dependent on recording date.
Automatic Assessment of Singer Traits in Popular Music: Gender, Age, Height and Race
  • Conference Paper
  • Full-text available

January 2011

·

408 Reads

·

28 Citations

·

Martin Wöllmer

·

We investigate fully automatic recognition of singer traits, i. e., gender, age, height and 'race' of the main performing artist(s) in recorded popular music. Monaural source separation techniques are combined to simultaneously enhance harmonic parts and extract the leading voice. For evaluation the UltraStar database of 581 pop music songs with 516 distinct singers is chosen. Extensive test runs with Long Short-Term Memory sequence classification reveal that binary classification of gender, height, race and age reaches up to 89.6, 72.1, 63.3 and 57.6% unweighted accuracy on beat level in unseen test data.

Download



Fig. 1. Turnwise annotations of the SAL database.
Fig. 2. Histogram for the turnwise annotations of activation (top) and valence (bottom) in the SAL database.  
Fig. 3. Architecture of the acoustic-linguistic affect recognition system.  
Fig. 5. Structure of a bidirectional network with input i, output o, and two hidden layers (h f and h b ) for forward and backward processing.
Fig. 7. Prediction of activation (black) using a Regression-LSTM and ground truth (grey) over all turns of the test set (only acoustic features used).  
Combining Long Short-Term Memory and Dynamic Bayesian Networks for Incremental Emotion-Sensitive Artificial Listening

November 2010

·

563 Reads

·

180 Citations

IEEE Journal of Selected Topics in Signal Processing

The automatic estimation of human affect from the speech signal is an important step towards making virtual agents more natural and human-like. In this paper, we present a novel technique for incremental recognition of the user's emotional state as it is applied in a sensitive artificial listener (SAL) system designed for socially competent human-machine communication. Our method is capable of using acoustic, linguistic, as well as long-range contextual information in order to continuously predict the current quadrant in a two-dimensional emotional space spanned by the dimensions valence and activation. The main system components are a hierarchical dynamic Bayesian network (DBN) for detecting linguistic keyword features and long short-term memory (LSTM) recurrent neural networks which model phoneme context and emotional history to predict the affective state of the user. Experimental evaluations on the SAL corpus of non-prototypical real-life emotional speech data consider a number of variants of our recognition framework: continuous emotion estimation from low-level feature frames is evaluated as a new alternative to the common approach of computing statistical functionals of given speech turns. Further performance gains are achieved by discriminatively training LSTM networks and by using bidirectional context information, leading to a quadrant prediction F1-measure of up to 51.3 %, which is only 7.6 % below the average inter-labeler consistency.



Figure 1: LSTM memory block consisting of one memory cell: input, output, and forget gate collect activations from inside and outside the block which control the cell through multiplicative units (depicted as small circles); input, output, and forget gate scale input, output, and internal state respectively; ai and ao denote activation functions; the recurrent connection of fixed weight 1.0 maintains the internal state  
Figure 2: Structure of a bidirectional network with input i, output o, as well as two hidden layers (h f and h b )  
Recognition of Spontaneous Conversational Speech using Long Short-Term Memory Phoneme Predictions

September 2010

·

262 Reads

·

21 Citations

We present a novel continuous speech recognition framework designed to unite the principles of triphone and Long Short-Term Memory (LSTM) modeling. The LSTM principle allows a recurrent neural network to store and to retrieve information over long time periods, which was shown to be well-suited for the modeling of co-articulation effects in human speech. Our system uses a bidirectional LSTM network to generate a phoneme prediction feature that is observed by a triphone-based large-vocabulary continuous speech recognition (LVCSR) decoder, together with conventional MFCC features. We evaluate both, phoneme prediction error rates of various network architectures and the word recognition performance of our Tandem approach using the COSINE database - a large corpus of conversational and noisy speech, and show that incorporating LSTM phoneme predictions in to an LVCSR system leads to significantly higher word accuracies.


Table 1: Distribution of the features selected via CFS for the classification of valence (VAL) and activation (ACT) as well as for the discrimination of 3, 4, and 5 clusters in emotional space (see section 3.4). 
Context-sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling

September 2010

·

2,307 Reads

·

209 Citations

In this paper, we apply a context-sensitive technique for multimodal emotion recognition based on feature-level fusion of acoustic and visual cues. We use bidirectional Long Short-Term Memory (BLSTM) networks which, unlike most other emotion recognition approaches, exploit long-range contextual information for modeling the evolution of emotion within a conversation. We focus on recognizing dimensional emotional labels, which enables us to classify both prototypical and non-prototypical emotional expressions contained in a large audiovisual database. Subject-independent experiments on various classification tasks reveal that the BLSTM network approach generally prevails over standard classification techniques such as Hidden Markov Models or Support Vector Machines, and achieves F1-measures of the order of 72%, 65%, and 55% for the discrimination of three clusters in emotional space and the distinction between three levels of valence and activation, respectively.



Bidirectional LSTM Networks for Context-Sensitive Keyword Detection in a Cognitive Virtual Agent Framework

September 2010

·

1,403 Reads

·

87 Citations

Cognitive Computation

Robustly detecting keywords in human speech is an important precondition for cognitive systems, which aim at intelligently interacting with users. Conventional techniques for keyword spotting usually show good performance when evaluated on well articulated read speech. However, modeling natural, spontaneous, and emotionally colored speech is challenging for today’s speech recognition systems and thus requires novel approaches with enhanced robustness. In this article, we propose a new architecture for vocabulary independent keyword detection as needed for cognitive virtual agents such as the SEMAINE system. Our word spotting model is composed of a Dynamic Bayesian Network (DBN) and a bidirectional Long Short-Term Memory (BLSTM) recurrent neural net. The BLSTM network uses a self-learned amount of contextual information to provide a discrete phoneme prediction feature for the DBN, which is able to distinguish between keywords and arbitrary speech. We evaluate our Tandem BLSTM-DBN technique on both read speech and spontaneous emotional speech and show that our method significantly outperforms conventional Hidden Markov Model-based approaches for both application scenarios.


Fig. 1. (a) Unweighted and (b) weighted average recall (UAR/WAR) in percentage of within corpus evaluations on all six corpora using corpus normalization (CN). Results for all emotion categores present with the particular corpus, binary arousal, and binary valence.  
Fig. 2. Box-plots for unweighted average recall (UA) in percentage for cross-corpora testing on four test corpora. Results obtained for varying number of classes (2-6) and for classes mapped to high/low arousal (A) and positive/negative valence (V). (a) DES, UAR. (b) EMO-DB, UAR. (c) eNTERFACE, UAR. (d) SMARTKOM, UAR.
Cross-Corpus Acoustic Emotion Recognition: Variances and Strategies

July 2010

·

5,360 Reads

·

380 Citations

IEEE Transactions on Affective Computing

As the recognition of emotion from speech has matured to a degree where it becomes applicable in real-life settings, it is time for a realistic view on obtainable performances. Most studies tend to overestimation in this respect: Acted data is often used rather than spontaneous data, results are reported on preselected prototypical data, and true speaker disjunctive partitioning is still less common than simple cross-validation. Even speaker disjunctive evaluation can give only a little insight into the generalization ability of today's emotion recognition engines since training and test data used for system development usually tend to be similar as far as recording conditions, noise overlay, language, and types of emotions are concerned. A considerably more realistic impression can be gathered by interset evaluation: We therefore show results employing six standard databases in a cross-corpora evaluation experiment which could also be helpful for learning about chances to add resources for training and overcoming the typical sparseness in the field. To better cope with the observed high variances, different types of normalization are investigated. 1.8 k individual evaluations in total indicate the crucial performance inferiority of inter to intracorpus testing.


Citations (71)


... Well-studied aspects of source separation are speech denoising and speech enhancement. Previous research on speech denoising comprises NMF (Weninger et al., 2012), deep NMF (Le , recurrent neural network (RNN)-based discriminate training (Weninger et al., 2014b), long short-term memory-RNNs , memory-enhanced RNNs (Weninger et al., 2014a), and deep recurrent autoencoders (Weninger et al., 2014c). Latest approaches to speech source separation also employ different DNN types, such as feed-forward neural networks (FFNNs) (Naithani et al., 2016), RNNs (Huang et al., 2015;Sun et al., 2017) or end-to-end learning using a CNN-or RNN-autoencoder instead of the usual spectral features (Venkataramani et al., 2017). ...

Reference:

New Avenues in Audio Intelligence: Towards Holistic Real-life Audio Understanding
Combining bottleneck-BLSTM and semi-supervised sparse NMF for recognition of conversational speech in highly instationary noise
  • Citing Conference Paper
  • September 2012

... Possible solutions are the calculation of speaker-independent features, such as the changes instead of the absolute values [16], or different normalisation methods [3]. Some research questions have already been answered: For example, it was shown that suprasegmental features perform better than segmental ones [21] or that features are not language-independent [25]. ...

The role of prosody in affective speech, linguistic insights, studies in language and communication
  • Citing Book
  • January 2009

... There are only a few approaches, yet, considering this mutual dependency of speaker characteristics, mostly based on multi-task learning with neural networks. Examples in acoustic speech information exploitation include simultaneous assessment of age, gender, height, and race recognition [45], age, height, weight, and smoking habits recognition at the same time [38], emotion, likability, and personality assessment in one pass [66], commonly targeting deception and sincerity [64] or drowsiness and alcohol intoxication [65] in the recognition, as well as assessment of several emotion dimensions or representations in parallel [14,60,61,63], and aiming at speaker verification [6] co-learning other aspects. Similar approaches can be found in text-based information exploitation [25]. ...

Semantic speech tagging: towards combined analysis
  • Citing Book
  • January 2011

... In the study of identifying singer traits, there has been limited research focusing on traits beyond distinguishing between sexes. Weninger et al. trained models to identify singer's sex, race, age and height, obtained an accuracy of 89.6% for SSC on beat level using a Bidirectional Long Short-Term Memory (BLSTM), however the dataset only contains pop tracks that have quarter beats [6]. Shi trained a model for sex classification and age detection that achieved and performance of 91% and 36% respectively using an internal dataset [7]. ...

Automatic assessment of singer traits in popular music: gender, age, height and race
  • Citing Book
  • January 2011

... • Spectral centroid: Higher spectral centroid values indicate emotions positioned in the upper-right quadrant of the valence-arousal 2D plane, such as excited or happy [26]. Lower values indicate subdued emotions, such as sad. ...

Prosodic, spectral or voice quality? Feature type relevance for the discrimination of emotion pairs
  • Citing Book
  • January 2009

... Another facet of SER is the semantic component, wherein the content of speech is analyzed. Here, the occurrence of certain words with an emotional reference is counted [12]. Technologies that are used to enable voice-based emotion recognition are machine-and deep learning methods, such as convolutional or recurrent neural networks, K-Nearest Neighbor, Support Vector Machines or the Hidden Markov Model. ...

Emotion Recognition in Naturalistic Speech and Language—A Survey
  • Citing Chapter
  • January 2015

... Sun et al. [27] proposed an improved seven-part detection method for faces by combining a domain adaptive method and a fully convolutional network, which improves the generalization across domains. Eyben et al [28] used a feature-level fusion approach to fuse audio and text information for emotion recognition work and obtained high accuracy. Poria et al. [29] applied attention for the first time within the field study of emotion recognition, where the computation of attention was used to model the relationship between different modalities to perform a fusion of different input information. ...

On-line emotion recognition in a 3-D activation-valence-time continuum using acoustic and linguistic cues
  • Citing Article
  • January 2009

... It is able to compensate for the ineffectiveness of RNN in the transmission of long time series of information. LSTM networks have shown outstanding performance in a wide variety of pattern recognition applications, including language translation, picture analysis, voice recognition, defect detection, and text recognition [15][16][17][18]. ...

Robust Discriminative Keyword Spotting for Emotionally Colored Spontaneous Speech using Bidirectional LSTM Networks

... D'Mello et al. (2011 introduced AutoTutor, an agent that adapts feedback based on learners' cognitive and emotional states, promoting engagement. Bevacqua et al. (2012) describe systems that select and analyzes feedback based on verbal (like tone of voice and word choice) and non-verbal (such as facial expressions, gestures, and posture) cues by leveraging emotion recognition techniques and embodied conversational agents. For instance, if a learner shows signs of frustration (e.g., frowning, slumped posture, or using negative language), the system might adapt its response by offering empathetic and supportive feedback. ...

Interacting with Emotional Virtual Agents

Lecture Notes of the Institute for Computer Sciences

... Recently, the term Relations has been used to refer to the relation between the individual and the social context in terms of perceived involvement [50] and to the changes detected in a group's involvement in a multiparty interaction [51]. Thus, it is clear that understanding the multiparty human communicative behavior implies an understanding of the modifications of the social structure and dynamics of small [52] and large groups (friends, colleagues, families, students, etc.) [53] and the changes in individuals' behaviors and attitudes that occur because of their membership in social and situational settings [54]. For a more broad overview of how the term context has been defined, the reader is directed to [55]. ...

Temporal and situational context modeling for improved dominance recognition in meetings
  • Citing Article
  • January 2012