Felix Weninger

Felix Weninger
  • Senior Researcher at Nuance Communications

About

102
Publications
44,475
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
7,276
Citations
Current institution
Nuance Communications
Current position
  • Senior Researcher
Additional affiliations
January 2010 - December 2014
Technical University of Munich
Position
  • PhD Student

Publications

Publications (102)
Article
Full-text available
A 2 year-old has approximately heard a 1000 h of speech—at the age of ten, around ten thousand. Similarly, automatic speech recognisers are often trained on data in these dimensions. In stark contrast, however, only few databases to train a speaker analysis system contain more than 10 h of speech and hardly ever more than 100 h. Yet, these systems...
Chapter
Methods for single‐channel source separation can be roughly grouped into two categories: clustering and classification / regression. Clustering algorithms are based on grouping similar time‐frequency bins. These include computational auditory scene analysis approaches, which rely on psychoacoustic cues, and spectral clustering based approaches, in...
Article
Full-text available
In this article, we review the INTERSPEECH 2013 Computational Paralinguistics ChallengE (ComParE) – the first of its kind – in light of the recent developments in affective and behavioural computing. The impact of the first ComParE instalment is manifold: first, it featured various new recognition tasks including social signals such as laughter and...
Conference Paper
In this work, we present a new view on automatic speaker diarisation, i.e., assessing "who speaks when", based on the recognition of speaker traits such as age, gender, voice likability, and personality. Traditionally, speaker diarisation is accomplished using low-level audio descriptors (e.g., cepstral or spectral features), neglecting the fact th...
Conference Paper
In this work, we present an in-depth analysis of the interdependency between the non-native prosody and the native language (L1) of English L2 speakers, as separately investigated in the Degree of Nativeness Task and the Native Language Task of the INTERSPEECH 2015 and 2016 Computational Paralinguistics ChallengE (ComParE). To this end, we propose...
Article
In this article, an approach is presented to predict the route and stopping intent of human-driven vehicles at urban intersections using a selection of distinctive features observed on the vehicle state (position, heading, acceleration, velocity). For potential future advanced driver assistance systems, this can facilitate the situation analysis an...
Article
Full-text available
We propose a new recognition task in the area of computational paralinguistics: automatic recognition of eating conditions in speech, i. e., whether people are eating while speaking, and what they are eating. To this end, we introduce the audio-visual iHEARu-EAT database featuring 1.6 k utterances of 30 subjects (mean age: 26.1 years, standard devi...
Chapter
Automatic human affect recognition aims to automatically predict affect-related information from humans observed for a certain time span. Such computer assessment of human emotion is described for audio-based methods. This includes first acoustic analysis with suited features and segmentation of the speech signal. Then follows linguistic analysis i...
Conference Paper
Full-text available
We evaluate some recent developments in recurrent neural network (RNN) based speech enhancement in the light of noise-robust automatic speech recognition (ASR). The proposed framework is based on Long Short-Term Memory (LSTM) RNNs which are discriminatively trained according to an optimal speech reconstruction objective. We demonstrate that LSTM sp...
Conference Paper
Full-text available
Acoustic novelty detection aims at identifying abnormal/ novel acoustic signals which differ from the reference/ normal data that the system was trained with. In this paper we present a novel approach based on non-linear predictive denoising autoencoders. In our approach, auditory spectral features of the next short-term frame are predicted from th...
Chapter
This chapter provides an overview over recent developments in naturalistic emotion recognition based on acoustic and linguistic cues. It discusses a variety of use-cases where emotion recognition can improve quality of service and quality of life. The chapter describes the existing corpora of emotional speech data relating to such scenarios, the un...
Article
Full-text available
Transcription of broadcast news is an interesting and challenging application for large-vocabulary continuous speech recognition (LVCSR). We present in detail the structure of a manually segmented and annotated corpus including over 160 hours of German broadcast news, and propose it as an evaluation framework of LVCSR systems. We show our own exper...
Conference Paper
Full-text available
Music as a form of art is intentionally composed to be emotionally expressive. The emotional features of music are invaluable for music indexing and recommendation. In this paper we present a cross-comparison of automatic emotional analysis of music. We created a public dataset of Creative Commons licensed songs. Using valence and arousal model, th...
Conference Paper
Full-text available
Music as a form of art is intentionally composed to be emo-tionally expressive. The emotional features of music are in-valuable for music indexing and recommendation. In this paper we present a cross-comparison of automatic emotional analysis of music. We created a public dataset of Creative Commons licensed songs. Using valence and arousal model,...
Article
Full-text available
Model-based methods and deep neural networks have both been tremendously successful paradigms in machine learning. In model-based methods, we can easily express our problem domain knowledge in the constraints of the model at the expense of difficulties during inference. Deterministic deep neural networks are constructed in such a way that inference...
Article
The INTERSPEECH 2012 Speaker Trait Challenge aimed at a unified test-bed for perceived speaker traits – the first challenge of this kind: personality in the five OCEAN personality dimensions, likability of speakers, and intelligibility of pathologic speakers. In the present article, we give a brief overview of the state-of-the-art in these three fi...
Article
This article investigates speech feature enhancement based on deep bidirectional recurrent neural networks. The Long Short-Term Memory (LSTM) architecture is used to exploit a self-learnt amount of temporal context in learning the correspondences of noisy and reverberant with undistorted speech features. The resulting networks are applied to featur...
Article
In this article we address the problem of distant speech recognition for reverberant noisy environments. Speech enhancement methods, e. g., using non-negative matrix factorization (NMF), are succesful in improving the robustness of ASR systems. Furthermore, discriminative training and feature transformations are employed to increase the robustness...
Conference Paper
Full-text available
This paper describes our joint efforts to provide robust automatic speech recognition (ASR) for reverberated environments, such as in hands-free human-machine interaction. We investigate blind feature space de-reverberation and deep recurrent de-noising auto-encoders (DAE) in an early fusion scheme. Results on the 2014 REVERB Challenge development...
Conference Paper
Full-text available
In this paper we propose the use of Long Short-Term Memory recurrent neural networks for speech enhancement. Networks are trained to predict clean speech as well as noise features from noisy speech features, and a magnitude domain soft mask is constructed from these features. Extensive tests are run on 73 k noisy and reverberated utterances from th...
Conference Paper
Full-text available
This paper proposes a novel machine learning approach for the task of on-line continuous-time music mood regression, i.e., low-latency prediction of the time-varying arousal and valence in musical pieces. On the front-end, a large set of segmental acoustic features is extracted to model short-term variations. Then, multi-variate regression is perfo...
Article
Full-text available
In the emerging field of computational paralinguistics, most research efforts are devoted to either short-term speaker states such as emotions, or long-term traits such as personality, gender, or age. To bridge this gap on the time axis, and hence broaden the scope of the field, the INTERSPEECH 2011 Speaker State Challenge addressed the algorithmic...
Article
We present a comprehensive evaluation of the influence of “harmonic” and rhythmic sections contained in an audio file on automatic music genre classification. The study is performed using the ISMIS database composed of music files, which are represented by vectors of acoustic parameters describing low-level music features. Non-negative Matrix Facto...
Article
Full-text available
Without doubt general video and sound, as found in large multimedia archives, carry emotional information. Thus, audio and video retrieval by certain emotional categories or dimensions could play a central role for tomorrow's intelligent systems, enabling search for movies with a particular mood, computer aided scene and sound design in order to el...
Conference Paper
Full-text available
An important aspect in short dialogues is attention as is manifested by eye-contact between subjects. In this study we provide a first analysis whether such visual attention is evident in the acoustic properties of a speaker's voice. We thereby introduce the multi-modal GRAS2 corpus, which was recorded for analysing attention in human-to-human inte...
Conference Paper
We present recent developments in the openSMILE feature extraction toolkit. Version 2.0 now unites feature extraction paradigms from speech, music, and general sound events with basic video features for multi-modal processing. Descriptors from audio and video can be processed jointly in a single framework allowing for time synchronization of parame...
Conference Paper
We introduce a novel method for the transcription of polyphonic piano music by discriminative training of support vector machines (SVMs). As features, we use pitch activations computed by supervised non-negative matrix factorization from low-level spectral features. Different approaches to low-level feature extraction, NMF dictionary learning and a...
Conference Paper
We present a multi-modal approach to speaker characterization using acoustic, visual and linguistic features. Full realism is provided by evaluation on a database of real-life web videos and automatic feature extraction including face and eye detection, and automatic speech recognition. Different segmentations are evaluated for the audio and video...
Conference Paper
In this work, we study the usefulness of several types of sparsity penalties in the task of speech separation using supervised and semi-supervised Nonnegative Matrix Factorization (NMF). We compare different criteria from the literature to two novel penalty functions based on Wiener Entropy, in a large-scale evaluation on spontaneous speech overlai...
Conference Paper
We present a novel method to integrate noise estimates by unsupervised speech enhancement algorithms into a semi-supervised non-negative matrix factorization framework. A multiplicative update algorithm is derived to estimate a non-negative noise dictionary given a time-varying background noise estimate with a stationarity constraint. A large-scale...
Conference Paper
Full-text available
The INTERSPEECH 2013 Computational Paralinguistics Challenge provides for the first time a unified test-bed for Social Signals such as laughter in speech. It further introduces conflict in group discussions as a new task and deals with autism and its manifestations in speech. Finally, emotion is revisited as task, albeit with a broader range of ove...
Conference Paper
Full-text available
Recently, the automatic analysis of likability of a voice has become popular. This work follows up on our original work in this field and provides an in-depth discussion of the matter and an analysis of the acoustic parameters. We investigate the automatic analysis of voice likability in a continuous label space with neural networks as regressors a...
Article
Full-text available
Serious gaming guides targeted behavior change to improve behaviors in everyday living. This survey of the field focuses on two case studies--one game aims to improve the social behavior of autistic children, and the other helps migrants interact with locals.
Conference Paper
Full-text available
We present our joint contribution to the 2nd CHiME Speech Separation and Recognition Challenge. Our system combines speech enhancement by supervised sparse non-negative matrix factorisation (NMF) with a multi-stream speech recognition system. In addition to a conventional MFCC HMM recogniser, predictions by a bidirectional Long Short-Term Memory re...
Conference Paper
A novel, data-driven approach to voice activity detection is presented. The approach is based on Long Short-Term Memory Recurrent Neural Networks trained on standard RASTA-PLP frontend features. To approximate real-life scenarios, large amounts of noisy speech instances are mixed by using both read and spontaneous speech from the TIMIT and Buckeye...
Article
Full-text available
WITHOUT DOUBT, THERE IS EMOTIONAL INFORMATION IN ALMOST ANY KIND OF SOUND RECEIVED BY HUMANS EVERY DAY: be it the affective state of a person transmitted by means of speech; the emotion intended by a composer while writing a musical piece, or conveyed by a musician while performing it; or the affective state connected to an acoustic event occurring...
Article
This article proposes and evaluates various methods to integrate the concept of bidirectional Long Short-Term Memory (BLSTM) temporal context modeling into a system for automatic speech recognition (ASR) in noisy and reverberated environments. Building on recent advances in Long Short-Term Memory architectures for ASR, we design a novel front-end f...
Article
Full-text available
This work focuses on automatically analyzing a speaker's sentiment in online videos containing movie reviews. In addition to textual information, this approach considers adding audio features as typically used in speech-based emotion recognition as well as video features encoding valuable valence information conveyed by the speaker. Experimental re...
Conference Paper
Full-text available
The recognition of spontaneous speech in highly variable noise is known to be a challenge, especially at low signal-to-noise ratios (SNR). In this paper, we investigate the effect of applying bidirectional Long Short-Term Memory (BLSTM) recurrent neural networks for speech feature enhancement in noisy conditions. BLSTM networks tend to prevail over...
Article
Full-text available
We describe the implementation of monaural audio source separation algorithms in our toolkit openBliSSART (Blind Source Separation for Audio Recognition Tasks). To our knowledge, it provides the first freely available C++ implementation of Non- Negative Matrix Factorization (NMF) supporting the Compute Unified Device Architecture (CUDA) for fast pa...
Article
Full-text available
We introduce the automatic determination of leadership emergence by acoustic and linguistic features in on-line speeches. Full realism is provided by the varying and challenging acoustic conditions of the presented YouTube corpus of on-line available speeches labeled by ten raters and by processing that includes Long Short-Term Memory based robust...
Conference Paper
Full-text available
The INTERSPEECH 2012 Speaker Trait Challenge provides for the first time a unified test-bed for 'perceived' speaker traits: Personality in the five OCEAN personality dimensions, likability of speakers, and intelligibility of pathologic speakers. In this paper, we describe these three Sub-Challenges, Challenge conditions, baselines, and a new featur...
Article
Full-text available
Recognizing speakers in emotional conditions remains a challenging issue, since speaker states such as emotion affect the acoustic parameters used in typical speaker recognition systems. Thus, it is believed that knowledge of the current speaker emotion can improve speaker recognition in real life conditions. Conversely, speech emotion recognition...
Conference Paper
Full-text available
We address the fully automatic recognition of intoxication, sleepiness, age and gender from speech in medium-term observation intervals of up to several minutes. The nature of these speaker states and traits as being medium-term or long-term, as opposed to short-term states such as emotion, makes it possible to collect cumulative evidence in the fo...
Article
Full-text available
This paper introduces an approach for performing distributed speech emotion recognition in a client-server architecture. In this architecture, the client side deals only with feature extraction, compression and bit-stream formatting, while the server side performs bit-stream decoding, feature decompression and emotion recognition, which requires mo...
Conference Paper
Full-text available
This paper proposes a multi-stream speech recognition system that combines information from three complementary analysis methods in order to improve automatic speech recognition in highly noisy and reverberant environments, as featured in the 2011 PASCAL CHiME Challenge. We integrate word predictions by a bidirectional Long Short-Term Memory recurr...
Conference Paper
Full-text available
In this paper, we present an on-line semi-supervised algorithm for real-time separation of speech and background noise. The proposed system is based on Nonnegative Matrix Factorization (NMF), where fixed speech bases are learned from training data whereas the noise components are estimated in real-time on the recent past. Experiments with spontaneo...
Conference Paper
Full-text available
In this paper, we propose a semi-supervised algorithm based on sparse non-negative matrix factorization (NMF) to improve separation of speech from background music in monaural signals. In our approach, fixed speech basis vectors are obtained from training data whereas music bases are estimated on-the-fly to cope with spectral variability while pres...
Conference Paper
Full-text available
We address the robustness of features for fully automatic recognition of vibrato, which is usually defined as a periodic oscillation of the pitch (F0) of the singing voice, in recorded polyphonic music. Using an evaluation database covering jazz, pop and opera music, we show that the extraction of pitch is challenging in the presence of instrumenta...
Conference Paper
Full-text available
Without a doubt there is emotion in sound. So far, however, research efforts have focused on emotion in speech and music despite many applications in emotion-sensitive sound retrieval. This paper is an attempt at automatic emotion recognition of general sounds. We selected sound clips from different areas of the daily human environment and model th...
Chapter
Full-text available
The field of computational paralinguistics is currently emerging from loosely connected research on speaker states, traits, and vocal behaviour. Starting from a broad perspective on the state-of-the-art in this field, we combine these facts with a bit of ‘tea leaf reading’ to identify ten currently dominant trends that might also characterise the n...
Conference Paper
In-car intoxication detection from speech is a highly promising non-intrusive method to reduce the accident risk associated with drunk driving. However, in-car noise significantly influences the recognition performance and needs to be addressed in practical applications. In this paper, we investigate how seriously the intrinsic in-car noise and bac...
Conference Paper
We address the learning of noise bases in a monaural speaker-independent speech enhancement framework based on non-negative matrix factorization. Bases are estimated from training data in batch processing by means of hierarchical and non-hierarchical sparse coding, or determined during the speech enhancement process based on the divergence of the o...
Conference Paper
Full-text available
The recognition of human emotions from spontaneous and non-prototypical real-life data is currently one of the most challenging tasks in the field of affective computing. This contribution presents our recent advances in assessing dimensional representations of emotion, such as arousal, expectation, power, and valence, in an audiovisual humancomput...
Conference Paper
Full-text available
One of the ever-present bottlenecks in Automatic Emotion Recognition is data sparseness. We therefore investigate the suitability of unsupervised learning in cross-corpus acoustic emotion recognition through a large-scale study with six commonly used databases, including acted and natural emotion speech, and covering a variety of application scenar...
Conference Paper
Full-text available
We present a study on the effect of reverberation on acousticlinguistic recognition of non-prototypical emotions during child-robot interaction. Investigating the well-defined Interspeech 2009 Emotion Challenge task of recognizing negative emotions in children's speech, we focus on the impact of artificial and real reverberation conditions on the q...
Conference Paper
Full-text available
We present an extensive study on the performance of data agglomeration and decision-level fusion for robust cross-corpus emotion recognition. We compare joint training with multiple databases and late fusion of classifiers trained on single databases, employing six frequently used corpora of natural or elicited emotion, namely ABC, AVIC, DES, eNTER...
Article
Automatic detection of a speaker’s level of interest is of high relevance for many applications, such as automatic customer care, tutoring systems, or affective agents. However, as the latest Interspeech 2010 Paralinguistic Challenge has shown, reliable estimation of non-prototypical natural interest in spontaneous conversations independent of the...
Conference Paper
Full-text available
We present a study on purely data-based recognition of animal sounds, performing evaluation on a real-world database obtained from the Humboldt-University Animal Sound Archive. As we avoid a preselection of friendly cases, the challenge for the classifiers is to discriminate between species regardless of the age or stance of the animal. We define c...
Conference Paper
Full-text available
We describe and evaluate our toolkit openBliSSART (open-source Blind Source Separation for Audio Recognition Tasks), which is the C++ framework and toolbox that we have successfully used in a multiplicity of research on blind audio source separation and feature extraction. To our knowledge, it provides the first open-source implementation of a wide...
Conference Paper
Full-text available
Features generated by Non-Negative Matrix Factorization (NMF) have successfully been introduced into robust speech processing, including noise-robust speech recognition and detection of nonlinguistic vocalizations. In this study, we introduce a novel tandem approach by integrating likelihood features derived from NMF into Bidirectional Long Short-T...
Article
Full-text available
We present a comprehensive study on the effect of reverberation and background noise on the recognition of nonprototypical emotions from speech. We carry out our evaluation on a single, well-defined task based on the FAU Aibo Emotion Corpus consisting of spontaneous children's speech, which was used in the INTERSPEECH 2009 Emotion Challenge, the fi...
Conference Paper
Full-text available
We researched how "likable" or "pleasant" a speaker appears based on a subset of the "Agender" database which was recently introduced at the 2010 Interspeech Paralinguistic Challenge. 32 participants rated the stimuli according to their likability on a seven point scale. An Anova showed that the samples rated are significantly different although th...
Conference Paper
Full-text available
We investigate fully automatic recognition of singer traits, i. e., gender, age, height and 'race' of the main performing artist(s) in recorded popular music. Monaural source separation techniques are combined to simultaneously enhance harmonic parts and extract the leading voice. For evaluation the UltraStar database of 581 pop music songs with 51...
Conference Paper
Full-text available
We introduce features based on Non-Negative Matrix Factorization (NMF) for discrimination of speech and non-linguistic vocalizations such as laughter or breathing, which is a crucial task in recognition of spontaneous speech. NMF has been successfully used in speech-related tasks such as de-noising and speaker separation. While existing approaches...
Conference Paper
Full-text available
We introduce a novel approach for noise-robust feature extraction in speech recognition, based on non-negative matrix factorization (NMF). While NMF has previously been used for speech denoising and speaker separation, we directly extract time-varying features from the NMF output. To this end we extend basic unsupervised NMF to a hybrid supervised/...
Conference Paper
Full-text available
We introduce the task of vocalist gender recognition in popular music and evaluate the benefit of Non-Negative Matrix Factorization based enhancement of melodic components to this aim. The underlying automatic separation of drum beats is described in detail, and the obtained significant gain by its use is verified in extensive test-runs on a novel...
Conference Paper
Full-text available
Non-Negative Matrix Factorization is well known to lead to considerable successes in the blind separation of drums and melodic parts of music recordings. Such splitting may well serve as enhancement when it comes to typical
Article
The development of diagnostic procedures based on microarray analysis confronts the bioinformatician and the biomedical researcher with a variety of challenges. Microarrays generate a huge amount of data. There are many, not yet clearly defined, data processing steps and many clinical response variables which may not match gene expression patterns....
Article
Background: The development of diagnostic procedures based on microarray analysis confronts the bioinformatician and the biomedical researcher with a variety of challenges. Microarrays generate a huge amount of data. There are many, not yet clearly defined, data processing steps and many clinical response variables which may not match gene expressi...
Article
Microarray technology has been proposed as an addition to the methods in current use for diagnosing leukemia. Before a new technology can be used in a diagnostic setting, the method has to be shown to produce robust results. It is known that, given the technical aspects of specimen sampling and target preparation, global gene expression patterns ca...

Network

Cited By