
Florian Eyben- Technical University of Munich
Florian Eyben
- Technical University of Munich
About
195
Publications
101,530
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
17,957
Citations
Introduction
Current institution
Publications
Publications (195)
Machine learning models for speech emotion recognition (SER) can be trained for different tasks and are usually evaluated based on a few available datasets per task. Tasks could include arousal, valence, dominance, emotional categories, or tone of voice. Those models are mainly evaluated in terms of correlation or recall, and always show some error...
Speech Emotion Recognition (SER) needs high computational resources to overcome the challenge of substantial annotator disagreement. Today SER is shifting towards dimensional annotations of arousal, dominance, and valence (A/D/V). Universal metrics as the L2 distance prove unsuitable for evaluating A/D/V accuracy due to non converging consensus of...
Uncertainty Quantification (UQ) is an important building block for the reliable use of neural networks in real-world scenarios, as it can be a useful tool in identifying faulty predictions. Speech emotion recognition (SER) models can suffer from particularly many sources of uncertainty, such as the ambiguity of emotions, Out-of-Distribution (OOD) d...
We introduce two rule-based models to modify the prosody of speech synthesis in order to modulate the emotion to be expressed. The prosody modulation is based on speech synthesis markup language (SSML) and can be used with any commercial speech synthesizer. The models as well as the optimization result are evaluated against human emotion annotation...
We report on the curation of several publicly available datasets for age and gender prediction. Furthermore, we present experiments to predict age and gender with models based on a pre-trained wav2vec 2.0. Depending on the dataset, we achieve an MAE between 7.1 years and 10.8 years for age, and at least 91.1% ACC for gender (female, male, child). C...
Recent advances in transformer-based architectures have shown promise in several machine learning tasks. In the audio domain, such architectures have been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of
model size
and
pre-training data
on downstream performance,...
Driven by the need for larger and more diverse datasets to pre-train and fine-tune increasingly complex machine learning models, the number of datasets is rapidly growing. audb is an open-source Python library that supports versioning and documentation of audio datasets. It aims to provide a standardized and simple user-interface to publish, mainta...
Introduction
The effective fusion of text and audio information for categorical and dimensional speech emotion recognition (SER) remains an open issue, especially given the vast potential of deep neural networks (DNNs) to provide a tighter integration of the two.
Methods
In this contribution, we investigate the effectiveness of deep fusion of text...
Quantifying neurological disorders from voice is a rapidly growing field of research and holds promise for unobtrusive and large-scale disorder monitoring. The data recording setup and data analysis pipelines are both crucial aspects to effectively obtain relevant information from participants. Therefore, we performed a systematic review to provide...
Large, pre-trained neural networks consisting of self-attention layers (transformers) have recently achieved state-of-the-art results on several speech emotion recognition (SER) datasets. These models are typically pre-trained in self-supervised manner with the goal to improve automatic speech recognition performance -- and thus, to understand ling...
Recent advances in transformer-based architectures which are pre-trained in self-supervised manner have shown great promise in several machine learning tasks. In the audio domain, such architectures have also been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of mode...
In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER). We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN), and contrast it with a single-stage one whe...
With the COVID-19 pandemic, several research teams have reported successful advances in automated recognition of COVID-19 by voice. Resulting voice-based screening tools for COVID-19 could support large-scale testing efforts. While capabilities of machines on this task are progressing, we approach the so far unexplored aspect whether human raters c...
COVID-19 is a global health crisis that has been affecting our daily lives throughout the past year. The symptomatology of COVID-19 is heterogeneous with a severity continuum. Many symptoms are related to pathological changes in the vocal system, leading to the assumption that COVID-19 may also affect voice production. For the first time, the prese...
COVID-19 is a global health crisis that has been affecting many aspects of our daily lives throughout the past year. The symptomatology of COVID-19 is heterogeneous with a severity continuum. A considerable proportion of symptoms are related to pathological changes in the vocal system, leading to the assumption that COVID-19 may also affect voice p...
Perceptual audio coding is heavily and successfully applied for audio compression. However, perceptual audio coders may inject audible coding artifacts when encoding audio at low bitrates. Low-bitrate audio restoration is a challenging problem, which tries to recover a high-quality audio sample close to the uncompressed original from a low-quality...
In this article, we study laughter found in child-robot interaction where it had not been prompted intentionally. Different types of laughter and speech-laugh are annotated and processed. In a descriptive part, we report on the position of laughter and speech-laugh in syntax and dialogue structure, and on communicative functions. In a second part,...
EVA¹ is describing a new class of emotion-aware autonomous systems delivering intelligent personal assistant functionalities. EVA requires a multi-disciplinary approach, combining a number of critical building blocks into a cybernetics systems/software architecture: emotion aware systems and algorithms, multimodal interaction design, cognitive mode...
This paper describes audEERING's submissions as well as additional evaluations for the One-Minute-Gradual (OMG) emotion recognition challenge. We provide the results for audio and video processing on subject (in)dependent evaluations. On the provided Development set, we achieved 0.343 Concordance Correlation Coefficient (CCC) for arousal (from audi...
To build a noise-robust online-capable laughter detector for behavioural monitoring on wearables, we incorporate context-sensitive Long Short-Term Memory Deep Neural Networks. We show our solution»s improvements over a laughter detection baseline by integrating intelligent noise-robust voice activity detection (VAD) into the same model. To this end...
In this article, we review the INTERSPEECH 2013 Computational Paralinguistics ChallengE (ComParE) – the first of its kind – in light of the recent developments in affective and behavioural computing. The impact of the first ComParE instalment is manifold: first, it featured various new recognition tasks including social signals such as laughter and...
We describe a data collection for vocal expression of ironic utterances and anger based on an Android app that was specifically developed for this study. The main aim of the investigation is to find evidence for a non-verbal expression of irony. A data set of 937 utterances was collected and labeled by six listeners for irony and anger. The automat...
In this work, we present a new view on automatic speaker diarisation, i.e., assessing "who speaks when", based on the recognition of speaker traits such as age, gender, voice likability, and personality. Traditionally, speaker diarisation is accomplished using low-level audio descriptors (e.g., cepstral or spectral features), neglecting the fact th...
There has been little research on the acoustic correlates of emotional expression in the singing voice. In this study, two pertinent questions are addressed: How does a singer's emotional interpretation of a musical piece affect acoustic parameters in the sung vocalizations? Are these patterns specific enough to allow statistical discrimination of...
This chapter gives an overview of the methods for speech and music analysis implemented by the author in the openSMILE toolkit. The methods described, include all the relevant processing steps from an audio signal to a classification result. These steps include pre-processing and segmentation of the input, feature extraction (i.e., computation of a...
This chapter summarises the methods presented for automatic speech and music analysis and the results obtained for speech emotion analytics and music genre identification with the openSMILE toolkit developed by the author. Further, it is discussed here if and how the aims defined upfront were achieved and open issues for future work are discussed.K...
With rapidly growing interest in and market value of social signal and media analysis a large demand for robust technology which works in adverse situations with poor audio quality and high levels of background noise and reverberation present has been created. Application areas include, e.g., interactive speech systems on mobile devices, multi-moda...
A central aim of this thesis was to define standard acoustic feature sets for both speech and music, which contain a large and comprehensive set of acoustic descriptors. Based on previous efforts to combine features and the authors experience from evaluations across several databases and tasks, 12 standard acoustic parameter sets have been proposed...
The features and the modelling methods used in this thesis have been selected with the goal of on-line processing in mind, however, most of them are all general methods that are suitable both for on-line and off-line processing. This section deals specifically with the issues encountered in on-line (aka incremental) processing, such as segmentation...
The baseline acoustic feature sets and the methods for robust and incremental audio analysis have been evaluated extensively by the author of this thesis. In this chapter, first, a set of 12 affective speech databases and two music style data-sets is introduced, which are used for a systematic evaluation of the proposed methods and baseline acousti...
This book reports on an outstanding thesis that has significantly advanced the state-of-the-art in the automated analysis and classification of speech and music. It defines several standard acoustic parameter sets and describes their implementation in a novel, open-source, audio analysis framework called openSMILE, which has been accepted and inten...
We introduce iHEARu-PLAY, a web-based multi-player game for crowdsourced database collection and – most important – labelling. Existing databases (with speech and video content) can be added to the game and labelling tasks can be defined via a web-interface. The primary purpose of iHEARu-PLAY is multi-label, holistic annotation of multi-modal affec...
Acoustic novelty detection aims at identifying abnormal/
novel acoustic signals which differ from the reference/
normal data that the system was trained with. In this
paper we present a novel approach based on non-linear predictive
denoising autoencoders. In our approach, auditory spectral
features of the next short-term frame are predicted from th...
We investigate the automatic recognition of emotions in the singing voice and study the worth and role of a variety of relevant acoustic parameters. The data set contains phrases and vocalises sung by eight renowned professional opera singers in ten different emotions and a neutral state. The states are mapped to ternary arousal and valence labels....
Acoustic novelty detection aims at identifying abnormal/novel
acoustic signals which differ from the reference/normal data that the
system was trained with. In this paper we present a novel unsupervised
approach based on a denoising autoencoder. In our approach
auditory spectral features are processed by a denoising autoencoder
with bidirectional L...
We introduce the openSMILE feature extraction toolkit, which unites feature extraction algorithms from the speech processing and the Music Information Retrieval communities. Audio low-level descriptors such as CHROMA and CENS features, loudness, Mel-frequency cepstral coefficients, perceptual linear predictive cepstral coefficients, linear predicti...
Work on voice sciences over recent decades has led to a proliferation of acoustic parameters that are used quite selectively and are not always extracted in a similar fashion. With many independent teams working in different research areas, shared standards become an essential safeguard to ensure compliance with state-of-the-art methods allowing ap...
Transcription of broadcast news is an interesting and challenging application
for large-vocabulary continuous speech recognition (LVCSR). We present in
detail the structure of a manually segmented and annotated corpus including
over 160 hours of German broadcast news, and propose it as an evaluation
framework of LVCSR systems. We show our own exper...
The Audio/Visual Mapping Personality Challenge and Workshop (MAPTRAITS) is a competition event that is organised to facilitate the development of signal processing and machine learning techniques for the automatic analysis of personality traits and social dimensions. MAPTRAITS includes two sub-challenges, the continuous space-time sub-challenge and...
In this paper, we investigate the relevance of using voice and lip activity to improve performance of audiovisual emotion recognition in unconstrained settings, as part of the 2014 Emotion Recognition in the Wild Challenge (EmotiW14). Indeed, the dataset provided by the organisers contains movie excerpts with highly challenging variability in terms...
The Audio/Visual Mapping Personality Challenge and Workshop (MAPTRAITS) is a competition event aimed at the comparison of signal processing and machine learning methods for automatic visual, vocal and/or audio-visual analysis of personality traits and social dimensions, namely, extroversion, agreeableness, conscientiousness, neuroticism, openness,...
Mood disorders are inherently related to emotion. In particular, the behaviour of people suffering from mood disorders such as unipolar depression shows a strong temporal correlation with the affective dimensions valence, arousal and dominance. In addition to structured self-report questionnaires, psychologists and psychiatrists use in their evalua...
Music as a form of art is intentionally composed to be emotionally expressive. The emotional features of music are invaluable for music indexing and recommendation. In this paper we present a cross-comparison of automatic emotional analysis of music. We created a public dataset of Creative Commons licensed songs. Using valence and arousal model, th...
Music as a form of art is intentionally composed to be emo-tionally expressive. The emotional features of music are in-valuable for music indexing and recommendation. In this paper we present a cross-comparison of automatic emotional analysis of music. We created a public dataset of Creative Commons licensed songs. Using valence and arousal model,...
Automatic emotion recognition systems based on supervised machine learning require reliable annotation of affective behaviours to build useful models. Whereas the dimensional approach is getting more and more popular for rating affective behaviours in continuous time domains, e.g., arousal and valence, methodologies to take into account reaction la...
The INTERSPEECH 2014 Computational Paralinguistics Challenge provides for the first time a unified test-bed for the automatic recognition of speakers' cognitive and physical load in speech. In this paper, we describe these two Sub-Challenges, their conditions, baseline results and experimental procedures, as well as the COMPARE baseline features ge...
With the availability of speech data obtained from different devices and varied acquisition conditions, we are often faced with scenarios, where the intrinsic discrepancy between the training and the test data has an adverse impact on affective speech analysis. To address this issue, this letter introduces an Adaptive Denoising Autoencoder based on...
The INTERSPEECH 2012 Speaker Trait Challenge aimed at a unified test-bed for perceived speaker traits – the first challenge of this kind: personality in the five OCEAN personality dimensions, likability of speakers, and intelligibility of pathologic speakers. In the present article, we give a brief overview of the state-of-the-art in these three fi...
This paper concerns the exploitation of multi-resolution time-frequency features via Wavelet Packet Transform to improve audio onset detection. In our approach, Wavelet Packet Energy Coefficients (WPEC) and Auditory Spectral Features (ASF) are processed by Bidirectional Long Short-Term Memory (BLSTM) recurrent neural network that yields the onsets...
A plethora of different onset detection methods have been proposed in the recent years. However, few attempts have been made with respect to widely-applicable approaches in order to achieve superior performances over different types of music and with considerable temporal precision. In this paper, we present a multi-resolution approach based on dis...
In this study we make use of Canonical Correlation Analy-sis (CCA) based feature selection for continuous depression recognition from speech. Besides its common use in multi-modal/multi-view feature extraction, CCA can be easily em-ployed as a feature selector. We introduce several novel ways of CCA based filter (ranking) methods, showing their rel...
In this paper we propose the use of Long Short-Term Memory recurrent neural networks for speech enhancement. Networks are trained to predict clean speech as well as noise features from noisy speech features, and a magnitude domain soft mask is constructed from these features. Extensive tests are run on 73 k noisy and reverberated utterances from th...
Over-smoothing is one of the major sources of quality degradation in statistical parametric speech synthesis. Many methods have been proposed to compensate over-smoothing with the speech parameter generation algorithm considering Global Variance (GV) being one of the most successfull. This paper models over-smoothing as a radial relocation of poles...
This paper proposes a novel machine learning approach for the task of on-line continuous-time music mood regression, i.e., low-latency prediction of the time-varying arousal and valence in musical pieces. On the front-end, a large set of segmental acoustic features is extracted to model short-term variations. Then, multi-variate regression is perfo...
In the emerging field of computational paralinguistics, most research efforts are devoted to either short-term speaker states such as emotions, or long-term traits such as personality, gender, or age. To bridge this gap on the time axis, and hence broaden the scope of the field, the INTERSPEECH 2011 Speaker State Challenge addressed the algorithmic...
Without doubt general video and sound, as found in large multimedia archives, carry emotional information. Thus, audio and video retrieval by certain emotional categories or dimensions could play a central role for tomorrow's intelligent systems, enabling search for movies with a particular mood, computer aided scene and sound design in order to el...
An important aspect in short dialogues is attention as is manifested by eye-contact between subjects. In this study we provide a first analysis whether such visual attention is evident in the acoustic properties of a speaker's voice. We thereby introduce the multi-modal GRAS2 corpus, which was recorded for analysing attention in human-to-human inte...
Mood disorders are inherently related to emotion. In particular, the behaviour of people suffering from mood disorders such as unipolar depression shows a strong temporal correlation with the affective dimensions valence and arousal. In addition, psychologists and psychiatrists take the observation of expressive facial and vocal cues into account w...
We present recent developments in the openSMILE feature extraction toolkit. Version 2.0 now unites feature extraction paradigms from speech, music, and general sound events with basic video features for multi-modal processing. Descriptors from audio and video can be processed jointly in a single framework allowing for time synchronization of parame...
We show that high pulse/low pulse, heart rate and skin conductance recognition can reach good accuracies using classification on a large group of 4k audio features extracted from sustained vowels and breathing periods. A database containing audio, heart rate and skin conductance recordings from 19 subjects is established for evaluation of audio-bas...
Overlapping speech is still a major cause of error in many speech processing applications, currently without any satisfactory solution. This paper considers the problem of detecting segments of overlapping speech within meeting recordings. Using an HMM-based framework recordings are segmented into intervals containing non-speech, speech and overlap...
The INTERSPEECH 2013 Computational Paralinguistics Challenge provides for the first time a unified test-bed for Social Signals such as laughter in speech. It further introduces conflict in group discussions as a new task and deals with autism and its manifestations in speech. Finally, emotion is revisited as task, albeit with a broader range of ove...
Recently, the automatic analysis of likability of a voice has become popular. This work follows up on our original work in this field and provides an in-depth discussion of the matter and an analysis of the acoustic parameters. We investigate the automatic analysis of voice likability in a continuous label space with neural networks as regressors a...
A novel, data-driven approach to voice activity detection is presented. The approach is based on Long Short-Term Memory Recurrent Neural Networks trained on standard RASTA-PLP frontend features. To approximate real-life scenarios, large amounts of noisy speech instances are mixed by using both read and spontaneous speech from the TIMIT and Buckeye...
WITHOUT DOUBT, THERE IS EMOTIONAL INFORMATION IN ALMOST ANY KIND OF SOUND RECEIVED BY HUMANS EVERY DAY: be it the affective state of a person transmitted by means of speech; the emotion intended by a composer while writing a musical piece, or conveyed by a musician while performing it; or the affective state connected to an acoustic event occurring...