Florian Eyben

Florian Eyben
  • Technical University of Munich

About

195
Publications
101,530
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
17,957
Citations
Current institution
Technical University of Munich

Publications

Publications (195)
Article
Machine learning models for speech emotion recognition (SER) can be trained for different tasks and are usually evaluated based on a few available datasets per task. Tasks could include arousal, valence, dominance, emotional categories, or tone of voice. Those models are mainly evaluated in terms of correlation or recall, and always show some error...
Preprint
Full-text available
Speech Emotion Recognition (SER) needs high computational resources to overcome the challenge of substantial annotator disagreement. Today SER is shifting towards dimensional annotations of arousal, dominance, and valence (A/D/V). Universal metrics as the L2 distance prove unsuitable for evaluating A/D/V accuracy due to non converging consensus of...
Preprint
Full-text available
Uncertainty Quantification (UQ) is an important building block for the reliable use of neural networks in real-world scenarios, as it can be a useful tool in identifying faulty predictions. Speech emotion recognition (SER) models can suffer from particularly many sources of uncertainty, such as the ambiguity of emotions, Out-of-Distribution (OOD) d...
Preprint
Full-text available
We introduce two rule-based models to modify the prosody of speech synthesis in order to modulate the emotion to be expressed. The prosody modulation is based on speech synthesis markup language (SSML) and can be used with any commercial speech synthesizer. The models as well as the optimization result are evaluated against human emotion annotation...
Preprint
Full-text available
We report on the curation of several publicly available datasets for age and gender prediction. Furthermore, we present experiments to predict age and gender with models based on a pre-trained wav2vec 2.0. Depending on the dataset, we achieve an MAE between 7.1 years and 10.8 years for age, and at least 91.1% ACC for gender (female, male, child). C...
Article
Full-text available
Recent advances in transformer-based architectures have shown promise in several machine learning tasks. In the audio domain, such architectures have been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of model size and pre-training data on downstream performance,...
Preprint
Full-text available
Driven by the need for larger and more diverse datasets to pre-train and fine-tune increasingly complex machine learning models, the number of datasets is rapidly growing. audb is an open-source Python library that supports versioning and documentation of audio datasets. It aims to provide a standardized and simple user-interface to publish, mainta...
Article
Full-text available
Introduction The effective fusion of text and audio information for categorical and dimensional speech emotion recognition (SER) remains an open issue, especially given the vast potential of deep neural networks (DNNs) to provide a tighter integration of the two. Methods In this contribution, we investigate the effectiveness of deep fusion of text...
Article
Full-text available
Quantifying neurological disorders from voice is a rapidly growing field of research and holds promise for unobtrusive and large-scale disorder monitoring. The data recording setup and data analysis pipelines are both crucial aspects to effectively obtain relevant information from participants. Therefore, we performed a systematic review to provide...
Preprint
Full-text available
Large, pre-trained neural networks consisting of self-attention layers (transformers) have recently achieved state-of-the-art results on several speech emotion recognition (SER) datasets. These models are typically pre-trained in self-supervised manner with the goal to improve automatic speech recognition performance -- and thus, to understand ling...
Preprint
Full-text available
Recent advances in transformer-based architectures which are pre-trained in self-supervised manner have shown great promise in several machine learning tasks. In the audio domain, such architectures have also been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of mode...
Preprint
Full-text available
In this contribution, we investigate the effectiveness of deep fusion of text and audio features for categorical and dimensional speech emotion recognition (SER). We propose a novel, multistage fusion method where the two information streams are integrated in several layers of a deep neural network (DNN), and contrast it with a single-stage one whe...
Conference Paper
Full-text available
With the COVID-19 pandemic, several research teams have reported successful advances in automated recognition of COVID-19 by voice. Resulting voice-based screening tools for COVID-19 could support large-scale testing efforts. While capabilities of machines on this task are progressing, we approach the so far unexplored aspect whether human raters c...
Article
Full-text available
COVID-19 is a global health crisis that has been affecting our daily lives throughout the past year. The symptomatology of COVID-19 is heterogeneous with a severity continuum. Many symptoms are related to pathological changes in the vocal system, leading to the assumption that COVID-19 may also affect voice production. For the first time, the prese...
Preprint
Full-text available
COVID-19 is a global health crisis that has been affecting many aspects of our daily lives throughout the past year. The symptomatology of COVID-19 is heterogeneous with a severity continuum. A considerable proportion of symptoms are related to pathological changes in the vocal system, leading to the assumption that COVID-19 may also affect voice p...
Article
Full-text available
Perceptual audio coding is heavily and successfully applied for audio compression. However, perceptual audio coders may inject audible coding artifacts when encoding audio at low bitrates. Low-bitrate audio restoration is a challenging problem, which tries to recover a high-quality audio sample close to the uncompressed original from a low-quality...
Preprint
Full-text available
In this article, we study laughter found in child-robot interaction where it had not been prompted intentionally. Different types of laughter and speech-laugh are annotated and processed. In a descriptive part, we report on the position of laughter and speech-laugh in syntax and dialogue structure, and on communicative functions. In a second part,...
Conference Paper
Full-text available
EVA¹ is describing a new class of emotion-aware autonomous systems delivering intelligent personal assistant functionalities. EVA requires a multi-disciplinary approach, combining a number of critical building blocks into a cybernetics systems/software architecture: emotion aware systems and algorithms, multimodal interaction design, cognitive mode...
Preprint
Full-text available
This paper describes audEERING's submissions as well as additional evaluations for the One-Minute-Gradual (OMG) emotion recognition challenge. We provide the results for audio and video processing on subject (in)dependent evaluations. On the provided Development set, we achieved 0.343 Concordance Correlation Coefficient (CCC) for arousal (from audi...
Conference Paper
To build a noise-robust online-capable laughter detector for behavioural monitoring on wearables, we incorporate context-sensitive Long Short-Term Memory Deep Neural Networks. We show our solution»s improvements over a laughter detection baseline by integrating intelligent noise-robust voice activity detection (VAD) into the same model. To this end...
Article
Full-text available
In this article, we review the INTERSPEECH 2013 Computational Paralinguistics ChallengE (ComParE) – the first of its kind – in light of the recent developments in affective and behavioural computing. The impact of the first ComParE instalment is manifold: first, it featured various new recognition tasks including social signals such as laughter and...
Chapter
Full-text available
We describe a data collection for vocal expression of ironic utterances and anger based on an Android app that was specifically developed for this study. The main aim of the investigation is to find evidence for a non-verbal expression of irony. A data set of 937 utterances was collected and labeled by six listeners for irony and anger. The automat...
Conference Paper
In this work, we present a new view on automatic speaker diarisation, i.e., assessing "who speaks when", based on the recognition of speaker traits such as age, gender, voice likability, and personality. Traditionally, speaker diarisation is accomplished using low-level audio descriptors (e.g., cepstral or spectral features), neglecting the fact th...
Article
Full-text available
There has been little research on the acoustic correlates of emotional expression in the singing voice. In this study, two pertinent questions are addressed: How does a singer's emotional interpretation of a musical piece affect acoustic parameters in the sung vocalizations? Are these patterns specific enough to allow statistical discrimination of...
Chapter
This chapter gives an overview of the methods for speech and music analysis implemented by the author in the openSMILE toolkit. The methods described, include all the relevant processing steps from an audio signal to a classification result. These steps include pre-processing and segmentation of the input, feature extraction (i.e., computation of a...
Chapter
This chapter summarises the methods presented for automatic speech and music analysis and the results obtained for speech emotion analytics and music genre identification with the openSMILE toolkit developed by the author. Further, it is discussed here if and how the aims defined upfront were achieved and open issues for future work are discussed.K...
Chapter
With rapidly growing interest in and market value of social signal and media analysis a large demand for robust technology which works in adverse situations with poor audio quality and high levels of background noise and reverberation present has been created. Application areas include, e.g., interactive speech systems on mobile devices, multi-moda...
Chapter
A central aim of this thesis was to define standard acoustic feature sets for both speech and music, which contain a large and comprehensive set of acoustic descriptors. Based on previous efforts to combine features and the authors experience from evaluations across several databases and tasks, 12 standard acoustic parameter sets have been proposed...
Chapter
The features and the modelling methods used in this thesis have been selected with the goal of on-line processing in mind, however, most of them are all general methods that are suitable both for on-line and off-line processing. This section deals specifically with the issues encountered in on-line (aka incremental) processing, such as segmentation...
Chapter
The baseline acoustic feature sets and the methods for robust and incremental audio analysis have been evaluated extensively by the author of this thesis. In this chapter, first, a set of 12 affective speech databases and two music style data-sets is introduced, which are used for a systematic evaluation of the proposed methods and baseline acousti...
Book
This book reports on an outstanding thesis that has significantly advanced the state-of-the-art in the automated analysis and classification of speech and music. It defines several standard acoustic parameter sets and describes their implementation in a novel, open-source, audio analysis framework called openSMILE, which has been accepted and inten...
Conference Paper
Full-text available
We introduce iHEARu-PLAY, a web-based multi-player game for crowdsourced database collection and – most important – labelling. Existing databases (with speech and video content) can be added to the game and labelling tasks can be defined via a web-interface. The primary purpose of iHEARu-PLAY is multi-label, holistic annotation of multi-modal affec...
Conference Paper
Full-text available
Acoustic novelty detection aims at identifying abnormal/ novel acoustic signals which differ from the reference/ normal data that the system was trained with. In this paper we present a novel approach based on non-linear predictive denoising autoencoders. In our approach, auditory spectral features of the next short-term frame are predicted from th...
Article
Full-text available
We investigate the automatic recognition of emotions in the singing voice and study the worth and role of a variety of relevant acoustic parameters. The data set contains phrases and vocalises sung by eight renowned professional opera singers in ten different emotions and a neutral state. The states are mapped to ternary arousal and valence labels....
Article
Full-text available
Acoustic novelty detection aims at identifying abnormal/novel acoustic signals which differ from the reference/normal data that the system was trained with. In this paper we present a novel unsupervised approach based on a denoising autoencoder. In our approach auditory spectral features are processed by a denoising autoencoder with bidirectional L...
Article
We introduce the openSMILE feature extraction toolkit, which unites feature extraction algorithms from the speech processing and the Music Information Retrieval communities. Audio low-level descriptors such as CHROMA and CENS features, loudness, Mel-frequency cepstral coefficients, perceptual linear predictive cepstral coefficients, linear predicti...
Article
Full-text available
Work on voice sciences over recent decades has led to a proliferation of acoustic parameters that are used quite selectively and are not always extracted in a similar fashion. With many independent teams working in different research areas, shared standards become an essential safeguard to ensure compliance with state-of-the-art methods allowing ap...
Article
Full-text available
Transcription of broadcast news is an interesting and challenging application for large-vocabulary continuous speech recognition (LVCSR). We present in detail the structure of a manually segmented and annotated corpus including over 160 hours of German broadcast news, and propose it as an evaluation framework of LVCSR systems. We show our own exper...
Conference Paper
Full-text available
The Audio/Visual Mapping Personality Challenge and Workshop (MAPTRAITS) is a competition event that is organised to facilitate the development of signal processing and machine learning techniques for the automatic analysis of personality traits and social dimensions. MAPTRAITS includes two sub-challenges, the continuous space-time sub-challenge and...
Conference Paper
Full-text available
In this paper, we investigate the relevance of using voice and lip activity to improve performance of audiovisual emotion recognition in unconstrained settings, as part of the 2014 Emotion Recognition in the Wild Challenge (EmotiW14). Indeed, the dataset provided by the organisers contains movie excerpts with highly challenging variability in terms...
Conference Paper
The Audio/Visual Mapping Personality Challenge and Workshop (MAPTRAITS) is a competition event aimed at the comparison of signal processing and machine learning methods for automatic visual, vocal and/or audio-visual analysis of personality traits and social dimensions, namely, extroversion, agreeableness, conscientiousness, neuroticism, openness,...
Article
Full-text available
Mood disorders are inherently related to emotion. In particular, the behaviour of people suffering from mood disorders such as unipolar depression shows a strong temporal correlation with the affective dimensions valence, arousal and dominance. In addition to structured self-report questionnaires, psychologists and psychiatrists use in their evalua...
Conference Paper
Full-text available
Music as a form of art is intentionally composed to be emotionally expressive. The emotional features of music are invaluable for music indexing and recommendation. In this paper we present a cross-comparison of automatic emotional analysis of music. We created a public dataset of Creative Commons licensed songs. Using valence and arousal model, th...
Conference Paper
Full-text available
Music as a form of art is intentionally composed to be emo-tionally expressive. The emotional features of music are in-valuable for music indexing and recommendation. In this paper we present a cross-comparison of automatic emotional analysis of music. We created a public dataset of Creative Commons licensed songs. Using valence and arousal model,...
Article
Full-text available
Automatic emotion recognition systems based on supervised machine learning require reliable annotation of affective behaviours to build useful models. Whereas the dimensional approach is getting more and more popular for rating affective behaviours in continuous time domains, e.g., arousal and valence, methodologies to take into account reaction la...
Conference Paper
Full-text available
The INTERSPEECH 2014 Computational Paralinguistics Challenge provides for the first time a unified test-bed for the automatic recognition of speakers' cognitive and physical load in speech. In this paper, we describe these two Sub-Challenges, their conditions, baseline results and experimental procedures, as well as the COMPARE baseline features ge...
Article
With the availability of speech data obtained from different devices and varied acquisition conditions, we are often faced with scenarios, where the intrinsic discrepancy between the training and the test data has an adverse impact on affective speech analysis. To address this issue, this letter introduces an Adaptive Denoising Autoencoder based on...
Article
The INTERSPEECH 2012 Speaker Trait Challenge aimed at a unified test-bed for perceived speaker traits – the first challenge of this kind: personality in the five OCEAN personality dimensions, likability of speakers, and intelligibility of pathologic speakers. In the present article, we give a brief overview of the state-of-the-art in these three fi...
Article
Full-text available
This paper concerns the exploitation of multi-resolution time-frequency features via Wavelet Packet Transform to improve audio onset detection. In our approach, Wavelet Packet Energy Coefficients (WPEC) and Auditory Spectral Features (ASF) are processed by Bidirectional Long Short-Term Memory (BLSTM) recurrent neural network that yields the onsets...
Article
Full-text available
A plethora of different onset detection methods have been proposed in the recent years. However, few attempts have been made with respect to widely-applicable approaches in order to achieve superior performances over different types of music and with considerable temporal precision. In this paper, we present a multi-resolution approach based on dis...
Conference Paper
Full-text available
In this study we make use of Canonical Correlation Analy-sis (CCA) based feature selection for continuous depression recognition from speech. Besides its common use in multi-modal/multi-view feature extraction, CCA can be easily em-ployed as a feature selector. We introduce several novel ways of CCA based filter (ranking) methods, showing their rel...
Conference Paper
Full-text available
In this paper we propose the use of Long Short-Term Memory recurrent neural networks for speech enhancement. Networks are trained to predict clean speech as well as noise features from noisy speech features, and a magnitude domain soft mask is constructed from these features. Extensive tests are run on 73 k noisy and reverberated utterances from th...
Conference Paper
Over-smoothing is one of the major sources of quality degradation in statistical parametric speech synthesis. Many methods have been proposed to compensate over-smoothing with the speech parameter generation algorithm considering Global Variance (GV) being one of the most successfull. This paper models over-smoothing as a radial relocation of poles...
Conference Paper
Full-text available
This paper proposes a novel machine learning approach for the task of on-line continuous-time music mood regression, i.e., low-latency prediction of the time-varying arousal and valence in musical pieces. On the front-end, a large set of segmental acoustic features is extracted to model short-term variations. Then, multi-variate regression is perfo...
Article
Full-text available
In the emerging field of computational paralinguistics, most research efforts are devoted to either short-term speaker states such as emotions, or long-term traits such as personality, gender, or age. To bridge this gap on the time axis, and hence broaden the scope of the field, the INTERSPEECH 2011 Speaker State Challenge addressed the algorithmic...
Article
Full-text available
Without doubt general video and sound, as found in large multimedia archives, carry emotional information. Thus, audio and video retrieval by certain emotional categories or dimensions could play a central role for tomorrow's intelligent systems, enabling search for movies with a particular mood, computer aided scene and sound design in order to el...
Conference Paper
Full-text available
An important aspect in short dialogues is attention as is manifested by eye-contact between subjects. In this study we provide a first analysis whether such visual attention is evident in the acoustic properties of a speaker's voice. We thereby introduce the multi-modal GRAS2 corpus, which was recorded for analysing attention in human-to-human inte...
Conference Paper
Full-text available
Mood disorders are inherently related to emotion. In particular, the behaviour of people suffering from mood disorders such as unipolar depression shows a strong temporal correlation with the affective dimensions valence and arousal. In addition, psychologists and psychiatrists take the observation of expressive facial and vocal cues into account w...
Conference Paper
We present recent developments in the openSMILE feature extraction toolkit. Version 2.0 now unites feature extraction paradigms from speech, music, and general sound events with basic video features for multi-modal processing. Descriptors from audio and video can be processed jointly in a single framework allowing for time synchronization of parame...
Conference Paper
We show that high pulse/low pulse, heart rate and skin conductance recognition can reach good accuracies using classification on a large group of 4k audio features extracted from sustained vowels and breathing periods. A database containing audio, heart rate and skin conductance recordings from 19 subjects is established for evaluation of audio-bas...
Conference Paper
Full-text available
Overlapping speech is still a major cause of error in many speech processing applications, currently without any satisfactory solution. This paper considers the problem of detecting segments of overlapping speech within meeting recordings. Using an HMM-based framework recordings are segmented into intervals containing non-speech, speech and overlap...
Conference Paper
Full-text available
The INTERSPEECH 2013 Computational Paralinguistics Challenge provides for the first time a unified test-bed for Social Signals such as laughter in speech. It further introduces conflict in group discussions as a new task and deals with autism and its manifestations in speech. Finally, emotion is revisited as task, albeit with a broader range of ove...
Conference Paper
Full-text available
Recently, the automatic analysis of likability of a voice has become popular. This work follows up on our original work in this field and provides an in-depth discussion of the matter and an analysis of the acoustic parameters. We investigate the automatic analysis of voice likability in a continuous label space with neural networks as regressors a...
Conference Paper
A novel, data-driven approach to voice activity detection is presented. The approach is based on Long Short-Term Memory Recurrent Neural Networks trained on standard RASTA-PLP frontend features. To approximate real-life scenarios, large amounts of noisy speech instances are mixed by using both read and spontaneous speech from the TIMIT and Buckeye...
Article
Full-text available
WITHOUT DOUBT, THERE IS EMOTIONAL INFORMATION IN ALMOST ANY KIND OF SOUND RECEIVED BY HUMANS EVERY DAY: be it the affective state of a person transmitted by means of speech; the emotion intended by a composer while writing a musical piece, or conveyed by a musician while performing it; or the affective state connected to an acoustic event occurring...

Network

Cited By