Johannes Wagner’s research while affiliated with University of Augsburg and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (94)


Figure 1: Distribution of speaker age (#samples) in the datasets for the three splits (CommonVoice age in middecades).
Overview of the datasets: #samples and #speak- ers (in parenthesis). 2 https://github.com/audeering/w2v2-age-gender-how-to
Speech-based Age and Gender Prediction with Transformers
  • Preprint
  • File available

June 2023

·

481 Reads

·

1 Citation

·

Johannes Wagner

·

·

[...]

·

We report on the curation of several publicly available datasets for age and gender prediction. Furthermore, we present experiments to predict age and gender with models based on a pre-trained wav2vec 2.0. Depending on the dataset, we achieve an MAE between 7.1 years and 10.8 years for age, and at least 91.1% ACC for gender (female, male, child). Compared to a modelling approach built on handcrafted features, our proposed system shows an improvement of 9% UAR for age and 4% UAR for gender. To make our findings reproducible, we release the best performing model to the community as well as the sample lists of the data splits.

Download

Fig. 1: Proposed architecture built on wav2vec 2.0 / HuBERT.
Fig. 5: Difference of fine-tuned (ft) to frozen (frz) CCC performance for arousal, dominance, and valence prediction on MSP-Podcast. The fine-tuned results are from Figure 2, where transformer and output layers are jointly trained. For the frozen results, we keep all transformer layers frozen and simply train the output head. Results show that fine-tuning the transformer layer is worth the computational cost it incurs.
Fig. 10: Visualisation of embeddings extracted with different models overlayed with meta information for a combined dataset of MSP-Podcast and IEMOCAP. We observe that the latent space of wav2vec 2.0 offers a better abstraction from domain, gender, and speaker compared to the CNN14 baseline -even without pretraining. However, only a pre-trained model is able to separate low from high valence. To reduce the dimensionality of the latent space, we applied T-SNE [63].
Fig. 11: Mean and standard deviation of development set performance on MSP-Podcast across three training runs. Compared to CNN14, w2v2-b converges earlier and shows less fluctuation.
Fig. 13: CCC scores for arousal, dominance, and valence / sentiment for w2v2-L-robust on sparse training data. The legend shows the fraction of data used for fine-tuning. Please note that steps are not linear.
Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap

March 2023

·

338 Reads

·

302 Citations

IEEE Transactions on Pattern Analysis and Machine Intelligence

Recent advances in transformer-based architectures have shown promise in several machine learning tasks. In the audio domain, such architectures have been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of model size and pre-training data on downstream performance, and have shown limited attention to generalisation , robustness , fairness , and efficiency . The present contribution conducts a thorough analysis of these aspects on several pre-trained variants of wav2vec 2.0 and HuBERT that we fine-tuned on the dimensions arousal, dominance, and valence of MSP-Podcast, while additionally using IEMOCAP and MOSI to test cross-corpus generalisation. To the best of our knowledge, we obtain the top performance for valence prediction without use of explicit linguistic information, with a concordance correlation coefficient (CCC) of. 638 on MSP-Podcast. Our investigations reveal that transformer-based architectures are more robust compared to a CNN-based baseline and fair with respect to gender groups, but not towards individual speakers. Finally, we show that their success on valence is based on implicit linguistic information, which explains why they perform on-par with recent multimodal approaches that explicitly utilise textual information. To make our findings reproducible, we release the best performing model to the community.


audb -- Sharing and Versioning of Audio and Annotation Data in Python

March 2023

·

140 Reads

Driven by the need for larger and more diverse datasets to pre-train and fine-tune increasingly complex machine learning models, the number of datasets is rapidly growing. audb is an open-source Python library that supports versioning and documentation of audio datasets. It aims to provide a standardized and simple user-interface to publish, maintain, and access the annotations and audio files of a dataset. To efficiently store the data on a server, audb automatically resolves dependencies between versions of a dataset and only uploads newly added or altered files when a new version is published. The library supports partial loading of a dataset and local caching for fast access to already downloaded data. audb is a lightweight library and can be interfaced with any machine learning library. It supports the management of datasets on a single PC, within a university or company, or within a whole research community.




Probing Speech Emotion Recognition Transformers for Linguistic Knowledge

April 2022

·

247 Reads

Large, pre-trained neural networks consisting of self-attention layers (transformers) have recently achieved state-of-the-art results on several speech emotion recognition (SER) datasets. These models are typically pre-trained in self-supervised manner with the goal to improve automatic speech recognition performance -- and thus, to understand linguistic information. In this work, we investigate the extent in which this information is exploited during SER fine-tuning. Using a reproducible methodology based on open-source tools, we synthesise prosodically neutral speech utterances while varying the sentiment of the text. Valence predictions of the transformer model are very reactive to positive and negative sentiment content, as well as negations, but not to intensifiers or reducers, while none of those linguistic features impact arousal or dominance. These findings show that transformers can successfully leverage linguistic information to improve their valence predictions, and that linguistic analysis should be included in their testing.


Figure 1: Proposed architecture built on wav2vec 2.0 / HuBERT.
Dawn of the transformer era in speech emotion recognition: closing the valence gap

March 2022

·

721 Reads

·

3 Citations

Recent advances in transformer-based architectures which are pre-trained in self-supervised manner have shown great promise in several machine learning tasks. In the audio domain, such architectures have also been successfully utilised in the field of speech emotion recognition (SER). However, existing works have not evaluated the influence of model size and pre-training data on downstream performance, and have shown limited attention to generalisation, robustness, fairness, and efficiency. The present contribution conducts a thorough analysis of these aspects on several pre-trained variants of wav2vec 2.0 and HuBERT that we fine-tuned on the dimensions arousal, dominance, and valence of MSP-Podcast, while additionally using IEMOCAP and MOSI to test cross-corpus generalisation. To the best of our knowledge, we obtain the top performance for valence prediction without use of explicit linguistic information, with a concordance correlation coefficient (CCC) of .638 on MSP-Podcast. Furthermore, our investigations reveal that transformer-based architectures are more robust to small perturbations compared to a CNN-based baseline and fair with respect to biological sex groups, but not towards individual speakers. Finally, we are the first to show that their extraordinary success on valence is based on implicit linguistic information learnt during fine-tuning of the transformer layers, which explains why they perform on-par with recent multimodal approaches that explicitly utilise textual information. Our findings collectively paint the following picture: transformer-based architectures constitute the new state-of-the-art in SER, but further advances are needed to mitigate remaining robustness and individual speaker issues. To make our findings reproducible, we release the best performing model to the community.


eXplainable Cooperative Machine Learning with NOVA

January 2020

·

378 Reads

·

33 Citations

KI - Ku_nstliche Intelligenz

In the following article, we introduce a novel workflow, which we subsume under the term “explainable cooperative machine learning” and show its practical application in a data annotation and model training tool called NOVA. The main idea of our approach is to interactively incorporate the ‘human in the loop’ when training classification models from annotated data. In particular, NOVA offers a collaborative annotation backend where multiple annotators join their workforce. A main aspect is the possibility of applying semi-supervised active learning techniques already during the annotation process by giving the possibility to pre-label data automatically, resulting in a drastic acceleration of the annotation process. Furthermore, the user-interface implements recent eXplainable AI techniques to provide users with both, a confidence value of the automatically predicted annotations, as well as visual explanation. We show in an use-case evaluation that our workflow is able to speed up the annotation process, and further argue that by providing additional visual explanations annotators get to understand the decision making process as well as the trustworthiness of their trained machine learning models.




Citations (79)


... Fine-tuned wav2vec has proven efficient across various speech recognition tasks and languages [14]. The recent application of the transformer-based wav2vec 2.0 showcased its utility in developing speech-based age and gender prediction models, including cross-corpus evaluation, with significant improvements in recall compared to a classic modeling approach based on hand-crafted features [47]. Additionally, wav2vec 2.0 representations of speech were found to be more effective in distinguishing between PD and HC subjects compared to language representations, including word-embedding models [48]. ...

Reference:

Analyzing wav2vec embedding in Parkinson’s disease speech: A study on cross-database classification and regression tasks
Speech-based Age and Gender Prediction with Transformers

... However, most multimodal SER designs focus on simple structures such as bimodal or trimodal architecture. Moreover, the invention of the Transformer sheds new light on multimodal SER [4,5]. This sequence-to-sequence architecture excels traditional Long Short-Term Memory (LSTM) for a faster forward path and better Graphics Processing Unit (GPU) compatibility, offering the potential for more complex model engineering and feature processing. ...

Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap

IEEE Transactions on Pattern Analysis and Machine Intelligence

... Many approaches have been presented to determine one's cognitive load including the use of speech [20], vision [5], and bio-signals such as EEG [3]. This paper further explores the use of EEG to detect cognitive load. ...

Quantifying Cognitive Load from Voice using Transformer-Based Models and a Cross-Dataset Evaluation
  • Citing Conference Paper
  • December 2022

... Kinect TM motion capture technology was used to capture facial features, gaze direction and depth information (Figure 8.1). Synchronisation was achieved using the Social Signal Interpretation framework (SSI) (Wagner et al. [169]). The participants took turns at telling personal stories which they associated with an enjoyable emotion. ...

Using phonetic patterns for detecting social cues in natural conversations
  • Citing Conference Paper
  • August 2013

... Locating such "hot" spots could help building more coherent models. In an earlier work we have investigated this within the context of personality trait detection [15]. We proposed a cluster-based approach, which aims at identifying frames that will likely carry cues about the personality. ...

A frame pruning approach for paralinguistic recognition tasks
  • Citing Conference Paper
  • September 2012

... One recent study probed transformer-based audio models for emotion recognition content to understand how much information related to emotions is contained in different models and layers [8], but did not probe for specific acoustic information. Another study fine-tuned pre-trained models to detect emotional properties (a multitask output: arousal, valence, and dominance) [9]. They then probed these models for a set of acoustic features, comparing a pre-trained Wav2Vec 2.0 [10] model fine-tuned with an added output head versus additionally fine-tuning the transformer layers. ...

Probing speech emotion recognition transformers for linguistic knowledge

... The evaluation is performed for the entire 30-minute video. Our qualitative judgment indicates that the model performs best for arousal, consistent with the results by Wagner et al. [49]. However, not all identified scenes show a significant increase or decrease in emotion. ...

Dawn of the transformer era in speech emotion recognition: closing the valence gap

... A possible solution is to use all available features. This is referred to as a "brute-force method" (Schuller et al., 2007). Indeed, some feature sets offered in openSMILE contain over 6,000 features and the use of collections as large as 50,000 is also reported (Schuller, Steidl and Batliner, 2009). ...

The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals

... The training corpus consists of a public speech delivered by Donald J. Trump in 2019 [10] , approximately 50 minutes. The data were annotated using NOVA (Baur et al., 2020), an annotation tool for annotating and analyzing behavior in social interactions. The NOVA user interface was designed to annotate continuous recordings with multiple modalities and subjects. ...

eXplainable Cooperative Machine Learning with NOVA

KI - Ku_nstliche Intelligenz

... In the early 1970s, Ekman found evidence that humans share six basic emotions: happiness, sadness, fear, anger, disgust, and sur- Table 1 Multimodal emotion analysis datasets. [192] act A + T dec Forbes-Riley&Litman (2004) [193] nat A + T feat Litman&Forbes-Riley (2004) [194] nat A + T feat [195] act A + T dec Litman&Forbes-Riley (2006) [196] nat A + T feat Seppi et al. (2008) [197] ind A + T feat [121] ind A + T model Schuller (2011) [198] nat A + T feat Wu and Liang (2011) [97] act A + T dec Rozgic et al. (2012) [199] act A + T+V feat Savran et al. (2012) [200] ind A + T+V model [4] nat A + T+V feat Wollmer et al. (2013) [27] nat A + T+V hybrid Sarkar et al. (2014) [201] nat A + T+V feat Alam et al. (2014) [202] nat A + T+V dec Ellis et al. (2014) [203] nat A + T+V dec Poria et al. (2014) [202] act A + T+V feat Siddiquie et al. (2015) [204] nat A + T+V hybrid [205] nat A + T+V dec [189] nat A + T+V feat Cai et al. (2015) [206] nat T + V dec Ji et al. (2015) [207] nat T + V model Yamasaki et al. (2015) [208] nat A + T model [94] nat A + T+V feat Legenda: Data Type (act = Acted, ind = Induced, nat = Natural); Modality (V = Video, A = Audio, T = Text); Fusion Type (feat = Feature; dec = Decision). ...

Patterns, prototypes, performance: classifying emotional user states
  • Citing Book
  • January 2008