Johannes Wagner’s research while affiliated with University of Augsburg and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (94)


Table 2 : 
Figure 3: SSI framework: Offline trained recognizers applied in a pipeline for event fusion in real-time. 
Figure 5: Influence of audio and video decay speed on vector fusion performance. 
Figure 6: Influence of audio and video weights on vector fusion performance. Stable performance is observed if audio and video events are weighted in a ratio of 8 to 10. 
Figure 7: Frequency of correctly classified frames according to laughter / smile confidence. Similar prediction behaviour allows to directly combine confidence values during the fusion process. 
An Event Driven Fusion Approach for Enjoyment Recognition in Real-time
  • Conference Paper
  • Full-text available

November 2014

·

207 Reads

·

35 Citations

·

Johannes Wagner

·

·

[...]

·

Social signals and interpretation of carried information is of high importance in Human Computer Interaction. Often used for affect recognition, the cues within these signals are displayed in various modalities. Fusion of multi-modal signals is a natural and interesting way to improve automatic classification of emotions transported in social signals. Throughout most present studies, uni-modal affect recognition as well as multi-modal fusion, decisions are forced for fixed annotation segments across all modalities. In this paper, we investigate the less prevalent approach of event driven fusion, which indirectly accumulates asynchronous events in all modalities for final predictions. We present a fusion approach, handling short-timed events in a vector space, which is of special interest for real-time applications. We compare results of segmentation based uni-modal classification and fusion schemes to the event driven fusion approach. The evaluation is carried out via detection of enjoyment-episodes within the audiovisual Belfast Story-Telling Corpus.

Download

Exploring Interaction Strategies for Virtual Characters to Induce Stress in Simulated Job Interviews

May 2014

·

1,178 Reads

·

46 Citations

Job interviews come with a number of challenges, especially for young people who are out of employment, education, or training (NEETs). This paper presents an approach to a job training simulation environment that employs two virtual characters and social cue recognition techniques to create an immersive interactive job interview. The two virtual characters are created with different social behavior profiles, understanding and demanding, which consequently influences the level of difficulty of the simulation as well as the impact on the user. Finally, we present a user study which investigates the feasibility of the proposed approach by measuring the effect the different virtual characters have on the users.


Laugh When You're Winning

April 2014

·

405 Reads

·

13 Citations

IFIP Advances in Information and Communication Technology

Developing virtual characters with naturalistic game playing capabilities is an increasingly researched topic in Human-Computer In-teraction. Possible roles for such characters include virtual teachers, per-sonal care assistants, and companions for children. Laughter is an under-investigated emotional expression both in Human-Human and Human-Computer Interaction. The EU Project ILHAIRE, aims to study this phenomena and endow machines with laughter detection and synthesis capabilities. The Laugh when you're winning project, developed during the eNTERFACE 2013 Workshop in Lisbon, Portugal, aimed to set up and test a game scenario involving two human participants and one such virtual character. The game chosen, the yes/no game, induces natural verbal and non-verbal interaction between participants, including fre-quent hilarious events, e.g., one of the participants saying "yes" or "no" and so losing the game. The setup includes software platforms, devel-oped by the ILHAIRE partners, allowing automatic analysis and fusion of human participants' multimodal data (voice, facial expression, body movements, respiration) in real-time to detect laughter. Further, virtual characters endowed with multimodal skills were synthesised in order to interact with the participants by producing laughter in a natural way.


Figure 1: Plot of normalized raw data obtained from different sensors. The bottom part of the graph shows how the signals are aligned with events in the audiovisual content generated by the 3D engine (Unity). The data was collected by sensors and stored in the URDB in real-time during an experiment in XIM. 
XIM-engine: A Software Framework to Support the Development of Interactive Applications That Uses Conscious and Unconscious Reactions in Immersive Mixed Reality

April 2014

·

114 Reads

·

19 Citations

The development of systems that allow multimodal interpretation of human-machine interaction is crucial to advance our understanding and validation of theoretical models of user behavior. In particular, a system capable of collecting, perceiving and interpreting unconscious behavior can provide rich contextual information for an interactive system. One possible application for such a system is in the exploration of complex data through immersion, where massive amounts of data are generated every day both by humans and computer processes that digitize information at different scales and resolutions thus exceeding our processing capacity. We need tools that accelerate our understanding and generation of hypotheses over the datasets, guide our searches and prevent data overload. We describe XIM-engine, a bio-inspired software framework designed to capture and analyze multi-modal human behavior in an immersive environment. The framework allows performing studies that can advance our understanding on the use of conscious and unconscious reactions in interactive systems.



Figure 2. Schematic illustration showing the experimental setting in the eXperience Induction Machine (XIM). 
Figure 3. Schematic timeline of a single experimental session. 
Interpreting Psychophysiological States Using Unobtrusive Wearable Sensors in Virtual Reality

January 2014

·

404 Reads

·

11 Citations

One of the main challenges in the study of human behavior is to quantitatively assess the participants’ affective states by measuring their psychophysiological signals in ecologically valid conditions. The quality of the acquired data, in fact, is often poor due to artifacts generated by natural interactions such as full body movements and gestures. We created a technology to address this problem. We enhanced the eXperience Induction Machine (XIM), an immersive space we built to conduct experiments on human behavior, with unobtrusive wearable sensors that measure electrocardiogram, breathing rate and electrodermal response. We conducted an empirical validation where participants wearing these sensors were free to move in the XIM space while exposed to a series of visual stimuli taken from the International Affective Picture System (IAPS). Our main result consists in the quantitative estimation of the arousal range of the affective stimuli through the analysis of participants’ psychophysiological states. Taken together, our findings show that the XIM constitutes a novel tool to study human behavior in life-like conditions.




Fig. 1. The recording process with skeleton and face tracking, audio graph, audio pitch information, as well as the event board that shows detected social cues, such as gestures, head poses, voice activity detection etc 
Fig. 2. A simplified Bayesian network to determine Engagement 
Fig. 3. NovA's graphical user interface. In this instance data for two users has been loaded. It shows both videos (with and without skeleton, (Figure 3 A)), pie charts for expressivity features (Figure 3 B), heatmaps (Figure 3 C), a waveform graph with voice activity detection events (Figure 3 G), the timeline graph showing automatically created annotations (Figure 3 F) and the hands height graph (Figure 3 E).
Fig. 4. Comparison of detected cues for high (a), medium (b) and low engagement (c).
Fig. 5. NovA serves to analyze the learner’s social cues when interacting with a virtual recruiter during a job interview simulation in the TARDIS project 
NovA: Automated Analysis of Nonverbal Signals in Social Interactions

October 2013

·

1,122 Reads

·

44 Citations

Lecture Notes in Computer Science

Previous studies have shown that the success of interper-sonal interaction depends not only on the contents we communicate ex-plicitly, but also on the social signals that are conveyed implicitly. In this paper, we present NovA (NOnVerbal behavior Analyzer), a system that analyzes and facilitates the interpretation of social signals conveyed by gestures, facial expressions and others automatically as a basis for computer-enhanced social coaching. NovA records data of human inter-actions, automatically detects relevant behavioral cues as a measurement for the quality of an interaction and creates descriptive statistics for the recorded data. This enables us to give a user online generated feedback on strengths and weaknesses concerning his social behavior, as well as elaborate tools for offline analysis and annotation.


Figure 2: Ways of fusing information in SSI. (FE=feature extraction, CL=classifier) 
Figure 3: Output of example pipeline. 
The Social Signal Interpretation (SSI) Framework Multimodal Signal Processing and Recognition in Real-Time

October 2013

·

1,179 Reads

·

206 Citations

Automatic detection and interpretation of social signals car-ried by voice, gestures, mimics, etc. will play a key-role for next-generation interfaces as it paves the way towards a more intuitive and natural human-computer interaction. The paper at hand introduces Social Signal Interpretation (SSI), a framework for real-time recognition of social signals. SSI supports a large range of sensor devices, filter and feature algorithms, as well as, machine learning and pattern recognition tools. It encourages developers to add new components using SSI's C++ API, but also addresses front end users by offering an XML interface to build pipelines with a text editor. SSI is freely available under GPL at http://openssi.net.


Citations (79)


... Fine-tuned wav2vec has proven efficient across various speech recognition tasks and languages [14]. The recent application of the transformer-based wav2vec 2.0 showcased its utility in developing speech-based age and gender prediction models, including cross-corpus evaluation, with significant improvements in recall compared to a classic modeling approach based on hand-crafted features [47]. Additionally, wav2vec 2.0 representations of speech were found to be more effective in distinguishing between PD and HC subjects compared to language representations, including word-embedding models [48]. ...

Reference:

Analyzing wav2vec embedding in Parkinson’s disease speech: A study on cross-database classification and regression tasks
Speech-based Age and Gender Prediction with Transformers

... In addition, Sun et al. (2023) propose an automatic audio augmentation method, incorporating both waveform-level and spectrogram-level augmentation to enhance generalisation. As an abstract from data, foundation models can leverage the rich information in external data to capture universal representations Zhu & Sato, 2023) and be fine-tuned to adapt to downstream tasks (Wagner et al., 2023). In this way, Ristea and Ionescu (2023) fine-tune a foundation model through appending an average pooling layer and a fully connected layer, whereas Porjazovski et al. (2023) utilise a Bayesian output layer for fine-tuning, with the aim of addressing parameter forgetting. ...

Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap

IEEE Transactions on Pattern Analysis and Machine Intelligence

... Many approaches have been presented to determine one's cognitive load including the use of speech [20], vision [5], and bio-signals such as EEG [3]. This paper further explores the use of EEG to detect cognitive load. ...

Quantifying Cognitive Load from Voice using Transformer-Based Models and a Cross-Dataset Evaluation
  • Citing Conference Paper
  • December 2022

... Kinect TM motion capture technology was used to capture facial features, gaze direction and depth information (Figure 8.1). Synchronisation was achieved using the Social Signal Interpretation framework (SSI) (Wagner et al. [169]). The participants took turns at telling personal stories which they associated with an enjoyable emotion. ...

Using phonetic patterns for detecting social cues in natural conversations
  • Citing Conference Paper
  • August 2013

... Locating such "hot" spots could help building more coherent models. In an earlier work we have investigated this within the context of personality trait detection [15]. We proposed a cluster-based approach, which aims at identifying frames that will likely carry cues about the personality. ...

A frame pruning approach for paralinguistic recognition tasks
  • Citing Conference Paper
  • September 2012

... One recent study probed transformer-based audio models for emotion recognition content to understand how much information related to emotions is contained in different models and layers [8], but did not probe for specific acoustic information. Another study fine-tuned pre-trained models to detect emotional properties (a multitask output: arousal, valence, and dominance) [9]. They then probed these models for a set of acoustic features, comparing a pre-trained Wav2Vec 2.0 [10] model fine-tuned with an added output head versus additionally fine-tuning the transformer layers. ...

Probing speech emotion recognition transformers for linguistic knowledge

... The evaluation is performed for the entire 30-minute video. Our qualitative judgment indicates that the model performs best for arousal, consistent with the results by Wagner et al. [49]. However, not all identified scenes show a significant increase or decrease in emotion. ...

Dawn of the transformer era in speech emotion recognition: closing the valence gap

... A possible solution is to use all available features. This is referred to as a "brute-force method" (Schuller et al., 2007). Indeed, some feature sets offered in openSMILE contain over 6,000 features and the use of collections as large as 50,000 is also reported (Schuller, Steidl and Batliner, 2009). ...

The relevance of feature type for the automatic classification of emotional user states: low level descriptors and functionals

... The training corpus consists of a public speech delivered by Donald J. Trump in 2019 [10] , approximately 50 minutes. The data were annotated using NOVA (Baur et al., 2020), an annotation tool for annotating and analyzing behavior in social interactions. The NOVA user interface was designed to annotate continuous recordings with multiple modalities and subjects. ...

eXplainable Cooperative Machine Learning with NOVA

KI - Ku_nstliche Intelligenz

... In the early 1970s, Ekman found evidence that humans share six basic emotions: happiness, sadness, fear, anger, disgust, and sur- Table 1 Multimodal emotion analysis datasets. [192] act A + T dec Forbes-Riley&Litman (2004) [193] nat A + T feat Litman&Forbes-Riley (2004) [194] nat A + T feat [195] act A + T dec Litman&Forbes-Riley (2006) [196] nat A + T feat Seppi et al. (2008) [197] ind A + T feat [121] ind A + T model Schuller (2011) [198] nat A + T feat Wu and Liang (2011) [97] act A + T dec Rozgic et al. (2012) [199] act A + T+V feat Savran et al. (2012) [200] ind A + T+V model [4] nat A + T+V feat Wollmer et al. (2013) [27] nat A + T+V hybrid Sarkar et al. (2014) [201] nat A + T+V feat Alam et al. (2014) [202] nat A + T+V dec Ellis et al. (2014) [203] nat A + T+V dec Poria et al. (2014) [202] act A + T+V feat Siddiquie et al. (2015) [204] nat A + T+V hybrid [205] nat A + T+V dec [189] nat A + T+V feat Cai et al. (2015) [206] nat T + V dec Ji et al. (2015) [207] nat T + V model Yamasaki et al. (2015) [208] nat A + T model [94] nat A + T+V feat Legenda: Data Type (act = Acted, ind = Induced, nat = Natural); Modality (V = Video, A = Audio, T = Text); Fusion Type (feat = Feature; dec = Decision). ...

Patterns, prototypes, performance: classifying emotional user states
  • Citing Book
  • January 2008