Ingo Siegert

Ingo Siegert
Otto-von-Guericke-Universität Magdeburg | OvGU · Faculty of Electrical Engineering and Information Technology

Jun.-Prof. Dr.-Ing.

About

130
Publications
57,440
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
773
Citations
Introduction
Ingo Siegert currently is Assistant Professor for Mobile Dialog Systems at the Faculty of Electrical Engineering and Information Technology, Otto-von-Guericke-Universität Magdeburg. Ingo does research in Pattern Recognition, speech-based Emotion Recognition, Signal Processing, Artificial Neural Network and Human-computer Interaction.
Additional affiliations
November 2018 - present
Otto-von-Guericke-Universität Magdeburg
Position
  • Professor (Assistant)
April 2015 - October 2018
Otto-von-Guericke-Universität Magdeburg
Position
  • PostDoc Position
August 2009 - March 2015
Otto-von-Guericke-Universität Magdeburg
Position
  • Research Assistant
Education
July 2009 - March 2015
Otto-von-Guericke-Universität Magdeburg
Field of study
  • Electrical Engineering
October 2003 - May 2009
Otto-von-Guericke-Universität Magdeburg
Field of study
  • Information Technology

Publications

Publications (130)
Article
To enable a naturalistic human–computer interaction the recognition of emotions and intentions experiences increased attention and several modalities are comprised to cover all human communication abilities. For this reason, naturalistic material is recorded, where the subjects are guided through an interaction with crucial points, but with the fre...
Article
Full-text available
For successful human-machine-interaction (HCI) the pure textual information and the individual skills, preferences, and affective states of the user must be known. Therefore, as a starting point, the user's actual affective state has to be recognized. In this work we investigated how additional knowledge, for example age and gender of the user, can...
Conference Paper
Full-text available
A new conversation corpus in the area of human-computer interaction is introduced. It consists of conversations between one and two interaction partners with a commercial voice assistant system (Amazon’s ALEXA) in two different settings. The fundamental aim for building up this corpus is to investigate how humans address technical systems. Thereby,...
Chapter
Full-text available
Datasets featuring modern voice assistants such as Alexa, Siri, Cortana and others allow an easy study of human-machine interactions. But data collections offering an unconstrained, unscripted public interaction are quite rare. Many studies so far have focused on private usage, short pre-defined task or specific domains. This contribution presents...
Article
Remote meetings via Zoom, Skype, or Teams limit the range and richness of nonverbal communication signals. Not just because of the typically sub-optimal light, posture, and gaze conditions, but also because of the reduced speaker visibility. Consequently, the speaker's voice becomes immensely important, especially when it comes to being persuasive...
Article
Full-text available
Objective Acoustic addressee detection is a challenge that arises in human group interactions, as well as in interactions with technical systems. The research domain is relatively new, and no structured review is available. Especially due to the recent growth of usage of voice assistants, this topic received increased attention. To allow a natural...
Conference Paper
Far-field speech recognition gained a lot of attention in the last years. In particular, the appearance of commercial voice assistants has taken research to a new level in case of recognition, understanding and applications. This technology has become one of the mainstay products, with well-known examples like ALEXA, Siri, or Cortana from Amazon, A...
Conference Paper
The present study investigates how the properties of room acoustics affect the production and, in particular, the acoustic analysis of charismatic prosodic parameters. A re-recorded version of emoDB was used. The room acoustics were varied in two ways: the environment in that the recordings took place (studio conditions, hallway, lecture hall) and...
Conference Paper
A rapid increase in the use of voice assistants has been observed during the recent years due to the convenience of their usage across the age spectrum. Typically, the use of voice assistants is limited to private users as public use of voice assistants and recording interactions poses a threat to user’s identification. This creates a lack of avail...
Conference Paper
Full-text available
Emotions are an integral part of a speaker's charismatic impact. Previous studies took started from this impact examined the associated emotional features on the part of the speaker and the recipient. We start here from the emotions themselves and test with a view to, e.g., everyday business communication and based on isolated, enacted stimulus sen...
Chapter
Full-text available
Background: In recent years, the market for commercial voice assistants has been continuously rising. While there is an increase in popularity of voice assistants in daily usage, the user input data (speech data) is stored and processed on cloud platform, which raises the concern of data privacy for many. In a 2019 Voice report by Microsoft, 41% of...
Chapter
Full-text available
Voice assistants are increasingly dominating everyday life and represent an easy way to perform various tasks with minimal effort. The areas of application for voice assistants are diverse and range from answering simple information questions to processing complex topics and controlling various tasks. However, current voice assistants very quickly...
Conference Paper
Full-text available
The central issue for the wider use of speech-based technical systems is the proper recognition of speech. But as spontaneous human speech has a lot of dis-fluencies and variations, even state-of-the-art ASR engines are posed with difficulties. One possibility to overcome this issue is the combination of different ASR outputs. In this paper ROVER,...
Conference Paper
Full-text available
This contribution summarizes some experiences with a summative assessment or preliminary examination in a two-semester engineering course. An online assessment with numerical or multiple-choice questions was chosen, which required an elaborate preparation, especially in question design, but reduced the correction effort enormously. This investment...
Conference Paper
Full-text available
The use of voice assistants has rapidly grown and they can be found in millions of households. And a lot of effort has been made by researchers to improve the usage of these systems. One issue that remains open is the usage of voice assistants and recording of interactions for research purposes in public environments due to privacy concerns. Althou...
Article
The main promise of voice assistants is their ability to correctly interpret and learn from user input as well as the ability to utilize this knowledge to achieve specific goals and tasks. These systems need predetermined activation actions to start a conversation. Unfortunately, the typically used solution, wake-words, force an unnatural interacti...
Article
Full-text available
Despite the growing importance of Automatic Speech Recognition (ASR), its application is still challenging, limited, language-dependent, and requires considerable resources. The resources required for ASR are not only technical, they also need to reflect technological trends and cultural diversity. The purpose of this research is to explore ASR per...
Chapter
Full-text available
Previous research by the authors showed that signal compression codecs used in remote meetings and mobile communications have a substantial negative effect on perceived speaker charisma. Moreover, this effect size varied as a function of speaker gender. Following up from this previous study, we conducted a multipara-metric acoustic analysis of a se...
Chapter
Full-text available
Das Phänomen erheblicher Pegelunterschiede in der Audiospur ist allgegenwärtig. Diese tritt nicht nur nur beim regulären Fernsehen oder Abspie-len von DVDs auf, sondern vermehrt auch bei der Verwendung von Streaming-Diensten. Es ist oft nicht möglich, eine erträgliche Lautstärkeeinstellung zu finden, bei der alle Dialoge verstanden werden können un...
Chapter
Full-text available
The article summarizes selected results of audio and video signal processing in a joint research project on agricultural mission data (HARMONIC) from our previous publications. We compare the results of audio-processing tasks, based on single-channel recordings directly at a small unmanned aerial vehicle (UAV, drone) with the improvements using a l...
Chapter
Nowadays, a diverse set of addressee detection methods is discussed. Typically, wake words are used. But these force an unnatural interaction and are error-prone, especially in case of false positive classification (user says the wake up word without intending to interact with the device). Therefore, technical systems should be enabled to perform a...
Chapter
Full-text available
Within the last five years, the availability and usability of interactive voice assistants have grown. Thereby, the development benefits mostly from the rapidly increased cloud-based speech recognition systems. Furthermore many cloud-based services, such as Google Speech API, IBM Watson, and Wit.ai, can be used for personal applications and transcr...
Conference Paper
Full-text available
The civilian and military use of drones (unmanned aerial vehicles, UAVs) for surveillance tasks, for inspection of industrial structures, for monitoring in agriculture and science-data collection is steadily growing. A sound or speech signal processing directly at drones or at the presence of drones nearby is challenging because of the significant...
Conference Paper
To train end-to-end automatic speech recognition models, it requires a large amount of labeled speech data. This goal is challenging for languages with fewer resources. In contrast to the commonly used feature level data augmentation, we propose to expand the training set by using different audio codecs at the data level. The augmentation method co...
Conference Paper
We analyze the addressee detection task for complexity-identical dialog for both human conversation and device-directed speech. Our recurrent neural model performs at least as good as humans, who have problems with this task, even native speakers, who profit from the relevant linguistic skills. We perform ablation experiments on the features used b...
Conference Paper
Industry 4.0 (I4.0) looks to enable intelligent production by connecting and evaluating data. The asset administration shell, the Industry 4.0 specification of a digital twin describes various concepts to realize this data exchange. One part of the asset administration shell is the I4.0-language, which intends to standardize complex interactions be...
Article
Full-text available
The European Union (EU) General Data Protection Regulations (GDPR) has a direct impact on research activities, as it raises the awareness of personal rights not only among the scientists but also among the data-subjects scientists process information from. This paper presents the dilemma related to the privacy of audio and video data, compliance wi...
Article
Full-text available
Human-machine addressee detection (H-M AD) is a modern paralinguistics and dialogue challenge that arises in multiparty conversations between several people and a spoken dialogue system (SDS) since the users may also talk to each other and even to themselves while interacting with the system. The SDS is supposed to determine whether it is being add...
Conference Paper
Full-text available
Human interaction analyzes are essential to study social interaction, conversational rules, and affective signals. These analyzes are also used to improve models for human-machine interaction. Besides the pure acoustic signal and its transcripts, the use of contextual information is essential. Since the enforcement of the GPDR for the EU in 2018, t...
Chapter
In interactions with speech based dialog systems users tend to adapt their speech behavior to their technical counterpart by taking care on the abilities and characteristics they ascribe to the system. Hence, it can be supposed, that different systems may evoke different speech behavior according to the users’ evaluation of the system. In order to...
Conference Paper
Full-text available
This study examines how the presence of other speakers affects the interaction with a spoken dialogue system. We analyze participants’ speech regarding several phonetic features, viz., fundamental frequency, intensity, and articulation rate, in two conditions: with and without additional speech input from a human confederate as a third interlocutor...
Chapter
Contemporary technical devices obey the paradigm of naturalistic multimodal interaction and user-centric individualisation. Users expect devices to interact intelligently, to anticipate their needs, and to adapt to their behaviour. To do so, companion-like solutions have to take into account the affective and dispositional state of the user, and th...
Conference Paper
Full-text available
Common applications of an unmanned aerial vehicle (UAV, aerial drone) utilize the capabilities of mobile image or video capturing, whereas our article deals with acoustic-related scenarios. Especially for surveillance tasks, e.g. in disaster management or measurement of artificial environmental noise in large industrial areas, an UAV-based acoustic...
Article
Today, multiple solutions are implemented to detect if a system should react to an uttered speech command. Common solutions are push-to-talk and activation words. But both are disadvantageous as their interaction initiation is quite unnatural. Furthermore, relying on an activation word is error-prone, especially when the activation word has been sa...
Chapter
Full-text available
A new dataset, the Restaurant Booking Corpus (RBC) is introduced, comprising 90 telephone dialogs of 30 German speaking students (10 males, 20 females) interacting either with one out of two different technical dialogue systems or with a human conversational partner. The aim of the participants was to reserve a table each at three different restaur...
Chapter
Full-text available
This paper presents a study that examines the difference of certain phonetic features between human-directed speech (HDS) and device-directed speech (DDS) in human-human-computer interactions. The corpus used consists of tasks, in which participants perform task with a confederate and a computer is used for the analyses. This includes distributiona...
Chapter
Usually, compression methods are avoided for emotion recognition problems, as it is feared that compression degrades the acoustic characteristics needed for an accurate recognition. By contrast, we assume that the psychoacoustic modeling used for transparent music compression could actually improve speech-based emotion recognition, as it removes ce...
Chapter
Emotion recognition from speech receives an ever growing attention since the systems around us aim to enable a natural communication. One important question still remains unresolved: The definition of the most suitable features across different data types. In the present paper, we employ a random-forest based feature selection known from other rese...
Conference Paper
Empathic vehicles are a promising concept to increase the safety and acceptance of automated vehicles. However, on the way towards empathic vehicles a lot of research in the area of automated emotion recognition is necessary. Successful methods to detect emotions need to be trained on realistic data that contain the target emotion and come from a s...
Conference Paper
Full-text available
Certain emotions can have a negative effect on the driver's capability of safely operating the vehicle and can ultimately lead to accidents. Therefore, it would be beneficial if the vehicle was able to detect the emotional state of the driver and provide appropriate assistance to mitigate these effects. This study investigates the influence of in-c...
Chapter
Full-text available
Today, in technical dialog-systems diverse solutions are implemented to detect if a system should react to an uttered speech command. Typically used solutions are push-to-talk and keywords. Unfortunately, these solutions constitute an unnatural interaction to overcome the problem that the system is not able to detect when it is addressed. Moreover,...
Article
In emotion recognition from speech, huge amounts of training material are needed for the development of classification engines. As most current corpora do not supply enough material, a combination of different datasets is advisable. Unfortunately, data recording is done differently and various emotion elicitation and emotion annotation methods are...
Article
Full-text available
Empathic vehicles are a promising concept to increase the safety and acceptance of automated vehicles. However, on the way towards empathic vehicles a lot of research in the area of automated emotion recognition is necessary. Successful methods to detect emotions need to be trained on realistic data that contain the target emotion and come from a s...
Chapter
During system interaction, the user’s emotions and intentions shall be adequately determined and predicted to recognize tendencies in his or her interests and dispositions. This allows for the design of an evolving search user interface (ESUI) which adapts to changes in the user’s emotional reaction and the users’ needs and claims.
Chapter
We demonstrate a successful multimodal dynamic human-computer interaction (HCI) in which the system adapts to the current situation and the user’s state is provided using the scenario of purchasing a train ticket. This scenario demonstrates that Companion Systems are facing the challenge of analyzing and interpreting explicit and implicit observati...
Chapter
In general, humans interact with each other using multiple modalities. The main channels are speech, facial expressions, and gesture. But also bio-physiological data such as biopotentials can convey valuable information which can be used to interpret the communication in a dedicated way. A Companion-System can use these modalities to perform an eff...
Chapter
The LAST MINUTE Corpus (LMC) is one of the rare examples of a corpus with naturalistic human-computer interactions. It offers richly annotated data from Ntotal = 130 experiments in a number of modalities. In this paper we present results from various investigations with data from the LMC using several primary modalities, e.g. transcripts, audio, qu...
Chapter
Spoken language is one of the main interaction patterns in human-human as well as in natural, companion-like human-machine interactions. Speech conveys content, but also emotions and interaction patterns determining the nature and quality of the user’s relationship to his counterpart. Hence, we consider emotion recognition from speech in the wider...
Conference Paper
Full-text available
The recognition performance of a classifier is affected by various aspects. A huge influence is given by the input data pre-processing. In the current paper we analysed the relation between different normalisation methods for emotionally coloured speech samples deriving general trends to be considered during data pre-processing. From the best of ou...
Article
Full-text available
User satisfaction is an important aspect of human-computer interaction (HCI) – if a user is not satisfied, he or she might not be willing to use such a system. Therefore, it is crucial to HCI applications to be able to recognise the user satisfaction level in order to react in an appropriate way. For such recognition tasks, data-driven methods have...
Conference Paper
Full-text available
Speech and audio codecs are implemented in a variety of multimedia applications, and multichannel sound is offered by first streaming or cloud-based services. Beside the objective of perceptual quality, coding-related research is focused on low bitrate and minimal latency. The IETF-standardized Opus codec provides a high perceptual quality, low lat...
Conference Paper
Most technical communication systems use speech compression codecs to save transmission bandwidth. A lot of development was made to guarantee a high speech intelligibility resulting in different compression techniques: Analysis-by-Synthesis, psychoacoustic modeling and a hybrid mode of both. Our first assumption is that the hybrid mode improves the...
Conference Paper
Full-text available
One objective of affective computing is the automatic processing of human emotions. Considering human speech, filled pauses are one of the cues giving insight into the emotional state of a human being. Filled pauses are short speech events without a specified semantic meaning, but they have a variety of communicative and affective functions. The de...
Conference Paper
Full-text available
For emotional analyses of interactions, qualitatively high transcription and annotation of given material is important. The textual transcription can be conducted with several available tools, like e.g. Folker or ANVIL. But tools for the annotation of emotions are quite rare. Furthermore, existing tools only allow to select an emotion term from a l...
Conference Paper
Full-text available
Enabling a natural (human-like) spoken conversation with technical systems requires affective information, contained in spoken language, to be intelligibly transmitted. This study investigates the role of speech and music codecs for affect intelligibility. A decoding and encoding of affective speech was employed from the well-known EMO-DB corpus. U...