
Ingo SiegertOtto-von-Guericke-Universität Magdeburg | OvGU · Faculty of Electrical Engineering and Information Technology
Ingo Siegert
Jun.-Prof. Dr.-Ing.
About
130
Publications
57,440
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
773
Citations
Introduction
Ingo Siegert currently is Assistant Professor for Mobile Dialog Systems at the Faculty of Electrical Engineering and Information Technology, Otto-von-Guericke-Universität Magdeburg. Ingo does research in Pattern Recognition, speech-based Emotion Recognition, Signal Processing, Artificial Neural Network and Human-computer Interaction.
Additional affiliations
November 2018 - present
April 2015 - October 2018
August 2009 - March 2015
Education
July 2009 - March 2015
October 2003 - May 2009
Publications
Publications (130)
To enable a naturalistic human–computer interaction the recognition of emotions and intentions experiences increased attention and several modalities are comprised to cover all human communication abilities. For this reason, naturalistic material is recorded, where the subjects are guided through an interaction with crucial points, but with the fre...
For successful human-machine-interaction (HCI) the pure textual information and the individual skills, preferences, and affective states of the user must be known. Therefore, as a starting point, the user's actual affective state has to be recognized. In this work we investigated how additional knowledge, for example age and gender of the user, can...
A new conversation corpus in the area of human-computer interaction is introduced. It consists of conversations between one and two interaction partners with a commercial voice assistant system (Amazon’s ALEXA) in two different settings. The fundamental aim for building up this corpus is to investigate how humans address technical systems. Thereby,...
Datasets featuring modern voice assistants such as Alexa, Siri, Cortana and others allow an easy study of human-machine interactions. But data collections offering an unconstrained, unscripted public interaction are quite rare. Many studies so far have focused on private usage, short pre-defined task or specific domains. This contribution presents...
Remote meetings via Zoom, Skype, or Teams limit the range and richness of nonverbal communication signals. Not just because of the typically sub-optimal light, posture, and gaze conditions, but also because of the reduced speaker visibility. Consequently, the speaker's voice becomes immensely important, especially when it comes to being persuasive...
Objective
Acoustic addressee detection is a challenge that arises in human group interactions, as well as in interactions with technical systems. The research domain is relatively new, and no structured review is available. Especially due to the recent growth of usage of voice assistants, this topic received increased attention. To allow a natural...
Far-field speech recognition gained a lot of attention in the last years. In particular, the appearance of commercial voice assistants has taken research to a new level in case of recognition, understanding and applications. This technology has become one of the mainstay products, with well-known examples like ALEXA, Siri, or Cortana from Amazon, A...
The present study investigates how the properties of room acoustics affect the production and, in particular, the acoustic analysis of charismatic prosodic parameters. A re-recorded version of emoDB was used. The room acoustics were varied in two ways: the environment in that the recordings took place (studio conditions, hallway, lecture hall) and...
A rapid increase in the use of voice assistants has been observed during the recent years due
to the convenience of their usage across the age spectrum. Typically, the use of voice assistants
is limited to private users as public use of voice assistants and recording interactions poses a
threat to user’s identification. This creates a lack of avail...
Emotions are an integral part of a speaker's charismatic impact. Previous studies took started from this impact examined the associated emotional features on the part of the speaker and the recipient. We start here from the emotions themselves and test with a view to, e.g., everyday business communication and based on isolated, enacted stimulus sen...
Background: In recent years, the market for commercial voice assistants has been continuously rising. While there is an increase in popularity of voice assistants in daily usage, the user input data (speech data) is stored and processed on cloud platform, which raises the concern of data privacy for many. In a 2019 Voice report by Microsoft, 41% of...
Voice assistants are increasingly dominating everyday life and represent an easy way to perform various tasks with minimal effort. The areas of application for voice assistants are diverse and range from answering simple information questions to processing complex topics and controlling various tasks. However, current voice assistants very quickly...
The central issue for the wider use of speech-based technical systems is the proper recognition of speech. But as spontaneous human speech has a lot of dis-fluencies and variations, even state-of-the-art ASR engines are posed with difficulties. One possibility to overcome this issue is the combination of different ASR outputs. In this paper ROVER,...
This contribution summarizes some experiences with a summative assessment or preliminary examination in a two-semester engineering course. An online assessment with numerical or multiple-choice questions was chosen, which required an elaborate preparation, especially in question design, but reduced the correction effort enormously. This investment...
The use of voice assistants has rapidly grown and they can be found in millions of households. And a lot of effort has been made by researchers to improve the usage of these systems. One issue that remains open is the usage of voice assistants and recording of interactions for research purposes in public environments due to privacy concerns. Althou...
The main promise of voice assistants is their ability to correctly interpret and learn from user input as well as the ability to utilize this knowledge to achieve specific goals and tasks. These systems need predetermined activation actions to start a conversation. Unfortunately, the typically used solution, wake-words, force an unnatural interacti...
Despite the growing importance of Automatic Speech Recognition (ASR), its application is still challenging, limited, language-dependent, and requires considerable resources. The resources required for ASR are not only technical, they also need to reflect technological trends and cultural diversity. The purpose of this research is to explore ASR per...
Previous research by the authors showed that signal compression codecs used in remote meetings and mobile communications have a substantial negative effect on perceived speaker charisma. Moreover, this effect size varied as a function of speaker gender. Following up from this previous study, we conducted a multipara-metric acoustic analysis of a se...
Das Phänomen erheblicher Pegelunterschiede in der Audiospur ist allgegenwärtig. Diese tritt nicht nur nur beim regulären Fernsehen oder Abspie-len von DVDs auf, sondern vermehrt auch bei der Verwendung von Streaming-Diensten. Es ist oft nicht möglich, eine erträgliche Lautstärkeeinstellung zu finden, bei der alle Dialoge verstanden werden können un...
The article summarizes selected results of audio and video signal processing in a joint research project on agricultural mission data (HARMONIC) from our previous publications. We compare the results of audio-processing tasks, based on single-channel recordings directly at a small unmanned aerial vehicle (UAV, drone) with the improvements using a l...
Nowadays, a diverse set of addressee detection methods is discussed. Typically, wake words are used. But these force an unnatural interaction and are error-prone, especially in case of false positive classification (user says the wake up word without intending to interact with the device). Therefore, technical systems should be enabled to perform a...
Within the last five years, the availability and usability of interactive voice assistants have grown. Thereby, the development benefits mostly from the rapidly increased cloud-based speech recognition systems. Furthermore many cloud-based services, such as Google Speech API, IBM Watson, and Wit.ai, can be used for personal applications and transcr...
The civilian and military use of drones (unmanned aerial vehicles, UAVs) for surveillance tasks, for inspection of industrial structures, for monitoring in agriculture and science-data collection is steadily growing. A sound or speech signal processing directly at drones or at the presence of drones nearby is challenging because of the significant...
To train end-to-end automatic speech recognition models, it requires a large amount of labeled speech data. This goal is challenging for languages with fewer resources. In contrast to the commonly used feature level data augmentation, we propose to expand the training set by using different audio codecs at the data level. The augmentation method co...
We analyze the addressee detection task for complexity-identical dialog for both human conversation and device-directed speech. Our recurrent neural model performs at least as good as humans, who have problems with this task, even native speakers, who profit from the relevant linguistic skills. We perform ablation experiments on the features used b...
Industry 4.0 (I4.0) looks to enable intelligent production by connecting and evaluating data. The asset administration shell, the Industry 4.0 specification of a digital twin describes various concepts to realize this data exchange. One part of the asset administration shell is the I4.0-language, which intends to standardize complex interactions be...
The European Union (EU) General Data Protection Regulations (GDPR) has a direct impact on research activities, as it raises the awareness of personal rights not only among the scientists but also among the data-subjects scientists process information from. This paper presents the dilemma related to the privacy of audio and video data, compliance wi...
Human-machine addressee detection (H-M AD) is a modern paralinguistics and dialogue challenge that arises in multiparty conversations between several people and a spoken dialogue system (SDS) since the users may also talk to each other and even to themselves while interacting with the system. The SDS is supposed to determine whether it is being add...
Human interaction analyzes are essential to study social interaction, conversational rules, and affective signals. These analyzes are also used to improve models for human-machine interaction. Besides the pure acoustic signal and its transcripts, the use of contextual information is essential. Since the enforcement of the GPDR for the EU in 2018, t...
In interactions with speech based dialog systems users tend to adapt their speech behavior to their technical counterpart by taking care on the abilities and characteristics they ascribe to the system. Hence, it can be supposed, that different systems may evoke different speech behavior according to the users’ evaluation
of the system. In order to...
This study examines how the presence of other speakers affects the interaction with a spoken dialogue system. We analyze participants’ speech regarding several phonetic features, viz., fundamental frequency, intensity, and articulation rate, in two conditions: with and without additional speech input from a human confederate as a third interlocutor...
Contemporary technical devices obey the paradigm of naturalistic multimodal interaction and user-centric individualisation. Users expect devices to interact intelligently, to anticipate their needs, and to adapt to their behaviour. To do so, companion-like solutions have to take into account the affective and dispositional state of the user, and th...
Common applications of an unmanned aerial vehicle (UAV, aerial drone) utilize the capabilities of mobile image or video capturing, whereas our article deals with acoustic-related scenarios. Especially for surveillance tasks, e.g. in disaster management or measurement of artificial environmental noise in large industrial areas, an UAV-based acoustic...
Today, multiple solutions are implemented to detect if a system should react to an uttered speech command. Common solutions are push-to-talk and activation words. But both are
disadvantageous as their interaction initiation is quite unnatural. Furthermore, relying on an activation word is error-prone, especially when the activation word has been sa...
A new dataset, the Restaurant Booking Corpus (RBC) is introduced, comprising 90 telephone dialogs of 30 German speaking students (10 males, 20 females) interacting either with one out of two different technical dialogue systems or with a human conversational partner. The aim of the participants was to reserve a table each at three different restaur...
This paper presents a study that examines the difference of certain phonetic features between human-directed speech (HDS) and device-directed speech (DDS) in human-human-computer interactions. The corpus used consists of tasks, in which participants perform task with a confederate and a computer is used for the analyses. This includes distributiona...
Usually, compression methods are avoided for emotion recognition problems, as it is feared that compression degrades the acoustic characteristics needed for an accurate recognition. By contrast, we assume that the psychoacoustic modeling used for transparent music compression
could actually improve speech-based emotion recognition, as it removes ce...
Emotion recognition from speech receives an ever growing attention since the systems around us aim to enable a natural communication. One important question still remains unresolved: The definition of the most suitable features across different data types. In the present paper, we employ a random-forest based feature selection known from other rese...
Empathic vehicles are a promising concept to increase the safety and acceptance of automated vehicles. However, on the way towards empathic vehicles a lot of research in the area of automated emotion recognition is necessary. Successful methods to detect emotions need to be trained on realistic data that contain the target emotion and come from a s...
Certain emotions can have a negative effect on the driver's capability of safely operating the vehicle and can ultimately lead to accidents. Therefore, it would be beneficial if the vehicle was able to detect the emotional state of the driver and provide appropriate assistance to mitigate these effects. This study investigates the influence of in-c...
Today, in technical dialog-systems diverse solutions are implemented to detect if a system should react to an uttered speech command. Typically used solutions are push-to-talk and keywords. Unfortunately, these solutions constitute an unnatural interaction to overcome the problem that the system is not able to detect
when it is addressed. Moreover,...
In emotion recognition from speech, huge amounts of training material are needed for the development of classification engines. As most current corpora do not supply enough material, a combination of different datasets is advisable. Unfortunately, data recording is done differently and various emotion elicitation and emotion annotation methods are...
Empathic vehicles are a promising concept to increase the safety and acceptance of automated vehicles. However, on the way towards empathic vehicles a lot of research in the area of automated emotion recognition is necessary. Successful methods to detect emotions need to be trained on realistic data that contain the target emotion and come from a s...
During system interaction, the user’s emotions and intentions shall be adequately determined and predicted to recognize tendencies in his or her interests and dispositions. This allows for the design of an evolving search user interface (ESUI) which adapts to changes in the user’s emotional reaction and the users’ needs and claims.
We demonstrate a successful multimodal dynamic human-computer interaction (HCI) in which the system adapts to the current situation and the user’s state is provided using the scenario of purchasing a train ticket. This scenario demonstrates that Companion Systems are facing the challenge of analyzing and interpreting explicit and implicit observati...
In general, humans interact with each other using multiple modalities. The main channels are speech, facial expressions, and gesture. But also bio-physiological data such as biopotentials can convey valuable information which can be used to interpret the communication in a dedicated way. A Companion-System can use these modalities to perform an eff...
The LAST MINUTE Corpus (LMC) is one of the rare examples of a corpus with naturalistic human-computer interactions. It offers richly annotated data from Ntotal = 130 experiments in a number of modalities. In this paper we present results from various investigations with data from the LMC using several primary modalities, e.g. transcripts, audio, qu...
Spoken language is one of the main interaction patterns in human-human as well as in natural, companion-like human-machine interactions. Speech conveys content, but also emotions and interaction patterns determining the nature and quality of the user’s relationship to his counterpart. Hence, we consider emotion recognition from speech in the wider...
The recognition performance of a classifier is affected by various aspects. A huge influence is given by the input data pre-processing. In the current paper we analysed the relation between different normalisation methods for emotionally coloured speech samples deriving general trends to be considered during data pre-processing. From the best of ou...
User satisfaction is an important aspect of human-computer interaction (HCI) – if a user is not satisfied, he or she might not be willing to use such a system. Therefore, it is crucial to HCI applications to be able to recognise the user satisfaction level in order to react in an appropriate way. For such recognition tasks, data-driven methods have...
Speech and audio codecs are implemented in a variety of multimedia applications, and multichannel sound is offered by first streaming or cloud-based services. Beside the objective of perceptual quality, coding-related research is focused on low bitrate and minimal latency. The IETF-standardized Opus codec provides a high perceptual quality, low lat...
Most technical communication systems use speech compression codecs to save transmission bandwidth. A lot of development was made to guarantee a high speech intelligibility resulting in different compression techniques: Analysis-by-Synthesis, psychoacoustic modeling and a hybrid mode of both. Our first assumption is that the hybrid mode improves the...
One objective of affective computing is the automatic processing of human emotions. Considering human speech, filled pauses are one of the cues giving insight into the emotional state of a human being. Filled pauses are short speech events without a specified semantic meaning, but they have a variety
of communicative and affective functions. The de...
For emotional analyses of interactions, qualitatively high transcription and annotation of given material is important. The textual transcription can be conducted with several available tools, like e.g. Folker or ANVIL. But tools for the annotation of emotions are quite rare. Furthermore, existing tools only allow to select an emotion term from a l...
Enabling a natural (human-like) spoken conversation with technical systems requires affective information, contained in spoken language, to be intelligibly transmitted. This study investigates the role of speech and music codecs for affect intelligibility. A decoding and encoding of affective speech was employed from the well-known EMO-DB corpus. U...