Tanja Schultz

Tanja Schultz
University of Bremen | Uni Bremen · Computer Science

Professor

About

605
Publications
117,591
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
15,509
Citations
Introduction
Since 2007, I direct the Cognitive Systems Lab, where our research activities include multilingual speech processing and the processing, recognition, and interpretation of biosignals for human-centered technologies and applications. In particular, we apply machine learning methods to process and interpret a variety of speech-related activities such as muscle and brain activities with the goal of creating biosignal-based speech processing devices for communication applications in everyday situations and of gaining a deeper understanding of spoken communication. Find out more at http://csl.uni-bremen.de/
Additional affiliations
April 2015 - present
University of Bremen
Position
  • Professor (Full)
March 2007 - April 2015
Karlsruhe Institute of Technology
Position
  • Professor (Full)
August 2000 - May 2007
Carnegie Mellon University
Position
  • Research Scientist, Research Assistant/Associate Professor
Education
May 1995 - June 2000
Karlsruhe Institute of Technology
Field of study
  • Informatics
October 1989 - April 1995
Karlsruhe Institute of Technology
Field of study
  • Informatics
October 1983 - November 1989
Heidelberg University
Field of study
  • Mathematics, Sports, Educational Science

Publications

Publications (605)
Book
Full-text available
This edition brings together the hard work of 50 authors from 14 countries across three continents: Oceania (Australia), Europe (France, Greece, Italy, Australia, Germany, Portugal, Slovenia, Sweden, and the UK), and Asia (China, Pakistan, Saudi Arabia, and Thailand). From hardware to software, from pipelines to applications, from handcrafted featu...
Preprint
The amount of articulatory data available for training deep learning models is much less compared to acoustic speech data. In order to improve articulatory-to-acoustic synthesis performance in these low-resource settings, we propose a multimodal pre-training framework. On single-speaker speech synthesis tasks from real-time magnetic resonance imagi...
Conference Paper
Full-text available
Silent moments in video meetings are a ubiquitous phenomenon in collaboration and are encountered frequently. While silence itself is not inherently detrimental, the discomfort it can induce has potential to significantly affect meeting performance and overall well-being of the meeting participants. The understanding of silence's impact, particular...
Preprint
Full-text available
Federated Learning (FL) is a privacy-preserving approach that allows servers to aggregate distributed models transmitted from local clients rather than training on user data. More recently, FL has been applied to Speech Emotion Recognition (SER) for secure human-computer interaction applications. Recent research has found that FL is still vulnerabl...
Conference Paper
Full-text available
Many studies based on symbolic music signals require retaining only melody tracks or accompaniment tracks from musical instrument digital interface (MIDI) files. However, this seemingly simple setting often becomes a stumbling block in the first step because the MIDI format does not have any mandatory regulations for the track numbers of melody/acc...
Preprint
Full-text available
Speech is a rich biomarker that encodes substantial information about the health of a speaker, and thus it has been proposed for the detection of numerous diseases, achieving promising results. However, questions remain about what the models trained for the automatic detection of these diseases are actually learning and the basis for their predicti...
Preprint
In the study of auditory attention, it has been revealed that there exists a robust correlation between attended speech and elicited neural responses, measurable through electroencephalography (EEG). Therefore, it is possible to use the attention information available within EEG signals to guide the extraction of the target speaker in a cocktail pa...
Article
Full-text available
We introduce the concept of LabLinking: a technology-based interconnection of experimental laboratories across institutions, disciplines, cultures, languages, and time zones - in other words human studies and experiments without borders. In particular, we introduce a theoretical framework of LabLinking, describing multiple dimensions of conceptual,...
Article
Full-text available
The Special Issue Sensors for Human Activity Recognition has received a total of 30 submissions so far, and from these, this new edition will publish 10 academic articles [...]
Preprint
Visual Grounding (VG) in VQA refers to a model's proclivity to infer answers based on question-relevant image regions. Conceptually, VG identifies as an axiomatic requirement of the VQA task. In practice, however, DNN-based VQA models are notorious for bypassing VG by way of shortcut (SC) learning without suffering obvious performance losses in sta...
Preprint
Full-text available
Speech emotion recognition (SER) plays a crucial role in human-computer interaction. The emergence of edge devices in the Internet of Things (IoT) presents challenges in constructing intricate deep learning models due to constraints in memory and computational resources. Moreover, emotional speech data often contains private information, raising co...
Conference Paper
Full-text available
Electromyography-to-Speech (ETS) conversion has demonstrated its potential for silent speech interfaces by generating audible speech from Electromyography (EMG) signals during silent articulations. ETS models usually consist of an EMG encoder which converts EMG signals to acoustic speech features, and a vocoder which then synthesises the speech sig...
Article
In this paper, we present the results of experiments conducted on multilingual acoustic modeling in the development of an Automatic Speech Recognition (ASR) system using speech data of phonetically much related Ethiopian languages (Amharic, Tigrigna, Oromo and Wolaytta) with multilingual (ML) mix and multitask approaches. The use of speech data fro...
Article
Full-text available
This article investigates electrocardiogram (ECG) acquisition artifacts often occurring in experiments due to human negligence or environmental influences, such as electrode detachment, misuse of electrodes, and unanticipated magnetic field interference, which are not easily noticeable by humans or software during acquisition. Such artifacts usuall...
Conference Paper
Full-text available
Facial expressions play a crucial role in non-verbal and visual communication, often observed in everyday life. The facial action coding system (FACS) is a prominent framework for categorizing facial expressions as action units (AUs), which reflect the activity of facial muscles. This paper presents a proof-of-concept study for upper face action un...
Article
Full-text available
As an essential task in data mining, outlier detection identifies abnormal patterns in numerous applications, among which clustering-based outlier detection is one of the most popular methods for its effectiveness in detecting cluster-related outliers, especially in medical applications. This article presents an advanced method to extract cluster-b...
Article
Background The use of mobile devices to continuously monitor objectively extracted parameters of depressive symptomatology is seen as an important step in the understanding and prevention of upcoming depressive episodes. Speech features such as pitch variability, speech pauses, and speech rate are promising indicators, but empirical evidence is lim...
Conference Paper
Full-text available
Motivational dynamics in jogging constitute a pivotal factor influencing a runner's performance, persistence, and overall engagement in the running activity. The manifestation of diminished motivation is concomitant with a cascade of physiological responses, capable of being represented through biological signals, for which biosignal monitoring, a...
Article
Full-text available
Humans possess the remarkable ability to selectively attend to a single speaker amidst competing voices and background noise, known as selective auditory attention . Recent studies in auditory neuroscience indicate a strong correlation between the attended speech signal and the corresponding brain's elicited neuronal activities. In this work, we...
Article
Speech contains rich information on the emotions of humans, and Speech Emotion Recognition (SER) has been an important topic in the area of human-computer interaction. The robustness of SER models is crucial, particularly in privacy-sensitive and reliability-demanding domains like private healthcare. Recently, the vulnerability of deep neural netwo...
Article
Full-text available
Speech is a rich biomarker that encodes substantial information about the health of a speaker, and thus it has been proposed for the detection of numerous diseases, achieving promising results. However, questions remain about what the models trained for the automatic detection of these diseases are actually learning and the basis for their predicti...
Conference Paper
Full-text available
Perceiving and producing speech and audio signals are the basic ways for humans to communicate with each other and know about the world. Benefiting from the advancement of Big Data, signal processing, and Artificial Intelligence (AI), intelligent machines have been rapidly developed to process speech and audio signals for assisting human life. Deep...
Article
Full-text available
As an important technique for data pre-processing, outlier detection plays a crucial role in various real applications and has gained substantial attention, especially in medical fields. Despite the importance of outlier detection, many existing methods are vulnerable to the distribution of outliers and require prior knowledge, such as the outlier...
Conference Paper
Electroencephalography (EEG) related research faces a significant challenge of subject independence due to the variation in brain signals and responses among individuals. While deep learning models hold promise in addressing this challenge, their effectiveness depends on large datasets for training and generalization across participants. To overcom...
Article
Full-text available
This paper presents a perspective on Biosignal-Adaptive Systems (BAS) which automatically adapt to user needs by continuously interpreting their biosignals and by providing transparent feedback, thereby keeping the user in the loop. The major hallmark of the described BAS is the low latency with which biosignals are processed, interpreted, and appl...
Preprint
Humans possess the remarkable ability to selectively attend to a single speaker amidst competing voices and background noise, known as selective auditory attention. Recent studies in auditory neuroscience indicate a strong correlation between the attended speech signal and the corresponding brain's elicited neuronal activities, which the latter can...
Chapter
Full-text available
High-Level Features (HLF) are a novel way of describing and processing human activities. Each feature captures an interpretable aspect of activities, and a unique combination of HLFs defines an activity. In this article, we propose and evaluate a concise set of six HLFs on and across the CSL-SHARE and UniMiB SHAR datasets, showing that HLFs can be...
Article
Full-text available
Objective: Despite recent advances, the decoding of auditory attention from brain signals remains a challenge. A key solution is the extraction of discriminative features from high-dimensional data, such as multi-channel electroencephalography (EEG). However, to our knowledge, topological relationships between individual channels have not yet been...
Article
Background Recent studies identified linguistic changes as a promising marker of early Alzheimer´s disease (AD) and mild cognitive impairment (MCI). Perplexity – i.e. a measure of content complexity and therefore predictability of speech – is considered as a predictive marker of AD and MCI in otherwise healthy elderly (Frankenberg et al., 2019). A...
Book
Full-text available
Human activity recognition (HAR) and human behavior recognition (HBR) play increasingly important roles in the digital age. High-quality sensory observations applicable to recognizing users’ activities and behaviors, including electrical, magnetic, mechanical (kinetic), optical, acoustic, thermal, and chemical biosignals, are inseparable from senso...
Preprint
Metrics for Visual Grounding (VG) in Visual Question Answering (VQA) systems primarily aim to measure a system's reliance on relevant parts of the image when inferring an answer to the given question. Lack of VG has been a common problem among state-of-the-art VQA systems and can manifest in over-reliance on irrelevant image parts or a disregard fo...
Preprint
BACKGROUND The use of mobile devices to continuously monitor objectively extracted parameters of depressive symptomatology is seen as an important step in the understanding and prevention of upcoming depressive episodes. Speech features such as pitch variability, speech pauses, and speech rate are promising indicators, but empirical evidence is lim...
Article
Full-text available
In this paper, we investigate the effect of distractions and hesitations as a scaffolding strategy. Recent research points to the potential beneficial effects of a speaker’s hesitations on the listeners’ comprehension of utterances, although results from studies on this issue indicate that humans do not make strategic use of them. The role of hesit...
Article
Recent studies have demonstrated that it is possible to decode and synthesize various aspects of acoustic speech directly from intracranial measurements of electrophysiological brain activity. In order to continue progressing toward the development of a practical speech neuroprosthesis for the individuals with speech impairments, better understandi...
Article
Full-text available
Zusammenfassung Künstliche Intelligenz (KI) gewinnt auch im Gesundheitswesen immer mehr an Bedeutung. Diese Entwicklung löst ernst zu nehmende Sorgen aus, die sich anhand von sechs großen „Worst-Case-Szenarien“ zusammenfassen lassen. Von einer KI-basierten Verbreitung von Desinformationen und Propaganda über einen möglichen militärischen Wettlauf z...
Conference Paper
Full-text available
Many human activity recognition (HAR) systems have the ultimate application scenarios in real-time, while most literature has limited the HAR study to offline models. Some mentioned real-time or online applications, but the investigation of implementing and evaluating a real-time HAR system was missing. With our years of experience developing and d...
Article
Full-text available
For the development of neuro-steered hearing aids, it is important to study the relationship between a speech stimulus and the elicited EEG response of a human listener. The recent Auditory EEG Decoding Challenge 2023 (Signal Processing Grand Challenge, IEEE International Conference on Acoustics, Speech and Signal Processing) dealt with this relati...
Article
Full-text available
Human activity recognition (HAR) and human behavior recognition (HBR) have been playing increasingly important roles in the digital age. High-quality sensory observations applicable to recognizing users’ activities and behaviors, including electrical, magnetic, mechanical (kinetic), optical, acoustic, thermal, and chemical biosignals, are inseparab...
Article
Full-text available
Biosignal-based technology has been increasingly available in our daily life, being a critical information source. Wearable biosensors have been widely applied in, among others, biometrics, sports, health care, rehabilitation assistance, and edutainment. Continuous data collection from biodevices provides a valuable volume of information, which nee...
Preprint
Visual Grounding (VG) in Visual Question Answering (VQA) systems describes how well a system manages to tie a question and its answer to relevant image regions. Systems with strong VG are considered intuitively interpretable and suggest an improved scene understanding. While VQA accuracy performances have seen impressive gains over the past few yea...
Article
Full-text available
We present three user studies that gradually prepare our prototype system SmartHelm for use in the field, i.e. supporting cargo cyclists on public roads for cargo delivery. SmartHelm is an attention-sensitive smart helmet that integrates none-invasive brain and eye activity detection with hands-free Augmented Reality (AR) components in a speech-ena...
Article
Full-text available
Numerous state-of-the-art solutions for neural speech decoding and synthesis incorporate deep learning into the processing pipeline. These models are typically opaque and can require significant computational resources for training and execution. A deep learning architecture is presented that learns input bandpass filters that capture task-relevant...
Article
Full-text available
As an essential subset of Chinese music, traditional Chinese folk songs frequently apply the anhemitonic pentatonic scale. In music education and demonstration, the Chinese anhemitonic pentatonic mode is usually introduced theoretically, supplemented by music appreciation, and a non-Chinese-speaking audience often lacks a perceptual understanding....
Preprint
Recent studies have demonstrated that it is possible to decode and synthesize various aspects of acoustic speech directly from intracranial measurements of electrophysiological brain activity. In order to continue progressing toward the development of a practical speech neuroprosthesis for the individuals with speech impairments, better understandi...
Article
Full-text available
Task‐specificity in isolated focal dystonias is a powerful feature that may successfully be targeted with therapeutic brain–computer interfaces. While performing a symptomatic task, the patient actively modulates momentary brain activity (disorder signature) to match activity during an asymptomatic task (target signature), which is expected to tran...
Conference Paper
Recent studies have shown it is possible to decode and synthesize speech directly using brain activity recorded from implanted electrodes. While this activity has been extensively examined using electrocorticographic (ECoG) recordings from cortical surface grey matter, stereotactic electroen-cephalography (sEEG) provides comparatively broader cover...