Korbinian Riedhammer

Korbinian Riedhammer
Georg Simon Ohm University of Applied Sciences Nuremberg | OHM · Department of Computer Science

Prof. Dr.-Ing.

About

107
Publications
13,706
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,699
Citations

Publications

Publications (107)
Preprint
We present an approach to Audio-Visual Speech Recognition that builds on a pre-trained Whisper model. To infuse visual information into this audio-only model, we extend it with an AV fusion module and LoRa adapters, one of the most up-to-date adapter approaches. One advantage of adapter-based approaches, is that only a relatively small number of pa...
Preprint
Self-supervised learning has been successfully used for various speech related tasks, including automatic speech recognition. BERT-based Speech pre-Training with Random-projection Quantizer (BEST-RQ) has achieved state-of-the-art results in speech recognition. In this work, we further optimize the BEST-RQ approach using Kullback-Leibler divergence...
Preprint
Speech pauses, alongside content and structure, offer a valuable and non-invasive biomarker for detecting dementia. This work investigates the use of pause-enriched transcripts in transformer-based language models to differentiate the cognitive states of subjects with no cognitive impairment, mild cognitive impairment, and Alzheimer's dementia base...
Preprint
With the advancement of multimedia technologies, news documents and user-generated content are often represented as multiple modalities, making Multimedia Event Extraction (MEE) an increasingly important challenge. However, recent MEE methods employ weak alignment strategies and data augmentation with simple classification models, which ignore the...
Preprint
In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the workload across multiple GPU threads, enabling simultaneous operations on matr...
Preprint
This paper explores the improvement of post-training quantization (PTQ) after knowledge distillation in the Whisper speech foundation model family. We address the challenge of outliers in weights and activation tensors, known to impede quantization quality in transformer-based language and vision models. Extending this observation to Whisper, we de...
Preprint
Accurately detecting dysfluencies in spoken language can help to improve the performance of automatic speech and language processing components and support the development of more inclusive speech and language technologies. Inspired by the recent trend towards the deployment of large language models (LLMs) as universal learners and processors of no...
Article
Automatic summarization of natural and human-made disaster events is an important area to increase situational awareness for human response organizations and disaster management. However, the incorporation of multiple data sources poses a challenge to current summarization systems, as the typically large document collections exceed the input limits...
Preprint
Full-text available
A Note to the Reader This report largely written back in Spring 2021, but unfortunately never submitted to peer review. In the meantime, two extensive reviews of similar nature have been published by Civit et al. (peer reviewed) and Zhao et al. (arxiv). We believe that this manuscript still adds value to the scientific community, since it focuses o...
Conference Paper
We present a learning experience platform that uses machine learning methods to support students and lecturers in self-motivated online learning and teaching processes. The platform is being developed as an agile open-source collaborative project supported by multiple universities and partners. The development is guided didactically, reviewed, and...
Article
Full-text available
Zusammenfassung Im BMBF-Verbundprojekt HAnS entwickeln und implementieren neun Hochschulen sowie drei hochschulübergreifende Einrichtungen ein intelligentes Hochschul-Assistenz-System als Open-Source-Lösung. Videobasierte Lehrmaterialien werden verschriftlicht und durch eine Indexierung Stichwortsuchen ermöglicht; geplant ist, über einen KI-Tutor a...
Article
Speech can contain a variety of diagnostically relevant cues. In this article, it is shown how artificial intelligence, in particular machine learning and speech processing, can be applied to speech signals: to assess intelligibility, to automate standardized tests, and to determine medical scales and diagnoses. We conclude with critical review of...
Preprint
Automated dementia screening enables early detection and intervention, reducing costs to healthcare systems and increasing quality of life for those affected. Depression has shared symptoms with dementia, adding complexity to diagnoses. The research focus so far has been on binary classification of dementia (DEM) and healthy controls (HC) using spe...
Conference Paper
Full-text available
The current shift from in-person to online education, e.g., through video lectures, requires novel techniques for quickly searching for and navigating through media content. At this point, an automatic segmentation of the videos into thematically coherent units can be beneficial. Like in a book, the topics in an educational video are often structur...
Presentation
Mit der Förderinitiative „Künstliche Intelligenz in der Hochschulbildung“ unterstützen Bund und Länder Künstliche Intelligenz an Hochschulen. Das Verbundprojekt „Hochschul-Assistenz-System“, kurz HAnS, mit acht weiteren Hochschulen unter Koordination der Ohm erhält dabei eine Fördersumme von fünf Millionen Euro.
Preprint
Most stuttering detection and classification research has viewed stuttering as a multi-class classification problem or a binary detection task for each dysfluency type; however, this does not match the nature of stuttering, in which one dysfluency seldom comes alone but rather co-occurs with others. This paper explores multi-language and cross-corp...
Preprint
The CrisisFACTS Track aims to tackle challenges such as multi-stream fact-finding in the domain of event tracking; participants' systems extract important facts from several disaster-related events while incorporating the temporal order. We propose a combination of retrieval, reranking, and the well-known Integer Linear Programming (ILP) and Maxima...
Article
Vocal fatigue refers to the feeling of tiredness and weakness of voice due to extended utilization. This paper investigates the effectiveness of neural embeddings for the detection of vocal fatigue. We compare x-vectors, ECAPA-TDNN, and wav2vec 2.0 embeddings on a corpus of academic spoken English. Low-dimensional mappings of the data reveal that n...
Preprint
Social media has become an important information source for crisis management and provides quick access to ongoing developments and critical information. However, classification models suffer from event-related biases and highly imbalanced label distributions which still poses a challenging task. To address these challenges, we propose a combinatio...
Preprint
We analyze the impact of speaker adaptation in end-to-end architectures based on transformers and wav2vec 2.0 under different noise conditions. We demonstrate that the proven method of concatenating speaker vectors to the acoustic features and supplying them as an auxiliary model input remains a viable option to increase the robustness of end-to-en...
Preprint
Specially adapted speech recognition models are necessary to handle stuttered speech. For these to be used in a targeted manner, stuttered speech must be reliably detected. Recent works have treated stuttering as a multi-class classification problem or viewed detecting each dysfluency type as an isolated task; that does not capture the nature of st...
Preprint
Current findings show that pre-trained wav2vec 2.0 models can be successfully used as feature extractors to discriminate on speaker-based tasks. We demonstrate that latent representations extracted at different layers of a pre-trained wav2vec 2.0 system can be effectively used for binary classification of various types of pathologic speech. We exam...
Preprint
The detection of pathologies from speech features is usually defined as a binary classification task with one class representing a specific pathology and the other class representing healthy speech. In this work, we train neural networks, large margin classifiers, and tree boosting machines to distinguish between four different pathologies: Parkins...
Conference Paper
For dementia screening and monitoring, standardized tests play a key role in clinical routine since they aim at minimizing subjectivity by measuring performance on a variety of cognitive tasks. In this paper, we report a study consisting of a semi-standardized history taking followed by two standardized neuropsychological tests, namely the SKT and...
Chapter
This paper empirically investigates the influence of different data splits and splitting strategies on the performance of dysfluency detection systems. For this, we perform experiments using wav2vec 2.0 models with a classification head as well as support vector machines (SVM) in conjunction with the features extracted from the wav2vec 2.0 model to...
Chapter
Standardized tests play a crucial role in the detection of cognitive impairment. Previous work demonstrated that automatic detection of cognitive impairment is possible using audio data from a standardized picture description task. The presented study goes beyond that, evaluating our methods on data taken from two standardized neuropsychological te...
Preprint
Full-text available
We are interested in the problem of conversational analysis and its application to the health domain. Cognitive Behavioral Therapy is a structured approach in psychotherapy, allowing the therapist to help the patient to identify and modify the malicious thoughts, behavior, or actions. This cooperative effort can be evaluated using the Working Allia...
Preprint
For dementia screening and monitoring, standardized tests play a key role in clinical routine since they aim at minimizing subjectivity by measuring performance on a variety of cognitive tasks. In this paper, we report on a study that consists of a semi-standardized history taking followed by two standardized neuropsychological tests, namely the SK...
Preprint
The "Switchboard benchmark" is a very well-known test set in automatic speech recognition (ASR) research, establishing record-setting performance for systems that claim human-level transcription accuracy. This work highlights lesser-known practical considerations of this evaluation, demonstrating major improvements in word error rate (WER) by corre...
Preprint
Standardized tests play a crucial role in the detection of cognitive impairment. Previous work demonstrated that automatic detection of cognitive impairment is possible using audio data from a standardized picture description task. The presented study goes beyond that, evaluating our methods on data taken from two standardized neuropsychological te...
Preprint
This paper empirically investigates the influence of different data splits and splitting strategies on the performance of dysfluency detection systems. For this, we perform experiments using wav2vec 2.0 models with a classification head as well as support vector machines (SVM) in conjunction with the features extracted from the wav2vec 2.0 model to...
Preprint
Full-text available
The ACM Multimedia 2022 Computational Paralinguistics Challenge addresses four different problems for the first time in a research competition under well-defined conditions: In the Vocalisations and Stuttering Sub-Challenges, a classification on human non-verbal vocalisations and speech has to be made; the Activity Sub-Challenge aims at beyond-audi...
Preprint
Vocal fatigue refers to the feeling of tiredness and weakness of voice due to extended utilization. This paper investigates the effectiveness of neural embeddings for the detection of vocal fatigue. We compare x-vectors, ECAPA-TDNN, and wav2vec 2.0 embeddings on a corpus of academic spoken English. Low-dimensional mappings of the data reveal that n...
Preprint
Stuttering is a varied speech disorder that harms an individual's communication ability. Persons who stutter (PWS) often use speech therapy to cope with their condition. Improving speech recognition systems for people with such non-typical speech or tracking the effectiveness of speech therapy would require systems that can detect dysfluencies whil...
Preprint
Stuttering is a complex speech disorder that negatively affects an individual's ability to communicate effectively. Persons who stutter (PWS) often suffer considerably under the condition and seek help through therapy. Fluency shaping is a therapy approach where PWSs learn to modify their speech to help them to overcome their stutter. Mastering suc...
Preprint
Personal narratives (PN) - spoken or written - are recollections of facts, people, events, and thoughts from one's own experience. Emotion recognition and sentiment analysis tasks are usually defined at the utterance or document level. However, in this work, we focus on Emotion Carriers (EC) defined as the segments (speech or text) that best explai...
Conference Paper
Full-text available
This paper presents an ongoing interdisciplinary research project that deals with free improvisation and human-machine interaction , involving a digital player piano and other musical instruments. Various technical concepts are developed by student participants in the project and continuously evaluated in artistic performances. Our goal is to explo...
Preprint
Full-text available
Stuttering is a complex speech disorder identified by repeti-tions, prolongations of sounds, syllables or words and blockswhile speaking. Specific stuttering behaviour differs strongly,thus needing personalized therapy. Therapy sessions requirea high level of concentration by the therapist. We introduceSTAN, a system to aid speech therapists in stu...
Chapter
Stuttering is a complex speech disorder that can be identified by repetitions, prolongations of sounds, syllables or words and blocks while speaking. Severity assessment is usually done by a speech therapist. While attempts at automated assessment were made, it is rarely used in therapy. Common methods for the assessment of stuttering severity incl...
Preprint
Performing machine learning tasks in mobile applications yields a challenging conflict of interest: highly sensitive client information (e.g., speech data) should remain private while also the intellectual property of service providers (e.g., model parameters) must be protected. Cryptographic techniques offer secure solutions for this, but have an...
Preprint
Stuttering is a complex speech disorder that can be identified by repetitions, prolongations of sounds, syllables or words, and blocks while speaking. Severity assessment is usually done by a speech therapist. While attempts at automated assessment were made, it is rarely used in therapy. Common methods for the assessment of stuttering severity inc...
Preprint
Time series are series of values ordered by time. This kind of data can be found in many real world settings. Classifying time series is a difficult task and an active area of research. This paper investigates the use of transfer learning in Deep Neural Networks and a 2D representation of time series known as Recurrence Plots. In order to utilize t...
Preprint
This paper presents a comparison of a traditional hybrid speech recognition system (kaldi using WFST and TDNN with lattice-free MMI) and a lexicon-free end-to-end (TensorFlow implementation of multi-layer LSTM with CTC training) models for German syllable recognition on the Verbmobil corpus. The results show that explicitly modeling prior knowledge...
Chapter
Time series are series of values ordered by time. This kind of data can be found in many real world settings. Classifying time series is a difficult task and an active area of research. This paper investigates the use of transfer learning in Deep Neural Networks and a 2D representation of time series known as Recurrence Plots. In order to utilize t...
Chapter
This paper presents a comparison of a traditional hybrid speech recognition system (kaldi using WFST and TDNN with lattice-free MMI) and a lexicon-free end-to-end (TensorFlow implementation of multi-layer LSTM with CTC training) models for German syllable recognition on the Verbmobil corpus. The results show that explicitly modeling prior knowledge...
Conference Paper
In this paper we present an algorithm that produces pitch and probability-of-voicing estimates for use as features in automatic speech recognition systems. These features give large performance improvements on tonal languages for ASR systems, and even substantial improvements for non-tonal languages. Our method, which we are calling the Kaldi pitch...
Article
In this article, we describe a semi-automatic calibration algorithm for dereverberation by spectral subtraction. We verify the method by a comparison to a manual calibration derived from measured room impulse responses (RIR). We conduct extensive experiments to understand the effect of all involved parameters and to verify values suggested in the l...
Conference Paper
In this paper we apply diagnostic analysis to gain a deeper understanding of the performance of the the keyword search system that we have developed for conversational telephone speech in the IARPA Babel program. We summarize the Babel task, its primary performance metric, “actual term weighted value” (ATWV), and our recognition and keyword search...
Conference Paper
Cleft Lip and Palate (CLP) is among the most frequent congenital abnormalities. The impaired facial development affects the articulation, with different phonemes being impacted inhomogeneously among different patients. This work focuses on automatic phoneme analysis of children with CLP for a detailed diagnosis and therapy control. In clinical rout...
Conference Paper
Full-text available
We describe a state-of-the-art large vocabulary continuous speech recognition (LVCSR) and keyword search (KWS) sys-tem trained on roughly 70 hours of conversational telephone speech. Using the Kaldi speech recognition toolkit, we investi-gate several aspects: for the acoustic front-end, we analyze the use of mel-frequency cepstral coefficients (MFC...
Conference Paper
Full-text available
This paper describes the acquisition, transcription and annotation of a multi-media corpus of academic spoken English, the LMELectures. It consists of two lecture se-ries that were read in the summer term 2009 at the com-puter science department of the University of Erlangen-Nuremberg, covering topics in pattern analysis, machine learning and inter...
Conference Paper
Full-text available
A growing number of universities and other educational institutions provide recordings of lectures and seminars as an additional resource to the students. In contrast to educational films that are scripted, di-rected and often shot by film professionals, these plain recordings are typically not post-processed in an editorial sense. Thus, the videos...
Conference Paper
Full-text available
In earlier studies, we employed a large prosodic feature vec-tor to assess the quality of L2 learner's utterances with respect to sentence melody and rhythm. In this paper, we combine these features with two standard approaches in paralinguistic analysis: (1) features derived from a Gaussian Mixture Model used as Universal Background Model (GMM-UBM...
Conference Paper
Full-text available
Voice scrambling is widely used to add privacy to the radio communication of various authorities – but is also used by criminals to evade prosecution. In this article, we consider various analog voice scrambling techniques such as fixed frequency inver-sion, splitband inversion and rolling code scramblers. We explain how to break them using automat...
Article
Full-text available
The tracheoesophageal (TE) substitute voice is currently state–of–the–art treatment to restore the ability to speak after laryngectomy. The intelligibility while talking over a telephone is an important clinical factor, as it is a crucial part of the patients ’ social life. An objective way to rate the intelligibility of substitute voices when talk...
Article
Full-text available
We describe a lattice generation method that is exact, i.e. it satisfies all the natural properties we would want from a lattice of alterna-tive transcriptions of an utterance. This method does not introduce substantial overhead above one-best decoding. Our method is most directly applicable when using WFST decoders where the WFST is expanded down...
Article
Full-text available
In the past decade, semi-continuous hidden Markov models (SCHMMs) have not attracted much attention in the speech recognition community. Growing amounts of training data and increasing sophistication of model estimation led to the impression that continuous HMMs are the best choice of acoustic model. However, recent work on recognition of under-res...
Conference Paper
Full-text available
A growing number of universities offer recordings of lectures, seminars and talks in an online e-learning portal. However, the user is often not interested in the entire recording, but is looking for parts covering a certain topic. Usually, the user has to either watch the whole video or "zap" through the lecture and risk missing important details....
Conference Paper
Full-text available
This paper focuses on the automatic detection of a person's blood level alcohol based on automatic speech processing approaches. We compare 5 different feature types with different ways of modeling. Experiments are based on the ALC corpus of IS2011 Speaker State Challenge. The classification task is restricted to the detection of a blood alcohol le...
Conference Paper
Full-text available
In this paper, we describe a new Java framework for an easy and efficient way of developing new GUI based speech processing applications. Standard components are provided to display the speech signal, the power plot, and the spectrogram. Furthermore, a component to create a new transcription and to display and manipulate an existing transcription i...
Article
One aspect of voice and speech evaluation after laryngeal cancer is acoustic analysis. Perceptual evaluation by expert raters is a standard in the clinical environment for global criteria such as overall quality or intelligibility. So far, automatic approaches evaluate acoustic properties of pathologic voices based on voiced/unvoiced distinction an...
Conference Paper
Full-text available
In this work we focus on speaker verification on channels of varying quality, namely Skype and high frequency (HF) radio. In our setup, we assume to have telephone recordings of speakers for training, but recordings of different channels for testing with varying (lower) signal quality. Starting from a Gaussian mixture / support vector machine (GMM/...
Article
We analyze and compare two different methods for unsupervised extractive spontaneous speech summarization in the meeting domain. Based on utterance comparison, we introduce an optimal formulation for the widely used greedy maximum marginal relevance (MMR) algorithm. Following the idea that information is spread over the utterances in form of concep...
Article
Full-text available
The CALO Meeting Assistant (MA) provides for distributed meeting capture, annotation, automatic transcription and semantic analysis of multiparty meetings, and is part of the larger CALO personal assistant system. This paper presents the CALO-MA architecture and its speech recognition and understanding components, which include real-time and offlin...