Peter BirkholzTU Dresden | TUD · Institute of Acoustics and Speech Communication
Peter Birkholz
Prof. Dr.-Ing.
About
230
Publications
78,202
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,109
Citations
Publications
Publications (230)
Introduction
We investigated the prosodic perception of uncertainty cues in adults with Autism Spectrum Disorder (ASD) compared to neurotypical adults (NTC).
Method
We used articulatory synthetic speech to express uncertainty in a human-machine scenario by varying the three acoustic cues pause, intonation, and hesitation. Twenty-eight adults with...
It has long been a mystery how children learn to speak without formal instructions. Previous research has used computational modelling to help solve the mystery by simulating vocal learning with direct imitation or caregiver feedback, but has encountered difficulty in overcoming the speaker normalisation problem, namely, discrepancies between child...
Within the realm of voice classification, singers could be sub-categorized by the weight of their repertoire, the so-called “singer's Fach.” However, the opposite pole terms “lyric” and “dramatic” singing are not yet well defined by their acoustic and articulatory characteristics. Nine professional singers of different singers' Fach were asked to s...
Articulatory copy synthesis (ACS) refers to the synthetic reproduction of natural utterances. The existing methods of ACS have the limitations of poor generalizability for unknown speakers, high computing costs, the lack of systematic evaluation, etc. Here we propose an ACS method based on the articulatory speech synthesizer VocalTractLab (VTL) and...
Purpose
Breathing is ubiquitous in speech production, crucial for structuring speech, and a potential diagnostic indicator for respiratory diseases. However, the acoustic characteristics of speech breathing remain underresearched. This work aims to characterize the spectral properties of human inhalation noises in a large speaker sample and explore...
Echo State Networks (ESNs) are a special type of Recurrent Neural Networks (RNNs), in which the input and recurrent connections are traditionally generated randomly, and only the output weights are trained. However, recent publications have addressed the problem that a purely random initialization may not be ideal. Instead, a completely determinist...
Fricatives require precise gestural control to produce a narrow constriction in the vocal tract. This generates turbulent airflow that gives fricatives their distinctive acoustic properties. This study investigated the articulation of the lingual fricatives /s, ʃ, x/ of Upper Sorbian, which is an endangered language spoken in eastern Germany. We us...
This paper introduces VQ-Synth, a prototype system for the voice-quality modification, designed to increase breathiness in sustained vowels. The system is currently operating offline but shall be used for real-time auditory feedback alteration in the near future. Here, we describe VQ-Synth's architecture and operating principles and evaluate its ef...
The nature of English diphthongs has been much disputed. By now, the most influential account argues that diphthongs are phoneme entities rather than vowel combinations. However, mixed results have been reported regarding whether the rate of formant transition is the most reliable attribute in the perception and production of diphthongs. Here, we u...
The human voice is a directional sound source. This property has been explored for more than 200 years, mainly using measurements of human participants. Some efforts have been made to understand the anatomical parameters that influence speech directivity, e.g., the mouth opening, diffraction and reflections due to the head and torso, the lips and t...
This study investigated how the bandwidths of resonances simulated by transmission-line models of the vocal tract compare to bandwidths measured from physical three-dimensional printed vowel resonators. Three types of physical resonators were examined: models with realistic vocal tract shapes based on Magnetic Resonance Imaging (MRI) data, straight...
In this study, silicone vocal fold models with different geometries were manufactured using the common silicone brand EcoFlex 00-30 with typical oil mixing ratios. However, the proportions of oil typically used are higher than the manufacturer's recommended limit, in order to attain the softness of human vocal folds. This additional oil causes dire...
Das Stimmresynthesesystem VQ-Synth soll die subjektive Behauchtheit (eine Facette der Heiserkeit) von Sprachaufnahmen steigern und damit einen Ansatz zur Erforschung und Therapie funktioneller Dysphonien bieten. Ziel dieser Arbeit war es, VQ-Synth erstmalig in einer Laborstudie auditiv zu evaluieren. Dazu wurde in einem Hörversuch (N = 31) der Einf...
In this study, 23 subjects produced cyclic transitions between rounded vowels and unrounded vowels as in /o-i-o-i-o-…/ at two specific speaking rates. Rounded vowels are typically produced with a lower larynx position than unrounded vowels. This contrast in vertical larynx position was further amplified by producing the unrounded vowels with a high...
Computational approaches have an important role to play in understanding the complex process of speech acquisition, in general, and have recently been popular in studies of vocal learning in particular. In this article we suggest that two significant problems associated with imitative vocal learning of spoken language, the speaker normalisation and...
This paper introduces a paradigm shift regarding vocal learning simulations, in which the communicative function of speech acquisition determines the learning process and intelligibility is considered the primary measure of learning success. Thereby, a novel approach for artificial vocal learning is presented that utilizes deep neural network-based...
Articulatory synthesis generates speech sounds by simulating the physical phenomena involved in speech production. The accuracy of the physical modelling is expected to affect the naturalness of the synthesis: the more realistic the description is, the greater the naturalness is expected to be. In this work, the accuracy of acoustic wave propagatio...
Evoc-Learn is a system for simulating early vocal learning of spoken language in ways that can overcome some of the major bottlenecks in vocal learning. The system consists of VocalTractLab, a geometrical three-dimensional vocal tract model for simulating aeroacoustics and articulatory dynamics, a coarticulation model for controlling the temporal d...
The finite-difference time-domain (FDTD) method has been
widely used for vocal tract acoustic modelling due to its simplicity and low computational cost. Nevertheless, the method
suffers from high discretization error while approximating realistic vocal tract geometries using orthogonal grid elements.
Alternatively, simplified vocal tract shapes ha...
Silent speech interfaces allow speech communication to take place in the absence of the acoustic speech signal. Radar-based sensing with radio antennas on the speakers' face can be used as a non-invasive modality to measure speech articulation in such applications. One of the major challenges with this approach is the variability between different...
Silent speech interfaces (SSIs) are subject of growing interest, as they can enable speech communication even in the absence of the acoustic signal. Among sensing techniques used in SSIs, radar sensing has many desirable characteristics, such as non-invasiveness and comfort. Although promising results have been achieved with radar-based SSIs, some...
While numerous studies on automatic speech recognition have been published in recent years describing data augmentation strategies based on time or frequency domain signal processing, few works exist on the artificial extensions of training data sets using purely synthetic speech data. In this work, the German KIEL corpus was augmented with synthet...
High-quality articulatory speech synthesis has many potential applications in speech science and technology. However, developing appropriate mappings from linguistic specification to articulatory gestures is difficult and time consuming. In this paper we construct an optimisation-based framework as a first step towards learning these mappings witho...
The electric power system is undergoing a tremendous transition. Despite increasing energy demands resulting from an "all-electric society", onshore as well as offshore renewable generation units require solutions for bulk energy transportation. Here long and very long HVDC cable systems are a key enabling technology, especially land cables, to inc...
Our research focuses on the acoustic characteristics of inhalation noises in speech and the underlying articulatory mechanisms. Previously, we have shown (Werner et al., 2021) two main similarities of breathing noise to selected speech sounds: a) enhanced amplitudes at frequencies corresponding to low vowel formants and b) spectral characteristics...
p>This paper introduces a paradigm shift regarding vocal learning simulations, in which the communicative function of speech acquisition determines the learning process and intelligibility is considered the main measure of learning success. Thereby, a novel approach for artificial early vocal learning is presented that utilizes deep neural network-...
p>This paper introduces a paradigm shift regarding vocal learning simulations, in which the communicative function of speech acquisition determines the learning process and intelligibility is considered the main measure of learning success. Thereby, a novel approach for artificial early vocal learning is presented that utilizes deep neural network-...
Reservoir Computing Networks (RCNs) belong to a group of machine learning techniques that project the input space non-linearly into a high-dimensional feature space, where the underlying task can be solved linearly. Popular variants of RCNs are capable of solving complex tasks equivalently to widely used deep neural networks, but with a substantial...
Reservoir Computing Networks (RCNs) belong to a group of machine learning techniques that project the input space non-linearly into a high-dimensional feature space, where the underlying task can be solved linearly. Popular variants of RCNs are capable of solving complex tasks equivalently to widely used deep neural networks, but with a substantial...
The nature of English diphthongs has been much disputed. Bynow, the most influential account argues that diphthongs arephoneme entities rather than vowel combinations. However,mixed results have been reported regarding whether the rate offormant transition is the most reliable attribute in the perceptionand production of diphthongs. Here, we used c...
The nature of English diphthongs has been much disputed. By now, the most influential account argues that diphthongs are phoneme entities rather than vowel combinations. However, mixed results have been reported regarding whether the rate of formant transition is the most reliable attribute in the perception and production of diphthongs. Here, we u...
High-quality articulatory speech synthesis has many potential applications in speech science and technology. However, developing appropriate mappings from linguistic specification to articulatory gestures is difficult and time consuming. In this paper we construct an optimisation-based framework as a first step towards learning these mappings witho...
Recovering speech in the absence of the acoustic speech signal itself, i.e., silent speech, holds great potential for restoring or enhancing oral communication in those who lost it. Radar is a relatively unexplored silent speech sensing modality, even though it has the advantage of being fully non-invasive. We therefore built a custom stepped frequ...
Echo state networks (ESNs) are a special type of recurrent neural networks (RNNs), in which the input and recurrent connections are traditionally generated randomly, and only the output weights are trained. Despite the recent success of ESNs in various tasks of audio, image, and radar recognition, we postulate that a purely random initialization is...
The periodic repetitions of laryngeal adduction and abduction gestures were uttered by 16 subjects. The movement of the cuneiform tubercles was tracked over time in the laryngoscopic recordings of these utterances. The adduction velocity and abduction velocity were determined objectively by means of a piecewise linear model fitted to the cuneiform...
This article describes the need for monitoring HVDC cable systems and presents a new approach for monitoring partial discharge at DC voltage on long-length, land-based HVDC cables containing multiple joints.
Acoustic simulation of sound propagation inside the vocal tract is a key element of speech research, especially for articulatory synthesis, which allows one to relate the physics of speech production to other fields of speech science, such as speech perception. Usual methods, such as the transmission line method, have a very low computational cost...
Articulatory synthesis is based on modeling various physical phenomena of speech production, including sound radiation from the mouth. With regard to sound radiation, the most common approach is to approximate it in terms of a simple spherical source of strength equal to the mouth volume velocity. However, because this approximation is only valid a...
Resonance-strategies with respect to vocal registers, i.e., frequency-ranges of uniform, demarcated voice quality, for the highest part of the female voice are still not completely understood. The first and second vocal tract resonances usually determine vowels. If the fundamental frequency exceeds the vowel-shaping resonance frequencies of speech,...
Articulatory synthesis relies on precise, parametric vocal tract shapes to generate natural-sounding speech. In German, a particular challenge is the accurate synthesis of the vocalic /r/ allophones following vowels or in syllable coda position. Using established phonetic conventions, no satisfying results could be achieved so far, implying a possi...
Recovering speech in the absence of the acoustic speech signal itself, i.e., silent speech, holds great potential for restoring or enhancing oral communication in those who lost it. Radar is a relatively unexplored silent speech sensing modality, even though it has the advantage of being fully non-invasive. We therefore built a custom stepped frequ...
This study investigated the intrinsic velocities of raising and lowering movements of the velum that are related to its biomechanical structure and aerodynamic conditions. To this end, five subjects produced cyclic transitions between nasals and fricatives as in /s-n-s-n-s-.../ with flat intonation and at two specific speaking rates to minimize con...
This video shows the measurement of the volume velocity transfer function of vocal tract tube models with soft versus hard walls.
https://www.youtube.com/watch?v=iD_PxQLEYlQ
Note onset detection – the detection of the beginning of new note events – is a fundamental task for music analysis that can help to improve Automatic Music Transcription (AMT). The method for onset detection always follows a similar outline: An audio signal is transformed into an Onset Detection Function (ODF), which should have rather low values...
While the acoustic vowel space has been extensively studied in previous research, little is known about the high-dimensional articulatory space of vowels. The articulatory imaging techniques are limited to tracking only a few key articulators, leaving the rest of the articulators unmonitored. In the present study, we attempted to develop a detailed...
Automatic music transcription (AMT) is one of the challenging problems in Music Information Retrieval with the goal of generating a score-like representation of a polyphonic audio signal. Typically, the starting point of AMT is an acoustic model that computes note likelihoods from feature vectors. In this work, we evaluate the capabilities of Echo...
When pitch is explicitly modelled for parametric speech synthesis, microprosodic variations of the fundamental frequency f0 are usually disregarded by current intonation models. While there are numerous studies dealing with the nature and the origin of microprosody, little research has been done on its audibility and its effect on the naturalness o...
Early detection of malign patterns in patients’ biological signals can save millions of lives. Despite the steady improvement of artificial intelligence–based techniques, the practical clinical application of these methods is mostly constrained to an offline evaluation of the patients’ data. Previous studies have identified organic electrochemical...
Objective:
Stroke survivors commonly suffer from dysphagia, originating from oro-facial impairments which affect swallowing function. Functional therapy often employs tongue exercises that require the patient to perform short motion sequences. Evaluating the patients performance on those exercises is difficult, because there is no reliable form of...
This study compared the f0 of 14 German vowels in monosyllabic words (/dVt/) embedded in carrier sentences produced by 30 native speakers and 30 Mandarin Chinese learners. Appropriate techniques were employed to robustly measure f0 values and reliably analyze f0 profiles. The results showed that Mandarin learners produced the vowels bearing sentenc...
Acoustic models of the vocal tract for articulatory speech synthesis often neglect a range of acoustic effects that are known to exist in the human vocal tract. Here we extended a basic acoustic vocal tract model by three features: the piriform fossae, transvelar acoustic coupling of the oral and nasal cavities, and sound radiation from the skin of...
The current state of the art for localizing defects in cable systems includes synchronous, multi-channel signal propagation time measurements and the use of time or frequency domain reflectometry. The length of the cable system to be diagnosed limits these technologies. A wave propagating as a result of a dielectric defect is attenuated due to the...
Echo State Networks (ESNs) are a special type of recurrent neural networks (RNNs), in which the input and recurrent connections are traditionally generated randomly, and only the output weights are trained. Despite the recent success of ESNs in various tasks of audio, image and radar recognition, we postulate that a purely random initialization is...
Acoustic and articulatory differences between spoken and shouted vowels were analyzed for two male and two female subjects by means of acoustic recordings and midsagittal magnetic resonance images of the vocal tract. In accordance with previous acoustic findings, the fundamental frequencies, intensities, and formant frequencies were all generally h...
The Time of Excitation (Tx) of speech, also widely known as the Glottal Closure Instants (GCI) denote the points in time at which the vocal folds close during the production of voiced speech. In this paper, we extend a previous approach based on a multilayer perceptron (MLP) using Echo State Networks (ESN), a variant of a Recurrent Neural Network (...
The influence of non-smooth trachea walls on phonation onset and offset pressures and the fundamental frequency of oscillation were experimentally investigated for three different synthetic vocal fold models. Three models of the trachea were compared: a cylindrical tube (smooth walls) and wavy-walled tubes with ripple depths of 1 and 2 mm. Threshol...
In music analysis, one of the most fundamental tasks is note onset detection-detecting the beginning of new note events. As the target function of onset detection is related to other tasks, such as beat tracking or tempo estimation, onset detection is the basis for such related tasks. Furthermore, it can help to improve Automatic Music Transcriptio...
Einleitung: Patienten mit neurologischen Erkrankungen leiden oft an feinmotorischen Störungen im oralen Bereich, was Dysphagien bedingen kann. Art und Ausprägung dieser Bewegungseinschränkungen bzw. deren Einfluss auf die Aspirationsgefahr können apparativ nur über die Videofluoroskopie (VFSS) beurteilt werden. Im BmBfgeförderten Verbundprojekt...
Purpose
Psychoacoustical studies on transmission characteristics related to bone-conducted (BC) speech, perceived by speakers during vocalization, are important for further understanding the relationship between speech production and perception, especially auditory feedback. For exploring how the outer ear part contributes to BC speech transmission...
The complex f0 variations in continuous speech make it rather difficult to perform automatic recognition of tones in a language like Mandarin Chinese. In this study, we tested the use of target approximation model (TAM) for continuous tone recognition on two datasets. TAM simulates f0 production from the articulatory point of view and so allow to d...
In this study, a state-of-the-art articulatory speech synthesiser was used as the basis for simulating the exploration of CV sounds imitating speech stimuli. By adopting a relevant kinematic model and systematically reducing the search space of consonant articulatory targets, intelligible CV sounds can be found. Derivative-free optimisation strateg...
Currently, convolutional neural networks (CNNs) define the state of the art for multipitch tracking in music signals. Echo State Networks (ESNs), a recently introduced recurrent neural network architecture, achieved similar results as CNNs for various tasks, such as phoneme or digit recognition. However, they have not yet received much attention in...