Volker DellwoUniversity of Zurich | UZH · Department of Computational Linguistics
Volker Dellwo
PhD MA
About
187
Publications
72,801
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
2,089
Citations
Introduction
I am interested in how people communicate with speech. I am mainly looking at communication via the acoustic channel but more and more also at visual information. In acoustics I look a variety of segmental and supra-segmental phenomena, e.g. speech timing, vowel identification. More information is on my homepage: www.pholab.uzh.ch/leute/dellwo.html
Additional affiliations
June 2000 - February 2002
October 2003 - July 2010
August 2010 - present
Education
April 2002 - January 2009
May 1992 - April 2000
Publications
Publications (187)
Speech rhythm in terms of durational variability of different levels of phonetic intervals can vary between speakers. The present article examines the role of syllabic intensity characteristics in rhythmic variability. Mean and peak intensity variability across syllables (stdevM, varcoM, stdevP, varcoP, rPVIm, nPVIm, rPVIp, nPVIp; henceforth: inten...
In a between-subject perception task, listeners either identified full words or vowels isolated from these words at F
0s between 220 and 880 Hz. They received two written words as response options (minimal pair with the stimulus vowel in contrastive position). Listeners' sensitivity (A′) was extremely high in both conditions at all F
0s, showi...
Between-speaker variability of acoustically measurable speech rhythm [%V, ΔV(ln), ΔC(ln), and Δpeak(ln)] was investigated when within-speaker variability of (a) articulation rate and (b) linguistic structural characteristics was introduced. To study (a), 12 speakers of Standard German read seven lexically identical sentences under five different in...
Listeners are typically able to identify speech as either being produced spontaneously or read from a transcript. In the present research we investigated whether this is true in vernacular speech when typical cues to read and spontaneous speech are either missing and/or ambiguous. In addition it was investigated what the acoustic cues for listeners...
The percentage of vocalic intervals (%V) and the standard deviation of consonantal intervals (deltaC) in a speech signal are two dimensions according to which languages of different rhythm classes (e.g. stress-timed, syllable-timed) seem to be differentiable on an acoustic level (Ramus et al., 1999). In this context it has been found that especiall...
To record clearly defined natural deceptive speech with precise knowledge of the ground truth and immediate consequences for the lying subject we present here the NumberLie game. The NumberLie design enables simultaneous and isolated audio recording of five players in our state-of-the-art laboratory, or adapted to any number of players in an online...
Multimodal learning involves integrating information from various modalities to enhance learning and comprehension. We compare three modality fusion strategies in person identification and verification by processing two modalities: voice and face. In this paper, a one-dimensional convolutional neural network is employed for x-vector extraction from...
Deception, defined as “a deliberate attempt to mislead others” [1], is prevalent in both human and non-human communication [2]. Under the reciprocal altruism theory [3], deception may yield individual gains but can undermine future interactions if the deceiver is recognized. It is plausible that humans seek to be less identifiable when lying. We th...
In automatic speech recognition, any factor that alters the acoustic properties of speech can pose a challenge to the system's performance. This paper presents a novel approach for automatic whispered speech recognition in the Irish dialect using the self-supervised WavLM model. Conventional automatic speech recognition systems often fail to accura...
Deepfakes are viral ingredients of digital environments, and they can trick human cognition into misperceiving the fake as real. Here, we test the neurocognitive sensitivity of 25 participants to accept or reject person identities as recreated in audio deepfakes. We generate high-quality voice identity clones from natural speakers by using advanced...
Human voice recognition over telephone channels typically yields lower accuracy when compared to audio recorded in a studio environment with higher quality. Here, we investigated the extent to which audio in video conferencing, subject to various lossy compression mechanisms, affects human voice recognition performance. Voice recognition performanc...
A random forest based comparison of classification performance of 3 Swiss German dialects using a continuum from most-rhythmic to most-spectral feature-sets (CV-interval-duration, onset strength amplitude envelope descriptors, V-to-V delta MFCCs, and MFCCs). We report a peak test accuracy of 97.8%.
Human listeners have a remarkable ability to recognize speakers by their voice, but within-speaker voice variability through different speaking styles, for example, can reduce recognition performance. In this study, we investigated voice discrimination across speaking styles in Persian. One hundred and forty-three naïve Persian listeners were asked...
Speakers adapt their speech in response to communicative context and listeners' needs. For example, when talking to hard of hearing or non-native listeners, or in the presence of background noise speakers make their speech more intelligible by speaking clearly. However, it is not clear which adaptations speakers will make when the goal is not intel...
Speech adaptations occur frequently in the presence of perceived communication barriers. Modern technological advancements have brought with them new interlocutors for human speakers with the introduction of voice-AI assistants. Findings have shown that voice-AI-directed speech is characterised by an increase in vocal effort resulting from the pres...
Proposals on how to assess the comparability of distances in a multidimensional feature space and perceived similarity by listeners of different voice identities using artificially morphed voice samples for a controlled theoretical distance between voices.
Introduction
Cooperation, acoustically signaled through vocal convergence, is facilitated when group members are more similar. Excessive vocal convergence may, however, weaken individual recognizability. This study aimed to explore whether constraints to convergence can arise in circumstances where interlocutors need to enhance their vocal individu...
In this study, we examined whether the convergence in interlocutors’ vowel acoustics leads to decreasing discriminability between interlocutors’ voices. Ten pairs of Grison and Zürich German speakers produced lexical items before and after dialogue interactions with evidence of vowel convergence in post-dialogue productions. In Experiment 1, native...
This introductory article for the Special Issue on Vocal Accommodation in Speech Communication provides an overview of prevailing theories of vocal accommodation and summarizes the ten papers in the collection. Communication Accommodation Theory focusses on social factors evoking accent convergence or divergence, while the Interactive Alignment Mod...
Formant dynamics are believed to reflect the characteristic articulatory behavior of a speaker. The present study aims to explore individual articulatory behaviors when producing American English /æ/ and /ɑ/. The two vowels differ in the degree of inherent spectral change, a property believed to carry information about vowel-phoneme identity, which...
Speech signals contain substantial fundamental frequency (f0) variability. Even within a single utterance, speakers modify f0 to create different intonational patterns. Previous studies have identified markers of increased f0 variability, such as the introduction of a new topic or greetings, but these are limited in the scope of their analyses. In...
This project presents a first-of-its-kind experimental technique, combining a VWP and voice recognition task, for exploring decision making in naïve familiar voice recognition. The current experiment design outlined in this poster will become a pilot study which will test the validity of this experimental concept.
It has been demonstrated that stress, experienced outside of a relationship, can spill into a relationship and cross over during interactions from one partner to the other. However, the mechanism of how stress cross over in real-time between partners is still unknown. To overcome this limitation, we invited 189 couples (N = 378 individuals) for two...
This presentation contains findings from a pilot study which aimed to explore if prevalent within-speaker variability is occurring in telephone openings, compared to later in the calls. A speaker discrimination task was used to see if performance differs when listeners are presented with samples taken from call openings, compared to mid-conversatio...
Deception is a common behaviour in not only humans, but also other taxa (Griffin & Ristau, 2013;
Whiten & Byrne, 1988). Although deception can include a wide variety of different behaviours,
here we only focus on lying by deliberately not telling the truth in human speech communication.
We start from the assumption that humans have no interest in b...
It has been demonstrated that stress, experienced outside of a relationship, can spill into a relationship and cross over during interactions from one partner to the other. However, the mechanism of how stress cross over in real-time between partners is still unknown. To overcome this limitation, we invited 189 couples (N = 378 individuals) for two...
This paper reports on the results of a research investigating whether rhythmic features, in terms of segmental timing properties, are object of speaker’s adjustments after the exposure to a conversational partner. In the context of dialects in contact, this is crucial to understand whether rhythmic attributes may bring about language variation and...
The human auditory system is capable of processing human speech even in situations when it has been heavily degraded, such as during noise-vocoding, when frequency domain-based cues to phonetic content are strongly reduced. This has contributed to arguments that speech processing is highly specialized and likely a de novo evolved trait in humans. P...
Voice timbre – the unique acoustic information in a voice by which its speaker can be recognized – is particularly critical in mother-infant interaction. Correct identification of vocal timbre is necessary in order for infants to recognize their mothers as familiar both before and after birth, providing a basis for social bonding between infant and...
This paper reports on the results of a research investigating whether rhythmic features, in terms of segmental timing properties, are object of speaker's adjustments after the exposure to a conversational partner. In the context of dialects in contact, this is crucial to understand whether rhythmic attributes may bring about language variation and...
Different methods to acquire a language can contribute differently to learning success. In the present study we tested the success of L2 stress contrasts acquisition, when ab initio learners were taught or not about the theoretic nature of L2 stress contrasts. In two 4-hour perceptual training methods, French-speaking listeners received either (a)...
Foreign-accented speech typically deviates segmentally and suprasegmentally from native-accented speech. Two experiments were conducted to investigate the role of amplitude envelope (ENV), segment duration (DUR), and speech rate (SR) on Italian listeners' ability to identify native-accented Italian in utterances produced by Zurich German speakers....
In everyday communication, the goal of speakers is to communicate their messages in an intelligible manner to their listeners. When they are aware of a speech perception difficulty on the part of the listener due to background noise, a hearing impairment, or a different native language, speakers will naturally and spontaneously modify their speech...
Speech rhythm varies with age. In this paper, we examined the role of mean and peak syllable intensity
variability in age-related rhythmic changes. Sixteen younger adults and 10 older speakers read 60
sentences in Zurich German. Results revealed that peak syllable intensity variability is significantly
smaller in older compared to younger adults; t...
It is unclear whether word stress in a language is stored as part of the word or whether it is generated by a rule. We test the generativist hypothesis of lexical storage stating that only unpredictable stress is stored in long-term memory against the contrasting usage-based approach assuming that all phonetic information regardless of its (un)pred...
No PDF available
ABSTRACT
Tones can be realized through fundamental frequency of oscillation (fo) or voice quality correlates (open-quotient, jitter and shimmer). Here we investigated the acoustic correlates of tones across two varieties of Burmese (a) standard Burmese (STB) and (b) southern Burmese (SBM). Historically, Burmese likely had voice-qua...
Listeners in fixed-stress languages are less sensitive in processing stress contrasts in a second language with contrastive stress (stress 'deafness'). We investigated whether native speakers of French (fixed-stress language) can acquire the ability to distinguish stress contrasts in Spanish (free-stress language). In behavioral experiments, we fou...
Databases for studying speech rhythm and tempo exist for numerous languages. The present corpus was built to allow comparisons between Arabic speech rhythm and other languages. 10 Egyptian speakers (gender-balanced) produced speech in two different speaking styles (read and spontaneous). The design of the reading task replicates the methodology use...
Human voices are individual and humans have elaborate skills in recognizing speakers by their voice, phenomena that are deeply rooted in the evolution of human behavior. To date, the mechanisms of speaker recognition are not well understood because of the high variability of the acoustic cues to a speaker’s identity. We wondered what role the speak...
In human-human interactions, the situational context plays a large role in the degree of speakers’ accommodation. In this paper, we investigate whether the degree of accommodation in a human-robot computer game is affected by (a) the duration of the interaction and (b) the success of the players in the game. 30 teams of two players played two card...
In the specialist literature on vowel acoustics, there is an extensive and often controversial debate on whether the primary acoustic cues of vowel quality are contained in the formant patterns or, alternatively, in the spectral shape. Yet, recent studies have shown that neither formant patterns nor spectral shapes are vowel quality-specific but th...
In this study, we investigated the effect of music aptitude on French and German listeners' performance at discriminating stress contrasts in Spanish L2, before and after a 4-hour perceptual training in Spanish. For the French listeners, results showed that the better the music aptitude the better the stress discrimination performance (before and a...
Formant characteristics are most commonly part of forensic speaker comparison (FSC). However, only formants F1 to F3 typically occur in evidence material because it is mostly recorded via telephone. Given recent technological advances in telephony (e.g. WeChat or WhatsApp) higher formants (F4-F5) are becoming increasingly part of evidence material....
Temporal organizations of the speech signal are highly individual among speakers of the same language. In the present study, we looked at speech production of bi-dialectal speakers using two varieties of the same language. We aimed at testing whether speaker-specific temporal features present in one dialect remain in another dialect of the same spe...
Fundamental frequency (F0) has always been reported as a popular and powerful parameter in forensic voice comparison (hereafter, FVC). One reason why F0 is considered forensically useful is that it conforms to many of the desiderata for FVC parameters (Rose, 2002). F0 is easily extracted and quantified and readily available even in a short stretch...
Acoustic correlates of speech which show high between-speaker variability coupled with low-within speaker variability play an essential role in reflecting speaker-specific information encoded in human speech. In an international study conducted by Gold and French (2011), vowels have been reported as one of the most analyzed segments among practitio...
An unsupervised automatic clustering algorithm (k-means) classified 1282 Mel frequency cepstral coefficient (MFCC) representations of isolated steady-state vowel utterances from eight standard German vowel categories with fo between 196 and 698 Hz. Experiment I obtained the number of MFCCs (1–20) in connection with the spectral bandwidth (2–20 kHz)...
Age-related decline in speech perception may result in difficulties partaking in spoken conversation and potentially lead to social isolation and cognitive decline in older adults. It is therefore important to better understand how age-related differences in neurostructural factors such as cortical thickness (CT) and surface area (CSA) are related...
First formant (F1) trajectories of vocalic intervals were divided into positive and negative dynamics. Positive F1 dynamics were defined as the speeds of F1 increases to reach the maxima, and negative F1 dynamics as the speeds of F1 decreases away from the maxima. Mean, standard deviation, and sequential variability were measured for both dynamics....
Adult speakers commonly alter their voices when talking to infants, giving rise to an infant-directed speech (IDS) style. Here we tested the effects of infant-directed speech on the recognizability of a speaker’s voice. 10 Swiss-German mothers were recorded talking to their infants IDS and talking to an adult experimenter (in adult-directed speech,...
With respect to existing evidence of rhythmic adjustments in response to the interlocutor’s idyosincratic characteristics, in the present study, we test whether interlocutors are likely to mutually adapt their rhythmic characteristics over the course of a conversation or after increased exposure to a dialogue partner. To study rhythmic accommodatio...
In everyday interactions, speakers tend to accommodate to the speech of their conversation partners. There are many studies that measure the acoustic and phonetic prosperities of accommodation phenomena in different adult communication situations (e.g. interactions between speakers from different dialects or languages). However, few studies have ex...
Acoustic measures of speech rhythm based on the durational characteristics of consonantal and vocalic intervals (henceforth C- or V-intervals) as well as syllabic intensity reveal between-speaker variability. The evidence obtained so far is based on speakers of stressed-timed languages, which are assumed to have complex consonant clusters and a hig...
This study investigates the use of long-term formant frequency (LTF) of vowels in discrimination of Persian speakers. LTF is a method by which formant values of all vocalic portions produced by a speaker are averaged, leading to one mean value and a standard deviation (SD) per formant. To explore between-and within-speaker variability, LTF values o...
Existing databases of isolated vowel sounds or vowel sounds embedded in consonantal context generally document only limited variation of basic production parameters. Thus, concerning the possible variation range of vowel and voice quality-related sound characteristics, there is a lack of broad phenomenological and descriptive references that allow...
We tested the influence of fundame ntal oscillation ( fo) on
human and machine speaker recognition performance in
vocalic test utterances. In experiment I, we trained a
Gaussian-Mixture model on 15 speakers (80 multi-word
utterances each) and tested it with sustained vowel utterances
(/a:/, /i:/ and /u:/) under six fo conditions, three changing (fa...
Speech segmental and suprasegmental characteristics vary considerably across the life span, for example, due to degenerative changes in speech production mechanisms and neuro-muscolar control. A great deal of research on the acoustic correlates of adult speakers’ voice has focussed on changes in voice quality, vowel formant patterns, f0, amplitude...
In the literature, the recognition of sinewave vowels replicating statistical formant patterns is reported as impaired when compared to natural sounds. However, the corresponding formant simulating sinusoids were harmonically unrelated, with synthesised signals only accidentally being quasi-periodic, and vowel confusion was indicated to relate to v...
When investigating formant pattern and spectral shape ambiguity in Klatt synthesis, an earlier study showed that the perceived vowel quality of Standard German vowel sounds can be changed by varying fundamental frequency only [Maurer et al. (2017). J. Acoust. Soc. Am. 141(5):3469-3470]. In this follow-up study, the previous original synthesis exper...
Voices are highly individual, and this information may be used to recognize people. This chapter provides an overview of how speaker-specific voice information can be used to assist in recognizing unknown speakers in evidential audio recordings in order to assist in progressing criminal investigations or for evidential purposes. While the chapter i...
A recent study involving the perceptual analysis of 24 speakers by two raters (San Segundo & Mompeán, 2017) revealed a slight inter-rater agreement in the assessment of vocal tract tension. In the current investigation several prosodic measures related to intensity and durational variability have been extracted per speaker with the aim of testing w...
A recent study involving the perceptual analysis of 24 speakers by two raters (San Segundo & Mompeán 2017) revealed a slight inter-rater agreement in the assessment of vocal tract tension (VTT). In the current investigation several prosodic measures related to intensity and durational variability have been extracted per speaker with the aim of test...
The perception of stress is highly influenced by listeners' native language. In this research, the authors examined the effect of intonation and talker variability (here: phonetic variability) in the discrimination of Spanish lexical stress contrasts by native Spanish (N = 17), German (N = 21), and French (N = 27) listeners. Participants listened t...
The phonological function of vowels can be maintained at fundamental frequencies (fo) up to
880 Hz [Friedrichs, Maurer, and Dellwo (2015). J. Acoust. Soc. Am. 138, EL36–EL42]. Here, the
influence of talker variability and multiple response options on vowel recognition at high fos is
assessed. The stimuli (n¼264) consisted of eight isolated vowels (...
Various researchers have shown an interest in the voice similarity of identical twins. However, results across studies are hardly comparable since the number of speakers, gender, speaking style and, most importantly, forensic comparison methods tend to differ. Therefore, it is difficult to assess the relative importance of different systems or the...
We present results of speech rhythm analysis for automatic speaker identification. We expand previous experiments using similar methods for language identification. Features describing the rhythmic properties of salient changes in signal components are extracted and used in an speaker identification task to determine to which extent they are descri...
Intensity contours of speech signals were sub-divided into positive and negative dynamics. Positive dynamics were defined as the speed of increases in intensity from amplitude troughs to subsequent peaks, and negative dynamics as the speed of decreases in intensity from peaks to troughs. Mean, standard deviation, and sequential variability were mea...
The influence of varying fundamental frequency on the perception of vowel quality in synthesized vowels was tested in two experiments. In experiment 1, based on investigations of natural Standard German vowel sounds, various modelformantpatternsF1’ to F3’ were created and, for each single pattern, sounds were synthesised on two or three fundamental...
We model the amplitude envelope of a speech signal as a kinematic system and calculate its basic parameters: displacement, velocity, and acceleration. Such system captures the smoothed amplitude fluctuation pattern over time, illustrating how energy is distributed across the signal. Although the pulmonic air pressure is the primary energy source of...
Previous research suggests that the broad-band amplitude envelope (ENV) of speech is crucial for the perception of speech rate and timing. The present experiment tested this claim using non-manipulated and spectrally rotated speech (rotated around 2.5 kHz) with a bandwidth of 5 kHz which both contain identical ENV and reversed speech in which the t...
Age-related changes in speech production influence both speech segmental and suprasegmentalcharacteristics. Previous research focussed on changes in voice quality, vowelformant patterns, f0 and speech rate due to aging but only little attention has been paid on speech rhythm (durational and dynamic). In this study we analyzed the segmental duration...
Speech can be modeled as the sum of high frequency carrier signals (temporal fine structure; TFS) from different frequency bands modulated in amplitude by a low frequency modulator (amplitude envelope; ENV). Previous research revealed that ENV cues alone can be sufficient for speech intelligibility. The present research tested the role of signal pe...
The performance of automatic speaker recognition (ASR) systems decreases when training and test data are produced in different social situations (speech registers). The present research tested ASR performance across adult- and infant-directed speech registers (ADS and IDS respectively). IDS compared to ADS is generally characterized by higher and m...
Rhythmic characteristics of speech vary between native and non-native speakers. Studies comparing the rhythmic properties of L1 and L2 speech based on rhythm metrics have shown that this relationship is far from straightforward. It seems evidently the case that the difference between native and non-native speech is a complex interaction of a variet...
Rhythmic characteristics of speech vary between native and non-native speakers. Studies comparing the rhythmic properties of L1 and L2 speech based on rhythm metrics have shown that this relationship is far from straightforward. It seems evidently the case that the difference between native and non-native speech is a
complex interaction of a variet...
Current formant measurement studies of vowel sounds generally use a Linear Predictive Coding (LPC) algorithm and rely on an interactive method of for-mant estimation which includes a comparison of measured formant tracks and characteristics of the spectrogram. Thereby, the selection of LPC parameters is based on the assumption that the number of po...