Conference Paper

A landmark-based approach to automatic voice onset time estimation in stop-vowel sequences

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In the field of phonetics, voice onset time (VOT) is a major parameter of human speech defining linguistic contrasts in voicing. In this article, a landmark-based method of automatic VOT estimation in acoustic signals is presented. The proposed technique is based on a combination of two landmark detection procedures for release burst onset and glottal activity detection. Robust release burst detection is achieved by the use of a plosion index measure. Voice onset and offset landmarks are determined using peak detection on power rate-of-rise. The proposed system for VOT estimation was tested on two voiceless-stop-vowel combinations /ka/, /ta/ spoken by 42 native German speakers.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... These differences could go either way, resulting in longer habituation durations with F I G U R E 4 Voice onset time (VOT) values for each exemplar in Experiment 2 (multiple speakers, blue dots) and VOT values for the two exemplars (of voiced /b/ and voiceless /p/) of Experiment 1 (single speaker, orange triangles). As expected for a Germanic voicing system, VOT values for /b/ mostly fall within a range of short VOT values (10-25 ms) along with a few negative values indicating occasional voicing during the closure of stop (Beckman, Jessen, & Ringen, 2013;Jessen, 1998) and VOT values for /p/ are much higher with a correspondingly substantial degree of inter-speaker variability (Kuberski et al., 2016;Tobin et al., 2018) variable stimuli because attention is better maintained or in shorter habituation durations because due to more focused attention learning is undistracted and can proceed faster. However, the habituation durations of Experiment 1 and 2 were quite comparable, thus providing no indication that input variability in and of itself creates differences in attention. ...
... VOT distributions are highly speaker dependent, both in terms of their means and their variances. For example, in a sample of forty-two speakers of German, mean VOT for the voiceless plosives /ka/ and / ta/ were found to vary from 44 to 104 ms (Kuberski, Tobin, & Gafos, 2016;Tobin et al., 2018). Moreover, VOT within a single speaker varies with syllable duration (the longer the syllable, the longer the VOT). ...
Article
Seminal work by Werker and colleagues (Stager & Werker, 1997) has found that 14‐month‐old infants do not show evidence for learning minimal pairs in the habituation‐switch paradigm. However, when multiple speakers produce the minimal pair in acoustically variable ways, infants’ performance improves in comparison to a single speaker condition (Rost & McMurray, 2009). The current study further extends these results and assesses how different kinds of input variability affect 14‐month‐olds’ minimal pair learning in the habituation‐switch paradigm testing German learning infants. The first two experiments investigated word learning when the labels were spoken by a single speaker versus when the labels were spoken by multiple speakers. In the third experiment we studied whether non‐acoustic variability, implemented by visual variability of the objects presented together with the labels, would also affect minimal pair learning. We found enhanced learning in the multiple speakers compared to the single speaker condition, confirming previous findings with English learning infants. In contrast, visual variability of the presented objects did not support learning. These findings both confirm and better delimit the beneficial role of speech‐specific variability in minimal pair learning. Finally, we review different proposals on the mechanisms via which variability confers benefits to learning and outline what may be likely principles that underlie this benefit. We highlight among these the multiplicity of acoustic cues signalling phonemic contrasts and the presence of relations among these cues. It is in these relations where we trace part of the source for the apparent paradoxical benefit of variability in learning.
... The author also mentioned that this method might fail in bilabial fricative /f/, glottal fricative /h/ because of transient-like properties. The PI method also adopted in the study [166] for CBT detection. ...
Thesis
Full-text available
Speech disorder is an early and prominent manifestation of neurological disorders. Therefore, the breakdown of speech disorders and detecting underlying pathophysiology have invaluable importance to clinical practice. Speech disorder is commonly attributed to aging onset, however, the pattern is mostly distinct for neurogenic voice. Parkinsonism is one of the neurological disorder that refers to idiopathic Parkinson’s Disease (PD) and Atypical Parkinsonian Syndromes (APS), such as Progressive Supranuclear Palsy (PSP) and Multiple System Atrophy (MSA). Differential diagnosis of latter disease groups is remains an challenging task due to similar symptoms at the early stages, while early diagnostic certainty is essential for the patient because of the diverging prognosis. Indeed, despite recent efforts, no validated objective speech marker is currently available to guide the clinician for the differential diagnosis. This thesis thus aims to design and define the speech markers that would provide deep insight into speech disorders caused by neurological diseases, and followed by differential diagnosis.Analysis of speech disorder demands at least a speech database by which pattern of speech abnormalities can be assessed. Speech database consisting PD and MSA-P neurological diseases is not available in French language. Thus development of a speech database (Voice4PD-MSA) from PD and MSA-P groups was one of the target of this thesis. While developing Voice4PD-MSA database, we explored CzechData database comprises of speech samples in Czech language for differential diagnosis.The automatic algorithm always in demand to quantify perceptual and visual observation to capture particular speech disorders. Clinically interpretable speech components are considered to capture speech abnormalities in respiration, vowel production, articulator movements, and prosody by objective methods from sustained vowel, word-initial consonants, diadochokinetic (DDK) tasks, and continuous speech. Imprecise vowel comprises deficits in vocal folds opening and closing, involuntary movements of articulator, hypernasality, tremor, and changes in vowel space area are observed to be important for differential diagnosis of MSA-P and PD patients. In imprecise obstruents, devoicing in voiced obstruents and burst in fricatives (anti-spirantization) are identified as distinctive speech markers for MSA-P. In addition, speech indexes related to the subsystem of speech production and dysarthria yield encouraging differentiation and disease specificity in disease groups. Given small amount of data, two-dimensional speech features are designed such that one of the disease group predominates in one speech dimension and consequently discriminate disease groups with good classification score.Early differential diagnosis was another critical objective of the current investigation. The present study observed some encouraging indications about early differential diagnosis exploring the trend of speech markers w.r.t. clinical signs. Thus we aspire that the presented methodology in this thesis would serve as a potential diagnosis tool in clinical practice and further inspire to develop automatic methods to investigate speech disorders in parkinsonism.
... The baseline VOTs were also verified manually after the experiment with a semi-automatic algorithm. All measurements were carried out in software (Kuberski et al., 2016) whose performance with respect to other landmark identification methods has been quantified. Response VOT and syllable duration were computed based on the stop release burst, phonation initiation, and cessation landmarks. ...
Article
Full-text available
During a cue-distractor task, participants repeatedly produce syllables prompted by visual cues. Distractor syllables are presented to participants via headphones 150 ms after the visual cue (before any response). The task has been used to demonstrate perceptuomotor integration effects (perception effects on production): response times (RTs) speed up as the distractor shares more phonetic properties with the response. Here it is demonstrated that perceptuomotor integration is not limited to RTs. Voice Onset Times (VOTs) of the distractor syllables were systematically varied and their impact on responses was measured. Results demonstrate trial-specific convergence of response syllables to VOT values of distractor syllables.
Conference Paper
Full-text available
Past studies have demonstrated that reaction times in producing CV syllables are modulated by audio distractors participants hear while preparing their responses. Galantucci et al. [5] showed that participants respond faster on trials with identical response-distractor pairs than when the response and distractor differ in voicing or articulator. The dynamical model of phonological planning developed in [9] attributes these differences in reaction times to phonological parameters of a distractor exciting or inhibiting the planning of a response. A so far untested prediction of the model is that within-category phonetic variation in the voicing parameter of the distractors still gives rise to congruency effects of the articulator. We report new results which show effects of response-distractor articulator congruency for distractors with varying voice onset times. These results extend previous findings by pursuing predictions concerning within-category variability of distractor stimuli.
Article
Full-text available
Of 58 papers published so far this year in Journal of Phonetics, 16 (28%) feature Voice Onset Time (VOT) or related measurements, confirming that VOT remains a central concern in the field. However, phoneticians' VOT measurements generally continue to rely on human judgment, which requires significant labor, makes even large laboratory experiments onerous, and prevents the field from taking full advantage of the millions of hours of digital speech now becoming available. We present an algorithm for accurate automatic measurement of VOT, combining HMM forced alignment for determining approximate stop boundaries with paired burst and voicing onset detectors. Each detector is a frame-level max margin classifier operating on the scale-space projection of a small number of relevant acoustic features. On a large set of clean lab speech, this system has a mean absolute error (relative to human annotation) of only 2.8 ms, with 98% of errors <10 ms. On a subcorpus independently annotated by two of the authors, the system agreed with the two human annotators as well as they agreed with one another (1.49 vs 1.50 ms). Promising results on other datasets will be reported. The system will be released as open-source software.
Article
Full-text available
A discriminative large-margin algorithm for automatic measurement of voice onset time (VOT) is described, considered as a case of predicting structured output from speech. Manually labeled data are used to train a function that takes as input a speech segment of an arbitrary length containing a voiceless stop, and outputs its VOT. The function is explicitly trained to minimize the difference between predicted and manually measured VOT; it operates on a set of acoustic feature functions designed based on spectral and temporal cues used by human VOT annotators. The algorithm is applied to initial voiceless stops from four corpora, representing different types of speech. Using several evaluation methods, the algorithm's performance is near human intertranscriber reliability, and compares favorably with previous work. Furthermore, the algorithm's performance is minimally affected by training and testing on different corpora, and remains essentially constant as the amount of training data is reduced to 50-250 manually labeled examples, demonstrating the method's practical applicability to new datasets.
Article
Full-text available
We examined the voice onset times (VOTs) of monolingual and bilingual speakers of English and French to address the question whether cross language phonetic influences occur particularly in simultaneous bilinguals (that is, speakers who learned both languages from birth). Speakers produced sentences in which there were target words with initial /p/, /t/ or /k/. In French, natively bilingual speakers produced VOTs that were significantly longer than those of monolingual French speakers. French VOTs were even longer in bilingual speakers who learned English before learning French. The outcome was analogous in English speech. Natively bilingual speakers produced shorter English VOTs than monolingual speakers. English VOTs were even shorter in the speech of bilinguals who learned French before English. Bilingual speakers had significantly longer VOTs in their English speech than in their French. Accordingly, the cross language effects do not occur because natively bilingual speakers adopt voiceless stop categories intermediate between those of native English and French speakers that serve both languages. Monolingual speakers of French or English in Montreal had VOTs nearly identical respectively to those of monolingual Parisian French speakers and those of monolingual Connecticut English speakers. These results suggest that mere exposure to a second language does not underlie the cross language phonetic effect; however, these findings must be resolved with others that appear to show an effect of overhearing.
Thesis
Thesis (Ph.D.)--Cornell University, August, 1996. Includes bibliographical references (leaves 343-365).
Article
Automatic and accurate detection of the closure-burst transition events of stops and affricates serves many applications in speech processing. A temporal measure named the plosion index is proposed to detect such events, which are characterized by an abrupt increase in energy. Using the maxima of the pitch-synchronous normalized cross correlation as an additional temporal feature, a rule-based algorithm is designed that aims at selecting only those events associated with the closure-burst transitions of stops and affricates. The performance of the algorithm, characterized by receiver operating characteristic curves and temporal accuracy, is evaluated using the labeled closure-burst transitions of stops and affricates of the entire TIMIT test and training databases. The robustness of the algorithm is studied with respect to global white and babble noise as well as local noise using the TIMIT test set and on telephone quality speech using the NTIMIT test set. For these experiments, the proposed algorithm, which does not require explicit statistical training and is based on two one-dimensional temporal measures, gives a performance comparable to or better than the state-of-the-art methods. In addition, to test the scalability, the algorithm is applied on the Buckeye conversational speech corpus and databases of two Indian languages.
Article
Preliminary results from eight participants in a cross-linguistic investigation of phonetic accommodation in speech production and perception are presented. The finding that synchronous actions are more stable than asynchronous ones has been reported in studies of general [Kelso (1981)] and speech-specific [Browman and Goldstein (1992), Byrd et al. (2009)] motor control. With reference to glottal-oral timing, near-zero VOTs (voice onset times) are representative of near-synchronous timing, whereas long-lag VOTs are representative of asynchronous timing [Sawashima and Hirose (1980), Dixit (1984), Lofqvist and Yoshioka (1989), Fuchs (2005)]. These observations served as a basis for the prediction that native speakers of Korean, with its long-lag aspirated stops (~120 ms), would more readily accommodate to typical English voiceless stop VOT (~70 ms) than native speakers of Spanish, with its short-lag voiceless stops (~20 ms). Spanish-English and Korean-English bilinguals were recorded reading voiceless stop-initial English words, before and during a task in which participants shadowed recorded productions of a native speaker of American English. Preliminary analysis of the production data provides some support for these hypotheses. The results contribute to our understanding of the conditions that promote phonetic accommodation.
Article
We describe an algorithm to automatically estimate the voice onset time (VOT) of plosives. The VOT is the time delay between the burst onset and the start of periodicity when it is followed by a voiced sound. Since the VOT is affected by factors like place of articulation and voicing it can be used for inference of these factors. The algorithm uses the reassignment spectrum of the speech signal, a high resolution time–frequency representation which simplifies the detection of the acoustic events in a plosive. The performance of our algorithm is evaluated on a subset of the TIMIT database by comparison with manual VOT measurements. On average, the difference is smaller than 10 ms for 76.1% and smaller than 20 ms for 91.4% of the plosive segments. We also provide analysis statistics of the VOT of /b/, /d/, /g/, /p/, /t/ and /k/ and experimentally verify some sources of variability. Finally, to illustrate possible applications, we integrate the automatic VOT estimates as an additional feature in an HMM-based speech recognition system and show a small but statistically significant improvement in phone recognition rate.
Article
The voice onset time (VOT) of a stop consonant is the interval between its burst onset and voicing onset. Among a variety of research topics on VOT, one that has been studied for years is how VOTs are efficiently measured. Manual annotation is a feasible way, but it becomes a time-consuming task when the corpus size is large. This paper proposes an automatic VOT estimation method based on an onset detection algorithm. At first, a forced alignment is applied to identify the locations of stop consonants. Then a random forest based onset detector searches each stop segment for its burst and voicing onsets to estimate a VOT. The proposed onset detection can detect the onsets in an efficient and accurate manner with only a small amount of training data. The evaluation data extracted from the TIMIT corpus were 2344 words with a word-initial stop. The experimental results showed that 83.4% of the estimations deviate less than 10 ms from their manually labeled values, and 96.5% of the estimations deviate by less than 20 ms. Some factors that influence the proposed estimation method, such as place of articulation, voicing of a stop consonant, and quality of succeeding vowel, were also investigated.
Article
This work is a component of a proposed knowledge‐based speech recognition system which uses landmarks to guide the search for distinctive features. In the speech signal, landmarks identify times when the acoustic manifestations of the linguistically motivated distinctive features are most salient. This paper describes an algorithm for automatically detecting acoustically abrupt landmarks. Some examples of acoustically abrupt landmarks are stop closures and releases, nasal closures and releases, and the point of cessation of free vocal fold vibration due to a velopharyngeal port closure at a nasal‐to‐obstruent juncture. As a consequence of landmark detection, the algorithm provides estimates of the broad phonetic class (articulator‐free features) of the underlying segment. The algorithm is hierarchically structured, and is rooted in linguistic and speech production theory. It uses several factors to detect landmarks: energy abruptness in five frequency bands and at two levels of temporal resolution, segmental duration, broad phonetic class constraints, and articulatory constraints. Tested on a database of continuous, clean speech of women and men, the landmark detector has detection rates over 90%. A large majority of the detections were within 20 ms of the landmark transcription, and almost all were within 30 ms. The results are analyzed by landmark type and phonetic class.
Article
As a first step toward automatic phonetic analysis of speech, one desires to segment the signal into syllable-sized units. Experiments were conducted in automatic segmentation techniques for continuous, reading-rate speech to derive such units. A new segmentation algorithm is described that allows assessment of the significance of a loudness minimum to be a potential syllabic boundary from the difference between the convex hull of the loudness function and the loudness function itself. Tested on roughly 400 syllables of continuous text, the algorithm results in 6.9% syllables missed and 2.6% extra syllables relative to a nominal, slow-speech syllable count. It is suggested that inclusion of alternative fluent-form syllabifications for multisyllabic words and the use of phonological rules for predicting syllabic contractions can further improve agreement between predicted and experimental syllable counts. Subject Classification: 70.40, 70.60.
Replay gain-a proposed standard
  • R Robinson