Article

MAuS goes iterative

Authors:
To read the full-text of this research, you can request a copy directly from the author.

Abstract

In this paper we describe further developments of the MAUS system and announce a free-ware software package that may be downloaded from the 'Bavarian Archive for Speech Signals' (BAS) web site. The quality of the MAUS output can be considerably improved by using an iterative technique. In this mode MAUS will calculated a first pass through all the target speech material using the standard speaker-independent acoustical models of the target language. Then the segmented and labelled speech data are used to re-estimated the acoustical models and the MAUS procedure is applied again to the speech data using these speaker-dependent models. The last two steps are repeated iteratively until the segmentation converges. The paper describes the general algorithm, the German benchmark for evaluating the method as well as some experiments on German target speakers.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the author.

Article
Full-text available
The Scottish English phoneme inventory is generally claimed to have a /ʍ/-/w/ contrast, although several studies have suggested that this historical contrast is weakening for Scottish English speakers in the urban areas of Glasgow, Edinburgh and Aberdeen. Little is known about whether the /ʍ/-/w/ contrast is maintained in supraregional Scottish Standard English (SSE). This study sets out to explore, based on the phonemically transcribed ICE-Scotland corpus, the distribution of [ʍ] and [w] in SSE, their acoustic properties and potentially influencing social and language-internal factors. A total of 1,241 tokens were extracted from the corpus, together with a matching number of tokens, and the median of harmonicity was measured. The results show that [ʍ] and [w] produced for words beginning with are acoustically distinct from [w] produced for words beginning with . [ʍ] is relatively frequent in SSE, but most speakers use both [ʍ] and [w] interchangeably for and some never use [ʍ]. The realisation of as [ʍ] is determined by preceding phonetic context and speaker gender.
Article
Purpose Heterogeneous child speech was force-aligned to investigate whether (a) manipulating specific parameters could improve alignment accuracy and (b) forced alignment could be used to replicate published results on acoustic characteristics of /s/ production by children. Method In Part 1, child speech from 2 corpora was force-aligned with a trainable aligner (Prosodylab-Aligner) under different conditions that systematically manipulated input training data and the type of transcription used. Alignment accuracy was determined by comparing hand and automatic alignments as to how often they overlapped (%-Match) and absolute differences in duration and boundary placements. Using mixed-effects regression, accuracy was modeled as a function of alignment conditions, as well as segment and child age. In Part 2, forced alignments derived from a subset of the alignment conditions in Part 1 were used to extract spectral center of gravity of /s/ productions from young children. These findings were compared to published results that used manual alignments of the same data. Results Overall, the results of Part 1 demonstrated that using training data more similar to the data to be aligned as well as phonetic transcription led to improvements in alignment accuracy. Speech from older children was aligned more accurately than younger children. In Part 2, /s/ center of gravity extracted from force-aligned segments was found to diverge in the speech of male and female children, replicating the pattern found in previous work using manually aligned segments. This was true even for the least accurate forced alignment method. Conclusions Alignment accuracy of child speech can be improved by using more specific training and transcription. However, poor alignment accuracy was not found to impede acoustic analysis of /s/ produced by even very young children. Thus, forced alignment presents a useful tool for the analysis of child speech.
Article
Full-text available
The present study investigates the role of articulatory and perceptual factors in the change from pre- to post-aspiration in two varieties of Andalusian Spanish. In an acoustic study, the influence of stop type, speaker age, and variety on the production of pre- and post-aspiration was analyzed in isolated words produced by 24 speakers of a Western and 24 of an Eastern variety, both divided into two age groups. The results confrmed previous fndings of a sound change from pre- to post-aspiration in both varieties. Velar stops showed the longest, bilabials the shortest, and dental stops intermediate pre- and post-aspiration durations. The observed universal VOTpattern was not found for younger Western Andalusian speakers who showed a particularly long VOT in /st/-sequences. A perception experiment with the same subjects as listeners showed that post-aspiration was used as a cue for distinguishing the minimal pair /pata/-/pasta/ by almost all listeners. Production-perception comparisons suggested a relationship between production and perception: subjects who produced long post-aspiration were also more sensitive to this cue. In sum, the results suggest that the sound change has frst been actuated in the dental context, possibly due to a higher perceptual prominence of post-aspiration in this context, and that post-aspirated stops in Andalusian Spanish are on their way to being phonologized.
Article
Full-text available
The study is concerned with a sound change in progress by which a post-vocalic, pre-consonantal /s-ʃ/ contrast in the standard variety of German (SG) in words such as west/wäscht (/vɛst/~/vɛʃt/, west/washes) is influencing the Augsburg German (AG) variety in which they have been hitherto neutralized as /veʃt/. Two of the main issues to be considered are whether the change is necessarily categorical; and the extent to which the change affects both speech production and perception equally. For the production experiment, younger and older AG and SG speakers merged syllables of hypothetical town names to create a blend at the potential neutralization site. These results showed a trend for a progressively greater /s-ʃ/ differentiation in the order older AG, younger AG, and SG speakers. For the perception experiment, forced-choice responses were obtained from the same subjects who had participated in the production experiment to a 16-step /s-ʃ/ continuum that was embedded into two contexts: /mIst-mIʃt/ in which /s-ʃ/ are neutralized in AG and /və'mIsə/-/və'mIʃə/ in which they are not. The results from both experiments are indicative of a sound change in progress such that the neutralization is being undone under the influence of SG, but in such a way that there is a gradual shift between categories. The closer approximation of the groups on perception suggests that the sound change may be more advanced on this modality than in production. Overall, the findings are consistent with the idea that phonological contrasts are experience-based, i.e., a continuous function of the extent to which a subject is exposed to, and makes use of, the distinction and are thus compatible with exemplar models of speech.
Conference Paper
Full-text available
This article describes a general framework for detecting accident-prone fatigue states based on prosody, articulation and speech quality related speech characteristics. The advantages of this real-time measurement approach are that obtaining speech data is non obtrusive, and free from sensor application and calibration efforts. The main part of the feature computation is the combination of frame level based speech features and high level contour descriptors resulting in over 8,500 features per speech sample. In general the measurement process follows the speech adapted steps of pattern recognition: (a) recording speech, (b) preprocessing (segmenting speech units of interest), (c) feature computation (using perceptual and signal processing related features, as e.g. fundamental frequency, intensity, pause patterns, formants, cepstral coefficients), (d) dimensionality reduction (filter and wrapper based feature subset selection, (un-)supervised feature transformation), (e) classification (e.g. SVM, K-NN classifier), and (f) evaluation (e.g. 10-fold cross validation). The validity of this approach is briefly discussed by summarizing the empirical results of a sleep deprivation study.
Article
The main purpose of this study was to compare acoustically the vowel spaces of two groups of cochlear implantees (CI) with two age-matched normal hearing groups. Five young test persons (15-25 years) and five older test persons (55-70 years) with CI and two control groups of the same age with normal hearing were recorded. The speech material consisted of five German vowels V = /a, e, i, o, u/ in bilabial and alveolar contexts. The results showed no differences between the two groups on Euclidean distances for the first formant frequency. In contrast, Euclidean distances for F2 of the CI group were shorter than those of the control group, causing their overall vowel space to be compressed. The main differences between the groups are interpreted in terms of the extent to which the formants are associated with visual cues to the vowels. Further results were partially longer vowel durations for the CI speakers.
Conference Paper
Full-text available
In this paper we present a hybrid statistical and rule-based segmentation system which takes into account phonetic variation of German. Input to the system is the orthographic representation and the speech signal of an utterance to be segmented. The output is the transcription (SAM-PA) with the highest overall likelihood and the corresponding segmentation of the speech signal. The system consists of three main parts: In a first stage the orthographic representation is converted into a linear string of phonetic units by lexicon lookup. Phonetic rules are applied yielding a graph that contains the canonic form and presumed variations. In a second HMM-based stage the speech signal of the concerning utterance is time-aligned by a Viterbi search which is constrained by the graph of the first stage. The outcome of this stage is a string of phonetic labels and the corresponding segment boundaries. A rule-based refinement of the segment boundaries using phonetic knowledge takes place in a third stage
Conference Paper
Full-text available
In very large and diverse scientific projects where as different groups as linguists and engineers with different intentions work on the same signal data or its orthographic transcript and annotate new valuable information, it will not be easy to build a homogeneous corpus. We will describe how this can be achieved, considering the fact that some of these annotations have not been updated properly, or are based on erroneous or deliberately changed versions of the basis transcription. We used an algorithm similar to dynamic programming to detect differences between the transcription on which the annotation depends and the reference transcription for the whole corpus. These differences are automatically mapped on a set of repair operations for the transcriptions such as splitting compound words and merging neighbouring words. On the basis of these operations the correction process in the annotation is carried out. It always depends on the type of the annotation as well as on the position and the nature of the difference, whether a correction can be carried out automatically or has to be fixed manually. Finally we present a investigation in which we exploit the multi-tier annotations of the Verbmobil corpus to find out how breathing is correlated with prosodic-syntactic boundaries and dialog acts.
Verbmobil Data Collection and Annotation. in: Verbmobil: Foundations of Speech-to-Speech Translation
  • K Burger
  • F Weilhammer
  • H G Schiel
  • Tillmann
Burger, K. Weilhammer, F. Schiel, H.G. Tillmann. 2000. Verbmobil Data Collection and Annotation. in: Verbmobil: Foundations of Speech-to-Speech Translation, edt. W. Wahlster, Springer Berlin New York, page 537-549.
Automatic Phonetic Transcription of Non-Prompted Speech
  • F Schiel
F. Schiel. 1999. Automatic Phonetic Transcription of Non-Prompted Speech. in: Proceedings of the ICPhS 1999, San Francisco, page 607-610.