Conference PaperPDF Available

Minimum Sample Length for the Estimation of Long-term Speaking Rate

Authors:

Abstract and Figures

In this study, we expand on previous experiments designed with the aim of determining the minimum length that an audio sample should have in order for the speaking rate derived from it to be representative of the sample as a whole. We compare two different approaches to establishing that the time series of the cumulative speaking rate calculated over the audio sample has reached stability. We also compare the effect on stabilization time of four other factors that may affect the way speaking rate is calculated. The results show that all factors tested have significant effects, although of limited practical concern. Overall, average stability time is 12.1 seconds, with the bulk of the distribution lying between 7.9 and 16.2 s.
Content may be subject to copyright.
A preview of the PDF is not available
... Such a phonetic unit comprises all the segments uttered between two consecutive vowel onsets, with the onset of the following vowel defining the beginning of a new V-V unit. It has been studied and employed among others by [11][12][13][14], with its application tracing back at least to [15,16]. ...
... Overall, 851 speech chunks were analyzed, an average of 42 chunks and 2.30 min of transcribed material per subject, resulting in an average of 45.5 speech rate and articulation rate data points per subject. Regarding the total length of transcribed material used in the present study, the experiment carried out by [14] with different linguistic units (phone, syllable, V-V units, and word) suggests an average stability time for speaking rate parameters of 12.1 seconds, in which vowel-to-vowel units (V-V units) was the linguistic unit yielding the shortest stabilization time (9.44s). This outcome supports that the average length of the recordings used in the present study is somewhat representative, at least ten times longer than the specialized literature recommends. ...
Article
Full-text available
The purpose of this study was to assess the speaker-discriminatory potential of a set of speech timing parameters while probing their suitability for forensic speaker comparison applications. The recordings comprised of spontaneous dialogues between twin pairs through mobile phones while being directly recorded with professional headset microphones. Speaker comparisons were performed with twins speakers engaged in a dialogue (i.e., intra-twin pairs) and among all subjects (i.e., cross-twin pairs). The participants were 20 Brazilian Portuguese speakers, ten male identical twin pairs from the same dialectal area. A set of 11 speech timing parameters was extracted and analyzed, including speech rate, articulation rate, syllable duration (V-V unit), vowel duration, and pause duration. Three system performance estimates were considered for assessing the suitability of the parameters for speaker comparison purposes, namely global Cllr, EER, and AUC values. These were interpreted while also taking into consideration the analysis of effect sizes. Overall, speech rate and articulation rate were found the most reliable parameters, displaying the largest effect sizes for the factor "speaker" and the best system performance outcomes, namely lowest Cllr, EER, and highest AUC values. Conversely, smaller effect sizes were found for the other parameters, which is compatible with a lower explanatory potential of the speaker identity on the duration of such units and a possibly higher linguistic control regarding their temporal variation. In addition, there was a tendency for speech timing estimates based on larger temporal intervals to present larger effect sizes and better speaker-discriminatory performance. Finally, identical twin pairs were found remarkably similar in their speech temporal patterns at the macro and micro levels while engaging in a dialogue, resulting in poor system discriminatory performance. Possible underlying factors for such a striking convergence in identical twins' speech timing patterns are presented and discussed.
Conference Paper
Full-text available
We investigated long-term mean, median and base value of F0 to estimate how long it takes their variability to stabilize. Change point analysis was used to locate stabilization points. In one experiment, stabilization points were calculated in record-ings of the same text spoken in 26 languages. Average stabi-lization points are 5 seconds for base value and 10 seconds for mean and median. Variance after the stabilization point was re-duced around 40 times for mean and median and more than 100 times for the base value. In another experiment, four speakers read two different texts each. Stabilization points for the same speaker across the texts do not exactly coincide as would be ideally expected. Average change point dislocation is 2.5 sec-onds for the base value, 3.4 for the median and 9.5 for the mean. After stabilization, individual differences in the three measures obtained from the two texts are 2% on average. Present results show that stabilization points in long-term measures of F0 occur earlier than suggested in the previous literature.
Article
Full-text available
One of the key challenges in changepoint analysis is the ability to detect multiple changes within a given time series or sequence. The changepoint package has been de-veloped to provide users with a choice of multiple changepoint search methods to use in conjunction with a given changepoint method and in particular provides an implementa-tion of the recently proposed PELT algorithm. This article describes the search methods which are implemented in the package as well as some of the available test statistics whilst highlighting their application with simulated and practical examples. Particular empha-sis is placed on the PELT algorithm and how results differ from the binary segmentation approach.
Book
This book provides a fascinating account of the psycholinguistic and social factors behind variation in speech timing in US English. With detailed discussions of its methods and data, it also acts as a valuable model for conducting corpus (socio)phonetic research.
Article
Given C samples, with ni observations in the ith sample, a test of the hypothesis that the samples are from the same population may be made by ranking the observations from from 1 to Σni (giving each observation in a group of ties the mean of the ranks tied for), finding the C sums of ranks, and computing a statistic H. Under the stated hypothesis, H is distributed approximately as χ(C – 1), unless the samples are too small, in which case special approximations or exact tables are provided. One of the most important applications of the test is in detecting differences among the population means.** Based in part on research supported by the Office of Naval Research at the Statistical Research Center, University of Chicago.
Article
Speakers vary their speech tempo (speaking rate), and such variations in tempo are quite noticeable. But what is the just noticeable difference (JND) for tempo in speech? The present study aims at providing a realistic and robust estimate, by using multiple speech tokens from multiple speakers. The JND is assessed in two (2IAX and 2IFC) comparison experiments, yielding an estimated JND for speech tempo of about 5%. A control experiment suggests that this finding is not due to acoustic artefacts of the tempo-transformation method used. Tempo variations within speakers typically exceed this JND, which makes such variations relevant in speech communication.
Conference Paper
The most prominent part in forensic speech and audio processing is speaker recognition. In the world a number of approaches to forensic speaker recognition (FSR) have been developed, that are different in terms of technical procedures, methodology, instrumentation and also in terms of the probability scale on which the final conclusion is based. The BKA's approach to speaker recognition is a combination of classical phonetic analy- sis techniques including analytical listening by an expert and the use of signal processing techniques within an acoustic-phonetic framework. This combined auditory-instrumental method in- cludes acoustic measurements of parameters which may be interpreted using statistical information on their distributions, e.g. probability distributions of average fundamental frequency for adult males and females, average syllable rates as indicators of speech rate, etc. In a voice comparison report the final conclu- sion is determined by a synopsis of the results from auditory and acoustic parameters, amounting to about eight to twelve on average, depending on the nature of the speech material. Results are given in the form of probability statements. The paper gives an overview of current procedures and specific problems of FSR.