Article

A Large-Scale Multilingual Study of Silent Pause Duration

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

This paper presents a large-scale study of silent pause duration, based on the analysis of ca. 6000 pauses in 5 hours of read and spontaneous speech in five languages. The distribution of pauses appears as trimodal, suggesting a categorization in brief (< 200 ms), medium (200-1000 ms) and long (> 1000 ms) pauses, the latter occurring only in spontaneous speech. The study reveals possible methodological flaws in previous research in which statistical tests that rely on normality assumption (such as the ANOVA) are routinely applied on non-transformed data, although distributions are far from normal. It also emphasizes the dangerous effect of thresholds, which are very commonly applied in the literature for practical reasons, but can lead to totally false conclusions when comparing speech styles, languages or speakers.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Elles peuvent être de durée variable. Campione & Véronis (2002b) ainsi que Goldman et al. (2010) distinguent par exemple les pauses brèves, moyennes et longues. Grosman et al. (2018) définissent les pauses silencieuses comme "une interruption de la phonation". ...
... Pour ce faire, certain travaux choisissent par exemple d'écarter les pauses trop brèves en utilisant des seuils prédéfinis en millisecondes (Grosjean & Deschamps, 1975;Candea, 2000). Toutefois, étant donnée la variabilité des débits de paroles et des durées de pauses, utiliser des seuils fixes peut avoir des conséquences sur les résultats (Campione & Véronis, 2002b;Grosman et al., 2018). Utiliser la ponctuation comme marqueurs de pauses et de respiration est aussi une solution proposée (Wang et al., 2021) pour prédire où un lecteur fera des pauses. ...
... Cette stratégie nous permet d'avoir un corpus annoté en pauses prenant en considération leur variabilité compte-tenu du locuteur et du corpus. (Campione & Véronis, 2002b;Goldman et al., 2010). De même que le seuil qui sépare un silence d'une pause peut varier, nous pensons que les seuils qui distinguent une pause courte, moyenne ou longue peuvent varier selon les locuteurs. ...
Conference Paper
Full-text available
Les pauses silencieuses jouent un rôle crucial en synthèse vocale où elles permettent d'obtenir un rendu plus naturel. Dans ce travail, notre objectif consiste à prédire ces pauses silencieuses, à partir de textes, afin d'améliorer les systèmes de lecture automatique. Cette tâche n'ayant pas fait l'objet de nombreuses études pour le français, constituer des données d'apprentissage dédiées à la prédiction de pauses est nécessaire. Nous proposons une stratégie d'inférence de pauses, reposant sur des informations temporelles issues de données orales transcrites, afin d'obtenir un tel corpus. Nous montrons ensuite qu'à l'aide d'un modèle basé sur des transformeurs et des données adaptées, il est possible d'obtenir des résultats prometteurs pour la prédiction des pauses produites par un locuteur lors de la lecture d'un document. ABSTRACT Pauzee : Pauses Prediction in text reading. Silent pauses play a crucial role in text-to-speech synthesis, where they help make the text reading sound more natural. In this work, our goal is to predict these silent pauses from texts to improve automatic reading systems. As this task has not been extensively studied for French, it is necessary to build training data dedicated to the prediction of pauses. We propose a strategy for inferring pauses, based on temporal information from transcribed speech, in order to obtain such a corpus. We then show that with the help of a model based on transformers and appropriate data, it is possible to obtain promising results for the prediction of pauses produced by a speaker during text reading. MOTS-CLÉS : pauses silencieuses, prédiction des pauses, annotation pour la synthèse vocale.
... For example, Vasilescu and Adda-Decker (2007) compared the duration of intra-lexical lengthened vowels and filled pauses in three languages and found that while the mean duration of lengthened vowels was around 60 ms, with nearly 90% of them shorter than 150 ms, filled pauses ranged from 150 to 250 ms, and only 15% lasted less than 150 ms. With regard to silent pauses, Campione and Véronis (2002) analysed 6000 silent pauses in spontaneous speech in five different languages and concluded that they could be categorised as either short (<200 ms), medium (200-1000 ms), or long (>1000 ms) pauses. Most medium silences were associated with demarcative prosodic functions, whereas short silences were not linked to prosodic-syntactic boundaries, and long silences were only observed in spontaneous speech. ...
... For each type of pause (silent pauses, filled pauses, and lengthened syllables), we computed 9 parameters: (1) mean duration; (2) median duration; (3) standard deviation of the duration; (4) number of each type of pause and all pauses together per (a) second of speech, (b) second of total duration (speech + pauses), (c) total number of syllables, and (d) total number of words; and (5) proportion of pause time over (a) the total duration of speech, and (b) the total duration (speech + pauses). Additionally, we classified every silent pause as long, medium, or short, according to Campione and Véronis (2002) criteria and calculated the percentage of each type of pause (long, medium, and short) over the total number of silent pauses. Then, for each of these three silent pause length categories, we also computed the number of pauses per (a) second of speech, (b) second of total duration (speech + pauses), (c) total number of syllables, and (d) total number of words. ...
... Pause duration is extremely varied in spontaneous speech, and often presents trimodal behaviour (Campione and Véronis 2002), making overall mean and median values of little interest. Therefore, we categorised silent pauses into three length types, following Campione and Véronis' criteria, namely short (<0.2 s), medium (0.2-1.0 s), and long (>1 s). ...
Article
Full-text available
Hesitations are often used by speakers in spontaneous speech not only to organise and prepare their speech but also to address any obstacles that may arise during delivery. Given the relationship between hesitation phenomena and motor and/or cognitive–linguistic control deficits, characterising the form of hesitation could be potentially useful in diagnosing specific speech and language disorders, such as primary progressive aphasia (PPA). This work aims to analyse the features of hesitations in patients with PPA compared to healthy speakers, with hesitations understood here as those related to speech planning, that is, silent or empty pauses, filled pauses, and lengthened syllables. Forty-three adults took part in this experiment, of whom thirty-two suffered from some form of PPA: thirteen from logopenic PPA (lvPPA), ten from nonfluent PPA (nfvPPA), and nine from semantic PPA (svPPA). The remaining 11 were healthy speakers who served as a control group. An analysis of audio data recorded when participants produced spontaneous speech for a picture description task showed that the frequency of silent pauses, especially those classified as long (>1000 ms) was particularly useful to distinguish PPA participants from healthy controls and also to differentiate among PPA types. This was also true, albeit to a lesser extent, of the frequency of filled pauses and lengthened syllables.
... Our observations confirm and extend the results reported in the change-point literature on human inference in the context of Poisson statistics, by exploring a more ecological [27,28,29,30,31,32,33,34,35], non-Poisson, temporally structured environment. Likewise, our results come to complement those of a number of studies on perception and decision-making that also investigate inference from stimuli with temporal statistics [36,37,38,10,11,39]. ...
... In the context of human motor behavior, walking is a highly rhythmical natural activity [31,32]. More complex patterns exist (neither clustered nor periodic), such as in human speech which presents a variety of temporal structures, whether at the level of syllables, stresses, or pauses [33,34,35]. In all these examples, natural mechanisms produce series of temporally structured events. ...
... Following [26], we assume that the repetition variable (distributed according to 1 − Λ t (x 1:t ,ŝ t−1 )), and the location variable in trials in which the estimate is not repeated (distributed according to µ t (ŝ t |x 1:t )), each bear a cognitive cost proportional to measures of the amount of information on the sequence of stimuli involved in choosing the repetition variable and the location variable, respectively, defined as I 1 = · · · p(x 1:t )D KL (Λ t (x 1:t ,ŝ t−1 )||Λ)dx 1 . . . dx t (35) and I 2 = · · · p(x 1:t )Λ t (x 1:t ,ŝ t−1 )D KL (µ t (.|x 1:t )||μ)dx 1 . . . dx t , ...
Preprint
Full-text available
To make informed decisions in natural environments that change over time, humans must update their beliefs as new observations are gathered. Studies exploring human inference as a dynamical process that unfolds in time have focused on situations in which the statistics of observations are history-independent. Yet temporal structure is everywhere in nature, and yields history-dependent observations. Do humans modify their inference processes depending on the latent temporal statistics of their observations? We investigate this question experimentally and theoretically using a change-point inference task. We show that humans adapt their inference process to fine aspects of the temporal structure in the statistics of stimuli. As such, humans behave qualitatively in a Bayesian fashion, but, quantitatively, deviate away from optimality. Perhaps more importantly, humans behave suboptimally in that their responses are not deterministic, but variable. We show that this variability itself is modulated by the temporal statistics of stimuli. To elucidate the cognitive algorithm that yields this behavior, we investigate a broad array of existing and new models that characterize different sources of suboptimal deviations away from Bayesian inference. While models with 'output noise' that corrupts the response-selection process are natural candidates, human behavior is best described by sampling-based inference models, in which the main ingredient is a compressed approximation of the posterior, represented through a modest set of random samples and updated over time. This result comes to complement a growing literature on sample-based representation and learning in humans.
... Our observations confirm and extend the results reported in the change-point literature on human inference in the context of Poisson statistics, by exploring a more ecological [27,28,29,30,31,32,33,34,35], non-Poisson, temporally structured environment. Likewise, our results come to complement those of a number of studies on perception and decision-making that also investigate inference from stimuli with temporal statistics [36,37,38,10,11,39]. ...
... In the context of human motor behavior, walking is a highly rhythmical natural activity [31,32]. More complex patterns exist (neither clustered nor periodic), such as in human speech which presents a variety of temporal structures, whether at the level of syllables, stresses, or pauses [33,34,35]. In all these examples, natural mechanisms produce series of temporally structured events. ...
... Following [26], we assume that the repetition variable (distributed according to 1 − Λ t (x 1:t ,ŝ t−1 )), and the location variable in trials in which the estimate is not repeated (distributed according to µ t (ŝ t |x 1:t )), each bear a cognitive cost proportional to measures of the amount of information on the sequence of stimuli involved in choosing the repetition variable and the location variable, respectively, defined as I 1 = · · · p(x 1:t )D KL (Λ t (x 1:t ,ŝ t−1 )||Λ)dx 1 . . . dx t (35) and I 2 = · · · p(x 1:t )Λ t (x 1:t ,ŝ t−1 )D KL (µ t (.|x 1:t )||μ)dx 1 . . . dx t , ...
... Goldman-Eisler (1968) proposes a threshold of 250 ms to distinguish between 'articulatory' (<250 ms) and 'hesitation' (>250 ms) pauses [1], and this threshold has been followed both in research on L1 and L2 speech. More recently, however, using this boundary has been called into question [2, 3, 4]. Most of pauses within the 130 ms – 250 ms range cannot be attributed to articulation [2]. ...
... Most of pauses within the 130 ms – 250 ms range cannot be attributed to articulation [2]. Pauses as short as 60 ms that are not part of occlusives have been reported [3]. In L2 research, applying a threshold to measure number and duration of pauses has also been used. ...
... In this study, we will report on measures of fluency based on pauses within AS units only. 1 In total, 10668 silent pauses within AS-units were identified.Figure 1 shows the distribution of pause durations, after (natural) log transformation. Both [3] and [4] report most pauses to be falling in the " short pause " distribution (roughly under 200 ms), whereas in our distribution most pauses are longer. Our participants speak in their L2, which has probably caused a different distribution as found before in read and spontaneous L1 speech (as reported before in [3,4]). ...
Conference Paper
Full-text available
Second language (L2) research often involves analyses of acoustic measures of fluency. The studies investigating fluency, however, have been difficult to compare because the measures of fluency that were used differed widely. One of the differences between studies concerns the lower cutoff point for silent pauses, which has been set anywhere between 100 ms and 1000 ms. The goal of this paper is to find an optimal cutoff point. We calculate acoustic measures of fluency using different pause thresholds and then relate these measures to a measure of L2 proficiency and to ratings on fluency. Index Terms: silent pauses, number of pauses, duration of pauses, silent pause threshold, second language speech.
... • Fluency. Long pauses are perceived as disruptive and hesitation is seen as an indication of inexperience [11,12]; therefore, we programmed our novice speech with pauses that were five times longer [11] than those in our expert's speech. A standard pause was represented by one full stop (".") in text, whereas a longer pause was represented by five full stops ("..... "); in our textto-speech module, the five full stops translated to a pause that was five times longer than a normal pause between sentences. ...
... • Fluency. Long pauses are perceived as disruptive and hesitation is seen as an indication of inexperience [11,12]; therefore, we programmed our novice speech with pauses that were five times longer [11] than those in our expert's speech. A standard pause was represented by one full stop (".") in text, whereas a longer pause was represented by five full stops ("..... "); in our textto-speech module, the five full stops translated to a pause that was five times longer than a normal pause between sentences. ...
... Our observations confirm and extend the results reported in the change-point literature in the context of Poisson statistics [71,52,53,72], by exploring a more ecological [75,76,77,78,79,80,81,82,83], non-Poisson, temporally structured environment. Likewise, our results broaden those of a number of studies in perception and decision making that also investigate human inference in presence of temporally structured variables [84,85,86,46,47,87]. ...
... Circadian and seasonal cycles are other examples of periodicity. More complex patterns exist (neither clustered nor periodic), such as in human speech which presents a variety of strong temporal structures, whether at the level of syllables, stresses, or pauses [81,82,83]. In all these examples, natural mechanisms produce series of temporally structured events. ...
Thesis
In past decades, the Bayesian paradigm has gained traction as an elegant and mathematically principled account of human behavior in inference tasks. Yet this success is tainted by the sub-optimality, variability, and systematic biases in human behavior. Besides, the brain must sequentially update its belief as new information is received, in natural environments that, usually, change over time and present a temporal structure. We investigate, with a task, the question of human online inference. Our data show that humans can make use of subtle aspects of temporal statistics in online inference; and that the magnitude of the variability found in responses itself depends on the inference. We investigate how a broad family of models, capturing deviations from optimality based on cognitive limitations, can account for human behavior. The variability in responses is reproduced by models approximating the posterior through random sampling during inference, and by models that select responses by sampling the posterior instead of maximizing it. Model fitting supports the former scenario and suggests that the brain approximates the Bayesian posterior using a small number of random samples. In a last part of our work, we turn to "sequential effects", biases in which human subjects form erroneous expectations about a random signal. We assume that subjects are inferring the statistics of the signal, but this inference is hindered by a cognitive cost, leading to non-trivial behaviors. Taken together, our results demonstrate, in the ecological case of online inference, how deviations from the Bayesian model, based on cognitive limitations, can account for sub-optimality, variability, and biases in human behavior.
... From the reference [10], it is known that the pause duration in speech can be divided into three: brief, medium and long. if n i = n i min then 10: remove n i samples 11: end if 12: end for 13: end for For brief of short pause, the duration is less than 200 milliseconds. ...
... Although the paper [10] classify that brief pause is less than 200 ms, it is still unclear how long the minimum duration of silence/pause in utterances. The paper found that in spontaneous speech the distribution of brief pause has peak at 78 ms and 426 ms, while read speech (acted) has first peak in distribution at 100-150 ms and second peak at 500-600 ms. ...
... Pauses control the overall pace of talk delivery and are measured in fraction of a second. Past research [12] observed that speech consists of short (∼ 0.15 s), medium (∼ 0.50 s), and long (≥1.50 s) pauses. Read speech tends to produce only short and medium pauses, while spontaneous speech shows more frequent use of medium and long pauses. ...
... The respective changes are within -29 % and +47 % bounds. The change in the number of clauses best correlates with the rate of the parameter change for 12 both tempo ⋆ and speed ⋆ settings respectively, mainly due to the simplicity of the threshold rule used for clause detection based on the length of the intra-word pause made by the speaker. The speaker in the original audio speaks at an average pace of 180 spm. ...
Conference Paper
Great public speakers are made, not born. Practicing a presentation in front of colleagues is common practice and results in a set of subjective judgements what could be improved. In this paper we describe the design and implementation of a mobile app which estimates the quality of speaker's delivery in real time in a fair, repeatable and privacy-preserving way. Quantle estimates the speaker's pace in terms of the number of syllables, words and clauses, computes pitch and duration of pauses. The basic parameters are then used to estimate the talk complexity based on readability scores from the literature to help the speaker adjust his delivery to the target audience. In contrast to speech-to-text-based methods used to implement a digital presentation coach, Quantle does processing locally in real time and works in the flight mode. This design has three implications: (1) Quantle does not interfere with the surrounding hardware, (2) it is power-aware, since 95.2% of the energy used by the app on iPhone 6 is spent to operate the built-in microphone and the screen, and (3) audio data and processing results are not shared with a third party therewith preserving speaker's privacy. We evaluate Quantle on artificial, online and live data. We artificially modify an audio sample by changing the volume, speed, tempo, pitch and noise level to test robustness of Quantle and its performance limits. We then test Quantle on 1017 TED talks held in English and compare computed features to those extracted from the available transcript processed by online text evaluation services. Quantle estimates of syllable and word counts are 85.4% and 82.8% accurate, and pitch is over 90% accurate. We use the outcome of this study to extract typical ranges for each vocal characteristic. We then use Quantle on live data at a social event, and as a tool for speakers to track their delivery when rehearsing a talk. Our results confirm that Quantle is robust to different noise levels, varying distances from the sound source, phone orientation, and achieves comparable performance to speech-to-text methods.
... Following these findings, the authors concluded that dismissing pauses within this time range on articulatory grounds might lead to interesting patterns being ignored. 2 In a more recent cross-linguistic study, Campione and Véronis (2002) reached a similar conclusion. They extracted pause durations from a corpus of read and spontaneous speech in five languages, showing how a simple comparison between spontaneous and read speech could lead to completely different conclusions depending on the threshold applied (Campione & Véronis, 2002). Whereas pausing has received quite a lot of attention in research on adult speech, pausing in the language of children is studied less often (see Sabin, Clemmer, O'Connell, & Kowal, 1979, for a review of early studies). ...
... In the two studies on pauses in child speech, information structure is presented as a possible interpretation of the pausing patterns observed, but without this being empirically investigated. In addition, in the latter two studies, a relatively high threshold for pausing was applied where the pauses occurred simultaneously with plosive closures, despite the fact that this approach runs the risk of dismissing psychologically relevant pauses on somewhat arbitrary grounds (Campione & Véronis, 2002; Hieke et al., 1983). ...
... We split the files based on i) a silence duration threshold between two words ("pause duration"), which we set to 0.3 and ii) a minimum number of words per segment, which we set to two 7 . Campione and Véronis (2002) studied silent pause durations based on the analysis of 5 ½ hours of speech in five Indo-European languages, and categorized silences in brief (< 0.2s), medium (0.2 -1s) and long (> 1s) pauses. Therefore, we suggest using a pause duration threshold between 0.2 and 1 second to segment the audio. ...
... The annotation of silences is usually carried out perceptually, by trained annotators, as to this day, no established, objective method to their detection exists. Eklund (2004) observed that duration is a poor cue to analyze silences and Campione and Véronis (2002) warn of using duration thresholds as they might skew the results, even though they are often employed in the automatic detection of silences. ...
Article
Full-text available
This study investigates the interplay of spoken and gestural hesitations under varying amounts of cognitive load. We argue that not only fillers and silences, as the most common hesitations, are directly related to speech pausing behavior, but that hesitation lengthening is as well. We designed a resource-management card game as a method to elicit ecologically valid pausing behavior while being able to finely control cognitive load via card complexity. The method very successfully elicits large amounts of hesitations. Hesitation frequency increases as a function of cognitive load. This is true for both spoken and gestural hesitations. We conclude that the method presented here is a versatile tool for future research and we present foundational research on the speech-gesture link related to hesitations induced by controllable cognitive load.
... The upper panels demonstrate that raw durations are highly skewed (cf. Campione and Véronis 2002); the lower panels display the data after logarithmic transformation. Both speech genres seem to be bimodal, with one mode around 100-150 ms and another around 500-600 ms. ...
Article
Full-text available
Pauses act as important acoustic cues to prosodic phrase boundaries. However, the distribution and phonetic characteristics of pauses have not yet been fully described either cross-linguistically or in different genres and speech styles within languages. The current study examines the pausal performance of 24 Czech speakers in two genres of read speech: news reading and poetry reciting. The pause rate and pause duration are related to genre differences, overt and covert text organization, and speech tempo. We found a significant effect of several levels of text organization, including a strong effect of punctuation. This was reflected in both measures of pausal performance. A grammatically informed analysis of a subset of pauses within the smallest units revealed a significant contribution for pause rate only. An effect of tempo was found in poetry reciting at a macro level (speaker averages) but not when pauses were observed individually. Genre differences did not manifest consistently and analogically for the two measures. The findings provide evidence that pausing is used systematically by speakers in read speech to convey not only prosodic phrasing but also text structure, among other things.
... For example, pause lengths correlate with the speaker's social characteristics such as location, ethnicity, age, and gender. Campione [52] also found differences in pause length in different European languages. ...
Article
Full-text available
Alzheimer's dementia (AD) is the most common incurable neurodegenerative disease worldwide. Apart from memory loss, AD leads to speech disorders. Timely diagnosis is crucial to halt the progression of the disease. However, current diagnostic procedures are costly, invasive, and distressing. Early-stage AD manifests itself in speech disorders, which implies examining those. Machine Learning (ML) represents a promising instrument in this context. Nevertheless, no genuine consensus on the language characteristics to be analyzed exists. To counteract this deficit and provide topic-related researchers with a better basis for decision-making, we present, based on a literature review, favourable speech characteristics for the appliance toward AD detection via ML. Research trends to apply spontaneous speech, gained from image descriptions, as analysis basis, and points out that the combined use of acoustic, linguistic, and demographic features positively influences recognition accuracy. In total, we have identified 97 overarching acoustic, linguistic and demographic features.
... Most studies on silent pauses used threshold as one of the objects of study [19], [20]. Those studies categorized thresholds in silent pause into two groups: low threshold (200 ms) and high threshold (2000 ms). ...
Preprint
Silence is a part of human-to-human communication, which can be a clue for human emotion perception. For automatic emotion recognition by a computer, it is not clear whether silence is useful to determine human emotion within a speech. This paper presents an investigation of the effect of using silence feature in dimensional emotion recognition. As the silence feature is extracted per utterance, we grouped the silence feature with high statistical functions from a set of acoustic features. The result reveals that the silence feature affects the arousal dimension more than other emotion dimensions. The proper choice of a factor in the calculation of silence feature improves the performance of dimensional speech emotion recognition performance in terms of a concordance correlation coefficient. On the other side, improper choice of that factor leads to a decrease in performance by using the same architecture.
... On the other hand, considering that over 80% of pauses (i.e. within-speaker gaps between words or phrases) in spontaneous speech are between 200 and 1 000 ms (Campione and V� eronis, 2002), values of ΔT greater than 1 s should typically be avoided. In highly fluctuating noise environments, smaller ΔT values will enhance the accuracy of the dosimetry system by helping it to detect ambient noise level variations during speech pauses (see method 2b). ...
Article
While personal noise exposure assessments are necessary to prevent noise-induced hearing loss in the workplace, standard personal noise dosimeters are limited when measuring the noise exposure of individuals wearing hearing protection devices (HPD). To overcome the difficulties in assessing the attenuation provided by HPDs, continuous monitoring systems of an individual’s noise exposure under the HPD show promise. However, these systems can be affected by the noise events induced by the wearer, though research has shown that the risk of hearing loss inherent to self-generated sounds (voice, swallowing, chewing) can be less than for external noise. This paper presents a low computational method to perform in-ear noise dosimetry under an earplug while excluding the noise contributions from the wearer. The method uses a dual-microphone earpiece able to take measurements both under the earplug and outside the ear. A comparison of the two microphones signals, through coherence calculations, provides sufficient information as to whether the protected noise levels originate mainly from the wearer or from external noise sources. Laboratory results collected on human test-subjects suggest that the proposed method is not only valid for a wide variety of self-generated sounds, it is efficient regardless of the amount of attenuation provided by the earplug. Further work involves validating the approach and parameters in occupational settings, and adapting this method to other types of HPDs such as earmuffs or dual hearing protection. (Full-text available until December 31, 2019: https://authors.elsevier.com/a/1a2ZHcd7kGZy9)
... This could be attributed to the selected pause threshold (100ms), which determines the count of silent pauses in an utterance alone while the length of an utterance between Chinese and English is approximately the same in ST-CE. The study [3] had concluded that different thresholds of silent pause could lead to contradictory results. We should concern about the discriminating ability instead; From Chinese to English, pr had a significant decrease for each load level. ...
Conference Paper
Speech-based cognitive load modeling recently proposed in English have enabled objective, quantitative and unobtrusive evaluation of cognitive load without extra equipment. However, no evidence indicates that these techniques could be applied to speech data in other languages without modification. In this study, a modified Stroop Test and a Reading Span Task were conducted to collect speech data in English and Chinese respectively, from which twenty non-linguistic features were extracted to investigate whether they were language dependent. Some discriminating speech features were observed language dependent, which serves as an evidence that there is a necessity to adapt speech-based cognitive load detection techniques to diverse language contexts for a higher performance.
... Pause duration was reported to correlate with social attributes of speaker, even ones such as region, ethnicity , age, and gender [3]. Cross-cultural study of silent pauses in selected European languages (Polish was not included) revealed differences in pause durations between languages [4], but their distribution is usually similar and can be well estimated by bi-Gaussian model [5]. Some medical aspects of different types of pauses were investigated in context of affective state [6] and physical [7] or mental [8] condition of the speaker, e.g. ...
Article
Full-text available
Statistics of pauses appearing in Polish as a potential source of biometry information for automatic speaker recognition were described. The usage of three main types of acoustic pauses (silent, filled and breath pauses) and syntactic pauses (punctuation marks in speech transcripts) was investigated quantitatively in three types of spontaneous speech (presentations, simultaneous interpretation and radio interviews) and read speech (audio books). Selected parameters of pauses extracted for each speaker separately or for speaker groups were examined statistically to verify usefulness of information on pauses for speaker recognition and speaker profile estimation. Quantity and duration of filled pauses, audible breaths, and correlation between the temporal structure of speech and the syntax structure of the spoken language were the features which characterize speakers most. The experiment of using pauses in speaker biometry system (using Universal Background Model and i-vectors) resulted in 30 % equal error rate. Including pause-related features to the baseline Mel-frequency cepstral coefficient system has not significantly improved its performance. In the experiment with automatic recognition of three types of spontaneous speech, we achieved 78 % accuracy, using GMM classifier. Silent pause-related features allowed distinguishing between read and spontaneous speech by extreme gradient boosting with 75 % accuracy.
... A common practice in the literature is to exclude those silent intervals by choosing a minimum cut-off point somewhere between 100 and 300 milliseconds, although there has been a longstanding debate about the threshold567. Campione and Véronis (2002) analyzed pauses in 5½ hours of read and spontaneous speech in five languages [8]. They found that the distribution of pauses appears as trimodal, suggesting a categorization in brief (< 200 ms), medium (200-1000 ms), and long (>1000 ms) pauses. ...
Conference Paper
Full-text available
In this study, we investigate the use of pauses and pause fillers in Mandarin Chinese. Our analysis is based on 267 spoken monologues from a Mandarin proficiency test. We identify two basic pause fillers in Mandarin: e and en. We find that males use more e than females, but there is no difference between them on the frequency of en. Therefore, the proportion of nasal-final pause fillers is higher in female than in male speakers, as was found in the studies of Germanic languages. Proficiency, on the other hand, does not affect the frequency of either e or en. With respect to the use of unfilled pauses, both sex and proficiency have a significant effect. Males and less proficient speakers use more medium and long, but not brief, pauses. Males tend to speak faster than females, they have a shorter en, but there is no difference between the two sexes on the duration of e. Un-proficient speakers produce shorter pause fillers, both e and en, than proficient ones. Finally, en is longer than e, it also precedes and follows a longer pause than e.
... As these discrepancies are likely to result in different distributions of overlap and silence, results of studies using different silence thresholds are not directly comparable. In addition, as demonstrated by Campione and Véronis [14], using thresholds on silence durations (both low and high) can lead to entirely wrong conclusions. When overlaps are added to the equation, the situation becomes even more complex because of interactions between the two categories. ...
Conference Paper
Full-text available
Faced with lack of objective and easily applicable criteria for segmentation of speech into dialogue turns, many authors resort instead to units defined in terms of stretches of speech minimally bounded by silence of some predefined duration. There is, however, no consensus concerning silence thresholds employed. While such thresholds can be established on perceptual grounds, in practice a wide range of values is used. As this has a direct impact on the reported frequencies of silences and overlaps, the discrepancies make comparisons of results across different studies difficult. In an attempt to overcome these problems in the present paper we use the Switchboard corpus to evaluate the expected variability in distributions of inter- And intra-speaker intervals when silence boundary thresholds of inter-pausal units are manipulated.
Article
Full-text available
In this study, we investigate the use of the filler particles (FPs) uh, um, hm, as well as glottal FPs and tongue clicks of 100 male native German speakers in a corpus of spontaneous speech. For this purpose, the frequency distribution, FP duration, duration of pauses surrounding FPs, voice quality of FPs, and their vowel quality are investigated in two conditions, namely, normal speech and Lombard speech. Speaker-specific patterns are investigated on the basis of twelve sample speakers. Our results show that tongue clicks and glottal FPs are as common as typically described FPs, and should be a part of disfluency research. Moreover, the frequency of uh, um, and hm decreases in the Lombard condition while the opposite is found for tongue clicks. Furthermore, along with the usual F1 increase, a considerable reduction in vowel space is found in the Lombard condition for the vowels in uh and um. A high degree of within-and between-speaker variation is found on the individual speaker level.
Article
Full-text available
The purpose of this study was to examine if prosodic patterns in oral reading derived from Recurrence Quan-tification Analysis (RQA) could distinguish between struggling and skilled German readers in Grades 2 (n = 67) and 4 (n = 69). Furthermore, we investigated whether models estimated with RQA measures outperformed models estimated with prosodic features derived from prosodic transcription. According to the findings, struggling second graders appear to have a slower reading rate, longer intervals between pauses, and more repetitions of recurrent amplitudes and pauses, whereas struggling fourth graders appear to have less stable pause patterns over time, more pitch repetitions, more similar amplitude patterns over time, and more repetitions of pauses. Additionally, the models with prosodic patterns outperformed models with prosodic features. These findings suggest that the RQA approach provides additional information about prosody that complements an established approach.
Article
Full-text available
In poetry declamation, the appropriate use of prosody to cause pleasure is essential. Among the prosodic parameters, pause is one of the most effective to engage the listeners and provide them with a pleasant experience. The declamation of three poems in two varieties of Portuguese by ten Brazilian Portuguese (BP) speakers and ten European Portuguese (EP) speakers, balanced for gender, was used as a corpus for evaluating the degree of pleasantness by listeners from the same language variety. The distributions of pause duration and inter-pause interval (IPI) both varied greatly across the subjects, being the main source of variability and strongly right-tailed. The evaluation of the degree of pleasantness revealed that pause duration predicts degree of pleasantness in EP, whereas IPI predicts degree of pleasantness in BP. Reciters perform a kind of complex “dance”, where sonority between pauses is favored in BP and pause duration in EP.
Conference Paper
Full-text available
A fala sintetizada de conteúdos matemáticos ainda apresenta desafios para estudantes que fazem uso dos leitores de tela, entre os quais podemos citar, as pausas inadequadas e as longas saídas auditivas, o que dificulta a memorização desse tipo de conteúdo. Nesse estudo realizamos dois experimentos com o objetivo de identificar e analisar processos capazes de reduzir a sobrecarga cognitiva da fala sintetizada de expressões matemáticas codificadas em MathML. O primeiro experimento visou verificar as dificuldades encontradas pelos estudantes com deficiência visual, além de analisar um modelo de pausas proposto. O segundo experimento buscou compreender os processos cognitivos para memorização de expressões matemáticas, através da técnica de rastreamento ocular de pessoas videntes. Embora alguns resultados não tenham sido conclusivos, a pesquisa mostrouse relevante, pois aponta direções que podem minimizar a carga mental e, consequentemente melhorar o processo cognitivo do estudante na leitura de expressões matemáticas.
Article
Purpose Gap duration contributes to the perception of utterances as fluent or disfluent, but few studies have systematically investigated the impact of gap duration on fluency judgments. The purposes of this study were to determine how gaps impact disfluency perception, and how listener background and experience impact these judgments. Methods Sixty participants (20 adults who stutter [AWS], 20 speech-language pathologists [SLPs], and 20 naïve listeners) listened to four tokens of the utterance, “Buy Bobby a puppy,” produced at typical speech rates. The gap duration between “Buy” and “Bobby” was systematically manipulated with gaps ranging from 23.59 ms to 325.44 ms. Participants identified stimuli as fluent or disfluent. Results The disfluency threshold – the point at which 50 % of trials were categorized as disfluent – occurred at a gap duration of 126.46 ms, across all participants and tokens. The SLPs exhibited higher disfluency thresholds than the AWS and the naïve listeners. Conclusion This study determined, based on the specific set of stimuli used, when the perception of utterances tends to shift from fluent to disfluent. Group differences indicated that SLPs are less inclined to identify disfluencies in speech potentially because they aim to be less critical of speech that deviates from “typical”.
Article
The purpose of this study is to describe the pause in the production of Indonesian speech in terms of duration, percentage, and reason of pause. This data is released on the pause in the speech delivered by the candidate for governor and vice governor in the event of “2017 Governor Candidate Debate of DKI Jakarta". Data is captured by recording all debate events from the youtube.com. Speech is then transcribed orthographically with the help of Praat version 5, then classified according the purposes. The results show that the existing silence is varied, ranging from very short pause (37 ms) to very long pause (3,633 ms). However, the average pause (499.89 ms) remains in normal pause. The percentage of speaker silence can be said to be quite large because it takes 20.71% or a fifth of the total speech duration. Meanwhile, the reason for pause is divided into two, which is because it is intentional and because it is not intentional. Intentional pause arises from respiration, lingual unit segmentation, grammatical pause, and also expression; unintentional disappearance can occur because of the mental processes experienced by speakers in planning and producing speech, that is because unpreparedness speakers start the utterance, caution choosing words, the mistakes, the pressure, and change the content of the speech.Penelitian ini bertujuan mendeskripsikan senyapan dalamtuturan lisan berbahasa Indonesia dari segi durasi, persentase, serta alasannya. Data diambil dari tuturan calon gubernur dan wakil gubernur dalam acara “Debat Calon Gubernur DKI Jakarta 2017”. Pengumpulan data dilakukan dengan mengunduh keseluruhan tayangan acara dari youtube.com. Tuturan ditranskripsi secara ortografis, lalu diidentifikasi dengan bantuan Praat versi 5, kemudian diklasifikasi berdasar tujuan penelitian. Hasilnya menunjukkan bahwa durasi senyapan sangat variatif, mulai dari senyapan sangat pendek sekali (37 md) hingga sangat panjang (3.633 md). Namun, rata-ratanya (499,89 md)berada dalam senyapan normal. Adapun persentase senyapan cukup besar karena membutuhkan waktu 20,71% atau seperlima lebih durasi bicara total. Alasan penutur senyap ada dua: disengaja dan tidak disengaja. Senyapan yang disengaja terjadi karena pernapasan, segmentasi satuan lingual, jeda gramatikal, serta pemberian ekspresi.Senyapan tidak disengaja terjadi karena proses mental yang dialami penutur dalam merencanakan dan memproduksi tuturannya, yaitu ketidaksiapan memulai tuturan, kehati-hatian memilih kata, adanya kekeliruan, adanya tekanan, dan pengubahan isi tuturan.
Article
Full-text available
Expressing interpersonal relationships varies from language to language. Numerous studies explore morphological or other means of expressing a tous/vous relationship. To our knowledge, no research has been done on whether the phonic realization has a share in mapping a T or V shade onto an utterance. The present study presents the results of such research. After the corpus was compiled and T and V utterances categorized, we measured pauses and melody contours, and we identified the pitch accent placement. Then, we interpreted the data sociolinguistically. The data point to two areas worth further examination - phonetic and sociological: a) a tendency was observed in T vs. V encounters with regard to the sociological parameter of age; b) the American culture seems to apply the model of "dispersion" rather than bipolarity, which makes it an intricate task to collect a sufficient number of V encounters providing for statistically significant data. © 2018 Slovak Association for the Study of English. All Rights Reserved.
Article
Full-text available
Significance When we speak, we unconsciously pronounce some words more slowly than others and sometimes pause. Such slowdown effects provide key evidence for human cognitive processes, reflecting increased planning load in speech production. Here, we study naturalistic speech from linguistically and culturally diverse populations from around the world. We show a robust tendency for slower speech before nouns as compared with verbs. Even though verbs may be more complex than nouns, nouns thus appear to require more planning, probably due to the new information they usually represent. This finding points to strong universals in how humans process language and manage referential information when communicating linguistically.
Conference Paper
Full-text available
Sentiment analysis is a trendy domain of Machine Learning which has developed considerably in the last several years. Nevertheless, most of the sentiment analysis systems are general. They do not take profit of the interactional context and all of the possibilities that it brings. My PhD thesis focuses on creating a system that uses the information transmitted between two speakers in order to analyze opinion inside a human-human or a human-agent interaction. This paper outlines a research plan for investigating a system that analyzes opinion in speech interactions, using hybrid discriminative models. We present the state of the art in our domain, then we discuss our prior research in the area and the preliminary results we obtained. Finally, we conclude with the future perspectives we want to explore during the rest of this PhD work.
Conference Paper
Full-text available
Pauses in spontaneous speaking constitute a rich source of data for several disciplines. They have been used to enhance automatic segmentation of speech, classification of patients with communication disorders, the design of psycholinguistic models of speaking, and the analysis of psychological disorders. However, although pause data is easy to collect under natural conditions, it has had little impact on recent research in cognitive psychology, involving learning, memory and individual differences for example. This alleged omission probably reflects a general preference for de-contextualised paradigms in that area, the cost in time of speech segmentation, and the number and variety of variables that influence pause duration. But the most challenging issues concern the basic characteristics of pauses. What are they, and how many different types of pauses merit consideration? In this paper we introduce an analytic approach that addresses and answers fundamental questions about pauses. The critical steps were as follows: (a) recognition that pause and speech segment duration distributions are skewed and reflect several component processes, (b) adoption of the assumption that the distributions reflect log-normal variability, and independent confirmation of the discovery that pause distributions include two component processes (Campione and Veronis, 2002), and (c) use of signal detection theory to estimate individual as distinct from universal thresholds to distinguish those contributions.
Article
The former Apple CEO Steve Jobs was one of the most charismatic speakers of the past decades. However, there is, as yet, no detailed quantitative profile of his way of speaking. We used state-of-the-art computer techniques to acoustically analyze his speech behavior and relate it to reference samples. Our paper provides the first-ever acoustic profile of Steve Jobs, based on about 4000 syllables and 12,000 individual speech sounds from his two most outstanding and well-known product presentations: the introductions of the iPhone 4 and the iPad 2. Our results show that Steve Jobs stands out against our reference samples in almost all key features of charisma, including melody, loudness, tempo and fluency, and he produced significant quantitative differences when addressing customers and investors in his speeches. Against this background, we provide specific advice on how to improve a speaker’s charismatic impact. We conclude with describing how further technological advances will move computers and behavioral characteristics like human speech even closer together.
Chapter
Full-text available
Speech production involves a remarkably complex combination of processes. Prior to articulation, it involves rapid interactions of processes of utterance planning, formulation, and motor planning for execution whose timing requires close coordination. During articulation, motor commands activating several muscle systems need to ensure that respiratory, phonatory, and articulatory gestures are timed in such a way as to produce an acoustic signal that adequately conveys the intended message both quickly and smoothly. Although it seems crucial to successful communication, fluency is an aspect of speech and language that is largely overlooked within mainstream linguistics, where the focus is on language competence, rather than spoken performance, and where fluency is the unmen- tioned default consequence of following the rules. This study of speech and language from a relatively static viewpoint stands in contrast to the study of the mechanisms of speech production, which presents a dynamic view. The failure to maintain the flow in overt speech, through error and repair and through hesitation, has been the focus of a growing number of studies within speech production. But an overarching definition of fluency (and of disfluency) is hard to come by and there exists confusion in the use of terminology. This chapter explores the notion of fluency in typical speech production and attempts to add some clarity.
Article
A major contribution to speaking style comes from both the location of phrase breaks in an utterance, as well as the duration of these breaks. This paper is about modeling the duration of style specific breaks. We look at six styles of speech here. We present analysis that shows that these styles differ in the duration of pauses in natural speech. We have built CART models to predict the pause duration in these corpora and have integrated them into the Festival speech synthesis system. Our objective results show that if we have sufficient training data, we can build style specific models. Our subjective tests show that people can perceive the difference between different models and that they prefer style specific models over simple pause duration models.
Article
This study explores spoken medical narratives in which dermatologists were shown images of dermatological conditions and asked to explain their reasoning process while working toward a diagnosis. This corpus has been annotated by a domain expert for information-rich conceptual units of thought, providing opportunity for analysis of the link between diagnostic reasoning steps and speech features. We explore these annotations in regards to speech disfluencies, prosody, and type-token ratios, with the finding that speech tagged within thought units is unique from non-tagged speech in each of these aspects. Additionally, we discuss pattern differences in temporal thought unit distribution based on diagnostic correctness.
Article
The aim of this paper is to employ a complex empirical methodology to explore characteristics of intonation units in utterances. In the first phase of the experiment, intonation units were identified by subjects on the basis of the meaning of linguistic expressions plus the way they sounded. In the second phase, participants only had access to the second type of information. Average results of the number of boundary markings exhibited hardly any difference between the two phases (24.3 and 25.8); furthermore, the number of identical boundary markings was high. The results suggest that the concept of intonation units is a better basis for establishing the segmentation of utterances both on the production and the perception side than the concept of sentences, as in earlier similar experiments. The importance of clauses in linguistic interaction is emphasised in the present investigation exactly by our refraining from making that category a point of departure; rather, its characteristic interaction with intonation units is empirically confirmed.
Article
This study aims to accomplish the functional and positional analysis of long silences – single, double or triple – in the oral productions of 20 speakers with high-functioning autism spectrum disorder and 20 typically-developing speakers. Thus, a pausative pattern, which combines different quantitative measures, is proposed for speakers with this disorder. Different comparisons showed homogeneity in the relative average between the word quantity and the number of long pauses in oral interactions of both groups. Finally, the excess of long internal shared pauses that are produced by speakers with high-functioning autism due to comprehension problems or lack of attention during dialogue is significant.
Article
Full-text available
The aim of the present paper is to provide comparative analysis regarding the functions of pauses through exploration of the similarities and differences in semantically identical utterances in micro-textual units in colloquial style produced by L1 and L2 speakers of English and German. The research study illustrates inappropriate segmentation of the discourse, inapt distribution and frequency of pause types in L2 subjects' utterances, which may be due to the fact that L2 speakers apply cognitive activities different from L1 speakers. L1 subjects' productions, on the other hand, indicate that they tend to plan and program their utterances in longer blocks.
Article
Full-text available
The frequency, duration and distribution of pauses in French were investigated acoustically in three types of speech styles: political interviews and casual interviews, which belong to spontaneous speech, and political speeches, which are carefully prepared. The speech samples were subdivided into articulated sequences, silent pauses, and non-silent pauses. The total time of silent pauses was 50% greater in political speeches than in either type of interview. It appears to be one of the characteristics of political speeches. In all three styles, the distribution of silent pauses was generally correlated with the syntactic structure of the sentence. Most of the time, these pauses occurred at clause or phrase boundaries. In political speeches, however, their frequency was greater and their duration longer. Some of these pauses, particularly the long ones, must have a predominantly stylistic function. In interviews, non-silent pauses were frequent and long, particularly in casual interviews, whereas they were almost completely absent in political speeches. These results confirm previous studies that involve other languages as well, and investigate the syntactic distribution of pauses and the importance of hesitation in spontaneous speech; they open onto a new research area concerned with the stylistic function of pauses.
Thesis
Full-text available
Ce travail de recherche porte sur les phénomènes dits « d’hésitation » en français oral non lu, que nous préférons appeler « marques du travail de formulation », à savoir : le euh, l’allongement final significatif, la répétition et l’autocorrection immédiate. Une importance particulière est accordée à la combinatoire de ces marques entre elles et avec la pause silencieuse. L’étude montre que les pauses silencieuses subséquentes à un euh ou à un allongement marquant le travail de formulation en cours, ainsi que les pauses silencieuses insérées entre les deux termes d’une répétition, forment une classe à part : nous les avons appelées « pauses non structurantes » car elles font partie de la marque qui les précède et elles ne contribuent pas à la hiérarchisation et à la démarcation des constituants. Chaque marque, étudiée séparément, est caractérisée par sa durée et celle de la pause subséquente, par ses combinaisons avec une autre marque et ses occurrences à l’intérieur de sites mixtes d’accumulation de marques, par ses contextes lexicaux et sa distribution (intono)syntaxique. Les configurations les plus fréquentes comme les plus rares sont répertoriées. L’analyse prend appui sur les études précédentes, encore rares, de ces marques en français ; elle tente de vérifier, nuancer et enrichir certaines hypothèses déjà formulées, à partir d’un corpus de 70 minutes d’enregistrement de récits non lus en classe de français. A travers un test de perception ainsi qu’à travers l’analyse de l’utilisation des marques d’« hésitation » par des auteurs dramatiques et par des acteurs, l’étude contribue en outre à mieux cerner les représentations que les locuteurs ordinaires ont de ces phénomènes et à mieux connaître leur perceptibilité.
Article
This paper is concerned with the evaluation of speech rate in French. Usually, this dynamic parameter is described as a quantitative dimension. It is shown that the slowing down of speech has also major qualitative effects that have to be taken into account. The theory on slowing down speech is thus revised.
Article
This second study of the time variables of spoken French unravels the main values of these variables in a forced language activity (the description of cartoons) and compares them with those found in radio interviews. As regards the primary variables, the difference between these two tasks is highly significant. A comparison with English in respect to a similar description task shows a great similarity at the level of the complex variables and the rate of articulation, but a significant difference in the length of the runs and of unfilled pauses. The comparison of the syntactic distribution of unfilled pauses in interviews and in descriptions shows a similarity in the occurrence of pauses at different syntactic positions (in particular finally and medially in sentences), but an increase in the number of pauses and – by about 100&percnt; – of their median length. Finally, a very great stability in the classification by order of size of the secondary variables (other hesitation pauses) is noted, which does not prevent these pauses from being distributed differently and two to three times as frequent in the descriptions.
Étiquetage semi-automatique de l'intonation dans les corpus oraux : algorithmes et méthodologie
  • E Campione
Campione, E., 2001. Étiquetage semi-automatique de l'intonation dans les corpus oraux : algorithmes et méthodologie. Thèse de doctorat. Aix-en-Provence: Université de Provence.
Pauses and the temporal structure of speech Fundamentals of speech synthesis and speech recognition
  • B Zellner
Zellner, B., 1994. Pauses and the temporal structure of speech. In E. Keller (Ed.), Fundamentals of speech synthesis and speech recognition (pp. 41-62). Chichester: John Wiley, 1994.