Figure 3 - uploaded by Hans Rutger Bosker
Content may be subject to copyright.
Average categorisation data (in % long /a:/ responses) for the five conditions with different time-compression factors κ from Experiment 2 (error bars show standard errors). Compression of speech carriers by κ = 2, with syllable rates within the theta range, leads to an increase in % /a:/ responses. However, compression of carriers by κ = 4 and κ = 5, with syllable rates outside the theta range, does not lead to an increase in % /a:/ responses (comparable target categorisation as in the baseline κ = 1 condition).
Source publication
This psychoacoustic study provides behavioural evidence that neural entrainment in the theta range (3–9 Hz) causally shapes speech perception. Adopting the “rate normalization” paradigm (presenting compressed carrier sentences followed by uncompressed target words), we show that uniform compression of a speech carrier to syllable rates inside the t...
Context in source publication
Context 1
... with missing categorisation responses (n = 6; <1%) were excluded from analyses. Categorisation data, calcu- lated as the percentage of long /a:/ responses (% /a:/), are presented in Figure 3, and were analyzed by a GLMM with a logistic linking function. The dependent variable was response /a:/ (coded as 1) or /ɑ/ (coded 0). ...
Similar publications
In temporal binding, the temporal interval between one event and another, occurring some time later, is subjectively compressed. We discuss two ways in which temporal binding has been conceptualized. In studies showing temporal binding between a voluntary action and its causal consequences, such binding is typically interpreted as providing a measu...
Citations
... Adaptation to , prominent frequencies in the preceding acoustic context results in neural contrast when those frequencies change upon introduction of the target sound, producing a neural (and ultimately perceptual) shift. The neural mechanisms underlying TCEs are less clear, but two candidates have been suggested: cortical entrainment to modulations in the amplitude envelope (Bosker and Ghitza, 2018) or evoked responses to rapid increases in speech amplitude, particularly at modulation onset ("acoustic edges") (Oganian and Chang, 2019;Oganian et al., 2023). When the rate of modulations or their onsets change across context and target stimuli, this similarly produces a contrastive shift where a larger change in rate is perceived, resulting in the TCE. ...
... First and foremost, it bears reminding that different types of contrast effects were measured across studies: Assgari and Stilp (2015) analyzed SCEs and the present study analyzed TCEs. Second, each effect is proposed to be subserved by different neural mechanisms: SCEs by neural adaptation (Stilp, 2020a,b) and TCEs by either entrainment to modulations in the amplitude envelope of speech (Bosker and Ghitza, 2018) or evoked responses to rapid increases in speech amplitude (Oganian and Chang, 2019;Oganian et al., 2023). Third, while studies used the same context sentences, the target stimuli differed. ...
... While the One Talker/One Sentence condition had zero variability in mean f0 from trial to trial (because it was the same token spoken at different rates), there was minimal variability in mean f0 for the One Talker/200 Sentences condition, yet TCEs were significantly smaller in this condition in Experiments 1 and 2. This is suggestive of a different type of stimulus variability being responsible for diminishing TCE magnitudes across conditions, most likely one tied to the proposed neural mechanisms underlying TCEs. By presenting a different sentence on each trial, there was variability in the amplitude envelope of each sentence (as suggested by Bosker and Ghitza, 2018, to underlie TCEs) as well as the timing and frequency of rapid increases of signal amplitude (as suggested by Oganian et al., 2023, to underlie TCEs). Targeted experimentation (akin to the experiments reported by Assgari et al., 2019) is needed to identify the specific cause of variation in TCE magnitudes in the present results. ...
Acoustic context influences speech perception, but contextual variability restricts this influence. Assgari and Stilp [J. Acoust. Soc. Am. 138, 3023-3032 (2015)] demonstrated that when categorizing vowels, variability in who spoke the preceding context sentence on each trial but not the sentence contents diminished the resulting spectral contrast effects (perceptual shifts in categorization stemming from spectral differences between sounds). Yet, how such contextual variability affects temporal contrast effects (TCEs) (also known as speaking rate normalization; categorization shifts stemming from temporal differences) is unknown. Here, stimuli were the same context sentences and conditions (one talker saying one sentence, one talker saying 200 sentences, 200 talkers saying 200 sentences) used in Assgari and Stilp [J. Acoust. Soc. Am. 138, 3023-3032 (2015)], but set to fast or slow speaking rates to encourage perception of target words as "tier" or "deer," respectively. In Experiment 1, sentence variability and talker variability each diminished TCE magnitudes; talker variability also produced shallower psychometric function slopes. In Experiment 2, when speaking rates were matched across the 200-sentences conditions, neither TCE magnitudes nor slopes differed across conditions. In Experiment 3, matching slow and fast rates across all conditions failed to produce equal TCEs and slopes everywhere. Results suggest a complex interplay between acoustic, talker, and sentence variability in shaping TCEs in speech perception.
... We then evaluated the consistency of the extracted stimuli features (speech envelopes and kinematic components). In both classes of stimuli, the peaks of the power spectra of speech envelopes were mostly confined between 4 and 8 Hz, which is consistent with consolidated evidence (3,(49)(50)(51)(52). The first four principal components (PCs) together accounted for 85% of the total variance of the EMA data (Fig. 1C), whereas each of the remaining components explained a negligible amount of variance [variance accounted for (VAF): <5% each]. ...
The human brain tracks available speech acoustics and extrapolates missing information such as the speaker's articulatory patterns. However, the extent to which articulatory reconstruction supports speech perception remains unclear. This study explores the relationship between articulatory reconstruction and task difficulty. Participants listened to sentences and performed a speech-rhyming task. Real kinematic data of the speaker's vocal tract were recorded via electromagnetic articulography (EMA) and aligned to corresponding acoustic outputs. We extracted articulatory synergies from the EMA data using Principal Component Analysis (PCA) and employed Partial Information Decomposition (PID) to separate the electroencephalographic (EEG) encoding of acoustic and articulatory features into unique, redundant, and synergistic atoms of information. We median-split sentences into easy (ES) and hard (HS) based on participants' performance and found that greater task difficulty involved greater encoding of unique articulatory information in the theta band. We conclude that fine-grained articulatory reconstruction plays a complementary role in the encoding of speech acoustics, lending further support to the claim that motor processes support speech perception.
... The process by which neural activity in the auditory cortex synchronizes with the amplitude envelope of the speech signal is known as neural entrainment. Neural entrainment captures acoustic and linguistic features like the syllable in a frequency around 5Hz -a syllable lasts approximately 200 milliseconds- (Gross et al., 2013b), and plays an important role in comprehension and intelligibility (Doelling et al., 2014;Bosker and Ghitza, 2018;Kösem et al., 2018;Poeppel and Assaneo, 2020). Whether it is caused by intrinsic oscillations (Lakatos et al., 2008;Giraud and Poeppel, 2012;Doelling et al., 2014;Notbohm et al., 2016;Zoefel et al., 2018) or by a sequence of evoked potentials in the theta range (Capilla et al., 2011;Keitel et al., 2014;Obleser and Kayser, 2019), is still under debate and therefore, we refer to this process simply as neural tracking. ...
The superior temporal and the Heschl’s gyri of the human brain play a fundamental role in speech processing. Neurons synchronize their activity to the amplitude envelope of the speech signal to extract acoustic and linguistic features, a process known as neural tracking/entrainment. Electroencephalography has been extensively used in language-related research due to its high temporal resolution and reduced cost, but it does not allow for a precise source localization. Motivated by the lack of a unified methodology for the interpretation of source reconstructed signals, we propose a method based on modularity and signal complexity. The procedure was tested on data from an experiment in which we investigated the impact of native language on tracking to linguistic rhythms in two groups: English natives and Spanish natives. In the experiment, we found no effect of native language but an effect of language rhythm. Here, we compare source projected signals in the auditory areas of both hemispheres for the different conditions using nonparametric permutation tests, modularity, and a dynamical complexity measure. We found increasing values of complexity for decreased regularity in the stimuli, giving us the possibility to conclude that languages with less complex rhythms are easier to track by the auditory cortex.
... Many empirical works in this domain have focused on the potential role of neural oscillations as a neurophysiological substrate for predictions in the time domain [2][3][4][5]. In this view, neural oscillators synchronize their excitability phase with external sequences, thereby reducing internal noise and optimizing the processing of incoming events [6][7][8]. In this sense, the phase of a neural oscillation can be used as an index for prediction in time, a mechanism that may be considered as constitutive of the inferential process. ...
Humans excel at predictively synchronizing their behavior with external rhythms, as in dance or music performance. The neural processes underlying rhythmic inferences are debated: whether predictive perception relies on high-level generative models or whether it can readily be implemented locally by hard-coded intrinsic oscillators synchronizing to rhythmic input remains unclear and different underlying computational mechanisms have been proposed. Here we explore human perception for tone sequences with some temporal regularity at varying rates, but with considerable variability. Next, using a dynamical systems perspective, we successfully model the participants behavior using an adaptive frequency oscillator which adjusts its spontaneous frequency based on the rate of stimuli. This model better reflects human behavior than a canonical nonlinear oscillator and a predictive ramping model–both widely used for temporal estimation and prediction–and demonstrate that the classical distinction between absolute and relative computational mechanisms can be unified under this framework. In addition, we show that neural oscillators may constitute hard-coded physiological priors–in a Bayesian sense–that reduce temporal uncertainty and facilitate the predictive processing of noisy rhythms. Together, the results show that adaptive oscillators provide an elegant and biologically plausible means to subserve rhythmic inference, reconciling previously incompatible frameworks for temporal inferential processes.
... Additionally, Assgari (2019, 2021) measured spectral context effects and not TCEs. These effects are subserved by different neural mechanisms [spectral contrast effects: neural adaptation (e.g., Stilp, 2020a); TCEs: neural oscillatory entrainment (Bosker and Ghitza, 2018) or evoked responses to acoustic edges (Kojima et al., 2021)] and are not obligated to follow the same patterns of results. Future investigation will elucidate which of these (or other) causes explain the lack of support for the second hypothesis. ...
... ARTICLE asa.scitation.org/journal/jel Stilp's (2020b) review] for which sufficient detail was provided to allow calculation of speaking rates (Pickett and Decker, 1960;Repp et al., 1978;Diehl et al., 1980;Summerfield, 1981;Port and Dalby, 1982;Gordon, 1988;Kidd, 1989;Newman and Sawusch, 2009;Reinisch et al., 2011;Reinisch and Sjerps, 2013;Reinisch, 2016;Bosker, 2017;Bosker and Ghitza, 2018). These speaking rates (mean slow speaking rate ¼ 4.11 syllables/s, mean fast rate ¼ 8.51 syllables/s) differ by larger amounts and extend to faster overall speaking rates than those tested here. ...
When speaking in noisy conditions or to a hearing-impaired listener, talkers often use clear speech, which is typically slower than conversational speech. In other research, changes in speaking rate affect speech perception through speaking rate normalization: Slower context sounds encourage perception of subsequent sounds as faster, and vice versa. Here, on each trial, listeners heard a context sentence before the target word (which varied from "deer" to "tier"). Clear and slowed conversational context sentences elicited more "deer" responses than conversational sentences, consistent with rate normalization. Changing speaking styles aids speech intelligibility but might also produce other outcomes that alter sound/word recognition.
... Furthermore, the average syllabic rate after repackaging the syllables by inserting 100 ms silences resulted in an average syllabic rate of 6.1 sps. This rate is similar to the syllabic rate of the control condition and to that of the repackaged syllabic rate reported in the literature that used the same experimental paradigm 14,25,33 . For more details about the effect of the time-compression on the duration and the rate of the different speech segments, we refer to Gransier et al. 17 ; ...
... Ghitza and colleagues 14,25,33 showed that repackaging TC speech, by inserting silence parts between speech segments, restores intelligibility. The result of the present study shows that this improvement is associated with the amount of TFS available to the listeners. ...
Intelligibility of time-compressed (TC) speech decreases with increasing speech rate. However, intelligibility can be restored by ‘repackaging’ the TC speech by inserting silences between the syllables so that the original ‘rhythm’ is restored. Although restoration of the speech rhythm affects solely the temporal envelope, it is unclear to which extent repackaging also affects the perception of the temporal-fine structure (TFS). Here we investigate to which extent TFS contributes to the perception of TC and repackaged TC speech in quiet. Intelligibility of TC sentences with a speech rate of 15.6 syllables per second (sps) and the repackaged sentences, by adding 100 ms of silence between the syllables of the TC speech (i.e., a speech rate of 6.1 sps), was assessed for three TFS conditions: the original TFS and the TFS conveyed by an 8- and 16-channel noise vocoder. An overall positive effect on intelligibility of both the repackaging process and of the amount of TFS available to the listener was observed. Furthermore, the benefit associated with the repackaging TC speech depended on the amount of TFS available. The results show TFS contributes significantly to the perception of fast speech even when the overall rhythm/envelope of TC speech is restored.
... Notably, given the flexibility and adaptability of our brains, we believe that the brain could use different rhythms to parse temporal perception for speech stimuli, which depends on the stimulus and context. For example, the recognition of spoken languages at different speaking rates requires us to adapt flexibly, and many studies have shown that it will affect the temporal perception of speech (Bosker and Ghitza, 2018;Kösem et al., 2018). The syllables used in this study are slightly slower than the normal pronunciation rate, and a previous The statistical difference of phase opposition sum (POS) in the A50V stimulus onset asynchronous (SOA). ...
Objective
Perceptual integration and segregation are modulated by the phase of ongoing neural oscillation whose frequency period is broader than the size of the temporal binding window (TBW). Studies have shown that the abstract beep-flash stimuli with about 100 ms TBW were modulated by the alpha band phase. Therefore, we hypothesize that the temporal perception of speech with about hundreds of milliseconds of TBW might be affected by the delta-theta phase.
Methods
Thus, we conducted a speech-stimuli-based audiovisual simultaneity judgment (SJ) experiment. Twenty human participants (12 females) attended this study, recording 62 channels of EEG.
Results
Behavioral results showed that the visual leading TBWs are broader than the auditory leading ones [273.37 ± 24.24 ms vs. 198.05 ± 19.28 ms, (mean ± sem)]. We used Phase Opposition Sum (POS) to quantify the differences in mean phase angles and phase concentrations between synchronous and asynchronous responses. The POS results indicated that the delta-theta phase was significantly different between synchronous and asynchronous responses in the A50V condition (50% synchronous responses in auditory leading SOA). However, in the V50A condition (50% synchronous responses in visual leading SOA), we only found the delta band effect. In the two conditions, we did not find a consistency of phases over subjects for both perceptual responses by the post hoc Rayleigh test (all ps > 0.05). The Rayleigh test results suggested that the phase might not reflect the neuronal excitability which assumed that the phases within a perceptual response across subjects concentrated on the same angle but were not uniformly distributed. But V-test showed the phase difference between synchronous and asynchronous responses across subjects had a significant phase opposition (all ps < 0.05) which is compatible with the POS result.
Conclusion
These results indicate that the speech temporal perception depends on the alignment of stimulus onset with an optimal phase of the neural oscillation whose frequency period might be broader than the size of TBW. The role of the oscillatory phase might be encoding the temporal information which varies across subjects rather than neuronal excitability. Given the enriched temporal structures of spoken language stimuli, the conclusion that phase encodes temporal information is plausible and valuable for future research.
... Removing slow amplitude modulations (e.g., using low-pass filters) strongly reduces speech comprehension [18]. Speech that is time-compressed (e.g., average syllable rate 9 Hz), and therefore unintelligible, can be made intelligible by inserting silent periods so that the overall rhythm is closer to that of typical speech (e.g., average syllable rate 6 Hz [19,20]). ...
Auditory rhythms are ubiquitous in music, speech, and other everyday sounds. Yet, it is unclear how perceived rhythms arise from the repeating structure of sounds. For speech, it is unclear whether rhythm is solely derived from acoustic properties (e.g., rapid amplitude changes), or if it is also influenced by the linguistic units (syllables, words, etc.) that listeners extract from intelligible speech. Here, we present three experiments in which participants were asked to detect an irregularity in rhythmically spoken speech sequences. In each experiment, we reduce the number of possible stimulus properties that differ between intelligible and unintelligible speech sounds and show that these acoustically-matched intelligibility conditions nonetheless lead to differences in rhythm perception. In Experiment 1, we replicate a previous study showing that rhythm perception is improved for intelligible (16-channel vocoded) as compared to unintelligible (1-channel vocoded) speech–despite near-identical broadband amplitude modulations. In Experiment 2, we use spectrally-rotated 16-channel speech to show the effect of intelligibility cannot be explained by differences in spectral complexity. In Experiment 3, we compare rhythm perception for sine-wave speech signals when they are heard as non-speech (for naïve listeners), and subsequent to training, when identical sounds are perceived as speech. In all cases, detection of rhythmic regularity is enhanced when participants perceive the stimulus as speech compared to when they do not. Together, these findings demonstrate that intelligibility enhances the perception of timing changes in speech, which is hence linked to processes that extract abstract linguistic units from sound.
... There are even indications that tACS can serve as external "pacemaker," guiding the phase and frequency of endogenous oscillations, in turn influencing behavioral speech perception (Kösem et al., 2020;Riecke et al., 2018;Zoefel et al., 2018). In line with these neurobiological findings, behavioral rate-dependent effects are observed only for speech rates in the 3-9-Hz range-that is, when the speech rate can be encoded by ongoing theta oscillations (Bosker & Ghitza, 2018). Further behavioral support comes from the observation that special populations known to demonstrate neural entrainment impairments such as individuals with developmental dyslexia (Goswami, 2011;Goswami et al., 2002) also show a reduced rate effect relative to typically developed listeners (Gabay et al., 2019). ...
... This is why we refrain from interpreting a direct comparison between these two context conditions, albeit, if one insists on such a comparison, the magnitude of reduction of the effect of rate appeared not to differ between the noise and reverberation context relative to the clear context in Experiment 1. Future studies may focus on a more thorough exploration of the effects of level of noise or reverberation, asking about thresholds of degradation when generally robust low-level processes such as rate-dependent perception start to lose impact until they completely diminish (cf. Bosker & Ghitza, 2018). The main finding of Experiment 1 of the present study was that signal degradation of a context can lead to a reduction of rate-dependent speech perception and might hence be qualitatively different from listening under taxed cognitive load with a clear speech signal . ...
Temporal contrasts in speech are perceived relative to the speech rate of the surrounding context. That is, following a fast context sentence, listeners interpret a given target sound as longer than following a slow context, and vice versa. This rate effect, often referred to as “rate-dependent speech perception,” has been suggested to be the result of a robust, low-level perceptual process, typically examined in quiet laboratory settings. However, speech perception often occurs in more challenging listening conditions. Therefore, we asked whether rate-dependent perception would be (partially) compromised by signal degradation relative to a clear listening condition. Specifically, we tested effects of white noise and reverberation, with the latter specifically distorting temporal information. We hypothesized that signal degradation would reduce the precision of encoding the speech rate in the context and thereby reduce the rate effect relative to a clear context. This prediction was borne out for both types of degradation in Experiment 1, where the context sentences but not the subsequent target words were degraded. However, in Experiment 2, which compared rate effects when contexts and targets were coherent in terms of signal quality, no reduction of the rate effect was found. This suggests that, when confronted with coherently degraded signals, listeners adapt to challenging listening situations, eliminating the difference between rate-dependent perception in clear and degraded conditions. Overall, the present study contributes towards understanding the consequences of different types of listening environments on the functioning of low-level perceptual processes that listeners use during speech perception.
... The current work contributes new evidence that exposure to speech modulates temporal processing (Bosker & Ghitza, 2018;Kösem et al., 2018), by demonstrating that listeners were substantially more likely to both detect and locate gaps that occurred within an ongoing speech stream, rather than before speech begins. We ensured that the breath sounds had a standardised intensity and even manipulated them directly in Experiment 3, in addition to visually priming participants at the onset of each trial. ...
The effect of non-speech sounds, such as breathing noise, on the perception of speech timing is currently unclear. In this paper we report the results of three studies investigating participants' ability to detect a silent gap located adjacent to breath sounds during naturalistic speech. Experiment 1 (n = 24, in-person) asked whether participants could either detect or locate a silent gap that was added adjacent to breath sounds during speech. In Experiment 2 (n = 182; online), we investigated whether different placements within an utterance were more likely to elicit successful detection of gaps. In Experiment 3 (n = 102; online), we manipulated the breath sounds themselves to examine the effect of breath-specific characteristics on gap identification. Across the study, we document consistent effects of gap duration, as well as gap placement. Moreover, in Experiment 2, whether a gap was positioned before or after an interjected breath significantly predicted accuracy as well as the duration threshold at which gaps were detected, suggesting that nonverbal aspects of audible speech production specifically shape listeners' temporal expectations. We also describe the influences of the breath sounds themselves, as well as the surrounding speech context, that can disrupt objective gap detection performance. We conclude by contextualising our findings within the literature, arguing that the verbal acoustic signal is not "speech itself" per se, but rather one part of an integrated percept that includes speech-related respiration, which could be more fully explored in speech perception studies.