ArticlePDF Available

Entrained theta oscillations guide perception of subsequent speech: behavioural evidence from rate normalisation

Taylor & Francis
Language, Cognition and Neuroscience
Authors:

Abstract and Figures

This psychoacoustic study provides behavioural evidence that neural entrainment in the theta range (3–9 Hz) causally shapes speech perception. Adopting the “rate normalization” paradigm (presenting compressed carrier sentences followed by uncompressed target words), we show that uniform compression of a speech carrier to syllable rates inside the theta range influences perception of subsequent uncompressed targets, but compression outside theta range does not. However, the influence of carriers – compressed outside theta range – on target perception is salvaged when carriers are “repackaged” to have a packet rate inside theta. This suggests that the brain can only successfully entrain to syllable/packet rates within theta range, with a causal influence on the perception of subsequent speech, in line with recent neuroimaging data. Thus, this study points to a central role for sustained theta entrainment in rate normalisation and contributes to our understanding of the functional role of brain oscillations in speech perception.
Content may be subject to copyright.
Full Terms & Conditions of access and use can be found at
http://www.tandfonline.com/action/journalInformation?journalCode=plcp21
Language, Cognition and Neuroscience
ISSN: 2327-3798 (Print) 2327-3801 (Online) Journal homepage: http://www.tandfonline.com/loi/plcp21
Entrained theta oscillations guide perception of
subsequent speech: behavioural evidence from
rate normalisation
Hans Rutger Bosker & Oded Ghitza
To cite this article: Hans Rutger Bosker & Oded Ghitza (2018) Entrained theta oscillations guide
perception of subsequent speech: behavioural evidence from rate normalisation, Language,
Cognition and Neuroscience, 33:8, 955-967, DOI: 10.1080/23273798.2018.1439179
To link to this article: https://doi.org/10.1080/23273798.2018.1439179
© 2018 The Author(s). Published by Informa
UK Limited, trading as Taylor & Francis
Group
Published online: 13 Feb 2018.
Submit your article to this journal
Article views: 272
View Crossmark data
REGULAR ARTICLE
Entrained theta oscillations guide perception of subsequent speech: behavioural
evidence from rate normalisation
Hans Rutger Bosker
a,b
and Oded Ghitza
c,d
a
Max Planck Institute for Psycholinguistics, Nijmegen, the Netherlands;
b
Donders Institute for Brain, Cognition and Behaviour, Radboud
University, Nijmegen, The Netherlands;
c
Department of Biomedical Engineering, Hearing Research Center, Boston University, Boston, MA, USA;
d
Neuroscience Department, Max Planck Institute for Empirical Aesthetics, Frankfurt, Germany
ABSTRACT
This psychoacoustic study provides behavioural evidence that neural entrainment in the theta
range (39 Hz) causally shapes speech perception. Adopting the rate normalizationparadigm
(presenting compressed carrier sentences followed by uncompressed target words), we show
that uniform compression of a speech carrier to syllable rates inside the theta range influences
perception of subsequent uncompressed targets, but compression outside theta range does not.
However, the influence of carriers compressed outside theta range on target perception is
salvaged when carriers are repackagedto have a packet rate inside theta. This suggests that
the brain can only successfully entrain to syllable/packet rates within theta range, with a causal
influence on the perception of subsequent speech, in line with recent neuroimaging data. Thus,
this study points to a central role for sustained theta entrainment in rate normalisation and
contributes to our understanding of the functional role of brain oscillations in speech perception.
ARTICLE HISTORY
Received 11 August 2017
Accepted 2 February 2018
KEYWORDS
Neural entrainment; theta
oscillations; speech rate; rate
normalisation
Introduction
Speech is a communicative signal with inherent slow
amplitude modulations. These amplitude modulations
fluctuate at a rate around 39 Hz, in various speech
types and in different languages (Bosker & Cooke, in
press; Ding et al., 2017; Krause & Braida, 2004; Varnet,
Ortiz-Barajas, Erra, Gervain, & Lorenzi, 2017), driven pri-
marily by the syllabic rate of speech. Recently, models
of speech perception have pointed at the remarkable
correspondence between the time scales of phonemic,
syllabic, and phrasal linguistic units, on the one hand,
and the periods of the gamma, theta, and delta oscil-
lations in the brain, on the other (Ghitza, 2011;
Poeppel, 2003). This correspondence has inspired
recent hypotheses on the potential role of neuronal oscil-
lations in speech perception (Ghitza, 2011,2017; Ghitza &
Greenberg, 2009; Giraud & Poeppel, 2012; Peelle & Davis,
2012; Poeppel, 2003). For instance, evidence has accu-
mulated showing that the listening brain follows the syl-
labic speech rhythm by phase-locking endogenous theta
oscillations (39 Hz) to the amplitude envelope of
speech (Doelling, Arnal, Ghitza, & Poeppel, 2014; Gross
et al., 2013; Luo & Poeppel, 2007; Peelle & Davis, 2012;
Peelle, Gross, & Davis, 2013). This process, known as
neural entrainment (speech tracking), has also been
proposed to explain why low-frequency amplitude
modulations in speech play a crucial role in perception
(e.g. literature on locally time-reversed, interrupted, alter-
nated, and filtered speech; Drullman, Festen, & Plomp,
1994a; Drullman, Festen, & Plomp, 1994b; Elliott & Theu-
nissen, 2009; Ghitza, 2012; Peelle & Davis, 2012; Saberi &
Perrott, 1999; Shannon, Zeng, Kamath, Wygonski, &
Ekelid, 1995; Ueda, Nakajima, Ellermeier, & Kattner,
2017). However, the functional role of this neural entrain-
ment in speech perception remains a topic of debate: is
entrainment causally involved in shaping successful
speech perception (Riecke, Formisano, Sorger, Başkent,
& Gaudrain, 2018; Zoefel, Archer-Boyd, & Davis, 2018)
or is it merely a response-driven epiphenomenon of
speech processing (Obleser, Herrmann, & Henry, 2012)?
The present study will put forward psychoacoustic find-
ings suggesting that, not only does neural entrainment
to a particular syllable rate shape the decoding of concur-
rent speech (Experiment 1),but also that neural entrain-
ment might persist when the entraining rhythm has
ceased, influencing the perception of subsequently pre-
sented words (Experiments 2 and 3).
Neural entrainment shapes perception of
concurrent speech
Evidence that neural entrainment shapes the decoding
of concurrent speech (i.e. the speech presented during
© 2018 The Author(s). Published by Informa UK Limited, trading as Taylor & Francis Group
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use,
distribution, and reproduction in any medium, provided the original work is properly cited.
CONTACT Hans Rutger Bosker HansRutger.Bosker@mpi.nl
LANGUAGE, COGNITION AND NEUROSCIENCE
2018, VOL. 33, NO. 8, 955967
https://doi.org/10.1080/23273798.2018.1439179
the time interval in which entrainment occurs) has been
given by both behavioural and neuroimaging exper-
iments. Behavioural evidence has come from studies
using (highly) compressed (i.e. accelerated) speech,
showing that the greater the compression, the more
intelligibility is impaired (e.g. Dupoux & Green, 1997).
Ghitza (2014) demonstrated that the intelligibility of
compressed speech deteriorates particularly sharply
when syllable rates exceed the upper frequency of the
theta range (>9 Hz; cf. Ghitza & Greenberg, 2009). He
postulated that as long as the syllable rate is inside the
theta frequency range, theta oscillations track the
speech input, aligning neural theta cycles to the input
syllabic rhythm.
1
When theta is in sync with the
sensory syllable rate (i.e. in the 39 Hz range), intelligibil-
ity is high. However, when theta is out of sync, for
instance, for syllable rates above 9 Hz, intelligibility
drops. This account can also explain perceptual learning
of compressed speech (i.e. better comprehension of
compressed speech with greater exposure, even across
talker changes; Adank & Janse, 2009; Dupoux & Green,
1997), since greater exposure would presumably allow
for closer alignment of neural oscillations to the speech
input.
This oscillations-based account of compressed speech
perception predicts that the maximum information
transfer rate of syllables the auditory channel capacity
is 9 syllables per second, the maximum speech tempo
where the theta oscillations are still in sync with the
speech input. Support for this prediction has been
given in Ghitza (2014)byrepackagingcompressed
speech: dividing a time-compressed waveform into frag-
ments, called packets, and delivering the packets at a
prescribed packet delivery rate by inserting silent inter-
vals in between packets (cf. Figure 1 below). While com-
pressed speech with syllable rates higher than 9 Hz (i.e.
outside theta range) was largely unintelligible, repacka-
ging these compressed signals such that the packet
delivery rate fell below 9 packets per second (i.e. allow-
ing theta oscillations to be in sync with the input
signal) restored intelligibility to a large extent. Note
that the acoustics inside the packet is the compressed
signal; hence the phonetic material available to the lis-
tener is identical in both (compressed and repackaged)
conditions. Only the packet delivery rate is different
between the compressed and repackaged speech con-
ditions, indicating that auditory channel capacity is
defined by information transfer rate, mostly independent
of speech tempo.
2
Neuroimaging findings support the idea that success-
ful neural entrainment shapes the decoding of concur-
rent speech. For instance, Doelling et al. (2014)
recorded magnetoencephalography (MEG) while partici-
pants listened to speech from which the slow syllabic
amplitude fluctuations had been removed (by filtering;
cf. Ghitza, 2012). The authors observed that neural
entrainment was reduced and intelligibility decreased,
relative to control. However, artificially reinstating the
Figure 1. Waveforms for one example digit string, compressed at different time-compression factors κ, with syllable rate σ. The bottom
waveform shows the repackaged condition, comprised of speech packets 66 ms long (taken from the κ=5 condition with a syllable rate
of about 15 syllables per second), spaced apart by 100 ms, resulting in a 6 Hz packet rate.
956 H. R. BOSKER AND O. GHITZA
temporal fluctuations by mixing in clicksat syllabic
intervals restored envelope-tracking activity (Doelling
et al., 2014) and hence speech intelligibility (Ghitza,
2012).
Neural entrainment shapes perception of
subsequent speech
It has recently been suggested that stimulus-induced
entrainment may persist even when the driving stimulus
has ceased (Lakatos et al., 2013; Spaak, de Lange, &
Jensen, 2014). Influencing effects of this persisting
neural rhythm on the perception of subsequent speech
would be strong evidence for the notion of neural
entrainment as a causal factor shaping perception (i.e.
rather than being a response-driven epiphenomenon
of language comprehension). That is, because the
entraining rhythm has already ceased, the possibility of
response-driven effects is excluded. For instance,
Hickok, Farahbod, and Saberi (2015) presented listeners
with entraining white noise modulated at 3 Hz, followed
by stationary noise. Participantstask was to detect near-
threshold pure tones that were embedded in the station-
ary noise signal in half of the trials. The authors observed
a perceptual oscillation in participantstone detection
performance, matching the period of the entraining
stimulus (i.e. 3 Hz) and persisting over several cycles.
This suggests that neural entrainment may be sustained
after rhythmic stimulation, consequently influencing
subsequent auditory perception.
Most studies on speech perception addressing the
issue of persisting entrainment and ensuing effects
thereof have adopted the rate normalizationparadigm.
That is, the perception of a target speech sound ambig-
uous between two members of a durational contrast (e.g.
in English, short /b/ vs. long /w/; in Dutch, short /ɑ/ vs.
long /a:/) may be biased towards the longer phoneme
(i.e. /w/ in English; /a:/ in Dutch) if presented after a pre-
ceding sentence (hereafter: carrier) produced at a faster
speech rate (Bosker, Reinisch, & Sjerps, 2017; Kidd, 1989;
Pickett & Decker, 1960; Reinisch & Sjerps, 2013;
Toscano & McMurray, 2015). This process, known as
rate normalisation, has been argued to involve general-
auditory processes, since it occurs in human and non-
human species (Dent, Brittan-Powell, Dooling, & Pierce,
1997), is induced by talker-incongruent contexts
(Bosker, 2017b; Newman & Sawusch, 2009), and even
by non-speech (Bosker, 2017a; Gordon, 1988; Wade &
Holt, 2005); in contrast to other rate-dependent percep-
tual effects, such as the Lexical Rate Effect (Dilley & Pitt,
2010; Pitt, Szostak, & Dilley, 2016). Rate normalisation
has been proposed to be related to the cognitive prin-
ciple of durational contrast (Diehl & Walsh, 1989). That
is, the perception of ambiguous speech segments
should be biased towards longer (shorter) percepts in
the context of other shorter (longer) surrounding seg-
ments (Wade & Holt, 2005). Even though this principle
accounts for the general-auditory nature of rate normal-
isation effects, it is as yet unclear how this cognitive prin-
ciple might be neurally implemented.
One potential neurobiologically plausible mechanism
involves sustained neural entrainment to the preceding
sentence. Bosker (2017a) showed that anisochronous
(non-rhythmic) fast and slow carriers do not trigger
rate normalisation, suggesting that rate normalisation
is induced by the periodicity in the carrier. This is in
line with an MEG experiment (Kösem et al., 2017),
where native Dutch participants were presented with
slow and fast carriers (amplitude modulations around 3
and 5.5 Hz, respectively), followed by (uncompressed)
target words containing vowels ambiguous between
short /ɑ/ and long /a:/. Behavioural vowel identification
responses revealed a consistent rate normalisation
effect, with more long /a:/ responses after fast carriers.
The MEG data showed that, during the carrier, neural
theta oscillations efficiently tracked the dynamics of
the slow and fast speech. Moreover, during the target
word time window, theta oscillations in right superior
temporal cortex were observed that corresponded in fre-
quency to the preceding syllable rate, revealing persist-
ing entrainment in the target window. In fact, the
extent to which these theta oscillations persisted into
the target window correlated with the observed behav-
ioural biases: the more evidence for sustained entrain-
ment in the target window, the greater the behavioural
rate normalisation effect (Kösem et al., 2017).
These findings suggest that neural oscillations actively
shape speech perception (Bosker, 2017a; Bosker & Kösem,
2017; Peelle & Davis, 2012). Entrained theta oscillations
would be thought to impose periodic phases of neuronal
excitation and inhibition, thus sampling the input signal at
the appropriate temporal granularity (Bosker, 2017a). In
other words, entrainment at a higher theta frequency
(e.g. 8 Hz) would raise the cortical sampling frequency.
If entrained neural rhythms persist after rhythmic stimu-
lation has ended, this higher cortical sampling frequency
then also affects the perception of subsequent speech.
That is, if a fast sentence is suddenly followed by an
ambiguous target vowel (e.g. ambiguous between short
/ɑ/ and long /a:/ in Dutch), entrainment to the fast sen-
tence would lead to oversamplingthe target vowel,
inducing overestimation of the targetsduration(i.e.
more long /a:/ responses). Similarly, if the same target
word is presented after a slow carrier, this would induce
cortical undersampling, with the targets duration
being underestimated (i.e. fewer long /a:/ responses).
LANGUAGE, COGNITION AND NEUROSCIENCE 957
Contextual effects of syllable rates inside and
outside theta range
Oscillations-based accounts of rate normalisation
(Bosker, 2017a; Kösem et al., 2017; Peelle & Davis, 2012)
would predict that, whenever theta oscillations are in
sync with the syllable rate of the carrier (i.e. ranging
from 39 Hz), they will consistently influence the cortical
sampling frequency, with sustained effects influencing
the perceived duration of following target segments.
However, when carriers are compressed to syllable
rates above 9 Hz, theta oscillations would be out of
sync with the syllabic input rate, thus removing contex-
tual influences on subsequent target perception. That
is, these oscillations-based models predict an upper
limit to the syllable rates that induce rate normalisation:
increasing syllable rates to up to nine syllables per
second should bias perception of following ambiguous
targets more and more towards longer percepts;
however, no consistent contextual influences should be
observed for syllable rates exceeding the theta range.
Interestingly, all studies on rate normalisation in the
literature so far have tested the effects of syllable rates
inside the theta range of 39 Hz. As far as we could
establish, the fastest syllable rates tested had a syllable
rate of 8 Hz (e.g. Newman & Sawusch, 2009). As such, it
is as yet unknown whether there is an upper limit to
the contextual syllable rates that might induce rate nor-
malisation effects.
In this study, we tested the effect of rate normalisation
with contextual syllable rates beyond capacity. In Exper-
iment 1, Ghitza (2014) was replicated with Dutch
materials. In Experiment 2, Dutch participants were pre-
sented withtime-compressed carriers followed by uncom-
pressed target words containing vowels ambiguous
between Dutch /ɑ/ and /a:/. The carriers were linearly com-
pressed by various compression factors κ, resulting in
some syllable rates within theta range (i.e. below 9 Hz)
and some outside theta range (i.e. above 9 Hz). Instead
of a gradual increase in long /a:/ responses with larger
compression factors, oscillations-based models actually
predict a gradual increase in the percentage of long /a:/
responses for carriers with syllable rates up to 9 Hz.
However, when the carrier has a syllable rate above
9 Hz, cortical theta is assumed to be out of sync with
the syllabic input rate, presumably eliminating any poten-
tial contextual effect on subsequent speech perception.
Restoring rate normalisation through
repackaging
If Experiment 2 would show that rate normalisation
effects are only triggered by syllable rates within theta
range, then this would provide behavioural support for
oscillations-based models of speeded speech compre-
hension. In turn, such a finding would predict that
repackaging compressed speech (Ghitza, 2014; Ghitza
& Greenberg, 2009) might be able to restore rate normal-
isation effects. For instance, compressed speech with a
syllable rate of 15 Hz would not be predicted to bias
the perception of subsequent Dutch target vowels
towards /a:/, because the syllable rate is outside theta.
However, if this compressed speech signal would be
repackaged such that the packet delivery rate would
fall below 9 Hz, rate normalisation effects should be
restored.
Experiment 3 was designed to test the effect of
repackaged speech on the perception of subsequent
speech. Oscillations-based models would predict that
compressed speech with a syllable rate above 9 Hz
would not induce rate normalisation, whereas repacka-
ging this same acoustic signal such that the packet deliv-
ery rate would fall below 9 Hz would induce rate
normalisation (similar to compressed speech with sylla-
ble rates below 9 Hz).
Experiment 1
Experiment 1 served as an attempt to replicate the find-
ings from Ghitza (2014) in Dutch, testing the intelligibility
of Dutch speech recordings compressed with various
compression factors κsuch that resulting syllable rates
would fall either within (i.e. below 9 Hz) or outside
theta range (above 9 Hz). Moreover, it tested whether
repackaging speech such that the resulting packet deliv-
ery rate falls within theta would restore intelligibility of
highly compressed speech, as reported in Ghitza
(2014). If Experiment 1, targeting overall intelligibility,
succeeds in replicating the findings from Ghitza (2014),
this allows us to use the Dutch speech materials for the
subsequent experiments, targeting rate normalisation.
Method
Participants
Native Dutch participants (N= 18) with normal hearing
were recruited from the Max Planck Institutes participant
pool. They gave informed consent as approved by the
Ethics Committee of the Social Sciences department of
Radboud University (project code: ECSW2014-1003-
196). Data from one participant were excluded for
failing to understand the experimental task. Data from
another participant were excluded because she
requested to stop the experiment, leaving data from 16
participants (12 females, 4 males; M
age
= 21) for analysis.
958 H. R. BOSKER AND O. GHITZA
Design and materials
Sentence materials were generated after Ghitza (2014). A
male native speaker of Dutch (first author) was recorded,
fluently producing 100 digit strings, followed by Dutch
target words. Experiment 1 only made use of the
recorded digit strings; description of the target words
is provided in the method of Experiment 2.
Each digit string was comprised of 7 digits (e.g. twee
een vijf, vier zes vijf drie; two one five, four six five
three), approximately 2 s long, with an average syllable
rate around 3 Hz (SD = 0.2). The speaker uttered the
strings as a phone number, using a cluster of 3 digits fol-
lowed by a cluster of 4 digits, without combining digits
(e.g. no sixteen,three times five, etc.). All digits
were used (0, 1, 2, 3, 4, 5, 6, 7, 9), except acht /ɑxt/
eight, because it contained the critical vowel used in
the target words in Experiment 2 and 3.
Digit strings were time-compressed using PSOLA in
Praat (Boersma & Weenink, 2016). This manipulation
alters a signals duration while preserving spectral prop-
erties (e.g. original formant patterns and fundamental
frequency contours are maintained; cf. Figure 1). The
time-compression factor κ(i.e. the factor by which the
original duration of an utterance was compressed; e.g.
κ= 5 involves compressing speech to 20% of its original
duration) was chosen such that κ{1, 2, 5}, resulting in
three different experimental conditions. Signals with
κ=1 and signals with κ=2 had an average syllable rate
of around 3 and 6 Hz, respectively. As such, their syllable
rates fell below 9 Hz, generally assumed to be the upper
limit of cortical theta and the maximum reliable infor-
mation transfer rate through the auditory channel
(i.e. auditory channel capacity; Ghitza, 2014). Signals
with κ=5 had an average syllable rate of around 15 Hz
(i.e. above 9 Hz), falling outside cortical theta and audi-
tory channel capacity.
A fourth repackaged condition involved repackaging
the κ=5 condition (see Figure 1). Intervals of 66 ms
were excised from the signal with κ=5 and spaced
apart by 100 ms using Praat. This manipulation resulted
in a repackaged condition that contained acoustic-pho-
netic material that was identical to the κ=5 condition,
only with a significantly lower packet delivery rate of
6 Hz (i.e. within the cortical theta range). Finally, low-
level speech-shaped noise was added to all digit
strings (SNR =20 dB; not shown in Figure 1).
Procedure
Stimulus presentation was controlled by Presentation
software (v16.5; Neurobehavioral Systems, Albany, CA,
USA). Each participant was presented with the four
speech conditions (κ=1; κ=2; κ=5; repackaged), with
80 digit strings per condition, chosen at random from
the total set of 100.
Each trial started with a fixation cross presented on
screen. After 500 ms, the auditory stimulus was played.
At stimulus offset, the screen was replaced with a
response screen. Participants were instructed to enter
only the last four digits of the stimulus they had heard
(using digits, e.g. 4653), and hit Enter to proceed (i.e.
self-paced). Requesting only the last four digits
reduced the bias of memory load on error patterns and
provided an opportunity for the presumed (cortical)
theta oscillator to entrain to the input rhythm of the
first three digits prior to the occurrence of the final
four digits. Participants were instructed to always enter
four digits, in the exact order in which they were
spoken, encouraging participants to guess if they had
missed any digit. After a response had been recorded,
a blank screen was presented for 500 ms, and the next
trial was initiated.
Results
Trials with responses with less than four digits (n=53;
<1%) were excluded from analyses. Proportion correct
scores were calculated as the proportion of digits in a
given string that were correctly registered in the
correct position (i.e. for the digit string 215 4652, the
response 4652received a proportion correct value of
1.0; the response 4352, 0.75; the response 6452,0.5;
etc.), and derived percentage correct scores are pre-
sented in Figure 2.
Figure 2. Percentage correct intelligibility scores for the four
conditions with different time-compression factors κfrom Exper-
iment 1 (error bars show standard errors). Speech intelligibility is
high for syllable rates within the theta range (i.e. κ=1 and κ=2
with syllable rates below 9 Hz) but deteriorates sharply for sylla-
ble rates outside the theta range (i.e. κ=5 with syllable rates
above 9 Hz). Moreover, when applying repackagingsuch that
the resulting packet delivery rate falls within theta range,
speech intelligibility greatly improves.
LANGUAGE, COGNITION AND NEUROSCIENCE 959
Proportion correct scores were entered into a Gener-
alised Linear Mixed Model (GLMM; Quené & Van den
Bergh, 2008) with a logistic linking function, as
implemented in the lme4 library (Bates, Maechler,
Bolker, & Walker, 2015) in R (R Development Core
Team, 2012), with weights specified as the maximum
number of correct digits per trial (i.e. 4). Condition was
entered as predictor (categorical variable, dummy
coded with the κ=1 condition mapped onto the inter-
cept), with Participant and Digit String entered as
random factors with by-participant and by-digit string
random slopes for Condition (Barr, Levy, Scheepers, &
Tily, 2013).
This model revealed significant differences between
the κ=1 and the κ=5 condition (β=9.291, SE = 0.688,
z=13.499, p<0.001; lower accuracy in κ=5vs.κ=1),
between the κ=1 and the repackaged condition (β=
7.807, SE = 0.699, z=11.157, p<0.001; lower accuracy
in repackaged vs. κ=1), and between the κ=1 and the
κ=2 condition (β=3.997, SE = 0.691, z=5.781, p<
0.001; slightly lower accuracy in κ=2vs.κ=1).
The categorical predictor Condition can only compare
conditions to its intercept (set to κ=1). In order to also
gain insight into comparisons between other conditions,
a mathematically equivalent GLMM was built, including
a re-leveled Condition predictor, this time mapping
the repackaged condition onto the intercept. This
analysis revealed significant differences between the
repackaged and the κ=5 condition (β=1.484, SE =
0.122, z=12.147, p<0.001; lower accuracy in κ=5
vs. repackaged), and between the repackaged and
the κ=2 condition (β= 3.811, SE = 0.285, z=12.147,
p<0.001; higher accuracy in κ=2 vs. repackaged).
Interim discussion
Experiment 1 replicated the findings from Ghitza (2014)
using Dutch materials. Speech intelligibility was high
for syllable rates within the theta range (i.e. κ=1 and
κ=2 with syllable rates below 9 Hz) but deteriorated
sharply for syllable rates outside the theta range (i.e.
κ=5 with syllable rates above 9 Hz). Moreover, when
maximally compressed speech (i.e. κ=5) was repack-
aged such that the resulting packet delivery rate falls
within theta range, the intelligibility of the speech was
restored to a large degree. Thus, Experiment 1 validated
the use of these materials for subsequent experiments.
Experiment 2
Experiment 2 was designed to test whether there is an
upper limit to the contextual syllable rates that may
induce rate normalisation. Instead of a gradual increase
in rate normalisation effects as the compression factor
(hence the syllable rate) is raised, oscillations-based
models predict that rate normalisation effects should
not be induced by syllable rates above 9 Hz, since corti-
cal theta would then be assumed to be out of sync with
the syllabic input rate.
Method
Participants
Native Dutch participants (N= 21; 17 females, 4 males;
M
age
= 22), that had not participated in Experiment 1,
with normal hearing were recruited from the Max
Planck Institutes participant pool. They gave informed
consent as approved by the Ethics Committee of the
Social Sciences department of Radboud University
(project code: ECSW2014-1003-196).
Design and materials
A set of 10 digit strings (out of the total set of a hundred)
were adopted for use in Experiment 2. These digit strings
were compressed (PSOLA in Praat) using various time-
compression factors κ,suchthatκ{1, 2, 3, 4, 5}, resulting
in five different experimental conditions (see Figure 1).
Signals with κ{1, 2, 3} had an average syllable rate
around 3, 6, and 9 Hz, respectively, falling just below the
upper limit of cortical theta and the maximum reliable
information transfer rate through the auditory channel
(i.e. auditory channel capacity; Ghitza, 2014). Signals with
κ>3 had a syllable rate above 9 Hz, falling outside cortical
theta and auditory channel capacity.
The male native Dutch speaker that had been
recorded for Experiment 1 had also produced Dutch
target words following each digit string. These target
words were proper names selected from four different
minimal pairs containing either the short vowel /ɑ/or
the long vowel /a:/: Ad - Aad, /ɑt - a:t/; Bas - Baas, /bɑs
- ba:s/; Dan - Daan, /dɑn - da:n/; Mart - Maart, /mɑrt -
ma:rt/. The /ɑ/-/a:/ vowel contrast in Dutch is cued by
both spectral (lower formant values for /ɑ/, higher
formant values for /a:/) and temporal cues (shorter dur-
ation for /ɑ/, longer duration for /a:/; Escudero,
Benders, & Lipski, 2009). Therefore, one natural /a:/
vowel token was selected, taken from the word Baas,
that had a duration that fell in between the speakers
typical /ɑ/ and /a:/ durations (120 ms; i.e. in duration
ambiguous between /ɑ/ and /a:/). This temporally ambig-
uous /a:/ vowel was manipulated to also be spectrally
ambiguous between /ɑ/ and /a:/ using Burgs LPC
method in Praat. Source and filter models were esti-
mated automatically from the selected vowel. The
formant values in the filter models were inspected and
adjusted (F1 = 820 Hz; F2 = 1150 Hz) to fall in between
960 H. R. BOSKER AND O. GHITZA
the speakers/ɑ/ and /a:/ formant values. Recombination
of the source and adjusted filter model resulted in a
vowel that was ambiguous between /ɑ/ and /a:/ in
both its temporal and spectral properties (corroborated
by pretesting). Finally, the vowel was spliced into fixed
consonantal frames (/ʔ_t/; /b_s/; /d_n/; /m_rt/) to
create target words.
Note that only using a single ambiguous vowel would
have made the experiment needlessly difficult for partici-
pants, whose task was to categorise the vowel in the
target word (i.e. negatively affecting participantsmotiv-
ation). Therefore, filler trials were included that contained
clear (i.e. unambiguous) /ɑ/ and /a:/ vowels.
Procedure
Each participant was presented with the 10 digit strings
in all 5 compression conditions, followed by any of the 4
ambiguous minimal pairs (n= 200; or followed by any of
the 8 unambiguous target words in filler trials; n=400).
Each trial started with a fixation cross presented on
screen. After 500 ms, a digit string was played, followed
by a 100 ms silent interval and a target stimulus. At
target offset, the fixation cross was replaced by a screen
with two response options, one word on the left,
another on the right (position of /ɑ/-/a:/ words counter-
balanced across participants). Participants entered their
response as to which of the two words they had heard
(Bas or Baas, etc.) by pressing 1for the option on the
left, or 0for the option on the right. After their response
(or timeout after 4 s), the screen was replaced by an empty
screen for 500 ms, after which the next trial was initiated.
Results
Trials with missing categorisation responses (n=6; <1%)
were excluded from analyses. Categorisation data, calcu-
lated as the percentage of long /a:/ responses (% /a:/), are
presented in Figure 3, and were analyzed by a GLMM
with a logistic linking function. The dependent variable
was response /a:/ (coded as 1) or /ɑ/ (coded 0). Condition
was entered as predictor (categorical variable, dummy
coded with the κ=1 condition mapped onto the inter-
cept), with Participant and Digit String entered as
random factors with by-participant and by-digit string
random slopes for Condition (Barr et al., 2013).
This model revealed significant differences between
the κ=1 and the κ=2 condition (β= 0.500, SE = 0.121,
z= 4.107, p<0.001; higher percentage of /a:/ responses
in κ=2); and between the κ=1 and the κ=3 condition
(β= 0.353, SE = 0.136, z= 2.601, p=0.009; higher percen-
tage of /a:/ responses in κ=3). However, no differences
were observed between the κ=1, κ=4, and κ=5
conditions.
A mathematically equivalent GLMM with a re-leveled
Condition predictor, this time mapping the κ=3 con-
dition onto the intercept, revealed additional significant
differences between the κ=3 and the κ=4 condition
(β=0.286, SE = 0.112, z=2.552, p=0.011; lower per-
centage of /a:/ responses in κ=4); and between the
κ=3 and the κ=5condition (β=0.326, SE = 0.117,
z=2.778, p=0.005; lower percentage of /a:/ responses
in κ=5).
Interim discussion
Experiment 2 was designed to test whether there is an
upper limit to the syllable rates that elicit rate normalisa-
tion effects on following ambiguous target words. We
observed that compressing a naturally produced carrier
by a factor of 2 induced a higher percentage of /a:/
responses for following ambiguous target words (relative
to the uncompressed signal), replicating earlier findings
in the literature (Bosker, 2017b; Bosker et al., 2017; Rein-
isch, 2016b; Reinisch & Sjerps, 2013).
A novel finding of Experiment 2 is that when carriers
are compressed to have syllable rates outside the theta
range (i.e. the κ=4 and κ=5 conditions), no difference
in target word categorisation is observed compared to
the uncompressed condition (i.e. κ=1). This finding sup-
ports an oscillations-based mechanism underlying rate
normalisation: only when theta oscillations can optimally
track the rate of the carrier do we find effects on the per-
ception of subsequent target words.
Figure 3. Average categorisation data (in % long /a:/ responses)
for the five conditions with different time-compression factors κ
from Experiment 2 (error bars show standard errors). Com-
pression of speech carriers by κ= 2, with syllable rates within
the theta range, leads to an increase in % /a:/ responses.
However, compression of carriers by κ= 4 and κ= 5, with syllable
rates outside the theta range, does not lead to an increase in %
/a:/ responses (comparable target categorisation as in the base-
line κ= 1 condition).
LANGUAGE, COGNITION AND NEUROSCIENCE 961
Note that the categorisation data in the κ=3 con-
dition fall in between the results for the κ=2 and the
κ={1, 4, 5} conditions. This observation may be a conse-
quence of the fact that the average syllable rate in the κ
=3 condition is around 9 Hz, just at the upper border of
the theta range. Recalling the biophysical nature of the
neuronal theta, its frequency range is not precise and
the 9 Hz limit should be considered as an estimated
mean. Hence, a plausible explanation for this finding
may be related to individual differences between partici-
pants in how successful their theta oscillations were in
tracking the speech input at syllable rates near capacity.
Experiment 3
Experiment 3 was designed to test whether repackaged
compressed speech may induce rate normalisation.
Since repackaging of heavily compressed speech can
lower the packet delivery rate to below 9 Hz (i.e. within
theta range), oscillations-based models predict that
repackaging might restore rate normalisation effects.
Method
Participants
Native Dutch participants (N= 29) with normal hearing,
that had not participated in the previous experiments,
were recruited from the Max Planck Institutes participant
pool. They gave informed consent as approved by the
Ethics Committee of the Social Sciences department of
Radboud University (project code: ECSW2014-1003-
196). Data from 6 participants were excluded for
reasons of technical errors (n= 1), illness (n= 1), fatigue
(n= 2) or non-compliance with the experimental task
(n= 2), leaving data from 23 participants (19 females, 4
males; M
age
= 21) for analysis.
Design and materials
Experiment 3 combined the four speech conditions from
Experiment 1 (κ=1; κ=2; κ=5; repackaged) with the
procedure from Experiment 2. That is, the 10 digit
strings used in Experiment 2 were time-compressed
and repackaged, using the method described in Exper-
iment 1, and presented together with the target words,
described in Experiment 2.
Procedure
The experimental procedure in Experiment 3 mirrored
the procedure in Experiment 2. Each participant was pre-
sented with the 10 digit strings in all 4 conditions, fol-
lowed by any of the 4 ambiguous minimal pairs (n=
160; or by any of the 8 unambiguous target words in
filler trials; n=320). Again, participantstask was to indi-
cate which target word they had heard (Bas or Baas, etc.).
Results
Trials with missing categorisation responses (n=3;
<1%) were excluded from analyses. Categorisation
data, calculated as the percentage of long /a:/
responses (% /a:/), are presented in Figure 4,andwere
analyzed by a GLMM with identical structure as the
one built for Experiment 2.
This GLMM revealed significant differences between
the κ=1 and the κ=2 condition (β= 0.398, SE = 0.101,
z= 3.927, p<0.001; higher percentage of /a:/ responses
in κ=2); and between the κ=1 and the repackaged con-
dition (β= 0.272, SE = 0.100, z= 2.694, p=0.007; higher
percentage of /a:/ responses in repackaged condition).
No difference was found between the κ=1 and κ=5
condition (p>0.7).
A mathematically equivalent GLMM with a re-
leveled Condition predictor, this time mapping the
repackaged condition onto the intercept, revealed an
additional significant difference between the repack-
aged condition and the κ=5condition(β=0.245,
SE = 0.100, z=2.440, p=0.015; lower percentage of
/a:/ responses in κ=5). However, no difference was
found between the repackaged condition and the κ
=2 condition (p>0.2).
Figure 4. Average categorisation data (in % long /a:/ responses)
for the four conditions with different time-compression factors κ
from Experiment 3 (error bars show standard errors). Similar to
Experiment 2, compression of speech carriers by κ= 2, with syl-
lable rates within the theta range, leads to an increase in % /a:/
responses. Also, compression of carriers by κ= 5, with syllable
rates outside the theta range, does not lead to an increase in
% /a:/ responses (comparable target categorisation as in the
baseline κ= 1 condition). However, when applying repacka-
gingto the κ= 5 condition, rate normalisation is restored: the
repackaged condition induces an increase in % /a:/ responses
comparable to the κ= 2 condition.
962 H. R. BOSKER AND O. GHITZA
Interim discussion
First, results from Experiment 3 replicate the findings for
the κ=1, κ=2, and κ=5 conditions from Experiment
2. That is, only when syllable rates were inside the
theta range, rate normalisation was observed.
The novel finding of Experiment 3 is that when the κ
=5 condition was repackaged such that its packet deliv-
ery rate was set to be around 6 Hz (inside theta range),
rate normalisation was restored. In fact, the categoris-
ation responses from the repackaged condition were
comparable to those from the κ=2 condition, with a syl-
lable rate comparable to the packet rate of the repack-
aged condition (around 6 Hz). Note that the acoustics
inside the packets in the compressed and repackaged
condition were identical (i.e. time-compressed by κ=5);
nevertheless, very different contextual effects were
observed on target word perception. Thus, Experiment
3, together with the replicated findings from Experiment
2, supports an oscillations-based account of rate normal-
isation, with a central role for sustained entrainment of
oscillations in the theta range in rate normalisation.
General discussion
The present study contributes psychoacoustic data to
the ongoing debate about the functional role of
entrained neural oscillations in speech comprehension:
does entrainment to a speech rhythm actively shape
the decoding of the spoken input signal (i.e. a causal
factor) or is it merely a manifestation of modulated
evoked responses to the driving auditory stimulus (i.e.
a consequence)? Previous studies have contributed to
this debate by providing empirical evidence that neural
entrainment guides the decoding of concurrent speech:
when the natural speech rhythm is disrupted, entrain-
ment is reduced (Doelling et al., 2014), and intelligibility
suffers (Ghitza, 2012,2014; Ghitza & Greenberg, 2009). A
demonstration that entrainment influences the percep-
tion of subsequent speech (i.e. after the driving stimulus
has ceased) thus allowing the exclusion of potential
evoked effects from the driving stimulus could
provide further evidence for the causal influence of
neural entrainment on speech comprehension. Our
study was concerned with providing such evidence by
means of psychoacoustic experimentation.
Experiment 1 showed that compressing Dutch speech
such that syllable rates fall outside the theta range (i.e. >
9 Hz) greatly harms intelligibility, in line with Ahissar et al.
(2001) who showed that MEG neural responses track the
temporal envelope of compressed speech as long as it is
intelligible. This behaviour is explained by the TEMPO
model in Ghitza (2011), suggesting that the decline in
intelligibility with speech speed is dictated by cortical
theta (Ghitza, 2014).
3
If the highly compressed speech
signal is repackaged(i.e. delivering compressed
speech packets at a lower rate by inserting silent inter-
vals in between packets; see Figure 1) such that the
packet delivery rate falls within theta range, intelligibility
is much enhanced. Together, these outcomes suggest
that entrainment of theta oscillations supports the
decoding of concurrent speech, extending earlier work
in English (Ghitza, 2014) to a new language: Dutch.
Experiment 2 and 3 targeted potential effects of sus-
tained entrainment on subsequent speech using the
rate normalisation paradigm: participants heard digit
strings compressed by various compression factors, fol-
lowed by (uncompressed) minimal pair target words
ambiguous between containing the short vowel /ɑ/ vs.
the long vowel /a:/ (e.g. Dan - Daan, /dɑn - da:n/). Exper-
iment 2 demonstrated that compression of the digit
strings biased listeners towards reporting more long
/a:/ responses, corroborating previous studies on rate
normalisation (e.g. Bosker, 2017b; Newman & Sawusch,
2009; Reinisch & Sjerps, 2013; Toscano & McMurray,
2015). However, this only applied for those compressed
speech conditions with syllable rates safely within theta
range. In fact, the perception of target words preceded
by compressed digit strings with syllable rates outside
the theta range (i.e. well above 9 Hz) was observed to
be comparable to the baseline uncompressed condition.
Thus, the present study is the first to demonstrate that
there is an upper limit to the syllable rates that induce
rate normalisation, namely a limit around 9 Hz.
One study that may seem to be at odds with this sug-
gestion is Wade and Holt (2005), who reported that tone
sequences with a presentation rate of 25 Hz also elicited
rate normalisation. Note, however, that this study used
non-speech carriers to elicit normalisation effects on
the perception of consonantal target segments (i.e.
short /b/ vs. long /w/ in English). It could be argued
that the perception of vowels (relatively long; syllable
nuclei) is governed by slow theta oscillations (corre-
sponding in modulation rate), whereas the perception
of consonants (relatively short; higher modulation rate;
syllable onsets and codas) would be sensitive to faster
gamma oscillations in the 2540 Hz range (Giraud &
Poeppel, 2012). This, however, remains speculative and
future studies may examine potentially differential nor-
malisation of vowels and consonants.
Experiment 3 also used the rate normalisation para-
digm but this time participants were presented with
repackaged digit strings, followed by the ambiguous
target words. Oscillations-based models predict that
repackaging highly compressed speech would restore
cortical speech tracking, with consequences for the
LANGUAGE, COGNITION AND NEUROSCIENCE 963
temporal sampling of subsequent speech. Results indeed
showed that, whereas highly compressed speech (with
syllable rates > 9 Hz) did not have an influence on listen-
erstarget vowel perception (replicating Experiment 2),
repackaged compressed speech with packet delivery
rates inside theta range did bias listeners to report
more long /a:/ vowels.
Taken together, the present experiments point to a
central role for sustained theta entrainment in rate nor-
malisation. Neural theta oscillations are suggested to
entrain to the rhythmic properties of a driving speech
stimulus (Giraud & Poeppel, 2012; Peelle & Davis, 2012),
imposing an appropriate sampling regime onto the
incoming sensory stimulus (cf. Experiment 1; Ghitza,
2012,2014). These oscillations may persist for several
cycles after stimulation has ceased (Hickok et al., 2015;
Kösem et al., 2017; Lakatos et al., 2013; Spaak et al.,
2014), allowing for the possibility that a preceding
rhythm influences the perception of subsequent
speech by imposing its own cortical sampling fre-
quency. For instance, fast rhythms would induce over-
samplingof the following speech content, with listeners
overestimating the duration of subsequent spoken seg-
ments (Bosker, 2017a; Bosker & Kösem, 2017). When syl-
lable rates fall outside the theta range (i.e. > 9 Hz), theta
oscillations cannot effectively entrain to the incoming
speech signal, and consequently the sustained effect of
these theta oscillations on subsequent perception is
reduced (cf. Experiment 2). However, when the infor-
mation transfer rate is tuned back to within the theta
range (via repackaging), theta entrainment is restored,
improving the decoding of the concurrent speech (intel-
ligibility enhancement; cf. Experiment 1) and influencing
subsequent perception (rate normalisation; cf. Exper-
iment 3).
Thus, this study should be viewed as providing a
neural implementation of cognitive principles proposed
to underlie rate normalisation, such as the principle of
durational contrast (Diehl & Walsh, 1989; Oller, Eilers,
Miskiel, Burns, & Urbano, 1991; Wade & Holt, 2005).
This principle was formulated to have the psycholinguis-
tic function of biasing the perception of ambiguous
speech segments towards longer (shorter) percepts
when they occur in the context of other shorter
(longer) surrounding segments. The neural implemen-
tation proposed here refines the description of this cog-
nitive principle by showing that rate normalisation is
governed by the syllable/packet rate of the carrier (in
line with Bosker, 2017a). Furthermore, it provides impor-
tant constraints, derived from current oscillations-based
models of speech perception, revealing an upper limit
to rate normalisation effects (i.e. at the upper frequency
of the theta range; 9 Hz).
Note, however, that we do not claim that this is the
only mechanism driving rate normalisation effects in
speech perception. The perception of an ambiguous
target word can also be biased by rate manipulations
in the following, rather than preceding, context
(although typically with smaller effect sizes; J. L. Miller
& Liberman, 1979; Newman & Sawusch, 1996; Sawusch
& Newman, 2000). Moreover, subjective impressions of
fast speech (i.e. without an acoustic increase in syllable
rate) can induce rate normalisation, such as when listen-
ing to speech with segmental reductions (Reinisch,
2016a), habitually fast talkers (Maslowski, Meyer, &
Bosker, in press; Reinisch, 2016b), an unfamiliar language
(Bosker & Reinisch, 2017), or while dual-tasking (Bosker
et al., 2017). We adopt the recently proposed two-stage
model of normalisation processes in speech perception
(Bosker et al., 2017), in which, at an early stage, an auto-
matic general-auditory mechanism is operating indepen-
dent from attentional demands. At a later stage, higher-
level influences, such as subjective impressions, come
into play involving cognitive rather than perceptual
adjustments. The neural mechanism proposed here
would be a candidate mechanism for the first general-
auditory stage of normalisation.
In sum, based on behavioural findings from three
psychoacoustic experiments, the present study (1)
concludes that rate normalisation is governed by the
syllable/packet rate of the carrier; and (2) puts
forward an oscillations-based neurobiological mechan-
ism of rate normalisation. This account is in line with
other behavioural studies (Bosker, 2017a;Bosker&
Kösem, 2017;Ghitza,2012,2014; Ghitza & Greenberg,
2009;Hickoketal.,2015), neuroimaging data (Doel-
ling et al., 2014; Kösem et al., 2017), and previously
formulated cognitive principles, such as durational
contrast (Diehl & Walsh, 1989). Thus, it augments
our understanding of the functional role of entrain-
ment in speech comprehension by proposing that
entrainment not only guides the processing of concur-
rent speech but also shapes the perception of follow-
ing speech content.
Notes
1. In order to be able to track the syllabic irregularities of
spontaneous speech (e.g. a stressed syllable followed
by a non-stressed syllable), the theta oscillator belongs
to a special class of oscillators termed flexible oscillators
(Ghitza, 2011). Such oscillators are different in important
respects from autonomous, rigid oscillators (cf. Ghitza,
2011).
2. It is important to note the distinction between the inter-
ruption of speech via Gating (G. A. Miller & Licklider,
1950), and the insertion of silent gaps via repackaging:
964 H. R. BOSKER AND O. GHITZA
interruption removes part of the speech signal, while
repackaging maintains all speech information with artifi-
cial distribution of information in time, defined by the
packet rate. While phonemic restoration from inter-
rupted speech (Mattys, Brooks, & Cooke, 2009; Warren,
1970) has been attributed to informational masking,
the insertion of gaps provides additional decoding
time: a gradual change in gap duration should be
viewed as tuning the packet rate in a search for a
better synchronisation between the input information
flow and the capacity of the auditory channel; the
optimal range of the packet rate is dictated by the prop-
erties of cortical theta (Ghitza, 2011,2014).
3. It is important to mention here a recent EEG-based
study (Pefkou, Arnal, Fontolan, & Giraud, 2017)
showing that brain-wave oscillations follow the
rhythm of comprehensible fast speech, with speeds
up to 14 syllables/sec. Does this finding contradict
the 9 syllables/sec limit suggested by Ghitza (2014)?
Interpreting their data Pefkou et al. (2017) suggested
two distinct functional processes, a bottom-up theta
driven and a top-down beta driven process, concur-
rently at play during speech perception. And so,
even when the theta oscillator ceases to accurately
track the input, thus preventing the bottom-up
process from extracting important syllabic infor-
mation from the sensory input, a top-down process
helps to obtain the missing information. Hence, the
brain-wave oscillations above 9 Hz beyond the bio-
physical range of theta reflect beta feedback. In
relation to our study, note that the effectiveness of
a top-down process is determined by the amount
of contextual information in the speech stream. In
Pefkou et al. (2017), participants listened to short
stories with high semantic context, feeding beta
oscillations and thus the brain-wave activity above
9 Hz. In contrast, our study (as well as Ghitza, 2014)
used random digit strings that do not allow much
opportunity for top-down processing (random
sequences of digits; relatively short; no coherence
between trials), hence involving mainly the bottom-
up theta driven function.
Acknowledgements
We would like to thank Lori Holt and two anonymous reviewers
for helpful comments and suggestions on earlier versions of
this paper, Phillip Alday for advice on statistical analyses, and
Milou Huijsmans for her help in testing participants.
Disclosure statement
No potential conflict of interest was reported by the author(s).
Funding
The first author was supported by a Gravitation grant from the
Dutch Government to the Language in Interaction Consortium.
The second author was supported by a research grant from the
Air Force Office of Scientific Research.
References
Adank, P., & Janse, E. (2009). Perceptual learning of time-com-
pressed and natural fast speech. The Journal of the
Acoustical Society of America,126(5), 26492659.
Ahissar, E., Nagarajan, S., Ahissar, M., Protopapas, A., Mahncke,
H., & Merzenich, M. M. (2001). Speech comprehension is cor-
related with temporal response patterns recorded from audi-
tory cortex. Proceedings of the National Academy of Sciences,
98(23), 1336713372.
Barr, D. J., Levy, R., Scheepers, C., & Tily, H. J. (2013). Random
effects structure for confirmatory hypothesis testing: Keep
it maximal. Journal of Memory and Language,68(3), 255278.
Bates, D., Maechler, M., Bolker, B., & Walker, S. (2015). Fitting
linear mixed-effects models using lme4. Journal of
Statistical Software,67(1), 148. doi:10.18637/jss.v067.i01
Boersma, P., & Weenink, D. (2016). Praat: Doing phonetics by
computer [computer program].
Bosker, H. R. (2017a). Accounting for rate-dependent category
boundary shifts in speech perception. Attention, Perception &
Psychophysics,79(1), 333343. doi:10.3758/s13414-016-1206-4
Bosker, H. R. (2017b). How our own speech rate influences our
perception of others. Journal of Experimental Psychology:
Learning, Memory, and Cognition,43(8), 12251238. doi:10.
1037/xlm0000381
Bosker, H. R., & Cooke, M. (in press). Talkers produce more pro-
nounced amplitude modulations when speaking in noise.
Journal of the Acoustical Society of America.doi:10.1121/1.
5024404
Bosker, H. R., & Kösem, A. (2017). An entrained rhythms fre-
quency, not phase, influences temporal sampling of speech.
Proceedings of interspeech 2017, Stockholm. doi:10.21437/
Interspeech.2017-73
Bosker, H. R., & Reinisch, E. (2017). Foreign languages sound fast:
Evidence from implicit rate normalization. Frontiers in
Psychology,8, 1729. doi:10.3389/fpsyg.2017.01063
Bosker, H. R., Reinisch, E., & Sjerps, M. J. (2017). Cognitive load
makes speech sound fast but does not modulate acoustic
context effects. Journal of Memory and Language,94, 166
176. doi:10.1016/j.jml.2016.12.002
Dent, M. L., Brittan-Powell, E. F., Dooling, R. J., & Pierce, A. (1997).
Perception of synthetic /ba/-/wa/ speech continuum by bud-
gerigars (Melopsittacus undulatus). The Journal of the
Acoustical Society of America,102(3), 18911897.
Diehl, R. L., & Walsh, M. A. (1989). An auditory basis for the
stimulus-length effect in the perception of stops and
glides. The Journal of the Acoustical Society of America,85
(5), 21542164.
Dilley, L. C., & Pitt, M. A. (2010). Altering context speech rate can
cause words to appear or disappear. Psychological Science,21
(11), 16641670.
Ding, N., Patel, A., Chen, L., Butler, H., Luo, C., & Poeppel, D.
(2017). Temporal modulations in speech and music.
Neuroscience and Biobehavioral Reviews, Online version.
doi:10.1016/j.neubiorev.2017.02.011
Doelling, K. B., Arnal, L. H., Ghitza, O., & Poeppel, D. (2014).
Acoustic landmarks drive deltatheta oscillations to enable
speech comprehension by facilitating perceptual parsing.
NeuroImage,85, 761768.
Drullman, R., Festen, J. M., & Plomp, R. (1994a). Effect of redu-
cing slow temporal modulations on speech recognition.
The Journal of the Acoustical Society of America,95(5),
26702680.
LANGUAGE, COGNITION AND NEUROSCIENCE 965
Drullman, R., Festen, J. M., & Plomp, R. (1994b). Effect of tem-
poral envelope smearing on speech reception. The Journal
of the Acoustical Society of America,95(2), 10531064.
Dupoux, E., & Green, K. (1997). Perceptual adjustment to highly
compressed speech: Effects of talker and rate changes.
Journal of Experimental Psychology: Human Perception and
Performance,23(3), 914927.
Elliott, T. M., & Theunissen, F. E. (2009). The modulation transfer
function for speech intelligibility. PLoS Computational
Biology,5(3), e1000302.
Escudero, P., Benders, T., & Lipski, S. C. (2009). Native, non-native
and L2 perceptual cue weighting for Dutch vowels: The case
of Dutch, German, and Spanish listeners. Journal of Phonetics,
37(4), 452465.
Ghitza, O. (2011). Linking speech perception and neurophysiol-
ogy: Speech decoding guided by cascaded oscillators locked
to the input rhythm. Frontiers in Psychology,2(130), 113.
Ghitza, O. (2012). On the role of theta-driven syllabic parsing in
decoding speech: Intelligibility of speech with a manipulated
modulation spectrum. Frontiers in Psychology,3(238), 112.
Ghitza, O. (2014). Behavioral evidence for the role of cortical θ
oscillations in determining auditory channel capacity for
speech. Frontiers in Psychology,5(751), 112.
Ghitza, O. (2017). Acoustic-driven delta rhythms as prosodic
markers. Language, Cognition and Neuroscience,32(5), 545
561. doi:10.1080/23273798.2016.1232419
Ghitza, O., & Greenberg, S. (2009). On the possible role of brain
rhythms in speech perception: Intelligibility of time-com-
pressed speech with periodic and aperiodic insertions of
silence. Phonetica,66(1-2), 113126.
Giraud, A.-L., & Poeppel, D. (2012). Cortical oscillations and
speech processing: Emerging computational principles and
operations. Nature Neuroscience,15(4), 511517.
Gordon, P. C. (1988). Induction of rate-dependent processing by
coarse-grained aspects of speech. Perception &
Psychophysics,43(2), 137146.
Gross, J., Hoogenboom, N., Thut, G., Schyns, P., Panzeri, S., Belin,
P., & Garrod, S. (2013). Speech rhythms and multiplexed
oscillatory sensory coding in the human brain. PLoS
Biology,11(12), e1001752.
Hickok, G., Farahbod, H., & Saberi, K. (2015). The rhythm of per-
ception: Entrainment to acoustic rhythms induces sub-
sequent perceptual oscillation. Psychological Science,26(7),
10061013. doi:10.1177/0956797615576533
Kidd, G. R. (1989). Articulatory-rate context effects in phoneme
identification. Journal of Experimental Psychology: Human
Perception and Performance,15(4), 736748.
Kösem, A., Bosker, H. R., Takashima, A., Jensen, O., Meyer, A., &
Hagoort, P. (2017). Neural entrainment determines the
words we hear. bioRxiv.doi:10.1101/175000
Krause, J. C., & Braida, L. D. (2004). Acoustic properties of natu-
rally produced clear speech at normal speaking rates. The
Journal of the Acoustical Society of America,115(1), 362378.
Lakatos, P., Musacchia, G., OConnel, M. N., Falchier, A. Y., Javitt,
D. C., & Schroeder, C. E. (2013). The spectrotemporal filter
mechanism of auditory selective attention. Neuron,77(4),
750761.
Luo, H., & Poeppel, D. (2007). Phase patterns of neuronal
responses reliably discriminate speech in human auditory
cortex. Neuron,54(6), 10011010.
Maslowski, M., Meyer, A. S., & Bosker, H. R. (in press). How the
tracking of habitual rate influences speech perception.
Journal of Experimental Psychology: Learning, Memory, and
Cognition.doi:10.1037/xlm0000579
Mattys, S. L., Brooks, J., & Cooke, M. (2009). Recognizing speech
under a processing load: Dissociating energetic from infor-
mational factors. Cognitive Psychology,59(3), 203243.
Miller, J. L., & Liberman, A. M. (1979). Some effects of later-occur-
ring information on the perception of stop consonant and
semivowel. Perception & Psychophysics,25(6), 457465.
Miller, G. A., & Licklider, J. C. (1950). The intelligibility of inter-
rupted speech. The Journal of the Acoustical Society of
America,22(2), 167173.
Newman, R. S., & Sawusch, J. R. (1996). Perceptual normalization
for speaking rate: Effects of temporal distance. Perception &
Psychophysics,58(4), 540560.
Newman, R. S., & Sawusch, J. R. (2009). Perceptual normalization
for speaking rate III: Effects of the rate of one voice on per-
ception of another. Journal of Phonetics,37(1), 4665.
Obleser, J., Herrmann, B., & Henry, M. J. (2012). Neural oscil-
lations in speech: Dont be enslaved by the envelope.
Frontiers in Human Neuroscience,6(250), 14.
Oller, D. K., Eilers, R. E., Miskiel, E., Burns, R., & Urbano, R. (1991).
The stop/glide boundary shift: Modelling perceptual data.
Phonetica,48(1), 3256.
Peelle, J. E., & Davis, M. H. (2012). Neural oscillations carry
speech rhythm through to comprehension. Frontiers in
Psychology,3.doi:10.3389/fpsyg.2012.00320
Peelle, J. E., Gross, J., & Davis, M. H. (2013). Phase-locked
responses to speech in human auditory cortex are enhanced
during comprehension. Cerebral Cortex,23(6), 13781387.
Pefkou, M., Arnal, L. H., Fontolan, L., & Giraud, A.-L. (2017). θ-
band and β-band neural activity reflects independent sylla-
ble tracking and comprehension of time-compressed
speech. The Journal of Neuroscience,37(33), 79307938.
Pickett, J. M., & Decker, L. R. (1960). Time factors in perception of
a double consonant. Language and Speech,3(1), 1117.
Pitt, M. A., Szostak, C., & Dilley, L. (2016). Rate dependent speech
processing can be speech-specific: Evidence from the per-
ceptual disappearance of words under changes in context
speech rate. Attention, Perception, & Psychophysics,78(1),
334345. doi:10.3758/s13414-015-0981-7
Poeppel, D. (2003). The analysis of speech in different temporal
integration windows: Cerebral lateralization as asymmetric
sampling in time.Speech Communication,41(1), 245255.
Quené, H., & Van den Bergh, H. (2008). Examples of mixed-
effects modeling with crossed random effects and with bino-
mial data. Journal of Memory and Language,59(4), 413425.
R Development Core Team. (2012). R: A language and environ-
ment for statistical computing [computer program].
Reinisch, E. (2016a). Natural fast speech is perceived as faster
than linearly time-compressed speech. Attention,
Perception, & Psychophysics,78(4), 12031217. doi:10.3758/
s13414-016-1067-x
Reinisch, E. (2016b). Speaker-specific processing and local
context information: The case of speaking rate. Applied
Psycholinguistics,37, 13971415. doi:10.1017/
S0142716415000612
Reinisch, E., & Sjerps, M. J. (2013). The uptake of spectral and
temporal cues in vowel perception is rapidly influenced by
context. Journal of Phonetics,41(2), 101116.
Riecke, L., Formisano, E., Sorger, B., Başkent, D., & Gaudrain, E.
(2018). Neural entrainment to speech modulates speech
intelligibility. Current Biology.doi:10.1016/j.cub.2017.11.033
966 H. R. BOSKER AND O. GHITZA
Saberi, K., & Perrott, D. R. (1999). Cognitive restoration of
reversed speech. Nature,398(6730), 760760.
Sawusch, J. R., & Newman, R. S. (2000). Perceptual normalization
for speaking rate II: Effects of signal discontinuities.
Perception & Psychophysics,62(2), 285300.
Shannon, R. V., Zeng, F.-G., Kamath, V., Wygonski, J., & Ekelid, M.
(1995). Speech recognition with primarily temporal cues.
Science,270(5234), 303304.
Spaak, E., de Lange, F. P., & Jensen, O. (2014). Local entrainment
of alpha oscillations by visual stimuli causes cyclic modu-
lation of perception. Journal of Neuroscience,34(10), 3536
3544. doi:10.1523/JNEUROSCI.4385-13.2014
Toscano, J. C., & McMurray, B. (2015). The time-course of speak-
ing rate compensation: Effects of sentential rate and vowel
length on voicing judgments. Language, Cognition and
Neuroscience,30(5), 529543.
Ueda, K., Nakajima, Y., Ellermeier, W., & Kattner, F. (2017).
Intelligibility of locally time-reversed speech: A multilingual
comparison. Scientific Reports,7,18.
Varnet, L., Ortiz-Barajas, M. C., Erra, R. G., Gervain, J., & Lorenzi, C.
(2017). A cross-linguistic study of speech modulation spectra.
The Journal of the Acoustical Society of America,142(4), 1976
1989.
Wade, T., & Holt, L. L. (2005). Perceptual effects of preceding
nonspeech rate on temporal properties of speech categories.
Perception & Psychophysics,67(6), 939950.
Warren, R. M. (1970). Perceptual restoration of missing speech
sounds. Science,167(3917), 392393.
Zoefel, B., Archer-Boyd, A., & Davis, M. H. (2018). Phase entrain-
ment of brain oscillations causally modulates neural
responses to intelligible speech. Current Biology.doi:10.
1016/j.cub.2017.11.071
LANGUAGE, COGNITION AND NEUROSCIENCE 967
... Adaptation to , prominent frequencies in the preceding acoustic context results in neural contrast when those frequencies change upon introduction of the target sound, producing a neural (and ultimately perceptual) shift. The neural mechanisms underlying TCEs are less clear, but two candidates have been suggested: cortical entrainment to modulations in the amplitude envelope (Bosker and Ghitza, 2018) or evoked responses to rapid increases in speech amplitude, particularly at modulation onset ("acoustic edges") (Oganian and Chang, 2019;Oganian et al., 2023). When the rate of modulations or their onsets change across context and target stimuli, this similarly produces a contrastive shift where a larger change in rate is perceived, resulting in the TCE. ...
... First and foremost, it bears reminding that different types of contrast effects were measured across studies: Assgari and Stilp (2015) analyzed SCEs and the present study analyzed TCEs. Second, each effect is proposed to be subserved by different neural mechanisms: SCEs by neural adaptation (Stilp, 2020a,b) and TCEs by either entrainment to modulations in the amplitude envelope of speech (Bosker and Ghitza, 2018) or evoked responses to rapid increases in speech amplitude (Oganian and Chang, 2019;Oganian et al., 2023). Third, while studies used the same context sentences, the target stimuli differed. ...
... While the One Talker/One Sentence condition had zero variability in mean f0 from trial to trial (because it was the same token spoken at different rates), there was minimal variability in mean f0 for the One Talker/200 Sentences condition, yet TCEs were significantly smaller in this condition in Experiments 1 and 2. This is suggestive of a different type of stimulus variability being responsible for diminishing TCE magnitudes across conditions, most likely one tied to the proposed neural mechanisms underlying TCEs. By presenting a different sentence on each trial, there was variability in the amplitude envelope of each sentence (as suggested by Bosker and Ghitza, 2018, to underlie TCEs) as well as the timing and frequency of rapid increases of signal amplitude (as suggested by Oganian et al., 2023, to underlie TCEs). Targeted experimentation (akin to the experiments reported by Assgari et al., 2019) is needed to identify the specific cause of variation in TCE magnitudes in the present results. ...
Article
Full-text available
Acoustic context influences speech perception, but contextual variability restricts this influence. Assgari and Stilp [J. Acoust. Soc. Am. 138, 3023-3032 (2015)] demonstrated that when categorizing vowels, variability in who spoke the preceding context sentence on each trial but not the sentence contents diminished the resulting spectral contrast effects (perceptual shifts in categorization stemming from spectral differences between sounds). Yet, how such contextual variability affects temporal contrast effects (TCEs) (also known as speaking rate normalization; categorization shifts stemming from temporal differences) is unknown. Here, stimuli were the same context sentences and conditions (one talker saying one sentence, one talker saying 200 sentences, 200 talkers saying 200 sentences) used in Assgari and Stilp [J. Acoust. Soc. Am. 138, 3023-3032 (2015)], but set to fast or slow speaking rates to encourage perception of target words as "tier" or "deer," respectively. In Experiment 1, sentence variability and talker variability each diminished TCE magnitudes; talker variability also produced shallower psychometric function slopes. In Experiment 2, when speaking rates were matched across the 200-sentences conditions, neither TCE magnitudes nor slopes differed across conditions. In Experiment 3, matching slow and fast rates across all conditions failed to produce equal TCEs and slopes everywhere. Results suggest a complex interplay between acoustic, talker, and sentence variability in shaping TCEs in speech perception.
... Additionally, Assgari (2019, 2021) measured spectral context effects and not TCEs. These effects are subserved by different neural mechanisms [spectral contrast effects: neural adaptation (e.g., Stilp, 2020a); TCEs: neural oscillatory entrainment (Bosker and Ghitza, 2018) or evoked responses to acoustic edges (Kojima et al., 2021)] and are not obligated to follow the same patterns of results. Future investigation will elucidate which of these (or other) causes explain the lack of support for the second hypothesis. ...
... ARTICLE asa.scitation.org/journal/jel Stilp's (2020b) review] for which sufficient detail was provided to allow calculation of speaking rates (Pickett and Decker, 1960;Repp et al., 1978;Diehl et al., 1980;Summerfield, 1981;Port and Dalby, 1982;Gordon, 1988;Kidd, 1989;Newman and Sawusch, 2009;Reinisch et al., 2011;Reinisch and Sjerps, 2013;Reinisch, 2016;Bosker, 2017;Bosker and Ghitza, 2018). These speaking rates (mean slow speaking rate ¼ 4.11 syllables/s, mean fast rate ¼ 8.51 syllables/s) differ by larger amounts and extend to faster overall speaking rates than those tested here. ...
Article
Full-text available
When speaking in noisy conditions or to a hearing-impaired listener, talkers often use clear speech, which is typically slower than conversational speech. In other research, changes in speaking rate affect speech perception through speaking rate normalization: Slower context sounds encourage perception of subsequent sounds as faster, and vice versa. Here, on each trial, listeners heard a context sentence before the target word (which varied from "deer" to "tier"). Clear and slowed conversational context sentences elicited more "deer" responses than conversational sentences, consistent with rate normalization. Changing speaking styles aids speech intelligibility but might also produce other outcomes that alter sound/word recognition.
... Furthermore, the average syllabic rate after repackaging the syllables by inserting 100 ms silences resulted in an average syllabic rate of 6.1 sps. This rate is similar to the syllabic rate of the control condition and to that of the repackaged syllabic rate reported in the literature that used the same experimental paradigm 14,25,33 . For more details about the effect of the time-compression on the duration and the rate of the different speech segments, we refer to Gransier et al. 17 ; ...
... Ghitza and colleagues 14,25,33 showed that repackaging TC speech, by inserting silence parts between speech segments, restores intelligibility. The result of the present study shows that this improvement is associated with the amount of TFS available to the listeners. ...
Article
Full-text available
Intelligibility of time-compressed (TC) speech decreases with increasing speech rate. However, intelligibility can be restored by ‘repackaging’ the TC speech by inserting silences between the syllables so that the original ‘rhythm’ is restored. Although restoration of the speech rhythm affects solely the temporal envelope, it is unclear to which extent repackaging also affects the perception of the temporal-fine structure (TFS). Here we investigate to which extent TFS contributes to the perception of TC and repackaged TC speech in quiet. Intelligibility of TC sentences with a speech rate of 15.6 syllables per second (sps) and the repackaged sentences, by adding 100 ms of silence between the syllables of the TC speech (i.e., a speech rate of 6.1 sps), was assessed for three TFS conditions: the original TFS and the TFS conveyed by an 8- and 16-channel noise vocoder. An overall positive effect on intelligibility of both the repackaging process and of the amount of TFS available to the listener was observed. Furthermore, the benefit associated with the repackaging TC speech depended on the amount of TFS available. The results show TFS contributes significantly to the perception of fast speech even when the overall rhythm/envelope of TC speech is restored.
... We then evaluated the consistency of the extracted stimuli features (speech envelopes and kinematic components). In both classes of stimuli, the peaks of the power spectra of speech envelopes were mostly confined between 4 and 8 Hz, which is consistent with consolidated evidence (3,(49)(50)(51)(52). The first four principal components (PCs) together accounted for 85% of the total variance of the EMA data (Fig. 1C), whereas each of the remaining components explained a negligible amount of variance [variance accounted for (VAF): <5% each]. ...
Article
The human brain tracks available speech acoustics and extrapolates missing information such as the speaker's articulatory patterns. However, the extent to which articulatory reconstruction supports speech perception remains unclear. This study explores the relationship between articulatory reconstruction and task difficulty. Participants listened to sentences and performed a speech-rhyming task. Real kinematic data of the speaker's vocal tract were recorded via electromagnetic articulography (EMA) and aligned to corresponding acoustic outputs. We extracted articulatory synergies from the EMA data using Principal Component Analysis (PCA) and employed Partial Information Decomposition (PID) to separate the electroencephalographic (EEG) encoding of acoustic and articulatory features into unique, redundant, and synergistic atoms of information. We median-split sentences into easy (ES) and hard (HS) based on participants' performance and found that greater task difficulty involved greater encoding of unique articulatory information in the theta band. We conclude that fine-grained articulatory reconstruction plays a complementary role in the encoding of speech acoustics, lending further support to the claim that motor processes support speech perception.
... The process by which neural activity in the auditory cortex synchronizes with the amplitude envelope of the speech signal is known as neural entrainment. Neural entrainment captures acoustic and linguistic features like the syllable in a frequency around 5Hz -a syllable lasts approximately 200 milliseconds- (Gross et al., 2013b), and plays an important role in comprehension and intelligibility (Doelling et al., 2014;Bosker and Ghitza, 2018;Kösem et al., 2018;Poeppel and Assaneo, 2020). Whether it is caused by intrinsic oscillations (Lakatos et al., 2008;Giraud and Poeppel, 2012;Doelling et al., 2014;Notbohm et al., 2016;Zoefel et al., 2018) or by a sequence of evoked potentials in the theta range (Capilla et al., 2011;Keitel et al., 2014;Obleser and Kayser, 2019), is still under debate and therefore, we refer to this process simply as neural tracking. ...
Article
Full-text available
The superior temporal and the Heschl’s gyri of the human brain play a fundamental role in speech processing. Neurons synchronize their activity to the amplitude envelope of the speech signal to extract acoustic and linguistic features, a process known as neural tracking/entrainment. Electroencephalography has been extensively used in language-related research due to its high temporal resolution and reduced cost, but it does not allow for a precise source localization. Motivated by the lack of a unified methodology for the interpretation of source reconstructed signals, we propose a method based on modularity and signal complexity. The procedure was tested on data from an experiment in which we investigated the impact of native language on tracking to linguistic rhythms in two groups: English natives and Spanish natives. In the experiment, we found no effect of native language but an effect of language rhythm. Here, we compare source projected signals in the auditory areas of both hemispheres for the different conditions using nonparametric permutation tests, modularity, and a dynamical complexity measure. We found increasing values of complexity for decreased regularity in the stimuli, giving us the possibility to conclude that languages with less complex rhythms are easier to track by the auditory cortex.
... Many empirical works in this domain have focused on the potential role of neural oscillations as a neurophysiological substrate for predictions in the time domain [2][3][4][5]. In this view, neural oscillators synchronize their excitability phase with external sequences, thereby reducing internal noise and optimizing the processing of incoming events [6][7][8]. In this sense, the phase of a neural oscillation can be used as an index for prediction in time, a mechanism that may be considered as constitutive of the inferential process. ...
Article
Full-text available
Humans excel at predictively synchronizing their behavior with external rhythms, as in dance or music performance. The neural processes underlying rhythmic inferences are debated: whether predictive perception relies on high-level generative models or whether it can readily be implemented locally by hard-coded intrinsic oscillators synchronizing to rhythmic input remains unclear and different underlying computational mechanisms have been proposed. Here we explore human perception for tone sequences with some temporal regularity at varying rates, but with considerable variability. Next, using a dynamical systems perspective, we successfully model the participants behavior using an adaptive frequency oscillator which adjusts its spontaneous frequency based on the rate of stimuli. This model better reflects human behavior than a canonical nonlinear oscillator and a predictive ramping model–both widely used for temporal estimation and prediction–and demonstrate that the classical distinction between absolute and relative computational mechanisms can be unified under this framework. In addition, we show that neural oscillators may constitute hard-coded physiological priors–in a Bayesian sense–that reduce temporal uncertainty and facilitate the predictive processing of noisy rhythms. Together, the results show that adaptive oscillators provide an elegant and biologically plausible means to subserve rhythmic inference, reconciling previously incompatible frameworks for temporal inferential processes.
... Notably, given the flexibility and adaptability of our brains, we believe that the brain could use different rhythms to parse temporal perception for speech stimuli, which depends on the stimulus and context. For example, the recognition of spoken languages at different speaking rates requires us to adapt flexibly, and many studies have shown that it will affect the temporal perception of speech (Bosker and Ghitza, 2018;Kösem et al., 2018). The syllables used in this study are slightly slower than the normal pronunciation rate, and a previous The statistical difference of phase opposition sum (POS) in the A50V stimulus onset asynchronous (SOA). ...
Article
Full-text available
Objective Perceptual integration and segregation are modulated by the phase of ongoing neural oscillation whose frequency period is broader than the size of the temporal binding window (TBW). Studies have shown that the abstract beep-flash stimuli with about 100 ms TBW were modulated by the alpha band phase. Therefore, we hypothesize that the temporal perception of speech with about hundreds of milliseconds of TBW might be affected by the delta-theta phase. Methods Thus, we conducted a speech-stimuli-based audiovisual simultaneity judgment (SJ) experiment. Twenty human participants (12 females) attended this study, recording 62 channels of EEG. Results Behavioral results showed that the visual leading TBWs are broader than the auditory leading ones [273.37 ± 24.24 ms vs. 198.05 ± 19.28 ms, (mean ± sem)]. We used Phase Opposition Sum (POS) to quantify the differences in mean phase angles and phase concentrations between synchronous and asynchronous responses. The POS results indicated that the delta-theta phase was significantly different between synchronous and asynchronous responses in the A50V condition (50% synchronous responses in auditory leading SOA). However, in the V50A condition (50% synchronous responses in visual leading SOA), we only found the delta band effect. In the two conditions, we did not find a consistency of phases over subjects for both perceptual responses by the post hoc Rayleigh test (all ps > 0.05). The Rayleigh test results suggested that the phase might not reflect the neuronal excitability which assumed that the phases within a perceptual response across subjects concentrated on the same angle but were not uniformly distributed. But V-test showed the phase difference between synchronous and asynchronous responses across subjects had a significant phase opposition (all ps < 0.05) which is compatible with the POS result. Conclusion These results indicate that the speech temporal perception depends on the alignment of stimulus onset with an optimal phase of the neural oscillation whose frequency period might be broader than the size of TBW. The role of the oscillatory phase might be encoding the temporal information which varies across subjects rather than neuronal excitability. Given the enriched temporal structures of spoken language stimuli, the conclusion that phase encodes temporal information is plausible and valuable for future research.
Article
The current study investigates the average effect: the tendency for humans to appreciate an averaged (face, bird, wristwatch, car, and so on) over an individual instance. The effect holds across cultures, despite varying conceptualizations of attractiveness. While much research has been conducted on the average effect in visual perception, much less is known about the extent to which this effect applies to language and speech. This study investigates the attractiveness of average speech rhythms in Dutch and Mandarin Chinese, two typologically different languages. This was tested in a series of perception experiments in either language in which native listeners chose the most attractive one from a pair of acoustically manipulated rhythms. For each language, two experiments were carried out to control for the potential influence of the acoustic manipulation on the average effect. The results confirm the average effect in both languages, and they do not exclude individual variation in the listeners’ perception of attractiveness. The outcomes provide a new crosslinguistic perspective and give rise to alternative explanations to the average effect.
Article
Full-text available
Neural oscillations are thought to support speech and language processing. They may not only inherit acoustic rhythms, but might also impose endogenous rhythms onto processing. In support of this, we here report that human (both, male and female) eye movements during naturalistic reading exhibit rhythmic patterns that show frequency-selective coherence with the electroencephalogram (EEG)—in the absence of any stimulation rhythm. Periodicity was observed in two distinct frequency bands: First, word-locked saccades at 4–5 Hertz display coherence with whole-head theta-band activity. Second, fixation durations fluctuate rhythmically at ∼1 Hertz, in coherence with occipital delta-band activity. This latter effect was additionally phase-locked to sentence endings, suggesting a relationship with the formation of multi-word chunks. Taken together, eye movements during reading contain rhythmic patterns that occur in synchrony with oscillatory brain activity. This suggests that linguistic processing imposes preferred processing time scales onto reading, largely independent of actual physical rhythms in the stimulus. SIGNIFICANCE STATEMENT: The sampling, grouping, and transmission of information is supported by rhythmic brain activity, so-called neural oscillations. In addition to sampling external stimuli, such rhythms may also be endogenous, affecting processing from the inside out. In particular, endogenous rhythms may impose their pace onto language processing. Studying this is challenging because speech contains physical rhythms that mask endogenous activity. To overcome this challenge, we turned to naturalistic reading, where text does not require the reader to sample in a specific rhythm. We observed rhythmic patterns of eye movements that are synchronized to brain activity as recorded with EEG. This rhythmicity is not imposed by the external stimulus, which indicates that rhythmic brain activity may serve as a pacemaker for language processing.
Article
Natural and artificial audition can in principle acquire different solutions to a given problem. The constraints of the task, however, can nudge the cognitive science and engineering of audition to qualitatively converge, suggesting that a closer mutual examination would potentially enrich artificial hearing systems and process models of the mind and brain. Speech recognition - an area ripe for such exploration - is inherently robust in humans to a number transformations at various spectrotemporal granularities. To what extent are these robustness profiles accounted for by high-performing neural network systems? We bring together experiments in speech recognition under a single synthesis framework to evaluate state-of-the-art neural networks as stimulus-computable, optimized observers. In a series of experiments, we (1) clarify how influential speech manipulations in the literature relate to each other and to natural speech, (2) show the granularities at which machines exhibit out-of-distribution robustness, reproducing classical perceptual phenomena in humans, (3) identify the specific conditions where model predictions of human performance differ, and (4) demonstrate a crucial failure of all artificial systems to perceptually recover where humans do, suggesting alternative directions for theory and model building. These findings encourage a tighter synergy between the cognitive science and engineering of audition.
Preprint
Full-text available
Low-frequency neural entrainment to rhythmic input has been hypothesized as a canonical mechanism that shapes sensory perception in time. Neural entrainment is deemed particularly relevant for speech analysis, as it would contribute to the extraction of discrete linguistic elements from continuous acoustic signals. Yet, its causal influence in speech perception has been difficult to establish. Here, we provide evidence that oscillations build temporal predictions about the duration of speech tokens that directly influence perception. Using magnetoencephalography (MEG), we studied neural dynamics during listening to sentences that changed in speech rate. We observed neural entrainment to preceding speech rhythms persisting for several cycles after the change in rate. The sustained entrainment was associated with changes in the perceived duration of the last word’s vowel, resulting in the perception of words with radically different meanings. These findings support oscillatory models of speech processing, suggesting that neural oscillations actively shape speech perception.
Article
Full-text available
Listeners are known to track statistical regularities in speech. Yet, which temporal cues are encoded is unclear. This study tested effects of talker-specific habitual speech rate and talker-independent average speech rate (heard over a longer period of time) on the perception of the temporal Dutch vowel contrast /ɑ/-/a:/. First, Experiment 1 replicated that slow local (surrounding) speech contexts induce fewer long /a:/ responses than faster contexts. Experiment 2 tested effects of long-term habitual speech rate. A high-rate group listened to ambiguous vowels embedded in "neutral" speech from Talker A, intermixed with fast speech from Talker B. A low-rate group listened to the same neutral speech from Talker A, and/but to Talker B speaking at a slow rate. Between-groups comparison of the neutral trials showed that the high-rate group demonstrated a lower proportion of /a:/ responses, indicating that Talker A's habitual speech rate sounded slower when B was faster. In Experiment 3, both talkers produced speech at both rates, removing the different habitual speech rates of Talkers A and B, while maintaining the average rates differing between groups. In Experiment 3, no global rate effect was observed. Taken together, the present experiments show that a talker's habitual rate is encoded relative to the habitual rate of another talker, carrying implications for episodic and constraint-based models of speech perception. (PsycINFO Database Record
Article
Full-text available
Speakers adjust their voice when talking in noise (known as Lombard speech), facilitating speech comprehension. Recent neurobiological models of speech perception emphasize the role of amplitude modulations in speech-in-noise comprehension, helping neural oscillators to "track" the attended speech. This study tested whether talkers produce more pronounced amplitude modulations in noise. Across four different corpora, modulation spectra showed greater power in amplitude modulations below 4 Hz in Lombard speech compared to matching plain speech. This suggests that noise-induced speech contains more pronounced amplitude modulations, potentially helping the listening brain to entrain to the attended talker, aiding comprehension.
Article
Full-text available
Due to their periodic nature, neural oscillations might represent an optimal "tool" for the processing of rhythmic stimulus input [1-3]. Indeed, the alignment of neural oscillations to a rhythmic stimulus, often termed phase entrainment, has been repeatedly demonstrated [4-7]. Phase entrainment is central to current theories of speech processing [8-10] and has been associated with successful speech comprehension [11-17]. However, typical manipulations that reduce speech intelligibility (e.g., addition of noise and time reversal [11, 12, 14, 16, 17]) could destroy critical acoustic cues for entrainment (such as "acoustic edges" [7]). Hence, the association between phase entrainment and speech intelligibility might only be "epiphenomenal"; i.e., both decline due to the same manipulation, without any causal link between the two [18]. Here, we use transcranial alternating current stimulation (tACS [19]) to manipulate the phase lag between neural oscillations and speech rhythm while measuring neural responses to intelligible and unintelligible vocoded stimuli with sparse fMRI. We found that this manipulation significantly modulates the BOLD response to intelligible speech in the superior temporal gyrus, and the strength of BOLD modulation is correlated with a phasic modulation of performance in a behavioral task. Importantly, these findings are absent for unintelligible speech and during sham stimulation; we thus demonstrate that phase entrainment has a specific, causal influence on neural responses to intelligible speech. Our results not only provide an important step toward understanding the neural foundation of human abilities at speech comprehension but also suggest new methods for enhancing speech perception that can be explored in the future.
Article
Full-text available
Speech is crucial for communication in everyday life. Speech-brain entrainment, the alignment of neural activity to the slow temporal fluctuations (envelope) of acoustic speech input, is a ubiquitous element of current theories of speech processing. Associations between speech-brain entrainment and acoustic speech signal, listening task, and speech intelligibility have been observed repeatedly. However, a methodological bottleneck has prevented so far clarifying whether speech-brain entrainment contributes functionally to (i.e., causes) speech intelligibility or is merely an epiphenomenon of it. To address this long-standing issue, we experimentally manipulated speech-brain entrainment without concomitant acoustic and task-related variations, using a brain stimulation approach that enables modulating listeners' neural activity with transcranial currents carrying speech-envelope information. Results from two experiments involving a cocktail-party-like scenario and a listening situation devoid of aural speech-amplitude envelope input reveal consistent effects on listeners' speech-recognition performance, demonstrating a causal role of speech-brain entrainment in speech intelligibility. Our findings imply that speech-brain entrainment is critical for auditory speech comprehension and suggest that transcranial stimulation with speech-envelope-shaped currents can be utilized to modulate speech comprehension in impaired listening conditions.
Article
Full-text available
Languages show systematic variation in their sound patterns and grammars. Accordingly, they have been classified into typological categories such as stress-timed vs syllable-timed, or Head-Complement (HC) vs Complement-Head (CH). To date, it has remained incompletely understood how these linguistic properties are reflected in the acoustic characteristics of speech in different languages. In the present study, the amplitude-modulation (AM) and frequency-modulation (FM) spectra of 1797 utterances in ten languages were analyzed. Overall, the spectra were found to be similar in shape across languages. However, significant effects of linguistic factors were observed on the AM spectra. These differences were magnified with a perceptually plausible representation based on the modulation index (a measure of the signal-to-noise ratio at the output of a logarithmic modulation filterbank): the maximum value distinguished between HC and CH languages, with the exception of Turkish, while the exact frequency of this maximum differed between stress-timed and syllable-timed languages. An additional study conducted on a semi-spontaneous speech corpus showed that these differences persist for a larger number of speakers but disappear for less constrained semi-spontaneous speech. These findings reveal that broad linguistic categories are reflected in the temporal modulation features of different languages, although this may depend on speaking style.
Article
Full-text available
Recent psychophysics data suggest that speech perception is not limited by the capacity of the auditory system to encode fast acoustic variations through neural γ activity, but rather by the time given to the brain to decode them. Whether the decoding process is bounded by the capacity of θ rhythm to follow syllabic rhythms in speech, or constrained by a more endogenous top-down mechanism, e.g., involving β activity, is unknown. We addressed the dynamics of auditory decoding in speech comprehension by challenging syllable tracking and speech decoding using comprehensible and incomprehensible time-compressed auditory sentences. We recorded EEGs in human participants and found that neural activity in both θ and γ ranges was sensitive to syllabic rate. Phase patterns of slow neural activity consistently followed the syllabic rate (4-14 Hz), even when this rate went beyond the classical θ range (4-8 Hz). The power of θ activity increased linearly with syllabic rate but showed no sensitivity to comprehension. Conversely, the power of β (14-21 Hz) activity was insensitive to the syllabic rate, yet reflected comprehension on a single-trial basis.Wefound different long-range dynamics for θ and β activity, with β activity building up in time while more contextual information becomes available. This is consistent with the roles of θ and β activity in stimulus-driven versus endogenous mechanisms. These data show that speech comprehension is constrained by concurrent stimulus-driven θ and low-γ activity, and by endogenous β activity, but not primarily by the capacity of θ activity to track the syllabic rhythm.
Article
Full-text available
Anecdotal evidence suggests that unfamiliar languages sound faster than one’s native language. Empirical evidence for this impression has, so far, come from explicit rate judgments. The aim of the present study was to test whether such perceived rate differences between native and foreign languages (FLs) have effects on implicit speech processing. Our measure of implicit rate perception was “normalization for speech rate”: an ambiguous vowel between short /a/ and long /a:/ is interpreted as /a:/ following a fast but as /a/ following a slow carrier sentence. That is, listeners did not judge speech rate itself; instead, they categorized ambiguous vowels whose perception was implicitly affected by the rate of the context. We asked whether a bias towards long /a:/ might be observed when the context is not actually faster but simply spoken in a FL. A fully symmetrical experimental design was used: Dutch and German participants listened to rate matched (fast and slow) sentences in both languages spoken by the same bilingual speaker. Sentences were followed by non-words that contained vowels from an /a-a:/ duration continuum. Results from Experiments 1 and 2 showed a consistent effect of rate normalization for both listener groups. Moreover, for German listeners, across the two experiments, foreign sentences triggered more /a:/ responses than (rate matched) native sentences, suggesting that foreign sentences were indeed perceived as faster. Moreover, this FL effect was modulated by participants’ ability to understand the FL: those participants that scored higher on a FL translation task showed less of a FL effect. However, opposite effects were found for the Dutch listeners. For them, their native rather than the FL induced more /a:/ responses. Nevertheless, this reversed effect could be reduced when additional spectral properties of the context were controlled for. Experiment 3, using explicit rate judgments, replicated the effect for German but not Dutch listeners. We therefore conclude that the subjective impression that FLs sound fast may have an effect on implicit speech processing, with implications for how language learners perceive spoken segments in a FL.
Article
Full-text available
Languages show systematic variation in their sound patterns and grammars. Accordingly, they have been classified into typological categories such as stress-timed vs. syllable-timed on the basis of their rhythms, Head-Complement vs. Complement-Head on the basis of their basic word order, or tonal vs. non-tonal on the basis of the presence/absence of lexical tones. To date, it has remained incompletely understood how these linguistic properties are reflected in the acoustic characteristics of speech in different languages. In the present study, the amplitude-modulation (AM) and frequency-modulation (FM) spectra of 1862 utterances produced by 44 speakers in 12 languages were analyzed. Overall, the spectra were similar across languages. However, a perceptually based representation of the AM spectrum revealed significant differences between languages. The maximum value of this spectrum distinguished between HC non-tonal, CH non-tonal, and tonal languages, while the exact frequency of this maximum value differed between stress-timed and syllable-timed languages. Furthermore, when normalized, the f0-modulation spectra of tonal and non-tonal languages also differed. These findings reveal that broad linguistic categories are reflected in differences in temporal modulation features of different languages. This has important implications for theories of language processing and acquisition.