ArticlePDF Available

Can prediction and retrodiction explain whether frequent multi-word phrases are accessed ’precompiled’ from memory or compositionally constructed on the fly?

Authors:

Abstract

An important debate on the architecture of the language faculty has been the extent to which it relies on a compositional system that constructs larger units from morphemes to words to phrases to utterances on the fly and in real time using grammatical rules; or a system that chunks large preassembled, stored units of language from memory; or some combination of both approaches. Good empirical evidence exists for both ’computed’ and ’large stored’ forms in language, but little is known about what shapes multi-word storage / access or compositional processing. Here we explored whether predictive and retrodictive processes are a likely determinant of multi-word storage / processing. Our results suggest that forward and backward predictability are independently informative in determining the lexical cohesiveness of multi-word phrases. In addition, our results call for a reevaluation of the role of retrodiction in contemporary language processing accounts (cf. (Ferreira and Chantavarin, 2018)).
Highlights
Can prediction and retrodiction explain whether frequent multi-
word phrases are accessed ’precompiled’ from memory or compo-
sitionally constructed on the fly?
Luca Onnis, Falk Huettig
Is language processing compositional or preassembled?
We reassessed prior evidence for multi-word storage with corpus data
Prediction and retrodiction appear to be important influences on multi-
word processing
Multi-word units vs compositional construction is a dual route process
Forward and backward predictability are both informative of lexical
cohesiveness
Can prediction and retrodiction explain whether
frequent multi-word phrases are accessed ’precompiled’
from memory or compositionally constructed on the fly?
Luca Onnisa, Falk Huettigb,c
aSchool of Social Sciences, University of Genoa, Genoa, Italy
bMax Planck Institute for Psycholinguistics, Nijmegen, The Netherlands
cCentre for Language Studies, Radboud University, , Nijmegen, The Netherlands
Abstract
An important debate on the architecture of the language faculty has been
the extent to which it relies on a compositional system that constructs larger
units from morphemes to words to phrases to utterances on the fly and in
real time using grammatical rules; or a system that chunks large preassem-
bled, stored units of language from memory; or some combination of both
approaches. Good empirical evidence exists for both ’computed’ and ’large
stored’ forms in language, but little is known about what shapes multi-word
storage / access or compositional processing. Here we explored whether
predictive and retrodictive processes are a likely determinant of multi-word
storage / processing. Our results suggest that forward and backward pre-
dictability are independently informative in determining the lexical cohesive-
ness of multi-word phrases. In addition, our results call for a reevaluation
of the role of retrodiction in contemporary language processing accounts (cf.
Ferreira and Chantavarin 2018).
Keywords: frequency effects, prediction, postdiction, retrodiction, stored
sequences
1. Multi-word storage and compositionality
Are frequent and larger language units (e.g. it was really funny ) con-
structed online using compositional rules or can they be retrieved as ’pre-
assembled’ stored chunks from long-term memory? This question has re-
ceived much attention recently because it has been thought to elucidate be-
In press in Brain Research September 27, 2021
tween competing theoretical accounts of language processing. On one side
of the debate there are several influential theoretical frameworks of human
information processing that claim that linguistic structure is the consequence
of ‘emergent’ processes: 1) usage-based accounts of language processing (e.g.,
Goldberg 2006) according to which whole chunks are taken directly from the
input to be stored in the mind, and 2) exemplar models of stored knowledge
(e.g., Nosofsky 1988) that assume that we store examples in memory rather
than forming abstract generalisations, i.e. linguistic structures ‘emerge’ from
experienced patterns in the input. If frequent multi-word sequences were
represented and used routinely as chunks (rather than compositionally com-
puted online) then this would provide support for notions that argue that
language processing involves the processing of dynamic patterns at different
grain sizes (Elman, 2009) rather than stable lexical (word-like) units.
On the other side of the debate there are approaches that assume an
essential role for the computation of compositional multi-word phrases (e.g.,
Pinker and Ullman 2002). Compositional approaches do not deny that some
longer phrases can occasionally be stored, for example idioms (e.g. kick
the bucket) could be stored as a whole, but the debate is unresolved about
whether a very large number of frequent multi-word phrases (e.g. it was
really funny) are computed in real time from their component words, or are
instead stored and retrieved as a whole chunk.
More and more researchers (e.g., Snider and Arnon 2012, cf. Bod 2006)
have started to question a strict distinction between compositionally con-
structed vs. stored longer phrase units. Jackendoff (e.g., Jackendoff 2002)
for examples argues in this regard that the ease or speed with which a rule
may be activated relative to stored phrases plays a role in how ’freely pro-
ductive’ it is. Further work is needed to elucidate among competing accounts
of multi-word processing. The present study aims to contribute to this en-
deavour.
2. Multi-word frequency effects
Frequency effects seem ubiquitous in language (Pf¨ander and Behrens,
2016): forms and structures that are highly frequent are acquired and pro-
cessed faster than infrequent ones, both in comprehension and production.
Crucially, such processing advantage is often taken as a signature of the fact
that the language units are accessed as ’precompiled’ from memory, and not
computed on the fly. To the extent that frequency effects apply to the lexicon,
2
they would be consistent with a division of labour whereby compositional
mechanisms do the independent syntactic work of assembling morphemes
and words (Pinker, 2015; Ullman, 2016, 2004). However, the detection of
so-called frequency effects for larger units of language such as grammatical
phrases that include lexical and syntactic items has been proposed as evi-
dence that language is much less compositional. Consistent with such sug-
gestions, Bannard and Matthews (2008) proposed that children store more
than individual words in memory based on their results that young children
were significantly more likely to repeat frequent sequences such as a drink
of milk correctly than to repeat infrequent sequences such as a drink of tea.
Such a view is consistent with the notion that compositional constructions
only emerge gradually during child development (e.g., Tomasello 2000).
There are however similar data with adult participants. Arnon and Snider
(2010) for example found that adults responded faster to higher frequency
than lower frequency phrases in a phrase-recognition task. Adult speakers’
recognition times for we have to talk for instance were faster than for we have
to sit, with the latter having lower overall frequency as a four-word unit than
the former.
In the following section we first consider some possible conceptual objec-
tions to a theoretical distinction between ‘stored’ and ‘computed’ linguistic
forms. Then, in the next section we ask whether the documented phrase-
frequency effects for multi-word phrases may emerge from dynamic online
processes driven by context predictability rather than phrase frequency per
se. The subsequent corpus analyses indeed support the view that frequency
effects for multi-word sequences are effects of online prediction and retro-
diction in disguise. We find evidence that (forward and backward) transi-
tional probabilities at multiple levels (which may contribute to the overall
high frequency of the entire multi-word sequence) could support sequential,
compositional processing rather than chunk-based processing. In the Discus-
sion section, we then consider which cognitive and neural mechanisms could
give rise to predictability effects on multi-word sequences. This allows us to
reappraise the debate on ‘stored’ versus ‘computed’ forms by proposing an
alternative framework that can account for facilitative processing effects on
combinatoriality. Finally, we discuss some limitations of the present approach
in particular with regard to hierarchical syntactic compositional parsing ap-
proaches.
3
2.1. Some conceptual inadequacies concerning a strict dichotomy
First, considering conceptual inadequacies, we conjecture that a theory
of language processing that relied on a very large number of memorised pre-
existing chunks would face difficulties accounting for graded effects of lexical
access. Indeed, Arnon and Snider (2010); Snider and Arnon (2012) showed
that their documented frequency effects for four-word sequences occurred
across the frequency range and was thus a gradient one.
Secondly, and consequently, if frequency effects are graded it is difficult
to establish an empirical threshold for what multi-word sequences should
be retrievable whole versus being compositionally computed online. Given
that frequency is a continuous variable in language, and the logarithm of fre-
quency is linearly related to reaction times in various psycholinguistic tasks,
a dichotomous categorization of lexical items in stored versus non-stored /
compositionally computed sequences is hard to achieve.
Third, the frequency distribution of linguistic items – including multi-
word sequences – while being continuous is highly non-linear and skewed
(Zipf, 1949). The vast majority of sequences (or n-grams in technical par-
lance) are positioned in the long tail of infrequent and rare events. This
would practically leave most of the language of interest outside the bene-
fits of mental storage, and would thus be of little theoretical relevance in
explaining how the entirety of language works. A theory of weak memory
storage for such a large number of sequences would have to account for what
else holds language together in processing such sequences besides a weak
frequency effect.
A fourth consideration is that while storage of single lexical items is large,
storage of unique 2-, 3-, 4-grams, and so on, is even larger by several magni-
tudes, as evidenced by large scale n-gram corpus analyses, including our own
below. And this state of affairs does not even consider non-adjacent n-grams
such as in X opinion, where Xcan be replaced by a personal pronoun (in
my/your/their/ opinion) or a noun in genitive form (in teachers’ opinion).
Most language in fact has been characterised in terms of partially matching
sequences, which may have gaps or open slots (Kolodny et al., 2015).
Relatedly, as a fifth consideration most frequent linguistic patterns are
composed of sequences of varying degree of compositionality and abstraction
(e.g., more than Y know*, where Yis an open slot that can be filled by various
pronouns and nouns, and the verb stem know* agrees morphologically with
Yand can take different tense forms).
4
As a sixth and final point, phrases can be part of linguistic patterns of
different sizes, just like syllables can be part of different words. For instance,
you know is one of the most frequent interjections in oral everyday commu-
nication, but so is also the phrase you know what? or what do you know?.
Which of these phrases is a stored sequence in the mind? If the first one is,
then the latter larger phrases must allow a compositional process. If the lat-
ter two are stored sequences, then they must allow a decompositional process
to account for the first phrase.
A similar issue is that chunk-based processing could be seen as akin to
deferring recognition of a spoken word until all its phonemes have occurred.
Such a mechanism would arguably slow processing. Moreover, strong cues
to end-of-sequence may only occur in a few circumscribed contexts. This
raises the issue of how word recognition for word sequences would be de-
ferred. Indeed, unlike spoken words, where sublexical components arguably
remain highly ambiguous at least in some languages, ’sub-sequence’ units in
multi-word sequences are words, each linked to distinct semantic representa-
tions and form classes. In other words, it seems implausible that ”it is time
to. . . ” in ”it is time to talk” would be analogous to hearing ”formul. . . ” (all
but the final phoneme of ’formula’), where there are arguably not discrete
elements that require actual classification (rather than a distribution of ac-
tivations/probabilities over possible phonemes or syllables at each position).
Clearly, compositionality cannot be disposed of easily even in the case of
frequent multi-word sequences. What could plausibly reconcile the ubiqui-
tous frequency effects for multi-word phrase processing found in the litera-
ture while allowing for an essentially compositional system? And, could this
change the debate over stored versus computed language? In the next sec-
tion we propose that prediction and retrodiction processes (cf. Ferreira and
Chantavarin 2018; Ferreira and Qiu 2021), here formalized as sensitivity to
contextual forward and backward probabilities between words, can account
for facilitative effects in language processing for multi-word expressions of
the kind empirically found in the literature.
3. A role for prediction and retrodiction in multi-word processing
The evidence and theoretical arguments considered above leave open the
crucial question of what determines whether frequent multi-word phrases
become stored in (and accessed online from) memory or are composition-
ally constructed on the fly. In essence we are exploring what determines
5
whether phrases of various sizes are ’lexically listed’. More specifically we
tested whether dynamic probabilistic online processes are informative in an-
swering this question and investigated whether forward and backward tran-
sitional probabilities can provide important insights about lexical cohesive-
ness, which in turn can affect the online processing of multi-word phrases.
In reanalysing existing four-word phrases from two published studies, we
conducted new corpus analyses (see Method and Results sections) on both
the Arnon and Snider (2010) and corresponding developmental Bannard
and Matthews (2008) studies and found that the last words in the frequent
phrases used in the above studies are also more predictable, both in terms
of forward and backward predictability. This, we contend, suggests that
predictive and ’postdictive’ (or retrodictive) processes may be an important
factor determining multi-word storage and processing. Our analyses cannot
directly reveal whether participants retrieved multi-word phrases from mem-
ory or constructed them online compositionally but they are compatible with
the notion that the processing advantage found in the two ’stored sequences
studies’ may be a consequence of a) pre-activation of the last words in the
multi-word sequences (consistent with forward predictability), and/or b) ease
of integration of the last word (consistent with backward predictability).
4. Method
4.0.1. Dataset
The dataset under scrutiny contained all 122 experimental stimuli used by
Bannard and Matthews (2008) (n = 32) and (Arnon and Snider, 2010) (n =
90). While the two subsets came from separate studies, they were constructed
with the same criteria and design in mind, and are thus groupable into a
single dataset here. The stimuli were pairs of four-word phrases that differed
in the final word. In each pair, the phrases differed in phrase-frequency
(high vs.low) but were matched for substring frequency (word, bigram, and
trigram): the phrases did not differ in the frequency of the final word, bigram
or trigram.
For the Bannard set, the high-frequency repeated 4-word sequences (e.g.,
when we go out) were selected from a naturalistic corpus of about 1.72 million
words of maternal child-directed speech. The Arnon set was selected from a
20-million corpus of American English collected from telephone conversations
in the Switchboard and Fisher corpora for the Arnon study.
6
The other half of the dataset was made of sequences matched by the au-
thors with low-frequency sequences on the last word (e.g., when we go in), to
obtain 61 minimal lexical pairs. Each 4-word sequence had been labelled ’fre-
quent’ or ’infrequent’, according to the authors’ analyses of corpus frequency,
and we used such information as Dependent Variable in our analyses. For
the Bannard set, the final words of matched sequences were controlled for (a)
the frequency of the final word (e.g., juice and noise were roughly equally
frequent), (b) the frequency of the final bigram (e.g., of juice and of noise
were roughly equally frequent), and (c) the length of the final word in syl-
lables. The Arnon set also controlled for trigram frequency. Six additional
sequences from the Bannard dataset were labelled ’intermediate frequency’
and were not considered in our analysis, because of their insufficient number
to form a third category on their own.
4.0.2. Corpus
To calculate new lexical statistics over the existing dataset, we used two
corpora. To model child language sequences in the Bannard set, we down-
loaded all 1-, 3-, and 4-grams of child-directed speech from an online repos-
itory of Childes corpora available at http://www.lucid.ac.uk/resources/for-
researchers/toolkit/ as part of the Language Researchers’ Toolkit project
(Chang, 2017). This corpus contains 40,507 1-gram types (9,222,801 to-
kens), 1,725,122 3-gram types (5,331,077 tokens), and 2,467,181 4-gram types
(4,062,022 tokens).
To model adult sequences in the Arnon set, we obtained 1,3, and 4-grams
based on the Corpus of Contemporary American English (COCA), one of
the largest publicly-available, genre-balanced corpus of English. The data at
the time of compilation contained approximately 430 million word tokens.
4.0.3. Measures
From the corpora we obtained three lexical statistics of cohesion for each
sequence in the dataset: 1) the frequency of each sequence on logarithmic
scale; 2) the forward and 3) backward Surprisal of the last word on each se-
quence. In psycholinguistics, a hypothesis has gained ground that processing
difficulty is proportional to the amount of information conveyed. Surprisal S
is an information-theoretic measure that estimates how unexpected a given
event is. Conceptually, improbable, i.e. ‘surprising’ events carry more infor-
mation than expected ones, so that surprisal is inversely related to probabil-
ity, through a logarithmic function. In the context of language processing,
7
if w1denotes a multi-word sequence, then the cognitive effort required for
processing the next word, wt, is assumed to be proportional to its surprisal
(Hale, 2006):
eff ort(t)surprisal(wt) = log(P(wt|w1, ..., wt1)) (1)
where P(wt|w1, ..., wt1) is the forward probability of wtgiven the sen-
tence’s previous words w1, ..., wt1.
For example, the surprisal of one of the sequences in our dataset when we
go out is simply the sum of the individual items’ surprisal:
S(when we go out) = S(when) + S(we) + S(go) + S(out) = (2)
logP (when |< sos >)logP (we |< sos >, when)
logP (go |< sos >, when, we)
logP (out |< sos >, when, we, go)
where < sos > denotes a start-of-sentence symbol. The summation is
relevant psychologically because surprisal is linearly related to reading times,
and the reading time of a sequence of words equals the sum of reading times
of its parts. Hence, surprisal of a multi-word sequence must equal the sum
of surprisals of its parts. In our case, because the high-frequency and low-
frequency sequences differed only in the last word, it was sufficient to measure
the surprisal at the last word, e.g. comparing
logP (out |when, we, go) (3)
and
logP (in |when, we, go) (4)
The measure above is forward surprisal, i.e. as a function of the proba-
bility of a word given its previous context. Backward surprisal can also be
calculated, based on the backward transitional probability, namely the like-
lihood of a context preceding a word. It denotes the frequency of the 4-gram
sequence relative to all instances of the final word in the sequence. Again
using the example above, the relevant comparison of backward surprisal was:
8
logP (when, we, go |out) (5)
and
logP (when, we, go |in) (6)
Forward and backward probabilities were calculated using the corpus n-
grams described above.
5. Results
5.0.1. Baseline model
Of the total 122 4-word sequences under scrutiny, 3 from the Arnon and
4 from the Bannard sets were excluded because 4-gram frequencies could
not be calculated from the corpora. To first establish that our analyses
with our corpora were comparable to original analyses, we assessed whether
frequency of 4-gram sequences was a predictor of category assignment. A
baseline logistic regression model included the (log)Frequency and Study
(Arnon vs Bannard) to predict the category (low frequency vs high frequency
sequences, as defined by Bannard and Arnon) of their experimental items (4-
gram sequences). In line with the two previous studies, we also found that
Frequency was a predictor for both datasets (β= 0.33, CI = -0.37, 1.03, see
Table 1 and Figure 1).
5.0.2. Additive model
To assess whether the predictability of the last word of each sequence
was informative in distinguishing sequence category, we ran a separate lo-
gistic regression adding Forward and Backward surprisal, in addition to
(log)Frequency and Study. In this model, Backward surprisal (β=0.40,
CI = 0.61, 0.19) and Forward surprisal (β=0.52, CI = 0.76, 0.27)
but not Frequency nor Study were significant predictors in categorising the
stimuli, (see Figure 1). The three predictor variables were only weakly to
moderately correlated (Forward surprisal and Frequency, r= -0.34, Back-
ward surprisal and Frequency, r= -0.42, Forward surprisal and Backward
surprisal, r= 0.18, see Table 2), justifying the choice of including them
as linearly independent predictors. Furthermore, a test of multicollinearity
9
tested negative (the squared root of the Variance Inflation Factor was less
than two). Finally, when directly comparing the two regression models, the
Additive model dropped the deviance by 265.15 230.50 = 34.64, which
was highly significant p< .001. Thus, based on these analyses the two cate-
gories of stimuli from the Bannard and Arnon datasets were distinguishable
by predictability of the last word, more than the frequency of the stimuli.
Surprisal estimates based of both forward and backward conditional proba-
bilities were predictive of stimulus category, with more surprising sequence
endings being categorised as ’low-frequency’ items by the logistic regression.
These results dovetail with the literature in reading and sentence processing
that found that words in more predictable contexts are read more quickly
(e.g. Hale, 2006; Frank and Bod, 2011), and suggest that corpus-derived con-
ditional probabilities are a significant predictor of single as well as multiword
processing, over and above base frequencies as a covariate.
Table 1: Summary of the logistic regression analyses for variables predicting 4-word se-
quence category
Dependent variable:
Sequence category
Baseline Model Additive Model
(log) Frequency 0.200∗∗∗ (0.063, 0.337) 0.023 (0.189, 0.143)
Study 0.329 (0.372, 1.031) 0.598 (0.222, 1.418)
Backward surprisal 0.402∗∗∗ (0.613, 0.191)
Forward suprisal 0.516∗∗∗ (0.764, 0.268)
Constant 0.569(1.167, 0.029) 5.122∗∗∗ (2.811, 7.434)
Observations 198 198
Log Likelihood 132.573 115.252
Akaike Inf. Crit. 271.146 240.505
Note: p<0.1; ∗∗p<0.05; ∗∗∗ p<0.01
6. Discussion
We conceptually and statistically re-evaluated two well cited empirical
studies that manipulated four-word phrases into frequent and infrequent
10
Table 2: Correlation matrix for variables predicting 4-word sequence category
(log) Frequency Forward suprisal Backward surprisal
(log) Frequency 1
Forward suprisal -0.344 1 0.176
Backward surprisal -0.416 0.176 1
0%
25%
50%
75%
100%
0.0 2.5 5.0 7.5
(log) Frequency
Sequence category (low to high)
Baseline model
0.0 2.5 5.0 7.5
Forward surprisal
Additive model
2.5 5.0 7.5 10.0 12.5
Backward surprisal
Additive model
Figure 1: Marginal effects in the Baseline and Additive logistic regressions
11
categories, and found facilitative processing effects for the frequent phrases.
Following these studies, frequency effects for multi-word expressions have
been taken as evidence that a larger amount of language than previously
acknowledged may be pre-compiled and stored in the mental lexicon rather
than being processed on the fly by a real-time processor. In new corpus
analyses, we found that the last word in the frequent phrases used in the
above studies are also more predictable than in the infrequent phrases, both
in terms of forward and backward predictability. This suggests an alterna-
tive interpretation of the original studies, namely that multi-word storage
effects are prediction and retrodiction effects in disguise. We now discuss the
implications of the present results.
6.1. Forward and backward looking
First, our results fit very much with recent accounts that highlight an
important role for proactive prediction and integrative ‘retrodiction’ in lan-
guage processing and learning (cf. Ferreira and Chantavarin 2018; Ferreira
and Qiu 2021; Huettig and Guerra 2019; Huettig and Mani 2016). A large
body of psycholinguistic evidence suggests that language users frequently
predict upcoming words (e.g., Huettig 2015; Pickering and Gambi 2018, for
review). One type of evidence consistent with such views are findings that
word-to-word statistical information can constrain interpretation in the for-
ward direction, so information from one word yields predictions about prop-
erties of upcoming words. Crucially, in the present study we found also
evidence for the importance of probabilistic processing in the backward di-
rection. Accordingly, our results point to a reevaluation of the role of what
might be called ’probabilistic retrodiction’ in language, which is understud-
ied (or at least currently underappreciated, cf. Ferreira and Chantavarin
2018; Ferreira and Qiu 2021) in the psycholinguistics literature in favour of
forward predictive models. In addition, our results suggest that forward and
backward predictability are independently informative (and perhaps equally
so, as the standardised beta values are of similar magnitude and influence
the dependent variable in the same direction) in determining the storage,
access, and processing of multi-word phrases. These findings also dovetail
with recent evidence that probabilistic integration in the backward direction
explains variance in processing modifier–noun collocation combinations like
vast majority (McConnell and Blumenthal-Dram´e, 2019), as well as reading
times of naturally occurring sentences read silently (Onnis et al., 2021), and
aloud – see Moers et al. (2017), although in the latter study the contributions
12
of forward and backward probabilities were combined in a single predictor,
and could not be disentangled.
6.2. What is retrodiction?
The question of how the past, which has already been observed, can be
a random variable that comprehenders model probabilistically, may raise
thorny questions of interpretability to some. Many current theoretical treat-
ments conceive of predictive processing as involving an explicit representation
of likely future input that is ’compared’ to the actual input to compute an
error signal. Given such accounts, a model that predicts the past, may per-
haps not be considered a reasonable account of probabilistic retrodiction. If
we acknowledge that probability theory is just one of several valid levels of
describing processing and change the level of description, then the interpreta-
tion of the present results is simple. One psychological candidate mechanism
is integration, whereby the processing system does not always pre-activate,
or predict, upcoming input but intergrates it faster if the preceding context
is a good fit, or to put it probabilistically, is more likely to precede it. This
fits with experimental evidence that suggests that language input is often
fast and sub-optimal and may in a fairly large number of situations ‘afford’
rather limited forward looking (cf. Huettig and Mani 2016).
6.3. Multi-word processing
How then do forward and backward looking processes affect the processing
of multi-word phrases? On the level of the brain, one possibility is that single
words are encoded as populations of neurons that can have different levels
of activation. Such activation is likely highest when the neurons respond to
a perceptual event (such as reading or hearing the word percept itself), or
they might encode a perceptual simulation of that event, via spreading of
activation with related words. If forward and backward conditional proba-
bilities reflect the degree of potential spreading of activation between words,
it is possible to envisage how words in an expression pass recurrent activation
back and forth among each other, thus reinforcing each other with different
degrees of activation. Higher neuronal activations can lead to faster recogni-
tion and thus faster reading or naming times at the behavioural level. Now
to understand how a phrase such as a drink of milk can be read, named or
repeated faster than a drink of tea, imagine a population of interconnected
neurons that functions as a distributed and dynamic (over time) represen-
tation for a drink of .... At time step 1 the population code can spread
13
activation to various words that might continue the sequence, and quicker
activations are expected for words that have a higher forward probability
(milk versus tea, alcohol, water, soda, etc.). At timestep 2, milk or tea are
read or heard and thus their percepts send bottom-up activations that add
up to the pre-activations that were spread at timestep 1. Because the forward
probability of milk is higher than tea, neuronal preactivation was higher for
milk and the word can be recognised faster than tea.
This can be taken as the neural instantiation of the effect of forward
probability on reading the last word on the 4-word phrases contemplated
in this study, and is consistent with recent accounts that explain prediction
in terms of neural pattern completion (Falandays et al., 2021). But how
would backward probability influence processing times? Because the back-
ward probability of a drink of ... is higher given milk than given tea, the
perceptual activation of milk can send stronger feedback signals back and
forth to a a drink of ... which reinforce each other, ultimately producing
higher neuronal activation patterns for the sequence a drink of milk than
for the sequence a drink of tea. We point out here that behaviourally such
a neuronal state of affairs would translate into the stored sequences effects
found in the literature, but crucially without the need for the sequence to be
’unanalyzed’ and stored as a single mental representation. This is because
the underlying neuronal structure of the lexicon can still be instantiated as
a network of more or less loosely connected population codes for word rep-
resentations that spread activation to each other in a web-like fashion. The
strength of activation that flows back and forth from these words determines
how fast these words are processed as a sequence, and is proportional to
word-to-word probabilistic properties such as forward and backward prob-
abilities, frequencies, and numerous potential other factors not considered
here, such as semantic relations, phonological similarity, and grammatical
dependencies (cf. Ferreira and Qiu 2021).
We stress here however that our results should not be taken as ruling out
that some multi-word phrases can be stored and retrieved as a whole. We do
interpret our findings however as suggesting that there is most likely a strong
limit to what kind of sequences end up stored as multi-word sequences and
will be retrievable whole versus being compositionally constructed online.
We believe that the present results are most compatible with some form
of a dual-route process, in which compositional construction of multi-word
sequences is akin to a default process but leaving open the possibility for
storage and retrieval of multi-word units.
14
6.4. Future work and conclusion
Further research is required to explore the circumstances that increase
the likelihood of storage and (preferential) access of multi-word sequences.
Similarly, another important task for future research will be to investigate the
exact mechanisms of how predictive and retrodictive processes determine the
extent to which frequent multi-word phrases are compositionally constructed
on the fly. For example, it may be possible to assess the independent con-
tribution of forward and backward surprisal on different real-time processing
tasks, such as self-paced reading, phrase repetition, phrase recognition, and
phrase naming tasks, by selectively manipulating the informativeness of each
cue (high or low), while maintaining constant the sequence overall frequency.
It is possible to select from a large database of language such as Google
Books multi-word sequences that are matched in forward surprisal but differ
in backward surprisal, and vice versa. Based on our regression analyses, we
predict facilitatory effects of processing (faster reading times, more accurate
repetitions, and faster recognition) for both types of stimuli.
Electrophysiological studies may also turn out to be a fruitful avenue for
further work. For example, when considering neural activity, the N400 ERP
component has been studied extensively and taken as a measure of expec-
tation violation, including probabilistic expectations that are measurable in
terms of conditional probabilities between elements. Because the N400 is sen-
sitive to different degrees of probabilistic violations, it is a candidate neural
signature for both forward and backward probabilistic processing. Thus, one
would predict that a stronger N400 ERP component is correlated with higher
levels of multi-word surprisals in both the forward and backward direction,
lending support for a common neural mechanism.
Another direction for future work could be to explore the effect of stored
multi-word sequences on (word) cohort processing in speech processing (cf.
Allopenna et al. 1998). If a multi-word sequence is processed as a chunk,
reduced cohort competition should be observed for words in the sequence
other than the first word (similar to reduced activation of ’bone’ in trombone’
or ’ate’ in ’agitate’ in spoken word recognition).
Finally, it is important to mention that the focus of the present study has
been on whether people learn and process multi-word phrases as lexical units
rather than as sequential combinations of individual words. In this type of
research, the items under scrutiny are typically fragments of sentences that
occur within phrases and are all syntactically cohesive, such as when we go
out, a lot of noise, I have to pay, etc. Perhaps for this reason, such work
15
has mostly ignored any hierarchical syntactic analysis of multi-word units.
Further work thus could also usefully ’scale up’ to make more contact with
contemporary hierarchical syntactic compositional parsing approaches (cf.
Ferreira and Qiu 2021).
6.5. Acknowledgments
We like to thank Jim Magnuson and an anonymous reviewer for their
useful comments on a previous version of this paper.
References
Allopenna, P.D., Magnuson, J.S., Tanenhaus, M.K., 1998. Tracking the
time course of spoken word recognition using eye movements: Evidence
for continuous mapping models. Journal of memory and language 38, 419–
439.
Arnon, I., Snider, N., 2010. More than words: Frequency effects for multi-
word phrases. Journal of memory and language 62, 67–82.
Bannard, C., Matthews, D., 2008. Stored word sequences in language learn-
ing: The effect of familiarity on children’s repetition of four-word combi-
nations. Psychological science 19, 241–248.
Bod, R., 2006. Exemplar-based syntax: How to get productivity from exam-
ples. The linguistic review 23, 291–320.
Chang, F., 2017. The lucid language researcher’s toolkit [computer software].
Elman, J.L., 2009. On the meaning of words and dinosaur bones: Lexical
knowledge without a lexicon. Cognitive science 33, 547–582.
Falandays, J.B., Nguyen, B., Spivey, M.J., 2021. Is prediction nothing more
than multi-scale pattern completion of the future? Brain Research ,
147578.
Ferreira, F., Chantavarin, S., 2018. Integration and prediction in language
processing: A synthesis of old and new. Current directions in psychological
science 27, 443–448.
Ferreira, F., Qiu, Z., 2021. Predicting syntacting structure. Brain Research
in press.
16
Frank, S.L., Bod, R., 2011. Insensitivity of the human sentence-processing
system to hierarchical structure. Psychological science 22, 829–834.
Goldberg, A.E., 2006. Constructions at work: The nature of generalization
in language. Oxford University Press on Demand.
Hale, J., 2006. Uncertainty about the rest of the sentence. Cognitive science
30, 643–672.
Huettig, F., 2015. Four central questions about prediction in language pro-
cessing. Brain research 1626, 118–135.
Huettig, F., Guerra, E., 2019. Effects of speech rate, preview time of visual
context, and participant instructions reveal strong limits on prediction in
language processing. Brain Research 1706, 196–208.
Huettig, F., Mani, N., 2016. Is prediction necessary to understand language?
probably not. Language, Cognition and Neuroscience 31, 19–31.
Jackendoff, R., 2002. Foundations of language: Brain, meaning, grammar,
evolution. Oxford University Press, USA.
Kolodny, O., Lotem, A., Edelman, S., 2015. Learning a generative probabilis-
tic grammar of experience: A process-level model of language acquisition.
Cognitive Science 39, 227–267.
McConnell, K., Blumenthal-Dram´e, A., 2019. Effects of task and corpus-
derived association scores on the online processing of collocations. Corpus
Linguistics and Linguistic Theory .
Moers, C., Meyer, A., Janse, E., 2017. Effects of word frequency and transi-
tional probability on word reading durations of younger and older speakers.
Language and Speech 60, 289–317.
Nosofsky, R.M., 1988. Exemplar-based accounts of relations between classi-
fication, recognition, and typicality. Journal of Experimental Psychology:
learning, memory, and cognition 14, 700.
Onnis, L., Lim, A., Cheung, S., Huettig, F., 2021. What does it mean
to say the mind is inherently forward looking? exploring ‘probabilistic
retrodiction‘ in language processing. Manuscript under revision .
17
Pf¨ander, S., Behrens, H., 2016. Experience counts: An introduction to fre-
quency effects in language, in: Experience counts: Frequency effects in
language. De Gruyter, pp. 1–20.
Pickering, M.J., Gambi, C., 2018. Predicting while comprehending language:
A theory and review. Psychological bulletin 144, 1002.
Pinker, S., 2015. Words and Rules: The Ingredients Of Language. Hachette
UK.
Pinker, S., Ullman, M.T., 2002. The past and future of the past tense. Trends
in cognitive sciences 6, 456–463.
Snider, N., Arnon, I., 2012. A unified lexicon and grammar? compositional
and non-compositional phrases in the lexicon, in: Frequency effects in
language representation. Mouton de Gruyter Berlin, pp. 127–163.
Tomasello, M., 2000. The item-based nature of children’s early syntactic
development. Trends in cognitive sciences 4, 156–163.
Ullman, M.T., 2004. Contributions of memory circuits to language: The
declarative/procedural model. Cognition 92, 231–270.
Ullman, M.T., 2016. The declarative/procedural model: A neurobiological
model of language learning, knowledge, and use, in: Neurobiology of lan-
guage. Elsevier, pp. 953–968.
Zipf, G.K., 1949. Human behavior and the principle of least effort. Addison-
Wesley press.
18
... We propose that processing multiword units involves combining information about individual words and phrases. This is consistent with the large body of psycholinguistic evidence suggesting that language users predict upcoming words (see Baayen et al., 2013;Jacobs et al., 2016;Onnis & Huettig, 2021;N. J. Smith & Levy, 2013). ...
... We found that the typology of language impacts language users' sensitivity to phrasal frequencies. Our findings align well with a single-system view of language processing (e.g., Baayen et al., 2013;Bybee, 1998;Christiansen & Chater, 2016;Onnis & Huettig, 2021) that combines information about individual words and larger units to explain language acquisition and processing. The findings are hard to align with proposals that multiword units are represented holistically (e.g., Wray, 2002). ...
Article
Full-text available
Collocations are understood to be integral building blocks of language processing, alongside individual words, but thus far evidence for the psychological reality of collocations has tended to be confined to English. In contrast to English, Turkish is an agglutinating language, utilizing productive morphology to convey complex meanings using a single word. Given this, we expected Turkish speakers to be less sensitive to phrasal frequencies than English speakers. In Study 1, we conducted a corpus analysis of translation-equivalent adjective–noun collocations (e.g., front door) and found differences between the two languages in frequency counts. In Study 2, we conducted a reaction time experiment to determine the sensitivity of native speakers of English and Turkish to the frequency of adjectives, nouns, and whole collocations. Turkish speakers were less sensitive to whole-phrase frequencies, as predicted, indicating that collocations are processed less holistically in Turkish than English. Both groups demonstrated that processing collocations involves combining information about individual words and phrases. Taken together, we show that speakers are sensitive to frequency information at multiple grain sizes that are attuned to the typology of different languages.
... For example, if absolute silence BTP between words (and particularly, in bigrams) has been found to affect reading times and eye movement behavior when reading (e.g., Demberg & Keller, 2008;Kapteijns & Hintz, 2021). The metric has been used to operationalize diverse theoretical concepts in linguistics, including probabilistic integration of unfolding words or phrases into previously-encountered contexts and structures (also called "retrodiction") (Ferreira & Chantavarin, 2018;Onnis & Huettig, 2021;van Paridon & Alday, 2020), and sensitivity to phonotactic contingencies and lexical cooccurrence (McConnell & Blumenthal-Dramé, 2019;Perruchet & Desaulty, 2008). The measure has also been used to model language learning processes (McCauley & Christiansen, 2019;Pelucchi et al., 2009;Roete et al., 2020). ...
... BTP's sibling metric, forward transition probability (FTP), has also received wide-spread attention, along with its log-inverse: "surprisal" (Boston et al., 2008;Frisson et al., 2005;Hale, 2016;Levy, 2008;Lowder et al., 2018;McDonald & Shillcock, 2003;Smith & Levy, 2013;Willems et al., 2016). Although the two metrics are similar, they have been found to be independently informative of real-time language processing (Onnis & Huettig, 2021). ...
Article
Full-text available
Individual differences in cognitive abilities are ubiquitous across the spectrum of proficient language users. Although speakers differ with regard to their memory capacity, ability for inhibiting distraction, and ability to shift between different processing levels, comprehension is generally successful. However, this does not mean it is identical across individuals; listeners and readers may rely on different processing strategies to exploit distributional information in the service of efficient understanding. In the following psycholinguistic reading experiment, we investigate potential sources of individual differences in the processing of co-occurring words. Participants read modifier-noun bigrams like absolute silence in a self-paced reading task. Backward transition probability (BTP) between the two lexemes was used to quantify the prominence of the bigram as a whole in comparison to the frequency of its parts. Of five individual difference measures (processing speed, verbal working memory, cognitive inhibition, global-local scope shifting, and personality), two proved to be significantly associated with the effect of BTP on reading times. Participants who could inhibit a distracting global environment in order to more efficiently retrieve a single part and those that preferred the local level in the shifting task showed greater effects of the co-occurrence probability of the parts. We conclude that some participants are more likely to retrieve bigrams via their parts and their co-occurrence statistics whereas others more readily retrieve the two words together as a single chunked unit.
... We believe that the main challenge for future research is to elucidate how these models can effectively account not only for the phenomena discussed in this paper, but also for other key findings in the statistical learning literature. For instance, the relationship between forward transitional probabilities and frequency (e.g., Onnis & Huettig, 2021), the extraction of backward transitional probabilities (e.g., Chartier & Dautriche, 2023;McCauley & Christiansen, 2019a;Pelucchi et al., 2009;Perruchet & Desaulty, 2008), the extraction of high-33 order transitional probabilities (e.g., Lazartigues et al., 2021Lazartigues et al., , 2022Rey et al., 2022), and the development of stimulus equivalence in humans and animals (e.g., Chartier & Fagot, 2022a, 2022b. Together, these empirical data and new computational approaches should enable us to better understand the essential associative mechanisms of chunking that structure our mental life, and more specifically the implicit statistical learning of sequences. ...
Article
Full-text available
The ability to track the statistical structure of language and more broadly, of our environment, is a key feature of our cognitive system. This process, known as statistical learning, is thought to rely on associative mechanisms, and notably chunking. In this review, we summarize recent empirical work on three main phenomena that have been consistently reported in the literature about chunking mechanisms: predictability effects, repetition spacing effects, and chunk size limits. To illustrate the generality and robustness of these phenomena, we show that they have been observed for the processing of both linguistic and visuo-motor sequences, in human and non-human primate studies. We discuss how current chunk-based models of statistical learning can account for these effects and highlight some of their limitations. Finally, we argue that recent neurocomputational models based on associative and Hebbian learning may provide new theoretical approaches to describe and better understand the nature of chunking mechanisms.
... Second, relaxations of the conditional independence assumption should be explored. Finally, thanks to the symmetry property of mutual information, our mathematical framework for the problem of predictability (prediction of next elements) can also be applied to the problems of retrodiction (prediction of previous elements) (Onnis & Huettig, 2021) or integration (Onnis et al., 2022). ...
Preprint
Full-text available
We address the linguistic problem of the sequential arrangement of a head and its dependents from an information theoretic perspective. In particular, we consider the optimal placement of a head that maximizes the predictability of the sequence. We assume that dependents are statistically independent given a head, in line with the open-choice principle and the core assumptions of dependency grammar. We demonstrate the optimality of harmonic order, i.e., placing the head last maximizes the predictability of the head whereas placing the head first maximizes the predictability of dependents. We also show that postponing the head is the optimal strategy to maximize its predictability while bringing it forward is the optimal strategy to maximize the predictability of dependents. We unravel the advantages of the strategy of maximizing the predictability of the head over maximizing the predictability of dependents. Our findings shed light on the placements of the head adopted by real languages or emerging in different kinds of experiments.
... A mechanism of probabilistic integration would not necessarily try to predict upcoming material, but instead increase efficiency by evaluating the probability of the preceding context given each heard/read linguistic item (e.g., the current word). In recent work (Onnis and Huettig, 2021;Onnis et al., 2022) this mechanism has been modeled successfully using n-gram language models as the backward transitional probability, P(prior context | word) as a proxy for integration, as opposed to prediction in the form of forward transitional probability, P(word | prior context). ...
Conference Paper
Full-text available
Fluent speakers make implicit predictions about forthcoming linguistic items while processing sentences, possibly to increase efficiency in real-time comprehension. However, the extent to which prediction is the primary mode of processing human language is widely debated. The human language processor may also gain efficiency by integrating new linguistic information with prior knowledge and the preceding context, without actively predicting. At present, the role of probabilistic integration , as well as its computational foundation, remains relatively understudied. Here, we explored whether a Delayed Recurrent Neural Network (d-RNN, Turek et al., 2020), as an implementation of both prediction and integration , can explain patterns of human language processing over and above the contribution of a purely predictive RNN model. We found that incorporating integration contributes to explaining variability in eye-tracking data for English and Hindi.
... Perhaps the primary advantages to holistic storage are in production rather than processing. For example, it is possible that it is quicker to access and produce a stored compound noun, but during language comprehension the process-ing system may avoid accessing words that we haven't heard yet since it may be riskier (for more debate ove the role of prediction in comprehension see Ferreira and Chantavarin, 2018;Onnis and Huettig, 2021). ...
Conference Paper
Full-text available
Despite evidence that learners are storing a lot more than simple words, it is still unclear what determines whether a phrase is stored holistically. For example, storage could be driven by either phrasal frequency or by the mutual predictability of a phrase's component parts. Further, the processing consequences of holistic storage are also unclear. Given that sentence processing is incremental, how does recognition of individual words give rise to recognition of holistically stored phrases? The present study examines these questions. Specifically, participants are presented with sentences that contain compound nouns in locally plausible or locally implausible contexts. We examine whether participants are able to overcome local implausibility effects more easily if the compound nouns are highly predictable. We find that predictability does not overcome local implausibility effects, suggesting that either predictability is not driving holistic storage or that holistic storage driven by predictability does not facilitate comprehension in our task.
... Thus, the model can be seen as a procedural approach to error-based associative learning, with prediction as its driving mechanism. This is in line with a long tradition of work that identifies a central role for word prediction, both based on preceding information in a sentence, or even based on the words that appear after a target word (which is often referred to as 'retrodiction' or integration, (Federmeier, 2007;Kutas et al., 2011;Onnis & Thiessen, 2013;Huettig, 2015;Onnis & Huettig, 2021;Alhama et al., 2021;Onnis et al., 2022). ...
Article
Full-text available
While there are well-known demonstrations that children can use distributional information to acquire multiple components of language, the underpinnings of these achievements are unclear. In the current paper, we investigate the potential pre-requisites for a distributional learning model that can explain how children learn their first words. We review existing literature and then present the results of a series of computational simulations with Vector Space Models, a type of distributional semantic model used in Computational Linguistics, which we evaluate against vocabulary acquisition data from children. We focus on nouns and verbs, and we find that: (i) a model with flexibility to adjust for the frequency of events provides a better fit to the human data, (ii) the influence of context words is very local, especially for nouns, and (iii) words that share more contexts with other words are harder to learn.
Article
This study explores pitch variability in language production and its implication for processing advantages of holistic units, with a specific focus on the relationship between disyllabic word production and their distributional properties in language use. Using a 185-million-word native corpus as a proxy for the statistical properties of native usage, the study examines how pitch variability of disyllabic words in a spontaneous speech corpus of Taiwan Mandarin is influenced by lexical frequency, predictive contingencies, and retrodictive contingencies. Building upon the duration-based pairwise variability index (PVI), this study introduces two variants of pitch-related PVI (f0PVI) to quantify pitch variability within speech segments. We assess their effectiveness through three phonetic analyses. The first analysis shows that disyllabic words exhibit significantly lower f0PVI values than their non-holistic part-word counterparts, indicating the metric’s capability to distinguish holistic linguistic units. The second analysis uncovers a significant inverse correlation between the pitch variability metrics of disyllabic words and their frequency values, highlighting a strong link between reduced prosodic prominence and the frequency-based processing advantages in lexical production. Finally, the third analysis demonstrates moderated effects of retrodictive lexical contingency on pitch variability, contingent on the word’s alignment with prosodic junctures. We discuss the implications of contextual predictability in lexical retrieval and its role in the dynamic planning process of speech production. Our findings underscore f0PVI as a robust prosodic measure for the automatized processing and entrenchment of linguistic units arising from repeated usage.
Article
Full-text available
While the notion of the brain as a prediction machine has been extremely influential and productive in cognitive science, there are competing accounts of how best to model and understand the predictive capabilities of brains. One prominent framework is of a “Bayesian brain” that explicitly generates predictions and uses resultant errors to guide adaptation. We suggest that the prediction-generation component of this framework may involve little more than a pattern completion process. We first describe pattern completion in the domain of visual perception, highlighting its temporal extension, and show how this can entail a form of prediction in time. Next, we describe the forward momentum of entrained dynamical systems as a model for the emergence of predictive processing in non-predictive systems. Then, we apply this reasoning to the domain of language, where explicitly predictive models are perhaps most popular. Here, we demonstrate how a connectionist model, TRACE, exhibits hallmarks of predictive processing without any representations of predictions or errors. Finally, we present a novel neural network model, inspired by reservoir computing models, that is entirely unsupervised and memoryless, but nonetheless exhibits prediction-like behavior in its pursuit of homeostasis. These explorations demonstrate that brain-like systems can get prediction “for free,” without the need to posit formal logical representations with Bayesian probabilities or an inference machine that holds them in working memory.
Article
Full-text available
Current theories of language processing emphasize prediction as a mechanism to facilitate comprehension, which contrasts with the state of the field a few decades ago, when prediction was rarely mentioned. We argue that the field of psycholinguistics would benefit from revisiting these earlier theories of comprehension that attempted to explain integration and the processes underlying the formation of rich representations of linguistic input and that emphasized informational newness over redundancy. We suggest further that integration and anticipation may be complementary mechanisms that operate on elaborated, coherent discourse representations, supporting enhanced comprehension and memory. In addition, the traditional emphasis on language as a tool for communication implies that much linguistic content will be nonredundant; moreover, the purpose of anticipation is probably not to permit the prediction of exact lexical or syntactic forms but instead to induce a state of preparedness that allows the comprehender to be receptive to new information, thus facilitating its processing.
Article
Full-text available
Researchers agree that comprehenders regularly predict upcoming language, but they do not always agree on what prediction is (and how to differentiate it from integration) or what constitutes evidence for it. After defining prediction, we show that it occurs at all linguistic levels from semantics to form, and then propose a theory of which mechanisms comprehenders use to predict. We argue that they most effectively predict using their production system (i.e., prediction-by-production): They covertly imitate the linguistic form of the speaker’s utterance and construct a representation of the underlying communicative intention. Comprehenders can then run this intention through their own production system to prepare the predicted utterance. But doing so takes time and resources, and comprehenders vary in the extent of preparation, with many groups of comprehenders (non-native speakers, illiterates, children, and older adults) using it less than typical native young adults. We thus argue that prediction-by-production is an optional mechanism, which is augmented by mechanisms based on association. Support for our proposal comes from many areas of research (electrophysiological, eye-tracking, and behavioral studies of reading, spoken language processing in the context of visual environments, speech processing, and dialogue).
Article
Prediction in language processing has been a topic of major interest in psycholinguistics for at least the last two decades, but most investigations focus on semantic rather than syntactic prediction. This review begins with a discussion of some influential models of parsing which assume that comprehenders have the ability to anticipate syntactic nodes, beginning with left-corner parsers and the garden-path model and ending with current information-theoretic approaches that emphasize online probabilistic prediction. We then turn to evidence for the prediction of specific syntactic forms, including coordinate clauses and noun phrases, verb arguments, and individual nouns, as well as studies that use morphosyntactic constraints to assess whether a specific semantic prediction has been made. The last section considers the implications of syntactic prediction for theories of language architecture and describes four avenues for future research.
Article
In the following self-paced reading study, we assess the cognitive realism of six widely used corpus-derived measures of association strength between words (collocated modifier–noun combinations like vast majority ): MI, MI3, Dice coefficient, T -score, Z -score, and log-likelihood. The ability of these collocation metrics to predict reading times is tested against predictors of lexical processing cost that are widely established in the psycholinguistic and usage-based literature, respectively: forward/backward transition probability and bigram frequency. In addition, the experiment includes the treatment variable of task : it is split into two blocks which only differ in the format of interleaved comprehension questions (multiple choice vs. typed free response). Results show that the traditional corpus-linguistic metrics are outperformed by both backward transition probability and bigram frequency. Moreover, the multiple-choice condition elicits faster overall reading times than the typed condition, and the two winning metrics show stronger facilitation on the critical word (i.e. the noun in the bigrams) in the multiple-choice condition. In the typed condition, we find an effect that is weaker and, in the case of bigram frequency, longer lasting, continuing into the first spillover word. We argue that insufficient attention to task effects might have obscured the cognitive correlates of association scores in earlier research.
Article
There is a consensus among language researchers that people can predict upcoming language. But do people always predict when comprehending language? Notions that “brains … are essentially prediction machines” certainly suggest so. In three eye-tracking experiments we tested this view. Participants listened to simple Dutch sentences (‘Look at the displayed bicycle’) while viewing four objects (a target, e.g. a bicycle, and three unrelated distractors). We used the identical visual stimuli and the same spoken sentences but varied speech rates, preview time, and participant instructions. Target nouns were preceded by definite gender-marked determiners, which allowed participants to predict the target object because only the targets but not the distractors agreed in gender with the determiner. In Experiment 1, participants had four seconds preview and sentences were presented either in a slow or a normal speech rate. Participants predicted the targets as soon as they heard the determiner in both conditions. Experiment 2 was identical except that participants were given only a one second preview. Participants predicted the targets only in the slow speech condition. Experiment 3 was identical to Experiment 2 except that participants were explicitly told to predict. This led only to a small prediction effect in the normal speech condition. Thus, a normal speech rate only afforded prediction if participants had an extensive preview. Even the explicit instruction to predict the target resulted in only a small anticipation effect with a normal speech rate and a short preview. These findings are problematic for theoretical proposals that assume that prediction pervades cognition.