ArticlePDF Available

Statistical Speech Segmentation and Word Learning in Parallel: Scaffolding from Child-Directed Speech

Authors:

Abstract and Figures

In order to acquire their native languages, children must learn richly structured systems with regularities at multiple levels. While structure at different levels could be learned serially, e.g., speech segmentation coming before word-object mapping, redundancies across levels make parallel learning more efficient. For instance, a series of syllables is likely to be a word not only because of high transitional probabilities, but also because of a consistently co-occurring object. But additional statistics require additional processing, and thus might not be useful to cognitively constrained learners. We show that the structure of child-directed speech makes simultaneous speech segmentation and word learning tractable for human learners. First, a corpus of child-directed speech was recorded from parents and children engaged in a naturalistic free-play task. Analyses revealed two consistent regularities in the sentence structure of naming events. These regularities were subsequently encoded in an artificial language to which adult participants were exposed in the context of simultaneous statistical speech segmentation and word learning. Either regularity was independently sufficient to support successful learning, but no learning occurred in the absence of both regularities. Thus, the structure of child-directed speech plays an important role in scaffolding speech segmentation and word learning in parallel.
Content may be subject to copyright.
ORIGINAL RESEARCH ARTICLE
published: 01 October 2012
doi: 10.3389/fpsyg.2012.00374
Statistical speech segmentation and word learning in
parallel: scaffolding from child-directed speech
Daniel Yurovsky 1*, ChenYu2and Linda B. Smith2
1Department of Psychology, Stanford University, Stanford, CA, USA
2Department of Psychological and Brain Sciences and Program in Cognitive Science, Indiana University, Bloomington, IN, USA
Edited by:
Claudia Männel, Max-Planck-Institute
for Human Cognitive and Brain
Sciences, Germany
Reviewed by:
Toni Cunillera, University of Barcelona,
Spain
Elika Bergelson, University of
Pennsylvania, USA
*Correspondence:
Daniel Yurovsky, Department of
Psychology, Jordan Hall, Building
01-420, Stanford University, 450 Serra
Mall, Stanford, CA 94305, USA.
e-mail: yurovsky@stanford.edu
In order to acquire their native languages, children must learn richly structured systems
with regularities at multiple levels. While structure at different levels could be learned seri-
ally, e.g., speech segmentation coming before word-object mapping, redundancies across
levels make parallel learning more efficient. For instance, a series of syllables is likely to
be a word not only because of high transitional probabilities, but also because of a consis-
tently co-occurring object. But additional statistics require additional processing, and thus
might not be useful to cognitively constrained learners.We show that the structure of child-
directed speech makes simultaneous speech segmentation and word learning tractable for
human learners. First, a corpus of child-directed speech was recorded from parents and
children engaged in a naturalistic free-play task. Analyses revealed two consistent regu-
larities in the sentence structure of naming events. These regularities were subsequently
encoded in an artificial language to which adult participants were exposed in the context
of simultaneous statistical speech segmentation and word learning. Either regularity was
independently sufficient to support successful learning, but no learning occurred in the
absence of both regularities.Thus, the structure of child-directed speech plays an important
role in scaffolding speech segmentation and word learning in parallel.
Keywords: statistical learning, speech segmentation, word learning, child-directed speech, frequent frames
INTRODUCTION
Human language is richly structured, with important regulari-
ties to be learned at multiple levels (Kuhl, 2004). For instance, the
human vocal apparatus can produce a staggering variety of sounds
distinguishable from each other by prelinguistic infants (Eimas
et al., 1971). However,only a tiny fraction of these become mean-
ingful units phonemes within a particular language. Similarly,
these phonemes can be strung together into an infinite number of
sequences, but only a tiny fraction of these are words. Thus,infants
must also solve the problem of parsing a continuous sequence of
phonemes into word units. Further, some of these words refer to
objects in the visual world, and so, for these segmented words,
infants must solve the word-world mapping problem. In addition,
speakers may refer to the same object with different words in differ-
ent contexts, and different word orderings and stress patterns can
radically alter an utterance’s meanings, so children must organize
sounds, segments, and meanings at the levels pragmatics, syntax,
and prosody as well.
An emerging theoretical consensus is that many or even all
of these problems may be solved through a process of statisti-
cal learning tracking predictive relationships between elemental
units (although, cf. Marcus, 2000;Waxman and Gelman, 2009).
In order to determine their native language phonemes, infants
may track the distribution of tonal and formant frequencies in
their input (Maye et al., 2002;Pierrehumbert, 2003). Similarly,
infants may learn word boundaries by tracking sequential sylla-
ble statistics (Saffran et al., 1996), learn word-world mappings by
tracking word-object occurrence statistics (Smith and Yu, 2008;
Vouloumanos and Werker, 2009), and learn grammar by tracking
sequential and non-adjacent dependencies between word types
(Gómez and Gerken, 2000;Saffran et al., 2008). Because statis-
tical learning at each level assumes the availability of primitives
at the level below and shows how to arrive at primitives for the
level above, a complete statistical account of language learning
must bridge these levels. Therefore, a critical question for statisti-
cal theories of language acquisition is how learners connect these
primitives.
One possibility is that the infants learn each level sequentially,
proceeding from the bottom up. Learning at each level would build
the units over which the next level operates, and thus higher lev-
els would have to wait until (at least some of) the primitives at
the lower levels had been acquired. This hypothesis is intuitive,
and makes several predictions consistent with the extant liter-
ature. First, it predicts a developmental trajectory in statistical
learning abilities: phoneme learning should come first, followed by
speech segmentation, followed by word-world mapping, followed
by syntax. Indeed, this is the general trend observed in infant statis-
tical learning experiments. At 6 months, infants show sensitivity
to phoneme distributions (Maye et al., 2002), at 8 months they
can segment continual speech into words (Saffran et al., 1996), at
12 months they can map words onto objects using co-occurrence
information (Smith and Yu,2008), and at 18 months they can learn
non-adjacent syntactic dependencies (Gómez, 2002). Second, this
account predicts that infants should be able to extract regularities
at one level, and use them subsequently to learn at the next higher
level. This has been confirmed by recent empirical findings from
www.frontiersin.org October 2012 | Volume 3 | Article 374 | 1
Yurovsky et al. Speech segmentation and word learning in parallel
Saffran and colleagues (Graf Estes et al., 2007;Hay et al., 2011)
showing that statistically coherent word segments extracted from
continuous speech subsequently act as superior labels in subse-
quent word learning. It is also supported by recent computational
models showing that regularities at multiple levels can be learned
serially from child-directed speech (Yu et al., 2005;Christiansen
et al., 2009;Räsänen, 2011).
Alternatively, learners could acquire structure at each level in
parallel. Because regularities at each level are statistically inter-
related, partial acquisition of the structure at any level would
reduce ambiguity at every other level (Feldman et al., 2009;John-
son et al., 2010). However, this aggregate ambiguity reduction
comes at a cost: if units are uncertain at every level, demands
on attention and memory are likely to skyrocket. Thus, an abun-
dance of structure helpful for ideal learners might easily overload
cognitively constrained statistical learners (Fu, 2008;Frank et al.,
2010). This tradeoff is evident in recent experiments investigating
simultaneous statistical speech segmentation and word learning.
In these experiments, adult learners engaged in a standard sta-
tistical speech segmentation task with one addition: word-onsets
occurred in a small window around the onset of visual objects.
Under these conditions, adults succeeded at both segmenting the
speech stream, and mapping the words onto their correct referents
(Cunillera et al., 2010a,b;Thiessen, 2010). However, in identi-
cal experiments, 8-month-olds failed to acquire either regularity
(Thiessen, 2010). Further,when the task is made slightly more dif-
ficult presenting multiple objects at once (as in Yu and Smith,
2007) adults fail to learn word-object mappings from contin-
uous speech (Frank et al., 2007). Thus, while parallel statistical
learning might provide a significant advantage, it could be out-
side the processing limits of human learners (cf. Fiser and Aslin,
2002, for an example of parallel learning in a purely visual task).
However, these demands on cognitive processing could be allevi-
ated in another way: human learners could be scaffolded by other
properties of natural language (Vygotsky, 1978;Mintz, 2003). The
studies in this paper provide evidence for just such a solution in
the context of parallel speech segmentation and word learning.
In typical statistical learning experiments, regularities in the
input are constructed in such a way as to isolate the problem of
interest. For instance,in statistical speech segmentation tasks, each
word typically occurs with equal frequency and is equally likely to
follow each other word (e.g., Saffran et al., 1996;Graf Estes et al.,
2007). In statistical word learning tasks, each word and object typ-
ically occur with equal frequency,and each incorrect mapping has
equal statistical support (e.g., Yu and Smith, 2007;Smith and Yu,
2008;Vouloumanos and Werker, 2009). But this structure differs
in a number of ways from the structure of natural language input,
and these difference are likely to matter (Kurumada et al., 2011;
Vogt, 2012). For instance, referential utterances in child-directed
speech often come from a small set of stereotyped naming frames,
e.g., look at the dog (Cameron-Faulkner et al., 2003). Children
are remarkably sensitive to this structure: 18-month-old infants
orient faster to the referent of a label embedded in such statisti-
cally frequent naming frames than they do to a label uttered in
isolation (Fernald and Hurtado, 2006). Do these frequent frames
help learners segment a stream of sounds into and to map these
words onto referents?
We pursued this question in two steps. First, we sought to
determine the statistical structure of the frames that characterize
naming events to young children. To this end, we analyzed data
from a corpus of child-directed speech recorded during naturalis-
tic free-play interactions to discover the shared structure of com-
mon naming frames. Subsequently, we constructed an artificial
language in which the strings were naming events that maintained
the main regularities found in the natural speech corpus. We then
embedded these naming events in a word-object mapping task
in which each trial contained multiple naming events and mul-
tiple visual referents. Thus, to learn the language, participants
would have to segment labels from continuous speech and map
them to their statistically consistent referents. We then parametri-
cally manipulated the artificial language to determine if and how
the regularities in natural naming frames facilitate simultaneous
speech segmentation and word learning. Our findings illustrate
the importance of understanding the statistical properties of nat-
ural language contexts for drawing conclusions about statistical
learning.
RESULTS
CORPUS ANALYSIS
To capture regularities in naming frame structure, we analyzed
transcripts of child-directed speech from naturalistic free-play
interactions between 17 parent-child dyads (Yu et al., 2008;Yu
and Smith, 2012). This corpus contained 3165 parental speech
utterances, 1624 of which contained the label of one of the toys
in the room. Of these utterances, 672 (20%) were single-word
utterances consisting of only the toy’s label. Because the Exper-
iments investigate the role of naming frames in parallel speech
segmentation and word learning, these utterances were excluded
from further analysis, but we return to them in the Discussion. The
remaining 952 events were analyzed for consistent naming frame
structure.
As shown in Table 1, 21 different naming frames cover more
than 50% of all naming events. Together,these frames contain only
20 unique words and conform to two general regularities. First, in
these frequent frames, the toy’s label always occurs in the final
position (see also Aslin et al., 1996). Second, only a small set of
words mostly articles precede a toy’s label (see also Shafer et al.,
1998). Both regularities are also common in the remaining naming
events, appearing in 50 and 63%, respectively. Because both final
position (Endress et al., 2005) and onset cues (Bortfeld et al., 2005;
Mersad and Nazzi, 2012) have previously been found to facilitate
statistical sequence learning, each regularity could potentially scaf-
fold statistical learners, buttressing them against the combinatorial
explosion of parallel speech segmentation and word learning. Fur-
ther, evidence from other studies suggests that redundant cues
help children learn language (e.g., Gogate et al., 2000;Frank et al.,
2009). Consequently the combination of both position and onset
cues could play an additive role in speech segmentation and word
learning.
EXPERIMENTS
To study joint speech segmentation and word-object mapping, we
exposed adult participants to a series of individually ambiguous
training trials based on the cross-situational learning paradigm (Yu
Frontiers in Psychology | Language Sciences October 2012 | Volume 3 | Article 374 | 2
Yurovsky et al. Speech segmentation and word learning in parallel
and Smith, 2007). On each trial, adults saw two objects and heard
two phrases of continuous speech from an artificial language. In
order to learn word-object mappings, they had to determine which
phrase referred to which object, where the word boundaries were,
and finally which words were Object Labels and which word were
Frame Words. Crucially, the naming frames extracted from the nat-
ural child-directed speech corpus were encoded into the artificial
language presented to participants (Figure 1).
Participants were assigned randomly to one of four language
conditions. In the Full language condition, participants heard
artificial language phrases containing both regularities found in
natural naming frames. In the Onset Only language condition,
Object Labels appeared in the middle of phrases instead of at the
end, but they were always preceded by one of a small set of onset
Table 1 |The 21 most frequent naming frames.
Phrase Pct. of corpus
The OBJ 6.30
That is a OBJ 4.73
And the OBJ 4.31
A OBJ 4.10
It is a OBJ 3.78
This is a OBJ 3.57
And a OBJ 3.26
Can you say OBJ 2.94
Here is the OBJ 2.63
And OBJ 2.42
Where is the OBJ 1.89
That is the OBJ 1.79
Look at the OBJ 1.79
I have the OBJ 1.47
You want the OBJ 1.16
Color is the OBJ 1.16
Is that the OBJ 1.16
there is the OBJ 1.05
You put the OBJ 1.05
To put the OBJ 0.95
One is the OBJ 0.95
Total 52.42%
Two regularities are apparent in the most frequent naming frames. First, Object
Labels occur reliably in final frame position. Second, labels are reliably preceded
by a small set of onset cues (a, the, and, say).
cue words. In the Position Only language condition Object Labels
always appeared in utterance-final position,but were not preceded
by a small set of onset cue words. Finally, in the Control language
condition, neither regularity from the natural naming frames was
provided. After training, participants were tested for their knowl-
edge of both the words of the language (speech segmentation),
and the word-object mappings. Additional details can be found in
the section “Materials and Methods” below.
Speech segmentation
On each segmentation test, participants were asked to indicate
which of two sequences was more likely to be a word of the lan-
guage. Figure 2 shows how participants’ segmentation of both
Object Labels and Frame Words varied across language condi-
tions. Overall, participants successfully segmented Object Labels
only in the Full and Position Only language conditions. They
segmented Frame Words successfully in the Onset Only lan-
guage condition, and to a lesser extent in the Position Only and
Control language conditions. Participants’ segmentation accura-
cies were averaged across all words and submitted to a mixed 4
(Language) ×2 (Word Type) ANOVA. This analysis showed no
main effect of language [F(3,90) =1.40, p=0.25] nor word type
[F(1,90) =0.83, p=0.37], but did show a significant interaction
[F(3,90) =5.39, p<0.01]. All segmentation accuracy were sub-
mitted to the Shapiro–Wilk test of normality (Shapiro and Wilk,
1965). Since none were found to be non-normal (all ps >0.1),
follow up analyses used t-tests. These follow up tests showed
that Object Label segmentation was above chance in the Full
[M=0.59, t(23) =2.69, p<0.05] and Position Only language
conditions [M=0.57, t(21) =2.13, p<0.05], but not in the
Onset Only [M=0.53, t(23) =1.34, p=0.19] or Control lan-
guage conditions [M=0.54, t(23) =1.26, p=0.22]. Frame-word
segmentation was above chance in the Onset Only language
condition [M=0.68, t(23) =5.39, p<0.001], trended toward
significance in the Position Only and Control language condi-
tions [MPositionOnly =0.56, t(21) =1.86, p=0.08; MControl =0.55,
t(21) =1.93, p=0.06] and was indistinguishable from chance in
the Full language condition [M=0.52, t(23) =0.51, p=0.62].
Segmentation of Object Labels and Frame Words was correlated
in Position Only language condition (r=0.48, p<0.05), but not
in any of the other language conditions (rFull = 0.22, p=0.29;
rOnsetOnly =0.19, p=0.39; rControl =0.23, p=0.29). Segmentation
focus and accuracy thus varied across the conditions.
In the Full language condition, participants focused on and
segmented only the Object Labels, learning little about the Frame
FIGURE 1 | An example training trial from the Full language condition. Trials were constructed by encoding naming event patterns from the child-directed
speech corpus into the artificial language.
www.frontiersin.org October 2012 | Volume 3 | Article 374 | 3
Yurovsky et al. Speech segmentation and word learning in parallel
Object Labels Frame Words
0.45
0.5
0.55
0.6
0.65
0.7
Segmentation Accuracy
Segmentation Accuracy by Word Type and Condition
Full
Position Only
Onset Only
Control
FIGURE 2 | Segmentation accuracy in each condition for both Object
Labels and Frame words. Learners successfully segmented Object Labels
in the Full and Position Only language conditions, and segmented Frame
Words in the Onset Only language condition. Error Bars indicate ±1 SE.
Words. In the Onset Only language condition, participants seg-
mented Frame Words very successfully, but failed to successfully
segment the Object Labels. In the Position Only language con-
dition, participants segmented Object Labels successfully and
segmented Frame Words at near-significant levels. Further, seg-
mentation accuracy for the two word types was correlated in this
condition, suggesting that they supported each other. In the Con-
trol language condition, segmentation trended toward accuracy for
the Frame Words and was at chance levels for Object Labels. Fur-
ther,segmentation of the word types was uncorrelated, suggesting
a less integrated segmentation strategy.
Word-object mapping
Participants were subsequently tested on their word-object map-
ping accuracy. On each test trial, they heard one word from
training and were asked to select the most likely referent object
from a set of four alternatives. As shown in Figure 3, par-
ticipants learned a significant proportion of word-object map-
pings in all but the Control language condition, but were most
successful in the Full and Position Only language conditions
the same languages in which they were most successful at
Object Label segmentation. An ANOVA showed significant dif-
ferences in mapping accuracy across conditions [F(3,90) =5.03,
p<0.01]. Additional tests showed that accuracy was signifi-
cantly above chance in all but the Control language condi-
tion [MFull =0.45, t(23) =4.98, p<0.001; MPositionOnly =0.42,
t(21) =4.12, p<0.001; MOnsetOnly =0.34, t(23) =2.99, p<0.01;
MControl =0.29, t(23) =1.78, p=0.09]. Further, accuracy was
similar in the Full and Position Only language conditions
[t(44) =0.57, p=0.57], and accuracy in both was significantly
greater than in the Control language condition [tFull(46) =3.69,
p<0.001; tPositionOnly(44) =2.92, p<0.01].Accuracy was signifi-
cantly greater in the Full language condition than in the Onset
Only language condition [t(46) =2.31, p<0.05], but accuracy
0.2
0.25
0.3
0.35
0.4
0.45
0.5
Mapping Accuracy
Word−Object Mapping Accuracy by Condition
Full
Position Only
Onset Only
Control
FIGURE 3 | Word-object mapping accuracy by condition. Participants
mapped words onto object successfully in all but the Control language
condition. Error Bars indicate ±1 SE.
did not differ between the Position Only and the Onset Only lan-
guage conditions [t(44) =1.65, p=0.11]. Thus, participants were
able to learn word-object mappings from continuous speech as
long as either regularity from natural naming frames was present.
However, the position regularity facilitated learning more than the
onset cue regularity.
Correlations between speech segmentation and word-object
mapping
Did segmentation and word-object mapping interact, bootstrap-
ping each other? Figure 4 shows correlations between each partici-
pant’s average Object Label segmentation and average word-object
mapping in each language condition. The two were positively
correlated in the Full (r=0.51; p<0.05) and the Position Only
language conditions (r=0.62, p<0.01), but were uncorrelated
in the Onset Only (r= 0.09, p=0.67) and Control language
conditions (r= 0.10, p=0.56). Thus, participants in the Onset
Only language condition showed evidence of learning word-object
mappings without fully segmenting the labels from the utterances.
DISCUSSION
Natural languages are richly structured, containing regularities at
multiple hierarchal levels. Statistical learning approaches to lan-
guage acquisition typically focus on one level at a time, showing
how the primitives from the level below can be used to construct
the primitives for the level above. Alternatively,statistical language
learning at every level could proceed in parallel, exploiting statisti-
cal redundancies across levels (Feldman et al., 2009;Johnson et al.,
2010). On this account, a child learning a word-referent mapping
may not need to wait until she has fully learned the word. But
uncertainty at multiple levels imposes significant attention and
memory demands on learners, demands that may prevent learn-
ing altogether (Frank et al., 2007;Thiessen, 2010). In this paper, we
suggest that these demands may be alleviated by other regularities
Frontiers in Psychology | Language Sciences October 2012 | Volume 3 | Article 374 | 4
Yurovsky et al. Speech segmentation and word learning in parallel
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Mapping Accuracy
Full Language
A
r = 0.51
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Position Only Language
B
r = 0.62
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Onset Only Language
Mapping Accuracy
Segmentation Accuracy
C
r = −0.09
0 0.2 0.4 0.6 0.8 1
0
0.2
0.4
0.6
0.8
1
Control Language
Segmentation Accuracy
r = −0.09
D
FIGURE 4 | Correlations between segmentation accuracy and
word-mapping accuracy in each language condition. Learning the two
regularities was positively correlated in the Full (A) and Position Only (B)
language conditions, and uncorrelated in the Onset Only (C) and Control
(D) language conditions.
in natural language input, for instance, frequent naming frames
(Mintz, 2003).
CORPUS ANALYSIS
Analyzing the structure of natural naming events is an important
step toward modeling children’s word learning. Because consis-
tency in naming event structure constrains the space of potential
solutions, the same mechanism that fails in an unstructured envi-
ronment may successfully extract words from fluent speech and
map them to their referent objects when additional regularities
are present. Our analysis showed, first, that a large proportion of
naming events in naturalistic free-play are single-word utterances
(see also Fernald and Morikawa, 1993;Brent and Siskind, 2001).
These utterances could simplify later speech segmentation and give
infants a leg up in later word learning (Brent and Siskind, 2001;
Lew-Williams et al., 2011).
Second, our analysis revealed two regularities common to over
50% of naming events: labels occur in final phrasal position, and
are preceded by an onset cue. We hypothesize that these regu-
larities, like single-word utterances, could also scaffold statistical
learning. Specifically, the information encoded in frequent nam-
ing frames may allow learners to identify the utterances most likely
to be naming events and to spot the label within each frame,
potentially without fully segmenting the other words. That is,
word-referent mapping may begin before children know exact
word boundaries (Yu et al., 2005).
EXPERIMENTS
Encoding these regularities into an artificial language, we tested
this idea empirically. Exposing adult participants to artificial lan-
guages constructed from a corpus of child-directed speech, we
were able to determine the independent and joint contributions
of the two regularities apparent in the corpus. Keeping constant
the words that make up naming phrases, we altered only their
order across conditions. If parallel speech segmentation and word-
object mapping rely on environmental cues to reduce cognitive
load, this should be reflected in the learning rates across our four
conditions.
In the Full language condition, which gave strong cues to the
frame position of Object Labels as well as to their onset, par-
ticipants successfully segmented labels from continuous speech
and mapped them onto their referent objects. This success came
in spite, or perhaps because, of chance-level performance on
Frame Word segmentation. That is, participants were able to focus
their attention on only the relevant portion of the speech steam
(see also Cunillera et al., 2010a). These results, along with the
strong correlation between word segmentation and word-object
mapping, suggest that participants became attuned to the posi-
tional regularity and effectively ignored large portions of the
speech input. This reduction in cognitive load may have supported
learning.
The Position Only language condition, in contrast, removed the
onset cue by moving words in the cue set to the beginning of
each sentence. In this condition,participants also successfully seg-
mented Object Labels from continuous speech, although at slightly
a reduced level. In trade, they performed at a near-significant level
on Frame Word segmentation. Also, unlike in the Full language
condition, segmentation of Object Labels and Frame Words was
highly correlated, suggesting an interaction between the processes.
Nonetheless, despite these differences, participants in the Position
Only language condition performed well on the test of word-object
mapping. Thus, removing the onset cue forced participants to
actively process more of the speech stream, but the presence of
the position cue kept cognitive load low enough to enable learn-
ing. These results are consistent with previous work showing that
utterance-final position facilitates language learning (Echols and
Newport, 1993;Goodsitt et al., 1993;Endress et al., 2005;Frank
et al., 2007).
Removing the position regularity from the Full language
yielded the Onset Only language condition. In this condition,
Object Labels were preceded by a small set of onset cues, but
occurred always in medial phrasal position. Without labels in
final position, participants performed at chance on tests of
Object Label segmentation. However performance on Frame
Word segmentation reached levels unseen in the other condi-
tions. Surprisingly, although participants did not show knowl-
edge of correct Object Label segmentation, they did succeed
in mapping words to objects at above chance (albeit reduced)
levels. Thus, an onset cue alone was sufficient to enable word
learning. This is consonant with other work showing that famil-
iar words can act as onset cues, giving infants a wedge into
speech segmentation (Bortfeld et al., 2005;Mersad and Nazzi,
2012).
Finally, when naming phrases contained all of the same words
but neither of the cues found in the child-directed speech cor-
pus, participants showed poor learning of both kinds of statistics.
Thus, in the Control language condition, participants were unable
to cope with the cognitive load inherent in the simultaneous
segmentation and word learning.
www.frontiersin.org October 2012 | Volume 3 | Article 374 | 5
Yurovsky et al. Speech segmentation and word learning in parallel
CONCLUSION
We began by considering the relationship between statistical
speech segmentation and statistical word learning. While previ-
ous work has demonstrated a serial link (e.g., Graf Estes et al.,
2007;Mirman et al., 2008), in which word candidates generated
via statistical segmentation are privileged in statistical word learn-
ing, a robust parallel demonstration has remained elusive (Frank
et al., 2007;Thiessen, 2010). Perhaps the computational resources
required by the tasks are simply too costly to allow their simul-
taneous resolution. We proposed that construction of previous
artificial languages may have averaged out the very regularities
that support a parallel solution in naturalistic environments. To
borrow from J. J. Gibson, “it’s not [just] what is inside the head
that is important, it’s what the head is inside of.”
Analysis of a corpus of child-directed speech from free-play
found two potential sources of such scaffolding. First, Object
Labels occurred consistently in the final position of naming
phrases. Second, these labels were consistently preceded by one
of a small set of onset cue words, predominantly articles. We con-
structed artificial languages following a 2 ×2 design to produce
all possible presence/absence combinations of these regularities.
Adult participants were exposed to an ambiguous word-object
mapping task in the cross-situational word learning paradigm (Yu
and Smith, 2007) in which labels were embedded within contin-
uous speech phrases. These experiments allowed us to determine
the independent and joint contributions of the two natural nam-
ing regularities. Although these studies use adult language learners
as a proxy for child language learners (Gillette et al., 1999), future
studies will need to ask this question more directly, using infant
participants and measuring learning on-line over the course of
training. This will allow finer-grained analysis of the relative time-
course of acquisition of each regularity, making clearer whether
learning is serial, parallel, or a mixture of both. Further, while the
two major regularities found in the corpus have been observed in
other corpora, further analyses will need to determine how naming
frames change over development,and how these frames contribute
to speech segmentation and word learning. Finally, it is impor-
tant to know to what extent these kinds of frames characterize
other languages. Although surely specific frames will differ from
language to language, there are reasons to expect common regular-
ities to generalize. For instance,Aslin et al. (1996) analyzed Turkish
child-directed speech and found that mothers consistently placed
target objects in final position even though this is ungrammatical.
These results highlight the importance of studying statistical
language learning in the context of real language input. Although
statistical learning is often studies under unbiased” assumptions
about input distributions (e.g., uniform word frequency), these
assumptions can be a poor proxy for real-world input (e.g., Zip-
fian frequency). Sometimes, as in the Full language condition,
natural input distributions facilitate statistical learning (see also,
Johns and Jones, 2010;Kurumada et al., 2011). However, in other
cases, natural input statistics make pure statistical learning diffi-
cult or impossible (e.g., Johnson and Tyler, 2010;Medina et al.,
2011;Vogt, 2012). In such cases, we may be led to understand
how other properties of the environment or of children’s and
adults’ perceptual systems take up the slack. For instance,a num-
ber of previous studies highlight the importance of redundant
information in language learning (e.g., Gogate et al., 2000, 2001;
Frank et al., 2009;Goldstein et al.,2010;Grassmann and Tomasello,
2010;Smith et al., 2010;Riordan and Jones, 2011). In all of these
cases, a difficult statistical language learning problem is made
easier by the addition of redundant information, often informa-
tion from a second sensory modality. For instance, the addition
of a pointing (Grassmann and Tomasello, 2010) or synchronous
motion (Gogate et al., 2000). This redundant information may
make the regularity easier to notice. In other cases, this highlight-
ing is accomplished with a single modality e.g., presenting the
label in a familiar voice (Bergelson and Swingley, 2012) or prosody
(Thiessen et al., 2005;Shukla et al., 2011). Finally, in some cases
this simplification may be accomplished by the child’s own percep-
tion/action system, which may act as a filter on the visual (Yurovsky
et al., 2012;Yu and Smith, 2012).
Language learning is a process of navigating uncertainty, of
leveraging partially learned regularities to learn other regularities
(Gleitman, 1990;Smith, 2000). Consequently, there many many
routes for breaking into language,and the route that learners adopt
is likely to depend on the statistics in their input. For instance,
in the Full language condition, participants learned word-object
mappings by segmenting Object Labels but ignoring Frame Words.
In contrast, participants in the Position Only language condition
segmented both kinds of words, and participants in the Onset Only
language condition learned word-object mappings but segmented
only the Frame Words. In concert with previous research indi-
cating that learners can ignore irrelevant statistical information
(Cunillera et al., 2010a;Weiss et al., 2010), and focus on reliable
statistical information (Smith, 2000;Colunga and Smith, 2005),
these results present a picture of language acquisition as an adap-
tive process in which learners focus on and exploit the regularities
most useful for the task at hand. Thus, the timing with which
different regularities are acquired is likely to vary as a function
of each learner’s input. There may thus be cases, as Peters (1977)
suggested, in which children“learn the tune before the words.”
MATERIALS AND METHODS
All experiments reported in this paper were approved by the
Human Subjects Office at the Indiana University Office of
Research Administration. Informed consent was obtained from
all participants prior to their participation in these experiments.
CORPUS ANALYSIS
Data
Transcripts of child-direct speech for naming frame analysis were
drawn from free-play interactions between 17 mothers and their
17–19-month-old children. These dyads were seated across from
each other and asked to play with three novel toys for 3 min at a
time. They were given three such sets of toys, resulting in nine total
minutes of interaction. Parents were taught labels for each of these
toys (e.g., dax,” toma”) and asked to use these if they wished to
refer to them by name. No other instructions were given.
Audio recordings of each parent’s speech were automatically
partitioned into individual utterances using a threshold of 1 s
of speech silence. This approach provides a consistent, objective
cutoff and obviates the reliability issues involved in human cod-
ing. For the purpose of speech segmentation, the importance of
Frontiers in Psychology | Language Sciences October 2012 | Volume 3 | Article 374 | 6
Yurovsky et al. Speech segmentation and word learning in parallel
utterance boundaries is that they provide salient stops that disam-
biguate word boundaries. Because previous research shows that
pauses on the order of 100 ms (Ettlinger et al., 2011) and 400 ms
(Finn and Hudson Kam, 2008) affect adult speech segmentation,
and pauses on the order of 500 ms (Mattys et al., 1999) affect infant
statistical speech segmentation, 1 s is a conservative estimate of the
length of pauses that would provide disambiguating information
to children.
These utterances were then transcribed by human coders into
English. Naming frame regularities were extracted using a six-
word window made up of three words on either side of a toy’s
label. If fewer than three words preceded or followed a label in any
given utterance, blanks were inserted to fill out the window (e.g.,
“_ _ the toma is blue _”). Next, individual toy labels were replaced
with a common token (OBJ), and the frequency of each resulting
multi-word frame was computed.
EXPERIMENTS
Participants
Ninety-two undergraduate students from Indiana University par-
ticipated in exchange for course credit. All participants were
self-reported native speakers of English. These participants were
divided into four approximately equal groups,each exposed to one
of the artificial languages.
Materials
Stimuli for the experiment consisted of 18 unique objects (from Yu
and Smith, 2007), and 38 unique words. Eighteen of these words
acted as labels for the novel objects, and the other 20 were mapped
onto the words contained in the 21 most frequent frames found
in the corpus analysis. Half of the words of each type were one
syllable (CV) long, and the other half were two syllables (CVCV)
long, necessitating the construction of 57 unique syllables. These
syllables were created by sampling 57 of the 60 possible combi-
nations of 12 constants and 5 vowels. Syllables were assigned to
words randomly, so that nothing about a word’s phonetic prop-
erties could be used to distinguish Object Labels from Frame
Words.
Words were then concatenated together without intervening
pauses to create artificial language equivalents of each of the 21 fre-
quent frames in the corpus. Participants were exposed to synthe-
sized versions of these phrases constructed with MBROLA (Dutoit
et al., 1996). This produced utterances in which no prosodic or
phonetic properties could be used to determine word bound-
aries, forcing participants to rely on statistical information. Speech
was synthesized using the us1 diphone database an Ameri-
can female speaking voice. Each consonant was 94ms long with
a pitch point of 200 Hz at 10 ms. Each vowel was 292 ms long
with a 221 Hz pitch point at 108 ms and a 200 Hz pitch point
at 292 ms. Each syllable was separated from the next by a 1 ms
pause and each utterance ended with a 20 ms pause. These val-
ues were chosen to produce speech with a natural sound and
cadence.
Design and procedure
Participants were told that they would be exposed to scenes con-
sisting of two novel objects, and a phrase referring to each of them.
Table 2 |The 2×2 design of the artificial language experiment.
Final position Middle position
Preceding
cue
Full Language
“Look at the OBJ”
Onset H: 1.45, Offset H: 0
Onset Only Language
At the OBJ look”
Onset H: 1.45, Offset H: 3.50
No cue Position Only Language
The look at OBJ”
Onset H: 2.71, Offset H: 0
Control Language
the look OBJ at”
Onset H: 2.71, Offset H: 3.50
Phrasal position of the Object Label varies along the rows; presence of the onset
cue varies along the columns.
Each phrase would contain exactly one word labeling an on-screen
object, along with several function words corresponding to the
grammar of the artificial language. Participants had to determine
which phrase referred to which object, how the phrases they heard
should be segmented into words, and which of these words referred
to which of the objects. Next, participants observed an exam-
ple trial using English words and familiar objects to demonstrate
the task. Importantly, the example contained both an object-final
phrase (“observe the tractor”) and an object-medial phrase (“and
the dog over there”) to prevent participants from expecting any
particular positional regularity.
After the example, participants observed 108 training trials,
each containing 2 objects and 2 spoken artificial language phrases
(Figure 1). Trials began with 2 s of silence, each phrase was approx-
imately 2 s in length, and 3 s of silence succeeded each phrase,
resulting in trials approximately 12s long. Each object appeared
12 times, and each naming frame occurred a number of times pro-
portional to its appearance in the child-directed speech corpus.
The entire training set ran just over 20 min.
After training, participants were tested first for speech segmen-
tation and then word-object mapping. On each segmentation test
trial, a participant heard 2 two-syllable words: a word from the
experiment and a foil created by concatenating the first syllable
of one word and the second syllable of another (following Fiser
and Aslin, 2002). They were asked to indicate which of the words
was more likely to be part of the artificial language (2AFC Test).
Six correct Object Labels were tested against 6 Object foils, and
6 correct Frame Words were tested against 6 Frame foils, result-
ing in 72 total segmentation trials. Each possible word occurred
an equal number of times in testing, preventing participants from
using test frequency as a cue to correctness. Tests for Object Labels
and Frame words were interspersed in a different random order
for each participant.
Subsequently, participants were tested on their knowledge of
word-object mappings. On each test trial, participants heard one
of the Object Labels and were asked to select its correct referent
from a set of four alternatives (4AFC Test). All of the labels were
tested once in random order.
To assess the independent and joint contribution of both the
final position and onset cue regularities, one group of partici-
pants was exposed to each of the four possible presence/absence
combinations of these cues. Materials and procedure were identi-
cal for each of the groups except for the order of words within
www.frontiersin.org October 2012 | Volume 3 | Article 374 | 7
Yurovsky et al. Speech segmentation and word learning in parallel
each artificial language naming phrase (Table 2). To quantify
the in-principle difficulty of segmenting each language, we com-
pute the binary entropy of the Frame Words in the positions
preceding and following an Object Label in each language con-
dition. Entropy (H) quantifies the variability of a distribution,
integrating both the number of unique alternatives and the rel-
ative frequency of each alternative (Shannon, 1948). When there
is no variability, e.g., when the only possibility is an utterance
boundary, entropy is zero. As the number of alternatives increases
and their frequencies become more uniform, entropy increases.
Onset and Offset entropies for each language are also found in
Table 2.
ACKNOWLEDGMENTS
This research was supported by a National Science Foundation
Graduate Research Fellowship to Daniel Yurovsky and National
Institutes of Health R01HD056029 to Chen Yu. The authors
wish to thank Damian Fricker for collecting much of the data,
and Morten Christiansen, Jenny Saffran, Michael C. Frank, and
Roberta Golinkoff for discussion.
REFERENCES
Aslin, R. N., Woodward, J. Z., LaMen-
dola, N. P., and Bever, T. G. (1996).
“Models of word segmentation in
fluent maternal speech to infants, in
Signal to Syntax: Bootstrapping from
Speech to Grammar in Early Acquisi-
tion, eds J. L. Morganand K. Demuth
(Hillsdale: Erlbaum), 117–134.
Bergelson, E., and Swingley, D. (2012).
At 6–9 months, human infants know
the meanings of many common
nouns. Proc. Natl. Acad. Sci. U.S.A.
109, 3253–3258.
Bortfeld, H., Morgan, J. L., Golinkoff,
R. M., and Rathbun, K. (2005).
Mommy and me: familiar names
help launch babies into speech-
stream segmentation. Psychol. Sci.
16, 298–304.
Brent, M. R., and Siskind, J. M. (2001).
The role of exposure to isolated
words in early vocabulary develop-
ment. Cognition 81, B33–B44.
Cameron-Faulkner, T., Lieven, E., and
Tomasello, M. (2003). A construc-
tion based analysis of child directed
speech. Cogn. Sci. 27, 843–873.
Christiansen, M. H., Onnis, L., and
Hockema, S. A. (2009). The secret
is in the sound: from unsegmented
speech to lexical categories. Dev. Sci.
12, 388–395.
Colunga, E., and Smith, L. B. (2005).
From the lexicon to expectations
about kinds: a role for associative
learning. Psychol. Rev. 112, 347–382.
Cunillera, T., Càmara, E., Laine, M.,
and Rodríguez-Fornells, A. (2010a).
Speech segmentation is facilitated by
visual cues. Q. J. Exp. Psychol. 63,
260–274.
Cunillera, T., Laine, M., Càmara, E.,
and Rodríguez-Fornells, A. (2010b).
Bridging the gap between speech
segmentation and word-to-world
mappings: evidence from an audio-
visual statistical learning task. J.
Mem. Lang . 63, 295–305.
Dutoit, T., Pagel, V., Pierret, N., Bataille,
F., and van der Vrecken, O. (1996).
The MBROLA project: towards a set
of high-quality speech synthesizers
dree of use for non-commercial pur-
poses, in Proceedings of the Fourth
International Conference on Spoken
Language Processing, eds H. T. Bunell
and W. Isardi (Wilmington, DE:
Dupont Institute), 1393–1396.
Echols, C. H.,and Newport, E. L. (1993).
The role of stress and position
in determining first words. Lang.
Acquis. 2, 189–220.
Eimas, P. D., Siqueland, E. R., Jusczyk,
P., and Vigorito, J. (1971). Speech
perception in infants. Science 171,
303–306.
Endress, A. D., Scholl, B. J., and Mehler,
J. (2005). The role of salience in the
extraction of algebraic rules. J. Exp.
Psychol. Gen. 134, 406–419.
Ettlinger, M., Finn, A. S., and Hudson
Kam, C. L. (2011). The effect of
sonority on word segmentation: evi-
dence for the use of a phonological
universal. Cogn. Sci. 36, 655–673.
Feldman, N. H., Griffiths, T. L., and
Morgan, J.L. (2009). “Learning pho-
netic categories by learning a lex-
icon, in Proceedings of the 31st
Annual Conference of the Cognitive
Science Society, eds A. D. De Groot
and G. Heymans (Austin, TX: Cog-
nitive Science Society), 2208–2213.
Fernald, A., and Hurtado, N. (2006).
Names in frames: infants interpret
words in sentence frames faster than
words in isolation. Dev. Sci. 9, F33–
F40.
Fernald, A., and Morikawa, H. (1993).
Common themes and cultural vari-
ations in Japanese and American
mothers’ speech to infants. Child
Dev. 64, 637–656.
Finn, A. S., and Hudson Kam, C. L.
(2008). The curse of knowledge: first
language knowledge impairs adult
learners’ use of novel statistics for
word segmentation. Cognition 108,
477–499.
Fiser, J., and Aslin, R. N. (2002). Statis-
tical learning of higher-order tem-
poral structure from visual shape
sequences. J. Exp. Psychol. Learn.
Mem. Cogn. 28, 458–467.
Frank, M. C., Goldwater, S., Griffiths,
T. L., and Tenenbaum, J. B. (2010).
Modeling human performance in
statistical word segmentation. Cog-
nition 117, 107–125.
Frank, M. C., Mansinghka, V., Gibson,
E., and Tenenbaum, J. B. (2007).
“Word segmentation as word learn-
ing: integrating stress and meaning
with distributional cues,”in Proceed-
ings of the 31st Annual Boston Univer-
sity Conference on Language Devel-
opment, eds H. Caunt-Nulton, S.
Kulatilake, and I. Woo (Boston, MA:
Boston University), 218–229.
Frank, M. C., Slemmer, J. A., Marcus,G.
F., and Johnson, S. P. (2009). Infor-
mation from multiple modalities
helps 5-month-olds learn abstract
rules. Dev. Sci. 12, 504–509.
Fu, W. (2008). Is a single-bladed knife
enough to dissect human cogni-
tion? Commentary on Griffiths et al.
Cogn. Sci. 32, 155–161.
Gillette, J., Gleitman, H., Gleitman,
L., and Lederer, A. (1999). Human
simulations of vocabulary learning.
Cognition 73, 135–176.
Gleitman, L. (1990). The structural
sources of verb meanings. Lang.
Acquis. 1, 3–55.
Gogate, L. J., Bahrick, L. E., and Wat-
son, J. D. (2000). A study of multi-
modal motherese: the role of tempo-
ral synchrony between verbal labels
and gestures. Child Dev. 71,878–894.
Gogate, L. J., Walker-Andrews, A. S.,
and Bahrick, L. E. (2001). The inter-
sensory origins of word comprehen-
sion: an ecological-dynamic systems
view. Dev. Sci. 4, 1–37.
Goldstein, M. H., Waterfall, H. R.,
Lotem, A., Halpern, J. Y., Schwade,
J. A., Onnis, L., et al. (2010). Gen-
eral cognitive principles for learning
structure in time and space. Trends
Cogn. Sci. (Regul. Ed.) 14, 249–258.
Gómez, R. L. (2002). Variability and
detection of invariant structure. Psy-
chol. Sci. 13, 431–436.
Gómez, R. L., and Gerken, L. (2000).
Infant artificial language learn-
ing and language acquisition.
Trends Cogn. Sci. (Regul. Ed.) 4,
178–186.
Goodsitt, J. V., Morgan, J. L., and Kuhl,
P. K. (1993). Perceptual strategies in
prelingual speech segmentation. J.
Child Lang. 20, 229–252.
Graf Estes, K., Evans, J. L., Alibali, M. W.,
and Saffran, J. R. (2007). Can infants
map meaning to newly segmented
words? Statistical segmentation and
word learning. Psychol. Sci. 18,
254–260.
Grassmann, S., and Tomasello, M.
(2010). Young children follow point-
ing over words in interpreting acts of
reference. Dev. Sci. 13, 252–263.
Hay,J. F., Pelucchi, B., Graf, K., and Saf-
fran, J. R. (2011). Linking sounds to
meanings: infant statistical learning
in a natural language. Cogn. Psychol.
63, 93–106.
Johns, B. T., and Jones, M. N. (2010).
Evaluating the random representa-
tion assumption of lexical semantics
in cognitive models. Psychon. Bull.
Rev. 17, 662–672.
Johnson, E. K., and Tyler,M. D. (2010).
Testing the limits of statistical learn-
ing for word segmentation. Dev. Sci.
13, 339–345.
Johnson, M., Frank, M. C., Demuth, K.,
and Jones, B. K. (2010). Synergies
in learning words and their refer-
ents. Adv. Neural Inf. Process Syst. 23,
1018–1026.
Kuhl,P. K. (2004). Early language acqui-
sition: cracking the speech code. Nat.
Rev. Neurosci. 5, 831–843.
Kurumada, C., Meylan, S., and Frank,
M. C. (2011). “Zipfian frequen-
cies support statistical word seg-
mentation, in Proceedings of the
33rd Annual Conference of the Cog-
nitive Science Society, eds L. Carl-
son, C. Hölscher, and T. Shipley
(Austin, TX: Cognitive Science Soci-
ety), 2667–2672.
Lew-Williams, C., Pelucchi, B., and
Saffran, J. R. (2011). Isolated
words enhance statistical language
learning in infancy. Dev. Sci. 14,
1323–1329.
Marcus,G. F. (2000). Pabikuand ga ti g a:
two mechanisms infants use to learn
about the world. Curr. Dir. Psychol.
Sci. 9, 145–147.
Mattys, S. L., Jusczyk, P. W., Luce, P. A.,
and Morgan, J. L. (1999). Phonotac-
tic and prosodic effects on word seg-
mentation in infants. Cogn. Psychol.
38, 465–494.
Maye, J., Werker, J. F., and Gerken, L.
(2002). Infant sensitivity to distrib-
utional information can affect pho-
netic discrimination. Cognition 82,
B101–B111.
Frontiers in Psychology | Language Sciences October 2012 | Volume 3 | Article 374 | 8
Yurovsky et al. Speech segmentation and word learning in parallel
Medina, T. N., Snedeker, J., Trueswell,
J. C., and Gleitman, L. R. (2011).
How words can and cannot be
learned by observation. Proc.
Natl. Acad. Sci. U.S.A. 108,
9014–9019.
Mersad, K., and Nazzi, T. (2012).
When mommy comes to the rescue
of statistics: infants combine top-
down and bottom-up cues to seg-
ment speech. Lang. Learn. Dev. 8,
303–315.
Mintz, T. H. (2003). Frequent frames as
a cue for grammatical categories in
child directed speech. Cognition 90,
91–117.
Mirman, D., Magnuson,J. S., Graf Estes,
K., and Dixon, J. A. (2008). The link
between statistical segmentation and
word learning in adults. Cognition
108, 271–280.
Peters, A. M. (1977). Language learn-
ing strategies: does the whole equal
the sum of the parts? Language 53,
560–573.
Pierrehumbert, J. B. (2003). Phonetic
diversity, statistical learning, and
the acquisition of phonology. Lang.
Speech 46, 115–154.
Räsänen, O. (2011). A computa-
tional model of word segmenta-
tion from continuous speech using
transitional probabilities of atomic
acoustic events. Cognition 120,
149–176.
Riordan, B., and Jones, M. N. (2011).
Redundancy in perceptual and
linguistic experience: comparing
feature-based and distributional
models of semantic representation.
Top. Cogn. Sci. 3, 303–345.
Saffran, J. R., Hauser, M., Seibel,
R., Kapfhamer, J., Tsao, F., and
Cushman, F. (2008). Gram-
matical pattern learning by
human infants and cotton-top
tamarin monkeys. Cognition 107,
479–500.
Saffran, J. R., Newport, E. L., and Aslin,
R. N. (1996). Word segmentation:
the role of distributional cues. J.
Mem. Lang. 621, 606–621.
Shafer, V., Shucard, D., Shucard, J., and
Gerken, L. (1998). An electrophys-
iological study of infants’ sensitiv-
ity to the sound patterns of English
speech. J. Speech Lang. Hear. Res. 41,
874–886.
Shannon, C. E. (1948). A mathematical
theory of communication. Bell Syst.
Tech. J. 27, 379–423.
Shapiro,S. S., and Wilk, M. B.(1965). An
analysis of variance test for normal-
ity (complete samples). Biometrika
52, 591–611.
Shukla, M., White, K. S., and Aslin,
R. N. (2011). Prosody guides the
rapid mapping of auditory word
forms onto visual objects in 6-mo-
old infants. Proc. Natl. Acad. Sci.
U.S.A. 108, 6038–6043.
Smith, L. B. (2000). “How to learn
words: an associative crane, in
Breaking the Word Learning Barrier,
eds R. M. Golinkoff and K. Hirsh-
Pasek (Oxford: Oxford University
Press), 51–80.
Smith, L. B., Colunga, E., and Yoshida,
H. (2010). Knowledge as process:
contextually-cued attention and
early word learning. Cogn. Sci. 34,
1287–1314.
Smith, L. B., and Yu, C. (2008). Infants
rapidly learn word-referent map-
pings via cross-situational statistics.
Cognition 106, 1558–1568.
Thiessen, E. D. (2010). Effects of visual
information on adults’ and infants’
auditory statistical learning. Cogn.
Sci. 34, 1093–1106.
Thiessen, E. D., Hill, E. A., and Saf-
fran, J. R. (2005). Infant-directed
speech facilitates word segmenta-
tion. Infancy 7, 49–67.
Vogt, P. (2012). Exploring the robust-
ness of cross-situational learning
under Zipfian distributions. Cogn.
Sci. 36, 726–739.
Vouloumanos, A., and Werker, J. F.
(2009). Infants’ learning of novel
words in a stochastic environment.
Dev. Psychol. 45, 1611–1167.
Vygotsky, L. (1978). Mind and Soci-
ety: The Development of Higher Psy-
chological Processes. Cambridge, MA:
Harvard University Press.
Waxman, S. R., and Gelman, S. A.
(2009). Early word-learning entails
reference, not merely associations.
Trends Cogn. Sci. (Regul. Ed.) 13,
258–263.
Weiss, D., Gerfen, C., and Mitchel,
A. (2010). Colliding cues in word
segmentation: the role of cue
strength and general cognitive
processes. Lang. Cogn. Process. 25,
402–422.
Yu, C., Ballard, D. H., and Aslin, R.
N. (2005). The role of embodied
intention in early lexical acquisition.
Cogn. Sci. 29, 961–1005.
Yu, C., and Smith, L. B. (2007). Rapid
word learning under uncertainty via
cross-situational statistics. Psychol.
Sci. 18, 414–420.
Yu, C., and Smith, L. B. (2012).
Embodied attention and
word learning by toddlers.
Cognition 125, 244–262.
Yu, C., Smith, L. B., and Pereira, A. F.
(2008). “Grounding word learning
in multimodal sensorimotor inter-
action, in Proceedings of the 30th
Annual Conference of the Cogni-
tive Science Society, eds B. C. Love,
K. McRae, and V. M. Sloutsky
(Austin, TX: Cognitive Science Soci-
ety), 1017–1022.
Yurovsky, D., Smith, L. B., and Yu, C.
(2012). Does Statistical Word Learn-
ing Scale? It’s a Matter of Perspec-
tive. Austin, TX: Cognitive Science
Society, 1209–1213.
Conflict of Interest Statement: The
authors declare that the research was
conducted in the absence of any com-
mercial or financial relationships that
could be construed as a potential con-
flict of interest.
Received: 12 July 2012; accepted: 11 Sep-
tember 2012; published online: 01 Octo-
ber 2012.
Citation: Yurovsky D, Yu C and Smith
LB (2012) Statistical speech segmen-
tation and word learning in par-
allel: scaffolding from child-directed
speech. Front. Psychology 3:374. doi:
10.3389/fpsyg.2012.00374
This article was submitted to Frontiers in
Language Sciences, a specialty of Frontiers
in Psychology.
Copyright © 2012 Yurovsky, Yu and
Smith. This is an open-access article dis-
tributed under the terms of the Creative
Commons Attribution License, which
permits use, distribution and reproduc-
tion in other forums, provided the original
authors and source are credited and sub-
ject to any copyright notices concerning
any third-party graphics etc.
www.frontiersin.org October 2012 | Volume 3 | Article 374 | 9
... Intuitively, adding mapping ambiguity could make the simultaneous task too challenging. However, Yurovsky et al. (2012) have shown that adults can simultaneously segment labels from phrases and map them to objects across ambiguous presentations. Using an adaptation of the cross-situational word learning paradigm ( Yu & Smith, 2007), adults were exposed to scenes with two novel objects. ...
... Not only were participants able to segment and map words simultaneously by tracking conditional probabilities, but their overall performance was stronger in this simultaneous task in comparison to separate tasks of segmentation and cross-situational word learning. This adds to the evidence showing that language learners benefit from combining several sources of linguistic information when learning a new language (Choi et al., 2018;Johnson, 2016;Saffran, 2020;Smith et al., 2018;Tachakourt, 2023;Yurovsky et al., 2012). This combination might be especially useful when dealing with ambiguity, as even a partial solution to one linguistic challenge could reduce ambiguity in other linguistic challenges (for related evidence, see Feldman, Myers et al., 2013). ...
Article
Full-text available
Language learners track conditional probabilities to find words in continuous speech and to map words and objects across ambiguous contexts. It remains unclear, however, whether learners can leverage the structure of the linguistic input to do both tasks at the same time. To explore this question, we combined speech segmentation and cross-situational word learning into a single task. In Experiment 1, when adults (N = 60) simultaneously segmented continuous speech and mapped the newly segmented words to objects, they demonstrated better performance than when either task was performed alone. However, when the speech stream had conflicting statistics, participants were able to correctly map words to objects, but were at chance level on speech segmentation. In Experiment 2, we used a more sensitive speech segmentation measure to find that adults (N = 35), exposed to the same conflicting speech stream, correctly identified non-words as such, but were still unable to discriminate between words and part-words. Again, mapping was above chance. Our study suggests that learners can track multiple sources of statistical information to find and map words to objects in noisy environments. It also prompts questions on how to effectively measure the knowledge arising from these learning experiences.
... In Yu and Smith (2007), a group of adults learnt new words and their referents in a CSSL task, as shown by an alternative forced choice test (AFC), where participants performed above chance and correctly identified up to 88% of the referents in the easiest (2 × 2) experiment and 53% in the hardest experiment (4 × 4). Different studies have since shown successful CSSL in adults (Fitneva & Christiansen, 2011;Yurovsky et al., 2012) and infants (Smith & Yu, 2008). A previous study (Suanda et al., 2014), tested typically developing 5-to 7-year-old children in a CSSL task. ...
... In turn, children with DLD showed a (significantly) larger percentage of 0% competitor selection contrasted with the TD group. Moreover, our behavioural results from the TD group are similar to those previously reported in English-speaking TD school-age children without DLD (see Suanda et al., 2014) and are also in line with CSSL studies conducted on typical populations (Fitneva & Christiansen, 2011;Smith & Yu, 2008;Yurovsky et al., 2012). Thus, we can infer that DLD has an important impact on the capacity of children to retain word-object pairs. ...
Article
Children with Developmental Language Disorder (DLD) need more exposures to learn new words in an unambiguous context compared to children with typical development (TD). However, it remains unclear whether they would be able to learn new words by extracting frequencies over multiple word-object encounters in ambiguous situations. The present study examines this question through a cross-situational statistical-learning task (CSSL). Thirty-eight school-aged children with DLD and thirty-eight age/sex-matched TD children completed a CSSL eye-tracking experiment. Participants' responses show that children with DLD had significantly poorer accuracy compared to TD children. However, both groups performed above chance. While the eye-tracking record evidenced no distinctive pattern between groups as children learnt new words, we observed a larger target visual preference in TD children when they were asked to find the referent for those new words. We discuss these findings in light of existing accounts for memory and language deficits in DLD. ARTICLE HISTORY
... Thus, the goal of this study is to clarify further the relationship between statistical learning and reading by examining visual statistical learning and orthographic awareness in Hong Kong Chinese children with developmental dyslexia and their typically developing peers. Statistical learning is regarded as a powerful mechanism that enables infants and young children to acquire various aspects of spoken language, including word segmentation (e.g., Saffran, Aslin, & Newport, 1996), phoneme learning (e.g., Maye, Werker, & Gerken, 2002), word learning (e.g., Yurovsky, Yu, & Smith, 2012), and syntactic development (e.g., Thompson & Newport, 2007). In recent years, a growing number of studies suggest that both child and adult learners are sensitive to statistical structures of orthographic inputs in alphabetic languages and that statistical learning plays a role in reading and writing development (e.g., Frost, Siegelman, Narkiss, & Afek, 2013;Kaefer, 2009;Samara & Caravolas, 2014;Treiman & Kessler, 2006). ...
... Four hypotheses might explain these patterns. First, these findings may relate to input the participants received during the intervention and generalization sessions (Romberg & Saffran, 2010;Yurovsky et al., 2012). Although the requirement was not explicitly stated in the intervention protocol, perhaps the linguistic structures that the trained clinicians targeted during the sessions followed a developmental trajectory. ...
Article
Full-text available
This study aimed to detect patterns in clause construction structural changes produced by four participants aged 9;5–13;7 (years;months) with motor speech disorders who used speech-generating devices. Sequences of adult–child interactions, drawn from the data of a larger study focused on enhancing vocabulary and grammar skills, were examined. This current study comprises a secondary analysis of a corpus of 29 conversations totalling 808.36 min, analysing clause structures by type, linguistic complexity, and intensity of adult prompts (number of turns). Results show that, over time, the participants’ clause structure complexity increased through addition of phrase-internal elements such as inflections, articles, and prepositions. Use of specific grammatical elements followed the developmental stages observed in children with typical development. For all participants, the personal pronoun I (first-person singular) emerged before she, he (third-person singular), and we or they (plural). Participants with the highest number of adult–child co-constructed clauses also had the highest number of well-formed clauses. The intensity of adult prompts increased as clause structures became more complex and as participants needed more support. Implications for practice and theory are discussed.
Article
Learning a language with complex morphology poses a challenge to language learners, especially adults, who may need to acquire unfamiliar grammatical categories. One possible advantage to languages with complex morphology is that the morphology could provide cues to word meaning. The hypothesis that morphology can bootstrap adult word learning is tested across four cross-situational word learning experiments. Adult learners were exposed to words from a novel language with CVCV stems and -CV suffixes. In the Experimental conditions, the suffixes consistently mapped to semantic categories (e.g. [-ke] for fruits). In the Control condition, the suffixes did not provide any consistent semantic information. Participants in the Experimental conditions outperformed participants in the Control conditions, but only when there were sufficient opportunities to infer the morphology in the initial learning phases. These results highlight adults’ ability to rapidly learn novel morphological information, and use this information in word learning.
Article
Full-text available
Humans communicate with small children in unusual and highly conspicuous ways (child-directed communication (CDC)), which enhance social bonding and facilitate language acquisition. CDC-like inputs are also reported for some vocally learning animals, suggesting similar functions in facilitating communicative competence. However, adult great apes, our closest living relatives, rarely signal to their infants, implicating communication surrounding the infant as the main input for infant great apes and early humans. Given cross-cultural variation in the amount and structure of CDC, we suggest that child-surrounding communication (CSC) provides essential compensatory input when CDC is less prevalent—a paramount topic for future studies.
Article
Young children learn language at an incredible rate. Although children come prepared with powerful statistical-learning mechanisms, the statistics they encounter are also prepared for them: Children learn from caregivers motivated to communicate with them. How precisely do parents tune their speech to their children's individual language knowledge? To answer this question, we asked parent-child pairs (N = 41) to play a reference game in which the parents' goal was to guide their child to select a target animal from a set of three. Parents fine-tuned their referring expressions to their children's knowledge at the lexical level, producing more informative references for animals they thought their children did not know. Further, parents learned about their children's knowledge over the course of the game and tuned their referring expressions accordingly. Child-directed speech may thus support children's learning not because it is uniformly simplified but because it is tuned to individual children's language development.
Article
The ability to extract reliable segments from the sensory data-stream is necessary for learning and for creating a reliable representation of the environment. Here we focused on segmentation in visual learning associated with foraging. Fifty-one pigeons (Columba livia) were trained to recognize two food-predicting signals, one presented against fixed background elements and the other against varying backgrounds. The pigeons were divided into two groups: 32 pigeons were trained with short intervals of 3-5 minutes between sessions and 19 with long intervals of 60 minutes between sessions. When tested on the signal with a plain background, 24 pigeons trained with the short time intervals preferred the signal presented against a varying background during training and eight preferred the signal presented against a fixed background during training. Nine pigeons trained with the long intervals preferred the signal presented against a varying background during training and ten the signal presented against a fixed background during training. Our findings indicate that pigeons possess an ability to use variation-set structure to segment visual data. To date, the effect of variation-set structure has been demonstrated only in the context of human language acquisition. Our current findings suggest that variation sets structure may be linked to other learning abilities too.
Conference Paper
Full-text available
As Voice is the most suitable form of communication, voice-based applications are playing a vital role in modern technology for the last few decades. It is not only the trend of modern and efficient technology but also a new shift of information and technology paradigm. Several research works have been completed on voice-based applications because it has more practical application than any other form of communication. For almost every application, voice signals need to be preprocessed before using it as the input signal. Preprocessing of any data improves the performance of applications. A preprocessing method for feature extraction of continuous Bangla voice has been proposed in this paper. In this method the signal is preprocessed in several steps. In the first step, the noise has been reduced from the signal. Then after balancing the frequency applying pre-emphasis, framing is applied which splits the whole signal into some frames. After that a hamming window and normalization are applied to improve the spectrum and SNR of the preprocessed signal. By following these steps, a clear voice signal, free from noise and unnecessary frequencies has been retrieved.
Article
This study investigates how the fine-grained phonetic realization of tonal cues impacts speech segmentation when the cues signal the same word boundary in the native and unfamiliar languages but do so differently. Korean listeners use the phrase-final high (H) tone and the phrase-initial low (L) tone to segment speech into words (Kim, Broersma, & Cho, 2012; Kim & Cho, 2009), but it is unclear how the alignment of the phrase-final H tone and the scaling of the phrase-initial L tone modulate their speech segmentation. Korean listeners completed three artificial-language (AL) tasks (within-subject): (a) one AL without tonal cues; (b) one AL with later-aligned phrase-final H cues (non-Korean-like); and (c) one AL with earlier-aligned phrase-final H cues (Korean-like). Three groups of Korean listeners heard (b) and (c) in three phrase-initial L scaling conditions (between-subject): high (non-Korean-like), mid (non-Korean-like), or low (Korean-like). Korean listeners’ segmentation improved as the L tone was lowered, and (b) enhanced segmentation more than (c) in the high- and mid-scaling conditions. We propose that Korean listeners tune in to low-level cues (the greater H-to-L slope in [b]) that conform to the Korean intonational grammar when the phrase-initial L tone is not canonical phonologically.
Article
Full-text available
This study examined European American and Hispanic American mothers' multimodal communication to their infants (N= 24). The infants were from three age groups representing three levels of lexical-mapping development: prelexical (5 to 8 months), early-lexical (9 to 17 months), and advanced-lexical (21 to 30 months). Mothers taught their infants four target (novel) words by using distinct objects during a semistructured play episode. Recent research suggests that young infants rely on temporal synchrony to learn syllable–object relations, but later, the role of synchrony diminishes. Thus, mothers' target and nontarget naming were coded for synchrony and other communication styles. The results indicated that mothers used target words more often than nontarget words in synchrony with object motion and sometimes touch. Thus, ‘multimodal motherese’ likely highlights target word-referent relations for infants. Further, mothers tailored their communication to infants' level of lexical-mapping development. Mothers of prelexical infants used target words in synchrony with object motion more often than mothers of early- and advanced-lexical infants. Mothers' decreasing use of synchrony across age parallels infants' decreasing reliance on synchrony, suggesting a dynamical and reciprocal environment–organismic relation.
Article
Full-text available
All computational models of word learning solve the problem of referential ambiguity by integrating information across naming events. This solution is supported by a wealth of empirical evidence from both adults and young children. However, these studies have recently been challenged by new data suggesting that human word learning mechanisms do not scale up to the ambiguity of real naming events. We replicate these experiments, collecting natural naming events both from a tripod-mounted camera and from a head-mounted camera that produced a "child's-eye" view. Although individual naming events were equally ambiguous from both views, significant learning across events occurred only from the child's own view. Thus, statistical word learning scales, but only from the right perspective.
Article
Full-text available
The process of word segmentation is flexible, with many strategies potentially available to learners. This experiment explores how segmentation cues interact, and whether successful resolution of cue competition is related to general executive functioning. Participants listened to artificial speech streams that contained both statistical and pause-defined cues to word boundaries. When these cues ‘collide’ (indicating different locations for word boundaries), cue strength appears to dictate the predominant parsing strategy. When cues are relatively equal in strength, the ability to successfully deploy a segmentation strategy significantly correlates with stronger performance on the Simon task, a non-linguistic cognitive task typically thought to involve executive processes such as inhibitory control and selective attention. These results suggest that general information processing strategies may play a role in solving one of the early challenges for language learners.
Article
Some of the things that adults learn about language, and about the world, are very specific, whereas others are more abstract or rulelike. This article reviews evidence showing that infants, too, can very rapidly acquire both specific and abstract information, and considers the mechanisms that infants might use in doing so.
Article
Two fundamentally different strategies may be employed by very young children learning their first language. The basic assumptions underlying the study of children's language development, however, have provided means for dealing with only one of these strategies: that which proceeds from the parts to the whole (Analytic). This paper reports on a child who evidently proceeded from the whole to the parts (Gestalt) in producing much of his early language. Since further evidence for a Gestalt strategy exists in the literature, albeit implicitly, such a strategy is probably quite widespread, and any theory of language or language acquisition needs to be able to account for it. It is also speculated that there may be neurological bases for the different language learning strategies.
Article
A discussion of English native-language vocabulary acquisition in children takes a closer look at the assumption that vocabulary is learned by common association of word with event, focusing on the acquisition of verb meanings. The intuitive power of the view that words are learned by noticing real-world contingencies for their use is acknowledged, but it is pointed out that such mapping, unaided, is in principle insufficiently constrained to explain how the child maps verbs (as phonological objects) with their meanings. The solution offered is that semantically relevant information in the syntactic structures can rescue observational learning from experiential pitfalls. Evidence is offered that children deduce meanings from their knowledge of structural-semantic relations. Limitations in data, need for further information about cross-linguistic correspondences, and problems occurring in the analysis are briefly addressed. A 59-item bibliography is included. (MSE)