ArticlePDF Available

Using automated acoustic analysis to explore the link between planning and articulation in second language speech production

Authors:

Abstract and Figures

Speakers learning a second language show systematic differences from native speakers in the retrieval, planning, and articulation of speech. A key challenge in examining the interrelationship between these differences at various stages of production is the need for manual annotation of fine-grained properties of speech. We introduce a new method for automatically analysing voice onset time (VOT), a key phonetic feature indexing differences in sound systems cross-linguistically. In contrast to previous approaches, our method allows reliable measurement of prevoicing, a dimension of VOT variation used by many languages. Analysis of VOTs, word durations, and reaction times from German-speaking learners of Spanish (Baus et al., 2013) suggest that while there are links between the factors impacting planning and articulation, these two processes also exhibit some degree of independence. We discuss the implications of these findings for theories of speech production and future research in bilingual language processing.
Content may be subject to copyright.
Using automated acoustic analysis to explore the link between
planning and articulation in second language speech production
Matthew Goldricka
Yosi Shremb
Oriana Kilbourn-Cerona
Cristina Bausc
Joseph Keshetb
aNorthwestern University
bBar-Ilan University
cUniversity Pompeu-Fabra
ARTICLE HISTORY
Compiled July 28, 2020
ABSTRACT
Speakers learning a second language show systematic differences from native speak-
ers in the retrieval, planning, and articulation of speech. A key challenge in examin-
ing the interrelationship between these differences at various stages of production is
the need for manual annotation of fine-grained properties of speech. We introduce a
new method for automatically analyzing voice onset time (VOT), a key phonetic fea-
ture indexing differences in sound systems cross-linguistically. In contrast to previous
approaches, our method allows reliable measurement of prevoicing, a dimension of
VOT variation used by many languages. Analysis of VOTs, word durations, and re-
action times from German-speaking learners of Spanish (Baus et al., 2013) suggest
that while there are links between the factors impacting planning and articulation,
these two processes also exhibit some degree of independence. We discuss the im-
plications of these findings for theories of speech production and future research in
bilingual language processing.
KEYWORDS
Bilingualism, automatic acoustic analysis, voice onset time
1. Introduction
In the early stages of acquiring a new language, adults can face significant challenges
when producing speech. Due to competition between language-specific representations
(Green, 1998; Kroll et al., 2008) and/or decreased experience in retrieval of language-
specific structures (Gollan et al., 2011), second language speakers
1
show difficulties in the
matt-goldrick@northwestern.edu
1
We emphasize that our study deals specifically with speakers who are learning a new language in adulthood in
an academic context, which is only a subset of the many contexts of bilingual experience. We also acknowledge
and challenge the privileged positioning of speech from “native speakers,” and the concomitant marginalization
of speakers who deviate from that idealized norm (Aneja, 2016). The aim of our analysis of differences between
retrieval and planning of speech (e.g., in picture naming tasks, second language speakers
show longer response times than monolinguals; Gollan et al., 2008, 2011; Ivanova and
Costa, 2008). These speakers also exhibit differences in articulation, producing speech
at a slower rate (Guion et al., 2000) and producing speech with an “accent” (where
the phonetic properties of their productions differ from those of native speakers; Flege,
1991).
Do these effects at multiple levels of language processing arise from a common source?
Examining this issue requires that researchers simultaneously examine measures of
retrieval (e.g., accuracy, reaction time) and phonetics (e.g., durations and spectral
properties of speech sounds) on a by-trial level to gain a picture of how different aspects
of processing are impacted within the same individual under a specific set of conditions.
Such work is hampered by the need for manual annotation of phonetic properties. The
substantial training and time required for accurate annotation limits the number of
observations that can be studied. This is clearly an issue; there is substantial variation in
processing across different second language speakers as well as variation in how speakers
respond to different items and contexts.
In recent years, there has been substantial progress in the accuracy of automated
measurement of phonetic properties of English speech sounds e.g., (Adi et al., 2016;
Goldrick et al., 2016; Sonderegger and Keshet, 2012). However, there has been far
less work on automated measurement across multiple languages – an advance that is
definitionally necessary to investigate bilingual speech production (see Schillingmann
et al., 2018, for a recent exception).
In this work, we propose a new machine learning algorithm for automated measurement
of voice onset time (VOT), an acoustic dimension that serves as an important distinction
between speech sounds across many languages. This algorithm improves on current
state-of-the-art techniques by allowing for accurate measurement of negative VOT, a
type that serves to distinguish speech sound categories in many languages. We use this
algorithm to measure VOT in a subset of speech data from Baus et al. (2013), examining
picture naming in German and Spanish by native German speakers learning Spanish
(along with a set of native Spanish controls). We use these data to assess how difficulties
in lexical retrieval and planning relate to articulation difficulties in non-native speech.
The results reveal key dissociations between effects at these levels of processing. This
shows the potential of automated phonetic analysis to provide objective, replicable data
to constrain theories of language processing.
We begin by reviewing previous work investigating the relationship between difficulties
in speech planning and articulation in second language speakers. We then introduce our
new algorithm, providing a description of its principles and implementation. In section
4, we examine multiple measures of speech in Baus et al. (2013): reaction time, word
duration, and VOT. We then conclude by discussing the implications of these analyses
and our general framework.
speakers is intended to elucidate bilingual language processing and not to make value judgments about differences
between speech from L1 vs. L2 speakers.
2
2. Links between difficulties in second language speech planning and
articulation
2.1. Evidence from word duration
An extensive body of work has found that second language (L2) speech has longer
durations than monolingual, native (L1) speech (Baker et al., 2011; Guion et al., 2000;
Gustafson and Goldrick, 2018; MacKay and Flege, 2004; Sadat et al., 2012). Do such
effects arise due to difficulties in planning and retrieval of words?
Sadat et al. (2012) examined Spanish picture naming with bare noun (e.g., ‘avi´on,
Eng. airplane) and noun phrase responses (‘el avi´on rojo’, Eng. the red airplane). In
noun phrase production, adult Spanish-Catalan bilinguals (Spanish dominant) showed
both longer reaction times (RTs) and longer speech durations than Spanish monolinguals,
suggesting effects in planning and retrieval are linked. However, these group effects on
articulatory durations were larger in noun phrase vs. bare noun naming, whereas group
effects in reaction times were relatively constant. This divergence provides some support
for contrasting effects in latencies and phonetic measures (Sadat et al., 2012, suggest
this may reflect a more incremental planning scope in the noun phrase vs. bare noun
naming conditions).
Other work has found more substantial divergence, where difficulties in retrieval do not
result in changes to speech duration. Du˜nabeitia and Costa (2015) examined speakers’
productions of true and false descriptions of a picture in their native, dominant language
Spanish and non-native English (e.g., ‘I see a white sheep with four legs’ when the
picture is of a yellow bird with two legs). Consistent with Sadat et al. (2012), both RTs
and description durations were longer in non-native vs. native productions. However,
producing false statements, a task previously shown to be tax cognitive resources, resulted
in increased RTs but had no significant effect on durations.
A key limitation of these previous studies is that they do not directly examine the
relationship between RT and duration on individual trials. Such a comparison provides
a stronger test of the (in)dependence of these two measures. If effects in duration
and RT exclusively arise from a single source, then all articulatory effects should be
highly predicted by variation in RT (Buz and Jaeger, 2016). Several recent studies
of monolingual English productions have found this not to be the case; variation in
articulatory durations does not reduce to variation in RT (Buz and Jaeger, 2016; Fink
et al., 2018; Goldrick et al., 2018). Consistent with these findings, Gustafson and Goldrick
(2018) found that adults’ difficulties in L2 articulation do not reduce to difficulties in L2
retrieval. They examined reaction times in English descriptions of simple events (e.g.,
‘the kite’ in ‘The kite rotates’) along with acoustic durations of nouns and preceding
determiners. Adult Mandarin speakers learning English showed longer reaction times and
durations than English monolinguals. Critically, analyses of duration included by-trial
reaction time. While there was a significant positive relationship between the measures
(longer RTs predicted longer durations), there was an independent effect of group on
durations. This suggests that while there is a coupling between retrieval and articulation
processes, there are also independent effects that arise within each processing stage.
2.2. Evidence from sublexical phonetic measurements
When different languages realize a similar sound contrast in phonetically distinct ways,
bilingual speakers’ productions reflect a blend of the properties of these two languages
(see Fowler et al., 2008 for a review). Here, we focus on the contrast between word-initial
3
voiced stops (/b/, /d/, /g/) and voiceless stops (/p/, /t/, /k/). An important phonetic
dimension distinguishing these sounds in this position is voice onset time (VOT), the
length of time between the release of the stop’s constriction and the beginning of vocal fold
vibration. Different languages utilize this phonetic dimension in distinct ways (Lisker and
Abramson, 1964). In some languages (e.g., German, English), initial voiced consonants
are associated with a short positive lag between constriction release and voicing; voiceless
consonants have a longer lag (Jessen and Ringen, 2002). In other languages (e.g., Spanish,
French), periodic vocal fold vibration begins before the release of voiced consonants
(producing prevoicing or negative VOTs); voiceless sounds show short positive lags (Flege,
1991). When adults are learning a language that has a different VOT system from their
L1, this conflict results in non-native accents – systematic deviations away from the
target language and towards the speaker’s other language (e.g., for English learners
of Spanish, less prevoicing for voiced stops and longer positive lags for voiceless stops;
Flege, 1991).
Several studies have examined how the phonetic interaction between two languages is
modulated in difficult processing contexts; specifically, mixing and/or switching between
the two languages. These are well known to increase reaction times (Gollan and Ferreira,
2009; Meuter and Allport, 1999). This work has found that non-native accents increase
in mixed-language contexts (e.g., contexts in which both German and Spanish words are
produced) compared with single-language contexts (e.g., contexts where only German is
produced; Antoniou et al., 2011; Bullock et al., 2006; Flege, 1991). Inconsistent results
have been reported in studies comparing the phonetic properties of switched items,
where participants change the language of production from a preceding word or trial,
relative to stay trials, where participants maintain the same language from trial to trial
within a mixed language context. When reading aloud sentences that contain switches
(which participants know in advance), some studies have found increased non-native
accents (Bullock et al., 2006;
ˇ
Sim´ckov´a and Podlipsk`y, 2015) whereas others have
reported no systematic effects, reduction in non-native accents, or consistent changes
across all segments (e.g., VOTs are lengthened in all languages Amengual et al., 2019;
Grosjean and Miller, 1994; L´opez, 2012;
ˇ
Sim´ckoa and Podlipsk`y, 2018). Other studies
have found increased non-native accents when participants are forced to unexpectedly
switch in picture naming (Goldrick et al., 2014; Olson, 2013; Tsui et al., 2019, and
in spontaneous codeswitching Balukas and Koops, 2015; Fricke et al., 2016; but c.f.
Piccinini and Arvaniti, 2015).
While the studies above examine contexts where longer RTs have been reported in
other studies, none of these phonetic studies have actually measured RTs within their
experiments. Jacobs et al. (2016) is the lone exception, measuring all 3 of the dependent
measures discussed here (n.b. without by-trial analysis of RT-phonetic relationships).
Jacobs et al. (2016) examined Spanish word naming by three groups of Spanish-English
bilinguals: intermediate-level Spanish learners, studying in classroom-based courses in an
English-dominant environment; advanced learners in the same context; and intermediate-
level Spanish learners participating in a Spanish immersion program. Measures of RT,
word duration, and VOT were gathered for cognates (translation equivalents sharing
form and meaning; e.g., perfecto) and non-cognates beginning with voiceless stops (e.g.,
percance, Eng. mishap). All groups showed significant cognate facilitation in reaction
times; however, only intermediate classroom learners showed significant phonetic effects.
While their word duration effects paralleled reaction time (shorter word durations for
cognates), their VOTs showed an opposite pattern (with a significant interaction between
cognate status and speaker group). Cognates exhibited a higher degree of non-native
4
properties. They were produced with longer VOTs than non-cognates, consistent with
the non-target language English. Jacobs et al. (2016) suggests that while cross-language
activation may facilitate retrieval of similar phonological targets, it hampers accurate
production, yielding articulations blending properties of the phonetic systems of the two
languages (see Muscalu and Smiley, 2019, for related results in typing).
2.3. The current study
This brief review suggests that while there is some consistency in findings, there is a
good deal of variability in the effects observed across experiments, particularly when
considering fine-grained phonetic measures such as VOT. As suggested by Jacobs
et al. (2016), part of this variability may reflect differences in the bilingual populations
examined in each study. However, there may be less theoretically interesting reasons
behind the variance in results. Many of the studies above rely on relatively small samples
of participants and items, which will necessarily lead to more variability across studies.
Another potential source of variability are differences in phonetic annotation practices
by annotators working across different labs. For example, producing prevoicing requires
speakers to maintain vocal fold vibration during closure; this aerodynamically challenging
articulation leads to widespread variation in the amplitude of voicing (Abramson and
Whalen, 2017). Reliable detection and consistent annotation of such weak signals may
also contribute to variable results.
To address issues with the reliability and sample size of phonetic measurements,
we propose an automated method for multilingual annotation of VOT. This machine
learning algorithm is trained on existing datasets from multiple labs, improving on
previous work by providing a consensus measurement of both positive and negative VOT.
In contrast to subjective measurements by particular human annotators, this allows for
objective, replicable measurement of acoustic properties of speech. Additionally, this
algorithm can drastically reduce the time required for phonetic measurement, allowing
for studies with appropriate levels of statistical power to be conducted. For example,
Goldrick et al. (2016) used automatic phonetic measurements to rapidly collect data
on the phonetics of speech errors. This single experiment produced more data than
existed in the entire extant literature. Here, we combine automatic analysis of VOT
with automatic measures of word duration and reaction time to create a fully objective,
replicable, automatic analysis pipeline of speech production data.
3. Automated analysis of negative and positive voice onset time
Recently, we have developed systems for automatic measurements of VOT based on kernel-
based machine learning methods and deep learning (Adi et al., 2017, 2016; Sonderegger
and Keshet, 2012). While these have met with good success in measuring positive-lag
VOT data from English, they fail to accurately measure negative VOTs. This blocks their
application to languages like Spanish, which utilize prevoicing contrastively, as well as
application to the full range of acoustic environments for English stops. Additionally, we
have observed less-than-ideal levels of generalization to datasets with different sampling
properties (speaker types, recording types, etc.).
The novel approach, introduced below, addresses these issues while maintaining a
high level of accuracy in measurement. Note that the technical part of the model along
with an assessment of its performance on annotated corpora appears in a companion
machine-learning paper (Shrem et al., 2019). Here, we focus on providing an overview of
5
the system’s structure for linguists and other cognitive scientists.
3.1. Problem statement and notation
VOT measurement belongs to the family of segmentation problems, as we are provided
with a series of time-related samples in an arbitrary length, and are required to predict
the boundaries for every segment. In the case of positive lag where voicing follows the
burst (i.e., the acoustic signature of the release of the consonant), the VOT is the time
between the onset of the burst and the onset of the vowel. When voicing precedes the
burst, the VOT is negative, and we measure the time between the onset of the prevoicing
and the onset of the burst. A common approach to address a segmentation task is to
locate the frames where new segments start at. We formulate our settings as follows:
Let
¯
x
= (
x1, ..., xT
) denote the input speech utterance with length
T
, represented as a
sequence of acoustic features. Each utterance is associated with a label
¯
y
= (
tpv, tb, tv
)
representing the frames’ indexes in which the prevoicing, burst and vowel starts at.
Hence,
{tpv, tb, tv} ∈
[1
, ..., T
] , s.t.
tpv < tb< tv
. For completeness, when the voicing
follows the burst, i.e. no prevoicing, we set
tpv
to be
1. Overall,
V OTpos
is
tvtb
and
V OTneg is tbtpv .
3.2. Model architecture
The architecture is depicted in Fig. 1. We walk through the major components of the
model in turn, starting with the recurrent neural network (RNN). This is a deep neural
network architecture that can model the behavior of dynamic temporal sequences using
an internal state, which can be thought of as a memory that allows the network to be
sensitive to information distributed across time (Hochreiter and Schmidhuber, 1997).
In the context of segmentation, an RNN provides the ability to leverage insights from
previous frames of acoustic features while processing the current frame. Previous studies
have demonstrated the capabilities of such architectures in multiple temporal domains
such as natural language processing (Mikolov et al., 2010) and speech (Graves et al.,
2013). A Bidirectional RNN (BiRNN) (Schuster and Paliwal, 1997) is a model composed
of two RNNs: the first is a standard RNN while the second reads the input backwards.
Such a model can predict the current frame based on both past and future frames.
Using this architecture (shown in Fig. 1, center) allows our model to capture the exact
boundaries between the segments with higher precision.
3.2.1. Multi Task Learning
As noted above, current algorithms do not perform well when attempting to measure
negative VOT. Our approach to this is based on recognizing the fundamentally different
structure of utterances with positive and negative VOT. Unlike positive VOT, where no
voicing precedes the burst, in a negative lag utterance, there is a burst following voice.
To tackle this fundamental change, we used a technique called Multi-Task-Learning
(MTL) (Caruana, 1997; Collobert et al., 2011). This enhances performance on the target
task (e.g., measurement of VOT) by leveraging information from a related auxiliary
task that utilizes a similar set of properties (e.g., acoustic features). Here, we use VOT
classification as the auxiliary task – the model explicitly predicts the probability of
whether the VOT is positive or negative, directly encoding the fundamental change in
the structure of the utterance (see Fig. 1, right). We then use this information to allow
6
the model to draw on different VOT prediction functions when processing negative vs.
positive VOT. Specifically, based on the model’s prediction for VOT classification for
the entire sequence, we use a different set of parameters
w
to predict the segmentation
of the utterance:
w:= (wpos P(pos|¯
x)> P (neg|¯
x)
wneg else (1)
3.2.2. Adversarial Training
As noted above, the lack of robust generalization to novel speech recordings is a key
limitation of existing work. Generalization is generally difficult for these machine learning
systems. During the training phase, a model will learn to map the properties of training
examples to the desired outcome. Typically, the particular distribution of acoustic
features in these training examples will cover only a subset of the entire space of
possible acoustic feature vectors. Similarly, the segmentation annotation conventions
used for these training examples may only provide information about a subset of the
possibly relevant cases for segmentation. If a new set of data lies outside these subsets,
performance will likely degrade.
To develop a model with corpus-invariant performance, we applied a technique
called adversarial training (Ganin et al., 2016; Goodfellow et al., 2014; Madry et al.,
2017). Unlike MTL, where an auxiliary task is used to enhance model performance, in
adversarial training we hinder model performance on a competing task. The idea is that
good performance on the target task is inversely related to good performance on the
competing task. Here, we use corpus classification as the competing task, reasoning that
good generalization on VOT measurement is inversely related to correctly classifying
the source of data. The acoustic representation (the output of the BiRNN) serves as
input to a classifier that attempts to find the corpus an utterance originated from
(see Fig. 1, left). We then train the model to minimize success at this adversarial task
using data from several corpora, gathered across different labs and annotators. This
procedure helps the model discover a clean acoustic representation, which does not
contain the characteristics associated with the originating corpus (similarly to Mor et al.,
2018) – while still preserving the task-relevant information in the signal. Results in
Shrem et al. (2019) suggest that this adversarial training procedure allows the model to
exhibit good generalization. Our complete model, incorporating the bidirectional RNN
acoustic representation (center), multi-task learning with VOT classification (right), and
adversarial training with corpus classification (left) is shown in Fig. 1. It can be accessed
at https://github.com/MLSpeech/Dr.VOT.
4. Results
4.1. Dataset
To provide an initial test of the algorithm, we re-analyze a subset of the picture naming
data gathered by Baus et al. (2013). This study examined German learners of Spanish
(from multiple German universities) who attended the University of La Laguna (Tenerife,
Spain) for one semester as part of the ERASMUS program (n=27; mean age=22.8;
20 females). The goal was to examine how their Spanish and German was impacted
by this immersion experience. Self-reported proficiency was obtained on a 10 point
7
Adversarial
Branch
acoustic
representation
corpus classifier
Classifier
Branch
BiRNN
VOT prediction
Figure 1.
Our model. A bidirectional RNN processes the utterance to create an acoustic representation and
forward a context-sensitive (‘memory like’) state to a VOT classifier and adversarial classifier. The classifier
branch, gives a probability for two broad classes of VOTs (positive and negative). We combine the classifier
decision with the acoustic representation to find the VOT onset and offset. Simultaneously, an adversarial
branch aims to identify the originated corpus of the given utterance. By hindering model performance on this
adversarial task (shown by dotted lines), we improve generalization on the target task.
Likert-scale (1: low proficient-10: native like) both for German (L1; mean = 9.7; SD=.08)
and Spanish (L2; mean = 5.08; SD=1.7). A group of native Spanish speakers (a group
of undergraduate students at the same university) provided a baseline for comparison
(n=19; mean age=21.1; 14 females).
German participants were tested at two time-points during their six-months immersion.
The first test was conducted during the first month after arriving to the University and
the second test was conducted during the last month before leaving the University. The
Spanish group was tested in the same testing periods. Participants named different lists
of pictures at each time point, with the picture lists counterbalanced across participants.
Informed consent was obtained from all the participants and they were compensated for
their time (10
e
/hour). Data from participants who only participated at one time point
were excluded from these analyses.
We extracted responses for stimuli beginning with voiced and voiceless initial stops
(see online supplementary materials for full list of items used in analyses). After excluding
errors, there were 2520 observations available for analysis.
Inspection of initial algorithm performance revealed that estimated VOTs of less than
or equal to an absolute duration of 5 msec were typically errors. Similarly, for both
languages, VOTs for voiceless consonants that were estimated as prevoicing were almost
invariably errors. These observations, along with extreme outliers (
120 msec,
<
-140
msec) were therefore excluded. The remaining observations (N = 2163, 85.83% of total)
were used for analysis.
4.2. Validation
To validate our automatic analyses, we compared the results to three previously estab-
lished findings. The means for German and Spanish native-language production showed
clear evidence of the contrasting VOT systems of the two languages (Jessen and Ringen,
2002; Lisker and Abramson, 1964). The German speakers contrasted long (voiceless;
mean = 59.5 msec) vs. short (voiced; mean = 14.8 msec) positive lag, while the Spanish
speakers contrasted short positive lag (voiceless; mean = 28.5 msec) vs. prevoiced (voiced;
mean = -55.2 msec).
Second, for native speaker productions, we replicated the standard pattern of variation
in voiceless VOTs as a function of the place in the oral cavity where airflow was restricted
8
(Cho and Ladefoged, 1999: German native labial mean = 49 msec, alveolar = 60 msec,
velar = 65.9 msec; Spanish native labial mean = 18.9 msec, alveolar = 25.5 msec, velar
= 35.8 msec).
Third, given the contrast between German and Spanish, it is expected that German
native speakers’ Spanish productions will show systematic deviations away from those
of native Spanish speakers, towards the properties of German productions. Consistent
with this, for voiceless stops in Spanish trials, German speakers had longer VOTs (mean
= 44.4 msec) than Spanish speakers (mean = 28.5 msec), and for voiced stops German
speakers producing Spanish had a lower proportion of prevoiced tokens (mean = 0.3)
than Spanish speakers (mean = 0.8).
4.3. Statistical methods
Anonymized phonetic and RT measurements, along with analysis scripts, can be accessed
at
https://osf.io/eqhra/
. Note that raw acoustic files are unavailable as participants
did not consent to inclusion of their voice recordings in a public database.
Since these post-hoc analyses only consider a subset of the original data in Baus et al.
(2013), we begin by examining reaction times (RTs) to determine how retrieval of this
subset of German and Spanish words is impacted by cognate status and immersion. RTs
were extracted from each file based on the time of the first acoustic event (voicing for
prevoiced tokens, burst for positive VOT tokens)
2
. Two linear mixed effects models of
reaction time were constructed (using the lme4, v 1.1-21 package (Bates et al., 2015b),
for R (R Core Team, 2019), v 3.6.0 ). Models fitted to raw vs. log RT yielded similar
dispersion, so we utilized raw RT. We examined reaction times in Spanish for non-native
(German) and native (Spanish) speakers. The contrast-coded fixed effects included
exposure (before or after immersion), cognate status of the picture names (non-cognate
or cognate, as reported in Baus et al., 2013), L1 (native language) of the participant,
along with all interactions of these factors. We also analyzed German productions alone,
including cognate status and exposure as fixed effects.
The random effect structures of the final models were selected by first fitting with the
maximal identifiable random effect structure, including random intercepts for participants
and for picture names (i.e. Spanish tigre and German Tiger were treated as separate
items) as well as random slopes for all fixed effects varying by participant or by item
(respectively). As this did not converge for any model, the random effects were reduced
by removing correlations and then by iteratively removing random slopes terms by
complexity. If the model failed to converge with simple random slopes, these were
iteratively tested to determine if any single effect could be included and allow for
convergence. The maximal identifiable model was then critiqued using the procedure
from Bates et al. (2015a). If the random effects structure was too complex for the
available data, the model was again simplified by removing low-variance random effects
terms (discarding terms that did not significantly reduce model fit). The need for
correlations in the final model was then examined via likelihood ratio tests; correlations
that significantly improved fit were then included.
The parsimonious within-Spanish model of reaction times has random intercepts for
participants and items, uncorrelated slopes for exposure and cognate by speaker, and an
uncorrelated slope of L1 by item. The within-German reaction time model had random
2
Similar results are found when RT is measured by the common acoustic landmark, burst onset. Note that
all of these acoustic measures are necessarily limited in that they cannot capture pre-phonation articulatory
movements (Holbrook et al., 2019).
9
intercepts for speakers and items.
We follow this with analysis of within-Spanish and German word durations. Onset
of the word was set to RT (as described above). We detected word offsets using by
combining the VAD (voiced activity detection) algorithm of the standard audio editing
library SoX (Bagwell and SoX Contributors, 2013) with a window-based search for the
final frame of the word. Beginning at the VAD boundary, as well as 100 and 200 msecs
following this boundary, we examined each preceding 10 msec window to see if the signal
intensity was at least 0.7 * mean intensity of the signal. If multiple start points converged
to a frame, we selected that frame as the word offset. If there was no consensus, the trial
was discarded. Many of the discarded trials were for words with many phonemes and
long RT. Production of these words was cut off by a fixed 2 second recording window
for each trial. (Others included trials with high levels of background noise.)
After trimming trials, as well as outliers (the lower and upper 2.5% of observations),
another 15.7% of observations were excluded; a total of 1823 observations were included
in the word duration analysis. Statistical models were constructed following the RT
analysis above, including phoneme length and its interaction with L1 background as
control factors. Models fitted to raw vs. log word duration yielded similar dispersion,
so we utilized raw word duration. The parsimonious random effects structure for the
Spanish analysis included intercepts and uncorrelated random slopes for cognate status
by participant, as well as random intercepts and uncorrelated random slopes for L1
background by item. The German analysis included random intercepts for speaker and
word.
The final two analyses were conducted to examine how the phonetic contrast between
languages changes over time. In the within-speaker analysis, we compare German speakers’
productions in their native (German) and nonnative (Spanish) languages. This allows us
to examine the ability of these adult bilinguals to successfully produce distinct phonetic
properties for their two languages, and to examine whether this distinction changes over
time for both cognates and non-cognates. In the within-language analysis, we compare
Spanish productions by nonnative German speakers and native Spanish speakers. This
allows us to examine how non-native speech deviates from native productions, and to
examine how any differences change over time for both cognates and non-cognates.
Each analysis was conducted using two separate models. For voiceless targets, we built
a linear mixed-effect model predicting VOT. Models fitted to raw vs. log VOT values
yielded similar dispersion, so we utilized raw VOT. For voiced sounds, we utilized a
logistic link function, as these stops are bimodally distributed between short-lag positive
VOT and prevoicing
3
. Fixed effects included exposure, cognate status, and either target
language (within-speaker analysis) or L1 (within-language analysis). All interactions
between the the variables were also included. We also included a control factor for VOT
variation across place of articulation (see above for discussion). Velar consonants were
used as baseline, with labial and alveolar serving as treatment conditions. In all models,
these factors had significant effects in the expected direction; we omit discussion of these
factors to focus on the effects of interest.
The parsimonious random effects structure for the within-speaker analysis, voiceless
targets, included random intercepts for participants and items, and uncorrelated random
slopes for target language and exposure by speaker. Voiced targets included a random
intercept for speaker and a correlated random slope for target language by speaker. For
the within-language analysis, the voiceless targets model included a random intercept
3
Given the low number of tokens of prevoicing for L2 Spanish speakers, and the overall lower rate of agreement
in the duration of prevoicing, this dataset does not provide sufficient power for analyses contrasting the amount
of prevoicing across the two groups. This may be a promising area for future investigations.
10
Cognate
Noncognate
German Speakers
Nonnative Productions
Spanish Speakers
Native Productions
pretest posttest pretest posttest
0.8
0.9
1.0
1.1
0.8
0.9
1.0
1.1
exposure
Mean RT
pretest
posttest
Figure 2. By-participant mean RT for Spanish picture naming by German and Spanish speakers, separated
by cognates vs. noncognates and pretest (before immersion) vs. posttest (after immersion). Error bars show
bootstrapped 95% confidence intervals.
for participants and items, a correlated random slope for exposure by speaker, and
a correlated random slope for L1 by item. The voiced targets model in this analysis
included a random intercept for participants, an uncorrelated random slope for cognate
status by participant, and a random slope for exposure by word.
Significance for fixed-effects terms in the linear regressions was assessed using the
lmerTest package Kuznetsova et al. (2017) to calculate pvalues for tstatistics using the
Satterthwaite method for determining denominator degrees of freedom. For the logistic
regressions, we use likelihood ratio tests.
4.4. Spanish reaction times, Native (Spanish) vs. Nonnative (German)
speakers
Figure 2 shows the reaction times in Spanish. Overall, German speakers were slower to
retrieve targets than Spanish speakers (
β
= 0.158, s.e. = 0.036, t= 4.38, p
<
.001), and
noncognates were more slowly retrieved than cognates (
β
= 0.060, s.e. = 0.029, t= 2.05,
p = 0.05). These two factors interacted, such that German speakers showed a greater
slowdown for accessing noncognates than cognates (
β
= 0.16, s.e. = 0.042, t= 3.76,
p
<
.001). However, these factors did not interact with immersion. There was no main
effect of exposure (
β
= 0.002, s.e. = 0.017, t= 0.1, p
>
0.05), the overall slowdown of
German speakers did not significantly change after immersion in the Spanish dominant
environment (
β
= -0.033, s.e. = 0.035, t= -0.93, p
>
0.05), and the overall cognate
effect did not change following exposure (
β
= -0.008, s.e. = 0.026, t= -0.3, p
>
0.05).
The three-way interaction also failed to reach significance (
β
= -0.008, s.e. = 0.026, t=
-0.3, p >0.05).
11
Cognate
Noncognate
pretest posttest pretest posttest
0.75
0.80
0.85
0.90
exposure
Mean RT
pretest
posttest
Figure 3.
By-participant mean RT for German picture naming by German speakers, separated by cognates
vs. noncognates and pretest (before immersion) vs. posttest (after immersion). Error bars show bootstrapped
95% confidence intervals.
4.5. German reaction times, native German speakers
Figure 3 shows the reaction times in German. Overall, noncognates were retrieved
significantly more slowly than cognates (
β
= 0.063, s.e. = 0.028, t= 2.22, p = 0.03).
There was no main effect of exposure (
β
= 0.01, s.e. = 0.016, t= 0.62, p
>
0.05), and
no interaction (β= -0.006, s.e. = 0.031, t= -0.2, p >0.05).
Baus et al. (2013) also examined effects of target word frequency on RT in German
and Spanish naming by native speakers. We repeated these analyses on the subset of
their data analyzed here and found no significant effects of lexical frequency in German
(given the ubiquity of such effects in picture naming tasks, we suspect this is likely a
power issue). For the Spanish speakers, after exclusion of an outlier item showing a large
slow-down from pre- to post-test (‘dardo’), no significant frequency effects were found.
4.6. Spanish word durations, Native (Spanish) vs. Nonnative (German)
speakers
Figure 4 shows word durations in Spanish. As expected, durations were longer for words
with more phonemes (β= 0.042, s.e. = 0.003, t= 12.87, p <.001). This effect was not
significantly different across native and non-native speakers (
β
= -0.001, s.e. = 0.004, t
= -0.24, p
>
0.05). Durations were shorter for words with longer reaction times (
β
=
-0.247, s.e. = 0.018, t= -13.57, p
<
.001), a trade-off observed in some previous studies
of picture naming (Fink et al., 2018). Over and above these length and reaction time
effects, German speakers produced longer word durations in non-native Spanish than
native Spanish speakers (
β
= 0.097, s.e. = 0.021, t= 4.53, p
<
.001). In contrast to the
RT results, the main effect of cognate was not significant (
β
= 0.018, s.e. = 0.01), and
neither was the interaction of cognate and L1 (
β
= 0.005, s.e. = 0.013, t= 0.42, p
>
0.05). No other main effects reached significance (exposure
β
= 0.015, s.e. = 0.011, t=
12
pretest
posttest
Cognate
Noncognate
0.4 0.8 1.2 0.4 0.8 1.2
0.4
0.6
0.8
1.0
0.4
0.6
0.8
1.0
Reaction Time (sec)
Word Duration (sec)
German Speakers
Nonnative Productions
Spanish Speakers
Native Productions
Figure 4.
Word duration by RT in Spanish picture naming by German and Spanish speakers, separated by
cognates vs. noncognates and pretest (before immersion) vs. posttest (after immersion). The shaded region
shows standard error of the linear regression lines.
1.41, p
>
0.05) nor did any interactions (exposure x cognate:
β
= 0.011, s.e. = 0.012, t
= 0.9, p
>
0.05; exposure x L1:
β
= -0.005, s.e. = 0.021, t= -0.24, p
>
0.05; exposure x
L1 x cognate: β= 0.007, s.e. = 0.024, t= 0.29, p >0.05)
4.7. German word durations, native German speakers
Figure 5 shows word durations in German. As expected, durations were longer for words
with more phonemes (
β
= 0.038, s.e. = 0.005, t= 7.98, p
<
.001). As with Spanish
productions, durations were shorter for words with longer reaction times (
β
= -0.34, s.e.
= 0.028, t= -12.29, p
<
.001). Over and above these length and reaction time effects,
there were no significant effects on German durations. In contrast to the RT analyses,
there was no main effect of cognate status (
β
= 0.011, s.e. = 0.015, t= 0.71, p
>
0.05).
Exposure also failed to reach significance (
β
= -0.009, s.e. = 0.009, t= -0.95, p
>
0.05),
and failed to interact with cognate status (
β
= 0.002, s.e. = 0.018, t= 0.12, p
>
0.05).
4.8. German speakers, Native vs. Nonnative Productions
In the first analysis of the sub-lexical phonetic properties of productions, we compare Ger-
man speakers’ productions in their native (German) and nonnative (Spanish) languages.
This allows us to examine the ability of these bilinguals to successfully produce distinct
phonetic properties for their two languages, and to examine whether this distinction
changes over time for both cognates and non-cognates.
Figure 6 shows the results for voiceless stops. Speakers successfully distinguished
German and Spanish phonetic properties, producing shorter VOTs for Spanish than
German. This was reflected by a significant main effect of target language (
β
= 13.7, s.e.
= 3, t= 4.62, p
<
.001). In contrast to the RT results, the main effect of cognate was
13
Cognate
Noncognate
0.0 0.5 1.0 0.0 0.5 1.0
0.3
0.5
0.7
0.9
0.3
0.5
0.7
0.9
Reaction Time (sec)
Word Duration (sec)
Figure 5.
Word duration by RT in German picture naming by German speakers, separated by cognates vs.
noncognates and pretest (before immersion) vs. posttest (after immersion). The shaded region shows standard
error of the linear regression lines.
not significant (cognate:
β
= 1.3, s.e. = 2.2,
>
0.05). The other main effects also failed
to reach significance (exposure:
β
= 0.8, s.e. = 1.6, t= 0.51, p
>
0.05; t= 0.56, p
>
0.05). Critically, there were no interactions. The difference between target languages was
not significantly different after immersion in the Spanish dominant environment (
β
=
-1.8, s.e. = 2.4, t= -0.75, p
>
0.05). The target language difference was not significantly
different for cognates vs. non-cognates (
β
= -8.8, s.e. = 4.4, t= -2, p = 0.05), and this
difference did not vary across time (
β
= 3.6, s.e. = 4.7, t= 0.77, p
>
0.05). The overall
cognate effect also failed to interact with time (
β
= 0.3, s.e. = 2.4, t= 0.15, p
>
0.05).
Figure 7 shows the results for voiced stops. Speakers were not able to reliably distin-
guish German and Spanish phonetic properties; prevoicing rates were not significantly
different in Spanish vs. German (main effect of target language (
β
= -0.9, s.e. = 0.5,
χ2
(1) = 2.9, p
>
0.05). In contrast to the RT results, the effect of cognate was not
significant (cognate:
β
= 0.2, s.e. = 0.3,
χ2
(1) = 0.28, p
>
0.05). Exposure was also
not significant (exposure:
β
= -0.2, s.e. = 0.3,
χ2
(1) = 0.37, p
>
0.05). Critically, there
were no interactions. The target language effect was not significantly different after
immersion in the Spanish dominant environment (
β
= -0.5, s.e. = 0.6,
χ2
(1) = 0.83, p
>
0.05). The difference between target languages was not significantly different for cognates
vs. non-cognates (
β
= 0.5, s.e. = 0.5,
χ2
(1) = 0.91, p
>
0.05) and this difference did
not vary across time (
β
= 0.2, s.e. = 1.1,
χ2
(1) = 0.03, p
>
0.05). (The overall cognate
effect also failed to interact with time: β= -0.1, s.e. = 0.5, χ2(1) = 0.02, p >0.05).
4.9. Spanish productions, Native (Spanish) vs. Nonnative (German)
speakers
In the second analysis, we compare Spanish productions by German and Spanish speakers.
This allows us to examine the deviation of nonnative (German) from native (Spanish)
productions, and to examine whether this deviation changes over time for both cognates
14
Cognate
Noncognate
Native (German)
Productions
Nonnative (Spanish)
Productions
30 60 90 120 30 60 90 120
0.000
0.005
0.010
0.015
0.020
0.000
0.005
0.010
0.015
0.020
Voice Onset Time (msec)
density
pretest
posttest
Figure 6.
Distribution of German speakers productions of voiceless stops in German and Spanish, separated
by cognates vs. noncognates and pretest (before immersion) vs. posttest (after immersion). Dotted lines show
means for each condition.
German
(Native) Productions
Spanish
(Nonnative) Productions
Cognate
Noncognate
pretest posttest pretest posttest
0.0
0.1
0.2
0.3
0.4
0.5
0.0
0.1
0.2
0.3
0.4
0.5
exposure
Proportion Prevoiced
pretest
posttest
Figure 7.
By-participant mean proportion of voiced stops prevoiced in German and Spanish, separated
by cognates vs. noncognates and pretest (before immersion) vs. posttest (after immersion). Error bars show
bootstrapped 95% confidence intervals.
15
Cognate
Noncognate
German Speakers
(Nonnative Productions)
Spanish Speakers
(Native Productions)
30 60 90 120 30 60 90 120
0.00
0.01
0.02
0.03
0.00
0.01
0.02
0.03
Voice Onset Time (msec)
density
pretest
posttest
Figure 8.
Distribution of German and Spanish speakers productions of voiceless stops in Spanish, separated
by cognates vs. noncognates and pretest (before immersion) vs. posttest (after immersion). Dotted lines show
means for each condition.
and non-cognates.
Figure 8 shows the results for voiceless stops. German speakers‘ productions showed
significant deviations from native Spanish-speaking norms, with longer VOTs than
the productions of Spanish speakers. This was reflected by a significant main effect
of L1 background (
β
= 16.87, s.e. = 2.73, t= 6.19, p
<
.001). The main effect of
cognate was significant (longer VOTs for noncognates;
β
= 4.39, s.e. = 1.97, t= 2.23,
p = 0.03) but exposure was not (
β
= 1.1, s.e. = 1.22, t= 0.91, p
>
0.05). Critically,
there were no interactions. The deviation of German speakers’ productions from native
Spanish phonetics was not significantly different after immersion in the Spanish dominant
environment (
β
= 1.0, s.e. = 2.44, t= 0.41, p
>
0.05). In contrast to the RT results,
the difference between German and Spanish speakers’ productions was not significantly
different for cognates vs. non-cognates (
β
= 3.09, s.e. = 3.13, t= 0.99, p
>
0.05), and
this difference did not vary across time (
β
= -0.94, s.e. = 4.04, t= -0.23, p
>
0.05).
(The overall cognate effect also failed to interact with time:
β
= -1.11, s.e. = 2.02, t=
-0.55, p >0.05).
Figure 9 shows the results for voiced stops. German speakers‘ productions showed
significant deviations from native Spanish-speaking norms, with fewer prevoiced tokens
than the productions of Spanish speakers. This was reflected by a significant main effect
of L1 background (
β
= -4.05, s.e. = 0.75,
χ2
(1) = 30.62, p
<
.001). The other main
effects were not significant (exposure:
β
= 0.0, s.e. = 0.36,
χ2
(1) = 0, p
>
0.05; cognate:
β
= -0.08, s.e. = 0.34,
χ2
(1) = 0.06, p
>
0.05). Critically, there were no interactions.
The deviation of German speakers’ productions from native Spanish phonetics was not
significantly different after immersion in the Spanish dominant environment (
β
= 0.07,
s.e. = 0.65,
χ2
(1) = 0.1, p
>
0.05). In contrast to the RT results, the difference between
German and Spanish speakers’ productions was not significantly different for cognates
vs. non-cognates (
β
= 0, s.e. = 0.67,
χ2
(1) = 0, p
>
0.05) and this difference did not
vary across time (
β
= 0.64, s.e. = 1.32,
χ2
(1) = 0.23, p
>
0.05). (The overall cognate
16
German Speakers
Nonnative Productions
Spanish Speakers
Native Productions
Cognate
Noncognate
pretest posttest pretest posttest
0.25
0.50
0.75
1.00
0.25
0.50
0.75
1.00
exposure
Proportion Prevoiced
pretest
posttest
Figure 9.
By-participant mean proportion of voiced stops prevoiced in German and Spanish, separated
by cognates vs. noncognates and pretest (before immersion) vs. posttest (after immersion). Error bars show
bootstrapped 95% confidence intervals.
effect also failed to interact with time: β= -0.33, s.e. = 0.71, χ2(1) = 0.21, p >0.05).
4.10. Non-cognate difficulties in lexical access are not matched by
difficulties in realizing L2-appropriate VOT
Non-cognates showed a persistent disadvantage in bilingual German speakers’ lexical
access, in both their L1 and L2. However, German speakers showed no significant
differences in the realization of VOT of cognate and non-cognate forms.
To more directly examine the apparent divergence between retrieval and articulation,
we examined the by-trial correlation between RT and phonetic properties for German
speakers’ productions of non-cognate forms in Spanish. If the difficulty in retrieval for
noncognates impacted articulation, we would expect to see a greater degree of non-native
accent, more German-like pronunications (longer voiceless VOTs, lower probability of
prevoicing) when non-cognates had longer vs. shorter RTs. As Figure 10 shows, there
was no clear relationship for voiceless non-cognates. A mixed effects regression predicting
VOT from RT (including random intercepts for participants and items and uncorrelated
random slopes of RT for each) confirmed this; there was no significant relation (
β
=
-3.58, s.e. 6.62, t= -0.54, p = .60). Voiced non-cognates also failed to show a significant
relationship, although the effect was in the predicted direction. A logistic regression
predicting probabilty of prevoicing from RT, with random intercepts for participants
and items, showed that longer RTs were associated with a lower probability of prevoicing
(
β
= -2.05, s.e. 1.13,
χ2
(1) = 3.57, p = .059). Taken together, these results suggest
that there is not a strong link between retrieval difficulties and difficulties in producing
non-native articulations of initial consonants.
17
30
60
90
1.0 1.5
Reaction Time (sec)
Voice Onset Time (msec)
Figure 10.
Relationship between reaction time and voice onset time for German speakers’ production of
non-cognate Spanish words. Note: RT outliers (<500 msec, N = 8) removed.
5. General Discussion
Research into the production of a non-native second language has shown that speakers
have difficulty in both retrieval and articulation of speech. Do these difficulties stem from
a single source, or are there independent contributions to these effects from retrieval vs.
articulatory processes? Answering this question requires the use of reliable, objective
measures of the fine-grained phonetic properties of speech. In this study, we utilize
a novel algorithm for VOT measurement to provide this type of measure. Reanalysis
of data from Baus et al. (2013) replicates standard reaction time and word duration
measures (longer reaction times and longer word durations for non-native talkers). We
also replicate standard phonetic effects on VOT (differences in the realization of voicing
categories across languages, place of articulation effects on VOT, and deviations of L2
speech from L1 norms). These findings suggest the algorithm is a reliable method for
analysis of bilingual speech.
Our analyses bolster previous findings suggesting that the difficulties non-native
speakers face in articulation do not reduce to difficulties these speakers have with
retrieval. We find that longer word durations for non-native speakers do not purely
derive from effects on RT (in fact, we find negative correlations between RT and word
duration). Extending previous work, we provide a closer examination of the relationship
between RT and a sub-lexical phonetic measure. We do not find significant correlations
between VOT and RT; more broadly, we find that factors that pervasively influence
retrieval speed (cognate status) exert no significant effect on VOT.
5.1. Implications for models of bilingual speech production
Consistent with recent work with monolingual English speakers (Buz and Jaeger, 2016;
Fink et al., 2018; Goldrick et al., 2018) and native Mandarin speakers in L2 English
18
(Gustafson and Goldrick, 2018), these findings are problematic for theories that assume
that articulatory processes are treated as a ‘reflex,’ solely driven by the state of planning
processes at the moment at which speech is initiated.
There are a range of architectures that are consistent with the results. It is possible that
planning continues after speech is initiated, providing another channel for disruptions to
ongoing retrieval and encoding to influence articulation (Goldrick et al., 2018; Sadat et al.,
2012). An alternative, mutually compatible type of account could assume that while
articulatory processes and planning are sensitive to an overlapping set of variables (e.g.,
native language status), the effects at each level are driven by independent mechanisms
(see Buz and Jaeger, 2016, for discussion in the context of monolingual processing). This
latter type of account seems highly plausible for difficulties in non-native speech that
stem, in part, from properties of speech sounds that are found across multiple lexical
items. For example, a native Spanish speaker learning English may have no trouble
recalling the general properties of the high frequency word sport, but they may experience
difficulty speaking it due to the presence of a sound sequence (initial /sp/) not found in
their native language. An important area for future work is to better understand the
relative contribution of both types of mechanisms to bilingual speech production.
5.2. Open science and phonetic studies of bilingual speech
Phonetic studies face many of the same issues that have lead to a sense of a ‘replication
crisis’ in quantitative research; if anything, such issues are even more acute, given the high
degrees of freedom phonetic analysis presents to researchers (Roettger, 2019). Automatic
algorithms that are publicly available, such as the one proposed here, provide one
mechanism for addressing this issue. They allow researchers across labs to use identical
procedures to measure acoustic data, eliminating one potential source of variance when
comparing results from different populations of bilingual speakers.
In our experience, developing and utilizing publicly available algorithms fosters a
more general willingness to pursue open research. If the algorithm is being shared, why
not the statistical analysis methods? Why not the data itself? This particular step was
not available to us as the original study did not include permissions to share recordings
in a public database. We urge bilingualism researchers and other cognitive scientists
investigating speech to think proactively and ask participants for permission to share
recordings. This can allow full transparency of the research analysis pipeline in phonetics
research.
6. Conclusions
Automatic phonetic analysis can help bilingual speech production link psycholinguistic
studies of word retrieval and planning to phonetic studies of speech articulation and
acoustics. This will provide new insights into the cognitive architecture of speech
production and help facilitate open, replicable research in cognitive science.
Acknowledgements
Thanks to Albert Costa for making possible the visit by MG to Barcelona which spawned
this work, and for creating an intellectual environment that inspired us to collaborate.
19
Funding
Supported in part by NIH grants R21HD077140 and R21MH119677.
References
Abramson, A. S. and Whalen, D. H. (2017). Voice onset time (VOT) at 50: Theoretical and
practical issues in measuring voicing distinctions. Journal of Phonetics, 63:75–86.
Adi, Y., Keshet, J., Cibelli, E., and Goldrick, M. (2017). Sequence segmentation using joint
RNN and structured prediction models. In IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP), pages 2422–2426.
Adi, Y., Keshet, J., Dmitrieva, O., and Goldrick, M. (2016). Automatic measurement of voice
onset time and prevoicing using recurrent neural networks. In Proceedings of INTERSPEECH
2016, pages 3152–3155.
Amengual, M., Meredith, L., and Panelli, T. (2019). Static and dynamic phonetic interactions
in the L2 and L3 acquisition of Japanese velar voiceless stops. In Calhoun, S., Escudero,
P., Tabain, M., and Warren, P., editors, Proceedings of the 19th International Congress of
Phonetic Sciences, pages 964–968, Canberra, Australia. Australasian Speech Science and
Technology Association Inc.
Aneja, G. A. (2016). Rethinking nativeness: Toward a dynamic paradigm of (non) native
speakering. Critical Inquiry in Language Studies, 13(4):351–379.
Antoniou, M., Best, C. T., Tyler, M. D., and Kroos, C. (2011). Inter-language interference
in VOT production by L2-dominant bilinguals: Asymmetries in phonetic code-switching.
Journal of Phonetics, 39(4):558–570.
Bagwell, C. and SoX Contributors (2013). SoX – Sound eXchange, the Swiss Army knife of
audio manipulation. http://sox.sourceforge.net/.
Baker, R. E., Baese-Berk, M., Bonnasse-Gahot, L., Kim, M., Van Engen, K. J., and Bradlow,
A. R. (2011). Word durations in non-native English. Journal of Phonetics, 39(1):1–17.
Balukas, C. and Koops, C. (2015). Spanish-English bilingual voice onset time in spontaneous
code-switching. International Journal of Bilingualism, 19(4):423–443.
Bates, D., Kliegl, R., Vasishth, S., and Baayen, H. (2015a). Parsimonious mixed models. arXiv
preprint arXiv:1506.04967.
Bates, D., MÃďchler, M., Bolker, B., and Walker, S. (2015b). Fitting linear mixed-effects
models using lme4. Journal of Statistical Software, 67(1):1–48.
Baus, C., Costa, A., and Carreiras, M. (2013). On the effects of second language immersion on
first language production. Acta Psychologica, 142(3):402–409.
Bullock, B. E., Toribio, A. J., GonzÃąlez, V., and Dalola, A. (2006). Language dominance
and performance outcomes in bilingual pronunciation. In Proceedings of the 8th generative
approaches to second language acquisition conference, pages 9–16. Cascadilla Proceedings
Project Somerville, MA.
Buz, E. and Jaeger, T. F. (2016). The (in)dependence of articulation and lexical planning
during isolated word production. Language, Cognition and Neuroscience, 31(3):404–424.
Caruana, R. (1997). Multitask learning. Machine Learning, 28(1):41–75.
Cho, T. and Ladefoged, P. (1999). Variation and universals in VOT: Evidence from 18 languages.
Journal of Phonetics, 27(2):207–229.
Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., and Kuksa, P. (2011).
Natural language processing (almost) from scratch. Journal of Machine Learning Research,
12(Aug):2493–2537.
Du˜nabeitia, J. A. and Costa, A. (2015). Lying in a native and foreign language. Psychonomic
Bulletin & Reviews, 22(4):1124–1129.
Fink, A., Oppenheim, G. M., and Goldrick, M. (2018). Interactions between lexical access and
articulation. Language, Cognition and Neuroscience, 33(1):12–24.
20
Flege, J. E. (1991). Age of learning affects the authenticity of voice-onset time (VOT) in stop
consonants produced in a second language. The Journal of the Acoustical Society of America,
89(1):395–411.
Fowler, C. A., Sramko, V., Ostry, D. J., Rowland, S. A., and HallÃľ, P. (2008). Cross language
phonetic influences on the speech of French–English bilinguals. Journal of Phonetics,
36(4):649–663.
Fricke, M., Kroll, J. F., and Dussias, P. E. (2016). Phonetic variation in bilingual speech: A
lens for studying the production–comprehension link. Journal of Memory and Language,
89:110–137.
Ganin, Y., Ustinova, E., Ajakan, H., Germain, P., Larochelle, H., Laviolette, F. Marchand, M.,
and Lempitsky, V. (2016). Domain-adversarial training of neural networks. The Journal of
Machine Learning Research, 17(1):2096–2030.
Goldrick, M., Keshet, J., Gustafson, E., Heller, J., and Needle, J. (2016). Automatic analysis of
slips of the tongue: Insights into the cognitive architecture of speech production. Cognition,
149:31–39.
Goldrick, M., McClain, R., Cibelli, E., Adi, Y., Gustafson, E., Moers, C., and Keshet, J.
(2018). The influence of lexical selection disruptions on articulation. Journal of Experimental
Psychology: Learning, Memory, and Cognition.
Goldrick, M., Runnqvist, E., and Costa, A. (2014). Language switching makes pronunciation
less nativelike. Psychological Science, 25(4):1031–1036.
Gollan, T. H. and Ferreira, V. S. (2009). Should I stay or should I switch? A cost–benefit analysis
of voluntary language switching in young and aging bilinguals. Journal of Experimental
Psychology: Learning, Memory, and Cognition, 35(3):640.
Gollan, T. H., Montoya, R. I., Cera, C., and Sandoval, T. C. (2008). More use almost always
means a smaller frequency effect: Aging, bilingualism, and the weaker links hypothesis.
Journal of Memory and Language, 58(3):787–814.
Gollan, T. H., Slattery, T. J., Goldenberg, D., Van Assche, E., Duyck, W., and Rayner, K.
(2011). Frequency drives lexical access in reading but not in speaking: The frequency-lag
hypothesis. Journal of Experimental Psychology: General, 140(2):186.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville,
A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information
processing systems, pages 2672–2680.
Graves, A., Mohamed, A., and Hinton, G. (2013). Speech recognition with deep recurrent neural
networks. In IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP), pages 6645–6649. IEEE.
Green, D. W. (1998). Mental control of the bilingual lexico-semantic system. Bilingualism:
Language and Cognition, 1(2):67–81.
Grosjean, F. and Miller, J. L. (1994). Going in and out of languages: An example of bilingual
flexibility. Psychological Science, 5(4):201–206.
Guion, S. G., Flege, J. E., Liu, S. H., and Yeni-Komshian, G. H. (2000). Age of learning
effects on the duration of sentences produced in a second language. Applied Psycholinguistics,
21(2):205–228.
Gustafson, E. and Goldrick, M. (2018). The role of linguistic experience in the processing of
probabilistic information in production. Language, Cognition and Neuroscience, 33(2):211–
226.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation,
9(8):1735–1780.
Holbrook, B. B., Kawamoto, A. H., and Liu, Q. (2019). Task demands and segment priming
effects in the naming task. Journal of Experimental Psychology: Learning, Memory, and
Cognition, 45(5):807–821.
Ivanova, I. and Costa, A. (2008). Does bilingualism hamper lexical access in speech production?
Acta Psychologica, 127(2):277–288.
Jacobs, A., Fricke, M., and Kroll, J. F. (2016). Cross-language activation begins during speech
planning and extends into second language speech. Language Learning, 66(2):324–353.
21
Jessen, M. and Ringen, C. (2002). Laryngeal features in german. Phonology, 19(2):189–218.
Kroll, J. F., Bobb, S. C., Misra, M., and Guo, T. (2008). Language selection in bilingual speech:
Evidence for inhibitory processes. Acta Psychologica, 128(3):416–430.
Kuznetsova, A., Brockhoff, P. B., and Christensen, R. H. B. (2017). lmerTest package: Tests in
linear mixed effects models. Journal of Statistical Software, 82(13):1–26.
Lisker, L. and Abramson, A. S. (1964). A cross-language study of voicing in initial stops:
Acoustical measurements. Word, 20(3):384–422.
opez, V. G. (2012). Spanish and English word-initial voiceless stop production in code-switched
vs. monolingual structures. Second Language Research, 28(2):243–263.
MacKay, I. R. and Flege, J. E. (2004). Effects of the age of second language learning on
the duration of first and second language sentences: The role of suppression. Applied
Psycholinguistics, 25(3):373–396.
Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. (2017). Towards deep learning
models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083.
Meuter, R. F. and Allport, A. (1999). Bilingual language switching in naming: Asymmetrical
costs of language selection. Journal of Memory and Language, 40(1):25–40.
Mikolov, T., Karafi´at, M., Burget, L.,
ˇ
Cernockáżş, J., and Khudanpur, S. (2010). Recurrent
neural network based language model. In Proceedings of INTERSPEECH 2010.
Mor, N., Wolf, L., Polyak, A., and Taigman, Y. (2018). A universal music translation network.
arXiv preprint arXiv:1805.07848.
Muscalu, L. M. and Smiley, P. A. (2019). The illusory benefit of cognates: Lexical facilitation
followed by sublexical interference in a word typing task. Bilingualism: Language and
Cognition, 22(4):848–865.
Olson, D. J. (2013). Bilingual language switching and selection at the phonetic level: Asym-
metrical transfer in VOT production. Journal of Phonetics, 41(6):407–420.
Piccinini, P. and Arvaniti, A. (2015). Voice onset time in Spanish–English spontaneous
code-switching. Journal of Phonetics, 52:121–137.
R Core Team (2019). R: A Language and Environment for Statistical Computing. R Foundation
for Statistical Computing, Vienna, Austria.
Roettger, T. B. (2019). Researcher degrees of freedom in phonetic research. Laboratory
Phonology, 10.
Sadat, J., Martin, C. D., Alario, F. X., and Costa, A. (2012). Characterizing the bilingual
disadvantage in noun phrase production. Journal of Psycholinguistic Research, 41(3):159–179.
Schillingmann, L., Ernst, J., Keite, V., Wrede, B., Meyer, A. S., and Belke, E. (2018). AlignTool:
The automatic temporal alignment of spoken utterances in German, Dutch, and British
English for psycholinguistic purposes. Behavior Research Methods, 50(2):466–489.
Schuster, M. and Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE
Transactions on Signal Processing, 45(11):2673–2681.
Shrem, Y., Goldrick, M., and Keshet, J. (2019). Dr.VOT: Measuring positive and negative
voice onset time in the wild. In Proceedings of INTERSPEECH 2019, pages 629–633.
ˇ
Sim´ckov´a,
ˇ
S. and Podlipsk`y, V. (2018). Patterns of short-term phonetic interference in
bilingual speech. Languages, 3(3):34.
ˇ
Sim´ckov´a,
ˇ
S. and Podlipsk`y, V. J. (2015). Immediate phonetic interference in code-switching
and interpreting. In Proceedings of the 18th International Congress of Phonetic Sciences.
Glasgow, UK.
Sonderegger, M. and Keshet, J. (2012). Automatic measurement of voice onset time using
discriminative structured prediction. The Journal of the Acoustical Society of America,
132(6):3965–3979.
Tsui, R. K.-Y., Tong, X., and Chan, C. S. K. (2019). Impact of language dominance on phonetic
transfer in Cantonese–English bilingual language switching. Applied Psycholinguistics,
40(1):29–58.
22
... Praat 55 69 84 130 230 267 70 94 154 64 105 125 54 81 112 57 75 95 semivowels 68 80 103 136 295 334 89 126 222 83 122 154 67 114 168 67 90 130 nasal 75 112 106 219 409 381 96 229 239 67 120 112 66 175 151 74 130 As seen in the results, replicating the performance obtained from the trained samples to a new domain is a critical obstacle for machine-learning-based methods [20,21]. There is a performance gap for DeepFormants and our model when we combine samples from the train set of Clopper and Hillenbrand during the optimization compared to train only with the VTR. ...
Preprint
Full-text available
Formants are the spectral maxima that result from acoustic resonances of the human vocal tract, and their accurate estimation is among the most fundamental speech processing problems. Recent work has been shown that those frequencies can accurately be estimated using deep learning techniques. However, when presented with a speech from a different domain than that in which they have been trained on, these methods exhibit a decline in performance, limiting their usage as generic tools. The contribution of this paper is to propose a new network architecture that performs well on a variety of different speaker and speech domains. Our proposed model is composed of a shared encoder that gets as input a spectrogram and outputs a domain-invariant representation. Then, multiple decoders further process this representation, each responsible for predicting a different formant while considering the lower formant predictions. An advantage of our model is that it is based on heatmaps that generate a probability distribution over formant predictions. Results suggest that our proposed model better represents the signal over various domains and leads to better formant frequency tracking and estimation.
Article
Albert Costa was a dear friend and colleague who died young but accomplished much. We provide a brief sketch of his scientific contributions to the field of psycholinguistics and bilingualism. The articles included in the special issue are then presented along three research topics developed by Albert Costa in his own career: Lexical access in bilingualism, executive control in bilingualism, and judgement and decision making in a foreign language. The articles explore topics such as competition within and across words in unimodal or bimodal bilinguals, and its links to domain-general executive control, the reshaping of word form knowledge following second language learning, the stakes and methods involved in investigating accented speech, and the contrast between decision making in the native or second language. We hope this collection provides an up-to-date perspective on the rich field of bilingualism research, and a modest homage to our late friend and colleague.
Conference Paper
Full-text available
This production experiment investigates how L1 English-L2 Japanese bilinguals and L1 Spanish-L2 English-L3 Japanese trilinguals produce voiceless velar stops /k/ in each of their languages in order to ascertain if these speakers create separate phonological categories in Spanish, English, and Japanese. The results show that even though the bilingual and multilingual individuals maintain language-specific VOT patterns for each language, they produce /k/ in bilingual and trilingual sessions that display less native-like VOT values in each language in comparison to the same productions in the monolingual sessions. These findings demonstrate that increased activation of the non-target language(s) in the multilingual experimental condition creates cross-linguistic phonetic convergence in the acoustic realization of these phonemic categories, however, it does not impede these multilingual individuals from maintaining language-specific categories in each language. It is proposed that a model of multilingual processing based on the principles of episodic frameworks can explain these findings.
Article
Full-text available
The results of published research critically depend on methodological decisions that have been made during data analysis. These so-called ‘researcher degrees of freedom’ (Simmons, Nelson, & Simonsohn, 2011) can affect the results and the conclusions researchers draw from it. It is argued that phonetic research faces a large number of researcher degrees of freedom due to its scientific object—speech—being inherently multidimensional and exhibiting complex interactions between multiple covariates. A Type-I error simulation is presented that demonstrates the severe inflation of false positives when exploring researcher degrees of freedom. It is argued that combined with common cognitive fallacies, exploitation of researcher degrees of freedom introduces strong bias and poses a serious challenge to quantitative phonetics as an empirical science. This paper discusses potential remedies for this problem including adjusting the threshold for significance; drawing a clear line between confirmatory and exploratory analyses via preregistration; open, honest, and transparent practices in communicating data analytical decisions; and direct replications.
Article
Full-text available
Bilinguals are susceptible to interaction between their two phonetic systems during speech processing. Using a language-switching paradigm, this study investigated differences in phonetic transfer of Cantonese–English bilingual adults with various language dominance profiles (Cantonese-dominant, English-dominant, and balanced bilinguals). Measurements of voice onset time revealed that unbalanced bilinguals and balanced bilinguals responded differently to language switching. Among unbalanced bilinguals, production of the dominant language shifted toward the nondominant language, with no effect in the opposite direction. However, balanced bilinguals’ speech production was unaffected by language switching. These results are analogous to the inhibitory control model, suggesting an asymmetrical switch cost of language switching at the phonetic level of speech production in unbalanced bilinguals. In contrast, the absence of switch cost in balanced bilinguals implies differences in the mechanism underlying balanced bilinguals’ and unbalanced bilinguals’ speech production.
Article
Full-text available
Previous research indicates that alternating between a bilingual’s languages during speech production can lead to short-term increases in cross-language phonetic interaction. However, discrepancies exist between the reported L1–L2 effects in terms of direction and magnitude, and sometimes the effects are not found at all. The present study focused on L1 interference in L2, examining Voice Onset Time (VOT) of English voiceless stops produced by L1-dominant Czech-English bilinguals—interpreter trainees highly proficient in L2-English. We tested two hypotheses: (1) switching between languages induces an immediate increase in L1 interference during code-switching; and (2) due to global language co-activation, an increase in L1-to-L2 interference occurs when bilinguals interpret (translate) a message from L1 into L2 even if they do not produce L1 speech. Fourteen bilinguals uttered L2-English sentences under three conditions: L2-only, code-switching into L2, and interpreting into L2. Against expectation, the results showed that English VOT in the bilingual tasks tended to be longer and less Czech-like compared to the English-only task. This contradicts an earlier finding of L2 VOT converging temporarily towards L1 VOT values for comparable bilingual tasks performed by speakers from the same bilingual population. Participant-level inspection of our data suggests that besides language-background differences, individual language-switching strategies contribute to discrepancies between studies.
Article
Full-text available
Interactive models of language production predict that it should be possible to observe long-distance interactions; effects that arise at one level of processing influence multiple subsequent stages of representation and processing. We examine the hypothesis that disruptions arising in nonform-based levels of planning—specifically, lexical selection—should modulate articulatory processing. A novel automatic phonetic analysis method was used to examine productions in a paradigm yielding both general disruptions to formulation processes and, more specifically, overt errors during lexical selection. This analysis method allowed us to examine articulatory disruptions at multiple levels of analysis, from whole words to individual segments. Baseline performance by young adults was contrasted with young speakers’ performance under time pressure (which previous work has argued increases interaction between planning and articulation) and performance by older adults (who may have difficulties inhibiting nontarget representations, leading to heightened interactive effects). The results revealed the presence of interactive effects. Our new analysis techniques revealed these effects were strongest in initial portions of responses, suggesting that speech is initiated as soon as the first segment has been planned. Interactive effects did not increase under response pressure, suggesting interaction between planning and articulation is relatively fixed. Unexpectedly, lexical selection disruptions appeared to yield some degree of facilitation in articulatory processing (possibly reflecting semantic facilitation of target retrieval) and older adults showed weaker, not stronger interactive effects (possibly reflecting weakened connections between lexical and form-level representations).
Article
A central issue in the study of speech production is whether phonological encoding occurs sequentially or in parallel. Some of the strongest evidence for sequential phonological encoding comes from the number of segments primed effect—response latencies decrease when increasing the number of primed segments from 0 to 1 to 2 (e.g., Meyer, 1991). Although it is often assumed that all participants adopt the same response criterion in the naming task, standard instructions can lead to the strategic adoption of different response criteria (such as an initial segment-based criterion or a syllable-based criterion). Furthermore, the number of segments primed effect might be driven by the manner of the initial segment such as the acoustic characteristics of plosives. In this study, participants named monosyllabic words varying in initial segment plosivity in a 0, 1, or 2 segments primed naming task and were instructed in ways to induce either a segment or syllable criterion. Data were analyzed by acoustic latency, articulatory latency, and initial segment duration, as distinguishing between a segment and syllable criterion and sequential and parallel encoding requires more than just a single point in the time-course of articulation. Shorter acoustic latencies when priming 2 segments over 1 were contingent on the manner of the initial segment and the adoption of a segment criterion, clarifying the nature of the number of segments primed effect. Moreover, the similar acoustic latencies found across priming conditions when a syllable criterion was adopted support parallel phonological encoding.
Article
Cognate facilitation and cognate interference in word production have been elicited separately, in different paradigms. In our experiment, we created conditions for facilitation and interference to occur sequentially, and identified the levels at which the two processes manifested. Bilinguals translated cognates and noncognates from L2 to L1 and typed the translations. Response-onset latencies were shorter for cognates (cognate-facilitation) but execution latencies were longer, and cross-language orthographic errors were more frequent for cognates than for noncognates (cognate-interference). Facilitation at onset followed by interference during word execution suggests that the language-selection mechanism operated efficiently at the lexical level but inefficiently at the sublexical level. It also suggests that language selection is not an event with irreversible outcome, but selection at one level may not guarantee language-selectivity at subsequent levels. We propose that a model of bilingual language production that specifies multiple language-selection processes at multiple loci of selection can accommodate this phenomenon.
Article
In language production research, the latency with which speakers produce a spoken response to a stimulus and the onset and offset times of words in longer utterances are key dependent variables. Measuring these variables automatically often yields partially incorrect results. However, exact measurements through the visual inspection of the recordings are extremely time-consuming. We present AlignTool, an open-source alignment tool that establishes preliminarily the onset and offset times of words and phonemes in spoken utterances using Praat, and subsequently performs a forced alignment of the spoken utterances and their orthographic transcriptions in the automatic speech recognition system MAUS. AlignTool creates a Praat TextGrid file for inspection and manual correction by the user, if necessary. We evaluated AlignTool’s performance with recordings of single-word and four-word utterances as well as semi-spontaneous speech. AlignTool performs well with audio signals with an excellent signal-to-noise ratio, requiring virtually no corrections. For audio signals of lesser quality, AlignTool still is highly functional but its results may require more frequent manual corrections. We also found that audio recordings including long silent intervals tended to pose greater difficulties for AlignTool than recordings filled with speech, which AlignTool analyzed well overall. We expect that by semi-automatizing the temporal analysis of complex utterances, AlignTool will open new avenues in language production research.
Article
We propose a new framework for estimating generative models via an adversarial process, in which we simultaneously train two models: a generative model G that captures the data distribution, and a discriminative model D that estimates the probability that a sample came from the training data rather than G. The training procedure for G is to maximize the probability of D making a mistake. This framework corresponds to a minimax two-player game. In the space of arbitrary functions G and D, a unique solution exists, with G recovering the training data distribution and D equal to 1/2 everywhere. In the case where G and D are defined by multilayer perceptrons, the entire system can be trained with backpropagation. There is no need for any Markov chains or unrolled approximate inference networks during either training or generation of samples. Experiments demonstrate the potential of the framework through qualitative and quantitative evaluation of the generated samples.