Content uploaded by Joseph V. Casillas
Author content
All content in this area was uploaded by Joseph V. Casillas on Aug 10, 2015
Content may be subject to copyright.
ACOUSTICS OF SPANISH AND ENGLISH CORONAL STOPS
Joseph V. Casillas, Yamile Díaz and Miquel Simonet
University of Arizona, Tucson
{jvcasill, ydiaz44, simonet}@email.arizona.edu
ABSTRACT
This study explores the acoustic correlates that dis-
tinguish coronal stops (/t/, /d/) between English and
Spanish. English and Spanish coronal stops are hy-
pothesized to differ in terms of voice-onset time and
place of articulation. We are particularly concerned
with capturing the place of articulation difference
with acoustic data, as the voice-onset time difference
is well known. Specifically, we focus on English
/d/ and Spanish /t/, which are phonetically-voiceless
stops with a short-lag voice-onset time. Spanish
/t/ has been described as being articulated at dental
place, whereas English /d/ is articulated at alveolar
place. Mixed-effects models explored various spec-
tral measurements of the consonant burst and found
that standard deviation, relative burst intensity, and
center of gravity differed as a function of place of
articulation (or language).
Keywords: Coronal stops, Spectral moments, VOT,
Spanish, English
1. INTRODUCTION
Spanish and English both have coronal stops (/d, t/);
however, their phonetic implementation differs. En-
glish /d/ and /t/ are produced with an alveolar place
of articulation (POA) [15]. Spanish /d/ and /t/, on
the other hand, are both produced with a dental POA
[9]. These descriptions rest mostly on impressionis-
tic observations. This investigation sets out to ex-
plore the acoustic correlates related to place differ-
ences amongst these segments.
An important difference between Spanish and En-
glish has to do with their treatment of the stop
voicing distinction common to these languages.
Both distinguish between /t/ and /d/ by exploiting
the acoustic correlate voice-onset time (VOT)—the
acoustic output of the coordination of glottal and
supra-glottal gestures that results in a time differ-
ence between the onset of modal voicing and articu-
latory release—; however, the manner in which they
exploit VOT differs between the two languages. In
English, /d/ has a short-lag VOT and /t/ has a long-
lag VOT, whereas in Spanish /d/ has a lead VOT
(prevoicing) and /t/ has a short-lag VOT [13, 20].
The question of how bilinguals who speak Span-
ish and English use VOT to distinguish voiced and
voiceless stops in their two languages has been in-
vestigated at length [19, 20]; however, studies to
date have overlooked the fact that bilinguals would
need to produce a difference in POA, in addition to
differently exploiting VOT, in order to avoid coa-
lescing English and Spanish coronal stops. To date,
few analyses have used acoustic measures to inves-
tigate differences in POA of coronal stops; two ex-
ceptions are [17] and [18].
The fact that both Spanish and English have short-
lag, phonetically voiceless stops in their sound in-
ventory (/t/ and /d/, respectively) begs the ques-
tion as to whether these two segments can be dis-
tinguished by any acoustic measures. Accounting
for POA differences via acoustic metrics opens the
door for new areas of research regarding bilingual-
ism and second-language learning. Similarly to
[17], the goal of the present study is to try to cap-
ture the hypothesized place difference between En-
glish and Spanish with acoustic data. The present
study focuses on monolingual speakers of both lan-
guages. Our future goals include studying the be-
havior of Spanish-English bilinguals to determine
whether they exploit the place differences between
Spanish and English coronal stops, similarly to [18].
The first four spectral moments—center of grav-
ity (COG), standard deviation (SD), skewness (SK),
and kurtosis (KT)—provide acoustic measurements
related to the shape of a spectrum (i.e. how the
energy is distributed across frequency bands) [10].
Various investigators have used spectral moments to
distinguish between place differences in fricatives
[7, 11]; however, [17] is one of a reduced number of
studies to use spectral moments in order to analyze
place differences in stops. Specifically, Sundara ex-
amined coronal stops in French and English, which
(similarly to Spanish and English) are realized with
dental and alveolar place, respectively. Her investi-
gation found differences between the two languages
in relative burst intensity, COG, SD, and KT that
were triggered by differences in POA. It remains
an open question whether place differences between
Spanish and English can be accounted for in the
same manner.
2. METHOD
The goal of this investigation was to explore the
acoustic correlates that differentiate Spanish from
English coronal stops. We measured VOT, the first
four spectral moments, and relative burst intensity
(see below). After establishing the expected differ-
ences in terms of VOT, we focused on an analysis of
only the two short-lag stops (English /d/ and Span-
ish /t/). Of particular interest was the relative impor-
tance of each of these spectral measures with regard
to POA differences across the two languages.
2.1. Speakers
In order to address the aforementioned issues, we
recorded the speech of 16 female participants. Eight
were native Spanish speakers between the ages of
18 and 23, all of which were recruited from the
Universitat de les Illes Balears campus community
and were born and raised on the island of Majorca,
Spain. Eight were native English speakers and were
undergraduate students at the University of Arizona,
born and raised in the US Southwest. The Spanish
speakers had studied some English in Spain, and the
English speakers had studied some Spanish in the
U.S., but none of the speakers were able to maintain
a basic conversation in their “second” language.
2.2. Materials and Procedure
We devised a list of 48 target words, 24 in English
and 24 in Spanish. The target words contained the
voiced and voiceless coronal stops of both languages
in word initial position. For each language there was
a total of 24 words, 12 beginning with /d/ and 12
beginning with /t/, equally divided between stressed
and unstressed syllables. All stops were followed by
a low vowel (/a/ for Spanish and /æ, A/ for English).
(See [17].)
In order to collect the acoustic data we used
the “delayed repetition technique” widely used in
bilingual-speech research [5]. The materials were
read by 6 male native speakers of these languages: 3
native English speakers (recorded in Austin, Texas)
and 3 native Spanish speakers (recorded on Majorca,
Spain). These acoustic materials were used as audi-
tory stimuli to be repeated outloud by the 16 female
speakers whose speech is analyzed here.
The speakers produced the target words in the car-
rier phrase “_ is the word” or the Spanish equivalent
(“_ es la palabra”). All words not containing coro-
nal stops were considered distractors. The computer
program Praat [3] presented the sentences randomly
in auditory form and the speakers were asked to lis-
ten to the entire sentence and then repeat it outloud
after a beep at their own pace. They were not asked
to imitate the voices of the male talkers, but to pro-
duce the sentences in their “own way.”
The English data were recorded in a sound atten-
uated booth on the campus of the University of Ari-
zona. The Spanish data were obtained in a quiet
classroom on the campus of the Universitat de les
Illes Balears. In order to carry out the recordings
we used a Shure SM10A dynamic head-mounted
microphone, a Sound Devices MM-1 microphone
pre-amplifier and a Marantz PMD660 digital speech
recorder. The signal was digitized at 44.1 kHz and
16-bit quantization.
Each participant provided the dataset with 72
coronal stops (24 target words ×3 repetitions).
Thus, a total of 1,152 tokens were recorded (24
words ×3 repetitions ×16 participants = 1,152
stops). Our initial analysis of VOT utilizes the entire
dataset; however, for subsequent analyses of burst
measurements we took a subset of this data (exactly
half) containing only Spanish /t/ and English /d/,
as these are the stops that are not easily be distin-
guished by VOT. Five tokens were removed due to
mispronunciations or extraneous noise leaving a to-
tal of 571 tokens for burst analyses.
2.3. Measurements
The digitized sound files were low-pass filtered at
11.025 kHz. For each of the coronal stops, synchro-
nized waveform and spectrographic displays were
used to mark the onset of modal voicing and of the
burst. The onset of voicing was taken to be the
upwards zero-crossing of the first periodic pattern
found in the oscillogram [12]. Voice-onset time was
calculated as the difference (in ms) between the on-
set of modal voicing and the onset of the burst.
Unlike in [17], the duration of the burst was de-
termined semi-automatically. For short-lag stops,
the burst was equal to the duration of VOT (see
above). Thus, in cases in which VOT was positive
but smaller than 25 ms, the burst was variable. For
stops with long-lag and lead VOT, the burst was ex-
actly 25 ms. That is, the onset of the burst was deter-
mined by hand, but the offset of the burst was set to
occur 25 ms after the onset—i.e., the longest bursts
were 25 ms.
In order to calculate relative burst intensity (RI),
we extracted the intensity of the burst (in dB), as
well as the mean intensity of the following vowel,
which was also manually segmented. RI was the dif-
ference between the intensity of the vowel and the
intensity of the burst. All spectral measures (COG,
SD, SK, KT) were derived from the spectral enve-
lope (which ranged from 60 Hz to 11.025 kHz.). To-
kens for which a clear burst could not be established
were removed.
2.4. Statistical analyses
A linear mixed-effects model was fit to each acous-
tic measure detailed above (i.e. VOT, RI, COG,
SD, SK, and KT). The analysis of VOT included
the entire dataset. Language (Spanish, English) and
consonant (/d/, /t/) were fixed effects. Individual
speaker and word items were random effects [1],
with random slopes for subjects and items for the ef-
fect consonant [2]. Statistical significance of group,
consonant, and the group by consonant interaction
were assessed using hierarchical partitioning of vari-
ance via nested model comparisons. Simultaneous
Tests for General Linear Hypotheses analyzed all
pairwise comparisons using Tukey Contrasts with
adjusted p-levels.
Subsequent analyses of residualized burst mea-
surements only included Spanish /t/ and English /d/
data. The present study was concerned with an-
alyzing the acoustic correlates (aside from VOT)
that could account for POA differences. Because
all burst metrics are directly related to VOT, the
effect of this variable was partialed out of the
burst measurements as a method of reducing multi-
collinearity between predictors. In order to accom-
plish this, separate models were fit with each burst
measurement regressed on VOT. The residuals of
these models were then used as the predictors for all
analyses.1Thus, each omnibus model directly com-
pared Spanish /t/ to English /d/.2In each case, in-
dividual speaker and word items were given random
intercepts. We report marginal R2and conditional
R2as an indication of goodness of fit for all models
[14]. Marginal R2provides a measure of variance
explained without mixed-effects and conditional R2
includes them.
The second analysis explored the extent to which
each of the acoustic measures could provide use-
ful information about the POA of the phonetically
voiceless segments. To this end, we divided the
dataset into two subsets of Spanish /t/ and English
/d/ stops: a training set, comprised of 75% of the
data, and a testing set, comprised of 25% of the data.
We then used RI, COG, SD, SK and KT as predic-
tors in a forward selection logistic regression model
in which phoneme identity (Spanish /t/, English /d/)
was the criterion variable. Causal priority was given
to the correlates found to best predict POA in [17].
After building the model on the training subset, we
used cross-validation to predict the identity of the
stops in the testing subset.
3. RESULTS
3.1. VOT
The analysis of the VOT data revealed a main ef-
fect of language (χ(2) = 44.07; p < 0.001), conso-
nant (χ(2) = 56.39; p < 0.001), as well as a lan-
guage by consonant interaction (χ(1) = 10.54; p <
0.002). Pairwise comparisons showed that all of the
coronal stops differed from each other (p < 0.001),
with the exception of the Spanish /t/ vs. English
/d/ short-lag stops. The mixed model provided the
best fit for the data (conditional R2= 0.87; marginal
R2= 0.82). Figure 1A plots VOT as a function of
language and consonant. The light gray box shows
that Spanish /d/ was produced with lead VOT ( ¯x=
−64.48 ±28.56 SD), and the dark gray box shows
that English /t/ was produced with long-lag VOT
( ¯x=77.63 ±25.63 SD). The white boxes represent
the short-lag stops of English and Spanish. VOT for
English /d/ was slightly longer ( ¯x=22.13 ±26.69
SD) than Spanish /t/ ( ¯x=16.18 ±5.08 SD); how-
ever, this difference was negligible, likely due to the
high rate of variability for English /t/. Thus, in these
data VOT can account for differences between all
coronal stops except for those that are manifested
through short-lag VOT: English /d/ and Spanish /t/.
3.2. Burst measurements
Figure 1B plots RI of Spanish /t/ and English /d/.
The data were best fit using the mixed-effects model
(conditional R2= 0.66; marginal R2= 0.20). The
analysis revealed that English /d/ was 4.81 dB ±1.77
standard errors (SE) higher that Spanish /t/ (t = -
2.72; p < 0.02).
The COG data were also best fit using the mixed-
effects model (conditional R2= 0.68; marginal R2
= 0.19). Spanish /t/ was 1131 Hz ±455 SE lower
than the average English /d/ (t = -2.48, p < 0.03; see
Figure 1C). Regarding SD, the mixed-effect model
accounted for 61% of the variance (vs. marginal R2
of 26%). The SD values for Spanish /t/ were 763
Hz ±222 SE lower than English /d/ (t = -3.44, p
< 0.004; see Figure 1D). The analysis of SK (Fig-
ure 1E) revealed that Spanish /t/ was 2.10 ±0.87 SE
units higher than English /d/ (t = 2.40; p < 0.03).
Again, the data were best fit using the mixed ef-
fects model (conditional R2= 0.45; marginal R2=
0.12). Lastly, the KT data had the least amount of
variance explained by the model (conditional R2=
0.26; marginal R2= 0.06). Spanish /t/ was 29.02 ±
14.30 se units higher than English /d/; however, this
difference was not significant at our specified alpha
level (t = 2.03, p = 0.06; see Figure 1F). In sum,
Figure 1: VOT and burst measures of Spanish and English coronals.
-150 -100 -50 0 50 100 150
VOT
/d/
/t/
/d/
/t/
English Spanish
Lead
Short-lag
Long-lag
A
-15 -10 -5 0 5 10 15 20
Relative Intensity
/d/
/t/
English Spanish
B
-2000 -1000 0 1000 2000 3000 4000 5000
Center of Gravity
/d/
/t/
English Spanish
C
-1000 0 1000 2000
Standard Deviation
/d/
/t/
English Spanish
D
-5 0 5 10
Skewness
/d/
/t/
English Spanish
E
-100 0 100 200 300 400
Kurtosis
/d/
/t/
English Spanish
F
all of the burst measurements, with the exception of
kurtosis, differed as a function of language. This is
taken as an indication that these metrics successfully
accounted for place differences between the short-
lag stops of English and Spanish.
The next step was to analyze the relative con-
tribution of the burst measurements. The training
subset of the data was analyzed via logistic regres-
sion, with the burst metrics as fixed effects for pre-
dicting the short-lag stop phoneme identity (Spanish
/t/, English /d/). The model eliminated SK and KT
from the analysis. Table 1 summarizes the results.
Nagelkerkes’ pseudo R2is reported to give an indi-
cation of goodness-of-fit of each predictor as it was
entered into the model. SD accounted for the largest
amount of the variance (35%), followed by RI (5%)
and COG (2%).
Table 1: Regression analysis of /d/-/t/.
Metric R2R2
change χ2
change p-value
SD .353 .353 175.39 < 0.001
COG .376 .023 14.01 < 0.001
RI .425 .049 30.01 < 0.001
SK .426 .001 0.08 > 0.05
KT .432 .006 4.36 < 0.04
Finally, the best fit model was used to predict the
identity of the short-lag stops of the testing subset.
The model performed with 87% accuracy (out of
sample error = 0.13). That is, given the data for SD,
RI and COG, it was possible to predict whether the
stop was Spanish /t/ or English /d/ on 87% of the
testing subset (142 tokens).
4. DISCUSSION AND CONCLUSION
The results of the analyses showed that the phoneti-
cally voiceless coronal stops of English and Spanish
can be distinguished by RI and by the spectral shape
of the stop burst. Importantly, the present study in-
dicates that the place of articulation differences de-
scribed for Spanish /t/ and English /d/ are best ac-
counted for using measures of SD, RI, and COG. SK
and KT did not significantly contribute to predicting
the POA of the short-lag stops in our data.
Our results partially corroborate the findings of
[17]. As is the case in the Canadian varieties of En-
glish and French investigated in [17], the two vari-
eties of English and Spanish investigated here dif-
fer in place of articulation of their coronal stops—
alveolar (English) and dental (French, Spanish), ac-
cording to impressionistic descriptions. In our data,
similar to [17], values of SD and other burst mea-
sures varied across the two languages; however, dif-
ferent from our findings, she also encountered sig-
nificant differences for kurtosis.
The present study contributes language-specific
acoustic characteristics of bursts in the short-lag
coronal stops of two monolingual varieties of En-
glish and Spanish. Among other things, the findings
provide base acoustic descriptions for future studies
on Spanish-English bilinguals.
5. REFERENCES
[1] Baayen, R. H., Davidson, D. J., Bates, D. M. Nov.
2008. Mixed-effects modeling with crossed ran-
dom effects for subjects and items. Journal of
Memory and Language 59, 390–412.
[2] Barr, D. J., Levy, R., Scheepers, C., Tily, H. J. Apr.
2013. Random effects structure for confirmatory
hypothesis testing: Keep it maximal. Journal of
Memory and Language 68, 255–278.
[3] Boersma, P., Weenink, D. 2012. Praat: doing pho-
netics by computer. Glot International 5, 341–345.
[4] Cole, J., McMurray, B., Munson, C., Linebaugh,
G. 2010. Unmasking the acoustic effects of vowel-
to-vowel coarticulation: A statistical modeling ap-
proach. Journal of Phonetics 38, 167–184.
[5] Flege, J. E., Munro, M., MacKay, I. 1995. Effects
of age of second-language learning on the produc-
tion of English consonants. Speech Communication
16, 1–26.
[6] Fowler, C. A. 1984. Segmentation of coarticulated
speech in perception. Perception & Psychophysics
36, 359–368.
[7] Gordon, M., Barthmaier, P., Sands, K. 2002. A
cross-linguistic acoustic study of voiceless frica-
tives. Journal of the International Phonetic Associ-
ation 32, 141–174.
[8] Gow, D. W. 2003. Feature parsing: Feature cue
mapping in spoken word recognition. Perception
& Psychophysics 65, 575–590.
[9] Hualde, J. I. 2005. The sounds of Spanish. Cam-
bridge University Press.
[10] Jones, M. J., Knight, R.-A. 2013. Bloomsbury
Companion to Phonetics. London, UK: A & C
Black.
[11] Jongman, A., Wayland, R., Wong, S. 2000. Acous-
tic characteristics of English fricatives. The Journal
of the Acoustical Society of America 108, 1252–
1263.
[12] Lieberman, P., Blumstein, S. E. 1988. Speech phys-
iology, speech perception, and acoustic phonetics.
Cambridge, UK: Cambridge University Press.
[13] Lisker, L., Abramson, A. S. 1964. A cross-
language study of voicing in initial stops: Acous-
tical measurements. Word 20, 384–422.
[14] Nakagawa, S., Schielzeth, H. 2013. A general and
simple method for obtaining R2 from generalized
linear mixed-effects models. Methods in Ecology
and Evolution 4, 133–142.
[15] Picard, M. 1987. An introduction to the compar-
ative phonetics of English and French in North
America volume 7. Amsterdam: The Nethrlands:
John Benjamins Publishing.
[16] R Core Team, 2014. R: A Language and Environ-
ment for Statistical Computing. R Foundation for
Statistical Computing Vienna, Austria.
[17] Sundara, M. 2005. Acoustic-phonetics of coronal
stops: A cross-language study of Canadian English
and Canadian French. The Journal of the Acousti-
cal Society of America 118, 1026–10037.
[18] Sundara, M., Polka, L., Baum, S. 2006. Production
of coronal stops by adult simultaneous bilinguals.
Bilingualism: Language and Cognition 9, 97–114.
[19] Williams, L. 1977. The perception of stop conso-
nant voicing by Spanish-English bilinguals. Per-
ception & Psychophysics 21, 289–297.
[20] Williams, L. 1979. The modification of speech per-
ception and production in second-language learn-
ing. Perception & Psychophysics 26, 95–104.
1See [6], [8] and [4] for discussion and examples of this
approach.
2Degrees of freedom for hypothesis tests were derived
using the Satterthwaite approximation as implemented in
the lmerTest package in R [16].