ArticlePDF Available

Alice Turk and Stefanie Shattuck-Hufnagel (2020). Speech timing: implications for theories of phonology, phonetics, and speech motor control. (Oxford Studies in Phonology and Phonetics 5.) Oxford: Oxford University Press. Pp. xv + 370.

Authors:
Alice Turk and Stefanie Shattuck-Hufnagel (2020). Speech timing:
implications for theories of phonology, phonetics, and speech motor control.
(Oxford Studies in Phonology and Phonetics 5.) Oxford: Oxford
University Press. Pp. xv + 370.
Jason A. Shaw *
Yale University
1 Overview and structure of the volume
There is increasing awareness that the temporal dimension of speech, in particular
the relative timing of speech movements, contains rich information about phono-
logical structure. Relating abstract phonological structure to the temporal unfold-
ing of realistically variable speech data remains a major interdisciplinary challenge.
It is this challenge that is taken up in Speech timing: implications for theories of
phonology, phonetics, and speech motor control, henceforth Speech timing.
The book has eleven chapters, including a short introduction and a conclusion.
The main proposal a sketch of a model mapping phonological representations to
continuous movements of articulators comes in the second half of the book, par-
ticularly in Chapters 7 and 10. The second half also includes chapters on optimisa-
tion (Ch. 8: Optimization) and general mechanisms for timing (Ch. 9: How do
timing mechanisms work?). These provide a unique synthesis of speech and non-
speech literature, which is highly accessible for linguists, and serves to motivate
aspects of the main proposal.
The rst half of the book provides a description and critique of the theory of
Articulatory Phonology, developed in the Task Dynamics framework (AP/TD)
(e.g. Browman & Goldstein 1986, Saltzman & Munhall 1989). On the view of
the authors:
AP/TD currently provides the most comprehensive account of systematic
spatiotemporal variability in speech it represents the standard which any
alternative theory must match or surpass, and provides a clear advantage over
traditional phonological theories as a model of the speech production process
(pp. 89).
The review of AP/TD in Chapter 2 of Speech timing highlights some key charac-
teristics of the theory, particularly those related to prosodic modulation of timing
and those implemented in the Task Dynamics Application, which simulates
articulatory trajectories and the resulting acoustics from specications of phono-
logical representations. The AP/TD overview sets the stage for an exposition of
empirical phenomena in Chapters 36 that the authors interpret as a challenge
to AP/TD and as motivation for an alternative approach.
Thus the basic rhetorical strategy is to rst problematise an existing theory,
AP/TD, and then to present an alternative approach.
* E-mail: JASON.SHAW@YALE.EDU.
Phonology 38 (2021) 165171. © The Author(s), 2021. Published by Cambridge University Press
doi:10.1017/S0952675721000099
165
Cambridge Core terms of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0952675721000099
Downloaded from https://www.cambridge.org/core. Yale University Library, on 23 Aug 2021 at 17:24:56, subject to the
2 Major arguments in the volume
Speech timing describes the Phonology-Extrinsic-Timing-Based Three-Component
approach, abbreviated as XT/3C. Although, as described above, this approach is
presented as an alternative to AP/TD, it has many features that may be more
familiar to readers of Phonology than AP/TD. AP/TD represents phonological
contrasts in terms of gestures, dynamically dened speech production tasks.
Gestures have temporal extent and are coordinated in time. The dynamical
approach of AP/TD denes at once discrete phonological tasks and the changes
in articulator position over time that achieve them. From this standpoint, pho-
nology and phonetics are not clearly distinct rather they are dierent levels of
description of the same system. In contrast, XT/3C is characterised as having
three components: (i) phonological planning, (ii) phonetic planning and (iii)
motor-sensory implementation.
The phonology-extrinsic timing aspect of the framework, the XTof XT/3C,
highlights a key contrast with AP/TD phonological representations in XT/3C do
not specify time beyond the linear order of segments. Moreover, in XT/3C,
phonological representations do not commit to the particular phonetic dimensions
that will serve as targets for the speech production process or how progress
towards those goals will unfold in time. Such specications come only at later
stages in the XT/3C speech production process, when the phonological represen-
tation is input into the phonetic planning component. Representations in the
phonological planning component of XT/3C consist of phonemes, hierarchical
prosodic structure, metrical prominence and abstract characterisations of speech
rate and style. Fully specied utterances in the phonological planning component
project linearly ordered acoustic cues, which are given quantitative targets in the
phonetic planning module. Also in the phonetic planning module, articulatory tra-
jectories are computed to achieve the acoustic targets in the linear order specied
by the phonological component. The cost of deviating from acoustic targets is
balanced against the cost of movement (movement eort and movement time),
both of which could be sensitive to any aspect of the phonological representation.
Movement control is proposed to be guided by instantaneous awareness of the
location of articulators and the distance that they would have to move to achieve
acoustic targets, i.e. the movement endpoints; it is suggested that this idea can
be appropriately formalised in tau theory, following Lee (1998). Finally, the
motor-sensory implementation component issues the motor commands at appro-
priate times to achieve the planned acoustic cues to phonological structure.
Some of the key contrasts between AP/TD and XT/3C discussed at length in
Speech timing are: (i) spatio-temporal (AP/TD) vs. symbolic (XT/3C) phono-
logical representations; (ii) articulatory (AP/TD) vs. acoustic (XT/3C) phonetic
targets; (iii) phonology-intrinsic (AP/TD) vs. phonology-extrinsic (XT/3C)
timing; (iv) onset-triggered (AP/TD) vs. endpoint-triggered (XT/3C) move-
ments. The XT/3C assumption of symbolic phonological representations, (i),
appears to dictate much of how the rest of the XT/3C architecture, i.e. (ii)(iv),
diers from AP/TD. In AP/TD, the mapping from phonological representations
to observable speech behaviour is compositional. Since gestures are specications
of how phonological contrasts shape speech production over time, any contextual
deviation must be attributed to another overlapping gesture. This includes the
π- and μ-gestures involved in prosodic modulation (Byrd & Saltzman 2003,
Katsika et al. 2014). Positing symbolic phonological representations liberates
XT/3C from having to account for contextual variation, including prosodic
166 Reviews
of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0952675721000099
Downloaded from https://www.cambridge.org/core. Yale University Library, on 23 Aug 2021 at 17:24:56, subject to the Cambridge Core terms
eects, compositionally, as spatio-temporal modications. Rather, in XT/3C,
since phonological representations lack spatio-temporal content, there is no
defaultto modify. Any aspect of the phonological representation, including
combinations of segmental contrast and phonological position can condition
unique, context-specic phonetic targets. Since phonological representations
lack spatio-temporal content, this needs to be provided in other components,
which relate to (ii)(iv).
The above aspects of XT/3C that dier from AP/TD are each motivated with
some discussion of empirical data. Most of the data that is key to the discussion
is already published. One exception is a study presenting some proof of concept
of the applicability of Lees theory to speech, cited as Lee & Turk (in preparation)
and presented briey in §9.2.1 (pp. 259261). Although the relevant data is gen-
erally not new, it is uniquely synthesised as motivation for XT/3C.
The empirical motivation for loosening the strict compositionality of AP/TD
comes in part from the observation that phrase-nal lengthening in Finnish is at-
tenuated when a vowel-length contrast is at stake (Nakai et al. 2009, Nakai et al.
2012). The Finnish data is interpreted as an argument against phonology-intrinsic
timing, because the lengthening eect of the prosodic boundary is not uniform
across short and long vowels, an apparent violation of the compositionality inher-
ent in AP/TD. The XT/3C alternative is that surface durations result from an
optimisation of cues to both phonological contrast and prosodic structure. Since
the symbolic representations of XT/3C are phonetically unconstrained, they can
reect sensitivity to contrast maintenance that outs the cues to prosodic bound-
aries just when contrast is at stake.
Key evidence for acoustic phonetic targets include well-known cases in which it
appears that dierent combinations of articulatory constrictions vary in order to
maintain relatively stable acoustic targets, such as the trade-obetween tongue-
dorsum retraction and lip rounding to maintain a relatively stable F2 for /u/ in
American English. In addition to examples from speech, Speech timing also
reviews empirical studies documenting other motor behaviours, such as typing
and tapping, which tend to show greater temporal variability at movement
onsets than at movement endpoints. One speech production study is cited as evi-
dence for this claim (Perkell & Matties 1992). Another possible line of evidence for
endpoint-triggered movements comes from spatially conditioned gestural timing
(for discussion, see Shaw & Chen 2019), although there may be alternative expla-
nations for these observations that are also consistent with onset-triggered
movement.
3 Critical discussion of data and claims
Although Speech timing raises some important issues that will guide future
research on the topic, to really adjudicate between approaches it will be necessary
to consider a wider range of data and to work out XT/3C in greater detail, ideally
to the extent that it can make quantitative predictions. I elaborate on these points
in this section.
Speech timing does not review cases in which articulation maintains stability
across contexts even at the expense of acoustic cues to phonemic contrast, or
cases where articulatory dierences corresponding to distinct phonemic contrasts
are masked in the acoustics. Extreme examples include gestural hiding and covert
contrasts. In gestural hiding, gestures are produced that have little or no acoustic
consequences, because of how the gestures overlap in time, such as the tongue-tip
167Reviews
of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0952675721000099
Downloaded from https://www.cambridge.org/core. Yale University Library, on 23 Aug 2021 at 17:24:56, subject to the Cambridge Core terms
gesture for /t/ movement after lip closure for /m/ in the English sequence perfec/t/
memory (Browman & Goldstein 1989). Covert contrasts are phonemic contrasts
that are produced distinctly in articulation but still sound indistinct to listeners,
and are thought by some researchers to be widespread in both typical and atypical
language development. These observations appear to me to present a challenge for
XT/3C or any theory in which articulatory trajectories are computed as optimal
means to achieve acoustic targets. Quite specically, the challenge situated
within the XT/3C framework is to identify the combinations of costs such that
optimisation will result in a large articulatory movement with no acoustic
consequences.
There are also cases in which the same articulatory posture appears to be reused
across segments. For example, the articulatory posture observed for fricative
vowels in Suzhou Chinese, including both alveolar fricative vowels, which are
restricted in their distribution to follow alveolar fricatives, and postalveolar frica-
tive vowels, which are not subject to such restriction, have the same articulatory
posture as corresponding fricative consonants in the language (Faytak 2018).
Faytak argues that this and other observations of articulatory uniformity, i.e.
reusing articulatory postures across dierent contrasting segments, may have its
basis in learning speakers may reuse already practiced motor routines if they
accomplish merely good enoughrather than optimal acoustic outcomes
(Faytak 2018: 29). This reasoning appears to me to be incompatible with
XT/3C, although, again, the precise predictions depend much on how the
interacting costs function in the optimisation process.
In addition to such cases of spatial uniformity, there are also cases of articulatory
timing uniformity in the literature. For example, Shaw & Davidson (2011)
pursued the hypothesis that native English speakers would produce Russian con-
sonant clusters, e.g. word-initial /zb dg/, etc., in ways that minimise the perceptual
distance between what they hear and what they produce this hypothesis is con-
sistent with XT/3C, but it was not supported by the data. Rather, computational
simulations deriving the acoustic data from the articulatory timing of consonant
gestures suggested that speakers imposed uniformity in timing producing frica-
tivestop clusters like /zb/ with the same pattern of relative timing as stopstop
clusters like /dg/, even though this results in dierent acoustic deviations from
Russian: [zb] vs. [dg]. Several studies have reported systematicity in the timing
between articulatory movements such that dierent consonants enter into the
same temporal relations (despite dierent spatial targets). Key empirical observa-
tions include: (i) timing is language-specic, in that similar sequences of segments
can show dierent patterns of articulatory timing across languages (e.g. Hermes
et al. 2017), and (ii) within languages, dierent sequences of segments, such as
rising vs. falling sonority syllable onsets, can sometimes show similar patterns of
relative timing (e.g. Shaw et al. 2011). These patterns can be derived from
AP/TD straightforwardly, but it remains to be seen if they might also emerge
from surface optimisation of acoustic targets and articulatory costs, as in XT/3C.
Of course, the timing between consonant gestures is not always consistent across
consonants of dierent identities. For example, initial /kn/ clusters in German
have longer temporal lag than initial /kl/ clusters (Bombien et al. 2013) and
Italian /sC/ clusters pattern together in their timing to the exclusion of rising
sonority clusters (Hermes et al. 2013). In some cases, the dierences in
gestural timing correspond to dierences in syllable structure, providing some
cross-linguistic support for the observation that higher levels of linguistic
structure, such as syllables, may be found in characteristic patterns of temporal
168 Reviews
of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0952675721000099
Downloaded from https://www.cambridge.org/core. Yale University Library, on 23 Aug 2021 at 17:24:56, subject to the Cambridge Core terms
organisation between articulatory gestures (Browman & Goldstein 1988). This
points to a broader issue how exactly phonological structure conditions variation
in articulatory coordination. Within the XT/3C framework, any such
correspondence is indirect, in that it is mediated by the linearisation of acoustic
cues to phonological structure, although the precise consequences of this
architecture depend on the cost function at play in optimisation.
The last ten years have produced a wide empirical base of studies reporting high
temporal resolution articulatory data that can constrain theorising about principles
underlying speech timing, within and across languages. Establishing whether
XT/3C can account for this range of patterns in a principled way requires
further development of the model. In my view, two further developments are
required.
One step is to converge on a precise characterisation of how cost enters into the
optimisation of surface timing, including the integration of costs for time, energy/
eort and spatial variation at the endpoint of movement, since, within the XT/3C
framework, optimisation is required to generate quantitative predictions. Speech
timing provides an extensive discussion of the issues involved in treating speech
production as a complex optimisation problem, including relevant non-speech
motor literature, and why it is dicult to identify with precision what the relevant
costs are for speech and how these costs might interact in the optimisation process.
This discussion, mostly localised in Chapter 8, includes insightful critiques of
optimisation approaches that posit durational targets for linguistic units with
penalties for deviating from targets. Speech timing argues that the durations of
linguistic units should not be specied in underlying representations, because
they can emerge instead from optimisation. In XT/3C, phonemes, syllables,
etc., gain temporal extent only through the joint optimisation of surface durations
between acoustic landmarks and the articulatory movements that give rise to them.
One major challenge for XT/3C is therefore to derive the range of temporal
patterns, including cases of apparent systematicity in articulatory timing, from
the proposed optimisation procedure.
A second step for XT/3C, in my view, is to develop the theory of how
phonological representations project linearly ordered acoustic/auditory cues.
This may be as simple as adopting existing phonological proposals, or aspects of
them. The mapping from abstract phonemic representations to context-specic
acoustic/auditory cues in XT/3C resembles the mapping from underlying to
surface representations in generative phonology, particularly versions of
generative phonology that oer an optimisation-based unication of phonetics
and phonology (e.g. Boersma 1998, Flemming 2001). Speech timing makes it
clear that the phonological component of XT/3C is not intended to be isomorphic
with the phonological grammar, even though they make use of similar abstract
linguistic units (p. 268). However, it seems to me that there is potential to do a
lot of what phonological grammars are designed to do within the XT/3C frame-
work, including mappings that can be described in terms of phonological rules.
This is because, unless further constrained, the framework leaves open the possi-
bility that the same abstract phoneme could be mapped to radically dierent
acoustic realisations in dierent phonological contexts. One constraint on
phoneme to acoustic landmark projection proposed in XT/3C is that this
mapping balances language redundancy and acoustic/auditory information so as
to maintain consistent recognition probability over time. It would be interesting
to consider how language redundancy relates to grammatical rules, particularly
in cases in which they make potentially conicting predictions. Working out the
169Reviews
of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0952675721000099
Downloaded from https://www.cambridge.org/core. Yale University Library, on 23 Aug 2021 at 17:24:56, subject to the Cambridge Core terms
mapping from phonological representations to acoustic/auditory cues would
reveal the degree to which XT/3C, as a model of speech production, can be
integrated with existing models of phonological grammar.
4 Conclusion
There are a number of properties which make Speech timing essential reading for
students and researchers interested in relating abstract phonological structure to
time-dependent articulatory and acoustic properties. From start to nish, the
book oers a balanced review of a signicant amount of relevant research, some
of which is not succinctly reviewed elsewhere. Perhaps even more valuable is
that Speech timing keeps consistent tabs on the empirical facts that motivate
each component of the model, maintaining contact between data and theory.
The book can also be read with an eye to even broader conceptual issues. These
include how speech motor control relates to other aspects of motor control, such
as limb movement –‘is speech special?’–and whether speech is better conceptua-
lised as a complex computation, as in the optimisation problem characterised by
XT/3C, or as the interplay of forces tending toward equilibrium, as in the
dynamical systems of AP/TD.
The general style of the book, reasoning from theoretical predictions
(of AP/TD) to empirical data and back to theoretical alternatives, encourages situ-
ating speech data as evidence for competing theoretical hypotheses. Although the
theoretical dichotomies in (i)(iv) are possibly too sharp it seems unlikely that
speech production targets are entirely articulatory or entirely acoustic/auditory,
or that articulatory timing refers only to movement onsets or only to movement
endpoints they serve to highlight important and unresolved issues, which can
be claried by further experimental work. Data seemingly supporting one
approach or the other could ultimately be specic to particular language varieties,
phonological contexts, experimental tasks, individual speakers or stages of lan-
guage development. For example, in AP/TD it is generally assumed that the pro-
duction goals of tone gestures are acoustic in nature (e.g. Gao 2008) and some
gesture-based models consider both movement onsets and osets (endpoints), as
well as other landmarks, to be available for coordination (Gafos 2002).
Nevertheless, the theoretical issues raised in Speech timing can focus much
future research on computational and experimental aspects of relating phono-
logical structure to the speech signal, including approaches that hybridise
aspects of AP/TD and aspects of XT/3C (see e.g. Parrell & Lammert 2019).
REFERENCES
Boersma, Paul (1998). Functional phonology: formalizing the interactions between articu-
latory and perceptual drives. PhD dissertation, University of Amsterdam.
Bombien, Lasse, Christine Mooshammer & Philip Hoole (2013). Articulatory coordi-
nation in word-initial clusters of German. JPh 41. 546561.
Browman, Catherine P. & Louis Goldstein (1986). Towards an articulatory phonology.
Phonology Yearbook 3. 219252.
Browman, Catherine P. & Louis Goldstein (1988). Some notes on syllable structure in
articulatory phonology. Phonetica 45. 140155.
Browman, Catherine P. & Louis Goldstein (1989). Articulatory gestures as phono-
logical units. Phonology 6. 201251.
170 Reviews
of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0952675721000099
Downloaded from https://www.cambridge.org/core. Yale University Library, on 23 Aug 2021 at 17:24:56, subject to the Cambridge Core terms
Byrd, Dani & Elliot Saltzman (2003). The elastic phrase: modeling the dynamics of
boundary-adjacent lengthening. JPh 31. 149180.
Faytak, Matthew D. (2018). Articulatory uniformity through articulatory reuse: insights
from an ultrasound study of Sūzhōu Chinese. PhD dissertation, University of
California, Berkeley.
Flemming, Edward (2001). Scalar and categorical phenomena in a unied model of
phonetics and phonology. Phonology 18.744.
Gafos, Adamantios I. (2002). A grammar of gestural coordination. NLLT 20. 269337.
Gao, Man (2008). Mandarin tones: an Articulatory Phonology account. PhD disserta-
tion, Yale University.
Hermes, Anne, Doris Mücke & Bastian Auris (2017). The variability of syllable pat-
terns in Tashlhiyt Berber and Polish. JPh 64. 127144.
Hermes, Anne, Doris Mücke & Martine Grice (2013). Gestural coordination of Italian
word-initial clusters: the case of impure s.Phonology 30.125.
Katsika, Argyro, Jelena Krivokapić, Christine Mooshammer, Mark Tiede & Louis
Goldstein (2014). The coordination of boundary tones and its interaction with
prominence. JPh 44.6282.
Lee, David N. (1998). Guiding movement by coupling taus. Ecological Psychology 10.
221250.
Lee, David N. & Alice Turk (in preparation). Vocalizing by tauG-guiding articulators.
Nakai, Satsuki, Sari Kunnari, Alice Turk, Kari Suomi & Riikka Ylitalo (2009).
Utterance-nal lengthening and quantity in Northern Finnish. JPh 37.2945.
Nakai, Satsuki, Alice Turk, Kari Suomi, Sonia Granlund, Riikka Ylitalo & Sari
Kunnari (2012). Quantity constraints on the temporal implementation of phrasal
prosody in Northern Finnish. JPh 40. 796807.
Parrell, Benjamin & Adam C. Lammert (2019). Bridging dynamical systems and
optimal trajectory approaches to speech motor control with dynamic movement prim-
itives. Frontiers in Psychology 10:2251.https://doi.org/10.3389/fpsyg.2019.02251.
Perkell, Joseph S. & Melanie L. Matties (1992). Temporal measures of anticipatory
labial coarticulation for the vowel /u/: within- and cross-subject variability. JASA
91. 29112925.
Saltzman, Elliot & Kevin G. Munhall (1989). A dynamical approach to gestural
patterning in speech production. Ecological Psychology 1. 333382.
Shaw, Jason A. & Wei-rong Chen (2019). Spatially conditioned speech timing: evi-
dence and implications. Frontiers in Psychology 10:2726.https://doi.org/10.3389/
fpsyg.2019.02726.
Shaw, Jason A. & Lisa Davidson (2011). Perceptual similarity in inputoutput
mappings: a computational/experimental study of non-native speech production.
Lingua 121. 13441358.
Shaw, Jason A., Adamantios I. Gafos, Philip Hoole & Chakir Zeroual (2011). Dynamic
invariance in the phonetic expression of syllable structure: a case study of Moroccan
Arabic consonant clusters. Phonology 28. 455490.
171Reviews
of use, available at https://www.cambridge.org/core/terms. https://doi.org/10.1017/S0952675721000099
Downloaded from https://www.cambridge.org/core. Yale University Library, on 23 Aug 2021 at 17:24:56, subject to the Cambridge Core terms
ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
Patterns of relative timing between consonants and vowels appear to be conditioned in part by phonological structure, such as syllables, a finding captured naturally by the two-level feedforward model of Articulatory Phonology (AP). In AP, phonological form – gestures and the coordination relations between them – receive an invariant description at the inter-gestural level. The inter-articulator level actuates gestures, receiving activation from the inter-gestural level and resolving competing demands on articulators. Within this architecture, the inter-gestural level is blind to the location of articulators in space. A key prediction is that intergestural timing is stable across variation in the spatial position of articulators. We tested this prediction by conducting an Electromagnetic Articulography (EMA) study of Mandarin speakers producing CV monosyllables, consisting of labial consonants and back vowels in isolation. Across observed variation in the spatial position of the tongue body before each syllable, we investigated whether inter-gestural timing between the lips, for the consonant, and the tongue body, for the vowel, remained stable, as is predicted by feedforward control, or whether timing varied with the spatial position of the tongue at the onset of movement. Results indicated a correlation between the initial position of the tongue gesture for the vowel and C-V timing, indicating that inter-gestural timing is sensitive to the position of the articulators, possibly relying on somatosensory feedback. Implications of these results and possible accounts within the Articulatory Phonology framework are discussed.
Article
Full-text available
Current models of speech motor control rely on either trajectory-based control (DIVA, GEPPETO, ACT) or a dynamical systems approach based on feedback control (Task Dynamics, FACTS). While both approaches have provided insights into the speech motor system, it is difficult to connect these findings across models given the distinct theoretical and computational bases of the two approaches. We propose a new extension of the most widely used dynamical systems approach, Task Dynamics, that incorporates many of the strengths of trajectory-based approaches, providing a way to bridge the theoretical divide between what have been two separate approaches to understanding speech motor control. The Task Dynamics (TD) model posits that speech gestures are governed by point attractor dynamics consistent with a critically damped harmonic oscillator. Kinematic trajectories associated with such gestures should therefore be consistent with a second-order dynamical system, possibly modified by blending with temporally overlapping gestures or altering oscillator parameters. This account of observed kinematics is powerful and theoretically appealing, but may be insufficient to account for deviations from predicted kinematics—i.e., changes produced in response to some external perturbations to the jaw, changes in control during acquisition and development, or effects of word/syllable frequency. Optimization, such as would be needed to minimize articulatory effort, is also incompatible with the current TD model, though the idea that the speech production systems economizes effort has a long history and, importantly, also plays a critical role in current theories of domain-general human motor control. To address these issues, we use Dynamic Movement Primitives (DMPs) to expand a dynamical systems framework for speech motor control to allow modification of kinematic trajectories by incorporating a simple, learnable forcing term into existing point attractor dynamics. We show that integration of DMPs with task-based point-attractor dynamics enhances the potential explanatory power of TD in a number of critical ways, including the ability to account for external forces in planning and optimizing both kinematic and dynamic movement costs. At the same time, this approach preserves the successes of Task Dynamics in handling multi-gesture planning and coordination.
Article
Full-text available
Linguistic form is expressed in space, as articulators effectconstrictions at various points in the vocal tract, but also in time, as articulators move. A rather widespread assumption in theories of phonology and phonetics is that the temporal dimension of speech is largely irrelevant to the description and explanation of the higher-level or more qualitative aspects of sound patterns. The argument is presented that any theory of phonology must include a notion of temporal coordination of gestures. Linguistic grammars are constructed in part out of this temporal substance. Language-particular sound patterns are in part patterns of temporal coordination among gestures.1
Article
Full-text available
This paper takes a computational/experimental approach to investigating faithfulness in input–output phonological mappings. We seek to explain the results of a speech production experiment recently reported in Davidson (2010). In that experiment, native English speakers were asked to produce phonotactically unattested consonant clusters. We argue that modifications of the target consonant clusters are best understood by considering both unfaithful phonological mappings and imprecision in the speech production mechanism. To account for the pattern of unfaithful input–output mappings, we consider an extension of the P-map hypothesis (Steriade, 2008) to the production of phonotactically unattested target sequences. Predictions of the P-map for this data were established by a perception experiment in which participants were asked to discriminate between unattested consonant clusters, CC, and attested modifications, including epenthesis, prothesis, C1 change and C1 deletion. To evaluate the effects of motor noise on consonant cluster production, we constructed a computational model that allowed us to simulate consonant cluster productions under different levels of noise. Simulations reveal that, first, a large proportion of cases involving vocoid insertion, CəC, are better accounted for by noisy implementation of the target timing than by phonological epenthesis and, second, that differences in insertion patterns between stop-initial clusters and fricative-initial clusters are due to the internal temporal properties of stops and fricatives. Factoring these cases into the analysis isolates a pattern of unfaithful input–output mappings used to evaluate the P-map hypothesis. On the basis of considerable mismatches between patterns of perceptual similarity and unfaithful input–output mappings, we argue that the P-map theory of faithfulness is too restrictive to extend to the production of non-native speech.
Article
In this study we investigate the timing of word-initial clusters and its relation to distinct phonological syllable parses in Tashlhiyt Berber and Polish. To this end, we use experimental, articulographic data (steps 1 and 2) combined with computer-based simulation (step 3). In step 1, we test how temporal properties of consonantal clusters such as overlap can vary within a single language. In step 2, we relate articulatory coordination patterns to distinct phonological syllable parses, involving simple and complex onsets, in order to calculate stability indices for each language. In step 3, we test the robustness of these stability patterns by adding anchor variability to the system. The analysis reveals that variability plays a different role in the two languages. Tashlhiyt shows a tight cluster timing with low variability in overlap across clusters. The phonetic heuristics for Tashlhiyt reveal a simple onset parse with a phonetic outcome that is strikingly robust against temporally induced variability. In contrast, Polish shows a considerably high variability in overlap between the different cluster types. The phonetic heuristics for Polish reveal a general trend towards a complex onset parse, but this time the picture is less clear. Furthermore, the Polish timing patterns are more sensitive to anchor variability than Tashlhiyt. This difference in the degree of sensitivity to variability is interpreted to be the result of different language-specific regulatory mechanisms mediating between different levels of description, such as segmental context and prosodic marking of different pragmatic functions. Natural human communication requires both stability and variability regulated by different needs and constraints within a given language, leading to differing degrees of flexibility in the hierarchical network of local weights and clocks attached to the different constituents of the prosodic hierarchy.
Article
Studies of sensory guidance of movement in animals show that large nervous systems are not necessary for accurate control, suggesting that guidance may be based on some simple principles. In search for those principles, a theory of guidance of movement is described, which has its roots in Gibson's pathfinding work on visual control of locomotion (J. J. Gibson, 1958/this issue). The theory is based on the use of the simple but powerful variable tau, the time-to-closure of a gap at the current gap closure rate (whatever the gap's dimension—distance, angle, force, etc.); and on the principle of tau-coupling (keeping two Ts in constant ratio). In this article, I show how tau-coupling could be used to synchronize movements and regulate their kinematics. Supportive experimental results are reported. I also show theoretically how sensory-taus, defined on sensory input arrays, can specify motion-taus through tau-coupling; how the braking procedure of keeping tau.dot stable is a particular case of tau.coupling; and how tools for steering (e.g., limbs, whole bodies, cars, or aircraft) could be built from tau-couplings, which would enable steering control in a variety of situations, including steering straight and curved courses to goals, steering and controlling speed at the same time, steering around obstacles, and asymptoting on surfaces as when landing. Some movements also involve intrinsic guidance from within, and a hypothesis on intrinsic guidance by tau is introduced, supported by experiments spanning different activities.
Article
This study investigates the coordination of boundary tones as a function of stress and pitch accent. Boundary tone coordination has not been experimentally investigated previously, and the effect of prominence on this coordination, and whether it is lexical (stress-driven) or phrasal (pitch accent-driven) in nature is unclear. We assess these issues using a variety of syntactic constructions to elicit different boundary tones in an Electromagnetic Articulography (EMA) study of Greek. The results indicate that the onset of boundary tones co-occurs with the articulatory target of the final vowel. This timing is further modified by stress, but not by pitch accent: boundary tones are initiated earlier in words with non-final stress than in words with final stress regardless of accentual status. Visual data inspection reveals that phrase-final words are followed by acoustic pauses during which specific articulatory postures occur. Additional analyses show that these postures reach their achievement point at a stable temporal distance from boundary tone onsets regardless of stress position. Based on these results and parallel findings on boundary lengthening reported elsewhere, a novel approach to prosody is proposed within the context of Articulatory Phonology: rather than seeing prosodic (lexical and phrasal) events as independent entities, a set of coordination relations between them is suggested. The implications of this account for prosodic architecture are discussed.
Article
Intra-gestural and inter-gestural coordination in German word-initial consonant clusters /kl, kn, ks, pl, ps/ is investigated in four speakers by means of EMA as a function of segmental make-up and prosodic variation, i.e. prosodic boundary strength and lexical stress. Segmental make-up is shown to determine the extent of articulatory overlap of the clusters, with /kl/ exhibiting the highest degree, followed by /pl/, /ps/, /ks/ and finally /kn/. Prosodic variation does not alter this order. However, overlap is shown to be affected by lexical stress in /kl/ and /ps/ and by boundary strength in /pC/ clusters. This indicates that boundary effects on coordination are stronger for clusters with little inter-articulator dependence (e.g. lips + tongue tip in /pl/ vs. tongue back+tongue tip in /kl/). The results also show that the extent to which prosodic factors affect articulation interacts with the position of the affected segment in the sound sequence: In general, boundary strength strongly affects the cluster's first consonant while lexical stress influences the second consonant. This indicates that prosodic effects are strongest at their source (i.e. the boundary or the stressed nucleus) and decrease in strength with distance from their source. However, prosodic lengthening effects can reach the more distal consonant in clusters with a high degree of overlap and high inter-articulator dependence. Besides these aspects the discussion covers differences in measures of articulatory coordination.