Stefanie Shattuck-Hufnagel's research while affiliated with Massachusetts Institute of Technology and other places

Publications (205)

Article
Full-text available
The LaMIT database consists in recordings of 100 Italian sentences. The sentences in the database were designed so to include all phonemes of the Italian language, and also take into account the typical frequency of each phoneme in written Italian. Four native adult speakers of Standard Italian, raised and living in Rome, Italy, two female and two...
Article
No PDF available ABSTRACT In recent years, speech recognition systems have dramatically improved in performance through the development of general machine learning techniques. However, it is not always straightforward to interpret the mapping from the signal to the detected category. In the present work, we focus on the goal of transparency, specif...
Article
No PDF available ABSTRACT This study investigated how acoustic cues to features in Italian vowels are affected by lexical stress and tonal prominence. One aspect of an analysis on the LaMIT (Lexical Access Model for Italian) project, this study analyzed the impact of prosodic environment on acoustic cues to features. The LaMIT corpus consists of 10...
Article
No PDF available ABSTRACT Detecting and interpreting individual acoustic cues to identify features that distinguish among speech sounds is thought to play a key role in automatic speech processing, modeling human speech perception, detecting and diagnosing speech disabilities (and tracking the effects of treatment for those disabilities), and study...
Article
No PDF available ABSTRACT This study examined how tonal prominence impacts the acoustic cues to features of Italian vowels. It was one aspect of an analysis of the LaMIT (Lexical Access Model for Italian) project [Di Benedetto et al., “Speech recognition of spoken Italian based on detection of landmarks and other acoustic cues to distinctive featur...
Article
No PDF available ABSTRACT A speech annotation system developed for English that identifies landmarks and other acoustic cues to distinctive features (Stevens, 2002; Huilgol et al., 2019) has been extended to Spanish and Korean. This process includes retrieving the (allo)phone sequence for the words of a target utterance from a standard lexicon, con...
Article
No PDF available ABSTRACT Landmarks (Stevens, 2002) are acoustic cues that are correlated with certain changes in speech articulation, and can be used to infer some of the distinctive features useful for speech recognition, largely the manner features. This project identifies and organizes the processing steps involved in extracting eight types of...
Article
The surface phonetic details of an utterance affect how ‘native’ a speaker sounds. However, studies have shown that children’s acquisition of context-appropriate variation (sometimes called allophones) is late. This study’s goal was to understand how caregivers use phonetic variation in the production of American English /t/ in child-directed speec...
Preprint
Full-text available
Modelling the process that a listener actuates in deriving the words intended by a speaker requires setting a hypothesis on how lexical items are stored in memory. This work aims at developing a system that imitates humans when identifying words in running speech and, in this way, provide a framework to better understand human speech processing. We...
Article
Full-text available
Two types of consonant gemination characterize Italian: lexical and syntactic. Italian lexical gemination is contrastive, so that two words may differ by only one geminated consonant. In contrast, syntactic gemination occurs across word boundaries and affects the initial consonant of a word in specific contexts, such as the presence of a monosyllab...
Article
Full-text available
Two conflicting views have been advanced of what defines ‘default’ high pitch accents in various West Germanic languages, including English: One equates these accents fundamentally with a rise to a high turning point, while the other focuses on the fall from it. Both views arise from the assumption within Autosegmental-Metrical theory that the phon...
Preprint
Full-text available
Two types of consonant gemination characterize Italian: lexical and syntactic. Italian lexical gemination is contrastive, so that two words may differ by only one geminated consonant. In contrast, syntactic gemination occurs across word boundaries, and affects the initial consonant of a word in specific contexts, such as the presence of a monosylla...
Preprint
Full-text available
The purpose of this project was to derive a reliable estimate of the frequency of occurrence of the 30 phonemes - plus consonant geminated counterparts - of the Italian language, based on four selected written texts. Since no comparable dataset was found in previous literature, the present analysis may serve as a reference in future studies. Four t...
Chapter
Our preliminary data and experiments show the potentiality of the artificial intelligence techniques to identify Parkinson’s disease and to assess its extent using the information extracted from the speech. Our conclusions are based on several prospective studies with different corpora of parkinsonian and control subjects modelled following differe...
Article
No PDF available ABSTRACT The purpose of this project was to derive a reliable estimate of the frequency of occurrence of the 30 phonemes – plus consonant geminated counterparts- of the Italian language, based on four selected written texts. Since no comparable dataset was found in previous literature, the present analysis may serve as a reference...
Article
No PDF available ABSTRACT The surface phonetic details of an utterance affect how ‘native' a speaker sounds. However, studies have shown that children's acquisition of context-appropriate variation (sometimes called allophones) is late. This study's goal was to understand how caregivers use phonetic variation in the production of American English /...
Article
No PDF available ABSTRACT Modeling the process that a listener actuates in deriving words intended by a speaker, requires setting a hypothesis on how lexical items are stored in memory. Stevens’ model (2002) postulates that lexical items are stored in memory according to distinctive features, and that these features are hierarchically organized. Th...
Article
No PDF available ABSTRACT Acoustic cues to lexical distinctive features can be found by examining speech waveform and spectrogram measurements, and can provide a more detailed analysis of speech than is currently provided by methods for identifying atypical speech production. Applying this individual-feature-cue framework to the clinical diagnosis...
Article
No PDF available ABSTRACT Several speech databases have been manually annotated for individual acoustic cues to distinctive features. The acoustic cue labels include 8 landmark types (Stevens 2002) related to the manner features, and 32 other types related to place and voicing features. The labeled data include isolated words and syllables, read sp...
Article
No PDF available ABSTRACT Current automatic speech recognition systems have been shown to be quite accurate in handling the way machines respond to human language spoken by typical healthy speakers. However, there are still many challenges to overcome in applying these systems to the task of detecting atypically produced speech. Analysis of acousti...
Conference Paper
Full-text available
Gestures can be described in various terms including their form, their relationship to spoken prosody, their semantic relationship with an utterance, or their pragmatic functions (see [1] for a review). However, McNeill's [2] classic descriptive types, with referential categories (iconic, metaphoric and deictic) distinct from a rhythmic category (b...
Article
This study examines the acoustic realizations of American English intervocalic flaps in the TIMIT corpus, using the landmark-critical feature-cue-based framework. Three different acoustic patterns of flaps are described: (i) both closure and release landmarks present, (ii) only the closure landmark present, and (iii) both landmarks deleted. The pat...
Chapter
This chapter introduces a theoretical framework, Optimal Control Theory, which will enable a phonology-extrinsic-timing-based, three-component model to determine values of controlled variables, and to model the influence of multiple factors on these parameter values. Key features of Optimal Control Theory models are discussed, as well as evidence f...
Book
This is a book about the architecture of the speech-production planning process and speech motor control. It is written in reaction to a debate in the literature about the nature of phonological representations, which are proposed to be spatiotemporal by some, and symbolic (atemporal) by others. Making this choice about the nature of phonological r...
Chapter
This chapter begins to motivate the development of an alternative approach to speech production by pointing out three potential difficulties with the highly-successful Articulatory Phonology/Task Dynamics approach. First, it discusses the extensive nature of modifications to AP/TD default specifications required to account for the wide variety of s...
Chapter
Effects of prosodic structure on surface phonetics are modeled in AP/TD in two ways: 1) via a set of PI and Mu T adjustment mechanisms used to model lengthening effects at boundaries and on prominent syllables, and 2) via a hierarchy of coupled syllable, cross-word foot, and phrase oscillators, used to model poly-subconstituent shortening effects,...
Chapter
This chapter addresses the nature of the general-purpose timekeeping mechanisms that are assumed in phonology-extrinsic-timing models of speech production. The first part of the chapter discusses some current questions about the nature of these mechanisms. The second part of the chapter presents Lee’s General Tau theory (Lee 1998, 2009), a theory o...
Chapter
Two key features of the current AP/TD coupled-oscillator approach to movement coordination are that 1) coordination among gestures is treated as relative timing control, accomplished via planning-oscillator phase relationships, rather than coordination based on spatial information or absolute timing, and 2) coordination is based on the (relative) t...
Chapter
This chapter presents evidence that challenges models in which phonological representations are temporal in nature, and where timing mechanisms are phonology-specific and intrinsic to the phonology. For example, evidence for separate representations of 1) movement targets vs. other parts of movement, and 2) spatial vs. temporal aspects, is difficul...
Chapter
Evidence presented in previous chapters suggests consideration of an alternative to the coupled-oscillator approach to modeling human speech planning and production processes. One alternative approach is based on the symbolic phonemic representations of Generative Phonology. This approach requires a separate mechanism to translate these symbolic re...
Chapter
This chapter summarizes the basic mechanisms of the Articulatory Phonology model, currently the most thoroughly worked-out model in the literature, with a focus on its system-intrinsic mechanisms used to account for systematic variation in speech timing. Key features of the model are reviewed, and oscillator-based mechanisms are described for timin...
Chapter
This chapter presents the outline of a model of speech-production planning, based on symbolic phonology and the specification of surface-timing patterns using general-purpose timekeeping mechanisms. This phonology-extrinsic-timing-based, three-component (XT/3C) model includes a Phonological Planning Component, to set and prioritize the goals for an...
Chapter
This volume compares two very different approaches to modeling speech planning: Articulatory Phonology, with quantitative phonological representations and a set of phonology-intrinsic timing mechanisms, and XT/3C, an alternative model with non-quantitative symbolic phonological representations and general-purpose phonology-extrinsic timing mechanis...
Article
Full-text available
The goals of this paper are (1) to discuss the key features of existing articulatory models of speech production that govern their approaches to timing, along with advantages and disadvantages of each, and (2) to evaluate these features in terms of several pieces of evidence from both the speech and nonspeech motor control literature. This evidence...
Article
Full-text available
Literature documents the impact of Parkinson’s Disease (PD) on speech but no study has analyzed in detail the importance of the distinct phonemic groups for the automatic identification of the disease. This study presents new approaches that are evaluated in three different corpora containing speakers suffering from PD with two main objectives: to...
Article
No PDF available ABSTRACT Acoustic cues are robust elements that can be used to infer information contained in the speech signal, such as underlying linguistic distinctive features and the words intended by the speaker (Stevens JASA 2002). Yet, most current automatic speech recognition systems do not take advantage of a feature-cue-based framework...
Article
No PDF available ABSTRACT Acoustic cues are properties of the speech signal that provide information about the distinctive features of the speaker’s intended words and phonemes. Analysis of acoustic cues can indicate reductions and modifications in speech, in which landmarks and other feature cues are deleted, inserted, or substituted for others, a...
Article
Since the publication of Speaking in 1989, with its extraordinary goal of modelling the entire process of human speech generation from message conceptualisation to articulation, encompassing results from a wide range of empirical studies, much new information has emerged about three aspects of speech production that were not clearly in focus at tha...
Article
Acoustic cues are characteristic patterns in the speech signal that provide lexical, prosodic, or additional information, such as speaker identity. In particular, acoustic cues related to linguistic distinctive features can be extracted and marked from the speech signal. These acoustic cues can be used to infer the intended underlying phoneme seque...
Article
Irregular pitch periods (IPPs) are associated with grammatically, pragmatically, and clinically significant types of nonmodal phonation, but are challenging to identify. Automatic detection of IPPs is desirable because accurately hand-identifying IPPs is time-consuming and requires training. The authors evaluated an algorithm developed for creaky v...
Conference Paper
Acoustic cues are properties of the speech signal that provide information about the distinctive features of the speaker’s intended words and phonemes. Analysis of acoustic cues can indicate reductions and modifications in speech, in which landmarks and other feature cues are deleted, inserted, or substituted for others, and can be informative in d...
Article
Sequences of similar (i.e., partially identical) words can be hard to say, as indicated by error frequencies, longer reaction and execution times. This study investigates the role of the location of this partial identity and the accompanying differences, i.e. whether errors are more frequent with mismatches in word onsets (top cop), codas (top tock...
Article
Irregular pitch periods (IPPs) occur in a wide variety of speech contexts and can support automatic speech recognition (ASR) systems by signaling word boundaries, phrase endings, and certain prosodic contours. IPPs can also provide information about emotional content, dialect, and speaker identity. The ability to automatically detect IPPs is partic...
Article
Full-text available
Many studies have documented a close timing relationship between speech prosody and co-speech gesture, but some studies have not, and it is unclear whether these differences in speech-gesture alignment are due to different speaking tasks, different target gesture types, different prosodic elements, different definitions of alignment, or even differ...
Article
This talk will present evidence supporting a model of speech production which includes phonology-extrinsic timing, and consists of 1) a Phonological Planning Component to plan the goals for an utterance, including the segmental and prosodic structure for an utterance and non-grammatical goals such as speaking quickly or in a particular style, 2) a...
Preprint
Full-text available
Prosodic categories, like other grammatical categories, are realized with wide variability, yet listeners interpret linguistic meaning with apparent ease. ToBI aims to capture the linguistically meaningful prosodic elements of utterances, but does not capture the variability in acoustic cues that the labeller (and listener) must interpret in order...
Poster
One viewpoint of early prosody is that utterances are highly simplified in comparison to adult models, in part, due to biological constraints such as incomplete control of pitch production (Lieberman, 1967; Snow, 2006). In contrast, research in Catalan and Spanish shows that early utterances (under 2 years) consist of the basic intonational categor...
Article
Irregular pitch periods (IPPs) occur in a wide variety of speech contexts and can support automatic speech recognition systems by signaling word boundaries, phrase endings, and certain prosodic contours. IPPs can also provide information about emotional content, dialect, and speaker identity. The ability to automatically detect IPPs is particularly...
Article
Full-text available
Although a large amount of acoustic indicators have already been proposed in the literature to evaluate the hypokinetic dysarthria of people with Parkinson’s Disease, the goal of this work is to identify and interpret new reliable and complementary articulatory biomarkers that could be applied to predict/evaluate Parkinson’s Disease from a diadocho...
Conference Paper
Full-text available
New tools based on speech analysis can improve and accelerate diagnosis of Parkinson's Disease. In this work, the use some specific segments of speech, around the so called Acoustic Landmarks, are used with different families of features such as acoustic cues or Rasta-PLP and GMM-UBM-Blend classification methods to detect Parkinson's Disease. Resul...
Article
The production of speech and music are two human behaviors that involve complex hierarchical structures with implications for timing. Timing constraints may arise from a human proclivity to form 'self-organized' metrical structures for perceived and produced event sequences, especially those that involve repetition. To test whether the propensity t...
Article
Earlier work has shown that speakers of American English often (although not always) produce irregular pitch periods (or other changes in voice quality) at prosodically significant locations, such as the onset of a new intonational phrase or a pitch-accented syllable, when those constituents begin with a [ + voiced] phonemic segment (Pierrehumbert...
Article
Acoustic landmarks are abrupt spectral changes that signal the underlying manner features of phonemes (Stevens 2002). Our goal in developing an automatic method to detect these landmarks is to create a robust, knowledge-based approach to phoneme extraction in automatic speech signal processing. One challenge in such an approach is posed by massive...
Article
An algorithm was developed for detecting glides (/w/, /j/, /r/, /l/, or /h/) in spoken English and detecting their place of articulation using an analysis of acoustic landmarks [Stevens 2002]. The system uses Gaussian mixture models (GMMs) trained on a subset of the TIMIT speech database annotated with acoustic landmarks. To characterize the glide...
Article
Non-native speakers often have difficulty accurately producing and perceiving the syllable structure of a second language. For example, Japanese learners of English often insert epenthetic vowels when producing English words, e.g., stress produced as /sutoresu/. Similarly, when asked to count syllables in spoken English words, they frequently overe...
Article
Stevens (2002) proposes that the distinctive feature [glide] is signaled by an acoustic landmark, i.e., an amplitude/F1 minimum, usually during a phonated region, but this hypothesis has not been tested extensively in languages other than American English. This study analyzes acoustic realizations of tapped /ɾ/ and trilled /r/ sounds in European Sp...
Article
A model of human speech processing based on individual cues to distinctive features of phonemes, such as the acoustic landmarks (abrupt spectral changes) that signal manner features, is proposed to provide a more accurate account of American English flapping of /t/ and /d/ than an allophonic or phone-based model. To test this hypothesis, this study...
Article
Non-word repetition tasks have been used to diagnose children with various developmental difficulties with phonology, but these productions have not been phonetically analyzed to reveal the nature of the modifications produced by children diagnosed with SLI, autism spectrum disorder or dyslexia compared to those produced by typically-developing chi...
Article
Changes in phonation patterns have long been studied as correlates of various linguistic elements, such as the occurrence of irregular pitch periods (IPPs) at significant locations in prosodic structure (in phrase-initial, phrase-final, and pitch accented contexts) and word-final voiceless stops, especially /t/. But less is known about the developm...
Article
This paper describes methods for evaluating automatic speech recognition (ASR) systems in comparison with human perception results, using measures derived from linguistic distinctive features. Error patterns in terms of manner, place and voicing are presented, along with an examination of confusion matrices via a distinctive-feature-distance metric...
Article
This paper tests the hypothesis that distinctive feature classifiers anchored at phonetic landmarks can be transferred cross-lingually without loss of accuracy. Three consonant voicing classifiers were developed: (1) manually selected acoustic features anchored at a phonetic landmark, (2) MFCCs (either averaged across the segment or anchored at the...
Article
The ability to find the beat in a sequence of auditory events may be linked to the ability to learn vocal communication, raising the question of how beat structure in speech events relates to that in other event sequences. We conducted a series of entrainment experiments designed to compare spoken syllable repetition with tapping. Producing taps to...
Article
Child speakers of American English in the age range 2;6-3;6 often produce vowel final noise (VFN), or preaspiration, for [-voice] labial and velar stops in coda position of monosyllabic words [e.g., Shattuck-Hufnagel, Hanson, and Zhao, “Feature-cue-based processing of speech: A developmental perspective,” in Proc. ICPhS 2015]. In contrast, the fema...
Article
Full-text available
Understanding the role of prosody in encoding linguistic meaning and in shaping phonetic form requires the analysis of prosodically annotated speech drawn from a wide variety of speech materials. Yet obtaining accurate and reliable prosodic annotations for even small datasets is challenging due to the time and expertise required. We discuss several...
Article
A matcher for a distinctive feature-based lexical access system is tested using degraded feature inputs. The input speech comprises 16 conversation files from a map task in American English, spoken by 8 female speakers. A sequence of predicted features are produced from a generation algorithm, and the results are randomly degraded at levels from ze...
Article
This study examined the emergence of the phonetic variants (often called allophones) of alveolar phonemes in the speech production of 2-year-olds. Our specific question was: Does the child start by producing a "canonical" form of a phoneme (e.g., /t/ with a clear closure and a release burst), only later learning to produce its other phonetic varian...
Chapter
In this chapter we review the evidence that, early on in the planning process, speakers know something about the larger structure of an utterance they plan to produce. In separate sections we address questions like the following: 1) Does the weight of the evidence suggest that a speaker plans ahead? 2) Does this evidence suggest that the speaker ge...
Article
In his JASA (2002) paper, Ken Stevens proposed a model of human speech recognition based on extracting acoustic cues to the distinctive feature contrasts of the speaker's intended words. Following Halle (1995), he distinguished between cues to manner features (e.g., abrupt spectral events associated with constrictions/widenings of the vocal tr...
Article
Full-text available
Prosodic and articulatory factors influence children's production of inflectional morphemes. For example, plural -s is produced more reliably in utterance-final compared to utterance-medial position (i.e., the positional effect), which has been attributed to the increased planning time in utterance-final position. In previous investigations of plur...
Article
In the first part of the paper, we summarize the linguistic factors that shape speech timing patterns, including the prosodic structures which govern them, and suggest that speech timing patterns are used to aid utterance recognition. In the spirit of optimal control theory, we propose that recognition requirements are balanced against requirements...