Article

Differential Auditory and Visual Phase-Locking Are Observed during Audio-Visual Benefit and Silent Lip-Reading for Speech Perception

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Speech perception in noisy environments is enhanced by seeing facial movements of communication partners. However, the neural mechanisms by which audio and visual speech are combined are not fully understood. We explore MEG phase locking to auditory and visual signals in MEG recordings from 14 human participants (6 females, 8 males) that reported words from single spoken sentences. We manipulated the acoustic clarity and visual speech signals such that critical speech information is present in auditory, visual or both modalities. MEG coherence analysis revealed that both auditory and visual speech envelopes (auditory amplitude modulations and lip aperture changes) were phase-locked to 2-6Hz brain responses in auditory and visual cortex, consistent with entrainment to syllable-rate components. Partial coherence analysis was used to separate neural responses to correlated audio-visual signals and showed non-zero phase locking to auditory envelope in occipital cortex during audio-visual (AV) speech. Furthermore, phase-locking to auditory signals in visual cortex was enhanced for AV speech compared to audio-only (AO) speech that was matched for intelligibility. Conversely, auditory regions of the superior temporal gyrus (STG) did not show above-chance partial coherence with visual speech signals during AV conditions, but did show partial coherence in VO conditions. Hence, visual speech enabled stronger phase locking to auditory signals in visual areas, whereas phase-locking of visual speech in auditory regions only occurred during silent lip-reading. Differences in these cross-modal interactions between auditory and visual speech signals are interpreted in line with cross-modal predictive mechanisms during speech perception.SIGNIFICANCE STATEMENTVerbal communication in noisy environments is challenging, especially for hearing-impaired individuals. Seeing facial movements of communication partners improves speech perception when auditory signals are degraded or absent. The neural mechanisms supporting lip-reading or audio-visual benefit are not fully understood. Using MEG recordings and partial coherence analysis we show that speech information is used differently in brain regions that respond to auditory and visual speech. While visual areas use visual speech to improve phase-locking to auditory speech signals, auditory areas do not show phase-locking to visual speech unless auditory speech is absent and visual speech is used to substitute for missing auditory signals. These findings highlight brain processes that combine visual and auditory signals to support speech understanding.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

ResearchGate has not been able to resolve any citations for this publication.
Article
Full-text available
The integration of visual and auditory cues is crucial for successful processing of speech, especially under adverse conditions. Recent reports have shown that when participants watch muted videos of speakers, the phonological information about the acoustic speech envelope, which is associated with but independent from the speakers’ lip movements, is tracked by the visual cortex. However, the speech signal also carries richer acoustic details, for example, about the fundamental frequency and the resonant frequencies, whose visuophonological transformation could aid speech processing. Here, we investigated the neural basis of the visuo-phonological transformation processes of these more fine-grained acoustic details and assessed how they change as a function of age. We recorded whole-head magnetoencephalographic (MEG) data while the participants watched silent normal (i.e., natural) and reversed videos of a speaker and paid attention to their lip movements. We found that the visual cortex is able to track the unheard natural modulations of resonant frequencies (or formants) and the pitch (or fundamental frequency) linked to lip movements. Importantly, only the processing of natural unheard formants decreases significantly with age in the visual and also in the cingulate cortex. This is not the case for the processing of the unheard speech envelope, the fundamental frequency, or the purely visual information carried by lip movements. These results show that unheard spectral fine details (along with the unheard acoustic envelope) are transformed from a mere visual to a phonological representation. Aging affects especially the ability to derive spectral dynamics at formant frequencies. As listening in noisy environments should capitalize on the ability to track spectral fine details, our results provide a novel focus on compensatory processes in such challenging situations.
Article
Full-text available
When we see our interlocutor, our brain seamlessly extracts visual cues from their face and processes them along with the sound of their voice, making speech an intrinsically multimodal signal. Visual cues are especially important in noisy environments, when the auditory signal is less reliable. Neuronal oscillations might be involved in the cortical processing of audiovisual speech by selecting which sensory channel contributes more to perception. To test this, we designed computer-generated naturalistic audiovisual speech stimuli where one mismatched phoneme-viseme pair in a key word of sentences created bistable perception. Neurophysiological recordings (high-density scalp and intracranial electroencephalography) revealed that the precise phase angle of theta-band oscillations in posterior temporal and occipital cortex of the right hemisphere was crucial to select whether the auditory or the visual speech cue drove perception. We demonstrate that the phase of cortical oscillations acts as an instrument for sensory selection in audiovisual speech processing.
Article
Full-text available
Human speech perception can be described as Bayesian perceptual inference but how are these Bayesian computations instantiated neurally? We used magnetoencephalographic recordings of brain responses to degraded spoken words and experimentally manipulated signal quality and prior knowledge. We first demonstrate that spectrotemporal modulations in speech are more strongly represented in neural responses than alternative speech representations (e.g. spectrogram or articulatory features). Critically, we found an interaction between speech signal quality and expectations from prior written text on the quality of neural representations; increased signal quality enhanced neural representations of speech that mismatched with prior expectations, but led to greater suppression of speech that matched prior expectations. This interaction is a unique neural signature of prediction error computations and is apparent in neural responses within 100 ms of speech input. Our findings contribute to the detailed specification of a computational model of speech perception based on predictive coding frameworks.
Article
Full-text available
On-line comprehension of natural speech requires segmenting the acoustic stream into discrete linguistic elements. This process is argued to rely on theta-gamma oscillation coupling, which can parse syllables and encode them in decipherable neural activity. Speech comprehension also strongly depends on contextual cues that help predicting speech structure and content. To explore the effects of theta-gamma coupling on bottom-up/top-down dynamics during on-line syllable identification, we designed a computational model (Precoss—predictive coding and oscillations for speech) that can recognise syllable sequences in continuous speech. The model uses predictions from internal spectro-temporal representations of syllables and theta oscillations to signal syllable onsets and duration. Syllable recognition is best when theta-gamma coupling is used to temporally align spectro-temporal predictions with the acoustic input. This neurocomputational modelling work demonstrates that the notions of predictive coding and neural oscillations can be brought together to account for on-line dynamic sensory processing.
Article
Full-text available
Sufficiently noisy listening conditions can completely mask the acoustic signal of significant parts of a sentence, and yet listeners may still report the perception of hearing the masked speech. This occurs even when the speech signal is removed entirely, if the gap is filled with stationary noise, a phenomenon known as perceptual restoration. At the neural level, however, it is unclear the extent to which the neural representation of missing extended speech sequences is similar to the dynamic neural representation of ordinary continuous speech. Using auditory magnetoencephalography (MEG), we show that stimulus reconstruction, a technique developed for use with neural representations of ordinary speech, works also for the missing speech segments replaced by noise, even when spanning several phonemes and words. The reconstruction fidelity of the missing speech, up to 25% of what would be attained if present, depends however on listeners’ familiarity with the missing segment. This same familiarity also speeds up the most prominent stage of the cortical processing of ordinary speech by approximately 5 ms. Both effects disappear when listeners have no or little prior experience with the speech segment. The results are consistent with adaptive expectation mechanisms that consolidate detailed representations about speech sounds as identifiable factors assisting automatic restoration over ecologically relevant timescales.
Article
Full-text available
During natural speech perception, humans must parse temporally continuous auditory and visual speech signals into sequences of words. However, most studies of speech perception present only single words or syllables. We used electrocorticography (subdural electrodes implanted on the brains of epileptic patients) to investigate the neural mechanisms for processing continuous audiovisual speech signals consisting of individual sentences. Using partial correlation analysis, we found that posterior superior temporal gyrus (pSTG) and medial occipital cortex tracked both the auditory and visual speech envelopes. These same regions, as well as inferior temporal cortex, responded more strongly to a dynamic video of a talking face compared to auditory speech paired with a static face. Occipital cortex and pSTG carry temporal information about both auditory and visual speech dynamics. Visual speech tracking in pSTG may be a mechanism for enhancing perception of degraded auditory speech. This article is protected by copyright. All rights reserved.
Article
Full-text available
Successful lip-reading requires a mapping from visual to phonological information [1]. Recently, visual and motor cortices have been implicated in tracking lip movements (e.g., [2]). It remains unclear, however, whether visuo-phonological mapping occurs already at the level of the visual cortex–that is, whether this structure tracks the acoustic signal in a functionally relevant manner. To elucidate this, we investigated how the cortex tracks (i.e., entrains to) absent acoustic speech signals carried by silent lip movements. Crucially, we contrasted the entrainment to unheard forward (intelligible) and backward (unintelligible) acoustic speech. We observed that the visual cortex exhibited stronger entrainment to the unheard forward acoustic speech envelope compared to the unheard backward acoustic speech envelope. Supporting the notion of a visuo-phonological mapping process, this forward-backward difference of occipital entrainment was not present for actually observed lip movements. Importantly, the respective occipital region received more top-down input, especially from left premotor, primary motor, and somatosensory regions and, to a lesser extent, also from posterior temporal cortex. Strikingly, across participants, the extent of top-down modulation of the visual cortex stemming from these regions partially correlated with the strength of entrainment to absent acoustic forward speech envelope, but not to present forward lip movements. Our findings demonstrate that a distributed cortical network, including key dorsal stream auditory regions [3–5], influences how the visual cortex shows sensitivity to the intelligibility of speech while tracking silent lip movements. Successful lip-reading requires a visual-phonological transformation. Hauswald et al. show that, during the processing of silently played lip movements, the visual cortex tracks the missing acoustic speech information when played forward as compared to backward. The effect is under top-down control.
Article
Full-text available
During online speech processing, our brain tracks the acoustic fluctuations in speech at different timescales. Previous research has focused on generic timescales (for example, delta or theta bands) that are assumed to map onto linguistic features such as prosody or syllables. However, given the high intersubject variability in speaking patterns, such a generic association between the timescales of brain activity and speech properties can be ambiguous. Here, we analyse speech tracking in source-localised magnetoencephalographic data by directly focusing on timescales extracted from statistical regularities in our speech material. This revealed widespread significant tracking at the timescales of phrases (0.6–1.3 Hz), words (1.8–3 Hz), syllables (2.8–4.8 Hz), and phonemes (8–12.4 Hz). Importantly, when examining its perceptual relevance, we found stronger tracking for correctly comprehended trials in the left premotor (PM) cortex at the phrasal scale as well as in left middle temporal cortex at the word scale. Control analyses using generic bands confirmed that these effects were specific to the speech regularities in our stimuli. Furthermore, we found that the phase at the phrasal timescale coupled to power at beta frequency (13–30 Hz) in motor areas. This cross-frequency coupling presumably reflects top-down temporal prediction in ongoing speech perception. Together, our results reveal specific functional and perceptually relevant roles of distinct tracking and cross-frequency processes along the auditory–motor pathway.
Article
Full-text available
Due to their periodic nature, neural oscillations might represent an optimal "tool" for the processing of rhythmic stimulus input [1-3]. Indeed, the alignment of neural oscillations to a rhythmic stimulus, often termed phase entrainment, has been repeatedly demonstrated [4-7]. Phase entrainment is central to current theories of speech processing [8-10] and has been associated with successful speech comprehension [11-17]. However, typical manipulations that reduce speech intelligibility (e.g., addition of noise and time reversal [11, 12, 14, 16, 17]) could destroy critical acoustic cues for entrainment (such as "acoustic edges" [7]). Hence, the association between phase entrainment and speech intelligibility might only be "epiphenomenal"; i.e., both decline due to the same manipulation, without any causal link between the two [18]. Here, we use transcranial alternating current stimulation (tACS [19]) to manipulate the phase lag between neural oscillations and speech rhythm while measuring neural responses to intelligible and unintelligible vocoded stimuli with sparse fMRI. We found that this manipulation significantly modulates the BOLD response to intelligible speech in the superior temporal gyrus, and the strength of BOLD modulation is correlated with a phasic modulation of performance in a behavioral task. Importantly, these findings are absent for unintelligible speech and during sham stimulation; we thus demonstrate that phase entrainment has a specific, causal influence on neural responses to intelligible speech. Our results not only provide an important step toward understanding the neural foundation of human abilities at speech comprehension but also suggest new methods for enhancing speech perception that can be explored in the future.
Article
Full-text available
Speech is a multisensory percept, comprising an auditory and visual component. While the content and processing pathways of audio speech have been well characterized, the visual component is less well understood. In this work, we expand current methodologies using system identification to introduce a framework that facilitates the study of visual speech in its natural, continuous form. Specifically, we use models based on the unheard acoustic envelope (E), the motion signal (M) and categorical visual speech features (V) to predict EEG activity during silent lipreading. Our results show that each of these models performs similarly at predicting EEG in visual regions and that respective combinations of the individual models (EV, MV, EM and EMV) provide an improved prediction of the neural activity over their constituent models. In comparing these different combinations, we find that the model incorporating all three types of features (EMV) outperforms the individual models, as well as both the EV and MV models, while it performs similarly to the EM model. Importantly, EM does not outperform EV and MV, which, considering the higher dimensionality of the V model, suggests that more data is needed to clarify this finding. Nevertheless, the performance of EMV, and comparisons of the subject performances for the three individual models, provides further evidence to suggest that visual regions are involved in both low-level processing of stimulus dynamics and categorical speech perception. This framework may prove useful for investigating modality-specific processing of visual speech under naturalistic conditions.
Article
Full-text available
Humans are adept at understanding speech despite the fact that our natural listening environment is often filled with interference. An example of this capacity is phoneme restoration, in which part of a word is completely replaced by noise, yet listeners report hearing the whole word. The neurological basis for this unconscious fill-in phenomenon is unknown, despite being a fundamental characteristic of human hearing. Here, using direct cortical recordings in humans, we demonstrate that missing speech is restored at the acoustic-phonetic level in bilateral auditory cortex, in real-time. This restoration is preceded by specific neural activity patterns in a separate language area, left frontal cortex, which predicts the word that participants later report hearing. These results demonstrate that during speech perception, missing acoustic content is synthesized online from the integration of incoming sensory cues and the internal neural dynamics that bias word-level expectation and prediction.
Article
Full-text available
The role of oscillatory phase for perceptual and cognitive processes is being increasingly acknowledged. To date, little is known about the direct role of phase in categorical perception. Here we show in two separate experiments that the identification of ambiguous syllables that can either be perceived as /da/ or /ga/ is biased by the underlying oscillatory phase as measured with EEG and sensory entrainment to rhythmic stimuli. The measured phase difference in which perception is biased toward /da/ or /ga/ exactly matched the different temporal onset delays in natural audiovisual speech between mouth movements and speech sounds, which last 80 ms longer for /ga/ than for /da/. These results indicate the functional relationship between prestimulus phase and syllable identification, and signify that the origin of this phase relationship could lie in exposure and subsequent learning of unique audiovisual temporal onset differences.
Article
Full-text available
Electroencephalographic (EEG) oscillations are hypothesized to reflect cyclical variation in the excitability of neuronal ensembles [1], with particular frequency bands reflecting differing types [2-4] and spatial scales [5-7] of brain operations. Interdependence between the gamma and theta bands [5, 8] suggests an underlying structure to the EEG spectrum, and there is also evidence that ongoing activity influences sensory responses [9, 10]. However, there is no unifying theory of EEG organization and the role of the ongoing oscillatory activity in sensory processing remains controversial. This study analyzed laminar profiles of synaptic activity and action potentials, both spontaneous and stimulus-driven, in primary auditory cortex [11]. We find that -1) The EEG is hierarchically organized; delta (1-4 Hz) phase modulates theta (4-10 Hz) amplitude, and theta phase modulates gamma (30-50 Hz) amplitude. 2) This Oscillatory Hierarchy controls baseline excitability and action potential generation, as well as stimulus-related responses in a neuronal ensemble. We propose that the hierarchical organization of ambient oscillatory activity allows auditory cortex to structure its temporal activity pattern so as to optimize the processing of rhythmic inputs.
Article
Full-text available
Auditory cortical activity is entrained to the temporal envelope of speech, which corresponds to the syllabic rhythm of speech. Such entrained cortical activity can be measured from subjects naturally listening to sentences or spoken passages, providing a reliable neural marker of online speech processing. A central question still remains to be answered about whether cortical entrained activity is more closely related to speech perception or non-speech-specific auditory encoding. Here, we review a few hypotheses about the functional roles of cortical entrainment to speech, e.g., encoding acoustic features, parsing syllabic boundaries, and selecting sensory information in complex listening environments. It is likely that speech entrainment is not a homogeneous response and these hypotheses apply separately for speech entrainment generated from different neural sources. The relationship between entrained activity and speech intelligibility is also discussed. A tentative conclusion is that theta-band entrainment (4-8 Hz) encodes speech features critical for intelligibility while delta-band entrainment (1-4 Hz) is related to the perceived, non-speech-specific acoustic rhythm. To further understand the functional properties of speech entrainment, a splitter's approach will be needed to investigate (1) not just the temporal envelope but what specific acoustic features are encoded and (2) not just speech intelligibility but what specific psycholinguistic processes are encoded by entrained cortical activity. Similarly, the anatomical and spectro-temporal details of entrained activity need to be taken into account when investigating its functional properties.
Article
Full-text available
Magnetoencephalography and electroencephalography (M/EEG) measure the weak electromagnetic signals generated by neuronal activity in the brain. Using these signals to characterize and locate neural activation in the brain is a challenge that requires expertise in physics, signal processing, statistics, and numerical methods. As part of the MNE software suite, MNE-Python is an open-source software package that addresses this challenge by providing state-of-the-art algorithms implemented in Python that cover multiple methods of data preprocessing, source localization, statistical analysis, and estimation of functional connectivity between distributed brain regions. All algorithms and utility functions are implemented in a consistent manner with well-documented interfaces, enabling users to create M/EEG data analysis pipelines by writing Python scripts. Moreover, MNE-Python is tightly integrated with the core Python libraries for scientific comptutation (NumPy, SciPy) and visualization (matplotlib and Mayavi), as well as the greater neuroimaging ecosystem in Python via the Nibabel package. The code is provided under the new BSD license allowing code reuse, even in commercial products. Although MNE-Python has only been under heavy development for a couple of years, it has rapidly evolved with expanded analysis capabilities and pedagogical tutorials because multiple labs have collaborated during code development to help share best practices. MNE-Python also gives easy access to preprocessed datasets, helping users to get started quickly and facilitating reproducibility of methods by other researchers. Full documentation, including dozens of examples, is available at http://martinos.org/mne.
Article
Full-text available
Cortical oscillations are likely candidates for segmentation and coding of continuous speech. Here, we monitored continuous speech processing with magnetoencephalography (MEG) to unravel the principles of speech segmentation and coding. We demonstrate that speech entrains the phase of low-frequency (delta, theta) and the amplitude of high-frequency (gamma) oscillations in the auditory cortex. Phase entrainment is stronger in the right and amplitude entrainment is stronger in the left auditory cortex. Furthermore, edges in the speech envelope phase reset auditory cortex oscillations thereby enhancing their entrainment to speech. This mechanism adapts to the changing physical features of the speech envelope and enables efficient, stimulus-specific speech sampling. Finally, we show that within the auditory cortex, coupling between delta, theta, and gamma oscillations increases following speech edges. Importantly, all couplings (i.e., brain-speech and also within the cortex) attenuate for backward-presented speech, suggesting top-down control. We conclude that segmentation and coding of speech relies on a nested hierarchy of entrained cortical oscillations.
Article
Full-text available
An unresolved question is how the reported clarity of degraded speech is enhanced when listeners have prior knowledge of speech content. One account of this phenomenon proposes top-down modulation of early acoustic processing by higher-level linguistic knowledge. Alternative, strictly bottom-up accounts argue that acoustic information and higher-level knowledge are combined at a late decision stage without modulating early acoustic processing. Here we tested top-down and bottom-up accounts using written text to manipulate listeners' knowledge of speech content. The effect of written text on the reported clarity of noise-vocoded speech was most pronounced when text was presented before (rather than after) speech (Experiment 1). Fine-grained manipulation of the onset asynchrony between text and speech revealed that this effect declined when text was presented more than 120 ms after speech onset (Experiment 2). Finally, the influence of written text was found to arise from phonological (rather than lexical) correspondence between text and speech (Experiment 3). These results suggest that prior knowledge effects are time-limited by the duration of auditory echoic memory for degraded speech, consistent with top-down modulation of early acoustic processing by linguistic knowledge. (PsycINFO Database Record (c) 2013 APA, all rights reserved).
Article
Full-text available
Our ability to selectively attend to one auditory signal amid competing input streams, epitomized by the "Cocktail Party" problem, continues to stimulate research from various approaches. How this demanding perceptual feat is achieved from a neural systems perspective remains unclear and controversial. It is well established that neural responses to attended stimuli are enhanced compared with responses to ignored ones, but responses to ignored stimuli are nonetheless highly significant, leading to interference in performance. We investigated whether congruent visual input of an attended speaker enhances cortical selectivity in auditory cortex, leading to diminished representation of ignored stimuli. We recorded magnetoencephalographic signals from human participants as they attended to segments of natural continuous speech. Using two complementary methods of quantifying the neural response to speech, we found that viewing a speaker's face enhances the capacity of auditory cortex to track the temporal speech envelope of that speaker. This mechanism was most effective in a Cocktail Party setting, promoting preferential tracking of the attended speaker, whereas without visual input no significant attentional modulation was observed. These neurophysiological results underscore the importance of visual input in resolving perceptual ambiguity in a noisy environment. Since visual cues in speech precede the associated auditory signals, they likely serve a predictive role in facilitating auditory processing of speech, perhaps by directing attentional resources to appropriate points in time when to-be-attended acoustic input is expected to arrive.
Article
Full-text available
Speech recognition from visual-only faces is difficult, but can be improved by prior information about what is said. Here, we investigated how the human brain uses prior information from auditory speech to improve visual-speech recognition. In a functional magnetic resonance imaging study, participants performed a visual-speech recognition task, indicating whether the word spoken in visual-only videos matched the preceding auditory-only speech, and a control task (face-identity recognition) containing exactly the same stimuli. We localized a visual-speech processing network by contrasting activity during visual-speech recognition with the control task. Within this network, the left posterior superior temporal sulcus (STS) showed increased activity and interacted with auditory-speech areas if prior information from auditory speech did not match the visual speech. This mismatch-related activity and the functional connectivity to auditory-speech areas were specific for speech, i.e., they were not present in the control task. The mismatch-related activity correlated positively with performance, indicating that posterior STS was behaviorally relevant for visual-speech recognition. In line with predictive coding frameworks, these findings suggest that prediction error signals are produced if visually presented speech does not match the prediction from preceding auditory speech, and that this mechanism plays a role in optimizing visual-speech recognition by prior information.
Article
Full-text available
A key feature of speech is the quasi-regular rhythmic information contained in its slow amplitude modulations. In this article we review the information conveyed by speech rhythm, and the role of ongoing brain oscillations in listeners' processing of this content. Our starting point is the fact that speech is inherently temporal, and that rhythmic information conveyed by the amplitude envelope contains important markers for place and manner of articulation, segmental information, and speech rate. Behavioral studies demonstrate that amplitude envelope information is relied upon by listeners and plays a key role in speech intelligibility. Extending behavioral findings, data from neuroimaging - particularly electroencephalography (EEG) and magnetoencephalography (MEG) - point to phase locking by ongoing cortical oscillations to low-frequency information (~4-8 Hz) in the speech envelope. This phase modulation effectively encodes a prediction of when important events (such as stressed syllables) are likely to occur, and acts to increase sensitivity to these relevant acoustic cues. We suggest a framework through which such neural entrainment to speech rhythm can explain effects of speech rate on word and segment perception (i.e., that the perception of phonemes and words in connected speech is influenced by preceding speech rate). Neuroanatomically, acoustic amplitude modulations are processed largely bilaterally in auditory cortex, with intelligible speech resulting in differential recruitment of left-hemisphere regions. Notable among these is lateral anterior temporal cortex, which we propose functions in a domain-general fashion to support ongoing memory and integration of meaningful input. Together, the reviewed evidence suggests that low-frequency oscillations in the acoustic speech signal form the foundation of a rhythmic hierarchy supporting spoken language, mirrored by phase-locked oscillations in the human brain.
Article
Full-text available
A growing body of evidence shows that ongoing oscillations in auditory cortex modulate their phase to match the rhythm of temporally regular acoustic stimuli, increasing sensitivity to relevant environmental cues and improving detection accuracy. In the current study, we test the hypothesis that nonsensory information provided by linguistic content enhances phase-locked responses to intelligible speech in the human brain. Sixteen adults listened to meaningful sentences while we recorded neural activity using magnetoencephalography. Stimuli were processed using a noise-vocoding technique to vary intelligibility while keeping the temporal acoustic envelope consistent. We show that the acoustic envelopes of sentences contain most power between 4 and 7 Hz and that it is in this frequency band that phase locking between neural activity and envelopes is strongest. Bilateral oscillatory neural activity phase-locked to unintelligible speech, but this cerebro-acoustic phase locking was enhanced when speech was intelligible. This enhanced phase locking was left lateralized and localized to left temporal cortex. Together, our results demonstrate that entrainment to connected speech does not only depend on acoustic characteristics, but is also affected by listeners' ability to extract linguistic information. This suggests a biological framework for speech comprehension in which acoustic and linguistic cues reciprocally aid in stimulus prediction.
Article
Full-text available
Neuronal oscillations are ubiquitous in the brain and may contribute to cognition in several ways: for example, by segregating information and organizing spike timing. Recent data show that delta, theta and gamma oscillations are specifically engaged by the multi-timescale, quasi-rhythmic properties of speech and can track its dynamics. We argue that they are foundational in speech and language processing, 'packaging' incoming information into units of the appropriate temporal granularity. Such stimulus-brain alignment arguably results from auditory and motor tuning throughout the evolution of speech and language and constitutes a natural model system allowing auditory research to make a unique contribution to the issue of how neural oscillatory activity affects human cognition.
Article
Full-text available
According to the predictive coding theory, top-down predictions are conveyed by backward connections and prediction errors are propagated forward across the cortical hierarchy. Using MEG in humans, we show that violating multisensory predictions causes a fundamental and qualitative change in both the frequency and spatial distribution of cortical activity. When visual speech input correctly predicted auditory speech signals, a slow delta regime (3-4 Hz) developed in higher-order speech areas. In contrast, when auditory signals invalidated predictions inferred from vision, a low-beta (14-15 Hz) / high-gamma (60-80 Hz) coupling regime appeared locally in a multisensory area (area STS). This frequency shift in oscillatory responses scaled with the degree of audio-visual congruence and was accompanied by increased gamma activity in lower sensory regions. These findings are consistent with the notion that bottom-up prediction errors are communicated in predominantly high (gamma) frequency ranges, whereas top-down predictions are mediated by slower (beta) frequencies.
Article
Full-text available
Humans are thought to have evolved brain regions in the left frontal and temporal cortex that are uniquely capable of language processing. However, congenitally blind individuals also activate the visual cortex in some verbal tasks. We provide evidence that this visual cortex activity in fact reflects language processing. We find that in congenitally blind individuals, the left visual cortex behaves similarly to classic language regions: (i) BOLD signal is higher during sentence comprehension than during linguistically degraded control conditions that are more difficult; (ii) BOLD signal is modulated by phonological information, lexical semantic information, and sentence-level combinatorial structure; and (iii) functional connectivity with language regions in the left prefrontal cortex and thalamus are increased relative to sighted individuals. We conclude that brain regions that are thought to have evolved for vision can take on language processing as a result of early experience. Innate microcircuit properties are not necessary for a brain region to become involved in language processing.
Article
Full-text available
This paper describes FieldTrip, an open source software package that we developed for the analysis of MEG, EEG, and other electrophysiological data. The software is implemented as a MATLAB toolbox and includes a complete set of consistent and user-friendly high-level functions that allow experimental neuroscientists to analyze experimental data. It includes algorithms for simple and advanced analysis, such as time-frequency analysis using multitapers, source reconstruction using dipoles, distributed sources and beamformers, connectivity analysis, and nonparametric statistical permutation tests at the channel and source level. The implementation as toolbox allows the user to perform elaborate and structured analyses of large data sets using the MATLAB command line and batch scripting. Furthermore, users and developers can easily extend the functionality and implement new algorithms. The modular design facilitates the reuse in other software packages.
Article
Full-text available
Author Summary When faced with ecologically relevant stimuli in natural scenes, our brains need to coordinate information from multiple sensory systems in order to create accurate internal representations of the outside world. Unfortunately, we currently have little information about the neuronal mechanisms for this cross-modal processing during online sensory perception under natural conditions. Neurophysiological and human imaging studies are increasingly exploring the response properties elicited by natural scenes. In this study, we recorded magnetoencephalography (MEG) data from participants viewing audiovisual movie clips. We developed a phase coherence analysis technique that captures—in single trials of watching a movie—how the phase of cortical responses is tightly coupled to key aspects of stimulus dynamics. Remarkably, auditory cortex not only tracks auditory stimulus dynamics but also reflects dynamic aspects of the visual signal. Similarly, visual cortex mainly follows the visual properties of a stimulus, but also shows sensitivity to the auditory aspects of a scene. The critical finding is that cross-modal phase modulation appears to lie at the basis of this integrative processing. Continuous cross-modal phase modulation may permit the internal construction of behaviorally relevant stimuli. Our work therefore contributes to the understanding of how multi-sensory information is analyzed and represented in the human brain.
Article
Full-text available
Humans, like other animals, are exposed to a continuous stream of signals, which are dynamic, multimodal, extended, and time varying in nature. This complex input space must be transduced and sampled by our sensory systems and transmitted to the brain where it can guide the selection of appropriate actions. To simplify this process, it's been suggested that the brain exploits statistical regularities in the stimulus space. Tests of this idea have largely been confined to unimodal signals and natural scenes. One important class of multisensory signals for which a quantitative input space characterization is unavailable is human speech. We do not understand what signals our brain has to actively piece together from an audiovisual speech stream to arrive at a percept versus what is already embedded in the signal structure of the stream itself. In essence, we do not have a clear understanding of the natural statistics of audiovisual speech. In the present study, we identified the following major statistical features of audiovisual speech. First, we observed robust correlations and close temporal correspondence between the area of the mouth opening and the acoustic envelope. Second, we found the strongest correlation between the area of the mouth opening and vocal tract resonances. Third, we observed that both area of the mouth opening and the voice envelope are temporally modulated in the 2-7 Hz frequency range. Finally, we show that the timing of mouth movements relative to the onset of the voice is consistently between 100 and 300 ms. We interpret these data in the context of recent neural theories of speech which suggest that speech communication is a reciprocally coupled, multisensory event, whereby the outputs of the signaler are matched to the neural processes of the receiver.
Article
Full-text available
Hearing-impaired persons usually perceive speech by watching the face of the talker while listening through a hearing aid. Normal-hearing persons also tend to rely on visual cues, especially when they communicate in noisy or reverberant environments. Numerous clinical and laboratory studies on the auditory-visual performance of normal-hearing and hearing-impaired children and adults demonstrate that combined auditory-visual perception is superior to perception through either audition or vision alone. This paper reviews these studies and provides a rationale for routine evaluation of auditory-visual speech perception in audiology clinics.
Article
Natural conversation is multisensory: when we can see the speaker's face, visual speech cues improve our comprehension. The neuronal mechanisms underlying this phenomenon remain unclear. The two main alternatives are visually mediated phase modulation of neuronal oscillations (excitability fluctuations) in auditory neurons and visual input-evoked responses in auditory neurons. Investigating this question using naturalistic audiovisual speech with intracranial recordings in humans of both sexes, we find evidence for both mechanisms. Remarkably, auditory cortical neurons track the temporal dynamics of purely visual speech using the phase of their slow oscillations and phase-related modulations in broadband high-frequency activity. Consistent with known perceptual enhancement effects, the visual phase reset amplifies the cortical representation of concomitant auditory speech. In contrast to this, and in line with earlier reports, visual input reduces the amplitude of evoked responses to concomitant auditory input. We interpret the combination of improved phase tracking and reduced response amplitude as evidence for more efficient and reliable stimulus processing in the presence of congruent auditory and visual speech inputs.SIGNIFICANCE STATEMENT Watching the speaker can facilitate our understanding of what is being said. The mechanisms responsible for this influence of visual cues on the processing of speech remain incompletely understood. We studied these mechanisms by recording the electrical activity of the human brain through electrodes implanted surgically inside the brain. We found that visual inputs can operate by directly activating auditory cortical areas, and also indirectly by modulating the strength of cortical responses to auditory input. Our results help to understand the mechanisms by which the brain merges auditory and visual speech into a unitary perception.
Article
Lip-reading is crucial for understanding speech in challenging conditions. But how the brain extracts meaning from—silent—visual speech is still under debate. Lip-reading in silence activates the auditory cortices, but it is not known whether such activation reflects immediate synthesis of the corresponding auditory stimulus or imagery of unrelated sounds. To disentangle these possibilities, we used magnetoencephalography to evaluate how cortical activity in 28 healthy adults humans (17 females) entrained to the auditory speech envelope and lip movements (mouth opening) when listening to a spoken story without visual input (audio-only), and when seeing a silent video of a speaker articulating another story (video-only). In video-only, auditory cortical activity entrained to the absent auditory signal at frequencies below 1 Hz more than to the seen lip movements. This entrainment process was characterized by an auditory-speech–to–brain delay of ~70 ms in the left hemisphere, compared to ~20 ms in audio-only. Entrainment to mouth opening was found in the right angular gyrus at below 1 Hz, and in early visual cortices at 1–8 Hz. These findings demonstrate that the brain can use a silent lip-read signal to synthesize a coarse-grained auditory speech representation in early auditory cortices. Our data indicate the following underlying oscillatory mechanism: Seeing lip movements first modulates neuronal activity in early visual cortices at frequencies that match articulatory lip movements; the right angular gyrus then extracts slower features of lip movements, mapping them onto the corresponding speech sound features; this information is fed to auditory cortices, most likely facilitating speech parsing.
Article
Several recent studies have used transcranial alternating current stimulation (tACS) to demonstrate a causal role of neural oscillatory activity in speech processing. In particular, it has been shown that the ability to understand speech in a multi-speaker scenario or background noise depends on the timing of speech presentation relative to simultaneously applied tACS. However, it is possible that tACS did not change actual speech perception but rather auditory stream segregation. In this study, we tested whether the phase relation between tACS and the rhythm of degraded words, presented in silence, modulates word report accuracy. We found strong evidence for a tACS-induced modulation of speech perception, but only if the stimulation was applied bilaterally using ring electrodes (not for unilateral left hemisphere stimulation with square electrodes). These results were only obtained when data were analyzed using a statistical approach that was identified as optimal in a previous simulation study. The effect was driven by a phasic disruption of word report scores. Our results suggest a causal role of neural entrainment for speech perception and emphasize the importance of optimizing stimulation protocols and statistical approaches for brain stimulation research.
Article
The streams of sounds we typically attend to abound in acoustic regularities. Neural entrainment is seen as an important mechanism that the listening brain exploits to attune to these regularities and to enhance the representation of attended sounds. We delineate the neurophysiology underlying this mechanism and review entrainment alongside its more pragmatic signature, often called 'speech tracking'. The latter has become a popular analytical approach to trace the reflection of acoustic and linguistic information at different levels of granularity, from neurophysiology to neuroimaging. As we discuss, the concept of entrainment offers both a putative neurophysiological mechanism for selective listening and a versatile window onto the neural basis of hearing and speech comprehension.
Article
Due to their periodic nature, neural oscillations might represent an optimal ‘‘tool’’ for the processing of rhythmic stimulus input [1–3]. Indeed, the alignment of neural oscillations to a rhythmic stimulus, often termed phase entrainment, has been repeatedly demonstrated [4–7]. Phase entrainment is central to current theories of speech processing [8–10] and has been associated with successful speech comprehension [11–17]. However, typical manipulations that reduce speech intelligibility (e.g., addition of noise and time reversal [11, 12, 14, 16, 17]) could destroy critical acoustic cues for entrainment (such as ‘‘acoustic edges’’ [7]). Hence, the association between phase entrainment and speech intelligibility might only be ‘‘epiphenomenal’’; i.e., both decline due to the same manipulation, without any causal link between the two [18]. Here, we use transcranial alternating current stimulation (tACS [19]) to manipulate the phase lag between neural oscillations and speech rhythm while measuring neural responses to intelligible and unintelligible vocoded stimuli with sparse fMRI. We found that this manipulation significantly modulates the BOLD response to intelligible speech in the superior temporal gyrus, and the strength of BOLD modulation is correlated with a phasic modulation of performance in a behavioral task. Importantly, these findings are absent for unintelligible speech and during sham stimulation; we thus demonstrate that phase entrainment has a specific, causal influence on neural responses to intelligible speech. Our results not only provide an important step toward understanding the neural foundation of human abilities at speech comprehension but also suggest new methods for enhancing speech perception that can be explored in the future.
Chapter
Speech perception is multisensory, making use of information from both the auditory modality (the talker’s voice) and the visual modality (the talker’s face). This chapter describes recent advances in our understanding of the neural processing of audiovisual speech, driven by studies using blood-oxygen level-dependent functional magnetic resonance imaging (BOLD fMRI), electrocorticography, causal inference modeling, and transcranial magnetic stimulation. An area of special importance is the posterior superior temporal sulcus and adjacent superior temporal gyrus and middle temporal gyrus. An audiovisual speech illusion known as the McGurk effect, in which incongruent auditory and visual syllables are perceived as a third syllable, has been a useful tool for interrogating the cortical speech network. There is a previously unappreciated level of intersubject and interstimulus variability in the behavioral and neural responses to this illusion.
Article
The McGurk effect is a textbook illustration of the automaticity with which the human brain integrates audio-visual speech. It shows that even incongruent audiovisual (AV) speech stimuli can be combined into percepts that correspond neither to the auditory nor to the visual input, but to a mix of both. Typically, when presented with, e.g., visual /aga/ and acoustic /aba/ we perceive an illusory /ada/. In the inverse situation, however, when acoustic /aga/ is paired with visual /aba/, we perceive a combination of both stimuli, i.e., /abga/ or /agba/. Here we assessed the role of dynamic cross-modal predictions in the outcome of AV speech integration using a computational model that processes continuous audiovisual speech sensory inputs in a predictive coding framework. The model involves three processing levels: sensory units, units that encode the dynamics of stimuli, and multimodal recognition/identity units. The model exhibits a dynamic prediction behavior because evidence about speech tokens can be asynchronous across sensory modality, allowing for updating the activity of the recognition units from one modality while sending top-down predictions to the other modality. We explored the model's response to congruent and incongruent AV stimuli and found that, in the two-dimensional feature space spanned by the speech second formant and lip aperture, fusion stimuli are located in the neighborhood of congruent /ada/, which therefore provides a valid match. Conversely, stimuli that lead to combination percepts do not have a unique valid neighbor. In that case, acoustic and visual cues are both highly salient and generate conflicting predictions in the other modality that cannot be fused, forcing the elaboration of a combinatorial solution. We propose that dynamic predictive mechanisms play a decisive role in the dichotomous perception of incongruent audiovisual inputs. Copyright © 2015. Published by Elsevier Ltd.
Article
During face-to-face conversational speech listeners must efficiently process a rapid and complex stream of multisensory information. Visual speech can serve as a critical complement to auditory information because it provides cues to both the timing of the incoming acoustic signal (the amplitude envelope, influencing attention and perceptual sensitivity) and its content (place and manner of articulation, constraining lexical selection). Here we review behavioral and neurophysiological evidence regarding listeners' use of visual speech information. Multisensory integration of audiovisual speech cues improves recognition accuracy, particularly for speech in noise. Even when speech is intelligible based solely on auditory information, adding visual information may reduce the cognitive demands placed on listeners through increasing the precision of prediction. Electrophysiological studies demonstrate that oscillatory cortical entrainment to speech in auditory cortex is enhanced when visual speech is present, increasing sensitivity to important acoustic cues. Neuroimaging studies also suggest increased activity in auditory cortex when congruent visual information is available, but additionally emphasize the involvement of heteromodal regions of posterior superior temporal sulcus as playing a role in integrative processing. We interpret these findings in a framework of temporally-focused lexical competition in which visual speech information affects auditory processing to increase sensitivity to acoustic information through an early integration mechanism, and a late integration stage that incorporates specific information about a speaker's articulators to constrain the number of possible candidates in a spoken utterance. Ultimately it is words compatible with both auditory and visual information that most strongly determine successful speech perception during everyday listening. Thus, audiovisual speech perception is accomplished through multiple stages of integration, supported by distinct neuroanatomical mechanisms. Copyright © 2015 Elsevier Ltd. All rights reserved.
Article
To form a coherent percept of the environment, the brain needs to bind sensory signals emanating from a common source, but to segregate those from different sources [1]. Temporal correlations and synchrony act as prominent cues for multisensory integration [2-4], but the neural mechanisms by which such cues are identified remain unclear. Predictive coding suggests that the brain iteratively optimizes an internal model of its environment by minimizing the errors between its predictions and the sensory inputs [5,6]. This model enables the brain to predict the temporal evolution of natural audiovisual inputs and their statistical (for example, temporal) relationship. A prediction of this theory is that asynchronous audiovisual signals violating the model's predictions induce an error signal that depends on the directionality of the audiovisual asynchrony. As the visual system generates the dominant temporal predictions for visual leading asynchrony, the delayed auditory inputs are expected to generate a prediction error signal in the auditory system (and vice versa for auditory leading asynchrony). Using functional magnetic resonance imaging (fMRI), we measured participants' brain responses to synchronous, visual leading and auditory leading movies of speech, sinewave speech or music. In line with predictive coding, auditory leading asynchrony elicited a prediction error in visual cortices and visual leading asynchrony in auditory cortices. Our results reveal predictive coding as a generic mechanism to temporally bind signals from multiple senses into a coherent percept.
Article
Previous imaging studies of congenital blindness have studied individuals with heterogeneous causes of blindness, which may influence the nature and extent of cross-modal plasticity. Here, we scanned a homogeneous group of blind people with bilateral congenital anophthalmia, a condition in which both eyes fail to develop, and, as a result, the visual pathway is not stimulated by either light or retinal waves. This model of congenital blindness presents an opportunity to investigate the effects of very early visual deafferentation on the functional organization of the brain. In anophthalmic animals, the occipital cortex receives direct subcortical auditory input. We hypothesized that this pattern of subcortical reorganization ought to result in a topographic mapping of auditory frequency information in the occipital cortex of anophthalmic people. Using functional MRI, we examined auditory-evoked activity to pure tones of high, medium, and low frequencies. Activity in the superior temporal cortex was significantly reduced in anophthalmic compared with sighted participants. In the occipital cortex, a region corresponding to the cytoarchitectural area V5/MT+ was activated in the anophthalmic participants but not in sighted controls. Whereas previous studies in the blind indicate that this cortical area is activated to auditory motion, our data show it is also active for trains of pure tone stimuli and in some anophthalmic participants shows a topographic mapping (tonotopy). Therefore, this region appears to be performing early sensory processing, possibly served by direct subcortical input from the pulvinar to V5/MT+.
Article
This article presents a review of the effects of adverse conditions (ACs) on the perceptual, linguistic, cognitive, and neurophysiological mechanisms underlying speech recognition. The review starts with a classification of ACs based on their origin: Degradation at the source (production of a noncanonical signal), degradation during signal transmission (interfering signal or medium-induced impoverishment of the target signal), receiver limitations (peripheral, linguistic, cognitive). This is followed by a parallel, yet orthogonal classification of ACs based on the locus of their effect: Perceptual processes, mental representations, attention, and memory functions. We then review the added value that ACs provide for theories of speech recognition, with a focus on fundamental themes in psycholinguistics: Content and format of lexical representations, time-course of lexical access, word segmentation, feed-back in speech perception and recognition, lexical-semantic integration, interface between the speech system and general cognition, neuroanatomical organisation of speech processing. We conclude by advocating an approach to speech recognition that includes rather than neutralises complex listening environments and individual differences.
Article
Purpose: In this study, the authors sought to quantify the relationships between speech intelligibility (perception) and gaze patterns under different auditory-visual conditions. Method: Eleven subjects listened to low-context sentences spoken by a single talker while viewing the face of one or more talkers on a computer display. Subjects either maintained their gaze at a specific distance (0°, 2.5°, 5°, 10°, and 15°) from the center of the talker's mouth (CTM) or moved their eyes freely on the computer display. Eye movements were monitored with an eye-tracking system, and speech intelligibility was evaluated by the mean percentage of correctly perceived words. Results: With a single talker and a fixed point of gaze, speech intelligibility was similar for all fixations within 10° of the CTM. With visual cues from two talker faces and a speech signal from one of the talkers, speech intelligibility was similar to that of a single talker for fixations within 2.5° of the CTM. With natural viewing of a single talker, gaze strategy changed with speech-signal-to-noise ratio (SNR). For low speech-SNR, a strategy that brought the point of gaze directly to within 2.5° of the CTM was used in approximately 80% of trials, whereas in high speech-SNR it was used in only approximately 50% of trials. Conclusions: With natural viewing of a single talker and high speech-SNR, subjects can shift their gaze between points on the talker's face without compromising speech intelligibility. With low-speech SNR, subjects change their gaze patterns to fixate primarily on points that are in close proximity to the talker's mouth. The latter strategy is essential to optimize speech intelligibility in situations where there are simultaneous visual cues from multiple talkers (i.e., when some of the visual cues are distracters).
Article
"Oral speech intelligibility tests were conducted with, and without, supplementary visual observation of the speaker's facial and lip movements. The difference between these two conditions was examined as a function of the speech-to-noise ratio and of the size of the vocabulary under test. The visual contribution to oral speech intelligibility (relative to its possible contribution) is, to a first approximation, independent of the speech-to-noise ratio under test. However, since there is a much greater opportunity for the visual contribution at low speech-to-noise ratios, its absolute contribution can be exploited most profitably under these conditions." (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Article
FreeSurfer is a suite of tools for the analysis of neuroimaging data that provides an array of algorithms to quantify the functional, connectional and structural properties of the human brain. It has evolved from a package primarily aimed at generating surface representations of the cerebral cortex into one that automatically creates models of most macroscopically visible structures in the human brain given any reasonable T1-weighted input image. It is freely available, runs on a wide variety of hardware and software platforms, and is open source.
Article
Independent component analysis (ICA) is a statistical method for transforming an observed multidimensional random vector into components that are statistically as independent from each other as possible. In this paper, we use a combination of two different approaches for linear ICA: Comon's information-theoretic approach and the projection pursuit approach. Using maximum entropy approximations of differential entropy, we introduce a family of new contrast (objective) functions for ICA. These contrast functions enable both the estimation of the whole decomposition by minimizing mutual information, and estimation of individual independent components as projection pursuit directions. The statistical properties of the estimators based on such contrast functions are analyzed under the assumption of the linear mixture model, and it is shown how to choose contrast functions that are robust and/or of minimum variance. Finally, we introduce simple fixed-point algorithms for practical optimization of the contrast functions. These algorithms optimize the contrast functions very fast and reliably.
Article
Precise localization of sulco-gyral structures of the human cerebral cortex is important for the interpretation of morpho-functional data, but requires anatomical expertise and is time consuming because of the brain's geometric complexity. Software developed to automatically identify sulco-gyral structures has improved substantially as a result of techniques providing topologically correct reconstructions permitting inflated views of the human brain. Here we describe a complete parcellation of the cortical surface using standard internationally accepted nomenclature and criteria. This parcellation is available in the FreeSurfer package. First, a computer-assisted hand parcellation classified each vertex as sulcal or gyral, and these were then subparcellated into 74 labels per hemisphere. Twelve datasets were used to develop rules and algorithms (reported here) that produced labels consistent with anatomical rules as well as automated computational parcellation. The final parcellation was used to build an atlas for automatically labeling the whole cerebral cortex. This atlas was used to label an additional 12 datasets, which were found to have good concordance with manual labels. This paper presents a precisely defined method for automatically labeling the cortical surface in standard terminology.
Article
The dip test measures multimodality in a sample by the maximum difference, over all sample points, between the empirical distribution function, and the unimodal distribution function that minimizes that maximum difference. The uniform distribution is the asymptotically least favorable unimodal distribution, and the distribution of the test statistic is determined asymptotically and empirically when sampling from the uniform.
Article
This paper reviews progress in understanding the psychology of lipreading and audio-visual speech perception. It considers four questions. What distinguishes better from poorer lipreaders? What are the effects of introducing a delay between the acoustical and optical speech signals? What have attempts to produce computer animations of talking faces contributed to our understanding of the visual cues that distinguish consonants and vowels? Finally, how should the process of audio-visual integration in speech perception be described; that is, how are the sights and sounds of talking faces represented at their conflux?
Article
The intelligibility of sentences presented in noise improves when the listener can view the talker's face. Our aims were to quantify this benefit, and to relate it to individual differences among subjects in lipreading ability and among sentences in lipreading difficulty. Auditory and audiovisual speech-reception thresholds (SRTs) were measured in 20 listeners with normal hearing. Sixty sentences, selected to range in the difficulty with which they could be lipread (with vision alone) from easy to hard, were presented for identification in white noise. Using the ascending method of limits, the SRT was defined as the lowest signal-to-noise ratio at which all three 'key words' in each sentence could be identified correctly. Measured as the difference in dB between auditory-alone and audiovisual SRTs, 'audiovisual benefit' averaged 11 dB, ranging from 6 to 15 dB among subjects, and from 3 to 22 dB among sentences. As predicted, audiovisual benefit is a measure of lipreading ability. It was highly correlated with visual-alone performance (n = 20, r = 0.86, P less than 0.01). Likewise, those sentences which were easiest to lipread gave a higher measure of benefit from vision in audiovisual conditions than did sentences that were hard to lipread (n = 60, r = 0.92, P less than 0.01). The results establish the basis of an efficient test of speech-reception disability in which measures are freed from the floor and ceiling effects encountered when percentage correct is used as the dependent variable.
Article
Nearly perfect speech recognition was observed under conditions of greatly reduced spectral information. Temporal envelopes of speech were extracted from broad frequency bands and were used to modulate noises of the same bandwidths. This manipulation preserved temporal envelope cues in each band but restricted the listener to severely degraded information on the distribution of spectral energy. The identification of consonants, vowels, and words in simple sentences improved markedly as the number of bands increased; high speech recognition performance was obtained with only three bands of modulated noise. Thus, the presentation of a dynamic temporal pattern in only a few broad spectral regions is sufficient for the recognition of speech.