
Iterative alignment discovery of speech-associated neural activity

IOP Publishing
Journal of Neural Engineering
To read the full-text of this research, you can request a copy directly from the authors.


Objective . Brain–computer interfaces (BCIs) have the potential to preserve or restore speech in patients with neurological disorders that weaken the muscles involved in speech production. However, successful training of low-latency speech synthesis and recognition models requires alignment of neural activity with intended phonetic or acoustic output with high temporal precision. This is particularly challenging in patients who cannot produce audible speech, as ground truth with which to pinpoint neural activity synchronized with speech is not available. Approach . In this study, we present a new iterative algorithm for neural voice activity detection (nVAD) called iterative alignment discovery dynamic time warping (IAD-DTW) that integrates DTW into the loss function of a deep neural network (DNN). The algorithm is designed to discover the alignment between a patient’s electrocorticographic (ECoG) neural responses and their attempts to speak during collection of data for training BCI decoders for speech synthesis and recognition. Main results . To demonstrate the effectiveness of the algorithm, we tested its accuracy in predicting the onset and duration of acoustic signals produced by able-bodied patients with intact speech undergoing short-term diagnostic ECoG recordings for epilepsy surgery. We simulated a lack of ground truth by randomly perturbing the temporal correspondence between neural activity and an initial single estimate for all speech onsets and durations. We examined the model’s ability to overcome these perturbations to estimate ground truth. IAD-DTW showed no notable degradation (<1% absolute decrease in accuracy) in performance in these simulations, even in the case of maximal misalignments between speech and silence. Significance . IAD-DTW is computationally inexpensive and can be easily integrated into existing DNN-based nVAD approaches, as it pertains only to the final loss computation. This approach makes it possible to train speech BCI algorithms using ECoG data from patients who are unable to produce audible speech, including those with Locked-In Syndrome.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

Full-text available
Objective. Brain-Computer Interfaces (BCIs) hold significant promise for restoring communication in individuals with partial or complete loss of the ability to speak due to paralysis from amyotrophic lateral sclerosis (ALS), brainstem stroke, and other neurological disorders. Many of the approaches to speech decoding reported in the BCI literature have required time-aligned target representations to allow successful training - a major challenge when translating such approaches to people who have already lost their voice. Approach. In this pilot study, we made a first step toward scenarios in which no ground truth is available. We utilized a graph-based clustering approach to identify temporal segments of speech production from electrocorticographic (ECoG) signals alone. We then used the estimated speech segments to train a voice activity detection (VAD) model using only ECoG signals. We evaluated our approach using held-out open-loop recordings of a single dysarthric clinical trial participant living with ALS, and we compared the resulting performance to previous solutions trained with ground truth acoustic voice recordings. Main results. Our approach achieves a median error rate of around 0.5 seconds with respect to the actual spoken speech. Embedded into a real-time BCI, our approach is capable of providing VAD results with a latency of only 10 ms. Significance. To the best of our knowledge, our results show for the first time that speech activity can be predicted purely from unlabeled ECoG signals, a crucial step toward individuals who cannot provide this information anymore due to their neurological condition, such as patients with locked-in syndrome. Clinical Trial Information., registration number NCT03567213 .
Full-text available
Brain‐computer interfaces (BCIs) can be used to control assistive devices by patients with neurological disorders like amyotrophic lateral sclerosis (ALS) that limit speech and movement. For assistive control, it is desirable for BCI systems to be accurate and reliable, preferably with minimal setup time. In this study, a participant with severe dysarthria due to ALS operates computer applications with six intuitive speech commands via a chronic electrocorticographic (ECoG) implant over the ventral sensorimotor cortex. Speech commands are accurately detected and decoded (median accuracy: 90.59%) throughout a 3‐month study period without model retraining or recalibration. Use of the BCI does not require exogenous timing cues, enabling the participant to issue self‐paced commands at will. These results demonstrate that a chronically implanted ECoG‐based speech BCI can reliably control assistive devices over long time periods with only initial model training and calibration, supporting the feasibility of unassisted home use.
Full-text available
Speech brain–computer interfaces (BCIs) have the potential to restore rapid communication to people with paralysis by decoding neural activity evoked by attempted speech into text 1,2 or sound 3,4 . Early demonstrations, although promising, have not yet achieved accuracies sufficiently high for communication of unconstrained sentences from a large vocabulary 1–7 . Here we demonstrate a speech-to-text BCI that records spiking activity from intracortical microelectrode arrays. Enabled by these high-resolution recordings, our study participant—who can no longer speak intelligibly owing to amyotrophic lateral sclerosis—achieved a 9.1% word error rate on a 50-word vocabulary (2.7 times fewer errors than the previous state-of-the-art speech BCI ² ) and a 23.8% word error rate on a 125,000-word vocabulary (the first successful demonstration, to our knowledge, of large-vocabulary decoding). Our participant’s attempted speech was decoded at 62 words per minute, which is 3.4 times as fast as the previous record ⁸ and begins to approach the speed of natural conversation (160 words per minute ⁹ ). Finally, we highlight two aspects of the neural code for speech that are encouraging for speech BCIs: spatially intermixed tuning to speech articulators that makes accurate decoding possible from only a small region of cortex, and a detailed articulatory representation of phonemes that persists years after paralysis. These results show a feasible path forward for restoring rapid communication to people with paralysis who can no longer speak.
Full-text available
Speech neuroprostheses have the potential to restore communication to people living with paralysis, but naturalistic speed and expressivity are elusive¹. Here we use high-density surface recordings of the speech cortex in a clinical-trial participant with severe limb and vocal paralysis to achieve high-performance real-time decoding across three complementary speech-related output modalities: text, speech audio and facial-avatar animation. We trained and evaluated deep-learning models using neural data collected as the participant attempted to silently speak sentences. For text, we demonstrate accurate and rapid large-vocabulary decoding with a median rate of 78 words per minute and median word error rate of 25%. For speech audio, we demonstrate intelligible and rapid speech synthesis and personalization to the participant’s pre-injury voice. For facial-avatar animation, we demonstrate the control of virtual orofacial movements for speech and non-speech communicative gestures. The decoders reached high performance with less than two weeks of training. Our findings introduce a multimodal speech-neuroprosthetic approach that has substantial promise to restore full, embodied communication to people living with severe paralysis.
Full-text available
Neuroprostheses have the potential to restore communication to people who cannot speak or type due to paralysis. However, it is unclear if silent attempts to speak can be used to control a communication neuroprosthesis. Here, we translated direct cortical signals in a clinical-trial participant (; NCT03698149) with severe limb and vocal-tract paralysis into single letters to spell out full sentences in real time. We used deep-learning and language-modeling techniques to decode letter sequences as the participant attempted to silently spell using code words that represented the 26 English letters (e.g. “alpha” for “a”). We leveraged broad electrode coverage beyond speech-motor cortex to include supplemental control signals from hand cortex and complementary information from low- and high-frequency signal components to improve decoding accuracy. We decoded sentences using words from a 1,152-word vocabulary at a median character error rate of 6.13% and speed of 29.4 characters per minute. In offline simulations, we showed that our approach generalized to large vocabularies containing over 9,000 words (median character error rate of 8.23%). These results illustrate the clinical viability of a silently controlled speech neuroprosthesis to generate sentences from a large vocabulary through a spelling-based approach, complementing previous demonstrations of direct full-word decoding.
Full-text available
Auditory feedback of one’s own speech is used to monitor and adaptively control fluent speech production. A new study in PLOS Biology using electrocorticography (ECoG) in listeners whose speech was artificially delayed identifies regions involved in monitoring speech production.
Full-text available
Reconstructing intended speech from neural activity using brain-computer interfaces holds great promises for people with severe speech production deficits. While decoding overt speech has progressed, decoding imagined speech has met limited success, mainly because the associated neural signals are weak and variable compared to overt speech, hence difficult to decode by learning algorithms. We obtained three electrocorticography datasets from 13 patients, with electrodes implanted for epilepsy evaluation, who performed overt and imagined speech production tasks. Based on recent theories of speech neural processing, we extracted consistent and specific neural features usable for future brain computer interfaces, and assessed their performance to discriminate speech items in articulatory, phonetic, and vocalic representation spaces. While high-frequency activity provided the best signal for overt speech, both low- and higher-frequency power and local cross-frequency contributed to imagined speech decoding, in particular in phonetic and vocalic, i.e. perceptual, spaces. These findings show that low-frequency power and cross-frequency dynamics contain key information for imagined speech decoding.
Conference Paper
Full-text available
Recent studies have shown promise for designing Brain-Computer Interfaces (BCIs) to restore speech communication for those suffering from neurological injury or disease. Numerous BCIs have been developed to reconstruct different aspects of speech, such as phonemes and words, from brain activity. However, many challenges remain toward the successful reconstruction of continuous speech from brain activity during speech imagery. Here, we investigate the potential of differentiating speech and non-speech using intracranial brain activity in different frequency bands acquired by stereotactic EEG. The results reveal statistically significant information in the alpha and theta bands for detecting voice activity, and that using a combination of multiple frequency bands further improves performance with over 92% accuracy. Furthermore, the model is causal and can be implemented with low-latency for future closed-loop experiments. These preliminary findings show the potential of cross-frequency brain signal features for detecting speech activity to enhance speech decoding and synthesis models.
Full-text available
Small, variable transmission delays over Zoom disrupt the typical rhythm of conversation, leading to delays in turn initiation. This study compared local and remote (Zoom) turn transition times using both a tightly controlled yes/no Question and Answer (Q&A) paradigm (Corps et al., 2018) and unscripted conversation. In the Q&A paradigm (Experiment 1), participants responded yes/no as quickly as possible to prerecorded questions. Half of the questions were played over Zoom and half were played locally from their own computer. Local responses had an average latency of 297 ms, whereas remote responses averaged 976 ms. These large increases in transition times over Zoom are far greater than the estimated 30-70 ms of audio transmission delay, suggesting disruption of automated mechanisms that normally guide the timing of turn initiation in conversation. In face-to-face conversations (Experiment 2), turn transition times averaged 135 ms, but transition times for the same dyads over Zoom averaged 487 ms. We consider the possibility that electronic transmission delays disrupt neural oscillators that normally synchronize on syllable rate, at around, 150-300 ms per cycle (Wilson & Wilson, 2005), and enable interlocutors to effortlessly and precisely time the initiation of their turns. (PsycInfo Database Record (c) 2021 APA, all rights reserved).
Full-text available
Brain–computer interfaces (BCIs) can restore communication to people who have lost the ability to move or speak. So far, a major focus of BCI research has been on restoring gross motor skills, such as reaching and grasping1,2,3,4,5 or point-and-click typing with a computer cursor6,7. However, rapid sequences of highly dexterous behaviours, such as handwriting or touch typing, might enable faster rates of communication. Here we developed an intracortical BCI that decodes attempted handwriting movements from neural activity in the motor cortex and translates it to text in real time, using a recurrent neural network decoding approach. With this BCI, our study participant, whose hand was paralysed from spinal cord injury, achieved typing speeds of 90 characters per minute with 94.1% raw accuracy online, and greater than 99% accuracy offline with a general-purpose autocorrect. To our knowledge, these typing speeds exceed those reported for any other BCI, and are comparable to typical smartphone typing speeds of individuals in the age group of our participant (115 characters per minute)⁸. Finally, theoretical considerations explain why temporally complex movements, such as handwriting, may be fundamentally easier to decode than point-to-point movements. Our results open a new approach for BCIs and demonstrate the feasibility of accurately decoding rapid, dexterous movements years after paralysis.
Full-text available
Purpose Speakers use auditory feedback to guide their speech output, although individuals differ in the magnitude of their compensatory response to perceived errors in feedback. Little is known about the factors that contribute to the compensatory response or how fixed or flexible they are within an individual. Here, we test whether manipulating the perceived reliability of auditory feedback modulates speakers' compensation to auditory perturbations, as predicted by optimal models of sensorimotor control. Method Forty participants produced monosyllabic words in two separate sessions, which differed in the auditory feedback given during an initial exposure phase. In the veridical session exposure phase, feedback was normal. In the noisy session exposure phase, small, random formant perturbations were applied, reducing reliability of auditory feedback. In each session, a subsequent test phase introduced larger unpredictable formant perturbations. We assessed whether the magnitude of within-trial compensation for these larger perturbations differed across the two sessions. Results Compensatory responses to downward (though not upward) formant perturbations were larger in the veridical session than the noisy session. However, in post hoc testing, we found the magnitude of this effect is highly dependent on the choice of analysis procedures. Compensation magnitude was not predicted by other production measures, such as formant variability, and was not reliably correlated across sessions. Conclusions Our results, though mixed, provide tentative support that the feedback control system monitors the reliability of sensory feedback. These results must be interpreted cautiously given the potentially limited stability of auditory feedback compensation measures across analysis choices and across sessions. Supplemental Material
Full-text available
Brain-computer interfaces (BCIs) enable control of assistive devices in individuals with severe motor impairments. A limitation of BCIs that has hindered real-world adoption is poor long-term reliability and lengthy daily recalibration times. To develop methods that allow stable performance without recalibration, we used a 128-channel chronic electrocorticography (ECoG) implant in a paralyzed individual, which allowed stable monitoring of signals. We show that long-term closed-loop decoder adaptation, in which decoder weights are carried across sessions over multiple days, results in consolidation of a neural map and 'plug-and-play' control. In contrast, daily reinitialization led to degradation of performance with variable relearning. Consolidation also allowed the addition of control features over days, that is, long-term stacking of dimensions. Our results offer an approach for reliable, stable BCI control by leveraging the stability of ECoG interfaces and neural plasticity.
Full-text available
Overfitting is one of the most challenging problems in deep neural networks with a large number of trainable parameters. To prevent networks from overfitting, the dropout method, which is a strong regularization technique, has been widely used in fully-connected neural networks. In several state-of-the-art convolutional neural network architectures for object classification, however, dropout was partially or not even applied since its accuracy gain was relatively insignificant in most cases. Also, the batch normalization technique reduced the need for the dropout method because of its regularization effect. In this paper, we show that conventional element-wise dropout can be ineffective for convolutional layers. We found that dropout between channels in the CNNs can be functionally similar to dropout in the FCNNs, and spatial dropout can be an effective way to take advantage of the dropout technique for regularizing. To prove our points, we conducted several experiments using the CIFAR-10 and CIFAR-100 databases. For comparison, we only replaced the dropout layers with spatial dropout layers and kept all other hyperparameters and methods intact. DenseNet-BC with spatial dropout showed promising results (3.32% error rates with CIFAR-10, 3.0 M parameters) compared to other existing competitive methods.
Full-text available
A decade after speech was first decoded from human brain signals, accuracy and speed remain far below that of natural speech. Here we show how to decode the electrocorticogram with high accuracy and at natural-speech rates. Taking a cue from recent advances in machine translation, we train a recurrent neural network to encode each sentence-length sequence of neural activity into an abstract representation, and then to decode this representation, word by word, into an English sentence. For each participant, data consist of several spoken repeats of a set of 30–50 sentences, along with the contemporaneous signals from ~250 electrodes distributed over peri-Sylvian cortices. Average word error rates across a held-out repeat set are as low as 3%. Finally, we show how decoding with limited data can be improved with transfer learning, by training certain layers of the network under multiple participants’ data.
Full-text available
Neural interfaces that directly produce intelligible speech from brain activity would allow people with severe impairment from neurological disorders to communicate more naturally. Here, we record neural population activity in motor, premotor and inferior frontal cortices during speech production using electrocorticography (ECoG) and show that ECoG signals alone can be used to generate intelligible speech output that can preserve conversational cues. To produce speech directly from neural data, we adapted a method from the field of speech synthesis called unit selection, in which units of speech are concatenated to form audible output. In our approach, which we call Brain-To-Speech, we chose subsequent units of speech based on the measured ECoG activity to generate audio waveforms directly from the neural recordings. Brain-To-Speech employed the user's own voice to generate speech that sounded very natural and included features such as prosody and accentuation. By investigating the brain areas involved in speech production separately, we found that speech motor cortex provided more information for the reconstruction process than the other cortical areas.
Full-text available
Technology that translates neural activity into speech would be transformative for people who are unable to communicate as a result of neurological impairments. Decoding speech from neural activity is challenging because speaking requires very precise and rapid multi-dimensional control of vocal tract articulators. Here we designed a neural decoder that explicitly leverages kinematic and sound representations encoded in human cortical activity to synthesize audible speech. Recurrent neural networks first decoded directly recorded cortical activity into representations of articulatory movement, and then transformed these representations into speech acoustics. In closed vocabulary tests, listeners could readily identify and transcribe speech synthesized from cortical activity. Intermediate articulatory dynamics enhanced performance even with limited data. Decoded articulatory representations were highly conserved across speakers, enabling a component of the decoder to be transferrable across participants. Furthermore, the decoder could synthesize speech when a participant silently mimed sentences. These findings advance the clinical viability of using speech neuroprosthetic technology to restore spoken communication.
Full-text available
[Purpose] Here, we evaluated the reaction times of young and middle-aged people in different tasks. [Participants and Methods] The study included 23 young and 28 middle-aged volunteers. Their reaction times were measured in three tasks featuring different symbols (arrow and figure symbols) and spatial attributes (left, right, and ipsilateral choices). [Results] No significant inter-group differences in the reaction times were found for the simple reaction time task. In the choice reaction time and go/no-go reaction time tasks, the middle-aged participants demonstrated significantly slower reaction times. When the correct response was congruous with the direction of an arrow stimulus, the reaction times were shortened significantly among the middle-aged participants. In the go/no-go reaction time task, the reactions were delayed due to an inhibition of responses to upcoming stimuli. [Conclusion] The slower reaction time of the middle-aged participants in the choice reaction time task suggested that their responses were guided by the arrow stimulus to a greater extent compared to that of the younger participants. In the go/no-go reaction time task, the reaction times may have been slower in middle-aged participants because of a non-response possibility, which meant that participants had to first check the stimulus before deciding whether to respond.
Full-text available
Neural keyword spotting could form the basis of a speech brain-computer-interface for menu-navigation if it can be done with low latency and high specificity comparable to the “wake-word” functionality of modern voice-activated AI assistant technologies. This study investigated neural keyword spotting using motor representations of speech via invasively-recorded electrocorticographic signals as a proof-of-concept. Neural matched filters were created from monosyllabic consonant-vowel utterances: one keyword utterance, and 11 similar non-keyword utterances. These filters were used in an analog to the acoustic keyword spotting problem, applied for the first time to neural data. The filter templates were cross-correlated with the neural signal, capturing temporal dynamics of neural activation across cortical sites. Neural vocal activity detection (VAD) was used to identify utterance times and a discriminative classifier was used to determine if these utterances were the keyword or non-keyword speech. Model performance appeared to be highly related to electrode placement and spatial density. Vowel height (/a/ vs /i/) was poorly discriminated in recordings from sensorimotor cortex, but was highly discriminable using neural features from superior temporal gyrus during self-monitoring. The best performing neural keyword detection (5 keyword detections with two false-positives across 60 utterances) and neural VAD (100% sensitivity, ~1 false detection per 10 utterances) came from high-density (2 mm electrode diameter and 5 mm pitch) recordings from ventral sensorimotor cortex, suggesting the spatial fidelity and extent of high-density ECoG arrays may be sufficient for the purpose of speech brain-computer-interfaces.
Full-text available
Auditory stimulus reconstruction is a technique that finds the best approximation of the acoustic stimulus from the population of evoked neural activity. Reconstructing speech from the human auditory cortex creates the possibility of a speech neuroprosthetic to establish a direct communication with the brain and has been shown to be possible in both overt and covert conditions. However, the low quality of the reconstructed speech has severely limited the utility of this method for brain-computer interface (BCI) applications. To advance the state-of-the-art in speech neuroprosthesis, we combined the recent advances in deep learning with the latest innovations in speech synthesis technologies to reconstruct closed-set intelligible speech from the human auditory cortex. We investigated the dependence of reconstruction accuracy on linear and nonlinear (deep neural network) regression methods and the acoustic representation that is used as the target of reconstruction, including auditory spectrogram and speech synthesis parameters. In addition, we compared the reconstruction accuracy from low and high neural frequency ranges. Our results show that a deep neural network model that directly estimates the parameters of a speech synthesizer from all neural frequencies achieves the highest subjective and objective scores on a digit recognition task, improving the intelligibility by 65% over the baseline method which used linear regression to reconstruct the auditory spectrogram. These results demonstrate the efficacy of deep learning and speech synthesis algorithms for designing the next generation of speech BCI systems, which not only can restore communications for paralyzed patients but also have the potential to transform human-computer interaction technologies.
Full-text available
A fundamental challenge in neuroscience is to understand what structure in the world is represented in spatially distributed patterns of neural activity from multiple single-trial measurements. This is often accomplished by learning a simple, linear transformations between neural features and features of the sensory stimuli or motor task. While successful in some early sensory processing areas, linear mappings are unlikely to be ideal tools for elucidating nonlinear, hierarchical representations of higher-order brain areas during complex tasks, such as the production of speech by humans. Here, we apply deep networks to predict produced speech syllables from a dataset of high gamma cortical surface electric potentials recorded from human sensorimotor cortex. We find that deep networks had higher decoding prediction accuracy compared to baseline models. Having established that deep networks extract more task relevant information from neural data sets relative to linear models (i.e., higher predictive accuracy), we next sought to demonstrate their utility as a data analysis tool for neuroscience. We first show that deep network’s confusions revealed hierarchical latent structure in the neural data, which recapitulated the underlying articulatory nature of speech motor control. We next broadened the frequency features beyond high-gamma and identified a novel high-gamma-to-beta coupling during speech production. Finally, we used deep networks to compare task-relevant information in different neural frequency bands, and found that the high-gamma band contains the vast majority of information relevant for the speech prediction task, with little-to-no additional contribution from lower-frequency amplitudes. Together, these results demonstrate the utility of deep networks as a data analysis tool for basic and applied neuroscience.
Full-text available
In humans, listening to speech evokes neural responses in the motor cortex. This has been controversially interpreted as evidence that speech sounds are processed as articulatory gestures. However, it is unclear what information is actually encoded by such neural activity. We used high-density direct human cortical recordings while participants spoke and listened to speech sounds. Motor cortex neural patterns during listening were substantially different than during articulation of the same sounds. During listening, we observed neural activity in the superior and inferior regions of ventral motor cortex. During speaking, responses were distributed throughout somatotopic representations of speech articulators in motor cortex. The structure of responses in motor cortex during listening was organized along acoustic features similar to auditory cortex, rather than along articulatory features as during speaking. Motor cortex does not contain articulatory representations of perceived actions in speech, but rather, represents auditory vocal information.
Full-text available
Recent work has shown that convolutional networks can be substantially deeper, more accurate and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper we embrace this observation and introduce the Dense Convolutional Network (DenseNet), where each layer is directly connected to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections, one between each layer and its subsequent layer (treating the input as layer 0), our network has L(L+1)/2 direct connections. For each layer, the feature maps of all preceding layers are treated as separate inputs whereas its own feature maps are passed on as inputs to all subsequent layers. Our proposed connectivity pattern has several compelling advantages: it alleviates the vanishing gradient problem and strengthens feature propagation; despite the increase in connections, it encourages feature reuse and leads to a substantial reduction of parameters; its models tend to generalize surprisingly well. We evaluate our proposed architecture on five highly competitive object recognition benchmark tasks. The DenseNet obtains significant improvements over the state-of-the-art on all five of them (e.g., yielding 3.74% test error on CIFAR-10, 19.25% on CIFAR-100 and 1.59% on SVHN).
Conference Paper
Full-text available
Recent work has shown that convolutional networks can be substantially deeper, more accurate, and efficient to train if they contain shorter connections between layers close to the input and those close to the output. In this paper, we embrace this observation and introduce the Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion. Whereas traditional convolutional networks with L layers have L connections - one between each layer and its subsequent layer - our network has L(L+1)/2 direct connections. For each layer, the feature-maps of all preceding layers are used as inputs, and its own feature-maps are used as inputs into all subsequent layers. DenseNets have several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters. We evaluate our proposed architecture on four highly competitive object recognition benchmark tasks (CIFAR-10, CIFAR-100, SVHN, and ImageNet). DenseNets obtain significant improvements over the state-of-the-art on most of them, whilst requiring less memory and computation to achieve high performance. Code and models are available at
Full-text available
Brain-Computer Interfaces (BCIs) have the potential to detect intraoperative awareness during general anaesthesia. Traditionally, BCI research is aimed at establishing or improving communication and control for patients with permanent paralysis. Patients experiencing intraoperative awareness also lack the means to communicate after administration of a neuromuscular blocker, but may attempt to move. This study evaluates the principle of detecting attempted movements from the electroencephalogram (EEG) during local temporary neuromuscular blockade. EEG was obtained from four healthy volunteers making 3-second hand movements, both before and after local administration of rocuronium in one isolated forearm. Using offline classification analysis we investigated whether the attempted movements the participants made during paralysis could be distinguished from the periods when they did not move or attempt to move. Attempted movement trials were correctly identified in 81 (68-94)% (mean (95% CI)) and 84 (74-93)% of the cases using 30 and 9 EEG channels, respectively. Similar accuracies were obtained when training the classifier on the participants' actual movements. These results provide proof of the principle that a BCI can detect movement attempts during neuromuscular blockade. Based on this, in the future a BCI may serve as a communication channel between a patient under general anaesthesia and the anaesthesiologist.
Full-text available
It has long been speculated whether communication between humans and machines based on natural speech related cortical activity is possible. Over the past decade, studies have suggested that it is feasible to recognize isolated aspects of speech from neural signals, such as auditory features, phones or one of a few isolated words. However, until now it remained an unsolved challenge to decode continuously spoken speech from the neural substrate associated with speech and language processing. Here, we show for the first time that continuously spoken speech can be decoded into the expressed words from intracranial electrocorticographic (ECoG) recordings.Specifically, we implemented a system, which we call Brain-To-Text that models single phones, employs techniques from automatic speech recognition (ASR), and thereby transforms brain activity while speaking into the corresponding textual representation. Our results demonstrate that our system can achieve word error rates as low as 25% and phone error rates below 50%. Additionally, our approach contributes to the current understanding of the neural basis of continuous speech production by identifying those cortical regions that hold substantial information about individual phones. In conclusion, the Brain-To-Text system described in this paper represents an important step toward human-machine communication based on imagined speech.
Full-text available
This chapter discusses the temporal patterns of rapid movement sequences in speech and typewriting and what these patterns might mean in relation to the advance planning or motor programming of such sequences. The chapter discusses response factors that affect the time to initiate a prespecified rapid movement sequence after a signal when the goal is to complete the sequence as quickly as possible as well as how such factors affect the rate at which movements in the sequence are produced. The response factor of central interest is number of elements in the sequence. The effect of the length of a movement sequence on its latency is based partly on the possibility that it reflects a latency component used for advance planning of the entire sequence: The length effect would then measure the extra time required to prepare extra elements. The idea that changes in reaction time might reflect changes in sequence preparation in this way proposed that simple reaction time increased with the number of elements in a sequence of movements made with one arm. A part of the reaction time includes the time to gain access to stored information concerning the whole sequence: a process akin to loading a program into a motor buffer, with sequences containing more elements requiring larger programs, and larger programs requiring more loading time.
Full-text available
Speaking is one of the most complex actions that we perform, but nearly all of us learn to do it effortlessly. Production of fluent speech requires the precise, coordinated movement of multiple articulators (for example, the lips, jaw, tongue and larynx) over rapid time scales. Here we used high-resolution, multi-electrode cortical recordings during the production of consonant-vowel syllables to determine the organization of speech sensorimotor cortex in humans. We found speech-articulator representations that are arranged somatotopically on ventral pre- and post-central gyri, and that partially overlap at individual electrodes. These representations were coordinated temporally as sequences during syllable production. Spatial patterns of cortical activity showed an emergent, population-level representation, which was organized by phonetic features. Over tens of milliseconds, the spatial patterns transitioned between distinct representations for different consonants and vowels. These results reveal the dynamic organization of speech sensorimotor cortex during the generation of multi-articulator movements that underlies our ability to speak.
Conference Paper
Full-text available
Many real-world sequence learning tasks re- quire the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their out- puts into label sequences, their applicability has so far been limited. This paper presents a novel method for training RNNs to label un- segmented sequences directly, thereby solv- ing both problems. An experiment on the TIMIT speech corpus demonstrates its ad- vantages over both a baseline HMM and a hybrid HMM-RNN.
Full-text available
We consume video content in a multitude of ways, including in movie theaters, on television, on DVDs and Blu-rays, online, on smart phones, and on portable media players. For quality control purposes, it is important to have a uniform viewing experience across these various platforms. In this work, we focus on voice synchronization, an aspect of video quality that is strongly affected by current post-production and transmission practices. We examined the synchronization of an actor's voice and lip movements in two distinct scenarios. First, we simulated the temporal mismatch between the audio and video tracks that can occur during dubbing or during broadcast. Next, we recreated the pitch changes that result from conversions between formats with different frame rates. We show, for the first time, that these audio visual mismatches affect viewer enjoyment. When temporal synchronization is noticeably absent, there is a decrease in the perceived performance quality and the perceived emotional intensity of a performance. For pitch changes, we find that higher pitch voices are not preferred, especially for male actors. Based on our findings, we advise that mismatched audio and video signals negatively affect viewer experience.
Regularization of (deep) learning models can be realized at the model, loss, or data level. As a technique somewhere in-between loss and data, label smoothing turns deterministic class labels into probability distributions, for example by uniformly distributing a certain part of the probability mass over all classes. A predictive model is then trained on these distributions as targets, using cross-entropy as loss function. While this method has shown improved performance compared to non-smoothed cross-entropy, we argue that the use of a smoothed though still precise probability distribution as a target can be questioned from a theoretical perspective. As an alternative, we propose a generalized technique called label relaxation, in which the target is a set of probabilities represented in terms of an upper probability distribution. This leads to a genuine relaxation of the target instead of a distortion, thereby reducing the risk of incorporating an undesirable bias in the learning process. Methodically, label relaxation leads to the minimization of a novel type of loss function, for which we propose a suitable closed-form expression for model optimization. The effectiveness of the approach is demonstrated in an empirical study on image data.
Damage or degeneration of motor pathways necessary for speech and other movements, as in brainstem strokes or amyotrophic lateral sclerosis (ALS), can interfere with efficient communication without affecting brain structures responsible for language or cognition. In the worst-case scenario, this can result in the locked in syndrome (LIS), a condition in which individuals cannot initiate communication and can only express themselves by answering yes/no questions with eye blinks or other rudimentary movements. Existing augmentative and alternative communication (AAC) devices that rely on eye tracking can improve the quality of life for people with this condition, but brain-computer interfaces (BCIs) are also increasingly being investigated as AAC devices, particularly when eye tracking is too slow or unreliable. Moreover, with recent and ongoing advances in machine learning and neural recording technologies, BCIs may offer the only means to go beyond cursor control and text generation on a computer, to allow real-time synthesis of speech, which would arguably offer the most efficient and expressive channel for communication. The potential for BCI speech synthesis has only recently been realized because of seminal studies of the neuroanatomical and neurophysiological underpinnings of speech production using intracranial electrocorticographic (ECoG) recordings in patients undergoing epilepsy surgery. These studies have shown that cortical areas responsible for vocalization and articulation are distributed over a large area of ventral sensorimotor cortex, and that it is possible to decode speech and reconstruct its acoustics from ECoG if these areas are recorded with sufficiently dense and comprehensive electrode arrays. In this article, we review these advances, including the latest neural decoding strategies that range from deep learning models to the direct concatenation of speech units. We also discuss state-of-the-art vocoders that are integral in constructing natural-sounding audio waveforms for speech BCIs. Finally, this review outlines some of the challenges ahead in directly synthesizing speech for patients with LIS.
Background Technology to restore the ability to communicate in paralyzed persons who cannot speak has the potential to improve autonomy and quality of life. An approach that decodes words and sentences directly from the cerebral cortical activity of such patients may represent an advancement over existing methods for assisted communication. Methods Download a PDF of the Research Summary. We implanted a subdural, high-density, multielectrode array over the area of the sensorimotor cortex that controls speech in a person with anarthria (the loss of the ability to articulate speech) and spastic quadriparesis caused by a brain-stem stroke. Over the course of 48 sessions, we recorded 22 hours of cortical activity while the participant attempted to say individual words from a vocabulary set of 50 words. We used deep-learning algorithms to create computational models for the detection and classification of words from patterns in the recorded cortical activity. We applied these computational models, as well as a natural-language model that yielded next-word probabilities given the preceding words in a sequence, to decode full sentences as the participant attempted to say them. Results We decoded sentences from the participant’s cortical activity in real time at a median rate of 15.2 words per minute, with a median word error rate of 25.6%. In post hoc analyses, we detected 98% of the attempts by the participant to produce individual words, and we classified words with 47.1% accuracy using cortical signals that were stable throughout the 81-week study period. Conclusions In a person with anarthria and spastic quadriparesis caused by a brain-stem stroke, words and sentences were decoded directly from cortical activity during attempted speech with the use of deep-learning models and a natural-language model. (Funded by Facebook and others; number, NCT03698149.) QUICK TAKE VIDEO SUMMARY Decoding Speech in a Paralyzed Person with Anarthria 02:45
Objective: Decoding language representations directly from the brain can enable new Brain-Computer Interfaces (BCI) for high bandwidth human-human and human-machine communication. Clinically, such technologies can restore communication in people with neurological conditions affecting their ability to speak. Approach: In this study, we propose a novel deep network architecture Brain2Char, for directly decoding text (specifically character sequences) from direct brain recordings (called Electrocorticography, ECoG). Brain2Char framework combines state-of-the-art deep learning modules --- 3D Inception layers for multiband spatiotemporal feature extraction from neural data and bidirectional recurrent layers, dilated convolution layers followed by language model weighted beam search to decode character sequences, optimizing a connectionist temporal classification (CTC) loss. Additionally, given the highly non-linear transformations that underlie the conversion of cortical function to character sequences, we perform regularizations on the network's latent representations motivated by insights into cortical encoding of speech production and artifactual aspects specific to ECoG data acquisition. To do this, we impose auxiliary losses on latent representations for articulatory movements, speech acoustics and session specific non-linearities. Main results: In 3 (out of 4) participants reported here, Brain2Char achieves 10.6%, 8.5% and 7.0%Word Error Rates (WER) respectively on vocabulary sizes ranging from 1200 to 1900 words. Significance: These results establish a new end-to-end approach on decoding text from brain signals and demonstrate the potential of Brain2Char as a high-performance communication BCI.
Objective: Direct synthesis of speech from neural signals could provide a fast and natural way of communication to people with neurological diseases. Invasively-measured brain activity (electrocorticography; ECoG) supplies the necessary temporal and spatial resolution to decode fast and complex processes such as speech production. A number of impressive advances in speech decoding using neural signals have been achieved in recent years, but the complex dynamics are still not fully understood. However, it is unlikely that simple linear models can capture the relation between neural activity and continuous spoken speech. Approach. Here we show that deep neural networks can be used to map ECoG from speech production areas onto an intermediate representation of speech (logMel spectrogram). The proposed method uses a densely connected convolutional neural network topology which is well-suited to work with the small amount of data available from each participant. Main results. In a study with six participants, we achieved correlations up to r=0.69 between the reconstructed and original logMel spectrograms. We transfered our prediction back into an audible waveform by applying a Wavenet vocoder. The vocoder was conditioned on logMel features that harnessed a much larger, pre-existing data corpus to provide the most natural acoustic output. Significance. To the best of our knowledge, this is the first time that high-quality speech has been reconstructed from neural recordings during speech production using deep neural networks.
A brain–computer interface (BCI) is a technology that uses neural features to restore or augment the capabilities of its user. A BCI for speech would enable communication in real time via neural correlates of attempted or imagined speech. Such a technology would potentially restore communication and improve quality of life for locked-in patients and other patients with severe communication disorders. There have been many recent developments in neural decoders, neural feature extraction, and brain recording modalities facilitating BCI for the control of prosthetics and in automatic speech recognition (ASR). Indeed, ASR and related fields have developed significantly over the past years, and many lend many insights into the requirements, goals, and strategies for speech BCI. Neural speech decoding is a comparatively new field but has shown much promise with recent studies demonstrating semantic, auditory, and articulatory decoding using electrocorticography (ECoG) and other neural recording modalities. Because the neural representations for speech and language are widely distributed over cortical regions spanning the frontal, parietal, and temporal lobes, the mesoscopic scale of population activity captured by ECoG surface electrode arrays may have distinct advantages for speech BCI, in contrast to the advantages of microelectrode arrays for upper-limb BCI. Nevertheless, there remain many challenges for the translation of speech BCIs to clinical populations. This review discusses and outlines the current state-of-the-art for speech BCI and explores what a speech BCI using chronic ECoG might entail.
For people who cannot communicate due to severe paralysis or involuntary movements, technology that decodes intended speech from the brain may offer an alternative means of communication. If decoding proves to be feasible, intracranial Brain-Computer Interface systems can be developed which are designed to translate decoded speech into computer generated speech or to instructions for controlling assistive devices. Recent advances suggest that such decoding may be feasible from sensorimotor cortex, but it is not clear how this challenge can be approached best. One approach is to identify and discriminate elements of spoken language, such as phonemes. We investigated feasibility of decoding four spoken phonemes from the sensorimotor face area, using electrocorticographic signals obtained with high-density electrode grids. Several decoding algorithms including spatiotemporal matched filters, spatial matched filters and support vector machines were compared. Phonemes could be classified correctly at a level of over 75% with spatiotemporal matched filters. Support Vector machine analysis reached a similar level, but spatial matched filters yielded significantly lower scores. The most informative electrodes were clustered along the central sulcus. Highest scores were achieved from time windows centered around voice onset time, but a 500 ms window before onset time could also be classified significantly. The results suggest that phoneme production involves a sequence of robust and reproducible activity patterns on the cortical surface. Importantly, decoding requires inclusion of temporal information to capture the rapid shifts of robust patterns associated with articulator muscle group contraction during production of a phoneme. The high classification scores are likely to be enabled by the use of high density grids, and by the use of discrete phonemes. Implications for use in Brain-Computer Interfaces are discussed.
Just over fifty years ago, Lisker and Abramson proposed a straightforward measure of acoustic differences among stop consonants of different voicing categories, Voice Onset Time (VOT). Since that time, hundreds of studies have used this method. Here, we review the original definition of VOT, propose some extensions to the definition, and discuss some problematic cases. We propose a set of terms for the most important aspects of VOT and a set of Praat labels that could provide some consistency for future cross-study analyses. Although additions of other aspects of realization of voicing distinctions (F0, amplitude, duration of voicelessness) could be considered, they are rejected as adding too much complexity for what has turned out to be one of the most frequently used metrics in phonetics and phonology.
We propose in this paper a differentiable learning loss between time series. Our proposal builds upon the celebrated Dynamic Time Warping (DTW) discrepancy. Unlike the Euclidean distance, DTW is able to compare asynchronous time series of varying size and is robust to elastic transformations in time. To be robust to such invariances, DTW computes a minimal cost alignment between time series using dynamic programming. Our work takes advantage of a smoothed formulation of DTW, called soft-DTW, that computes the soft-minimum of all alignment costs. We show in this paper that soft-DTW is a differentiable loss function, and that both its value and its gradient can be computed with quadratic time/space complexity (DTW has quadratic time and linear space complexity). We show that our regularization is particularly well suited to average and cluster time series under the DTW geometry, a task for which our proposal significantly outperforms existing baselines (Petitjean et al., 2011). Next, we propose to tune the parameters of a machine that outputs time series by minimizing its fit with ground-truth labels in a soft-DTW sense.
Objective: The superior temporal gyrus (STG) and neighboring brain regions play a key role in human language processing. Previous studies have attempted to reconstruct speech information from brain activity in the STG, but few of them incorporate the probabilistic framework and engineering methodology used in modern speech recognition systems. In this work, we describe the initial efforts toward the design of a neural speech recognition (NSR) system that performs continuous phoneme recognition on English stimuli with arbitrary vocabulary sizes using the high gamma band power of local field potentials in the STG and neighboring cortical areas obtained via electrocorticography. Approach: The system implements a Viterbi decoder that incorporates phoneme likelihood estimates from a linear discriminant analysis model and transition probabilities from an n-gram phonemic language model. Grid searches were used in an attempt to determine optimal parameterizations of the feature vectors and Viterbi decoder. Main results: The performance of the system was significantly improved by using spatiotemporal representations of the neural activity (as opposed to purely spatial representations) and by including language modeling and Viterbi decoding in the NSR system. Significance: These results emphasize the importance of modeling the temporal dynamics of neural responses when analyzing their variations with respect to varying stimuli and demonstrate that speech recognition techniques can be successfully leveraged when decoding speech from neural signals. Guided by the results detailed in this work, further development of the NSR system could have applications in the fields of automatic speech recognition and neural prosthetics.
In this article, we investigated the performance of a real-time voice activity detection module exploiting different time-frequency methods for extracting signal features in a subject with implanted electrocorticographic (ECoG) electrodes. We used ECoG signals recorded while the subject performed a syllable repetition task. The voice activity detection module used, as input, ECoG data streams, on which it performed feature extraction and classification. With this approach we were able to detect voice activity (speech onset and offset) from ECoG signals with high accuracy. The results demonstrate that different timefrequency representations carried complementary information about voice activity, with the S-transform achieving 92% accuracy using the 86 best features and support vector machines as the classifier. The proposed real-time voice activity detector may be used as a part of an automated natural speech BMI system for rehabilitating individuals with communication deficits.
Objective: Although brain-computer interfaces (BCIs) can be used in several different ways to restore communication, communicative BCI has not approached the rate or efficiency of natural human speech. Electrocorticography (ECoG) has precise spatiotemporal resolution that enables recording of brain activity distributed over a wide area of cortex, such as during speech production. In this study, we sought to decode elements of speech production using ECoG. Approach: We investigated words that contain the entire set of phonemes in the general American accent using ECoG with four subjects. Using a linear classifier, we evaluated the degree to which individual phonemes within each word could be correctly identified from cortical signal. Main results: We classified phonemes with up to 36% accuracy when classifying all phonemes and up to 63% accuracy for a single phoneme. Further, misclassified phonemes follow articulation organization described in phonology literature, aiding classification of whole words. Precise temporal alignment to phoneme onset was crucial for classification success. Significance: We identified specific spatiotemporal features that aid classification, which could guide future applications. Word identification was equivalent to information transfer rates as high as 3.0 bits s(-1) (33.6 words min(-1)), supporting pursuit of speech articulation for BCI control.
Brain–machine interfaces for speech restoration have been extensively studied for more than two decades. The success of such a system will depend in part on selecting the best brain recording sites and signal features corresponding to speech production. The purpose of this study was to detect speech activity automatically from electrocorticographic signals based on joint spatial-frequency clustering of the ECoG feature space. For this study, the ECoG signals were recorded while a subject performed two different syllable repetition tasks. We found that the optimal frequency resolution to detect speech activity from ECoG signals was 8 Hz, achieving 98.8% accuracy by employing support vector machines as a classifier. We also defined the cortical areas that held the most information about the discrimination of speech and nonspeech time intervals. Additionally, the results shed light on the distinct cortical areas associated with the two syllables repetition tasks and may contribute to the development of portable ECoG-based communication.
Results from chronometric and speech errors studies provide convergent evidence for both lower and upper bounds on interaction within the speech production system. Some degree of cascading activation is required to account for patterns of speech errors in neurologically intact and impaired speakers as well as the results of recent chronometric studies. However, the strength of this form of interaction must be limited to account for the occurrence of selective deficits in the production system and restrictions on the conditions under which interactive effects influence reaction times. Similarly, some amount of feedback from phonological to word-level representations is necessary to account for patterns of speech errors in neurologically intact and impaired individuals as well as the influence of phonological neighbours on response latency. This interactive mechanism must also be limited to account for restrictions on the types of speech errors produced following selective deficits within the production system. Results from a variety of empirical traditions therefore converge on the same conclusion: interaction is present, but it must be crucially limited.
Sensorimotor integration is an active domain of speech research and is characterized by two main ideas, that the auditory system is critically involved in speech production and that the motor system is critically involved in speech perception. Despite the complementarity of these ideas, there is little crosstalk between these literatures. We propose an integrative model of the speech-related "dorsal stream" in which sensorimotor interaction primarily supports speech production, in the form of a state feedback control architecture. A critical component of this control system is forward sensory prediction, which affords a natural mechanism for limited motor influence on perception, as recent perceptual research has suggested. Evidence shows that this influence is modulatory but not necessary for speech perception. The neuroanatomy of the proposed circuit is discussed as well as some probable clinical correlates including conduction aphasia, stuttering, and aspects of schizophrenia.
Residual activation of the cortex was investigated in nine patients with complete spinal cord injury between T6 and L1 by functional magnetic resonance imaging (fMRI). Brain activations were recorded under four conditions: (1) a patient attempting to move his toes with flexion-extension, (2) a patient imagining the same movement, (3) passive proprio-somesthesic stimulation of the big toes without visual control, and (4) passive proprio-somesthesic stimulation of the big toes with visual control by the patient. Passive proprio-somesthesic stimulation of the toes generated activation posterior to the central sulcus in the three patients who also showed a somesthesic evoked potential response to somesthesic stimulation. When performed under visual control, activations were observed in two more patients. In all patients, activations were found in the cortical areas involved in motor control (i.e., primary sensorimotor cortex, premotor regions and supplementary motor area [SMA]) during attempts to move or mental imagery of these tasks. It is concluded that even several years after injury with some local cortical reorganization, activation of lower limb cortical networks can be generated either by the attempt to move, the mental evocation of the action, or the visual feedback of a passive proprio-somesthesic stimulation.
For many years people have speculated that electroencephalographic activity or other electrophysiological measures of brain function might provide a new non-muscular channel for sending messages and commands to the external world – a brain–computer interface (BCI). Over the past 15 years, productive BCI research programs have arisen. Encouraged by new understanding of brain function, by the advent of powerful low-cost computer equipment, and by growing recognition of the needs and potentials of people with disabilities, these programs concentrate on developing new augmentative communication and control technology for those with severe neuromuscular disorders, such as amyotrophic lateral sclerosis, brainstem stroke, and spinal cord injury. The immediate goal is to provide these users, who may be completely paralyzed, or ‘locked in’, with basic communication capabilities so that they can express their wishes to caregivers or even operate word processing programs or neuroprostheses. Present-day BCIs determine the intent of the user from a variety of different electrophysiological signals. These signals include slow cortical potentials, P300 potentials, and mu or beta rhythms recorded from the scalp, and cortical neuronal activity recorded by implanted electrodes. They are translated in real-time into commands that operate a computer display or other device. Successful operation requires that the user encode commands in these signals and that the BCI derive the commands from the signals. Thus, the user and the BCI system need to adapt to each other both initially and continually so as to ensure stable performance. Current BCIs have
Conference Paper
An investigation has been made for individual phonemes focusing mainly on their duration in continuous speech spoken at different rates: fast, normal, and slow. Fifteen short sentences uttered by four male speakers have been used as the speech material which comprises a total of 291 morae. The normal speaking rate (n-speech) is, on average, 150 milliseconds/mora (or 300 morae/minute) and the four speakers were asked to read the sentences twice as fast as (f-speech) and half as slow as (s-speech) the normal speed in reference to n-speech. Among consonants, the greatest influence has been found to occur on the syllabic nasal /N/ and the least on the voiceless stop /t/ in f-speech. For s-speech, /N/ has also been found to be the greatest but the least is voiced stop /d/. The ratio of duration between consonant and vowel of a CV-syllable in f-speech is kept almost the same as that in n-speech while vowel lengthening becomes significantly large in s-speech