Alan W Black’s research while affiliated with Carnegie Mellon University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (463)


Deep Speech Synthesis from Multimodal Articulatory Representations
  • Preprint

December 2024

·

17 Reads

Peter Wu

·

·

Kevin Scheck

·

[...]

·

The amount of articulatory data available for training deep learning models is much less compared to acoustic speech data. In order to improve articulatory-to-acoustic synthesis performance in these low-resource settings, we propose a multimodal pre-training framework. On single-speaker speech synthesis tasks from real-time magnetic resonance imaging and surface electromyography inputs, the intelligibility of synthesized outputs improves noticeably. For example, compared to prior work, utilizing our proposed transfer learning methods improves the MRI-to-speech performance by 36% word error rate. In addition to these intelligibility results, our multimodal pre-trained models consistently outperform unimodal baselines on three objective and subjective synthesis quality metrics.


Sylber: Syllabic Embedding Representation of Speech from Raw Audio

October 2024

·

10 Reads

Syllables are compositional units of spoken language that play a crucial role in human speech perception and production. However, current neural speech representations lack structure, resulting in dense token sequences that are costly to process. To bridge this gap, we propose a new model, Sylber, that produces speech representations with clean and robust syllabic structure. Specifically, we propose a self-supervised model that regresses features on syllabic segments distilled from a teacher model which is an exponential moving average of the model in training. This results in a highly structured representation of speech features, offering three key benefits: 1) a fast, linear-time syllable segmentation algorithm, 2) efficient syllabic tokenization with an average of 4.27 tokens per second, and 3) syllabic units better suited for lexical and syntactic understanding. We also train token-to-speech generative models with our syllabic units and show that fully intelligible speech can be reconstructed from these tokens. Lastly, we observe that categorical perception, a linguistic phenomenon of speech perception, emerges naturally in our model, making the embedding space more categorical and sparse than previous self-supervised learning approaches. Together, we present a novel self-supervised approach for representing speech as syllables, with significant potential for efficient speech tokenization and spoken language modeling.



Figure 1: (a) Experimental setup including a dry electrode neckband, baseline monitoring face electrodes, wet reference electrode behind the right ear, and neckworn electronics behind the head. (b) Partial photograph of 3D printed, gold plated neck electrodes. (c) Sample renders of the experiment GUI's subject and host views. Subject view displays a teleprompter while raw EMG data is live plotted on the host view. (d) Raw sample EMG from a single utterance of the words 'Heed' and 'Kale'. (d) Sample EMG time-frequency spectrograms (see section 3.2) from a single utterance of the words 'Heed' and 'Kale'.
Figure 2: Classification accuracy for different numbers of neck electrodes. Solid lines are means and opaque regions are 95% confidence intervals.
Figure 3: Confusion matrices using model trained on (a) the 10 neck channels and (b) all 13 channels.
Figure 4: Weighted sum of self-supervised speech features match EMG spectrogram frequency bins. Here, we plot 1 EMG channel of a "kale" utterance for bins 90-94 Hz, 102-105 Hz, 238-242 Hz, and 348-352 Hz (Top-to-bottom).
Towards EMG-to-Speech with a Necklace Form Factor
  • Preprint
  • File available

July 2024

·

22 Reads

Electrodes for decoding speech from electromyography (EMG) are typically placed on the face, requiring adhesives that are inconvenient and skin-irritating if used regularly. We explore a different device form factor, where dry electrodes are placed around the neck instead. 11-word, multi-speaker voiced EMG classifiers trained on data recorded with this device achieve 92.7% accuracy. Ablation studies reveal the importance of having more than two electrodes on the neck, and phonological analyses reveal similar classification confusions between neck-only and neck-and-face form factors. Finally, speech-EMG correlation experiments demonstrate a linear relationship between many EMG spectrogram frequency bins and self-supervised speech representation dimensions.

Download



Proposed text classification system architecture.
Proposed co-reference resolution evaluation pipeline.
Compressed models for co-reference resolution: enhancing efficiency with debiased word embeddings

October 2023

·

19 Reads

·

2 Citations

This work presents a comprehensive approach to reduce bias in word embedding vectors and evaluate the impact on various Natural Language Processing (NLP) tasks. Two GloVe variations (840B and 50) are debiased by identifying the gender direction in the word embedding space and then removing or reducing the gender component from the embeddings of target words, while preserving useful semantic information. Their gender bias is assessed through the Word Embedding Association Test. The performance of co-reference resolution and text classification models trained on both original and debiased embeddings is evaluated in terms of accuracy. A compressed co-reference resolution model is examined to gauge the effectiveness of debiasing techniques on resource-efficient models. To the best of the authors’ knowledge, this is the first attempt to apply compression techniques to debiased models. By analyzing the context preservation of debiased embeddings using a Twitter misinformation dataset, this study contributes valuable insights into the practical implications of debiasing methods for real-world applications such as person profiling.


Deep Speech Synthesis from MRI-Based Articulatory Representations

August 2023

·

114 Reads

·

18 Citations

In this paper, we study articulatory synthesis, a speech synthesis method using human vocal tract information that offers a way to develop efficient, generalizable and interpretable synthesizers. While recent advances have enabled intelligible articulatory synthesis using electromagnetic articulography (EMA), these methods lack critical articulatory information like excitation and nasality, limiting generalization capabilities. To bridge this gap, we propose an alternative MRI-based feature set that covers a much more extensive articulatory space than EMA. We also introduce normalization and denoising procedures to enhance the generalizability of deep learning methods trained on MRI data. Moreover, we propose an MRI-to-speech model that improves both computational efficiency and speech fidelity. Finally, through a series of ablations, we show that the proposed MRI representation is more comprehensive than EMA and identify the most suitable MRI feature subset for articulatory synthesis.


Deep Speech Synthesis from MRI-Based Articulatory Representations

July 2023

·

57 Reads

·

1 Citation

In this paper, we study articulatory synthesis, a speech synthesis method using human vocal tract information that offers a way to develop efficient, generalizable and interpretable synthesizers. While recent advances have enabled intelligible articulatory synthesis using electromagnetic articulography (EMA), these methods lack critical articulatory information like excitation and nasality, limiting generalization capabilities. To bridge this gap, we propose an alternative MRI-based feature set that covers a much more extensive articulatory space than EMA. We also introduce normalization and denoising procedures to enhance the generalizability of deep learning methods trained on MRI data. Moreover, we propose an MRI-to-speech model that improves both computational efficiency and speech fidelity. Finally, through a series of ablations, we show that the proposed MRI representation is more comprehensive than EMA and identify the most suitable MRI feature subset for articulatory synthesis.


Figure 1. Proposed Sentiment Analysis system architecture
GloVe 840 word embedding vectors
Compressed Models for Co-reference Resolution: Enhancing Efficiency with Debiased Word Embeddings

June 2023

·

23 Reads

This work presents a comprehensive approach to reduce bias in word embedding vectors and evaluate the impact on various Natural Language Processing (NLP) tasks. Two GloVe variations (840B and 50) are debiased by identifying the gender direction in the word embedding space and then removing or reducing the gender component from the embeddings of target words, while preserving useful semantic information. Their gender bias is assessed through the Word Embedding Association Test. The performance of co-reference resolution and sentiment analysis models trained on both original and debiased embeddings is evaluated in terms of accuracy. A compressed co-reference resolution model is examined to gauge the effectiveness of debiasing techniques on resource-efficient models. To the best of the authors' knowledge, this is the first attempt to apply compression techniques to debiased models. By analyzing the context preservation of debiased embeddings using a Twitter misinformation dataset, this study contributes valuable insights into the practical implications of debiasing methods for real-world applications such as human profiling.


Citations (46)


... Syllabic structure in speech SSL Previous studies have demonstrated that syllabic structure can be induced by SSL (Peng et al., 2023;Cho et al., 2024b;Komatsu & Shinozaki, 2024). Peng et al. (2023) shows that syllabic structure in SSL features can be induced by jointly training with images and spoken captions. ...

Reference:

Sylber: Syllabic Embedding Representation of Speech from Raw Audio
SD-HuBERT: Sentence-Level Self-Distillation Induces Syllabic Organization in Hubert
  • Citing Conference Paper
  • April 2024

... In particular, speech tokens obtained by quantizing SSL features are receiving attention for understanding and generating spoken language Hassid et al., 2024;. Substantial evidence suggests that SSL features are highly phonetic Cho et al., 2023;2024a;, which suggests that these quantized tokens are sub-phonemic units that densely tile the phonetic space (Sicherman & Adi, 2023). While capturing fine-grained speech contents, most existing speech tokenization approaches yield high frequency tokens (25-75 Hz), resulting in a long sequence of tokens to be processed. ...

Self-Supervised Models of Speech Infer Universal Articulatory Kinematics
  • Citing Conference Paper
  • April 2024

... Because of the translating subtitle encoder block, it is also possible to apply an additional CTC loss to the output of the subtitle encoder for subtitle data, although we found only limited benefits in doing this. Previous work [76,77] has argued that a Transformer model with a CTC objective is able translate an input sequence to a non-monotonic alignment, despite the monotonicity assumptions in the CTC framework (e.g. in machine translation). Applying a subtitle CTC loss was not possible in the previous methods, as a shared CTC module for verbatim and subtitled data, or separate CTC objectives applied to the same encoder output, would lead to contradictory objectives for the encoder layers due to the different alignments between verbatim transcriptions and subtitles. ...

CTC Alignments Improve Autoregressive Translation
  • Citing Conference Paper
  • January 2023

... However, relying on ground-truth mel-spectrograms that include speaker-specific details and ambient noise forces models to learn irrelevant information, which reduces intelligibility (WER up to 102.6% [1]) and hinders generalization. Recent direct single-stage approaches [18] face similar challenges, additionally requiring extensive speaker-specific preprocessing of rtMRI video data. Despite this, error rates remain high (Character Error Rate (CER) at 69.2%), and their effectiveness across multiple speakers and datasets remains yet to be seen. ...

Deep Speech Synthesis from MRI-Based Articulatory Representations

... [16] combine pretrained self-supervised learning (SSL), ASR, language model (LM) and SLU models to explore the model combination that can achieve the best SLU performance and shows that pre-training approaches rooted in self-supervised learning are more potent than those grounded in supervised learning. [17] develop compositional end-to-end SLU systems that initially transform spoken utterances into a series of token representations, followed by token-level classification using a NLU system. [6] suggest a comprehensive approach that combines a multilingual pretrained speech model, a text model, and an adapter to enhance speech understanding. ...

Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models
  • Citing Conference Paper
  • January 2022

... MDETR [11], 12-in-1 [12], VLT5 [13] utilize visionlanguage pretraining [14,15] for enhancing the understanding of multi-modality [16,17]. In addition, although there are some works [18,19,20,21,22,23] to explore self-explaining neural networks on NLP tasks and image captioning, we focus on constraining and improving pre-trained vision-language models for reasoning in compositional VQA. ...

On Advances in Text Generation from Images Beyond Captioning: A Case Study in Self-Rationalization
  • Citing Conference Paper
  • January 2022

... Moreover, datasets are available upon request (Birkholz et al., 2020;Dediu et al., 2022;Isaieva et al., 2023), and the associated protocols, even when the dataset is not immediately available, serve as valuable guidelines for others to gather high-quality data (Lim et al., 2023;Wu et al., 2023). Notably, there is a uptick in the utilisation of machine learning techniques in tasks related to vocal tract MRI Ribeiro et al., 2022;Ruthven et al., 2023). ...

Deep Speech Synthesis from MRI-Based Articulatory Representations
  • Citing Preprint
  • July 2023

... These analyses have also benefited from direct observation of the dynamics of speech articulation using a variety of instrumental methods, including electromagnetic articulography (Cho et al., 2023;Wu et al., 2022), ultrasound (Palo et al., 2014;Pertti et al., 2015), and real-time magnetic resonance imaging (rtMRI) (Kim et al., 2014;Lim et al., 2021). Recent efforts (e.g., Lian et al., 2023) have begun to examine the connections between SSL speech representations and articulatory movement trajectories, showing that continuous articulations align closely with SSL speech representations. However, detailed analysis involving variations in speech production and their relation to phoneme recognition with the SSL speech model is largely under-investigated. ...

Articulatory Representation Learning via Joint Factor Analysis and Neural Matrix Factorization
  • Citing Conference Paper
  • June 2023

... The second dataset, TORGO [33], contains EMA sensor data but lacks a palate trace. Previous research [34] proposed approximating the missing palate trace using the convex hull of tongue data, but we did not include this method in our experiments. Thus, our experiments included only six of the nine TVs, excluding tongue-related constriction degrees ("CD"). ...

Speaker-Independent Acoustic-to-Articulatory Speech Inversion
  • Citing Conference Paper
  • June 2023

... Conventional F0 estimation approaches have utilized digital speech processing-based (DSP) heuristics. These DSP-based approaches can be divided into 3 main categories: time domain processing [6,7,8,9], frequency domain processing [10,11,12,13], and time-frequency domain processing [14,15]. ...

A Fast and Accurate Pitch Estimation Algorithm Based on the Pseudo Wigner-Ville Distribution