Samuele Cornell’s research while affiliated with Carnegie Mellon University and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (79)


Figure 3: Tag occurrences of different samples in the non-blind and blind test sets. The "hard samples" are defined as samples with poor metric scores below pre-defined thresholds. The detailed rule for determining hard samples can be found in Section 2.2.
Figure 4: Correlations (KRCC and LCC) between MOS and other objective metrics on the blind test data.
Lessons Learned from the URGENT 2024 Speech Enhancement Challenge
  • Preprint
  • File available

June 2025

·

17 Reads

·

Kohei Saijo

·

Samuele Cornell

·

[...]

·

The URGENT 2024 Challenge aims to foster speech enhancement (SE) techniques with great universality, robustness, and generalizability, featuring a broader task definition, large-scale multi-domain data, and comprehensive evaluation metrics. Nourished by the challenge outcomes, this paper presents an in-depth analysis of two key, yet understudied, issues in SE system development: data cleaning and evaluation metrics. We highlight several overlooked problems in traditional SE pipelines: (1) mismatches between declared and effective audio bandwidths, along with label noise even in various "high-quality" speech corpora; (2) lack of both effective SE systems to conquer the hardest conditions (e.g., speech overlap, strong noise / reverberation) and reliable measure of speech sample difficulty; (3) importance of combining multifaceted metrics for a comprehensive evaluation correlating well with human judgment. We hope that this endeavor can inspire improved SE pipeline designs in the future.

Download

Figure 4: Average training time per epoch for UniVERSA, UniVERSA-T, and ARECHO on the Base training set.. The proposed ARECHO model demonstrates improved training efficiency.
Training data composition across domains.
List of supported non-matching metrics in VERSA. The "Model Based" column represents metrics that need pre-trained models. The "Target Direction" column indicates which direction is desirable for each metric without being overly technical.
ARECHO: Autoregressive Evaluation via Chain-Based Hypothesis Optimization for Speech Multi-Metric Estimation

May 2025

·

15 Reads

Speech signal analysis poses significant challenges, particularly in tasks such as speech quality evaluation and profiling, where the goal is to predict multiple perceptual and objective metrics. For instance, metrics like PESQ (Perceptual Evaluation of Speech Quality), STOI (Short-Time Objective Intelligibility), and MOS (Mean Opinion Score) each capture different aspects of speech quality. However, these metrics often have different scales, assumptions, and dependencies, making joint estimation non-trivial. To address these issues, we introduce ARECHO (Autoregressive Evaluation via Chain-based Hypothesis Optimization), a chain-based, versatile evaluation system for speech assessment grounded in autoregressive dependency modeling. ARECHO is distinguished by three key innovations: (1) a comprehensive speech information tokenization pipeline; (2) a dynamic classifier chain that explicitly captures inter-metric dependencies; and (3) a two-step confidence-oriented decoding algorithm that enhances inference reliability. Experiments demonstrate that ARECHO significantly outperforms the baseline framework across diverse evaluation scenarios, including enhanced speech analysis, speech generation evaluation, and noisy speech evaluation. Furthermore, its dynamic dependency modeling improves interpretability by capturing inter-metric relationships.


Interspeech 2025 URGENT Speech Enhancement Challenge

May 2025

·

24 Reads

There has been a growing effort to develop universal speech enhancement (SE) to handle inputs with various speech distortions and recording conditions. The URGENT Challenge series aims to foster such universal SE by embracing a broad range of distortion types, increasing data diversity, and incorporating extensive evaluation metrics. This work introduces the Interspeech 2025 URGENT Challenge, the second edition of the series, to explore several aspects that have received limited attention so far: language dependency, universality for more distortion types, data scalability, and the effectiveness of using noisy training data. We received 32 submissions, where the best system uses a discriminative model, while most other competitive ones are hybrid methods. Analysis reveals some key findings: (i) some generative or hybrid approaches are preferred in subjective evaluations over the top discriminative model, and (ii) purely generative SE models can exhibit language dependency.




ESPnet-SpeechLM: An Open Speech Language Model Toolkit

February 2025

·

29 Reads

We present ESPnet-SpeechLM, an open toolkit designed to democratize the development of speech language models (SpeechLMs) and voice-driven agentic applications. The toolkit standardizes speech processing tasks by framing them as universal sequential modeling problems, encompassing a cohesive workflow of data preprocessing, pre-training, inference, and task evaluation. With ESPnet-SpeechLM, users can easily define task templates and configure key settings, enabling seamless and streamlined SpeechLM development. The toolkit ensures flexibility, efficiency, and scalability by offering highly configurable modules for every stage of the workflow. To illustrate its capabilities, we provide multiple use cases demonstrating how competitive SpeechLMs can be constructed with ESPnet-SpeechLM, including a 1.7B-parameter model pre-trained on both text and speech tasks, across diverse benchmarks. The toolkit and its recipes are fully transparent and reproducible at: https://github.com/espnet/espnet/tree/speechlm.





ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration

September 2024

·

35 Reads

We introduce ESPnet-EZ, an extension of the open-source speech processing toolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZ focuses on two major aspects: (i) easy fine-tuning and inference of existing ESPnet models on various tasks and (ii) easy integration with popular deep neural network frameworks such as PyTorch-Lightning, Hugging Face transformers and datasets, and Lhotse. By replacing ESPnet design choices inherited from Kaldi with a Python-only, Bash-free interface, we dramatically reduce the effort required to build, debug, and use a new model. For example, to fine-tune a speech foundation model, ESPnet-EZ, compared to ESPnet, reduces the number of newly written code by 2.7x and the amount of dependent code by 6.7x while dramatically reducing the Bash script dependencies. The codebase of ESPnet-EZ is publicly available.


Citations (40)


... Recent advancements in text-based generation, particularly with diffusion models [6,9,27,29,30], have significantly improved the generation of high-fidelity images [1,7,19,43,49,57], audio [14,17,18,24,34,45,46], Stable Diffusion v3.5 and video [4,11,16,20,28,31,52,53] from textual descriptions, achieving remarkable advancements in visual fidelity and semantic alignment. By iteratively refining a noisy input until it converges to a sample that aligns with the given text prompt, these models capture intricate details and complex compositions previously unattainable with other approaches. ...

Reference:

Accelerating Diffusion Sampling via Exploiting Local Transition Coherence
Diffusion-Based Generative Modeling With Discriminative Guidance for Streamable Speech Enhancement
  • Citing Conference Paper
  • December 2024

... To address these issues, we introduce an additional speech enhancement (SE) model called G-SpatialNet. On the other hand, since reference clean speech is unavailable in real-world scenarios, previous studies [2,8,9] mainly rely on simulated data to train SE models. However, this brings domain mismatch problem between simulated and real-recorded data, which significantly degrades models performance in real-world scenarios [10,11,12]. ...

The CHiME-8 DASR Challenge for Generalizable and Array Agnostic Distant Automatic Speech Recognition and Diarization
  • Citing Conference Paper
  • September 2024

... Speech data synthesis [1] is a promising way to address data scarcity, enabling advancements in domain adaptation, recognition of rare names, numeric transcription, and low-resource languages [2][3][4]. Recent work leverages LLMs for text generation and multi-speaker TTS models for speech synthesis [5]. However, effective integration requires high-quality TTS models that produce naturalistic audio, as excessive synthesized data can degrade ASR performance on spontaneous and conversational speech. ...

Generating Data with Text-to-Speech and Large-Language Models for Conversational Speech Recognition
  • Citing Conference Paper
  • August 2024

... To overcome this, we expand the training set to over 500k utterances from 5,750 source recordings, covering submissions from both the URGENT 2024 [22] and URGENT 2025 [23] challenges. This expansion introduces significantly more content and distortion variability, along with a greater amount of multilingual and MOS-labeled data. ...

URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement
  • Citing Conference Paper
  • September 2024

... Each conversation is around 10 minutes long. We use the train, validation and test split from [41] (11577, 61 and 61 conversations of respectively 1960 h, 7 h and 7 h). Due to being telephone speech, it features separate channels for each of the two speakers. ...

End-to-end integration of speech separation and voice activity detection for low-latency diarization of telephone conversations
  • Citing Article
  • May 2024

Speech Communication

... In the domain of AOSS, many effective causal methods have already been proposed [23,16,7]. Unfortunately, straightforward adaptation of these advanced causal AOSS models for AVSS tasks may not yield optimal results. For example, Zhang et al. [36] transformed a causal Conv-TasNet [23] into a audio-visual causal version (Causal AV-ConvTasNet), yet the performance was suboptimal (see Section 5.3). ...

Resource-Efficient Separation Transformer
  • Citing Conference Paper
  • April 2024

... Some systems operate modularly, combining diarization, source separation, speaker clustering, and ASR as separate components. Others follow endto-end strategies incorporating speaker tokens [16,17] or use multiple decoder heads to generate separate transcripts [18]. ...

One Model to Rule Them All ? Towards End-to-End Joint Speaker Diarization and Speech Recognition
  • Citing Conference Paper
  • April 2024

... The message-passing mechanism allows nodes to exchange and propagate information from their neighbors, effectively capturing dependencies across multiple hops (Dong & Sun, 2024). Additionally, the multihead self-attention mechanism assigns varying weights to each edge based on its importance, thereby improving the model's ability to learn from complex node relationships (Aironi et al., 2024). By utilizing the message-passing and multi-head self-attention mechanisms, this prediction model might effectively leverage the information from neighboring nodes to enhance model generalization and performance. ...

A Graph-Based Neural Approach to Linear Sum Assignment Problems
  • Citing Article
  • December 2023

International Journal of Neural Systems

... Overall performance is obtained by macro-averaging these dysfluency-class-wise scores to mitigate class imbalance, particularly in FluencyBank++, where interjections occur over twice as often as prolongations. This approach treats all dysfluency classes equally and aligns with prior work on unbalanced datasets [27,28]. Selection of Embeddings and Clustering Method: We explore different combinations of clustering methods and speech embeddings to determine the optimal setup for StutterCut. ...

The CHiME-7 DASR Challenge: Distant Meeting Transcription with Multiple Devices in Diverse Scenarios
  • Citing Conference Paper
  • August 2023

... These applications also enforce robust and redundant communication networks. To reduce their cost, best effort delivery networks can be aided by data reconstruction through inpainting with low-dimensionality representations when data packets are lost or arrive too late [97]. ...

A Score-aware Generative Approach for Music Signals Inpainting
  • Citing Conference Paper
  • October 2023