Nicola Pia’s research while affiliated with Fraunhofer Institute for Integrated Circuits and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (25)


Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron
  • Preprint
  • File available

January 2025

·

3 Reads

·

·

·

[...]

·

In recent years, several text-to-speech systems have been proposed to synthesize natural speech in zero-shot, few-shot, and low-resource scenarios. However, these methods typically require training with data from many different speakers. The speech quality across the speaker set typically is diverse and imposes an upper limit on the quality achievable for the low-resource speaker. In the current work, we achieve high-quality speech synthesis using as little as five minutes of speech from the desired speaker by augmenting the low-resource speaker data with noise and employing multiple sampling techniques during training. Our method requires only four high-quality, high-resource speakers, which are easy to obtain and use in practice. Our low-complexity method achieves improved speaker similarity compared to the state-of-the-art zero-shot method HierSpeech++ and the recent low-resource method AdapterMix while maintaining comparable naturalness. Our proposed approach can also reduce the data requirements for speech synthesis for new speakers and languages.

Download

Neural Speech Coding for Real-Time Communications Using Constant Bitrate Scalar Quantization

December 2024

·

13 Reads

IEEE Journal of Selected Topics in Signal Processing

Neural audio coding has emerged as a vivid research direction by promising good audio quality at very low bitrates unachievable by classical coding techniques. Here, end-to-end trainable autoencoder-like models represent the state of the art, where a discrete representation in the bottleneck of the autoencoder is learned. This allows for efficient transmission of the input audio signal. The learned discrete representation of neural codecs is typically generated by applying a quantizer to the output of the neural encoder. In almost all state-of-theart neural audio coding approaches, this quantizer is realized as a Vector Quantizer (VQ) and a lot of effort has been spent to alleviate drawbacks of this quantization technique when used together with a neural audio coder. In this paper, we propose and analyze simple alternatives to VQ, which are based on projected Scalar Quantization (SQ). These quantization techniques do not need any additional losses, scheduling parameters or codebook storage thereby simplifying the training of neural audio codecs. For real-time speech communication applications, these neural codecs are required to operate at low complexity, low latency and at low bitrates. We address those challenges by proposing a new causal network architecture that is based on SQ and a Short- Time Fourier Transform (STFT) representation. The proposed method performs particularly well in the very low complexity and low bitrate regime.


FlowMAC: Conditional Flow Matching for Audio Coding at Low Bit Rates

September 2024

·

5 Reads

This paper introduces FlowMAC, a novel neural audio codec for high-quality general audio compression at low bit rates based on conditional flow matching (CFM). FlowMAC jointly learns a mel spectrogram encoder, quantizer and decoder. At inference time the decoder integrates a continuous normalizing flow via an ODE solver to generate a high-quality mel spectrogram. This is the first time that a CFM-based approach is applied to general audio coding, enabling a scalable, simple and memory efficient training. Our subjective evaluations show that FlowMAC at 3 kbps achieves similar quality as state-of-the-art GAN-based and DDPM-based neural audio codecs at double the bit rate. Moreover, FlowMAC offers a tunable inference pipeline, which permits to trade off complexity and quality. This enables real-time coding on CPU, while maintaining high perceptual quality.



Figure 2: Listening test results for loss traces in Profile-1, with 95% confidence intervals. CC and EPC stands for Clean Channel and Error-prone channel respectively.
Figure 3: Listening test results for loss traces in Profile-2, with 95% confidence intervals. CC and EPC stands for Clean Channel and Error-prone channel respectively.
On Improving Error Resilience of Neural End-to-End Speech Coders

June 2024

·

18 Reads

Error resilient tools like Packet Loss Concealment (PLC) and Forward Error Correction (FEC) are essential to maintain a reliable speech communication for applications like Voice over Internet Protocol (VoIP), where packets are frequently delayed and lost. In recent times, end-to-end neural speech codecs have seen a significant rise, due to their ability to transmit speech signal at low bitrates but few considerations were made about their error resilience in a real system. Recently introduced Neural End-to-End Speech Codec (NESC) can reproduce high quality natural speech at low bitrates. We extend its robustness to packet losses by adding a low complexity network to predict the codebook indices in latent space. Furthermore, we propose a method to add an in-band FEC at an additional bitrate of 0.8 kbps. Both subjective and objective assessment indicate the effectiveness of proposed methods, and demonstrate that coupling PLC and FEC provide significant robustness against packet losses.




Subjective Evaluation of Text-to-Speech Models: Comparing Absolute Category Rating and Ranking by Elimination Tests

August 2023

·

72 Reads

·

1 Citation

Modern text-to-speech (TTS) models are typically subjectivelyevaluated using an Absolute Category Rating (ACR) method.This method uses the mean opinion score to rate each modelunder test. However, if the models are perceptually too similar,assigning absolute ratings to stimuli might be difficult and proneto subjective preference errors. Pairwise comparison tests offerrelative comparison and capture some of the subtle differencesbetween the stimuli better. However, pairwise comparisons takemore time as the number of tests increases exponentially withthe number of models. Alternatively, a ranking-by-elimination(RBE) test can assess multiple models with similar benefits aspairwise comparisons for subtle differences across models without the time penalty. We compared the ACR and RBE tests forTTS evaluation in a controlled experiment. We found that theobtained results were statistically similar even in the presenceof perceptually close TTS models.


Fig. 1: Block diagram of Tacotron-2 with proposed noise augmentation embedding. Here, Bi-LSTM is Bidirectional Long Short Term Memory, CNN is convolutional neural network (NN), RNN is recurrent NN and, FCN is Fully Connected Network.
Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation

June 2023

·

71 Reads

Many neural text-to-speech architectures can synthesize nearly natural speech from text inputs. These architectures must be trained with tens of hours of annotated and high-quality speech data. Compiling such large databases for every new voice requires a lot of time and effort. In this paper, we describe a method to extend the popular Tacotron-2 architecture and its training with data augmentation to enable single-speaker synthesis using a limited amount of specific training data. In contrast to elaborate augmentation methods proposed in the literature, we use simple stationary noises for data augmentation. Our extension is easy to implement and adds almost no computational overhead during training and inference. Using only two hours of training data, our approach was rated by human listeners to be on par with the baseline Tacotron-2 trained with 23.5 hours of LJSpeech data. In addition, we tested our model with a semantically unpredictable sentences test, which showed that both models exhibit similar intelligibility levels.



Citations (10)


... Recently, Lajszczak et al. [16] proposed an elaborate data reordering strategy for data augmentation. Dai et al. [17] and our previous work [18] used noise augmentation for low-resource TTS. An issue with many of these data augmentation methods is that they are unsuitable in a real low-resource scenario because they require at least one hour or more of the target speaker data. ...

Reference:

Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron
Low-Resource Text-to-Speech Using Specific Data and Noise Augmentation
  • Citing Conference Paper
  • September 2023

... In contrast, generative approaches promise superior performance in these scenarios by learning the distribution of clean speech signals, thereby allowing them to generate the desired speech signal by conditioning on noisy inputs. However, most SOTA generative methods are primarily designed for moderate SNR conditions and can be broadly categorized into two main types: GAN-based methods [8]- [11] and, more recently, diffusion-based techniques [12]- [14]. GAN-based methods dominate practical applications, as they do not impose significant constraints on the design of the model architecture and can be deployed in a manner similar to discriminative methods during inference time. ...

SEFGAN: Harvesting the Power of Normalizing Flows and GANs for Efficient High-Quality Speech Enhancement
  • Citing Conference Paper
  • October 2023

... We use Mel Cepstral Distortion with Dynamic Time Warping (MCD-DTW) [28] as the objective measure to evaluate the synthesis 2 https://github.com/sh-lee-prml/HierSpeechpp 3 https://github.com/declare-lab/adapter-mix ...

Subjective Evaluation of Text-to-Speech Models: Comparing Absolute Category Rating and Ranking by Elimination Tests

... Such systems require further processing like crossfading, overlap-add etc. to ensure seamless transition from decoded to concealed frame and vice-versa. With the advent of end-to-end self-supervised neural speech codecs like [16,17,18,19], there is a need for more integrated error resilient tools for concealment. The common architecture of end-to-end codecs includes an encoder, a decoder and a Vector Quantizer (VQ) consisting of multiple residual stages to calculate a quantized representation of the encoder output, i.e., the latent. ...

NESC: Robust Neural End-2-End Speech Coding with GANs
  • Citing Conference Paper
  • September 2022

... Recently, deep neural networks (DNNs) have been increasingly preferred by researchers due to their good modeling capabilities and automatic optimization under big data. One type of approach is to add deep neural network-based post-processing modules [14,38] at the end of existing conventional audio codecs to enhance the coding quality. This approach can enhance the performance of existing audio codecs at minimal cost, but still relies on manually designed signal processing pipelines. ...

PostGAN: A GAN-Based Post-Processor to Enhance the Quality of Coded Speech
  • Citing Conference Paper
  • May 2022

... Conversely, this type of codecs often compromises on the decoded audio quality, such as traditional linear predictive coding (LPC) [5]. Subsequently, researchers have tried to integrate traditional parametric codecs with neural vocoders [6,7,8,9]. These neural vocoders are employed to convert discrete tokens discretized by traditional parametric codecs into audio waveforms. ...

A Streamwise Gan Vocoder for Wideband Speech Coding at Very Low Bit Rate
  • Citing Conference Paper
  • October 2021

... The first network, called the acoustic model, converts text to mel-spectrogram sequences, and the second network, referred to as the neural vocoder, converts mel-spectrogram sequences to speech waveforms. In this work, we modify the acoustic model to enable low-resource TTS and use a pre-trained StyleMelGAN [20] vocoder. ...

StyleMelGAN: An Efficient High-Fidelity Adversarial Vocoder with Temporal Adaptive Normalization
  • Citing Conference Paper
  • June 2021

... Two 1-forms α and β are said to be Engel defining forms for a given Engel structure D if D = ker α ∩ ker β and E = [D, D] = ker α. A pair of defining forms determines a complementary distribution R = T, R called the Reeb distribution (see [17]). The conformal class of α is uniquely determined by D, whereas in general the conformal class of β is not. ...

Riemannian Properties of Engel Structures
  • Citing Article
  • August 2020

International Mathematics Research Notices