Yannis Stylianou’s research while affiliated with University of Crete and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (264)


Memory Efficient Neural Speech Synthesis Based on FastSpeech2 Using Attention Free Transformer
  • Conference Paper

August 2024

Eirini Sisamaki

·

Vassilis Tsiaras

·

Yannis Stylianou


Figure 2: Interpretation of cumulant GAN as a weighted variation of SGDA for β, γ > 0. Both real and generated samples for which the discriminator outputs a value closer to the decision boundary are assigned with larger weights because these are the samples which most probably confuse the discriminator.
Figure 7: The evolution of the weights w β i = e −βD(xi) / m j=1 e −βD(xj ) (green triangles) and w γ i = e γD(G(zi)) / m j=1 e γD(G(zj )) (red circles) with i = 1, ..., m as training progresses. Here, we set β = γ = 0.5 while one update for the discriminator is followed by one update for the generator (i.e., k = 1 in Algorithm 1 of the main text).
Figure 10: Generated samples using the Wasserstein distance using clipping (1st row), KL divergence (2nd row), reverse KLD (3rd row) and Hellinger distance (last row). Two hidden layers with 64 units per layer are used. The results are similar to Fig. 3 from the main text.
Figure 11: Same as Fig. 10 but with 16 units per hidden layer. The capacity of the neural nets is low resulting in convergence instabilities. Despite the low capacity, training with Hellinger distance (last row) resulted in an accurate and stable solution.
Figure 12: Same as Fig. 10 but with 3 hidden layers and 32 units per layer. Here the capacity of the NNs, especially for the discriminator, is rather high resulting in gradient vanishing. Again, Hellinger distance (last row) produced the most accurate result for the given number of iterations.

+8

Cumulant GAN
  • Article
  • Full-text available

April 2022

·

123 Reads

·

10 Citations

IEEE Transactions on Neural Networks and Learning Systems

In this article, we propose a novel loss function for training generative adversarial networks (GANs) aiming toward deeper theoretical understanding as well as improved stability and performance for the underlying optimization problem. The new loss function is based on cumulant generating functions (CGFs) giving rise to Cumulant GAN. Relying on a recently derived variational formula, we show that the corresponding optimization problem is equivalent to Rényi divergence minimization, thus offering a (partially) unified perspective of GAN losses: the Rényi family encompasses Kullback-Leibler divergence (KLD), reverse KLD, Hellinger distance, and χ²-divergence. Wasserstein GAN is also a member of cumulant GAN. In terms of stability, we rigorously prove the linear convergence of cumulant GAN to the Nash equilibrium for a linear discriminator, Gaussian distributions, and the standard gradient descent ascent algorithm. Finally, we experimentally demonstrate that image generation is more robust relative to Wasserstein GAN and it is substantially improved in terms of both inception score (IS) and Fréchet inception distance (FID) when both weaker and stronger discriminators are considered.

Download

Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noise

March 2022

·

33 Reads

We present a neural text-to-speech (TTS) method that models natural vocal effort variation to improve the intelligibility of synthetic speech in the presence of noise. The method consists of first measuring the spectral tilt of unlabeled conventional speech data, and then conditioning a neural TTS model with normalized spectral tilt among other prosodic factors. Changing spectral tilt and keeping other prosodic factors unchanged enables effective vocal effort control at synthesis time independent of other prosodic factors. By extrapolation of the spectral tilt values beyond what has been seen in the original data, we can generate speech with high vocal effort levels, thus improving the intelligibility of speech in the presence of masking noise. We evaluate the intelligibility and quality of normal speech and speech with increased vocal effort in the presence of various masking noise conditions, and compare these to well-known speech intelligibility-enhancing algorithms. The evaluations show that the proposed method can improve the intelligibility of synthetic speech with little loss in speech quality.


End-to-End Neural Based Modification of Noisy Speech for Speech-in-Noise Intelligibility Improvement

November 2021

·

16 Reads

·

9 Citations

IEEE/ACM Transactions on Audio Speech and Language Processing

Intelligibility of speech can be significantly reduced when it is presented in adverse near-end listening conditions, like background noise. Multiple approaches have been suggested to improve the perception of speech in such conditions. However, most of these approaches were designed to work with clean input speech. Therefore, they have serious limitations when deployed in real world applications like telephony and hearing aids, where noisy input speech is quite common. In this paper we present an end-to-end neural network approach for the above problem, which effectively reduces the input noise and improves the intelligibility for listeners in adverse conditions. To that end, a convolutional neural network topology with variable dilation factors is proposed and evaluated both in a causal and a non-causal configuration using raw speech as input. A Teacher-Student training strategy is employed, where the Teacher is a well-established speech-in-noise intelligibility enhancer based on spectral shaping followed by dynamic range compression (SSDRC). The evaluation is performed both objectively using the speech intelligibility in bits metric (SIIB), and subjectively on the Greek Harvard corpus. A noise robust multi-band version of SSDRC was used as a baseline. Compared with the baseline, at 0 dB input SNR, the suggested neural network system achieved about 380% and 230% relative SIIB improvements in fluctuating and stationary backgrounds, respectively. Subjectively, the suggested model increased listeners' keyword correct rate in stationary noise from 25% to 60% at 0 dB input SNR, and from about 52% to 75% at 5 dB input SNR, compared with the baseline.



Assessing Speaker Interpolation in Neural Text-to-Speech

September 2021

·

14 Reads

Lecture Notes in Computer Science

This paper presents a study on voice interpolation in the framework of neural text-to-speech. Two main approaches are considered. The first one consists of adding three independent speaker embeddings at 3 different positions within the model. The second one substitutes the embedding vectors by convolutional layers, kernels of which are computed on the fly from reference spectrograms. The interpolation between speakers is done by linear interpolation between the speaker embeddings in the first case, and between convolution kernels in the second. Finally, we propose a new method for evaluating interpolation smoothness using agreements between interpolation weights, objective and subjective speaker similarities. The results indicate that both methods are able to produce smooth interpolation to some extent, with the one based on learned speaker embeddings yielding better results.




Combining speakers of multiple languages to improve quality of neural voices

August 2021

·

13 Reads

In this work, we explore multiple architectures and training procedures for developing a multi-speaker and multi-lingual neural TTS system with the goals of a) improving the quality when the available data in the target language is limited and b) enabling cross-lingual synthesis. We report results from a large experiment using 30 speakers in 8 different languages across 15 different locales. The system is trained on the same amount of data per speaker. Compared to a single-speaker model, when the suggested system is fine tuned to a speaker, it produces significantly better quality in most of the cases while it only uses less than 40%40\% of the speaker's data used to build the single-speaker model. In cross-lingual synthesis, on average, the generated quality is within 80%80\% of native single-speaker models, in terms of Mean Opinion Score.


Citations (71)


... The Lombard effect [7] describes the involuntary voice raising of humans in noisy environments. Several methods have been proposed for mimicking the Lombard effect, including using signal-processing-inspired manipulations [8,9] and neural TTS [10,11,12]. However, previous research [12] confirmed the degradation in the naturalness of manipulated speech by signal processing. ...

Reference:

SaSLaW: Dialogue Speech Corpus with Audio-visual Egocentric Information Toward Environment-adaptive Dialogue Speech Synthesis
Vocal effort modeling in neural TTS for improving the intelligibility of synthetic speech in noise
  • Citing Conference Paper
  • September 2022

... In this study, the GrHarvard speech corpus was employed (Sfakianaki, 2021;Shifas et al., 2022). GrHarvard is a phonemically balanced sentence corpus in Modern Greek which is based on the English Harvard/Institute of Electrical and Electronics Engineers (IEEE) sentences. ...

End-to-End Neural Based Modification of Noisy Speech for Speech-in-Noise Intelligibility Improvement
  • Citing Article
  • November 2021

IEEE/ACM Transactions on Audio Speech and Language Processing

... Through this contribution, we explore the potential of the FastSpeech2 architecture to generate audiovisual synthetic speech as an unified process. We evaluate FastLips in comparison with a baseline unified audiovisual-Tacotron2 model [10]. Through an ablation study, we highlight the main contributing components to the evaluated benefits of the proposed FastLips model. ...

Audiovisual Speech Synthesis using Tacotron2
  • Citing Conference Paper
  • October 2021

... A Mutual-Information Neural Estimator (MINE) is then used to maximize the lower bound for mutual information through Donsker-Varadhan representation (Hu et al. 2020). Similarly in (Paul et al. 2021), Rényi Divergence is used to create more nuanced speaker representations by establishing distance between timbre and content information. ...

A Universal Multi-Speaker Multi-Style Text-to-Speech via Disentangled Representation Learning Based on Rényi Divergence Minimization

... In this paper, we also investigate the effectiveness of dynamic range compression (DRC) [16,17], which is used to estimate the envelope of an input signal from its amplitude. Both dynamic and static compression are then applied, based on the estimated envelope. ...

On spectral and time domain energy reallocation for speech-in-noise intelligibility enhancement
  • Citing Conference Paper
  • September 2014

... The phase computation problems are overcome by various methods in a few applications (Bozkurt et al., 2007;Hedelin, 1988;Quatieri, 1979;Sun et al., 1997). Attempts are being made to study and exploit the speech features present in the phase component of speech spectrum, by deriving the phase spectrograms using models of production system that can provide the phase information (Gerkmann et al., 2015;Mahale et al., 2016;Vijayan et al., 2014a) or by unwrapping the phase in specific cases (Drugman and Stylianou, 2015) or by focusing on some aspects of phase that are not affected by phase wrapping (Nayak et al., 2017;Vijayan et al., 2014b). A few of these phase processing studies are briefly reviewed in this section. ...

Fast and accurate phase unwrapping
  • Citing Conference Paper
  • September 2015

... In (2), the goal is the removal of the linear phase component of the phase response. In many situations it is straightforward to obtain pitch marks, i.e. indications of pitch periods, rather than exact moments where phase information can be theoretically retrieved, such as GCIs [10,11,12] or related instants where phase can be estimated [13]. Therefore, by assuming initial pitch marks, a simplified version of MSE-CCEP can be used for phase estimation solely. ...

A maximum likelihood approach to the detection of moments of maximum excitation and its application to high-quality speech parameterization
  • Citing Conference Paper
  • September 2015

... These speaking style modifications are typically associated with changes in signal intensity, loudness (that takes into account spectral balance), and pitch with the aim of maintaining speech intelligibility in different noise conditions or at different spatial distances between the speaker and the listener. Different levels of vocal effort can be seen as forming a continuum consisting of whispered, normal, loud, Lombard, and shouted speech (see also work on clear speech; [16], [17]), even though there are several subtle articulatory and acoustic differences between these styles that do not follow simple linear relationships. Since speech recording and reproduction environments (or the original and new target listeners) may differ from each other, and since the different styles on the above-mentioned continuum are directly related to the spoken communication's success and suitability, it would be beneficial to be able to tailor the speech signal along this continuum through the use of SSC technology. ...

Intelligibility enhancement of casual speech for reverberant environments inspired by clear speech properties
  • Citing Conference Paper
  • September 2015

... Multi-Task Learning based (MTL) (Liu et al., 2017) is a popular approach to enhance the performance of a main task by leveraging the other related subtasks. Motivated by the applications of multi-task learning in various areas such as computer vision (Wang et al., 2009;Zhang & Yeung, 2010), speech synthesis (Hu et al., 2015;Wu et al., 2015) and natural language processing (Luong et al., 2016;Zhao et al., 2015), a plethora of current rumor verification studies have used MTL approach. The first line of work in this direction have been proposed by Kochkina et al. (2018) and Ma et al. (2018) that show the utility of auxiliary task labels in veracity classification tasks. ...

Fusion of multiple parameterisations for DNN-based sinusoidal speech synthesis with multi-task learning
  • Citing Conference Paper
  • September 2015

... • Code-switching or multilingual datasets: The practise of rotating between two or more languages within a conversation or writing is referred to as code-switching (Bhogale et al., 2023;Chadha et al., 2022;Mowlaee et al., 2014) and (Raval et al., 2021b). Creating resources that capture the code-switching phenomena in multilingual datasets for Automatic Speech Recognition (ASR) in Gujarati language. ...

Phase importance in speech processing applications
  • Citing Conference Paper
  • September 2014