August 2024
What is this page?
This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.
Publications (264)
September 2022
·
13 Reads
·
5 Citations
April 2022
·
123 Reads
·
10 Citations
IEEE Transactions on Neural Networks and Learning Systems
In this article, we propose a novel loss function for training generative adversarial networks (GANs) aiming toward deeper theoretical understanding as well as improved stability and performance for the underlying optimization problem. The new loss function is based on cumulant generating functions (CGFs) giving rise to Cumulant GAN. Relying on a recently derived variational formula, we show that the corresponding optimization problem is equivalent to Rényi divergence minimization, thus offering a (partially) unified perspective of GAN losses: the Rényi family encompasses Kullback-Leibler divergence (KLD), reverse KLD, Hellinger distance, and χ²-divergence. Wasserstein GAN is also a member of cumulant GAN. In terms of stability, we rigorously prove the linear convergence of cumulant GAN to the Nash equilibrium for a linear discriminator, Gaussian distributions, and the standard gradient descent ascent algorithm. Finally, we experimentally demonstrate that image generation is more robust relative to Wasserstein GAN and it is substantially improved in terms of both inception score (IS) and Fréchet inception distance (FID) when both weaker and stronger discriminators are considered.
March 2022
·
33 Reads
We present a neural text-to-speech (TTS) method that models natural vocal effort variation to improve the intelligibility of synthetic speech in the presence of noise. The method consists of first measuring the spectral tilt of unlabeled conventional speech data, and then conditioning a neural TTS model with normalized spectral tilt among other prosodic factors. Changing spectral tilt and keeping other prosodic factors unchanged enables effective vocal effort control at synthesis time independent of other prosodic factors. By extrapolation of the spectral tilt values beyond what has been seen in the original data, we can generate speech with high vocal effort levels, thus improving the intelligibility of speech in the presence of masking noise. We evaluate the intelligibility and quality of normal speech and speech with increased vocal effort in the presence of various masking noise conditions, and compare these to well-known speech intelligibility-enhancing algorithms. The evaluations show that the proposed method can improve the intelligibility of synthetic speech with little loss in speech quality.
November 2021
·
16 Reads
·
9 Citations
IEEE/ACM Transactions on Audio Speech and Language Processing
Intelligibility of speech can be significantly reduced when it is presented in adverse near-end listening conditions, like background noise. Multiple approaches have been suggested to improve the perception of speech in such conditions. However, most of these approaches were designed to work with clean input speech. Therefore, they have serious limitations when deployed in real world applications like telephony and hearing aids, where noisy input speech is quite common. In this paper we present an end-to-end neural network approach for the above problem, which effectively reduces the input noise and improves the intelligibility for listeners in adverse conditions. To that end, a convolutional neural network topology with variable dilation factors is proposed and evaluated both in a causal and a non-causal configuration using raw speech as input. A Teacher-Student training strategy is employed, where the Teacher is a well-established speech-in-noise intelligibility enhancer based on spectral shaping followed by dynamic range compression (SSDRC). The evaluation is performed both objectively using the speech intelligibility in bits metric (SIIB), and subjectively on the Greek Harvard corpus. A noise robust multi-band version of SSDRC was used as a baseline. Compared with the baseline, at 0 dB input SNR, the suggested neural network system achieved about 380% and 230% relative SIIB improvements in fluctuating and stationary backgrounds, respectively. Subjectively, the suggested model increased listeners' keyword correct rate in stationary noise from 25% to 60% at 0 dB input SNR, and from about 52% to 75% at 5 dB input SNR, compared with the baseline.
October 2021
·
39 Reads
·
11 Citations
September 2021
·
14 Reads
Lecture Notes in Computer Science
This paper presents a study on voice interpolation in the framework of neural text-to-speech. Two main approaches are considered. The first one consists of adding three independent speaker embeddings at 3 different positions within the model. The second one substitutes the embedding vectors by convolutional layers, kernels of which are computed on the fly from reference spectrograms. The interpolation between speakers is done by linear interpolation between the speaker embeddings in the first case, and between convolution kernels in the second. Finally, we propose a new method for evaluating interpolation smoothness using agreements between interpolation weights, objective and subjective speaker similarities. The results indicate that both methods are able to produce smooth interpolation to some extent, with the one based on learned speaker embeddings yielding better results.
August 2021
·
233 Reads
·
9 Citations
August 2021
·
19 Reads
·
10 Citations
August 2021
·
13 Reads
In this work, we explore multiple architectures and training procedures for developing a multi-speaker and multi-lingual neural TTS system with the goals of a) improving the quality when the available data in the target language is limited and b) enabling cross-lingual synthesis. We report results from a large experiment using 30 speakers in 8 different languages across 15 different locales. The system is trained on the same amount of data per speaker. Compared to a single-speaker model, when the suggested system is fine tuned to a speaker, it produces significantly better quality in most of the cases while it only uses less than of the speaker's data used to build the single-speaker model. In cross-lingual synthesis, on average, the generated quality is within of native single-speaker models, in terms of Mean Opinion Score.
Citations (71)
... The Lombard effect [7] describes the involuntary voice raising of humans in noisy environments. Several methods have been proposed for mimicking the Lombard effect, including using signal-processing-inspired manipulations [8,9] and neural TTS [10,11,12]. However, previous research [12] confirmed the degradation in the naturalness of manipulated speech by signal processing. ...
- Citing Conference Paper
September 2022
... In this study, the GrHarvard speech corpus was employed (Sfakianaki, 2021;Shifas et al., 2022). GrHarvard is a phonemically balanced sentence corpus in Modern Greek which is based on the English Harvard/Institute of Electrical and Electronics Engineers (IEEE) sentences. ...
- Citing Article
November 2021
IEEE/ACM Transactions on Audio Speech and Language Processing
... Through this contribution, we explore the potential of the FastSpeech2 architecture to generate audiovisual synthetic speech as an unified process. We evaluate FastLips in comparison with a baseline unified audiovisual-Tacotron2 model [10]. Through an ablation study, we highlight the main contributing components to the evaluated benefits of the proposed FastLips model. ...
- Citing Conference Paper
October 2021
... A Mutual-Information Neural Estimator (MINE) is then used to maximize the lower bound for mutual information through Donsker-Varadhan representation (Hu et al. 2020). Similarly in (Paul et al. 2021), Rényi Divergence is used to create more nuanced speaker representations by establishing distance between timbre and content information. ...
- Citing Conference Paper
- Full-text available
August 2021
... In this paper, we also investigate the effectiveness of dynamic range compression (DRC) [16,17], which is used to estimate the envelope of an input signal from its amplitude. Both dynamic and static compression are then applied, based on the estimated envelope. ...
- Citing Conference Paper
September 2014
... The phase computation problems are overcome by various methods in a few applications (Bozkurt et al., 2007;Hedelin, 1988;Quatieri, 1979;Sun et al., 1997). Attempts are being made to study and exploit the speech features present in the phase component of speech spectrum, by deriving the phase spectrograms using models of production system that can provide the phase information (Gerkmann et al., 2015;Mahale et al., 2016;Vijayan et al., 2014a) or by unwrapping the phase in specific cases (Drugman and Stylianou, 2015) or by focusing on some aspects of phase that are not affected by phase wrapping (Nayak et al., 2017;Vijayan et al., 2014b). A few of these phase processing studies are briefly reviewed in this section. ...
- Citing Conference Paper
September 2015
... In (2), the goal is the removal of the linear phase component of the phase response. In many situations it is straightforward to obtain pitch marks, i.e. indications of pitch periods, rather than exact moments where phase information can be theoretically retrieved, such as GCIs [10,11,12] or related instants where phase can be estimated [13]. Therefore, by assuming initial pitch marks, a simplified version of MSE-CCEP can be used for phase estimation solely. ...
- Citing Conference Paper
September 2015
... These speaking style modifications are typically associated with changes in signal intensity, loudness (that takes into account spectral balance), and pitch with the aim of maintaining speech intelligibility in different noise conditions or at different spatial distances between the speaker and the listener. Different levels of vocal effort can be seen as forming a continuum consisting of whispered, normal, loud, Lombard, and shouted speech (see also work on clear speech; [16], [17]), even though there are several subtle articulatory and acoustic differences between these styles that do not follow simple linear relationships. Since speech recording and reproduction environments (or the original and new target listeners) may differ from each other, and since the different styles on the above-mentioned continuum are directly related to the spoken communication's success and suitability, it would be beneficial to be able to tailor the speech signal along this continuum through the use of SSC technology. ...
- Citing Conference Paper
September 2015
... Multi-Task Learning based (MTL) (Liu et al., 2017) is a popular approach to enhance the performance of a main task by leveraging the other related subtasks. Motivated by the applications of multi-task learning in various areas such as computer vision (Wang et al., 2009;Zhang & Yeung, 2010), speech synthesis (Hu et al., 2015;Wu et al., 2015) and natural language processing (Luong et al., 2016;Zhao et al., 2015), a plethora of current rumor verification studies have used MTL approach. The first line of work in this direction have been proposed by Kochkina et al. (2018) and Ma et al. (2018) that show the utility of auxiliary task labels in veracity classification tasks. ...
- Citing Conference Paper
September 2015
... • Code-switching or multilingual datasets: The practise of rotating between two or more languages within a conversation or writing is referred to as code-switching (Bhogale et al., 2023;Chadha et al., 2022;Mowlaee et al., 2014) and (Raval et al., 2021b). Creating resources that capture the code-switching phenomena in multilingual datasets for Automatic Speech Recognition (ASR) in Gujarati language. ...
- Citing Conference Paper
September 2014