Conference Paper

Vocoder-Based Speech Synthesis from Silent Videos

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Pioneer lip-to-speech studies have focused on either constrained word datasets within controlled environments, such as GRID [4], TCD-TIMIT [5], and OuluVS [6], or open vocabulary datasets from real-world scenarios with specific speakers, such as Lip2Wav [7]. Many of these methods [1], [8]- [11] estimate speech parameters from silent video inputs and synthesize speech using statistical parametric vocoders, whereas others [7], [12]- [28] directly predict mel-spectrograms, which are then converted into speech waveforms using pre-trained neural vocoders. These studies have undoubtedly advanced the field by establishing fundamental methodologies. ...
... In essence, the upsample and downsample blocks ensure compatible dimensions for combining information from the DDSP signal and the content embedding within the HiFi-GAN architecture. The hidden dimension, kernel size in the MRF module, kernel size in the transposed convolutions, and dilation rates in MRF are set to 512, [3,7,11], [11,8,4,4], and [ [1,3,5], [1,3,5], [1,3,5]] respectively. For the settings of the Multi-Period Discriminator (MPD) and Multi-Scale Discriminator (MSD), we follow the configurations in [33]. ...
... In essence, the upsample and downsample blocks ensure compatible dimensions for combining information from the DDSP signal and the content embedding within the HiFi-GAN architecture. The hidden dimension, kernel size in the MRF module, kernel size in the transposed convolutions, and dilation rates in MRF are set to 512, [3,7,11], [11,8,4,4], and [ [1,3,5], [1,3,5], [1,3,5]] respectively. For the settings of the Multi-Period Discriminator (MPD) and Multi-Scale Discriminator (MSD), we follow the configurations in [33]. ...
Preprint
Recent advancements in visual speech recognition (VSR) have promoted progress in lip-to-speech synthesis, where pre-trained VSR models enhance the intelligibility of synthesized speech by providing valuable semantic information. The success achieved by cascade frameworks, which combine pseudo-VSR with pseudo-text-to-speech (TTS) or implicitly utilize the transcribed text, highlights the benefits of leveraging VSR models. However, these methods typically rely on mel-spectrograms as an intermediate representation, which may introduce a key bottleneck: the domain gap between synthetic mel-spectrograms, generated from inherently error-prone lip-to-speech mappings, and real mel-spectrograms used to train vocoders. This mismatch inevitably degrades synthesis quality. To bridge this gap, we propose Natural Lip-to-Speech (NaturalL2S), an end-to-end framework integrating acoustic inductive biases with differentiable speech generation components. Specifically, we introduce a fundamental frequency (F0) predictor to capture prosodic variations in synthesized speech. The predicted F0 then drives a Differentiable Digital Signal Processing (DDSP) synthesizer to generate a coarse signal which serves as prior information for subsequent speech synthesis. Additionally, instead of relying on a reference speaker embedding as an auxiliary input, our approach achieves satisfactory performance on speaker similarity without explicitly modelling speaker characteristics. Both objective and subjective evaluation results demonstrate that NaturalL2S can effectively enhance the quality of the synthesized speech when compared to state-of-the-art methods. Our demonstration page is accessible at https://yifan-liang.github.io/NaturalL2S/.
... B. Implementation details 1) Data preparation: For the GRID-4S and TCD-TIMIT-3S datasets, we adhere to the convention of randomly selecting 90% data for training, 5% for validation, and 5% for testing, as established in previous works [5], [11], [17], [40]. For Lip2Wav, we adopt the official data split [5]. ...
... This involved training the network separately with individual speakers. However, for the GRID-4S and TCD-TIMIT-3S datasets, we evaluated RobustL2S in a constrained (seen) speaker setting [5], [17], [40]. In this case, we trained a single speaker model for each dataset. ...
... We report the mean test scores on all four speakers of the GRID-4S dataset and all three speakers of the TCD-TIMIT-3S dataset, as documented in previous works. Remarkably, our RobustL2S approach demonstrates significant improvements in terms of STOI and WER [40] 0.648 0.455 23.33 % Ephrat et al. [14] 0.659 0.376 27.83 % Lip2Wav [5] 0.731 0.535 14.08 % VAE-based [16] 0.724 0.540 -VCA-GAN [19] 0.724 0.609 12.25 % kim et al. [28], [46] 0 Ephrat et al. [5] 0.184 0.098 GAN-based [47] 0.195 0.104 Lip2Wav [5] 0.418 0.290 Hong et al. [28] 0 Table IV provides a synopsis of RobustL2S's performance on the Lip2Wav dataset. This dataset includes a significant amount of silences between words, and RobustL2S shows a notable improvement across all metrics. ...
Preprint
Full-text available
Significant progress has been made in speaker dependent Lip-to-Speech synthesis, which aims to generate speech from silent videos of talking faces. Current state-of-the-art approaches primarily employ non-autoregressive sequence-to-sequence architectures to directly predict mel-spectrograms or audio waveforms from lip representations. We hypothesize that the direct mel-prediction hampers training/model efficiency due to the entanglement of speech content with ambient information and speaker characteristics. To this end, we propose RobustL2S, a modularized framework for Lip-to-Speech synthesis. First, a non-autoregressive sequence-to-sequence model maps self-supervised visual features to a representation of disentangled speech content. A vocoder then converts the speech features into raw waveforms. Extensive evaluations confirm the effectiveness of our setup, achieving state-of-the-art performance on the unconstrained Lip2Wav dataset and the constrained GRID and TCD-TIMIT datasets. Speech samples from RobustL2S can be found at https://neha-sherin.github.io/RobustL2S/
... Following this, [10] (an extension of another early videoto-speech approach [11]) was the first to train and evaluate on multiple speakers (in this case, a 4-speaker subset of GRID), achieving a major leap forward in the realism of its outputs. This method set two trends which are widely adopted in follow-ing works: predicting speech features directly from raw video, rather than from manually extracted visual features [2,16,18,23,27,32,[37][38][39], and using mel-frequency spectrograms as an intermediate representation [2,27,32,39], which are then converted into raw waveform using the Griffin-Lim algorithm [12]. Notable exceptions include [38], which proposes an end-to-end video-to-waveform generative adversarial network (GAN) capable of producing intelligible speech from raw video without the need for a separate spectrogram-to-waveform system, and [23], which uses a traditional vocoder to synthesize speech, rather than a spectrogram-based approach. ...
... This method set two trends which are widely adopted in follow-ing works: predicting speech features directly from raw video, rather than from manually extracted visual features [2,16,18,23,27,32,[37][38][39], and using mel-frequency spectrograms as an intermediate representation [2,27,32,39], which are then converted into raw waveform using the Griffin-Lim algorithm [12]. Notable exceptions include [38], which proposes an end-to-end video-to-waveform generative adversarial network (GAN) capable of producing intelligible speech from raw video without the need for a separate spectrogram-to-waveform system, and [23], which uses a traditional vocoder to synthesize speech, rather than a spectrogram-based approach. ...
... Remarkably, most recent works focus on corpora with small pools of speakers, constrained vocabularies, and video recorded in studio conditions (e. g., 4-Speaker GRID and 3-Lipspeaker TCD-TIMIT [14]) [2,16,23,27,[37][38][39], achieving improvements in performance via the use of intricate loss ensembles [18,24,37] and complex architectures [16,32,37,39]. While these developments are meaningful within ideal conditions, they fail to leverage the massive amount of audiovisual data available publicly, and propose training procedures which do not easily scale to very large datasets [18,24]. ...
... Thus, it is difficult to be applied in unseen, even multi-speaker, settings. There has been remarkable progresses in video-to-speech synthesis [8,9,17,22,28,33,40], especially with few speakers. While they have shown impressive performances, they have not explicitly considered the varying identity characteristics of different speakers, thus not investigated well in unseen multi-speaker setting. ...
... Video to Speech Synthesis. Speech synthesis from silent talking faces is one of the lip-reading techniques that have been consistently studied [28,43]. The initial approach [9] presented an end-to-end CNN-based model that predicts the speech audio signal from a silent talking face video and significantly improved the performance than the methods using hand-crafted visual features [29]. ...
... GRID corpus [7] dataset is the most commonly used dataset for speech reconstruction tasks [1,8,9,28,30,33,40], containing 33 speakers with 6 words taken from a fixed dictionary. Since we focus on the training with a large number of subjects, we conduct experiments on two different settings: 1) multi-speaker independent (unseen) setting where the speakers in the test dataset are unseen and 2) multi-speaker dependent (seen) setting that all 33 speakers are used all training, validation, and evaluation with 90%-5%-5% split, respectively. ...
Preprint
Full-text available
The goal of this work is to reconstruct speech from a silent talking face video. Recent studies have shown impressive performance on synthesizing speech from silent talking face videos. However, they have not explicitly considered on varying identity characteristics of different speakers, which place a challenge in the video-to-speech synthesis, and this becomes more critical in unseen-speaker settings. Distinct from the previous methods, our approach is to separate the speech content and the visage-style from a given silent talking face video. By guiding the model to independently focus on modeling the two representations, we can obtain the speech of high intelligibility from the model even when the input video of an unseen subject is given. To this end, we introduce speech-visage selection module that separates the speech content and the speaker identity from the visual features of the input video. The disentangled representations are jointly incorporated to synthesize speech through visage-style based synthesizer which generates speech by coating the visage-styles while maintaining the speech content. Thus, the proposed framework brings the advantage of synthesizing the speech containing the right content even when the silent talking face video of an unseen subject is given. We validate the effectiveness of the proposed framework on the GRID, TCD-TIMIT volunteer, and LRW datasets. The synthesized speech can be heard in supplementary materials.
... Following this, [10] (an extension of another early videoto-speech approach [11]) was the first to train and evaluate on multiple speakers (in this case, a 4-speaker subset of GRID), achieving a major leap forward in the realism of its outputs. This method set two trends which are widely adopted in follow-ing works: predicting speech features directly from raw video, rather than from manually extracted visual features [2,16,18,23,27,32,[37][38][39], and using mel-frequency spectrograms as an intermediate representation [2,27,32,39], which are then converted into raw waveform using the Griffin-Lim algorithm [12]. Notable exceptions include [38], which proposes an end-to-end video-to-waveform generative adversarial network (GAN) capable of producing intelligible speech from raw video without the need for a separate spectrogram-to-waveform system, and [23], which uses a traditional vocoder to synthesize speech, rather than a spectrogram-based approach. ...
... This method set two trends which are widely adopted in follow-ing works: predicting speech features directly from raw video, rather than from manually extracted visual features [2,16,18,23,27,32,[37][38][39], and using mel-frequency spectrograms as an intermediate representation [2,27,32,39], which are then converted into raw waveform using the Griffin-Lim algorithm [12]. Notable exceptions include [38], which proposes an end-to-end video-to-waveform generative adversarial network (GAN) capable of producing intelligible speech from raw video without the need for a separate spectrogram-to-waveform system, and [23], which uses a traditional vocoder to synthesize speech, rather than a spectrogram-based approach. ...
... Remarkably, most recent works focus on corpora with small pools of speakers, constrained vocabularies, and video recorded in studio conditions (e. g., 4-Speaker GRID and 3-Lipspeaker TCD-TIMIT [14]) [2,16,23,27,[37][38][39], achieving improvements in performance via the use of intricate loss ensembles [18,24,37] and complex architectures [16,32,37,39]. While these developments are meaningful within ideal conditions, they fail to leverage the massive amount of audiovisual data available publicly, and propose training procedures which do not easily scale to very large datasets [18,24]. ...
Preprint
Video-to-speech synthesis (also known as lip-to-speech) refers to the translation of silent lip movements into the corresponding audio. This task has received an increasing amount of attention due to its self-supervised nature (i.e., can be trained without manual labelling) combined with the ever-growing collection of audio-visual data available online. Despite these strong motivations, contemporary video-to-speech works focus mainly on small- to medium-sized corpora with substantial constraints in both vocabulary and setting. In this work, we introduce a scalable video-to-speech framework consisting of two components: a video-to-spectrogram predictor and a pre-trained neural vocoder, which converts the mel-frequency spectrograms into waveform audio. We achieve state-of-the art results for GRID and considerably outperform previous approaches on LRW. More importantly, by focusing on spectrogram prediction using a simple feedforward model, we can efficiently and effectively scale our method to very large and unconstrained datasets: To the best of our knowledge, we are the first to show intelligible results on the challenging LRS3 dataset.
... Yadav et al. [11] used stochastic modelling approach with variational autoencoder. Michelsanti et al. [12] predicted vocoder features of [13] and synthesized speech using the vocoder. Different from the previous works, our approach explicitly models the local visual feature and global visual context to synthesize accurate speech. ...
... Furthermore, we investigate the performance of the VCA-GAN on GRID dataset in unseen-speaker setting following [10,12], shown in Table 5. We measure the WER using the pre-trained ASR model, trained in unseen-speaker setting of GRID dataset. ...
... We measure the WER using the pre-trained ASR model, trained in unseen-speaker setting of GRID dataset. Compared to the previous works [10,12], the VCA-GAN outperforms in STOI, ESTOI, and PESQ. Since the model cannot access the voice characteristics of unseen speaker during training, the overall performance is lower than that of the constrained-speaker (i.e., seen-speaker) setting. ...
Preprint
Full-text available
In this paper, we propose a novel lip-to-speech generative adversarial network, Visual Context Attentional GAN (VCA-GAN), which can jointly model local and global lip movements during speech synthesis. Specifically, the proposed VCA-GAN synthesizes the speech from local lip visual features by finding a mapping function of viseme-to-phoneme, while global visual context is embedded into the intermediate layers of the generator to clarify the ambiguity in the mapping induced by homophene. To achieve this, a visual context attention module is proposed where it encodes global representations from the local visual features, and provides the desired global visual context corresponding to the given coarse speech representation to the generator through audio-visual attention. In addition to the explicit modelling of local and global visual representations, synchronization learning is introduced as a form of contrastive learning that guides the generator to synthesize a speech in sync with the given input lip movements. Extensive experiments demonstrate that the proposed VCA-GAN outperforms existing state-of-the-art and is able to effectively synthesize the speech from multi-speaker that has been barely handled in the previous works.
... For the articulatory-to-acoustic conversion task, typically electromagnetic articulography [25], ultrasound tongue imaging [6,7], permanent magnetic articulography [11], surface electromyography [13], magnetic resonance imaging [5] or video of the lip movements [16,10,3,19,17,23,22,24] are used. Lip-to-speech synthesis can be solved in two different ways: 1) direct approach, meaning that speech is generated without an intermediate step from the input signal [16,10,3,19,17]; and 2) indirect approach, meaning that lip-to-text recognition is followed by text-to-speech synthesis [23,22,24]. ...
... For the articulatory-to-acoustic conversion task, typically electromagnetic articulography [25], ultrasound tongue imaging [6,7], permanent magnetic articulography [11], surface electromyography [13], magnetic resonance imaging [5] or video of the lip movements [16,10,3,19,17,23,22,24] are used. Lip-to-speech synthesis can be solved in two different ways: 1) direct approach, meaning that speech is generated without an intermediate step from the input signal [16,10,3,19,17]; and 2) indirect approach, meaning that lip-to-text recognition is followed by text-to-speech synthesis [23,22,24]. The direct approach has the advantage that potentially it can be faster as there are no intermediate steps in the processing. ...
... In this paper, we proposed a lip-to-speech system built from a backend for deep neural network training and inference and a fronted as a form of a mobile application. Compared to earlier lip-to-speech and lip reading systems [3,10,16,17,19,22,23,24], the main difference is that here we focus on the practical implementation of the whole system, and not only the deep learning aspects. A limitation of the current system is the speed, i.e. inference at the server and network delay make real-time communication somewhat inconvenient. ...
Preprint
Full-text available
Articulatory-to-acoustic (forward) mapping is a technique to predict speech using various articulatory acquisition techniques as input (e.g. ultrasound tongue imaging, MRI, lip video). The advantage of lip video is that it is easily available and affordable: most modern smartphones have a front camera. There are already a few solutions for lip-to-speech synthesis, but they mostly concentrate on offline training and inference. In this paper, we propose a system built from a backend for deep neural network training and inference and a fronted as a form of a mobile application. Our initial evaluation shows that the scenario is feasible: a top-5 classification accuracy of 74% is combined with feedback from the mobile application user, making sure that the speaking impaired might be able to communicate with this solution.
... These models achieve intelligible results, but are only applied to seen speakers, i. e., there is exact correspondence between the speakers in the training, validation and test sets, or choose to focus on single speaker speech reconstruction [43]. Recently, [33] has proposed an alternative approach based on predicting WORLD vocoder parameters [34] which generates clear speech for unseen speakers as well. However, the reconstructed speech is still not realistic. ...
... The resulting spectrograms are very close to the original samples, but the reconstructed waveforms sound noticeably robotic. Another recent work [33] uses CNNs+RNNs to predict vocoder parameters (aperiodicity and spectral envelope), rather than spectrograms. Additionally, the model is trained to predict the transcription of the speech, in other words performing speech reconstruction and recognition simultaneously in a multi-task fashion. ...
... In this section, we present our experiments for seen speakers. For direct comparison with other works we use the same 4 speakers from GRID (1, 2, 4 and 29) as in [1,33,43,53] and the 3 lipspeakers from TCD-TIMIT as in [43]. In order to investigate the impact of the number of speakers and the amount of training data, we also present results for all 33 speakers from the GRID dataset. ...
Preprint
Video-to-speech is the process of reconstructing the audio speech from a video of a spoken utterance. Previous approaches to this task have relied on a two-step process where an intermediate representation is inferred from the video, and is then decoded into waveform audio using a vocoder or a waveform reconstruction algorithm. In this work, we propose a new end-to-end video-to-speech model based on Generative Adversarial Networks (GANs) which translates spoken video to waveform end-to-end without using any intermediate representation or separate waveform synthesis algorithm. Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech, which is then fed to a waveform critic and a power critic. The use of an adversarial loss based on these two critics enables the direct synthesis of raw audio waveform and ensures its realism. In addition, the use of our three comparative losses helps establish direct correspondence between the generated audio and the input video. We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID, and that it is the first end-to-end model to produce intelligible speech for LRW (Lip Reading in the Wild), featuring hundreds of speakers recorded entirely `in the wild'. We evaluate the generated samples in two different scenarios -- seen and unseen speakers -- using four objective metrics which measure the quality and intelligibility of artificial speech. We demonstrate that the proposed approach outperforms all previous works in most metrics on GRID and LRW.
... In its narrow sense, speech synthesis is used to refer to text-to-speech (TTS) [1], which play an essential role in spoken dialog systems as a way for machines to communicate with humans. In its broader definition, speech synthesis can refer to all kinds of speech generation interfaces like voice conversion (VC) [2], video-to-speech [3,4], et cetera [5]. Recent state-of-the-art (SOTA) speech synthesis systems can generate speech with natural sounding quality, some of which are indistinguishable from recorded speech [6]. ...
... The main di↵erence between voice cloning and speech synthesis is that the former puts an emphasis on the identity of the target speaker [17] while the latter sometimes disregards this aspect for the naturalness [18]. Given this definition, a voice cloning system can be TTS, VC, or any type of speech generation system [4,5]. ...
... A small-sized neural network, which consists of several feed-forward layers, is used as acoustic model in Chapter 3,4 and 5 to test the feasibility and behavior of the focused technique. The neural acoustic model (text encoder and speech decoder) takes aligned linguistic features (dependent on language) and transforms them into acoustic features, which include 60-dimensional mel-cepstral coe cients, 25-dimensional band-limited aperiodicities, interpolated logarithm fundamental frequencies, and their dynamic counterparts. ...
Thesis
Full-text available
Speech synthesis is the technology of generating speech from an input. While the term is commonly used to refer to text-to-speech (TTS), there are many types of speech synthesis systems which handle different input interfaces such as voice conversion (VC), which converts speech of a source speaker to the voice of a target, or video-to-speech, which generates speech from an image sequence (video) of facial movements. This thesis focuses on the voice cloning task which is the developing of a speech synthesis system with an emphasis on speaker identity and data efficiency. A voice cloning system is expected to handle circumstance of having less than ideal data for a particular target speaker. More specifically, when we not have control over the target speaker, recording environment, or the quality and quantity of speech data. Such systems will be useful for many practical applications which involve generating speech with desired voices. However, it is also vulnerable to misuse which can cause significant damage to society by people with malicious intentions. By first breaking down the structures of conventional TTS and VC systems into common functional modules, we propose a versatile deep learning based voice cloning framework which can be used to create a unified speech generation system of TTS and VC with a target voice. Given such unified system, which is expected to have consistent performance between its TTS and VC modes, we can use it to handle many application scenarios that are difficult to tackle by just one or the other, as TTS and VC have their own strengths and weaknesses. As this thesis is dealing with two major research subjects, which are TTS and VC, to provide a comprehensive narrative its content can be considered as comprising of two segments which tackle two different issues: (1) developing a versatile speaker adaptation method for neural TTS systems. Unlike VC in which existing voice cloning methods are capable of producing high-quality generated speech, existing TTS adaptation methods are lacking behind in performance and scalability. The proposed method is expected to be capable of cloning voices using either transcribed or untranscribed speech with varying amounts of adaptation data while producing generated speech with high quality and speaker similarity; (2) establishing a unified speech generation system of TTS and VC with highly consistent performance between two. To achieve this consistency, it is desirable to reduce the differences between the methodology and use the same framework for both systems. In addition to convenience, such system also has the ability to solve many unique speech generation tasks, as TTS and VC are operated under different application scenarios and complement each other. On the first issue, by investigating the mechanism of a multi-speaker neural acoustic model, we proposed a novel multimodal neural TTS system with the ability to perform crossmodal adaptation. This ability is fundamental for cloning voices with untranscribed speech on the basis of the backpropagation algorithm. Comparing with existing unsupervised speaker adaptation methods which only involve a forward pass, a backpropagation-based unsupervised adaptation method has significant implication on performance as it allows us to expand the speaker component to other parts of the neural networks beside the speaker bias. This hypothesis is tested by using speaker scaling together with speaker bias, or the entire module as adaptable components. The proposed system unites the procedure of supervised and unsupervised speaker adaptation. On the second issue, we test the feasibility of using the multimodal neural TTS system proposed previously to bootstrap a VC system for a particular target speaker. More specifically, the proposed VC system is tested on standard intra-language scenarios and cross-lingual scenarios with the experiment evaluations showing promising performance in both. Finally given the proof-of-concept provided by earlier experiments, the proposed methodology is incorporated with relevant techniques and components of modern neural speech generation systems to push performance of the unified TTS/VC system further. The experiments suggest that the proposed unified system has comparable performance with existing state-of-the-art TTS and VC systems, at the time this thesis was written, but higher speaker similarity and better data efficiency. At the end of this thesis, we have successfully created a versatile voice cloning system which can be used for many interesting speech generation scenarios. Moreover, the proposed multimodal system can be extended to other speech generation interfaces or enhanced to provide controls over para-linguistic features (e.g., emotions). These are all interesting directions for future works.
... Therefore, AP and F0 were not estimated from the silent video, but artificially produced without taking the visual information into account, while SP was estimated with a Gaussian mixture model (GMM) and FFNN within a regression-based framework. As input to the models, two different visual features were considered, 2-D DCT and AAM, while the explored SP representations were [149] 2017 AAM Codebook entries FFNN / RNN mouth (mel-filterbank amplitudes) [57] 2017 Raw pixels LSP of LPC CNN, FFNN face [56] 2017 Raw pixels, Mel-scale and CNN, FFNN, optical flow linear-scale BiGRU face spectrograms [11] 2018 Raw pixels AE features, CNN, LSTM, face spectrogram FFNN, AE [145] 2018 Raw pixels LSP of LPC CNN, LSTM, mouth FFNN [147] 2018 Raw pixels LSP of LPC CNN, BiGRU, mouth FFNN [146] 2019 Raw pixels LSP of LPC CNN, BiGRU, mouth FFNN [243] 2019 Raw pixels WORLD CNN, FFNN mouth spectrum [256] 2019 Raw pixels Raw waveform GAN, CNN, mouth GRU [247] 2019 Raw pixels AE features, CNN, LSTM mouth spectrogram FFNN, AE [177] 2020 Raw pixels WORLD CNN, GRU, mouth / face features FFNN [206] 2020 Raw pixels mel-scale CNN, LSTM face spectrogram linear predictive coding (LPC) coefficients and mel-filterbank amplitudes. While the choice of visual features did not have a big impact on the results, the use of mel-filterbank amplitudes allowed to outperform the systems based on LPC coefficients. ...
... The method proposed in [177] intended to still be able to reconstruct speech in a speaker independent scenario, but also to avoid artefacts similar to the ones introduced by the model in [256]. Therefore, vocoder features were used as training target instead of raw waveforms. ...
... While Uttam et al. [247] decided to work with features extracted by a pre-trained deep AE, similarly to [11], the approach in [146] estimated a LSP representation of LPC coefficients. In addition, Kumar et al. [146] provided a VSR module, as in [177]. However, this module was trained separately from the main system and was designed to provide only one among ten possible sentence transcriptions, making it database-dependent and not feasible for real-time applications. ...
Preprint
Full-text available
Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. More recently, visual information from the target speakers, such as lip movements and facial expressions, has been introduced to speech enhancement and speech separation systems, because the visual aspect of speech is essentially unaffected by the acoustic environment. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning, achieving state-of-the-art performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: visual features; acoustic features; deep learning methods; fusion techniques; training targets and objective functions. We also survey commonly employed audio-visual speech datasets, given their central role in the development of data-driven approaches, and evaluation methods, because they are generally used to compare different systems and determine their performance. In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals, since these methods can be more or less directly applied to audio-visual speech enhancement and separation.
... In its narrow sense, speech synthesis is used to refer to text-to-speech (TTS) systems [1], which play an essential role in a spoken dialog system as a way for machine-human communication. In its broader definition, speech synthesis can refer to all kinds of speech generation interfaces like voice conversion (VC) [2], video-to-speech [3], [4], and others [5], [6], [7]. Recent state-of-the-art speech synthesis systems can generate speech with natural sounding quality, some of which are indistinguishable from recorded speech [8], [9]. ...
... The main difference between voice cloning and speech synthesis is that the former puts an emphasis on the identity of the target speaker [25], while the latter sometimes disregards this aspect for naturalness [26]. Given this definition, a voice cloning can be a TTS, a VC, or any type of speech synthesis system [4], [7]. The NAUTILUS system is designed to be expandable to other input interfaces. ...
... The first scenario focuses more on VC and cloning voices with untranscribed speech, while the second scenario focuses more on TTS and performance of the supervised and unsupervised speaker adaptation strategies 5 . 4 A chain system based on TDNN-F pretrained on the Librispeech corpus [69] was used for the calculation (http://kaldi-asr.org/models/m13). 5 The generated speech samples of both experiment scenarios are available at https://nii-yamagishilab.github.io/sample-versatile-voice-cloning/ ...
Preprint
Full-text available
We introduce a novel speech synthesis system, called NAUTILUS, that can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker. By using a multi-speaker speech corpus to train all requisite encoders and decoders in the initial training stage, our system can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm. Moreover, depending on the data circumstance of the target speaker, the cloning strategy can be adjusted to take advantage of additional data and modify the behaviors of text-to-speech (TTS) and/or voice conversion (VC) systems to accommodate the situation. We test the performance of the proposed framework by using deep convolution layers to model the encoders, decoders and WaveNet vocoder. Evaluations show that it achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech. Moreover, it is demonstrated that the proposed framework has the ability to switch between TTS and VC with high speaker consistency, which will be useful for many applications.
... The recently developed model on Speech Reconstruction is [21] which uses Vocoders for speech reconstruction. Their work uses the vocoder features like spectral envelope, fundamental frequency, aperiodic parameter. ...
... The architecture model used by Michelsanti et. al [21] consists of a video encoder, Gated Recurrent Unit (GRU) and 5 decoders: SP decoder, AP decoder, F0 decoder, VUV decoder and a VSR decoder. There are two outputs for the model, (1) reconstructed speech, (2) text output corresponding to the speech., ...
Chapter
Generating audio from visual scene is an extremely challenging yet useful task as it finds application in remote surveillance, comprehending speech for hearing impaired people, or in silent speech interface (SSI). Due to the recent advancements of deep neural network techniques, there have been considerable research effort toward speech reconstruction from silent videos or visual speech. In this survey paper, we review several recent papers in this area and make a comparative study in terms of their architectural models and accuracy achieved.KeywordsAudio generationVisual speechLipreadingDeep learningCNNLSTM
... Most existing methods of lip-to-speech synthesis [1,9,10,18,21,22,37,41] focus on constrained settings with small datasets, such as GRID [6] and TCD-TIMIT [14], where each speaker's vocabulary contains less than a hundred words, and the speakers are always facing forward with almost no head movement. On the other hand, unconstrained lip-to-speech synthesis uses real-world talking videos which contain vocabulary of thousands of words and large head movements. ...
... Our model enables the usage of neural vocoder in unconstrained lip-to-speech synthesis, which is not possible for previous works. 1.673 GAN-based [37] 1.684 Ephrat et al. [9] 1.825 Lip2Wav [24] 1.772 VAE-based [41] 1.932 Vocoder-based [21] 1.900 VCA-GAN [18] 2.008 ...
Preprint
Unconstrained lip-to-speech synthesis aims to generate corresponding speeches from silent videos of talking faces with no restriction on head poses or vocabulary. Current works mainly use sequence-to-sequence models to solve this problem, either in an autoregressive architecture or a flow-based non-autoregressive architecture. However, these models suffer from several drawbacks: 1) Instead of directly generating audios, they use a two-stage pipeline that first generates mel-spectrograms and then reconstructs audios from the spectrograms. This causes cumbersome deployment and degradation of speech quality due to error propagation; 2) The audio reconstruction algorithm used by these models limits the inference speed and audio quality, while neural vocoders are not available for these models since their output spectrograms are not accurate enough; 3) The autoregressive model suffers from high inference latency, while the flow-based model has high memory occupancy: neither of them is efficient enough in both time and memory usage. To tackle these problems, we propose FastLTS, a non-autoregressive end-to-end model which can directly synthesize high-quality speech audios from unconstrained talking videos with low latency, and has a relatively small model size. Besides, different from the widely used 3D-CNN visual frontend for lip movement encoding, we for the first time propose a transformer-based visual frontend for this task. Experiments show that our model achieves 19.76×19.76\times speedup for audio waveform generation compared with the current autoregressive model on input sequences of 3 seconds, and obtains superior audio quality.
... Previous efforts generally build statistical models to map visual features to acoustic features, then use a vocoder to synthesize the waveform. Typical modelling approaches include Hidden Markov Models [1,2], non-negative matrix factorization [3], maximum likelihood estimation [4] and deep learning methods [1,[5][6][7][8][9][10][11][12][13][14][15][16][17]. Most works [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15] are restricted to small datasets (e.g., GRID [18]) to create single-speaker systems under constrained conditions with limited vocabulary, which hinders their practical deployment. ...
... Typical modelling approaches include Hidden Markov Models [1,2], non-negative matrix factorization [3], maximum likelihood estimation [4] and deep learning methods [1,[5][6][7][8][9][10][11][12][13][14][15][16][17]. Most works [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15] are restricted to small datasets (e.g., GRID [18]) to create single-speaker systems under constrained conditions with limited vocabulary, which hinders their practical deployment. A few studies [16,17] propose to use speaker representations to capture speaker characteristics and control the speaker identity of generated speech, such that multi-speaker VTS can be achieved in a single system. ...
Preprint
Though significant progress has been made for speaker-dependent Video-to-Speech (VTS) synthesis, little attention is devoted to multi-speaker VTS that can map silent video to speech, while allowing flexible control of speaker identity, all in a single system. This paper proposes a novel multi-speaker VTS system based on cross-modal knowledge transfer from voice conversion (VC), where vector quantization with contrastive predictive coding (VQCPC) is used for the content encoder of VC to derive discrete phoneme-like acoustic units, which are transferred to a Lip-to-Index (Lip2Ind) network to infer the index sequence of acoustic units. The Lip2Ind network can then substitute the content encoder of VC to form a multi-speaker VTS system to convert silent video to acoustic units for reconstructing accurate spoken content. The VTS system also inherits the advantages of VC by using a speaker encoder to produce speaker representations to effectively control the speaker identity of generated speech. Extensive evaluations verify the effectiveness of proposed approach, which can be applied in both constrained vocabulary and open vocabulary conditions, achieving state-of-the-art performance in generating high-quality speech with high naturalness, intelligibility and speaker similarity. Our demo page is released here: https://wendison.github.io/VCVTS-demo/
... In its narrow sense, speech synthesis is used to refer to text-to-speech (TTS) systems [1], which play an essential role in a spoken dialog system as a way for machinehuman communication. In its broader definition, speech synthesis can refer to all kinds of speech generation interfaces like voice conversion (VC) [2], video-to-speech [3], [4], and others [5]- [7]. Recent state-of-the-art speech synthesis systems can generate speech with natural sounding quality, some of which are indistinguishable from recorded speech [8], [9]. ...
... The main difference between voice cloning and speech synthesis is that the former puts an emphasis on the identity of the target speaker [25], while the latter sometimes disregards this aspect for naturalness [26]. Given this definition, a voice cloning can be a TTS, a VC, or any type of speech synthesis system [4], [7]. The NAUTILUS system is designed to be expandable to other input interfaces. ...
Article
Full-text available
We introduce a novel speech synthesis system, called NAUTILUS, that can generate speech with a target voice either from a text input or a reference utterance of an arbitrary source speaker. By using a multi-speaker speech corpus to train all requisite encoders and decoders in the initial training stage, our system can clone unseen voices using untranscribed speech of target speakers on the basis of the backpropagation algorithm. Moreover, depending on the data circumstance of the target speaker, the cloning strategy can be adjusted to take advantage of additional data and modify the behaviors of text-to-speech (TTS) and/or voice conversion (VC) systems to accommodate the situation. We test the performance of the proposed framework by using deep convolution layers to model the encoders, decoders and WaveNet vocoder. Evaluations show that it achieves comparable quality with state-of-the-art TTS and VC systems when cloning with just five minutes of untranscribed speech. Moreover, it is demonstrated that the proposed framework has the ability to switch between TTS and VC with high speaker consistency, which will be useful for many applications.
... In addition, we present a Multi-Task Learning (MTL) [16] strategy where a phone recognition task is learned together with SI. The motivation of the MTL approach lies in previous work, which showed that speech recognition can improve not only speech enhancement [17] (and vice versa [18,19]), but also speech reconstruction from silent videos [20]. ...
... In addition to the plain AV-SI model, we devised a MTL approach, which attempts to perform SI and phone recognition simultaneously. Our MTL training makes use of a Connectionist Temporal Classification (CTC) loss [23] which is very similar to the one presented in [20] for the task of speech synthesis from silent videos. The phone recognition subtask block in Fig. 1 shows the phone recognition module. ...
Preprint
Full-text available
In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e., the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and generally aims at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration. We also experiment with a multi-task learning approach where a phone recognition task is learned together with speech inpainting. Results show that the performance of audio-only speech inpainting approaches degrades rapidly when gaps get large, while the proposed audio-visual approach is able to plausibly restore missing information. In addition, we show that multi-task learning is effective, although the largest contribution to performance comes from vision.
... Lip2AudSpec [10] trains a fully-connected autoencoder on spectrograms and then uses the bottleneck features as training targets for a CNN+RNN lip reading network. In [68] a multi-task model was presented that predicts the spectral envelope, aperiodic parameters and the fundamental frequency as inputs to a vocoder to synthesize the raw waveform. The model also performs lip reading jointly with connectionist temporal classification (CTC) [58]. ...
Article
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. Previous approaches train on data from almost exclusively audio-visual datasets, i.e., every audio sample has a corresponding video sample. This precludes the use of abundant audio-only datasets which may not have a corresponding visual modality such as audiobooks, radio podcasts, and speech recognition datasets. In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz, and then use the pretrained decoders to initialize the audio decoders for the video-tospeech synthesis task. The pre-training step uses audio samples only and does not require labels or corresponding samples from other modalities (visual, text). We demonstrate that this improves the reconstructed speech and that it is an unexplored way to improve the quality of the generator in a cross-modal task while only requiring samples from one of the modalities. We conduct experiments using both raw audio and mel spectrograms as target outputs and benchmark our models with existing work.
... Lip2AudSpec [56] first trains a fullyconnected auto-encoder on spectrograms and then uses the learned bottleneck features as training targets for a CNN+RNN lipreading model. A multi-task model was presented in [57], which predicts the spectral envelope, aperiodic parameters and the fundamental frequency as inputs to a vocoder to synthesize the raw waveform. It is also trained jointly for a lipreading task using a connectionist temporal classification (CTC) [46] loss. ...
Preprint
Video-to-speech synthesis involves reconstructing the speech signal of a speaker from a silent video. The implicit assumption of this task is that the sound signal is either missing or contains a high amount of noise/corruption such that it is not useful for processing. Previous works in the literature either use video inputs only or employ both video and audio inputs during training, and discard the input audio pathway during inference. In this work we investigate the effect of using video and audio inputs for video-to-speech synthesis during both training and inference. In particular, we use pre-trained video-to-speech models to synthesize the missing speech signals and then train an audio-visual-to-speech synthesis model, using both the silent video and the synthesized speech as inputs, to predict the final reconstructed speech. Our experiments demonstrate that this approach is successful with both raw waveforms and mel spectrograms as target outputs.
... Lip2AudSpec [24] initially trains a fully-connected auto-encoder network on spectrograms and then uses the bottleneck features as training targets for a CNN+RNN lip reading network. In [87] a multi-task model was presented that predicts the spectral envelope, aperiodic parameters and the fundamental frequency as inputs to a vocoder to synthesize the raw waveform. The model also performs lip reading jointly with a connectionist temporal classification (CTC) [77]. ...
Preprint
Video-to-speech synthesis is the task of reconstructing the speech signal from a silent video of a speaker. Most established approaches to date involve a two-step process, whereby an intermediate representation from the video, such as a spectrogram, is extracted first and then passed to a vocoder to produce the raw audio. Some recent work has focused on end-to-end synthesis, whereby the generation of raw audio and any intermediate representations is performed jointly. All such approaches involve training on data from almost exclusively audio-visual datasets, i.e. every audio sample has a corresponding video sample. This precludes the use of abundant audio-only datasets which may not have a corresponding visual modality (e.g. audiobooks, radio podcasts, speech recognition datasets etc.), as well as audio-only architectures that have been developed by the audio machine learning community over the years. In this paper we propose to train encoder-decoder models on more than 3,500 hours of audio data at 24kHz, and then use the pre-trained decoders to initialize the audio decoders for the video-to-speech synthesis task. The pre-training step uses audio samples only and does not require labels or corresponding samples from other modalities (visual, text). We demonstrate that this pre-training step improves the reconstructed speech and that it is an unexplored way to improve the quality of the generator in a cross-modal task while only requiring samples from one of the modalities. We conduct experiments using both raw audio and mel spectrograms as target outputs and benchmark our models with existing work.
... There have been some studies on the GRID dataset that extracted speaker embeddings from natural reference speech for unseen speakers [3,4,6,23]. Although this task is not the focus of this paper, we also compared the performance of our proposed method when using the speech identity encoder for voice control with these studies. ...
Preprint
Lip-to-Speech (Lip2Speech) synthesis, which predicts corresponding speech from talking face images, has witnessed significant progress with various models and training strategies in a series of independent studies. However, existing studies can not achieve voice control under zero-shot condition, because extra speaker embeddings need to be extracted from natural reference speech and are unavailable when only the silent video of an unseen speaker is given. In this paper, we propose a zero-shot personalized Lip2Speech synthesis method, in which face images control speaker identities. A variational autoencoder is adopted to disentangle the speaker identity and linguistic content representations, which enables speaker embeddings to control the voice characteristics of synthetic speech for unseen speakers. Furthermore, we propose associated cross-modal representation learning to promote the ability of face-based speaker embeddings (FSE) on voice control. Extensive experiments verify the effectiveness of the proposed method whose synthetic utterances are more natural and matching with the personality of input video than the compared methods. To our best knowledge, this paper makes the first attempt on zero-shot personalized Lip2Speech synthesis with a face image rather than reference audio to control voice characteristics.
... For GRID-4S and TCD-TIMIT-LS datasets, we follow the convention [17,21,27] and randomly select 90% of the data samples from each speaker for training, 5% for validation, and 5% for testing. For Lip2Wav, we adopt the official data split 2 . ...
Preprint
Most lip-to-speech (LTS) synthesis models are trained and evaluated under the assumption that the audio-video pairs in the dataset are perfectly synchronized. In this work, we show that the commonly used audio-visual datasets, such as GRID, TCD-TIMIT, and Lip2Wav, can have data asynchrony issues. Training lip-to-speech with such datasets may further cause the model asynchrony issue -- that is, the generated speech and the input video are out of sync. To address these asynchrony issues, we propose a synchronized lip-to-speech (SLTS) model with an automatic synchronization mechanism (ASM) to correct data asynchrony and penalize model asynchrony. We further demonstrate the limitation of the commonly adopted evaluation metrics for LTS with asynchronous test data and introduce an audio alignment frontend before the metrics sensitive to time alignment for better evaluation. We compare our method with state-of-the-art approaches on conventional and time-aligned metrics to show the benefits of synchronization training.
... However, since they solely depend on the visual information (i.e., lip movements) which holds incomplete information about speech [1], these techniques are known to be challenging problems. Especially, video-driven speech reconstruction, also known as Lip-to-speech synthesis (Lip2Speech), has been shown much lower performance compared to visual speech recognition [2,3,4,5]. Therefore, Lip2Speech is being developed with constrained datasets [6,7] such that the number of speakers is limited or the sentences following a fixed grammar. In contrast, visual speech recognition [8,9,10] achieved significant performance improvements in the wild datasets [11,12,13] containing large speaker variations and utterances. ...
Preprint
Recent studies have shown impressive performance in Lip-to-speech synthesis that aims to reconstruct speech from visual information alone. However, they have been suffering from synthesizing accurate speech in the wild, due to insufficient supervision for guiding the model to infer the correct content. Distinct from the previous methods, in this paper, we develop a powerful Lip2Speech method that can reconstruct speech with correct contents from the input lip movements, even in a wild environment. To this end, we design multi-task learning that guides the model using multimodal supervision, i.e., text and audio, to complement the insufficient word representations of acoustic feature reconstruction loss. Thus, the proposed framework brings the advantage of synthesizing speech containing the right content of multiple speakers with unconstrained sentences. We verify the effectiveness of the proposed method using LRS2, LRS3, and LRW datasets.
... In addition to the music audio, another specific generation task seeks to synthesize speech audio from videos of human speaking [63], [64], [102], [139], [142], [157], [171], [209], [240]. One unique aspect about this audio generation task is that the speech largely relies on the movement of lips while speaking. ...
Preprint
Full-text available
We are perceiving and communicating with the world in a multisensory manner, where different information sources are sophisticatedly processed and interpreted by separate parts of the human brain to constitute a complex, yet harmonious and unified sensing system. To endow the machines with true intelligence, the multimodal machine learning that incorporates data from various modalities has become an increasingly popular research area with emerging technical advances in recent years. In this paper, we present a survey on multimodal machine learning from a novel perspective considering not only the purely technical aspects but also the nature of different data modalities. We analyze the commonness and uniqueness of each data format ranging from vision, audio, text and others, and then present the technical development categorized by the combination of Vision+X, where the vision data play a fundamental role in most multimodal learning works. We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels, and provide an additional comparison in the light of their technical connections with the data nature, e.g., the semantic consistency between image objects and textual descriptions, or the rhythm correspondence between video dance moves and musical beats. The exploitation of the alignment, as well as the existing gap between the intrinsic nature of data modality and the technical designs, will benefit future research studies to better address and solve a specific challenge related to the concrete multimodal task, and to prompt a unified multimodal machine learning framework closer to a real human intelligence system.
... Researchers have also explored various generative approaches [46], [47], such as Vougioukas et al. [48] uses the 1D GAN model to synthesize raw waveform from a given video. Similarly, Michelsanti et al. [49] uses a deep auto-encoder architecture where the encoder with a recursive module maps video frames of the speaker to vocoder features. There are five decoders for multitasking and the various outputs are used to reconstruct speech from the vocoder. ...
Preprint
Full-text available
Understanding the lip movement and inferring the speech from it is notoriously difficult for the common person. The task of accurate lip-reading gets help from various cues of the speaker and its contextual or environmental setting. Every speaker has a different accent and speaking style, which can be inferred from their visual and speech features. This work aims to understand the correlation/mapping between speech and the sequence of lip movement of individual speakers in an unconstrained and large vocabulary. We model the frame sequence as a prior to the transformer in an auto-encoder setting and learned a joint embedding that exploits temporal properties of both audio and video. We learn temporal synchronization using deep metric learning, which guides the decoder to generate speech in sync with input lip movements. The predictive posterior thus gives us the generated speech in speaker speaking style. We have trained our model on the Grid and Lip2Wav Chemistry lecture dataset to evaluate single speaker natural speech generation tasks from lip movement in an unconstrained natural setting. Extensive evaluation using various qualitative and quantitative metrics with human evaluation also shows that our method outperforms the Lip2Wav Chemistry dataset(large vocabulary in an unconstrained setting) by a good margin across almost all evaluation metrics and marginally outperforms the state-of-the-art on GRID dataset.
... Afterwards, Prajwal et al. [36] improve the model performance with 3D CNN and skip connections. Recently, Michelsanti et al. [37] have presented a multi-task architecture to learn spectral envelope, aperiodic parameters and fundamental frequency separately, which are then fed into a vocoder for waveform synthesis. They integrate a connectionist temporal classification (CTC) [38] loss to jointly perform lip reading, which is capable of further enhancing and constraining the video encoder. ...
Preprint
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos. We propose LipSound2 which consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms directly without requiring any human annotations. The proposed LipSound2 model is firstly pre-trained on \sim2400h multi-lingual (e.g. English and German) audio-visual data (VoxCeleb2). To verify the generalizability of the proposed method, we then fine-tune the pre-trained model on domain-specific datasets (GRID, TCD-TIMIT) for English speech reconstruction and achieve a significant improvement on speech quality and intelligibility compared to previous approaches in speaker-dependent and -independent settings. In addition to English, we conduct Chinese speech reconstruction on the CMLR dataset to verify the impact on transferability. Lastly, we train the cascaded lip reading (video-to-text) system by fine-tuning the generated audios on a pre-trained speech recognition system and achieve state-of-the-art performance on both English and Chinese benchmark datasets. Paper available from arXiv: https://arxiv.org/abs/2112.04748
... The theoretical background is provided by articulatory-to-acoustic mapping (AAM), where articulatory data is recorded while the subject is speaking, and machine learning methods (typically deep neural networks (DNNs)) are applied to predict the speech signal from the articulatory input. The set of articulatory acquisition devices includes ultrasound tongue imaging (UTI) [4,5,6,7,8], Magnetic Resonance Imaging (MRI) [9], electromagnetic articulography (EMA) [10,11,12], permanent magnetic articulography (PMA) [13,14,15], surface electromyography (sEMG) [16,17,18], electro-optical stomatography (EOS) [19], lip videos [20,21], or a multimodal combination of the above [22]. ...
... At all these production levels, biosignals can be captured and studied to draw conclusions about linguistic and paralinguistic information of spoken communication [1]. Many researchers have taken advantage of biosignals, proposing systems to generate speech features from Electrocorticography (ECoG) [2] [3] [4], Electroencephalography (EEG) [5] [6], Electromyography (EMG) [7] [8] [9], ultrasound [10] [11], and video recordings of speech articulation [12] [13]. ...
... The theoretical background is provided by articulatory-to-acoustic mapping (AAM), where articulatory data is recorded while the subject is speaking, and machine learning methods (typically deep neural networks (DNNs)) are applied to predict the speech signal from the articulatory input. The set of articulatory acquisition devices includes ultrasound tongue imaging (UTI) [4,5,6,7,8], Magnetic Resonance Imaging (MRI) [9], electromagnetic articulography (EMA) [10,11,12], permanent magnetic articulography (PMA) [13,14,15], surface electromyography (sEMG) [16,17,18], electro-optical stomatography (EOS) [19], lip videos [20,21], or a multimodal combination of the above [22]. ...
Preprint
Full-text available
Articulatory-to-acoustic mapping seeks to reconstruct speech from a recording of the articulatory movements, for example, an ultrasound video. Just like speech signals, these recordings represent not only the linguistic content, but are also highly specific to the actual speaker. Hence, due to the lack of multi-speaker data sets, researchers have so far concentrated on speaker-dependent modeling. Here, we present multi-speaker experiments using the recently published TaL80 corpus. To model speaker characteristics, we adjusted the x-vector framework popular in speech processing to operate with ultrasound tongue videos. Next, we performed speaker recognition experiments using 50 speakers from the corpus. Then, we created speaker embedding vectors and evaluated them on the remaining speakers. Finally, we examined how the embedding vector influences the accuracy of our ultrasound-to-speech conversion network in a multi-speaker scenario. In the experiments we attained speaker recognition error rates below 3\%, and we also found that the embedding vectors generalize nicely to unseen speakers. Our first attempt to apply them in a multi-speaker silent speech framework brought about a marginal reduction in the error rate of the spectral estimation step.
Article
We are perceiving and communicating with the world in a multisensory manner, where different information sources are sophisticatedly processed and interpreted by separate parts of the human brain to constitute a complex, yet harmonious and unified sensing system. To endow the machines with true intelligence, multimodal machine learning that incorporates data from various sources has become an increasingly popular research area with emerging technical advances in recent years. In this paper, we present a survey on multimodal machine learning from a novel perspective considering not only the purely technical aspects but also the intrinsic nature of different data modalities. We analyze the commonness and uniqueness of each data format mainly ranging from vision, audio, text, and motions, and then present the methodological advancements categorized by the combination of data modalities, such as Vision+Text , with slightly inclined emphasis on the visual data. We investigate the existing literature on multimodal learning from both the representation learning and downstream application levels, and provide an additional comparison in the light of their technical connections with the data nature, e.g. , the semantic consistency between image objects and textual descriptions, and the rhythm correspondence between video dance moves and musical beats. We hope that the exploitation of the alignment as well as the existing gap between the intrinsic nature of data modality and the technical designs, will benefit future research studies to better address a specific challenge related to the concrete multimodal task, prompting a unified multimodal machine learning framework closer to a real human intelligence system.
Chapter
The goal of this work is to reconstruct speech from a silent talking face video. Recent studies have shown impressive performance on synthesizing speech from silent talking face videos. However, they have not explicitly considered on varying identity characteristics of different speakers, which place a challenge in the video-to-speech synthesis, and this becomes more critical in unseen-speaker settings. Our approach is to separate the speech content and the visage-style from a given silent talking face video. By guiding the model to independently focus on modeling the two representations, we can obtain the speech of high intelligibility from the model even when the input video of an unseen subject is given. To this end, we introduce speech-visage selection that separates the speech content and the speaker identity from the visual features of the input video. The disentangled representations are jointly incorporated to synthesize speech through visage-style based synthesizer which generates speech by coating the visage-styles while maintaining the speech content. Thus, the proposed framework brings the advantage of synthesizing the speech containing the right content even with the silent talking face video of an unseen subject. We validate the effectiveness of the proposed framework on the GRID, TCD-TIMIT volunteer, and LRW datasets.KeywordsVideo to speech synthesisSpeech-visage selection
Article
Full-text available
The aim of this work is to investigate the impact of crossmodal self-supervised pre-training for speech reconstruction (video-to-audio) by leveraging the natural co-occurrence of audio and visual streams in videos. We propose LipSound2 that consists of an encoder-decoder architecture and location-aware attention mechanism to map face image sequences to mel-scale spectrograms directly without requiring any human annotations. The proposed LipSound2 model is first pre-trained on ∼ 2400-h multilingual (e.g., English and German) audio-visual data (VoxCeleb2). To verify the generalizability of the proposed method, we then fine-tune the pre-trained model on domain-specific datasets (GRID and TCD-TIMIT) for English speech reconstruction and achieve a significant improvement on speech quality and intelligibility compared to previous approaches in speaker-dependent and speaker-independent settings. In addition to English, we conduct Chinese speech reconstruction on the Chinese Mandarin Lip Reading (CMLR) dataset to verify the impact on transferability. Finally, we train the cascaded lip reading (video-to-text) system by fine-tuning the generated audios on a pre-trained speech recognition system and achieve the state-of-the-art performance on both English and Chinese benchmark datasets.
Article
Video-to-speech is the process of reconstructing the audio speech from a video of a spoken utterance. Previous approaches to this task have relied on a two-step process where an intermediate representation is inferred from the video and is then decoded into waveform audio using a vocoder or a waveform reconstruction algorithm. In this work, we propose a new end-to-end video-to-speech model based on generative adversarial networks (GANs) which translates spoken video to waveform end-to-end without using any intermediate representation or separate waveform synthesis algorithm. Our model consists of an encoder-decoder architecture that receives raw video as input and generates speech, which is then fed to a waveform critic and a power critic. The use of an adversarial loss based on these two critics enables the direct synthesis of the raw audio waveform and ensures its realism. In addition, the use of our three comparative losses helps establish direct correspondence between the generated audio and the input video. We show that this model is able to reconstruct speech with remarkable realism for constrained datasets such as GRID, and that it is the first end-to-end model to produce intelligible speech for Lip Reading in the Wild (LRW), featuring hundreds of speakers recorded entirely ``in the wild.'' We evaluate the generated samples in two different scenarios--seen and unseen speakers--using four objective metrics which measure the quality and intelligibility of artificial speech. We demonstrate that the proposed approach outperforms all previous works in most metrics on GRID and LRW.
Article
Full-text available
Automatic Speaker Verification (ASV) systems are vulnerable to a variety of voice spoofing attacks, e.g., replays, speech synthesis, etc. The imposters/fraudsters often use different voice spoofing attacks to fool the ASV systems to achieve certain objectives, i.e., bypass the security of someone's home or stealing money from a bank account, etc. To counter such fraudulent activities on the ASV systems, we propose a robust voice spoofing detection system capable of effectively detecting multiple types of spoofing attacks. For this purpose, we propose a novel feature descriptor Center Lop-Sided Local Binary Patterns (CLS-LBP) for audio representation. CLS-LBP effectively analyzes the audios bidirectionally to better capture the artifacts of synthetic speech, microphone distortions of replay, and dynamic speech attributes of the bonafide signal. The proposed CLS-LBP features are used to train the long short-term memory (LSTM) network for detection of both the physical-(replay) and logical-access attacks (speech synthesis, voice conversion). We employed the LSTM due to its effectiveness to better process and learn the internal representation of sequential data. More specifically, we obtained an equal error rate (EER) value of 0.06% on logical-acess (LA) while 0.58% on physical-access (PA) attacks. Additionally, the proposed system is also capable of detecting the unseen voice spoofing attacks and also robust enough to classify among the cloning algorithms used to synthesize the speech. Performance evaluation on the ASVspoof 2019 corpus signify the effectiveness of the proposed system in terms of detecting the physical-and logical-access attacks over existing state-of-the-art voice spoofing detection systems. Ó 2022 The Author(s). Published by Elsevier B.V. on behalf of King Saud University. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
Article
Speechreading which infers spoken message from a visually detected articulated facial trend is a challenging task. In this paper, we propose an end-to-end ResNet (E2E-ResNet) model for synthesizing speech signals from the silent video of a speaking individual. The model is the convolutional encoder-decoder framework which captures the frames of video and encodes into a latent space of visual features. The outputs of the decoder are spectrograms which are converted into waveforms corresponding to a speech articulated in the input video. The speech waveforms are then fed to a waveform critic used to decide the real or synthesized speech. The experiments show that the proposed E2E-V2SResNet model is apt to synthesize speech with realism and intelligibility/quality for GRID database. To further demonstrate the potentials of the proposed model, we also conduct experiments on the TCD-TIMIT database. We examine the synthesized speech in unseen speakers using three objective metrics use to measure the intelligibility, quality, and word error rate (WER) of the synthesized speech. We show that E2E-V2SResNet model outscores the competing approaches in most metrics on the GRID and TCD-TIMIT databases. By comparing with the baseline, the proposed model achieved 3.077% improvement in speech quality and 2.593% improvement in speech intelligibility.
Article
The goal of this work is to reconstruct speech from silent video, in both speaker dependent and speaker independent ways. Unlike previous works that have been mostly restricted to a speaker dependent setting, we propose Visual Voice memory to restore essential auditory information to generate proper speech from different speakers and even unseen speakers. The proposed memory takes additional auditory information that corresponds to the input face movements and stores the auditory contexts that can be recalled by the given input visual features. Specifically, the Visual Voice memory contains value and key memory slots, where value memory slots are for saving the audio features, and key memory slots are for storing the visual features in the same location of the saved audio features. Guiding each memory to properly save each feature, the model can adequately produce the speech through auxiliary information of audio. Hence, our method employs both video and audio information during training time, but does not require any additional auditory input in the inference time. Our key contributions are: (1) proposing the Visual Voice memory that brings rich information of audio that complements the visual features, thus producing high-quality speech from silent video, and (2) enabling multi-speaker and speaker independent training by memorizing auditory features and the corresponding visual features. We validate the proposed framework on GRID and Lip2Wav datasets and show that our method surpasses the performance of previous works. Moreover, we experiment on both multi-speaker and speaker independent settings and verify the effectiveness of the Visual Voice memory. We also demonstrate that the Visual Voice memory contains meaningful information to reconstruct speech.
Chapter
Articulatory-to-acoustic (forward) mapping is a technique to predict speech using various articulatory acquisition techniques as input (e.g. ultrasound tongue imaging, MRI, lip video). The advantage of lip video is that it is easily available and affordable: most modern smartphones have a front camera. There are already a few solutions for lip-to-speech synthesis, but they mostly concentrate on offline training and inference. In this paper, we propose a system built from a backend for deep neural network training and inference and a fronted as a form of a mobile application. Our initial evaluation shows that the scenario is feasible: a top-5 classification accuracy of 74% is combined with feedback from the mobile application user, making sure that the speaking impaired might be able to communicate with this solution.
Article
italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">Speech enhancement and speech separation are two related tasks, whose purpose is to extract either one or more target speech signals, respectively, from a mixture of sounds generated by several sources. Traditionally, these tasks have been tackled using signal processing and machine learning techniques applied to the available acoustic signals. Since the visual aspect of speech is essentially unaffected by the acoustic environment, visual information from the target speakers, such as lip movements and facial expressions, has also been used for speech enhancement and speech separation systems. In order to efficiently fuse acoustic and visual information, researchers have exploited the flexibility of data-driven approaches, specifically deep learning , achieving strong performance. The ceaseless proposal of a large number of techniques to extract features and fuse multimodal information has highlighted the need for an overview that comprehensively describes and discusses audio-visual speech enhancement and separation based on deep learning. In this paper, we provide a systematic survey of this research topic, focusing on the main elements that characterise the systems in the literature: acoustic features ; visual features ; deep learning methods ; fusion techniques ; training targets and objective functions . In addition, we review deep-learning-based methods for speech reconstruction from silent videos and audio-visual sound source separation for non-speech signals , since these methods can be more or less directly applied to audio-visual speech enhancement and separation. Finally, we survey commonly employed audio-visual speech datasets , given their central role in the development of data-driven approaches, and evaluation methods , because they are generally used to compare different systems and determine their performance.
Conference Paper
Full-text available
Speech Reconstruction is the task of recreation of speech using silent videos as input. In the literature, it is also referred to as lipreading. In this paper, we design an encoder-decoder architecture which takes silent videos as input and outputs an audio spectrogram of the reconstructed speech. The model, despite being a speaker-independent model, achieves comparable results on speech reconstruction to the current state-of-the-art speaker-dependent model. We also perform user studies to infer speech intelligibility. Additionally, we test the usability of the trained model using bilingual speech.
Article
Full-text available
Lipreading has a lot of potential applications such as in the domain of surveillance and video conferencing. Despite this, most of the work in building lipreading systems has been limited to classifying silent videos into classes representing text phrases. However, there are multiple problems associated with making lipreading a text-based classification task like its dependence on a particular language and vocabulary mapping. Thus, in this paper we propose a multi-view lipreading to audio system, namely Lipper, which models it as a regression task. The model takes silent videos as input and produces speech as the output. With multi-view silent videos, we observe an improvement over single-view speech reconstruction results. We show this by presenting an exhaustive set of experiments for speaker-dependent, out-of-vocabulary and speaker-independent settings. Further, we compare the delay values of Lipper with other speechreading systems in order to show the real-time nature of audio produced. We also perform a user study for the audios produced in order to understand the level of comprehensibility of audios produced using Lipper.
Article
Full-text available
We recently presented a new model for singing synthesis based on a modified version of theWaveNet architecture. Instead of modeling raw waveform, we model features produced by a parametric vocoder that separates the influence of pitch and timbre. This allows conveniently modifying pitch to match any target melody, facilitates training on more modest dataset sizes, and significantly reduces training and generation times. Nonetheless, compared to modeling waveform directly, ways of effectively handling higher-dimensional outputs, multiple feature streams and regularization become more important with our approach. In this work, we extend our proposed system to include additional components for predicting F0 and phonetic timings from a musical score with lyrics. These expression-related features are learned together with timbrical features from a single set of natural songs. We compare our method to existing statistical parametric, concatenative, and neural network-based approaches using quantitative metrics as well as listening tests.
Article
Full-text available
We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. We scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on one single-GPU server.
Conference Paper
Full-text available
We propose an end-to-end deep learning architecture for word level visual speech recognition. The system is a combination of spatiotemporal convolutional, residual and bidirectional Long Short-Term Memory networks. We trained and evaluated it on the Lipreading In-The-Wild benchmark, a challenging database of 500-size vocabulary consisting of video excerpts from BBC TV broadcasts. The proposed network attains word accuracy equal to 83.0%, yielding 6.8% absolute improvement over the current state-of-the-art.
Conference Paper
Full-text available
This paper investigates how far a very deep neural network is from attaining close to saturating performance on existing 2D and 3D face alignment datasets. To this end, we make the following three contributions: (a) we construct, for the first time, a very strong baseline by combining a state-of-the-art architecture for landmark localization with a state-of-the-art residual block, train it on a very large yet synthetically expanded 2D facial landmark dataset and finally evaluate it on all other 2D facial landmark datasets. (b) We create a guided by 2D landmarks network which converts 2D landmark annotations to 3D and unifies all existing datasets, leading to the creation of LS3D-W, the largest and most challenging 3D facial landmark dataset to date (~230,000 images). (c) Following that, we train a neural network for 3D face alignment and evaluate it on the newly introduced LS3D-W. (d) We further look into the effect of all "traditional" factors affecting face alignment performance like large pose, initialization and resolution, and introduce a "new" one, namely the size of the network. (e) We show that both 2D and 3D face alignment networks achieve performance of remarkable accuracy which is probably close to saturating the datasets used. Demo code and pre-trained models can be downloaded from http://www.cs.nott.ac.uk/~psxab5/face-alignment/
Article
Full-text available
Lipreading is the task of decoding text from the movement of a speaker's mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). All existing works, however, perform only word classification, not sentence-level sequence prediction. Studies have shown that human lipreading performance increases for longer words (Easton & Basala, 1982), indicating the importance of features capturing temporal context in an ambiguous communication channel. Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, an LSTM recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is the first lipreading model to operate at sentence-level, using a single end-to-end speaker-independent deep model to simultaneously learn spatiotemporal visual features and a sequence model. On the GRID corpus, LipNet achieves 93.4% accuracy, outperforming experienced human lipreaders and the previous 79.6% state-of-the-art accuracy.
Article
Full-text available
zAn algorithm is proposed for estimating the band aperiodicity of speech signals, where “aperiodicity” is defined as the power ratio between the speech signal and the aperiodic component of the signal. Since this power ratio depends on the frequency band, the aperiodicity should be given for several frequency bands. The proposed D4C (Definitive Decomposition Derived Dirt-Cheap) estimator is based on an extension of a temporally static group delay representation of periodic signals. In this paper, the principle and algorithm of D4C are explained, and its effectiveness is discussed with reference to objective and subjective evaluations. Evaluation results indicate that a speech synthesis system using D4C can synthesize natural speech better than ones using other algorithms.
Article
Full-text available
Intelligibility listening tests are necessary during development and evaluation of speech processing algorithms, despite the fact that they are expensive and time consuming. In this paper, we propose a monaural intelligibility prediction algorithm, which has the potential of replacing some of these listening tests. The proposed algorithm shows similarities to the short-Time objective intelligibility (STOI) algorithm, but works for a larger range of input signals. In contrast to STOI, extended STOI (ESTOI) does not assume mutual independence between frequency bands. ESTOI also incorporates spectral correlation by comparing complete 400-ms length spectrograms of the noisy/processed speech and the clean speech signals. As a consequence, ESTOI is also able to accurately predict the intelligibility of speech contaminated by temporally highly modulated noise sources in addition to noisy signals processed with time-frequency weighting. We show that ESTOI can be interpreted in terms of an orthogonal decomposition of short-Time spectrograms into intelligibility subspaces, i.e., a ranking of spectrogram features according to their importance to intelligibility. A free MATLAB implementation of the algorithm is available for noncommercial use at http://kom.aau.dk/∼jje/.
Article
Full-text available
A vocoder-based speech synthesis system, named WORLD, was developed in an effort to improve the sound quality of real-time applications using speech. Speech analysis, manipulation, and synthesis on the basis of vocoders are used in various kinds of speech research. Although several high-quality speech synthesis systems have been developed, real-time processing has been difficult with them because of their high computational costs. This new speech synthesis system has not only sound quality but also quick processing. It consists of three analysis algorithms and one synthesis algorithm proposed in our previous research. The effectiveness of the system was evaluated by comparing its output with against natural speech including consonants. Its processing speed was also compared with those of conventional systems. The results showed that WORLD was superior to the other systems in terms of both sound quality and processing speed. In particular, it was over ten times faster than the conventional systems, and the real time factor (RTF) indicated that it was fast enough for real-time processing.
Article
Full-text available
In this paper, we propose a novel neural network model called RNN Encoder--Decoder that consists of two recurrent neural networks (RNN). One RNN encodes a sequence of symbols into a fixed-length vector representation, and the other decodes the representation into another sequence of symbols. The encoder and decoder of the proposed model are jointly trained to maximize the conditional probability of a target sequence given a source sequence. The performance of a statistical machine translation system is empirically found to improve by using the conditional probabilities of phrase pairs computed by the RNN Encoder--Decoder as an additional feature in the existing log-linear model. Qualitatively, we show that the proposed model learns a semantically and syntactically meaningful representation of linguistic phrases.
Conference Paper
Full-text available
Many real-world sequence learning tasks re- quire the prediction of sequences of labels from noisy, unsegmented input data. In speech recognition, for example, an acoustic signal is transcribed into words or sub-word units. Recurrent neural networks (RNNs) are powerful sequence learners that would seem well suited to such tasks. However, because they require pre-segmented training data, and post-processing to transform their out- puts into label sequences, their applicability has so far been limited. This paper presents a novel method for training RNNs to label un- segmented sequences directly, thereby solv- ing both problems. An experiment on the TIMIT speech corpus demonstrates its ad- vantages over both a baseline HMM and a hybrid HMM-RNN.
Article
Full-text available
A sawtooth waveform inspired pitch estimator (SWIPE) has been developed for speech and music. SWIPE estimates the pitch as the fundamental frequency of the sawtooth waveform whose spectrum best matches the spectrum of the input signal. The comparison of the spectra is done by computing a normalized inner product between the spectrum of the signal and a modified cosine. The size of the analysis window is chosen appropriately to make the width of the main lobes of the spectrum match the width of the positive lobes of the cosine. SWIPE('), a variation of SWIPE, utilizes only the first and prime harmonics of the signal, which significantly reduces subharmonic errors commonly found in other pitch estimation algorithms. The authors' tests indicate that SWIPE and SWIPE(') performed better on two spoken speech and one disordered voice database and one musical instrument database consisting of single notes performed at a variety of pitches.
Article
Full-text available
A computational model of auditory analysis is described that is inspired by psychoacoustical and neurophysiological findings in early and central stages of the auditory system. The model provides a unified multiresolution representation of the spectral and temporal features likely critical in the perception of sound. Simplified, more specifically tailored versions of this model have already been validated by successful application in the assessment of speech intelligibility [Elhilali et al., Speech Commun. 41(2-3), 331-348 (2003); Chi et al., J. Acoust. Soc. Am. 106, 2719-2732 (1999)] and in explaining the perception of monaural phase sensitivity [R. Carlyon and S. Shamma, J. Acoust. Soc. Am. 114, 333-348 (2003)]. Here we provide a more complete mathematical formulation of the model, illustrating how complex signals are transformed through various stages of the model, and relating it to comparable existing models of auditory processing. Furthermore, we outline several reconstruction algorithms to resynthesize the sound from the model output so as to evaluate the fidelity of the representation and contribution of different features and cues to the sound percept.
Conference Paper
Speechreading or lipreading is the technique of understanding and getting phonetic features from a speaker's visual features such as movement of lips, face, teeth and tongue. It has a wide range of multimedia applications such as in surveillance, Internet telephony, and as an aid to a person with hearing impairments. However, most of the work in speechreading has been limited to text generation from silent videos. Recently, research has started venturing into generating (audio) speech from silent video sequences but there have been no developments thus far in dealing with divergent views and poses of a speaker. Thus although, we have multiple camera feeds for the speech of a user, but we have failed in using these multiple video feeds for dealing with the different poses. To this end, this paper presents the world's first ever multi-view speech reading and reconstruction system. This work encompasses the boundaries of multimedia research by putting forth a model which leverages silent video feeds from multiple cameras recording the same subject to generate intelligent speech for a speaker. Initial results confirm the usefulness of exploiting multiple camera views in building an efficient speech reading and reconstruction system. It further shows the optimal placement of cameras which would lead to the maximum intelligibility of speech. Next, it lays out various innovative applications for the proposed system focusing on its potential prodigious impact in not just security arena but in many other multimedia analytics problems.
Article
This work is concerned with generating intelligible audio speech from a video of a person talking. Regression and classification methods are proposed first to estimate static spectral envelope features from active appearance model (AAM) visual features. Two further methods are then developed to incorporate temporal information into the prediction - a feature-level method using multiple frames and a model-level method based on recurrent neural networks. Speech excitation information is not available from the visual signal, so methods to artificially generate aperiodicity and fundamental frequency are developed. These are combined within the STRAIGHT vocoder to produce a speech signal. The various systems are optimised through objective tests before applying subjective intelligibility tests that determine a word accuracy of 85% from a set of human listeners on the GRID audio-visual speech database. This compares favourably with a previous regression-based system that serves as a baseline which achieved a word accuracy of 33%.
Article
Deep neural nets with a large number of parameters are very powerful machine learning systems. However, overfitting is a serious problem in such networks. Large networks are also slow to use, making it difficult to deal with overfitting by combining the predictions of many different large neural nets at test time. Dropout is a technique for addressing this problem. The key idea is to randomly drop units (along with their connections) from the neural network during training. This prevents units from co-adapting too much. During training, dropout samples from an exponential number of different "thinned" networks. At test time, it is easy to approximate the effect of averaging the predictions of all these thinned networks by simply using a single unthinned network that has smaller weights. This significantly reduces overfitting and gives major improvements over other regularization methods. We show that dropout improves the performance of neural networks on supervised learning tasks in vision, speech recognition, document classification and computational biology, obtaining state-of-the-art results on many benchmark data sets. © 2014 Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever and Ruslan Salakhutdinov.
Article
When sound hits an object, it causes small vibrations of the object's surface. We show how, using only high-speed video of the object, we can extract those minute vibrations and partially recover the sound that produced them, allowing us to turn everyday objects---a glass of water, a potted plant, a box of tissues, or a bag of chips---into visual microphones. We recover sounds from high-speed footage of a variety of objects with different properties, and use both real and simulated data to examine some of the factors that affect our ability to visually recover sound. We evaluate the quality of recovered sounds using intelligibility and SNR metrics and provide input and recovered audio samples for direct comparison. We also explore how to leverage the rolling shutter in regular consumer cameras to recover audio from standard frame-rate videos, and use the spatial resolution of our method to visualize how sound-related vibrations vary over an object's surface, which we can use to recover the vibration modes of an object.
Article
Training Deep Neural Networks is complicated by the fact that the distribution of each layer's inputs changes during training, as the parameters of the previous layers change. This slows down the training by requiring lower learning rates and careful parameter initialization, and makes it notoriously hard to train models with saturating nonlinearities. We refer to this phenomenon as internal covariate shift, and address the problem by normalizing layer inputs. Our method draws its strength from making normalization a part of the model architecture and performing the normalization for each training mini-batch}. Batch Normalization allows us to use much higher learning rates and be less careful about initialization. It also acts as a regularizer, in some cases eliminating the need for Dropout. Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin. Using an ensemble of batch-normalized networks, we improve upon the best published result on ImageNet classification: reaching 4.9% top-5 validation error (and 4.8% test error), exceeding the accuracy of human raters.
Article
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions. The method is straightforward to implement and is based an adaptive estimates of lower-order moments of the gradients. The method is computationally efficient, has little memory requirements and is well suited for problems that are large in terms of data and/or parameters. The method is also ap- propriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The method exhibits invariance to diagonal rescaling of the gradients by adapting to the geometry of the objective function. The hyper-parameters have intuitive interpretations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical convergence properties of the algorithm and provide a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. We demonstrate that Adam works well in practice when experimentally compared to other stochastic optimization methods.
Article
A set of simple new procedures has been developed to enable the real-time manipulation of speech parame- ters. The proposed method uses pitch-adaptive spec- tral analysis combined with a surface reconstruction method in the time-frequency region, and an excita- tion source design based on group delay manipulation. It also consists of a fundamental frequency (F0) ex- traction method using instantaneous frequency calcu- lation based on a new concept called 'fundamental- ness'. The proposed procedures preserve the details of time-frequency surfaces while almost perfectly remov- ing fine structures due to signal periodicity. This close- to-perfect elimination of interferences and smooth F0 trajectory allow for over 600% manipulation of such speech parameters as pitch, vocal tract length, and speaking rate, while maintaining high reproduction quality.
Article
An audio-visual corpus has been collected to support the use of common material in speech perception and automatic speech recognition studies. The corpus consists of high-quality audio and video recordings of 1000 sentences spoken by each of 34 talkers. Sentences are simple, syntactically identical phrases such as "place green at B 4 now". Intelligibility tests using the audio signals suggest that the material is easily identifiable in quiet and low levels of stationary noise. The annotated corpus is available on the web for research use.
Conference Paper
Previous objective speech quality assessment models, such as bark spectral distortion (BSD), the perceptual speech quality measure (PSQM), and measuring normalizing blocks (MNB), have been found to be suitable for assessing only a limited range of distortions. A new model has therefore been developed for use across a wider range of network conditions, including analogue connections, codecs, packet loss and variable delay. Known as perceptual evaluation of speech quality (PESQ), it is the result of integration of the perceptual analysis measurement system (PAMS) and PSQM99, an enhanced version of PSQM. PESQ is expected to become a new ITU-T recommendation P.862, replacing P.861 which specified PSQM and MNB
Char2wav: End-to-end speech synthesis
  • J Sotelo
  • S Mehri
  • K Kumar
  • J F Santos
  • K Kastner
  • A Courville
  • Y Bengio
J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, and Y. Bengio, "Char2wav: End-to-end speech synthesis," in Proc. of ICLR Workshop, 2017.
Pytorch: An imperative style, high-performance deep learning library
  • A Paszke
  • S Gross
  • F Massa
  • A Lerer
  • J Bradbury
  • G Chanan
  • T Killeen
  • Z Lin
  • N Gimelshein
  • L Antiga
  • A Desmaison
  • A Kopf
  • E Yang
  • Z Devito
  • M Raison
  • A Tejani
  • S Chilamkurthy
  • B Steiner
  • L Fang
  • J Bai
  • S Chintala
A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, "Pytorch: An imperative style, high-performance deep learning library," in Advances in NeurIPS, 2019.