Block diagram of Tacotron2 based reference baseline that has three modules, an encoder, an attention-based decoder, and two alternative methods for waveform generation.

Block diagram of Tacotron2 based reference baseline that has three modules, an encoder, an attention-based decoder, and two alternative methods for waveform generation.

Source publication
Article
Full-text available
Neural end-to-end text-to-speech (TTS) is superior to conventional statistical methods in many ways. However, the exposure bias problem, that arises from the mismatch between the training and inference process in autoregressive models, remains an issue. It often leads to performance degradation in face of out-of-domain test data. To address this pr...

Context in source publication

Context 1
... encoder-decoder network learns to map an input sequence to an output sequence. Tacotron2 TTS system is one of the successful encoder-decoder network implementations, as illustrated in Fig. 1. It consists of an encoder, attention-based decoder and two alternatives for waveform generation, that are described next in ...

Citations

... Finally, the introduction of a multi-teacher knowledge distillation (MT-KD) network for Tacotron2 Text-to-Speech (TTS) [5] models represent a significant advancement in addressing exposure bias, achieving improved naturalness, robustness, and expressiveness in TTS systems. However, while MT-KD shows promise in mitigating exposure bias, challenges may arise in terms of model scalability and generalization to diverse linguistic contexts and speaking styles. ...
Article
In the realm of law enforcement, the accurate and timely filing of First Information Reports (FIRs) stands as a crucial step in the initiation of criminal investigations. However, this process is often fraught with challenges, ranging from the complexity of legal frameworks to the potential for inaccuracies in manual documentation. In recent years, the modernization of law enforcement practices has been propelled by advancements in technology, offering promising solutions to address these challenges. This survey paper aims to provide a comprehensive overview of the landscape of technological solutions designed to enhance FIR filing processes, with a specific focus on their application in law enforcement contexts.
... Unlike the speech synthesis technology for single utterance that just predicts the speaking style according to its linguistic content (Wang et al. 2017;Ren et al. 2021;Liu et al. 2021aLiu et al. ,b, 2022aLiu et al. , 2024 or attempt to transfer the style information from an additional reference speech (Wang et al. 2018;Li et al. 2022c;Huang et al. 2022), CSS methods * Corrposending Author. Copyright © 2024, Association for the Advancement of Artificial Intelligence (www.aaai.org). ...
Article
Conversational Speech Synthesis (CSS) aims to accurately express an utterance with the appropriate prosody and emotional inflection within a conversational setting. While recognising the significance of CSS task, the prior studies have not thoroughly investigated the emotional expressiveness problems due to the scarcity of emotional conversational datasets and the difficulty of stateful emotion modeling. In this paper, we propose a novel emotional CSS model, termed ECSS, that includes two main components: 1) to enhance emotion understanding, we introduce a heterogeneous graph-based emotional context modeling mechanism, which takes the multi-source dialogue history as input to model the dialogue context and learn the emotion cues from the context; 2) to achieve emotion rendering, we employ a contrastive learning-based emotion renderer module to infer the accurate emotion style for the target utterance. To address the issue of data scarcity, we meticulously create emotional labels in terms of category and intensity, and annotate additional emotional information on the existing conversational dataset (DailyTalk). Both objective and subjective evaluations suggest that our model outperforms the baseline models in understanding and rendering emotions. These evaluations also underscore the importance of comprehensive emotional annotations. Code and audio samples can be found at: https://github.com/walker-hyf/ECSS.
... Furthermore, several studies also highlight that although audiovisual media can increase student motivation and engagement (Nicolini et al., 2023;Nuraini, 2017;Soni, 2018), its impact on academic learning outcomes still needs further investigation. For example, research by (Liu et al., 2022) in China showed that while student motivation increased, there was no significant improvement in cognitive learning outcomes. This suggests that although audio-visual media can be an effective tool for increasing student participation (Margiyan & Nuraini, 2022;Rihatno & Nuraini, 2019), there is a need for a more structured and evidence-based approach to ensure that this technology also improves overall learning outcomes. ...
Article
Full-text available
This research focuses on efforts to bridge the gap between physical education theory and teaching practice by implementing audio-visual media in the classroom. This study is a systematic literature review that evaluates the effectiveness of using audio-visual media in physical education teaching from 2020 to 2024. The theories bridged include visual learning theory and learning motivation theory, which are integrated into teaching practice through instructional videos, animations, and multimedia presentations. This research examined 21 studies involving secondary school students from various regions and countries, with a total of 1789 students participating. The study results show that using audio-visual media in physical education can significantly increase student motivation and involvement. However, implementing this media still faces various challenges, such as teachers' lack of understanding of technology, limited access to audio-visual devices, and inadequate training. To address this, the research recommends increasing specific teacher training on using audio-visual media, providing more comprehensive access to such technology, and supporting policies that encourage technology integration in physical education curricula. Thus, this research emphasizes integrating audio-visual media in physical education to bridge theory and practice and improve the quality of learning and student learning outcomes. This research also suggests concrete steps for implementation, such as ongoing training programs for teachers and regular evaluation of the effectiveness of using audio-visual media in the classroom.
... Wutiwiwatchai et al. [12] proposed an accent level adjustment mechanism for bilingual TTS synthesis, where the accent level is adjusted by means of interpolation between HMMs of native phones and HMMs of corresponding foreign phones. This method provides an effective fine-grained accent intensity control scheme, while it cannot be used in current deep learning TTS models [13][14][15][16], such as Tacotron [9,17] and FastSpeech [18,19] based architectures. In a recent deep learning based multilingual TTS study [1], the authors employed the domain adversarial training [20] to disentangle the accent identity from the speaker identity where the accent level can be controlled by varying the domain adversarial weight [1]. ...
... With the rapid progress of Text-To-Speech (TTS) [2][3][4] synthesis and Voice Conversion (VC) techniques, one is able to impersonate another's voice easily. The Audio Deepfake Detection (ADD) technology [5] is a task that determines whether the given audio is authentic or counterfeited, and it has attracted increasing attention recently. ...
... With the rapid progress of Text-To-Speech (TTS) [3][4][5] synthesis and Voice Conversion (VC) techniques, one is able to impersonate another's voice easily. The Audio Deepfake Detection (ADD) technology [6] is a task that determines whether the given audio is authentic or counterfeited, and it has attracted increasing attention recently. ...
Preprint
Full-text available
Audio Deepfake Detection (ADD) aims to detect the fake audio generated by text-to-speech (TTS), voice conversion (VC) and replay, etc., which is an emerging topic. Traditionally we take the mono signal as input and focus on robust feature extraction and effective classifier design. However, the dual-channel stereo information in the audio signal also includes important cues for deepfake, which has not been studied in the prior work. In this paper, we propose a novel ADD model, termed as M2S-ADD, that attempts to discover audio authenticity cues during the mono-to-stereo conversion process. We first projects the mono to a stereo signal using a pretrained stereo synthesizer, then employs a dual-branch neural architecture to process the left and right channel signals, respectively. In this way, we effectively reveal the artifacts in the fake audio, thus improve the ADD performance. The experiments on the ASVspoof2019 database show that M2S-ADD outperforms all baselines that input mono. We release the source code at \url{https://github.com/AI-S2-Lab/M2S-ADD}.
... With the development of "Empathy AI", the research of emotional TTS has attracted more and more attention [30]. Speech synthesis in conversational scenarios and emotional speech synthesis are hot research topics nowadays [31]. ...
Preprint
Full-text available
Text-to-Speech (TTS) synthesis for low-resource languages is an attractive research issue in academia and industry nowadays. Mongolian is the official language of the Inner Mongolia Autonomous Region and a representative low-resource language spoken by over 10 million people worldwide. However, there is a relative lack of open-source datasets for Mongolian TTS. Therefore, we make public an open-source multi-speaker Mongolian TTS dataset, named MnTTS2, for the benefit of related researchers. In this work, we prepare the transcription from various topics and invite three professional Mongolian announcers to form a three-speaker TTS dataset, in which each announcer records 10 hours of speeches in Mongolian, resulting 30 hours in total. Furthermore, we build the baseline system based on the state-of-the-art FastSpeech2 model and HiFi-GAN vocoder. The experimental results suggest that the constructed MnTTS2 dataset is sufficient to build robust multi-speaker TTS models for real-world applications. The MnTTS2 dataset, training recipe, and pretrained models are released at: \url{https://github.com/ssmlkl/MnTTS2}
Article
This paper describes an Android mobile phone application designed for blind or visually impaired people. The main aim of this system is to create an automatic text- reading assistant using the hardware capabilities of a mobile phone associated with innovative algorithms. The Android platform was chosen for people who already have a mobile phone and do not need to buy new hardware. Four key technologies are required: camera capture, text detection, speech synthesis, and voice detection. Moreover, a voice recognition subsystem has been created that meets the needs of blind users, allowing them to effectively control the appli- cation by voice. It requires three key technologies: voice capture over the embedded microphone, speech-to-text, and user request interpretation. As a result, the application for an Android platform was developed based on these tech- nologies.