Science topic
Speech Synthesis - Science topic
Explore the latest questions and answers in Speech Synthesis, and find Speech Synthesis experts.
Questions related to Speech Synthesis
Hi everyone, I'm attempting to code the Tacotron speech synthesis system from scratch to make sure I understand it. I'm done implementing the first convolutional filterbank layer and have implemented the max pooling layer, but I don't understand why the authors of chose a max-pooling over time with stride 1. They claim it's to keep the temporal resolution, but my problem is that I think using a stride of 1 is equivalent to just doing nothing and keeping the data as is.
As an example, say we have a matrix in which every time step corresponds to one column:
A= [1,2,3,4;
5,6,7,8;
1,2,3,4];
If we max pool over time with stride 2, we'll have:
B = [2,4;
6,8;
2,4]
Max-pooling with stride one will keep the time resolution but also result in B=A (keep every column). So what's the point of even saying that max-pooling was applied?
I hope my question was clear enough, thank you for reading.
Dear research gate community,
for a new study I am looking for a tool or software that would allow me to manipulate formants (i.e. shift frequencies of F1, F2, and F3) and their transition (e.g., start or slope of transition) either within a synthesized CVC word or between two synthesized words. Therefore, it would be crucial to be able to control precisely where in the word or sequence formant manipulation starts and ends.
What I tried so far:
I already tried a tool written for Praat (Praat Vocal Toolkit) but it can only shift formants over the whole word and not for a specified time window.
Furthermore, I tried TrackDraw (https://github.com/guestdaniel/TrackDraw) which is a very good tool to synthesize vocalic sounds (Klatt Synthesizer) and manipulate their formants. However, CV sequences (and their vocalic transition) can not be generated.
I also used an online interface of the Klatt synthesizer (http://www.asel.udel.edu/speech/tutorials/synthesis/Klatt.html) but it is quite complex to even generate simple CV syllables and therefore not very user friendly for my purpose. Furthermore, I don't have reference values for the consonant parameters for German.
What I achieved so far:
I'm able to synthesize German words and phrases that sound quite natural with Python (text-to-speech synthesis).
What I'm looking for:
Ideally, I was hoping to find an application or tool that would allow for 1) language specific (in my case German) text-to-speech synthesis where 2) formants (and/or their transition) can be easily manipulated over time. Or a tool that already takes a synthesized sound as input and allows for formant manipulation.
If you have any ideas, recommendations, or comments I would be very obliged. Thank you!
Stella Krüger
I am trying to build a text to speech convertor from scratch.
For that
Text 'A' should sound Ayyy
Text 'B' should sound Bee
Text 'Ace' should sound 'Ase'
Etc
So how many total sounds should I need to resconstruct full English language words
I am working on statistical parametric speech synthesis. I extracted the fundamental frequency and MFCC from speech waveforms. The next task is to invert MFCC back to speech waveforms. For this, I have read about sinusoidal wave generation methods which need amplitude, phase and frequency values to be determined from extracted speech parameters. How can we determine amplitude and phase information from the MFCC sequence and fundamental frequency?
I have referred to the following research paper. Can anyone please tell how phase synthesis and amplitude generation is done in this paper?
I am currently doing "emotional voice conversion" but suffering from a lack of emotional speech database. Is there any emotional speech database that can be downloaded of academic purpose? I have checked a few databases but only have limited linguistic contents or few utterances for each emotion. IEMOCAP has many overlaps which are not suitable for speech synthesis...I would like to know if there is any database has many utterances with different contents for different emotions and with high speech quality/ no overlap?
In speech synthesis, Merlin system need alignment information between wave and phoneme. Seq2seq method overcome alignment problem by introducing attention, however, it impede the inference time latency. Is there some speech segmentation way other than kaldi force-align?
Can any one help me on how to build a phoneme embedding?, the phonemes have different size in some features , how to solve this problem ?
thank you
Hi everyone. I have been conducting a few experiments with simultaneous speech, but I have been using recorded speech (.wav, .ogg or .mp3 files) in all of them. However, I would like to play the simultaneous speech using Text-to-Speech solutions directly, instead of saving to a file first (mainly to avoid the delay, but also to be used across the OS/device).
All my attempts to play two simultaneous TTS voices (separate threads/processes, ...) have failed, as it seems that speech synthesis / TTS uses a unique channel (resulting in sequential audio).
Do you know any alternatives to make this work (independent of the OS/device - although windows / android are preferred)? Moreover, can you provide me additional information / references on why it doesn't work, so I can try to find a workaround?
Thanks in advance.
The goal is to localize the starting time and the ending time of each phoneme in the waveform signal. If the code is written in Java, that would be better! Thanks in advance!
I am trying to understand sampleRNN before implementing it by myself. However, I am really confused by model diagram in the original paper. The diagram image is attached below.
I have the following questions:
- What inputs do the horizontal arrows refer to? Take Tier 2 for example. I believe the first horizontal arrow along Tier 2 is the input frame(please correct me if this is wrong), but do the second, third and forth horizontal arrows represent the output or the state of the RNN cell?
- How long should the input sequence be? From my understanding, Tier 2 on the diagram takes Xi+12 to Xi+15, samples generated from downward layer at previous timestep(I am also uncertain about this part, so please correct me if I am wrong), as part of its input. So I assume the input sequence should have the same length, i.e., a vector of length 4. Is this correct? If it is, which part of the input should be fed into the RNN cell?
- Where should I perform upsampling as mentioned in the original paper? It seems every input to the same cell have the same dimensionality. So why is upsampling necessary?
The original paper can be found at: https://arxiv.org/pdf/1612.07837.pdf

I am interested in creating voice-impaired speech samples for a speech perception task. It seems that, to date, there is no speech synthesizer that can create natural sounding speech with typical dysphonic characteristics (e.g. high jitter or shimmer values). But I might be wrong, since I am new to the field of speech synthesis. If you know of a specific software or can recommend related publications, I'd appreciate your help.
- Quality is bad on new words. How can that be improved?
In order to make the blind people to read a text in the document.
I've always assumed in order to generate a set of MFCCs for speech synthesis using Hidden Markov Models, that there was one HMM per Mel Coefficient, that is 12 HMMs, an HMM for the pitch, and yet another for durations. Apparently people just use one HMM for all the variables, so I wonder if it is possible to do as I first described, and if so is it efficient?
As it mentioned in the state of art, word spotting process, in manuscript or printed documents, is based or not in machine learning.
My works is about to propose a system of word spotting in manuscript documents. The proposed approach isn't based an a machine learning.Till now, my system generate good results compared to different works in the state of art.
Is the using of a machine learning permits increasing my results ? Is it considered the only way to increase the results of it exits other methods for that ?
Best regards
Arabic speech corpus developed by @Nawar Halabi @MicroLinkPc is machine generated voice from machine auto diacritized texts maybe some human correction involved
what is the diacritizer and the TTS tools and algorithms used in the generation?
I want to compare a model of speech synthesis to other concurrent and well known models (WaveNet,etc).
The language recognition uses the Shift Delta Coefficients(SDC) as acoustic features.
Some papers uses only SDC(i.e. 49 for each frame), while some uses
MFCC(c0-c6)+SDC (total of 56 for each frame).
Question is :
1) Are SDC are enough for language modeling(i.e. 49)
2) Are MFCC(c0-c6) + SDC much better, and what about c0 should be energy of frame of simple c0?
Can anyone please give me some reference paper where time shifted samples of the speech signals have been used for recognition purpose?
We need to classify numbers as per the data type.
Some cases:
Date "28th December 1999" here year can be pronounced as "Nineteen ninety-nine"
Currency: $1999 here the number pronounced as "One thousand ninety-nine"
So I want to know how to resolve this issue for Text to speech synthesis system?
Does anyone know, where can I find any HTS scripts for articoulatory movements synthesis or at least speech synthesis? I need to do an experiment with acoustic to articoulatory speech inversion using HTS (HMM based Speech Synthesis System).
How can we simulate various spoofing attacks (such as speech synthesis, voice conversion etc.) on speech data for developing a robust Speaker Verification System?
Does there exist any freely available dataset for speaker verification task?
I only know of TIMIT which is mono-lingual (english). I don't know if WordNet contains speech as well.
Now I am working for Indonesian emotional speech synthesis system, but I am confused which part must I change, because I've read so many paper but it's didn't tell how I can make emotional speech synthesis from HTS demo, what I found is just the theory. Please help me. Thank you.
Hello, could anybody recommend me some publications about errors and accoustic glitches in concantenative speech synthesis? Something about "what makes speech synthesis sound unnatural"?
Thank you.
I have used HTK toos to get trained HMMs. I have executed till decoding (like HVite for ergodic and bigram). I want know how these HMMs will be used for HTS speech synthesis. Especially what inputs from HTK trained system will be feed into these HTS commands.
I have installed hts_engine version 1. 08 using the installation instruction provided along with the software. Now I do not find any interface and I am stuck here.
I need different methods for text to speech synthesis.
I am looking for some links of research papers or guidelines with clear explanation for HMM based speech synthesis (HTS). I am already done with the speech recognition implementation using HTK. But I do not know how to start for HTS. Thanks in advance.
I work on expressive speech synthesis and I don't know if the simple fact of synthesizing expressive speech in a different language is in and of itself an originality. Suppose I only use existing methods and only synthesize speech in a different language, but without bringing anything new on a technical level. What is your opinion?
I'm performing some experiments that require a vocal tract length change, but I need to know the original one.
I'm aware of the formula: L = c / 4F, where the "c" is the speed of sound (34029 cm/s) and "F" is the first formant frequency. I'm also aware that I should use vowels closest as possible to an unconstricted vocal tract.
However, I made a few experiments with the software program Praat and I got rather different and difficult to interpret results. In a single vowel, I get a large range of frequencies (1st formant ones), so I thought I should focus on the average? Is that correct? Moreover, among different vowels I get very different results. Is that normal?
Thanks in advance!
Automatic dictation challenges text-to-speech synthesis in several apects: pausing should allow trainees to comfortably write down the text (taking into account orthographic, lexical, morpho-syntactic difficulties, etc). Prosody of dictation is also very particular: clear articulation and ample prosodic patterns should enlighten grammatical issues, etc. I will be pleased to get references and comments
What is the concept of center of gravity in speech signal (both time and frequency domain) and how is it useful in removing phase mismatches in concatenative speech synthesis?
Currently, many researchers are interested in the statistical model for solving the problem in Grapheme-to-Phoneme conversion. Why not neural network approach ? Is there any reasons ?
If not, what kind of models should we use for a better performance?
By the way, how to get the CMU Dictionary which is usually used by most of the researcher. how to choose the training and testing data properly ?
We have been working to make all math materials accessible, but the JAWS reader that our school uses has a lot of issues with math symbols, even something as simple as a mixed number is not read correctly. What system do you use for math? There must be something out there that works.
I'm sending to different signals to the left and right channels of TI's C6713 codec. The output is stereo type. I want to do some programming that will help me in hearing one sound in the left headfone and the other in the right headfone. Is that possible?