Science topic

Speech Synthesis - Science topic

Explore the latest questions and answers in Speech Synthesis, and find Speech Synthesis experts.
Questions related to Speech Synthesis
  • asked a question related to Speech Synthesis
Question
3 answers
Hi everyone, I'm attempting to code the Tacotron speech synthesis system from scratch to make sure I understand it. I'm done implementing the first convolutional filterbank layer and have implemented the max pooling layer, but I don't understand why the authors of chose a max-pooling over time with stride 1. They claim it's to keep the temporal resolution, but my problem is that I think using a stride of 1 is equivalent to just doing nothing and keeping the data as is.
As an example, say we have a matrix in which every time step corresponds to one column:
A= [1,2,3,4;
5,6,7,8;
1,2,3,4];
If we max pool over time with stride 2, we'll have:
B = [2,4;
6,8;
2,4]
Max-pooling with stride one will keep the time resolution but also result in B=A (keep every column). So what's the point of even saying that max-pooling was applied?
I hope my question was clear enough, thank you for reading.
Relevant answer
Answer
Typically, kernel size (the area over which we look for the max value) is equal to stride (the step with which we move kernel), but it's not always the case. If the kernel is 1x1 and stride is 1, then it is indeed an identity transformation. But if the kernel size is not 1x1, then it's not an identity transformation for stride=1. Such transformation when applied to an image makes it brighter and less detailed while keeping roughly the same image size (depending on padding).
  • asked a question related to Speech Synthesis
Question
4 answers
Dear research gate community,
for a new study I am looking for a tool or software that would allow me to manipulate formants (i.e. shift frequencies of F1, F2, and F3) and their transition (e.g., start or slope of transition) either within a synthesized CVC word or between two synthesized words. Therefore, it would be crucial to be able to control precisely where in the word or sequence formant manipulation starts and ends.
What I tried so far:
I already tried a tool written for Praat (Praat Vocal Toolkit) but it can only shift formants over the whole word and not for a specified time window.
Furthermore, I tried TrackDraw (https://github.com/guestdaniel/TrackDraw) which is a very good tool to synthesize vocalic sounds (Klatt Synthesizer) and manipulate their formants. However, CV sequences (and their vocalic transition) can not be generated.
I also used an online interface of the Klatt synthesizer (http://www.asel.udel.edu/speech/tutorials/synthesis/Klatt.html) but it is quite complex to even generate simple CV syllables and therefore not very user friendly for my purpose. Furthermore, I don't have reference values for the consonant parameters for German.
What I achieved so far:
I'm able to synthesize German words and phrases that sound quite natural with Python (text-to-speech synthesis).
What I'm looking for:
Ideally, I was hoping to find an application or tool that would allow for 1) language specific (in my case German) text-to-speech synthesis where 2) formants (and/or their transition) can be easily manipulated over time. Or a tool that already takes a synthesized sound as input and allows for formant manipulation.
If you have any ideas, recommendations, or comments I would be very obliged. Thank you!
Stella Krüger
Relevant answer
Answer
I wonder if MBROLA, which uses diphones as speech units (from existing voices in their database or you can create them on your own using the MBROLATOR) can help here. It can give you a TTS synthesizer if you use it with a text producing system.
  • asked a question related to Speech Synthesis
Question
3 answers
I am trying to build a text to speech convertor from scratch.
For that
Text 'A' should sound Ayyy
Text 'B' should sound Bee
Text 'Ace' should sound 'Ase'
Etc
So how many total sounds should I need to resconstruct full English language words
Relevant answer
Answer
May be, it will be usefull...
  • asked a question related to Speech Synthesis
Question
2 answers
I am working on statistical parametric speech synthesis. I extracted the fundamental frequency and MFCC from speech waveforms. The next task is to invert MFCC back to speech waveforms. For this, I have read about sinusoidal wave generation methods which need amplitude, phase and frequency values to be determined from extracted speech parameters. How can we determine amplitude and phase information from the MFCC sequence and fundamental frequency?
I have referred to the following research paper. Can anyone please tell how phase synthesis and amplitude generation is done in this paper?
Relevant answer
Answer
You can use LibRosa tool to invert MFCC to audio: https://librosa.org/doc/latest/feature.html
  • asked a question related to Speech Synthesis
Question
4 answers
I am currently doing "emotional voice conversion" but suffering from a lack of emotional speech database. Is there any emotional speech database that can be downloaded of academic purpose? I have checked a few databases but only have limited linguistic contents or few utterances for each emotion. IEMOCAP has many overlaps which are not suitable for speech synthesis...I would like to know if there is any database has many utterances with different contents for different emotions and with high speech quality/ no overlap?
Relevant answer
  • asked a question related to Speech Synthesis
Question
2 answers
In speech synthesis, Merlin system need alignment information between wave and phoneme. Seq2seq method overcome alignment problem by introducing attention, however, it impede the inference time latency. Is there some speech segmentation way other than kaldi force-align?
Relevant answer
Answer
Look the link, maybe helpful.
Regards,
Shafagat
  • asked a question related to Speech Synthesis
Question
3 answers
Can any one help me on how to build a phoneme embedding?, the phonemes have different size in some features , how to solve this problem ?
thank you
Relevant answer
Answer
Yes, you can use RNN Encoder-Decoder to produce the phoneme embeddings, it means RNN maps each the phoneme to embedding space.
  • asked a question related to Speech Synthesis
Question
4 answers
Hi everyone. I have been conducting a few experiments with simultaneous speech, but I have been using recorded speech (.wav, .ogg or .mp3 files) in all of them. However, I would like to play the simultaneous speech using Text-to-Speech solutions directly, instead of saving to a file first (mainly to avoid the delay, but also to be used across the OS/device).
All my attempts to play two simultaneous TTS voices (separate threads/processes, ...) have failed, as it seems that speech synthesis / TTS uses a unique channel (resulting in sequential audio).
Do you know any alternatives to make this work (independent of the OS/device - although windows / android are preferred)? Moreover, can you provide me additional information / references on why it doesn't work, so I can try to find a workaround?
Thanks in advance.
Relevant answer
Answer
Did you try to use different engines?
  • asked a question related to Speech Synthesis
Question
6 answers
The goal is to localize the starting time and the ending time of each phoneme in the waveform signal. If the code is written in Java, that would be better! Thanks in advance!
Relevant answer
Answer
If you look on the Carnegie Mellon website, the speech technology group has a lot of free tools available for resesarcht o do language processin.g
  • asked a question related to Speech Synthesis
Question
2 answers
I am trying to understand sampleRNN before implementing it by myself. However, I am really confused by model diagram in the original paper. The diagram image is attached below.
I have the following questions:
  1. What inputs do the horizontal arrows refer to? Take Tier 2 for example. I believe the first horizontal arrow along Tier 2 is the input frame(please correct me if this is wrong), but do the second, third and forth horizontal arrows represent the output or the state of the RNN cell?
  2. How long should the input sequence be? From my understanding, Tier 2 on the diagram takes Xi+12 to Xi+15, samples generated from downward layer at previous timestep(I am also uncertain about this part, so please correct me if I am wrong), as part of its input. So I assume the input sequence should have the same length, i.e., a vector of length 4. Is this correct? If it is, which part of the input should be fed into the RNN cell?
  3. Where should I perform upsampling as mentioned in the original paper? It seems every input to the same cell have the same dimensionality. So why is upsampling necessary?
The original paper can be found at: https://arxiv.org/pdf/1612.07837.pdf
Relevant answer
Answer
Dear,
RNN input is the same as any neural network but it is influenced by the historical states. you can think about it as a deep feedforward neural network when you unroll it.
  • asked a question related to Speech Synthesis
Question
9 answers
I am interested in creating voice-impaired speech samples for a speech perception task. It seems that, to date, there is no speech synthesizer that can create natural sounding speech with typical dysphonic characteristics (e.g. high jitter or shimmer values). But I might be wrong, since I am new to the field of speech synthesis. If you know of a specific software or can recommend related publications, I'd appreciate your help.
Relevant answer
Answer
Hello, Isabel Schiller . I have been dveleoping a physics-based synthesizer of dysphonic voices. In this link you will find more information and a prototype for download: https://cic.unb.br/~lucero/synthesis_en.html . The synthesizer is able to produce sustained vowels with controlled levels of breathiness, jitter, tremor, and other effects, and I am currently working on consonant-vowel sequences.
  • asked a question related to Speech Synthesis
Question
4 answers
  • Quality is bad on new words. How can that be improved?
Relevant answer
Answer
Quality depends on the synthesis method used. Synthesis that relies on natural speech recordings, can be poor for multiple reasons (recording conditions, low quality equipment, multiple speakers, recording a single speaker at different times, use of non professional speakers etc.) Formant synthesis (less used today) was poor because of the absence of F0 (intonation) variation. That improved with the introduction of TOBI
  • asked a question related to Speech Synthesis
Question
4 answers
In order to make the blind people to read a text in the document.
Relevant answer
Answer
This one converts text into speech.
Important references:
  • asked a question related to Speech Synthesis
Question
4 answers
for speech synthesis
Relevant answer
Answer
Hi Wy, you may find the RAVDESS helpful in your work. It's a validated multimodal database of emotional speech and song. It contain 7356 recordings in English, with 8 emotions: calm, happy, sad, angry, fearful, surprise, disgust, and neutral, each at two emotional intensities. Downloaded for free - https://zenodo.org/record/1188976
  • asked a question related to Speech Synthesis
Question
1 answer
I've always assumed in order to generate a set of MFCCs for speech synthesis using Hidden Markov Models, that there was one HMM per Mel Coefficient, that is 12 HMMs, an HMM for the pitch, and yet another for durations. Apparently people just use one HMM for all the variables, so I wonder if it is possible to do as I first described, and if so is it efficient?
Relevant answer
Answer
I have found the answer; basically the method I've described can only be applied of the variables are independent. If the variables are correlated it is not possible to generate random vectors that would fit the distribution. A covariance matrix has to be provided with non-zero values for elements not on the diagonal.
  • asked a question related to Speech Synthesis
Question
4 answers
As it mentioned in the state of art, word spotting process, in manuscript or printed documents, is based or not in machine learning. 
My works is about to propose a system of word spotting in manuscript documents. The proposed approach isn't based an a machine learning.Till now, my system generate good results compared  to different works in the state of art. 
Is the using of a machine learning permits increasing my results ? Is it considered the only way to increase the results of it exits other methods for that  ?
Best regards
Relevant answer
Answer
Dear Brooks, you are asking for the experimental manuscript documents, right ?
  • asked a question related to Speech Synthesis
Question
3 answers
Arabic speech corpus  developed by @Nawar Halabi @MicroLinkPc is machine generated voice from machine auto diacritized texts  maybe some human correction involved
what is the diacritizer and the  TTS tools and algorithms used in the generation?
Relevant answer
Answer
this is speech corpus where is the text?
I'm asking specifically about the tool and algorithms used in producing completed  corpus
you seems added your answer without reading the question and wasted my time
  • asked a question related to Speech Synthesis
Question
4 answers
I want to compare a model of speech synthesis to other concurrent and well known models (WaveNet,etc).
Relevant answer
Answer
guess Danila means this one https://catalog.ldc.upenn.edu/ldc93s1
  • asked a question related to Speech Synthesis
Question
3 answers
The language recognition uses the Shift Delta Coefficients(SDC) as acoustic features.
Some papers uses only SDC(i.e. 49 for each frame), while some uses 
MFCC(c0-c6)+SDC (total of 56 for each frame). 
Question is :
1) Are SDC are enough for language modeling(i.e. 49)
2) Are MFCC(c0-c6) + SDC much better, and what about c0 should be energy of frame of simple c0? 
Relevant answer
Answer
I use both: MFCCs and SDC. However, I recomand you to try also extencional features such as: RASTA PLP, Thomson MFCC and others...
  • asked a question related to Speech Synthesis
  • asked a question related to Speech Synthesis
Question
7 answers
We need to classify numbers as per the data type.
Some cases:
Date "28th December 1999"  here year can be pronounced  as "Nineteen ninety-nine" 
Currency: $1999 here the number pronounced  as "One  thousand ninety-nine"
So I want to know how to resolve this issue for Text to speech synthesis system?
Relevant answer
Answer
The Sproat et al (2001) article on non-standard word classification and expansion (please see link) might be helpful.
  • asked a question related to Speech Synthesis
Question
2 answers
Does anyone know, where can I find any HTS scripts for articoulatory movements synthesis or at least speech synthesis? I need to do an experiment with acoustic to articoulatory speech inversion using HTS (HMM based Speech Synthesis System).
Relevant answer
Answer
Thank you very much, Jyothirmai, for your advice.
  • asked a question related to Speech Synthesis
Question
4 answers
How can we simulate various spoofing attacks (such as speech synthesis, voice conversion etc.) on speech data for developing a robust Speaker Verification System?
Does there exist any freely available dataset for speaker verification task?
Relevant answer
Answer
Sapan,
The human ear and brain are very good at speech recognition as are Siri and other computer-based algorithms. To spoof an person's speech, you must first have a good, lengthy sample of the speech and then develop what is essentially a vocal tract model for the human speaker. The model is a transfer function between the vocal cords, air supply, and air flow of the specific human vocal tract and the listener. Of course the model changes if the speaker is sick, has a cold, swollen vocal tract, etc. Once you have a vocal tract model and proper excitation sounds (like vocal cords), you can create speech that spoofs the speaker. This is not an easy task because you must learn a lot about how human speech is created.
Please ask another question if you need clarification.
Good luck,
Steve
  • asked a question related to Speech Synthesis
Question
6 answers
I only know of TIMIT which is mono-lingual (english).  I don't know if WordNet contains speech as well.
Relevant answer
Answer
It is necessary to design a more suitable phoneme set for multilingual speech  data  collection and labeling.
  • asked a question related to Speech Synthesis
Question
3 answers
Now I am working for Indonesian emotional speech synthesis system, but I am confused which part must I change, because I've read so many paper but it's didn't tell how I can make emotional speech synthesis from HTS demo, what I found is just the theory. Please help me. Thank you.
Relevant answer
Answer
Hi Elok,
your question is not simple.
I have usually seen emotional TTS implemented as separate "voices", one per emotion (as suggested by Sri Harsha), so that an emotion change is actually a voice change. However, it is also possible (but more complex) to implement this as a single voice.
Either way, you could have an analysis module that tries to identify the correct emotion (i.e. sentiment analysis) and triggers an emotion change. Alternatively, you could drive the emotion change via tags in the text...
Cheers!
  • asked a question related to Speech Synthesis
Question
3 answers
Hello, could anybody recommend me some publications about errors and accoustic glitches in concantenative speech synthesis? Something about "what makes speech synthesis sound unnatural"?
Thank you.
Relevant answer
Answer
Aryal, Sandesh, and Ricardo Gutierrez-Osuna. "Articulatory inversion and synthesis: towards articulatory-based modification of speech." Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.
Aryal, Sandesh, and Ricardo Gutierrez-Osuna. "Reduction of non-native accents through statistical parametric articulatory synthesis." The Journal of the Acoustical Society of America 137.1 (2015): 433-446.
  • asked a question related to Speech Synthesis
Question
2 answers
I have used HTK toos to get trained HMMs. I have executed till decoding (like HVite for ergodic and bigram). I want know how these HMMs will be used for HTS speech synthesis. Especially what inputs from HTK trained system will be feed into these HTS commands.
Relevant answer
Answer
Hello
I am facing difficulties using HMGenS tool
I have patched the HTK and I can see all the tools compiled , but when I try to use HMGenS too I get : Command not found message
  • asked a question related to Speech Synthesis
Question
1 answer
I have installed hts_engine version 1. 08 using the installation instruction provided along with the software. Now I do not find any interface and I am stuck here.
Relevant answer
Answer
Hi! Have you solved your problem? I imagine there is no graphical user interface, but you must work on the command line. Have a look at the PDF downloadable from here: http://hts.sp.nitech.ac.jp/archives/2.2/HTS-2.2_for_HTK-3.4.1.tar.bz2
Good luck!
  • asked a question related to Speech Synthesis
Question
6 answers
I need different methods for text to speech synthesis.
Relevant answer
Answer
The most popular engine used for TTS now a days is Festvox. It provides different methods of concatenative speech synthesis - diphone synthesis, limited domain synthesis, cluster-units unit selection synthesis. It also has support to build TTS through HMM mentioned in above answer (HTS).
  • asked a question related to Speech Synthesis
Question
3 answers
I am looking for some links of research papers or guidelines with clear explanation for HMM based speech synthesis (HTS). I am already done with the speech recognition implementation using HTK. But I do not know how to start for HTS. Thanks in advance.
Relevant answer
Answer
Here is an example about how we started with HMM synthesis for Hungarian:
Bálint Tóth, Géza Németh, Hidden Markov Model Based Speech Synthesis System in Hungarian. INFOCOMMUNICATIONS JOURNAL LXIII:(7) pp. 30-34. (2008)
You may look at the follow ups from my list of publications (not fully up-to-date):
  • asked a question related to Speech Synthesis
Question
5 answers
I work on expressive speech synthesis and I don't know if the simple fact of synthesizing expressive speech in a different language is in and of itself an originality. Suppose I only use existing methods and only synthesize speech in a different language, but without bringing anything new on a technical level. What is your opinion?
Relevant answer
Answer
I agree with Sascha. There is the phenomenon of phoneme substitution in non native speakers. One area that I would also look at is phonological feature behavior in foreign speakers..
How these elements play into syllable composition for "accented" voice generation would be interesting to explore.
  • asked a question related to Speech Synthesis
Question
10 answers
I'm performing some experiments that require a vocal tract length change, but I need to know the original one.
I'm aware of the formula: L = c / 4F, where the "c" is the speed of sound (34029 cm/s) and "F" is the first formant frequency. I'm also aware that I should use vowels closest as possible to an unconstricted vocal tract.
However, I made a few experiments with the software program Praat and I got rather different and difficult to interpret results. In a single vowel, I get a large range of frequencies (1st formant ones), so I thought I should focus on the average? Is that correct? Moreover, among different vowels I get very different results. Is that normal?
Thanks in advance!
Relevant answer
Answer
The first and second formants define and thus vary between vowels. More revealing are the third and fourth formants. - The second formant of the vowel /i/, however, can be approximated as a standing wave in the pharynx and thus informs to some extent about the length of the pharynx . - Also, speed of sound is about 350 m/s in the vocal tract, not 340 m/s which is valid for a room temperature of about 20 oC
  • asked a question related to Speech Synthesis
Question
1 answer
Automatic dictation challenges text-to-speech synthesis in several apects: pausing should allow trainees to comfortably write down the text (taking into account orthographic, lexical, morpho-syntactic difficulties, etc). Prosody of dictation is also very particular: clear articulation and ample prosodic patterns should enlighten grammatical issues, etc. I will be pleased to get references and comments
Relevant answer
Answer
I found the following refs:
Coniam, D. (1996). "Computerized dictation for assessing listening proficiency." Calico Journal 13: 73-86.
Santiago-Oriola, C. (1998). "Système vocal interactif pour l’apprentissage des langues : la synthèse de la parole au service de la dictée." U. de Toulouse le Mirail. Toulouse. IRIT.
Santiago-Oriola, C. (1999). Vocal synthesis in a computerized dictation exercise. Eurospeech, Budapest, Hungary: 191-194.
Ruggia, S. (2000) "La dictée interactive." Alsic 3 DOI: 10.4000/alsic.1820.
Shang-Ming Huang, C.-L. Liu and Z.-M. Gao (2005). Computer-assisted item generation for listening cloze tests and dictation practice in English. Advances in Web-Based Learning - ICWL, Lecture Notes in Computer Science Volume 3583: 197-208.
Beaufort, R. and S. Roekhaut (2011). "Automation of dictation exercises. A working combination of CALL and NLP." Journal of Computational Linguistics in theNetherlands 1: 1-20.
Beaufort, R. and S. Roekhaut (2011). Le TAL au service de l’ALAO/ELAO. L’exemple des exercices de dictée automatisés. Conférence sur le Traitement Automatique des Langues Naturelles (TALN), Montpellier.
Pellegrini, T., A. Costa and I. Trancoso (2012). Less errors with TTS? A dictation experiment with foreign language learners. Interspeech, Portland, OR: 1291-1294.
Beaufort, R. and S. Roekhaut (2011). "Automation of dictation exercises. A working combination of CALL and NLP." Journal of Computational Linguistics in theNetherlands 1: 1-20.
Roekhaut, S., R. Beaufort and C. Fairon (2013). "PLATON, un outil de dictée automatique au service de l’apprentissage de l’orthographe " Le Langage et l'Homme 48(2): 127-140.
  • asked a question related to Speech Synthesis
Question
1 answer
What is the concept of center of gravity in speech signal (both time and frequency domain) and how is it useful in removing phase mismatches in concatenative speech synthesis?
Relevant answer
Answer
Dear Kuldeep Dhoot, The center of gravity of is a function of only of the first derivative of the phase spectrum at the origin.
  • asked a question related to Speech Synthesis
Question
8 answers
Currently, many researchers are interested in the statistical model for solving the problem in Grapheme-to-Phoneme conversion. Why not neural network approach ? Is there any reasons ?
If not, what kind of models should we use for a better performance?
By the way, how to get the CMU Dictionary which is usually used by most of the researcher. how to choose the training and testing data properly ?
Relevant answer
Answer
In fact, we are interested in the phonetic transcription because it is useful not only for the Text-to-speech system, but also for other systems such as spoken term detection and so on. In these recent years, the researchers have been trying to improve the predicting quality of the grapheme-to-phoneme (or letter-to-phoneme) conversion because it is quite difficult to let the system pronounce the unknown words (or out-of-vocabulary words). Moreover, there are many unexpected problems occurring in English words (esp. the American English words in CMUDict pronunciation dictionary which contains many loan words). Morpheme-based system is also an interesting solution, but it is language-dependent. Personally, I'm not sure if it could deal with the foreign words (language-independent case).
  • asked a question related to Speech Synthesis
Question
9 answers
We have been working to make all math materials accessible, but the JAWS reader that our school uses has a lot of issues with math symbols, even something as simple as a mixed number is not read correctly. What system do you use for math? There must be something out there that works.
Relevant answer
Answer
Hello,
Andras gave the good answer: it totally depends on what do you mean by "mathematics". Simple maths can be efficiently processed through screen reader and speech synthesis, but mathematics have a bidimentional notation which generate a lot of problems for visually impaired. A voice (like braille notation) gives a single dimensional representation of mathematics, as for music or chemistry. There are two main challenges: providing a good spoken sentence as close as possible to natural language, AND supporting interactions with formulas to help visually impaired people to UNDERSTAND the formulas, not only to have access to the content of formulas. There are several project in these field, as mentioned by Andras, Gopal or Alistair. You could additionally have a look at the MAWEN project (D.Archambault, K.Miesenberger, et al.). For math support in Braille (but for me the issues are very closed), you may check Heumader et al, Mascret et al(;-)), Archambault et al., ... Look at the procedings of the last ICCHP conference in Linz (and former ICCHP).
Best regards,
Bruno
  • asked a question related to Speech Synthesis
Question
1 answer
I'm sending to different signals to the left and right channels of TI's C6713 codec. The output is stereo type. I want to do some programming that will help me in hearing one sound in the left headfone and the other in the right headfone. Is that possible?
Relevant answer
Answer
Yes it is possible.
you can design a appropriate filter with cut of frequency to separate the bass and treble component of the given signal.The output of the filters can be given to the processor which you call it as a stereo type. Make a note that only if you know the spectrum of your signal the Fc for separating bass and treble can be decided.
It is simple if you do this using FDA tool in matlab and transfer to 6713.