Science topic
Speech Signal Processing - Science topic
Explore the latest questions and answers in Speech Signal Processing, and find Speech Signal Processing experts.
Questions related to Speech Signal Processing
I am new to Machine Learning and I am currently doing research on speech emotion recognition using deep learning. I found out that recent literatures were using mostly CNN and there are only few literatures found for SER using RNN. I also found out that most approaches used MFCCs.
My questions are:
- Is it true that CNN has been proved to outperform RNN in SER?
- If yes, what are the limitations that RNN have compared with CNN?
- Also, what are the limitations of the existing CNN approaches in SER?
- Why MFCC is used the most in SER? Does MFCC have any limitations?
Any help or guidance would be appreciated.
Hi everyone. I have been conducting a few experiments with simultaneous speech, but I have been using recorded speech (.wav, .ogg or .mp3 files) in all of them. However, I would like to play the simultaneous speech using Text-to-Speech solutions directly, instead of saving to a file first (mainly to avoid the delay, but also to be used across the OS/device).
All my attempts to play two simultaneous TTS voices (separate threads/processes, ...) have failed, as it seems that speech synthesis / TTS uses a unique channel (resulting in sequential audio).
Do you know any alternatives to make this work (independent of the OS/device - although windows / android are preferred)? Moreover, can you provide me additional information / references on why it doesn't work, so I can try to find a workaround?
Thanks in advance.
Hello,
I want explanation of MFCC coefficients we get, only first 12-13 coefficients are considered for evaluating the performance of feature vector. Whats the reason behind this. if we take higher coefficients as well, what will be effect. And how we now whether our feature vectors is good or bad, like in case of sound signal, if we compute its feature vectors, how can we analyze whether sound features are good.
The other question is about LPC feature extraction method, as it is based on order of coefficients, so mostly 10-12 LPC order is considered in this scheme, whats the reason behind this, if we take lower or higher order what will be effect on its performance.
If we compare MFCC and LPCC methods, one works in mel cpestrum domain and other in cepstrum domain, whats the benefit of cpestrum and main difference between mel cepstrum and cepstrum, and which is one is better.
I am new in the field of speech signal processing. Given a speech signal,. wav file consisting of single utterance, how can fundamental frequency is determined. I am usin Java programming language. Please suggest a book or online article to read further
Working methodology of various speech segmentation techniques
I want to get the Intensity readings of a sustained vowel at even time intervals (i.e. every 0.01 or 0.001 seconds). In Praat when I adjust the time step to "fixed" and change the fixed time to 0.01 or 0.001 it adjusts this for pitch and formants, but not for intensity. Intensity remains at increments of 0.010667 seconds between each time. Is it possible to change the time step for intensity or can it only be changed for the other parameters? Any help is much appreciated!
The language recognition uses the Shift Delta Coefficients(SDC) as acoustic features.
Some papers uses only SDC(i.e. 49 for each frame), while some uses
MFCC(c0-c6)+SDC (total of 56 for each frame).
Question is :
1) Are SDC are enough for language modeling(i.e. 49)
2) Are MFCC(c0-c6) + SDC much better, and what about c0 should be energy of frame of simple c0?
Hi all,
I would like to ask from all the experts here, in order to get the better view on the usage of cleaned signals which already removed the echo using few types of adaptive algorithms with method of AEC.(acoustic echo cancellation)
How the significance of MSE and PSNR can improve in the classification processes? Which i mean normally we evaluate using the technique of WER, Accuracy and may EER too.Is there any kind connectivity of MSE and PSNR values in terms of improving those classification metrics.?
wish to have the clarification on this.
Thanks much
The continuous sequence of images (for example the conversation between a deaf person using an interpreter to converse with someone who does not understand the signs) being converted to speech, where the system would serve as the image-to-speech converter.
I am analyzing recorded speech of sustained vowel phonation and am trying to figure out which filters are necessary for the analysis. Does an A-weighted filter need to be applied to account for the fundamental frequency? And does any de-noising need to be done to the signal?
Can anyone please give me some reference paper where time shifted samples of the speech signals have been used for recognition purpose?
Hi,
I am working on the Principal Components Analysis method for speech recognition. The idea here is to identifiy two sources, voice can then easily be identified, as the speech tends to follow a Laplacian law, and noise a Gaussian.
However, are in the spectral or temporal domain when making such assumptions ? Typically in the paper linked ? More broadly, what do we mean when talking about the "distribution of speech" in such papers ?
I'm a beginer in speech signal processing. I see a lot of literature using DET diagram to evaluate the performance of speaker recognition algorithm. I want to use the convolutional neural network in speaker recognition.So, how can I draw a DCT diagram? And what is the difference in speaker identification and speaker verification?Thank you!
I want to work on the features contribute to Prosody in speech. To have naturalness in Text to speech synthesis system output we nee to handle factors affecting prosody. Is MFCC not sufficient to handle this issue? How can we improve prosody in Unit selection as well HMM-based TTS ? Is there any other solution to handle prosody in Text to Speech Synthesis System ?
there are differences between the frequencies of sounds, as 8000Hz , 16000Hz, 44100Hz , ....,etc.
why the researchers prefer the higher frequencies?
For the purpose to improve the performance of a baseline HMM-based recognizer, I would like to integrate more external features extracted from speech signals into the recognizer. my question about combining MFCCs features extracted via HTK Toolkit and the new pitch features extracted via Praat script (fused vector will contain both MFCCs and Pitch features frame by frame) ?
Any help or guidance would be appreciated
If we train our PLDA with microphone data only, and test with Phone data, will it effects the system performance?
and If we train with large amount of data of microphone and with less data of phone, how much the accuracy be effected?
Or there should be a balance between them?
Any recent state of the art review paper about language detection?
How can we simulate various spoofing attacks (such as speech synthesis, voice conversion etc.) on speech data for developing a robust Speaker Verification System?
Does there exist any freely available dataset for speaker verification task?
I have the samples of phonemes of English language. I want the best method of concatenative synthesis and also the best way to resolve glitch observed in concatenating small units of speech(phonemes)
I am facing some difficulties to make the indicator of someone who having good stress, rhythm and intonation why they are speaking and reading.
Need to get in touch with someone who has already worked once with Kaldi ASR for speech recognition.
I want to do multi speaker speech separation. can any one suggest a good paper/algorithm/matlab code for multi pitch tracking?
I need a time-frequency representation that best distinguishes between speech and other auditory events. I have knowledge about spectrograms and scalograms.
I am coding oral speech data and segment them by pauses.
Pauses can be divided into two categories: clause-internal and clause-external (e.g., Tavakoli et al., 2015). The more clause-internal pauses L2 learners produced, the more likely they are assessed as less fluent.
Before this pause segmentation, I adopt AS-unit analysis to measure complexity of the speech data. And I often see this kinds of utterances.
| And (1.2) so it is not interesting |
| indicates AS-unit boundary, and the number in ( ) means a pause duration, according to AS-unit coding system.
In this case, should I this pause regard as clause-internal one?
Simply because this utterance has only one AS-unit and one clause?
But I'm wondering if I should use another clause definition for break-down fluency (i.e., pause) measurements.
It'd be grateful if you provide any advice about this.
Hi all,
How could the quantization error (by truncation) (ET) for negative fractional numbers represented using ones complement is bounded by 0 <= ET < 2^-b ?
For ex, with (b+1) bits (b=3 for magnitude,1 for sign)
Using 1s complement the number X = -(1/8)10 is represented by X= (1.110)2. If it truncated to 1 bit , then X=(1.1)2, X= -(3/8)10. Therefore the error is given by
ET = Q(X) - X;
ET = -3/8 +1/8 = -(2/8)
which is in contradiction with the error bound for negative fractional numbers using ones complement representation.
Reference: Digital Signal Processing by Oppenheim & Schafer
Thank you.
What are your consideration while detecting speech? For example- sex(m/f), Age , BMI, emotion , pitch etc
I'm reading the paper of the Perpeceptual Magnet Effect and trying to figure out how to calculate this value (sigma of s) Could anyone help me?
What are the ideal features that are extracted for the voice based automatic person identification application?
In particular, during voiced speech production? I am looking for understanding the process of speech production in detail.
I have implemented MVDR beamformer for speech signal processing, assuming the gain to be unity for the desired direction(delay vector), but when I am checking it for speech files gain seems to be many folds during speech regions. due to this speech is getting distorted and saturated.I am using 2 mic linear array with separation of 6cm for capturing audio files.
from the problem formulation of MVDR beamforming we assume unit gain in desired direction, do I need to multiply the calculated weights with some small constant(fixed/adaptive)as below in order to control the gain
w = 0.05.*( Inv(noise_cor) * c_ ) / ( ( c_' *Inv(noise_cor )*c_)) ;
or there is some implementation mistake ?
I have two speech signals, recorded using two different recording equipment and recorded at two different time. May I know if there is a way to establish that the two signals are really related irrespective of the way they were recorded?
I'm going to procees speech signal in TMS320C6713 DSK Board with real time system but I have a problem how to get reliable value from speech signal variable . I'm using 44KHz Fs
I have used a build in MFCC algorithm to extract the features from speech signal . According to the MFCC Algo setting, 13 coefficients have to return. Each time i got 13 * n dimension matrix in return with different n for different utterances . How to select 13 MFCC coefficients from the return matrix.
I work in independent component analysis to separate speech signal from noise . The problem in computing signal to noise ratio after independent component analysis because the variance of the original speech signal (X) is different from the variance of estimated source signal (s). s=WX. the formula of signal to noise ratio that I used SNR=10*log10var(s)/var(s-X) but the result of SNR is incorrect
How to normalize each recording sentence to a Root-Mean-Square (RMS) level of 70 dB SPL, or recordings were normalized for RMS amplitude to 70 dB SPL? By Praat, Adobe Audition or Matlab? How to realize it? Thanks.
IN speech signal processing, i am getting these two terms more and more. what are they actually?
Can anyone suggest database sites to download audio files for speech recognition in English, Hindi or Assamese language ? I need a database that is available free of cost. This is for research purposes and I will cite the database site if used in the project work.
Thanks in advance for any help.
I want to extract the pitch of many files (<100) using Wavesurfer and the RAPT method. I know it is possible to generate a file with the pitch information by opening the audio file and choosing the Save Data File. But I want to perform that automatically. Does anyone know how to perform this?
Thank you very much.
I need to assess the effects these channels have on the characteristics of a voice signal, so I need to simulate as best as possible the transmission of the signals (from different databases) through the channel.
What ITU-T recommendations and similar standards should I look for?
What tools would you recommend?
Thank you very much.
The group delay function can be effectively used for various speech processing tasks only when the signal under consideration is a minimum phase signal.
For example if we have speakers within the same accent category does it make the decision process easier or harder compare to the speakers with different accent?
Theoretical explanation and/or empirical result will be appreciated.
For sure, there is a (well known) theoretical background for pitch estimation, including many interesting academic papers with comparative studies of methods. On the other hand, one knows that reverberant room effects can be handled through signal pre-whitening methods. Nonetheless, my question is to those who, like myself, feel frustrated by the almost erratic performance of pitch estimators in naturally spoken sentences (i.e. normal rhythm) in small reverberant rooms, even after signal pre-whitening. Thus, I would like to know is someone successfully experimented new pragmatic approaches, possibly not convencional ones.
I am trying to assess the degree of degradation that "musical noise" causes in the low frequency bands of the spectrum of speech signals. Perceptually (playing back the treated signal) this artifact is stronger in mid and high frequencies (over 700 Hz), however I need an objective way to confirm or disprove this.
Does anyone have information on this subject or knows a way to evaluate the amount of musical noise present in a signal?
Thank you very much.
Speech signal has both voiced and unvoiced portions, but focussing on transition occurs in the voicing portion alone, In the voicing regions the source is almost constant and transition occurs due to time varying nature of system that is source is time invariant & vocal tract system is time variant.
Why is microphone array signal processing more difficult than antenna array signal processing?
Generally speech is created with pulmonary pressure provided by the lungs that generates sound by phonation in the glottis in the larynx, then is modified by the vocal tract into different vowels and consonants.
Is it meaningful to work with the PSFs in the context of sampling? Does it have any interconnection with wavelets?
With Help of Speech Processing techniques with help of LPC, Wavelet Transforms and Windowing techniques with fixed and adjustable filters.
(i.e. Using the buffer method existed in matlab or by multiplying them by a specific type of window like hamming window.)
What is the concept of center of gravity in speech signal (both time and frequency domain) and how is it useful in removing phase mismatches in concatenative speech synthesis?
I have a sound sample, and by applying window length 0.015 and time step 0.005, I have extracted 12 MFCC features for 171 frames directly from the sample using a software tool called PRAAT.
Now I have all 12 MFCC coefficients for each frame. I want to process them further, making a 39-dimensional matrix by adding energy features and delta-delta features and applying dynamic time warping. How do I deal with the coefficients and how do I make delta-delta coefficients?
I am trying to identify whether signals measured via accelerometer on the ankle bone, while someone is talking, is actually derived from his or her own voice transmitting through bone conduction. I initially used FFT to spectral analyze both the vocal speech and the bone vibration. I want to prove that both signals are derived from the same source. The harmonics in FFT is a mathematical derived value, whereas the standard voice analysis LPC is a measured value. They are not the same. Is LCP a better tool to use in this case?