Science topic

Speech Signal Processing - Science topic

Explore the latest questions and answers in Speech Signal Processing, and find Speech Signal Processing experts.
Questions related to Speech Signal Processing
  • asked a question related to Speech Signal Processing
Question
4 answers
I am new to Machine Learning and I am currently doing research on speech emotion recognition using deep learning. I found out that recent literatures were using mostly CNN and there are only few literatures found for SER using RNN. I also found out that most approaches used MFCCs.
My questions are:
- Is it true that CNN has been proved to outperform RNN in SER?
- If yes, what are the limitations that RNN have compared with CNN?
- Also, what are the limitations of the existing CNN approaches in SER?
- Why MFCC is used the most in SER? Does MFCC have any limitations?
Any help or guidance would be appreciated.
Relevant answer
Answer
Answers to some of your queries are as follows:
- It depends on network configurations, the way one creates training examples & datasets. A lack of systematic benchmarking of existing methods however creates confusion. There are several studies which shows that LSTM outperforms CNN for speech emotion recognition. Specially LSTM with attention mechanism helps to boost emotion recognition performance. Some other studies report the opposite and CNN seems to be a better choice. You can do a quick literature survey on latest papers published in last couple of INTERSPEECH, ICASSP & ASRU.
- MFCC is the default choice for most speech processing tasks including speech emotion recognition. However, MFCC is not the optimal one as it lacks prosody information, long-term information. That's why often MFCC is augmented with pitch (to be more specific log F0) and/or shifted delta coefficients. These additional information help to boost the emotion recognition performance. MFCC lacks phase information but the role of phase in emotion recognition performance is not much investigated. The parameters for MFCC computation such as the number of filters, the frequency scale are chosen experimentally and they are dependent on the dataset and the backend classifier.
  • asked a question related to Speech Signal Processing
Question
4 answers
Hi everyone. I have been conducting a few experiments with simultaneous speech, but I have been using recorded speech (.wav, .ogg or .mp3 files) in all of them. However, I would like to play the simultaneous speech using Text-to-Speech solutions directly, instead of saving to a file first (mainly to avoid the delay, but also to be used across the OS/device).
All my attempts to play two simultaneous TTS voices (separate threads/processes, ...) have failed, as it seems that speech synthesis / TTS uses a unique channel (resulting in sequential audio).
Do you know any alternatives to make this work (independent of the OS/device - although windows / android are preferred)? Moreover, can you provide me additional information / references on why it doesn't work, so I can try to find a workaround?
Thanks in advance.
Relevant answer
Answer
Did you try to use different engines?
  • asked a question related to Speech Signal Processing
Question
11 answers
Hello,
I want explanation of MFCC coefficients we get, only first 12-13 coefficients are considered for evaluating the performance of feature vector. Whats the reason behind this. if we take higher coefficients as well, what will be effect. And how we now whether our feature vectors is good or bad, like in case of sound signal, if we compute its feature vectors, how can we analyze whether sound features are good.
The other question is about LPC feature extraction method, as it is based on order of coefficients, so mostly 10-12 LPC order is considered in this scheme, whats the reason behind this, if we take lower or higher order what will be effect on its performance.
If we compare MFCC and LPCC methods, one works in mel cpestrum domain and other in cepstrum domain, whats the benefit of cpestrum and main difference between mel cepstrum and cepstrum, and which is one is better.
Relevant answer
Answer
An intuition about the cepstral features can help to figure out what we should look for when we use them in a speech-based system.
- As cepstral features are computed by taking the Fourier transform of the warped logarithmic spectrum, they contain information about the rate changes in the different spectrum bands. Cepstral features are favorable due to their ability to separate the impact of source and filter in a speech signal. In other words, in the cepstral domain, the influence of the vocal cords (source) and the vocal tract (filter) in a signal can be separated since the low-frequency excitation and the formant filtering of the vocal tract are located in different regions in the cepstral domain.
- If a cepstral coefficient has a positive value, it represents a sonorant sound since the majority of the spectral energy in sonorant sounds are concentrated in the low-frequency regions.
- On the other hand, if a cepstral coefficient has a negative value, it represents a fricative sound since most of the spectral energies in fricative sounds are concentrated at high frequencies.
- The lower order coefficients contain most of the information about the overall spectral shape of the source-filter transfer function.
- The zero-order coefficient indicates the average power of the input signal.
- The first-order coefficient represents the distribution spectral energy between low and high frequencies.
- Even though higher order coefficients represent increasing levels of spectral details, depending on the sampling rate and estimation method, 12 to 20 cepstral coefficients are typically optimal for speech analysis. Selecting a large number of cepstral coefficients results in more complexity in the models. For example, if we intend to model a speech signal by a Gaussian mixture model (GMM), if a large number of cepstral coefficients is used, we typically need more data in order to accurately estimate the parameters of the GMM.
  • asked a question related to Speech Signal Processing
Question
3 answers
I am new in the field of speech signal processing. Given a speech signal,. wav file consisting of single utterance, how can fundamental frequency is determined. I am usin Java programming language. Please suggest a book or online article to read further
  • asked a question related to Speech Signal Processing
Question
7 answers
Working methodology of various speech segmentation techniques
Relevant answer
  • asked a question related to Speech Signal Processing
Question
3 answers
I want to get the Intensity readings of a sustained vowel at even time intervals (i.e. every 0.01 or 0.001 seconds). In Praat when I adjust the time step to "fixed" and change the fixed time to 0.01 or 0.001 it adjusts this for pitch and formants, but not for intensity. Intensity remains at increments of 0.010667 seconds between each time. Is it possible to change the time step for intensity or can it only be changed for the other parameters? Any help is much appreciated!
Relevant answer
Answer
What you actually want to change is the window size. For that you need to change the pitch settings and give a minimum pitch 4 times more than the rate you want rate you want to have (i.e. 400/4000 for a window size of 0.ü1/0.001). This will however not change the overall mean intensity much, and is more relevant to clinical research.
  • asked a question related to Speech Signal Processing
Question
3 answers
The language recognition uses the Shift Delta Coefficients(SDC) as acoustic features.
Some papers uses only SDC(i.e. 49 for each frame), while some uses 
MFCC(c0-c6)+SDC (total of 56 for each frame). 
Question is :
1) Are SDC are enough for language modeling(i.e. 49)
2) Are MFCC(c0-c6) + SDC much better, and what about c0 should be energy of frame of simple c0? 
Relevant answer
Answer
I use both: MFCCs and SDC. However, I recomand you to try also extencional features such as: RASTA PLP, Thomson MFCC and others...
  • asked a question related to Speech Signal Processing
Question
4 answers
Hi all, 
I would like to ask from all the experts here, in order to get the better view on the usage of cleaned signals which already removed the echo using few types of adaptive algorithms with method of AEC.(acoustic echo cancellation)
How the significance of MSE and PSNR can improve in the classification processes? Which i mean normally we evaluate using the technique of WER, Accuracy and may EER too.Is there any kind connectivity of MSE and PSNR values in terms of improving those classification metrics.?
wish to have the clarification on this.
Thanks much
Relevant answer
Answer
It is one of the old issues in a speech recognition research field. That is on the relationship between any speech enhancement technique and the classification accuracy.
As far as I know, both the MSE and PSNR are frequently used for improving the quality of the input. They are known as useful in reducing WER. However, the relationship with recognition accuracy is not directly proportional. 
Enhancing a noisy signal in terms of MSE or PSNR means that you may have a good quality of the input but there is a risk. Sometimes, unexpected artifacts are produced by the speech enhancement techniques and WER can be increased in the worst case.
So, in phonemic classification task, matched condition is more crucial. And in the case of mismatched condition between train and test, MSE and PSNR are somewhat related to WER, but not directly. It is a case-by-case study.    
  • asked a question related to Speech Signal Processing
Question
3 answers
The continuous sequence of images (for example the conversation between a deaf person using an interpreter to converse with someone who does not understand the signs) being converted to speech, where the system would serve as the image-to-speech converter.
Relevant answer
Answer
This is a good idea however it needs a lot of work, Firstly, we can employ computer vision to identify objects and in a simple case just convert object names into speech. In summary, the components of such a system are available namely; a module for object identification in images and speaking words module. Just integration is needed if not done somewhere.
  • asked a question related to Speech Signal Processing
Question
2 answers
I am analyzing recorded speech of sustained vowel phonation and am trying to figure out which filters are necessary for the analysis. Does an A-weighted filter need to be applied to account for the fundamental frequency? And does any de-noising need to be done to the signal?
Relevant answer
Answer
If the recorded speech signal is degraded by Background noise, using wavelet tresholding would be more appropriate than low pass filters for vowels.
  • asked a question related to Speech Signal Processing
  • asked a question related to Speech Signal Processing
Question
3 answers
Hi,
I am working on the Principal Components Analysis method for speech recognition. The idea here is to identifiy two sources, voice can then easily be identified, as the speech tends to follow a Laplacian law, and noise a Gaussian.
However, are in the spectral or temporal domain when making such assumptions ? Typically in the paper linked ? More broadly, what do we mean when talking about the "distribution of speech" in such papers ?
Relevant answer
Answer
You can find measured distribution of the clean speech signal in time and frequency domain plus models in this paper: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/Tashev_Acero_Statistical_Modeling_of_the_Speech_Signal.pdf.
  • asked a question related to Speech Signal Processing
Question
4 answers
I'm a beginer in speech signal processing. I see a lot of literature using DET diagram to evaluate the performance of speaker recognition algorithm. I want to use the convolutional neural network in speaker recognition.So, how can I draw a DCT diagram?  And what is the difference in speaker identification and speaker verification?Thank you!
Relevant answer
Answer
Hi Yilun,
DET:
First, you need to compute the true score from genuine trials and false scores from imposter trials. Then you may use NIST DETware
or BOSARIS toolkit to get the DET plot/
Check the help sections/readme files for details.
Speaker identification (SI) v/s Speaker verification (SV): Check Page 5 of
Hope this helps.
Sahid
  • asked a question related to Speech Signal Processing
Question
9 answers
I want to work on the features contribute to Prosody in speech. To have naturalness in Text to speech synthesis system output we nee to handle factors affecting prosody. Is MFCC not sufficient to handle this issue? How can we improve prosody in Unit selection as well HMM-based TTS ? Is there any other solution to handle prosody in Text to Speech Synthesis System ?
Relevant answer
Answer
Dear Sushanta,
Here's some literature that might be relevant to your interest:
Also you could take a look at the tools we have developed for prosody analysis and synthesis: https://www.researchgate.net/project/Research-Tools-mostly-Praat-based
  • asked a question related to Speech Signal Processing
Question
7 answers
there are differences between the frequencies of sounds, as 8000Hz , 16000Hz, 44100Hz , ....,etc.
why the researchers prefer the higher frequencies?
Relevant answer
Answer
Frequency approach to music, speech and hearing research has never ceased to raise fundamental questions since Ohm's Acoustical Law (1843) and Helmholtz's Resonance Theory (1877). Hussein's question above is just one of millions unanswered questions. . Stephen's reference above to the Nyquist theorem provides adequate answer to the question at hand. If I record musical instruments at say 8 k/Hz sample rate, the sound quality is relatively poor in comparison to the same signal recorded at a higher sample rate. In speech, you might find that different phonemes (particularly fricatives/sibilants) are not well captured at low sampling rates as you'll lose the high frequency components which are critical distinctive features (acoustically speaking).  Thus, if you use a unique sample rate, you might limit yourself to a specific sound source. I adopt Xaver's procedure mentioned above. It is one wise way to discover what sample rate does to the quality of your recordings.
  • asked a question related to Speech Signal Processing
Question
8 answers
For the purpose to improve the performance of a baseline HMM-based recognizer, I would like to integrate more external features extracted from speech signals into the recognizer. my question about combining MFCCs features extracted via HTK Toolkit and the new pitch features extracted via Praat script (fused vector will contain both MFCCs and Pitch features frame by frame) ?
Any help or guidance would be appreciated
Relevant answer
Answer
I have tried mfcc together with pitch feature, and it works well in Chinese ASR, because Chinese language is a strong tonal languange. However, for pitch to be effected, something should be dealt carefully:
1. mfcc frame shift and that of pitch should be the same, so that the total frames are the same
2. for tonal language ASR, the tonal information is rather saved in delta pitch and deltadelta pitch, so the static pitch can be removed;
3. be careful, for speech where pitch is not found (no resonance), do not try to amend it by averaging methods, this is harmful for speech recognition, just set delta pitch and delta delta pitch to zero
That's my experience to make these two features work.
  • asked a question related to Speech Signal Processing
Question
1 answer
If we train our PLDA with microphone data only, and test with Phone data, will it effects the system performance? 
and If we train with large amount of  data of microphone and with less data of phone, how much the accuracy be effected?
Or there should be a balance between them?
Relevant answer
Answer
1. recognition always work on correlation of data or you can say correlation of features in the data.
2. more number of samples will always help you to increase accuracy
3.accuracy is your case will depend on what type of features you use
4. if you want to enhance accuracy use both time and frequency domain features , it may slow down your algorithm but accuracy will improve.more the feature you include more will be the accuracy.
5.my advice work with very simple logic you need to increase correlation.it can be done by increasing samples if you cant increase samples you have to increase features for recolonization . 
  • asked a question related to Speech Signal Processing
Question
2 answers
Any recent state of the art review paper about language detection? 
Relevant answer
Answer
This might be interesting
A hybrid phonotactic language identification system with an SVM back-end for simultaneous lecture translation
  • asked a question related to Speech Signal Processing
Question
4 answers
How can we simulate various spoofing attacks (such as speech synthesis, voice conversion etc.) on speech data for developing a robust Speaker Verification System?
Does there exist any freely available dataset for speaker verification task?
Relevant answer
Answer
Sapan,
The human ear and brain are very good at speech recognition as are Siri and other computer-based algorithms. To spoof an person's speech, you must first have a good, lengthy sample of the speech and then develop what is essentially a vocal tract model for the human speaker. The model is a transfer function between the vocal cords, air supply, and air flow of the specific human vocal tract and the listener. Of course the model changes if the speaker is sick, has a cold, swollen vocal tract, etc. Once you have a vocal tract model and proper excitation sounds (like vocal cords), you can create speech that spoofs the speaker. This is not an easy task because you must learn a lot about how human speech is created.
Please ask another question if you need clarification.
Good luck,
Steve
  • asked a question related to Speech Signal Processing
Question
4 answers
I have the samples of phonemes of English language. I want the best method of concatenative synthesis and also the best way to resolve glitch observed in concatenating small units of speech(phonemes)
Relevant answer
Answer
The way I did it was inspired by an algorithm I found in a book titled DAFX by Zölzer (2002). Basically you calculate the cross-correlation between the end of the first phoneme and the beginning of the second and find the point where they correlate the most. This is where you want to concatenate the phonemes/diphones/units. Based on this point, you multiply the end of the first phoneme by a decreasing ramp and the beginning of the second phoneme by an increasing ramp and then just add them up. I hope you understand the idea.
  • asked a question related to Speech Signal Processing
Question
3 answers
I am facing some difficulties to make the indicator of someone who having good stress, rhythm and intonation why they are speaking and reading.
Relevant answer
Answer
Dear Suciati Anandes,
First, you have to make an inventory of certain targeted statements that carefully pose the suprasegmental features you want to investigate . Subsequently, ask the participants (the selected sample) to read them. Naturally, you record their voices in this particular stage. Finally, ask three native speakers of English to rate them based on the accuracy and appropriacy of their production of suprasegmentals. Good luck with your research.
Best regards,
R.Biria
  • asked a question related to Speech Signal Processing
Question
4 answers
Need to get in touch with someone who has already worked once with Kaldi ASR for speech recognition.
Relevant answer
Answer
In India, we are working on the Indian language (Tamil & Hindi) speech recognizer  based on the KALDI toolkit.
  • asked a question related to Speech Signal Processing
Question
6 answers
I want to do multi speaker speech separation. can any one suggest a good paper/algorithm/matlab code for multi pitch tracking?
Relevant answer
Answer
It is a very challenging task and highly depends upon the number of speakrs, the f0 interval and dynamics of the speakers.
Eg. if there are only one male and one female speakrs, you could probably use band-pass filter with the f0 range and then a standard pitch tracking algorithm (e.g. KALDI pitch tracker or SWIPE).
  • asked a question related to Speech Signal Processing
Question
6 answers
I need a time-frequency representation that best distinguishes between speech and other auditory events. I have knowledge about spectrograms and scalograms. 
Relevant answer
Answer
I think it may be usefool for you
  • asked a question related to Speech Signal Processing
Question
1 answer
I am coding oral speech data and segment them by pauses.
Pauses can be divided into two categories: clause-internal and clause-external (e.g., Tavakoli et al., 2015). The more clause-internal pauses L2 learners produced, the more likely they are assessed as less fluent. 
Before this pause segmentation, I adopt AS-unit analysis to measure complexity of the speech data. And I often see this kinds of utterances.
| And (1.2) so it is not interesting |
| indicates AS-unit boundary, and the number in ( ) means a pause duration, according to AS-unit coding system.
In this case, should I this pause regard as clause-internal one?
Simply because this utterance has only one AS-unit and one clause?
But I'm wondering if I should use another clause definition for break-down fluency (i.e., pause) measurements.
It'd be grateful if you provide any advice about this.
Relevant answer
Answer
As you know, I’m no expert in phonetics and phonology.  To be honest, I have no idea what the AS-unit coding framework is!
I was just wondering whether the AS-unit coding framework meant the same thing to all researchers.  It appears to me that in our research papers, we often need to explain how we define certain analytical concepts and how we understand certain analytical frameworks, and not only adopt but also adapt these concepts/frameworks.
You asked “if I should use another clause definition for break-down fluency (i.e., pause) measurements”.  The crux of the argument might be what clause definition serves best to answer your research questions.
Hope other people will be involved in our discussion!
  • asked a question related to Speech Signal Processing
Question
2 answers
Hi all,
How could the quantization error (by truncation)  (ET) for negative fractional numbers represented using ones complement is bounded by 0 <= ET < 2^-b ?
For ex, with (b+1) bits (b=3 for magnitude,1 for sign)
Using 1s complement the number X = -(1/8)10 is represented by X= (1.110)2. If it truncated to 1 bit , then X=(1.1)2, X= -(3/8)10. Therefore the  error is given by
                 ET = Q(X) - X;
                 ET = -3/8 +1/8 = -(2/8)
which is in contradiction with the error bound for negative fractional numbers using ones complement representation.
Reference: Digital Signal Processing by Oppenheim & Schafer
Thank you.
Relevant answer
Answer
Thank you, I got it Rajesh ji. I misinterpreted it. For ex, x= (0.100)2 is equals to (4/8)10,  if you truncate it to x=(0.1)2 again it equals to (4/8)10 = (1/2)10. This holds for positive numbers so I assumed it is true for negative numbers to without even having a single thought about it . That's a big mistake, but I learned from it.
  • asked a question related to Speech Signal Processing
Question
2 answers
What are your consideration while detecting speech? For example- sex(m/f), Age , BMI, emotion , pitch etc
Relevant answer
Answer
Formant frequencies vary depending on the the size and shape of the vocal tract. The first three formants (F1, F2, and F3) are the most important in speech. F1 is correlated with the height of the tongue, i.e. F1 increases as the tongue moves from high to low position. F2 is associated with the backness of the sound, i.e front sounds have higher F2 frequencies than back sounds do. F3 correlates with lip rounding; it decreases as the lips take rounding shape. 
  • asked a question related to Speech Signal Processing
Question
2 answers
I'm reading the paper of the Perpeceptual Magnet Effect and trying to figure out how to calculate this value (sigma of s) Could anyone help me?
Relevant answer
Answer
Hi, I've not read the article you refer to, but the variance (2nd moment of a density function or distribution) is usually symbolised as sigma-squared. Standard deviation is often a more useful measure though. If calculated on audio data, SD is referred to as spectral spread which is a perceptually relevant predictor in many contexts. From spectral spread you'll get "spectral variance" by squaring. You can find scripts for Matlab for example in the excellent MIR toolbox (https://www.jyu.fi/hum/laitokset/musiikki/en/research/coe/materials/mirtoolbox)
  • asked a question related to Speech Signal Processing
Question
4 answers
What are the ideal features that are extracted for the voice based automatic person identification application?
Relevant answer
Answer
Dear Yasir Rahmatallah ,Alessio Brutti ·,B. Tomas 
Thank you for the detailed answer. It is exactly what I was trying to do.
  • asked a question related to Speech Signal Processing
Question
5 answers
In particular, during voiced speech production? I am looking for understanding the process of speech production in detail.
Relevant answer
Answer
Maybe You can see chapter
 Determination of Spectral Parameters of Speech Signal by Goertzel Algorithm,B. Tomas
Speech Technologies 01/2011;
 
  • asked a question related to Speech Signal Processing
Question
4 answers
I have implemented MVDR beamformer for speech signal processing, assuming the gain to be unity for the desired direction(delay vector), but when I am checking it for speech files gain seems to be many folds during speech regions. due to this speech is getting distorted and saturated.I am using 2 mic linear array with separation of 6cm for capturing audio files. 
from the problem formulation of MVDR beamforming we assume unit gain in desired direction, do I need to multiply the calculated weights with some small constant(fixed/adaptive)as below in order to control the gain  
w = 0.05.*( Inv(noise_cor) * c_ ) / ( ( c_' *Inv(noise_cor )*c_)) ;
or there is some implementation mistake ? 
Relevant answer
Answer
Hi Arpit,
Assuming that you are using your 2 microphones as a ULA (Uniform Linear Array), the separation of 6cm/0.06m will correspond to a wavelength of 0.12m.  Assuming the speed of sound is 343m/s, the wavelength of 0.12m will correspond to a frequency of 2.858 kHz.  I am thinking that if your are sampling a audio signal greater than 2.858 kHz, you may want to place the microphones nearer to each other (less than 6 cm, maybe 5 cm for up to 3 kHz) so as to prevent signal returns due to grating lobes from interfering with your intended audio signals. 
Also, when using MVDR, an array of 2 elements with only provide two degrees of freedom (DOF) for the constraints: (1) to prevent distortion of the signal in the desired direction of arrival and the (2) to suppress interference from other directions.  Thus, by adding a couple more of array elements (microphones), you will be able to produce better interference suppression from undesirable signals from other directions.
  • asked a question related to Speech Signal Processing
Question
11 answers
I have two speech signals, recorded using two different recording equipment and recorded at two different time. May I know if there is a way to establish that the two signals are really related irrespective of the way they were recorded?
Relevant answer
Answer
Dear Poornoma,
There are many mathematical equations that show the relationship between two signal. e.g
1) Correlation
2) Mutual Information
3) Euclidean distance 
etc...
and some  features that help to biuld some relationship between signals e.g
1) Pitch
2) Harmonics
3) Fundamental frequency
4) Energies
 and distribution of above features e.g standard deviation, skewness , kurtosis etc
  • asked a question related to Speech Signal Processing
Question
3 answers
I'm going to procees speech signal in TMS320C6713 DSK Board with real time system but I have a problem how to get reliable value from speech signal  variable . I'm using 44KHz Fs
Relevant answer
Answer
Hi ... what is the problem with the reliablty ??. U use a safe value 44k
  • asked a question related to Speech Signal Processing
Question
14 answers
I have used a build in MFCC algorithm to extract the features from speech signal . According to the MFCC Algo setting, 13 coefficients have to return. Each time i got 13 * n dimension matrix in return with different n for different utterances . How to select 13 MFCC coefficients from the return matrix.
Relevant answer
Answer
Further to Mikel's answer: 13*(100*s) is a 2D matrix of MFCC. Typically, various researchers take statistics such as mean, var, IQR, etc. to reduce the feature dimension. Some researchers model it using multi-variate regression, and some fit it to Gaussian mixture model. It depends on the next step in the processing pipeline.
  • asked a question related to Speech Signal Processing
Question
4 answers
I work in independent component analysis to separate speech signal from noise . The problem in computing signal to noise ratio after independent component analysis because the variance of the original speech signal (X) is different from the variance of estimated source signal (s). s=WX. the formula of signal to noise ratio that I used SNR=10*log10var(s)/var(s-X) but the result of SNR is incorrect
Relevant answer
Answer
if s is noisy and X is original and clear speech:
SNR=10*log10var(X)/var(s-X)
Also I think s= X+W,i.e, noise is additive.
  • asked a question related to Speech Signal Processing
Question
4 answers
How to normalize each recording sentence to a Root-Mean-Square (RMS) level of 70 dB SPL, or recordings were normalized for RMS amplitude to 70 dB SPL? By Praat, Adobe Audition or Matlab? How to realize it? Thanks.
Relevant answer
Answer
I assume that you are trying to scale different sound files to an approximately equal loudness. If that is the case, you can select a Sound Object in Praat and under "Modify", you can use "Scale intensity" to set a specific dB SPL level (Drop me an email if you need a script that normalizes all files in a directory that way). To determine the actual SPL of the output you can use a sound level meter. 
  • asked a question related to Speech Signal Processing
Question
8 answers
IN speech signal processing, i am getting these two terms more and more. what are they actually?
Relevant answer
Answer
There are two types of features of a speech signal:
  • The temporal features (time domain features), which are simple to extract and have easy physical interpretation, like: the energy of signal, zero crossing rate, maximum amplitude, minimum energy, etc.
  • The spectral features (frequency based features), which are obtained by converting the time based signal into the frequency domain using the Fourier Transform, like: fundamental frequency, frequency components, spectral centroid, spectral flux, spectral density, spectral roll-off, etc. These features can be used to identify the notes, pitch, rhythm, and melody.
  • asked a question related to Speech Signal Processing
Question
5 answers
Can anyone suggest database sites to download audio files for speech recognition in English, Hindi or Assamese language ? I need a database that is available free of cost. This is for research purposes and I will cite the database site if used in the project work.
Thanks in advance for any help.
Relevant answer
Answer
I haven't worked on speech recognition, but maybe the CMU ARCTIC databases could be useful. 
  • asked a question related to Speech Signal Processing
Question
2 answers
I want to extract the pitch of many files (<100) using Wavesurfer and the RAPT method. I know it is possible to generate a file with the pitch information by opening the audio file and choosing the Save Data File. But I want to perform that automatically. Does anyone know how to perform this?
Thank you very much.
Relevant answer
Answer
Thank you very much for your suggestion mr. Koch. I'm going to try YAAPT as well, but I still need to test RAPT. For now, I'm going to test it in a few signals.
  • asked a question related to Speech Signal Processing
Question
5 answers
I need to assess the effects these channels have on the characteristics of a voice signal, so I need to simulate as best as possible the transmission of the signals (from different databases) through the channel.
What ITU-T recommendations and similar standards should I look for?
What tools would you recommend?
Thank you very much.
Relevant answer
Answer
For TDM you need to simulate the encoding and decoding (G.711 or G.726 for international links). You can learn about acceptable BER from G.821 and G.826. Next you need to model the effects of line echo - see G.168 and the text of the relevant P series documents.
For packet based (e.g., VoIP) systems you need to do is to implement the desired compression and decompression (e.g., G.729, G.723, AMR), and then simulate the effects of packet loss on the compressed voice packets. You need to decide whether compressed frames are sent separately or several per packet, and then to model the packet loss (the accepted way is to use a GIlbert-Eliott state machine, as detailed in ETSI and ITU-T documents).
If you need to model delay and PDV as well, you could use real traces, or generate traces using a tool like NS2 or NS3.
If you need to simulate acoustic echo, then further work is involved. Read the relevant P series documents.
Y(J)S
  • asked a question related to Speech Signal Processing
Question
5 answers
The group delay function can be effectively used for various speech processing tasks only when the signal under consideration is a minimum phase signal.
Relevant answer
Answer
Yes, it is compulsory. The group delay is, to certain extent, similar to the magnitude spectrum of the signal. Those spikes are due to wrapped phase and not actual, and it has to be avoided. 
  • asked a question related to Speech Signal Processing
Question
4 answers
For example if we have speakers within the same accent category does it make the decision process easier or harder compare to the speakers with different accent?
Theoretical explanation and/or empirical result will be appreciated.
  • asked a question related to Speech Signal Processing
Question
4 answers
For sure, there is a (well known) theoretical background for pitch estimation, including many interesting academic papers with comparative studies of methods. On the other hand, one knows that reverberant room effects can be handled through signal pre-whitening methods. Nonetheless, my question is to those who, like myself, feel frustrated by the almost erratic performance of pitch estimators in naturally spoken sentences (i.e. normal rhythm) in small reverberant rooms, even after signal pre-whitening. Thus, I would like to know is someone successfully experimented new pragmatic approaches, possibly not convencional ones.
Relevant answer
Answer
You could give ProsodyPro a try to see if it can handle some of your cases: http://www.phon.ucl.ac.uk/home/yi/ProsodyPro/
  • asked a question related to Speech Signal Processing
Question
6 answers
I am trying to assess the degree of degradation that "musical noise" causes in the low frequency bands of the spectrum of speech signals. Perceptually (playing back the treated signal) this artifact is stronger in mid and high frequencies (over 700 Hz), however I need an objective way to confirm or disprove this.
Does anyone have information on this subject or knows a way to evaluate the amount of musical noise present in a signal?
Thank you very much.
Relevant answer
Answer
It is exactly in the frequencies which are weakened by the algorithm or in the places that noise has strong peaks and more power.
If the noise is white then the musical noise is distributed in all frequencies.
  • asked a question related to Speech Signal Processing
Question
2 answers
Speech signal has both voiced and unvoiced portions, but focussing on transition occurs in the voicing portion alone, In the voicing regions the source is almost constant and transition occurs due to time varying nature of system that is source is time invariant & vocal tract system is time variant. 
Relevant answer
Answer
I think, you have to detect both, voicing level of speech signal and unvoiced portions and the time depency of the transition.
  • asked a question related to Speech Signal Processing
Question
4 answers
Why is microphone array signal processing more difficult than antenna array signal processing?
Relevant answer
Answer
Compared with antenna array sp , microphone array sp often aims to the near field broadband signals,eg. acoustic signals. In addition , mutipath or echo is another difficulty. Especially for DOA and beamforming techniques, it 's almost a disaster.
  • asked a question related to Speech Signal Processing
Question
3 answers
Generally speech is created with pulmonary pressure provided by the lungs that generates sound by phonation in the glottis in the larynx, then is modified by the vocal tract into different vowels and consonants.
  • asked a question related to Speech Signal Processing
Question
1 answer
Is it meaningful to work with the PSFs in the context of sampling? Does it have any interconnection with wavelets?
Relevant answer
Answer
Robert Marks published the first textbook treatment I used, in 1991 "Introduction to Shannon Sampling and Interpolation Theory". I saw the material in graduate courses in 1981. His references for Fourier methods and the PSF go back to 1961, Landau, "Prolate spheroidal wave functions Fourier analysis and uncertainty", in the Bell Systems Technical Journal. Landau and others in IEEE in the mid 1960's.
Robert Marks' two books: "Introduction to Shannon Sampling and Interpolation Theory", and "Advanced Topics in Shannon Sampling and Interpolation Theory" have a *very* good bibliography.
Wavelet connections, I can't speak to directly. There may be interesting connections to be made, mathematically, but the only one obvious to me is the same advantage you get any time you recognize an orthogonal basis and a transform which provides a simpler analysis in one space than another.
  • asked a question related to Speech Signal Processing
Question
3 answers
With Help of Speech Processing techniques with help of LPC, Wavelet Transforms and Windowing techniques with fixed and adjustable filters.
Relevant answer
Answer
I believe the question is much too general to yield reasonable answers.
What is the quality measure being sought?
What is the nature of the underlying sound that requires quality improvement?
For each sound there will be different methods to tackle the problem.
For example, if the sound signal is filled with high-intensity, high-frequency random noise then one might pursue quality improvement in some particular directions.
For a sound that is overcome by periodic sound interference, there would be another direction.
etc.
  • asked a question related to Speech Signal Processing
Question
1 answer
(i.e. Using the buffer method existed in matlab or by multiplying them by a specific type of window like hamming window.)
Relevant answer
Answer
Dr. Salih Qaraawi, You will get some aspect of your question through article accessible via link below.
  • asked a question related to Speech Signal Processing
Question
1 answer
What is the concept of center of gravity in speech signal (both time and frequency domain) and how is it useful in removing phase mismatches in concatenative speech synthesis?
Relevant answer
Answer
Dear Kuldeep Dhoot, The center of gravity of is a function of only of the first derivative of the phase spectrum at the origin.
  • asked a question related to Speech Signal Processing
Question
4 answers
I have a sound sample, and by applying window length 0.015 and time step 0.005, I have extracted 12 MFCC features for 171 frames directly from the sample using a software tool called PRAAT.
Now I have all 12 MFCC coefficients for each frame. I want to process them further, making a 39-dimensional matrix by adding energy features and delta-delta features and applying dynamic time warping. How do I deal with the coefficients and how do I make delta-delta coefficients?
Relevant answer
Answer
It depends on what you need these features. For example you can calculate the average value for each MFCC coefficient . Than you will have 13 features. To calculate delta (first derivative) or delta delta (second derivative) you have to calculate the rate of change and acceleration changes basing on MFCC. Than you can also calculate the averge.
  • asked a question related to Speech Signal Processing
Question
5 answers
I am trying to identify whether signals measured via accelerometer on the ankle bone, while someone is talking, is actually derived from his or her own voice transmitting through bone conduction. I initially used FFT to spectral analyze both the vocal speech and the bone vibration. I want to prove that both signals are derived from the same source. The harmonics in FFT is a mathematical derived value, whereas the standard voice analysis LPC is a measured value. They are not the same. Is LCP a better tool to use in this case?
Relevant answer
Answer
Dear Abu Dzarr,
why don't you use a coherency analysis? It can be used to estimate the causality between the input and output.