Science topic

Speech Acoustics - Science topic

The acoustic aspects of speech in terms of frequency, intensity, and time.
Questions related to Speech Acoustics
  • asked a question related to Speech Acoustics
Question
8 answers
Hi everybody,
Given the different methods of speech feature extraction (IS09, IS10,...), which one do you suggest for EMODB and IEMOCAP datasets?
Relevant answer
Answer
The key to speech emotion recognition is the feature extraction process. The quality of the features directly influences the accuracy of classification results. If you are interested in typically feature extraction, the Mel-frequency Cepstrum coefficient (MFCC) is the most used representation of the spectral property of voice signals as well as you can try energy, pitch, formant frequency, Linear Prediction Cepstrum Coefficients (LPCC), and modulation spectral features (MSFs).
According to your suggested IS09 and IS10 which one is better so both are working good and there is no big difference but I recommend trying high-level (DL) features, it will be defiantly better than low-level.
  • asked a question related to Speech Acoustics
Question
3 answers
I have a project in which I have given a dataset (more than enough) of 10-20seconds audio files (singing these "swar" / "ragas": "sa re ga ma pa" ) without any labels, nothing well in data ... and I have to create a deep learning model which will recognise what speech it is and for how long it is present in the audio clip (time range of particular "Swar" sa ,re ,ga, ma )
The answesr to questions that I am looking for are
1. how I can achieve my goal , should I use RNN , CNN ,LSTM or hidden Markov model or something else like unsupervised learning for speech recognition ?
2. How to get correct speech tone for Indian language as most acoustic speech recognition models are tuned for English ?
3. How to find the time range ?for what range particular sound with particular "swar" is present in music clip ? how to add that time range recognition with speech recognition model ?
4. are there any existing music recognition models which resembles my research topic ? ,if yes please tag them .
I am looking for full guide for this project as it's completely new and people who are interested to work with me /guide me are also welcome .
  • asked a question related to Speech Acoustics
Question
7 answers
Hello everyone,
I am looking for links of audio datasets that can be used in classification tasks in machine learning. Preferably the datasets have been exposed in scientific journals.
Thank you for your attention and valuable support.
Regards,
Cecilia-Irene Loeza-Mejía
Relevant answer
Answer
Hi, I would recommend this website for checking emotional stimuli datasets https://airtable.com/shrnVoUZrwu6riP9b/tbljKUnVvikhzaNvF/viwlo7OvlHBG2q88P?blocks=hide
"KAPODI - the searchable database of free emotional stimuli sets."
Best regards,
Diogo Branco
  • asked a question related to Speech Acoustics
Question
3 answers
Several reports exists relating to the frequency fundamental of male and female speech. Though they not all agree, there is a clear trend that the fundamental frequency of men's voices is lower than females. One example: "The voiced speech of a typical adult male will have a fundamental frequency from 85 to 155 Hz, and that of a typical adult female from 165 to 255 Hz."[1]
QUESTION: Is it meaningful to study speech below these frequencies and why?
I am studying speech directivity and for some reason in the literature the male and female voice seems to repeatedly compared at 125 Hz, near the male fundamental. This seems nonsensical to me but maybe there is a good reason for this? I have recorded a fair bit of female speech and I see very little sound energy in this frequency band.
[1] Baken, R. J. (2000). Clinical Measurement of Speech and Voice, 2nd Edition. London: Taylor and Francis Ltd. (pp. 177), ISBN 1-5659-3869-0. That in turn cites Fitch, J.L. and Holbrook, A. (1970). Modal Fundamental Frequency of Young Adults in Archives of Otolaryngology, 92, 379-382, Table 2 (p. 381).
Relevant answer
I find your research interesting. Now bear in mind that one of the objectives of conducting researches is to prove or support existing theories as well as yield results which are dissimilar to the available published findings and resources. I believe you would want to choose either of these two options.
  • asked a question related to Speech Acoustics
Question
8 answers
I am on the research for studies that investigate speaker normalization in children. For example, I wonder whether children around the age of six years can already normalize acoustic differences between speakers as well as adults. Any suggestions for literature on this topic?
Looking forward to reading your suggestions.
Relevant answer
Answer
Isabel,
I know of many studies on when phoneme detectors develop but never thought about your question.
Are we even sure that people perform speaker normalization in the same sense that speech recognition systems do (or at least used to before deep learning)?
Since we do know that one of the early layers of human speech recognition processing produces a sequence of best phoneme guesses, and higher layers can force backtracking to try the next best guesses, it is possible to check psychophysically if there is such a normalization at lower layers by checking the processing speed, and if there is a change in processing speed after normalization takes place.
It is also an interesting question how the speaker recognition and speech recognition processes are related. Do we first recognize the speaker and then apply that speaker's phoneme recognizers? I remember reading a paper a few years ago about recognizing accents before recognizing speech (but can't find the reference).
Another clue is Hearing different accents at home impacts language processing in infants from U Buffalo (www.sciencedaily.com/releases/2017/12/171205104127.htm ) which found that infants exposed to multiple accents before 12 months develop different recognition strategies. See also the JASA reference there and
Linguistic processing of accented speech across the lifespan (www.frontiersin.org/articles/10.3389/fpsyg.2012.00479/full).
Finally Language Discrimination by English-Learning 5-Month-Olds: Effects of Rhythm and Familiarity (labfon.letras.ulisboa.pt/personal/sfrota/aeli/Nazzi_Jusczyk_Johnson_2000.pdf) may be of use.
Would love to hear what you learn!
Y(J)S
  • asked a question related to Speech Acoustics
Question
3 answers
I am trying to build a text to speech convertor from scratch.
For that
Text 'A' should sound Ayyy
Text 'B' should sound Bee
Text 'Ace' should sound 'Ase'
Etc
So how many total sounds should I need to resconstruct full English language words
Relevant answer
Answer
May be, it will be usefull...
  • asked a question related to Speech Acoustics
Question
5 answers
I conducted a thirty day experiment with about 10 participants. The experiment was supposed to test whether one can "train" humans to predict the age(a), height(h), and weight(w) of a person from their voice.
I conducted two experiments: 1) Regression where I asked participants to assign a numerical number to each of the 5 speaker's a,h,w.; 2) Binary classification where I asked participants to assign who is taller, heavier, and older from two pairs of speakers.
What kind of interesting analysis can be done from this data?
Relevant answer
Nice Dear Kévin Vervier
  • asked a question related to Speech Acoustics
Question
4 answers
In Python, which way is best to extract the pitch of the speech signals?
Although, I extracted pitches via "piptrack" in "librosa" and "PitchDetection" in "upitch", but I'm not sure which of these is best and accurate.
Is there any simple and realtime or at least semi-realtime way instead of these?
Relevant answer
Answer
There are so many varieties of tools for extracting pitch, but none of the fully automatic algorithms I know can guarantee accuracy and consistency of extracted f0, especially in terms of continuous f0 trajectories in connected speech. An alternative is to allow human operators to intervene where automatic algorithms helplessly fail. ProsodyPro (http://www.homepages.ucl.ac.uk/~uclyyix/ProsodyPro/) provides such a function. It is a script based on Praat—A program already with some of the best pitch extraction algorithms. But ProsodyPro allows human users to intervene with difficult cases by rectifying raw vocal pulse markings. It thus maximizes our ability to observe continuous f0 trajectories.
  • asked a question related to Speech Acoustics
Question
3 answers
Can any one help me on how to build a phoneme embedding?, the phonemes have different size in some features , how to solve this problem ?
thank you
Relevant answer
Answer
Yes, you can use RNN Encoder-Decoder to produce the phoneme embeddings, it means RNN maps each the phoneme to embedding space.
  • asked a question related to Speech Acoustics
Question
6 answers
I have trained an isolated spoken digit model for 0-9. My speech recognition system is recognizing the isolated digits like 0,1,2...9 but it fails to recognize the continuous digits like 11, 123, 11111, etc.. Can anyone please help me in converting these isolated digits to connected digits
Relevant answer
Answer
Segmentation of naturally spoken speech into words, even when there is a relatively small dictionary of words, is a harder problem than recognizing isolated digits.
People tend to think of spoken words as somehow isolated but "close" in time. This is not the case, unless you have a cooperating speaker (who helps the detection, or at least monitors it and repeats when it misdetects).
You can easily find in the literature the standard end-point detection mechanisms people use (mostly Viterbi based), and then run the isolated word detectors, but they are computationally expensive and don't really work very well for natural speech (the possible exception was flexible endpoint DTW, but I doubt that you are using DTW as a detector).
Y(J)S
  • asked a question related to Speech Acoustics
Question
6 answers
I have two speech signals coming from two different people. I want to find out whether or not both people are saying the same phrase. Is there anything that I can directly measure between the two signals to know how similar they are?
Relevant answer
Answer
It sounds simple, but unfortunately, it is not!
There are many confounding factors that make this process complicated. I give you some examples: consider you have a recording of your own voice recorded in a sound proof room saying "OPEN THE DOOR", and you would like to use that recording as the reference to which other voice commands are compared to take an action to open the door, for example.
  • Now, if you utter the same utterance but in a noisy environment, the two recordings are no longer the same.
  • If you change the room and record it in a reverberant room, the two signals are no longer the same.
  • If you say the same sentence but in different speed (speech rate) as you uttered the reference one, the two signals are no longer the same.
  • If you utter the same sentence but in different rhythm as you uttered the reference one, again, the two signals are no longer the same.
  • Now, consider that all or some of the above mentioned factors happen at the same time. Again, the two signals are no longer the same.
  • Now, imagine that you want to compare your reference signal with another person's recording of the same sentence. If both recordings are recorded in a similar environmental condition (same room, same equipment) and the same rhythm and rate, again, the two recordings are not the same.
  • Age, gender, health condition are other confounding factors that influence the signal.
Considering the formants of the two signals and comparing them using some similarity measures could be a very simple and quick solution. But unfortunately, they do not provide a good result since, for example, the similarity measure of two completely different sentences recorded in a particular acoustic environment can be relatively higher than two roughly similar sentences recorded in different environment, or if in the second recording the speaker utters the similar words than the reference recording but in different order.
To deal with these factors and variabilities, you might need a model (such as hidden Markov model, Gaussian mixture model) to capture the acoustical characteristics of the signals (in some relevant feature space such as cepstral domain or time-frequency domain) and to relate the segments of a signal to the language unites, and also you need a language model to link the unites to recognize the sentence. All these procedures are covered under the speech recognition field.
  • asked a question related to Speech Acoustics
Question
8 answers
Hi,
I am looking for free speech databases for speaker recognition (at least more than 50 speakers) Do you have any suggestions?
Relevant answer
Answer
Most speaker verification databases like NIST ones are paid. But, there are a couple of freely available databases like Voxceleb (http://www.robots.ox.ac.uk/~vgg/data/voxceleb/) and SITW(http://www.speech.sri.com/projects/sitw/).
Hope that helps!
  • asked a question related to Speech Acoustics
Question
3 answers
For my bachelor thesis, I would like to analyse the voice stream of a few meetings of 5 to 10 persons.
The goal is to validate some hypothesis linking speech time repartition to the workshop creativity. I am looking for a tool that can be implemented easily and without any extensive knowledge of signal processing.
Ideally, I would like to feed the tool with an an audio input and get the time segments of the speaker either graphically or in matrix/array form.
- diarization does not need to be realtime
- source can be single or multi stream (we could install microphones on each participant)
- the process and can be (semi-)supervised if need be, we know the number of participants beforehand.
- Tool can be an matlab, .exe, java, or similar file. I am open for suggestions.
Again I am looking for the simplest, easy-to-install solution.
Thank you in advance
Basile Verhulst
Relevant answer
Answer
  • asked a question related to Speech Acoustics
Question
5 answers
I am working on emotion detection from speech.
Relevant answer
Answer
Hi Tohidul , there are quite a few:
  1. RAVDESS - 7356 files, English, 8 emotions, two emotional intensities, speech & song, 319 raters.
  2. CREMA-D - 7442 files, English, 6 emotions, speech, 2443 raters. See https://ieeexplore.ieee.org/abstract/document/6849440
  3. The Toronto Emotional Speech Set by Dupuis and Pichora-Fuiller. - https://tspace.library.utoronto.ca/handle/1807/24487/browse?type=title&submit_browse=Title
  4. Check out this enormous list in "Handling Emotions in Human-Computer Dialogues" - https://link.springer.com/content/pdf/bbm%3A978-90-481-3129-7%2F1.pdf
  • asked a question related to Speech Acoustics
Question
3 answers
Phonetically Balance word list ,malayalam is available in ISHA battery;however the full article in which the list was published isnt available.Hence we couldnt find out whether or not psychometric function was done for the list.It would be helpful if someone could suggest any other PB word list in malayalam for which the psychometric function was done or direct me to the orginal article in which the list was published.
Thank you.
Relevant answer
Answer
What is the evidence that testing with English word lists is not valid. Patients and testers are naturally reluctant for the use of English lists on foreign speakers, but in my experience meaningful results can be obtained.
  • asked a question related to Speech Acoustics
Question
2 answers
Because the result of noisy speech filtering strongly depends on the silence intervals problem solution.
Relevant answer
Answer
  • asked a question related to Speech Acoustics
Question
3 answers
I am trying to analyze frequency mean, range, and variability from a speaker reading a passage aloud. I am using Praat and a Matlab script I am writing to analyze these. The common threshold in Praat is 75 Hz to 300 Hz for a male speaking voice and 100 Hz to 500 Hz for a female speaking voice. I want to make sure I am obtaining the most accurate fundamental frequencies of their voice, not higher frequencies from breathes or ends of words. Does anyone with experience in these analyses have a more accurate threshold criteria or are these thresholds in Praat suitable?
Relevant answer
Answer
I personally prefer Praat scripts
Good luck
  • asked a question related to Speech Acoustics
Question
4 answers
Hi all, 
I would like to ask from all the experts here, in order to get the better view on the usage of cleaned signals which already removed the echo using few types of adaptive algorithms with method of AEC.(acoustic echo cancellation)
How the significance of MSE and PSNR can improve in the classification processes? Which i mean normally we evaluate using the technique of WER, Accuracy and may EER too.Is there any kind connectivity of MSE and PSNR values in terms of improving those classification metrics.?
wish to have the clarification on this.
Thanks much
Relevant answer
Answer
It is one of the old issues in a speech recognition research field. That is on the relationship between any speech enhancement technique and the classification accuracy.
As far as I know, both the MSE and PSNR are frequently used for improving the quality of the input. They are known as useful in reducing WER. However, the relationship with recognition accuracy is not directly proportional. 
Enhancing a noisy signal in terms of MSE or PSNR means that you may have a good quality of the input but there is a risk. Sometimes, unexpected artifacts are produced by the speech enhancement techniques and WER can be increased in the worst case.
So, in phonemic classification task, matched condition is more crucial. And in the case of mismatched condition between train and test, MSE and PSNR are somewhat related to WER, but not directly. It is a case-by-case study.    
  • asked a question related to Speech Acoustics
Question
3 answers
I am looking for the frequency average values of RP consonants in order to compare this with the accented English consonants pronounced by the students
Relevant answer
Answer
A.C.Gimson - An Introduction to the Pronunciation of English
  • asked a question related to Speech Acoustics
Question
1 answer
I m working on Voice conversion techniques and i would like to know how to perform Dynamic Time Warping Parallel for three utterances?
Relevant answer
Answer
Since DTW works only on pair of (potentially multidimensional) time-series, if you want to have all the three utterances aligned with DTW, you can only obtain that by aligning two signals separately to the third one. To choose the best reference signal, you may try each of them and select either the one which leads to smallest amount of time-warping (DTW outputs a cost value measuring the amount of time-warping) or the one which leads to the smallest error when comparing the aligned signals with the reference.
hope it helps
  • asked a question related to Speech Acoustics
Question
5 answers
Is there any limitation(minimum) on the length of the of room impulse response, for it to degrade the performance of speech recognition.
Usually to perceive(by listening) the reverberation, the room impulse response should be around 1000 coefficients. My question exactly is that if the length of room impulse response is less, does it affect the performance of speech recognition tools such as Pocket sphinx.
Is there any literature where such analysis has been done.     
Relevant answer
Answer
Your question is ill-posed.
It depends on a variety of factors: acoustic models, language models, recognition task, ASR front-end, ASR back end etc....
To simplify, if you have close-talking models, probably a RIR long 50ms (800 sample) won't affect much the ASR performance.
  • asked a question related to Speech Acoustics
Question
4 answers
There exists more than one type of Mandarin tone sandhi.
The most basic is the Tone 3 possibly changing to Tone 2 under certain conditions.
Also, there is the half tone phenomenon.
Is there some place that assembles descriptions of all these things?
I know of some of the work of Duanmu San, Zhiming Bao, Moira Yip, and Yuenren Chao. Are there other bodies of related work?
Relevant answer
Answer
Dear Jim,
if you're also interested in experimental work on tone sandhi in Mandarin you may want to check out the papers below:
  • asked a question related to Speech Acoustics
Question
3 answers
When you tighten your Thyrovocalis muscle are you changing your harmonic frequency or your formant frequency?
Relevant answer
Answer
I agree with Harm.  Vocal cord/fold action affects Fundamental action; Fo action generates the harmonics; their relative strength is determined by state of folds; thicker, shorter has stronger more numerous harmonics that are filtered by vocal tract.
                               Richard Dean, Ohio University/Stanford U
  • asked a question related to Speech Acoustics
Question
3 answers
How much of fMRI BOLD signal from a task requiring overt speech is lost because of head movement or articulation artifacts? Is there the risk that too much correction leads to unaffordable conclusions?
Relevant answer
Answer
Hi,
A member of my old research group investigated overt speech with fMRI. Check her publications: Brendel, Bettina.
E.g.:
Brendel, B., Hertrich, I., Erb, M., Lindner, A., Riecker, A., Grodd, W., et al. (2010). The contribution of mesiofrontal cortex to the preparation and execution of repetitive syllable productions: An fMRI study. NeuroImage, 50, 1219-1230
  • asked a question related to Speech Acoustics
Question
8 answers
I am planning to record experimental data in a reasonably quiet, but not sound-proof environment. Intensity analysis, especially spectral tilt, should be possible. I have read about the proximity effect that could be a problem. Is it better to use a dynamic or a condenser mic? Should it be unidirectional or omnidirectional?..
Relevant answer
Answer
Hi Dina, 
Have a look at the study by Jan Svec and Svante Granqvist entitled "Guidelines for Selecting Microphones for Human Voice Production Research"
And also Sramkova et al 2015 entitled "The softest sound levels of the human voice in normal subjects" 
I believe this two articles will really help with your selection. 
Good luck
  • asked a question related to Speech Acoustics
Question
12 answers
Many works have been done on ASR but none (to the best of my knowledge) has been carried out on Yorùbá language. Yorùbá language is a tonal language in which words of the same phonemes can have different meanings, e.g. igbá - "calabash" , igba - "200", ìgbà - "period", ìgbá - "garden egg", igbà - "climbing rope". This makes the recognition difficult. Please kindly assist if you know of any solution or a guide. Your contributions will be highly appreciated.
Relevant answer
Answer
There have been some recognition experiments on Yoruba e.g. A Review of Yorùbá Automatic Speech Recognition, Yusuf et al, 2013 IEEE 3rd International Conference on System Engineering and Technology, 19 - 20 Aug. 2013, Shah Alam, Malaysia; CROSS-LANGUAGE MAPPING FOR SMALL-VOCABULARY ASR IN UNDER-RESOURCED LANGUAGES: INVESTIGATING THE IMPACT OF SOURCE LANGUAGE CHOICE, Vakil and Palmer, SLTU 2014. Typically the language model is used to help differentiate between acoustically similar but distinct language contexts. The IARPA Babel programme has been looking at recognition and keyword spotting of low resource languages which covers a wide range of types of languages and should provide some useful information for you. There have been a number of papers published at Interspeech, ICASSP, SLTU and ASRU. The Kaldi toolkit provides a recipe to build a state-of-the-art ASR system for Babel languages. Depending on your experience and the data set(s) available for training your acoustic and language models this may be a good place to start.
  • asked a question related to Speech Acoustics
Question
9 answers
Acoustic-phonetic production experiments often report relative segment durations (rather than absolute durations), mostly because relative durations are less prone to influences from speaking rate.
Typical reference units for normalization in the literature are:
1) units that contain the target segment (e.g., the syllable, the word, the phrase)
2) units that are adjacent to the target segment (e.g., sounds or words to the right or left)
3) the average phone duration in the respective phrase
Depending on the structure of the utterance and/or the nature of the target segment (e.g., phonemically long vs. short), differences across experimental conditions may appear larger or smaller (depending on whether the duration of the reference unit is negatively or positively correlated with the duration of the target).
Are there theoretical considerations that speak for (or against) one of those units of reference? Or do we need perception data in order to decide which relative measure participants are sensitive to? Should we always collect recordings in different speech rates in order to identify relative durations that are not (or least) influenced by the speaking rate manipulation?
Relevant answer
Answer
Hi Bettina,
if you have enough data from your subjects, z-scoring, but separately for each phoneme class, may be an option. I've used phoneme-specific z-scores for recognizing prosodic boundaries and pitch accented syllables. This takes into account Hartmut's finding, mentioned by Susanne above, that different phonemes are affected differently. Nick Campbell, by the way, came up with the term elasticity -- phonemes differ in their elasticity, and he introduced the phoneme-specific z-scores to model phoneme durations in synthesis in a paper in 1992.
Antje
  • asked a question related to Speech Acoustics
Question
3 answers
Hello everyone! I would like to hear any suggestions you might have regarding the equipment I can use for Blind Speaker Separation experiments in real rooms (real time as well as offline). Has any of you set up such experiments? What kind of microphones have you used, acquisition sound card etc.
Thanks in advance!
Relevant answer
For a cheap setup, for 2 by 2 cases, behringer ecm8000 measurement microphones, and any usb interface, in particular I have used an alesis io2 express 2 channel USB interface. You will need to adapt some stand to mount the microphones.
Use a notebook for recording, unplugged from the wall to avoid power noise. Get a good characterization of the room acoustic characteristics. You can measure impulse responses using TSP method or MLS method and then schroeder backintegration to estimate power decay curves. I have made some software for this, you can find it here: http://ldipersia.wikidot.com/software#revtime
If you have a sound proof room you can get better recordings. You can add drapery and carpets to reduce reverberation time, and you can use plywood panels to increase the reverberation time, so you can produce different environments to perform the experiments.
  • asked a question related to Speech Acoustics
Question
11 answers
I have two speech signals, recorded using two different recording equipment and recorded at two different time. May I know if there is a way to establish that the two signals are really related irrespective of the way they were recorded?
Relevant answer
Answer
Dear Poornoma,
There are many mathematical equations that show the relationship between two signal. e.g
1) Correlation
2) Mutual Information
3) Euclidean distance 
etc...
and some  features that help to biuld some relationship between signals e.g
1) Pitch
2) Harmonics
3) Fundamental frequency
4) Energies
 and distribution of above features e.g standard deviation, skewness , kurtosis etc
  • asked a question related to Speech Acoustics
Question
3 answers
In Ghana, I have observed that the phoneme /j/ is realized as /dz/; /y/ or /Ʒ/ in speeches by individuals. I have also noticed that the difference in the realizations depends on either the absence or the presence of the target phoneme in the learners’ speech (i.e. transfer errors). Where it is present but the realizations are not the same, the learner tries to articulate the phoneme as a phoneme he/ she already knows. Where the target phoneme does not exist in the already known languages of the learner, he or she tries to make a substitution with another phoneme that exists in his or her linguistic repertoire. Can someone share with me some example of phonemic variations that he or she has noticed in their students’ speeches? Are the reasons for the variations different from what I have stated?
Relevant answer
Answer
Thank you Prof. Ivleva and  Prof. Prunescu. Prof. Ivleva, I very much like the historic insight given at the website. Prof. Prunescu, your point is well noted. It has to do with geographical locations. I am thankful to both of you. I am working on variations in Ewe (i.e. a local language) and your points are very useful to me.
  • asked a question related to Speech Acoustics
Question
6 answers
I am able to access the transcripts but I am unable to access the audio files even on free online corpora webpages. Could anyone tell me how to access both transcripts as well as audio files together?
Relevant answer
Answer
Sir, You can write to John M Swales who was instrumental in developing MICASE. He responds to our queries. Generally we get access to transcripts only. The audio databases are not shared. There is Dr Claudia from Germany, Dresden . she collected a lot of samples from Indian users of English.  Her contact is also useful.
  • asked a question related to Speech Acoustics
Question
3 answers
In my work, I want to use Gaussian mixture model in speaker identification. I use Mel frequency cepstral coefficient (MFCC) to extract the feature extraction of the training and testing speech signal and I use obj= fitgmdist(X,K) to estimate the parameter of Gaussian mixture model for training speech signal. I use [p, nlogl]=posterior(obj, testdata) and I choose the minimum (nlogl) to show the maximum similarity between reference and testing models as shown in matlab attach file.
The problem in my program is the minimum nlogl changes and it recognizes different speaker even if I use the same testing speech signal. For example, when I run the program for the first time, the program recognize that the first testing speaker has the maximum similarity with training speech signals (I=1) and If try to run the program again for the same testing speech, I will get the five testing speaker have the maximum similarity with training model . I do not know what is the problem in the program and why the program gives different speaker when I run the program for three times for the same testing speech signal .can any person specialize in speaker regonition system and Gaussian mixture model answer about my question 
With best regards
Relevant answer
Answer
I would suggest to test prtools toolbox for matlab
  • asked a question related to Speech Acoustics
Question
1 answer
We want to do instantaneous frequency altered feedback  (for stuttering patients) and thus thought we could implement it using MATLAB. Unfortunately there's not much precise information out there besides for pvoc which does not give us the expected quality. Have also been looking at github but solutions -->  were normally not good documented/not complete (subfunctions). Any idea? A build GUI based app would be perfect, since student assistants have to use it!
Relevant answer
Answer
For catching a particular frequency component Goertzel algorithm or EPLL technique can found useful. I found it while doing literature review. 
  • asked a question related to Speech Acoustics
Question
6 answers
I am looking for books, which could help me write my BA thesis on differences between Standard Scottish English and Received Pronunciation. Are there any easily accessible books on that topic?
Relevant answer
Answer
Hi Paulina,
Your best bet for a description of "Standard Scottish English" is:
Robinson, C. & Crawford, C. A. (2001). Scotspeak: A guide to the pronunciation of modern urban Scots. Perth: Scots Language Resource Centre.
The comparison with Received Pronunciation is something you'll have to do yourself!
Lachlan
  • asked a question related to Speech Acoustics
Question
2 answers
I want to assess the nasality in two groups of children undergoing cleft palate surgery (2 different techniques) and compare the difference. Are there any simple ways of clinically assessing nasality?
Relevant answer
Dear Kiran,
Check this paper out if it gives you an idea:
February 7, 2006 Feature
Resonance Disorders and Nasal Emissions
Evaluation and Treatment using "Low Tech" and "No Tech" Procedures
 by Ann W. Kummer
  • asked a question related to Speech Acoustics
Question
4 answers
My recent perception experiments (The Tension Theory of Pitch Production and Perception) show that the physical dimension (length of a string or diameter of a membrane) is an inherent source of force for the vibrating body. I am now trying to see how tension meters measure the tension of a string since they do not seem to take all the physical dimensions of the string (or other bodies) into account.
Relevant answer
Answer
Thank you very much Pooya. I was told the same thing by the Department of Astronomy, but my investigations came up with very different results. I looked at your list of publications and noted that you must be in really advanced physics and mathematics (if I'm not wrong). I am looking at the tension issue from the standpoint of auditory psychophysics and not in terms of pure physics. The document I'm referring to is my paper The Tension Theory of Pitch Production and Perception. This is the work that led to all the questions I am asking.  if you are not in hearing research, the paper  may bot be relevant to you area of expertise...
Thank you once again for your help on the matter.
  • asked a question related to Speech Acoustics
Question
3 answers
I remember having seen some papers talking about a hybrid speech enhancement method. In this hybrid method, it takes the form of a weighted sum of several individual methods. By considering the error in different methods to be uncorrelated, a better noise reduction performance can be achieved in the hybrid method. If you happen to know such publications, please help me. Thank you very much!
Relevant answer
Answer
I know it is not the same problem, but a similar approach which might be relevant to you is that used to improve sound source separation results in:
Xabier Jaureguiberry, Gaël Richard, P. Leveau, Romain Hennequin et E. Vincent, (2013), Introducing A Simple Fusion Framework For Audio Source Separation, "Machine Learning for Signal Processing (MLSP)", Southampton, UK
  • asked a question related to Speech Acoustics
Question
9 answers
It's often problematic to squeeze vowels into 2D vowel space, because the differences caused by F3 and higher formants are lost (even if using F2'). I'd like something producing better looking results than DPlot (used for the attached file) with more freedom in axis orientation.
Relevant answer
Answer
A lot of thanks Marzena! I wish I could help you with something, too! :)
Best,
Juris
  • asked a question related to Speech Acoustics
Question
1 answer
I had calculated the fundamental frequency and the first and second formant using the spectral peak picking approach and I'd like to know if there is a way to measure how much relation is between the harmonics of f0 and the first formant. How can I know that from the spectral peak picking approach harmonics are not mischosen instead of the correct f1 value
Thanks
Relevant answer
Answer
Hi Saameh,
It would be good if you check your formant results (as well as the f0 results) for their fit with values given for the respective vowels (see lit) and gender of the speaker (f: between ~150-400/500, m: ~70-150). This can be done automatically to exclude and reanalyse possible outliers. How many files do you have? If it is less than say a hundred you could also try Praats inbuilt formant and F0 tracker and check those results visually.  This also has the additional advantage that you can use Praat's scripting techniques to get your results into e.g. R, Julia or whatever prog u r going to use.
  • asked a question related to Speech Acoustics
Question
1 answer
During the closing of the VF oscillation cycle, the two VFs would tend to collide with each other. This would pose a challenge to numerically simulate the fluid-structure coupled problem of the VF-glottal jet. Is there a way by which we can impose a condition on the VFs to prevent it from colliding. If yes, how can we implement it numerically?
Relevant answer
Answer
Hello Shakti,
we are running a European project in which this problem is being handled. A paper on a novel solution is in preparation by Johan Jansson, N. Cem Degirmenci and Jeannette Spuehler; the journal to be decided. You can read more about the project at www.eunison.eu.
Best regards,
Sten Ternström
EUNISON Coordinator
  • asked a question related to Speech Acoustics
Question
4 answers
For sure, there is a (well known) theoretical background for pitch estimation, including many interesting academic papers with comparative studies of methods. On the other hand, one knows that reverberant room effects can be handled through signal pre-whitening methods. Nonetheless, my question is to those who, like myself, feel frustrated by the almost erratic performance of pitch estimators in naturally spoken sentences (i.e. normal rhythm) in small reverberant rooms, even after signal pre-whitening. Thus, I would like to know is someone successfully experimented new pragmatic approaches, possibly not convencional ones.
Relevant answer
Answer
You could give ProsodyPro a try to see if it can handle some of your cases: http://www.phon.ucl.ac.uk/home/yi/ProsodyPro/
  • asked a question related to Speech Acoustics
Question
21 answers
Dear sir/mam,
One of my friends is doing Ph.D. in English on pragmatics, which focuses on speech act theory. Could you please suggest some good reading regarding this?
thanking you
yours sincerely
Dr.Kiran R. Ranadive
Relevant answer
Apart from special issues, there are four particularly prominent accounts of 'speech acts':
(1) JL Austin's 'How to Do Things with Words' (1975),
(2) JR Searle's 'Speech Acts' (1969),
(3) Bach & Harnish's 'Linguistic Communication and Speech Acts' (1979) and
(4) WP Alston's 'Illocutionary Acts and Sentence Meaning'.
The notions given of what an 'illocutionary act' (or 'speech Act') actually is differ strongly in each case: your friend should be aware of the possible fact that these authors are dealing with different phenomena. Thus, Austin is concerned with institutional act which involve (what he calls) the "securing of uptake"; in the case of Searle it is rather unclear what identity conditions the acts he deals with actually have; Bach and Harnish are dealing primarily with acts of communication ("communicative speech act"), and secondarily with institutional acts ("cnventional speech acts"), while Alston's interest is directed at acts of 'saying something' (in a somewhat pecular sense of 'saying' not equal to any of the beforementioned kinds of act). One of the important first decisions your friend shall have to make is thus: Which of the different conceptions of 'speech acts' does he/she intend to deal with in his/her diss?
[Note: Austin's and Searle's accounts are not easy to detangle; my own diss. 'Illocutionary Acts -- Austin's Account and What Searle Made Out of It' (2006) sets out to give detailed analyses of both. Searle's account was commonly recognised as the leading account for decades; during the eighties, Bach and Harnish's account became widely 'accepted', too. Most authors endorse the outstanding significance of Austin's work on the matter, but only few seems to have actually read the relevants texts authored by him (in any meaningful sense of 'read'); here the work of Marina Sbisà contains particularly helpful analyses.]
  • asked a question related to Speech Acoustics
Question
5 answers
I am working on a perception experiment on German stress and would like to measure the syllable duration of my stimuli. I was wondering how to treat ambisyllabic consonants, e.g. [l] in [maˈrɪlə] 'apricot'. My first intention was to split the ambisyllabic consonant and count half of its duration to the second and half of its duration to the third syllable of the word. I would be very glad to receive practical help on this issue.
Relevant answer
Answer
Hi Katharina,
not really practical help, but a few psycholinguistic studies on the role of ambisyllabic consonants in Dutch and English. Zwitserlood et al. (1993) showed that Dutch CVC syllables primed both CVC.CVC syllables and also strings with an ambisyllabic consonant CV(C)VC. Also CVC syllables were detected faster than CV syllables in words with ambisyllabic consonants. In production, the results of syllable-reversal experiments by Schiller et al. (2007) also suggest that the ambisyllabic consonants are mostly seen as the coda of the first syllable. Similar findings would probably apply to German.
In work on tonal alignment, Atterer & Ladd (2004) and Ladd, Mennen & Schepman (2000) used target words in which some of the stressed syllables were closed by an ambisyllabic consonant. In these conditions, the ambisyllabic consonant was treated as belonging to the stressed syllable (Atterer & Ladd 2004: 184).
Hope this helps
Dutch:
Schiller, N.O. Meyer, A. & Levelt, W.J.M. 1997. The syllabic structure of spoken words: evidence from the syllabification of intervocalic consonants. Language & Speech 40, 103-140.
Zwitserlood, P., H. Schriefers, A. Lahiri & Wilmaar van Donselaar. 1993. The role of syllables in the perception of spoken Dutch. Journal of Experimental Psychology: Learning Memory, & Cognition 19. 260-271.
English:
Treiman, R. and Danis, C. 1998. Syllabification of Intervocalic Consonants. Journal of Memory and Language 27, 87—10.
Ladd, D.R., Mennen, I., Schepman, A. 2000. Phonological conditioning of peak alignment in rising pitch accents in Dutch. Journal of the Acoustical Society of America 107, 2685-96.
  • asked a question related to Speech Acoustics
Question
1 answer
I'm wondering if terrestrial animals could possibly produce vocalizations that would create visual warp distortions close to their mouth during production. Would this effect be heavily dependent on environmental temperature and humidity?
Relevant answer
Answer
Well, it would be heavily dependent on the frequency - the lower the frequency, the longer (more visible) the wave length. It would probably be a very low sound (about 2-5 Hz) used by no animal known to man (at least not to me). Another thing is that it would depend heavily on the density of air or any other medium (liquid or solid). If you "talk to water" you can observe some waves that your speech is causing (some of them will be of course caused solely by the expulsion of air but others, the smaller and more numerous ones, will resonate to the vibrations of your voice). I'm not sure about the effect of the temperature or humidity.
All in all, human eye (and brain) can process images at maximum about 15 Hz, whereas the ear starts at at least about 20 Hz. So you will be able to see it but not to hear it. That's for sure. Do such things occur in nature? I don't know and I don't think we can see air distortion (unlike water distortion). If they do, do they depend on the environment? Yes, very much.
  • asked a question related to Speech Acoustics
Question
3 answers
Using jitter and shimmer in speaker identification.
Relevant answer
Answer
Shimmer and jitter give you information about stability in the phonation, while MFCC give information regarding articulators (it is still in dicussion in the state of the art). Normally, if you use MFCC you don't need to use LPC because the information provided by LPC is already in the MFCC. Respect to PLP or RASTA analyzes, all of that are related to perform spectral smoothing and are very useful for speaker identification.
For fussion you can use all of the features you want or you need. At the features level is just to put together all of the matrixes (be careful with the sizes) and depending on the classifier, you can fuse also at the "scores" level, i.e. if you use an SVM, you take the distances from the sample to the hyperplane as feature.