Science topic
Speech Acoustics - Science topic
The acoustic aspects of speech in terms of frequency, intensity, and time.
Questions related to Speech Acoustics
Hi everybody,
Given the different methods of speech feature extraction (IS09, IS10,...), which one do you suggest for EMODB and IEMOCAP datasets?
I have a project in which I have given a dataset (more than enough) of 10-20seconds audio files (singing these "swar" / "ragas": "sa re ga ma pa" ) without any labels, nothing well in data ... and I have to create a deep learning model which will recognise what speech it is and for how long it is present in the audio clip (time range of particular "Swar" sa ,re ,ga, ma )
The answesr to questions that I am looking for are
1. how I can achieve my goal , should I use RNN , CNN ,LSTM or hidden Markov model or something else like unsupervised learning for speech recognition ?
2. How to get correct speech tone for Indian language as most acoustic speech recognition models are tuned for English ?
3. How to find the time range ?for what range particular sound with particular "swar" is present in music clip ? how to add that time range recognition with speech recognition model ?
4. are there any existing music recognition models which resembles my research topic ? ,if yes please tag them .
I am looking for full guide for this project as it's completely new and people who are interested to work with me /guide me are also welcome .
Hello everyone,
I am looking for links of audio datasets that can be used in classification tasks in machine learning. Preferably the datasets have been exposed in scientific journals.
Thank you for your attention and valuable support.
Regards,
Cecilia-Irene Loeza-Mejía
Several reports exists relating to the frequency fundamental of male and female speech. Though they not all agree, there is a clear trend that the fundamental frequency of men's voices is lower than females. One example: "The voiced speech of a typical adult male will have a fundamental frequency from 85 to 155 Hz, and that of a typical adult female from 165 to 255 Hz."[1]
QUESTION: Is it meaningful to study speech below these frequencies and why?
I am studying speech directivity and for some reason in the literature the male and female voice seems to repeatedly compared at 125 Hz, near the male fundamental. This seems nonsensical to me but maybe there is a good reason for this? I have recorded a fair bit of female speech and I see very little sound energy in this frequency band.
[1] Baken, R. J. (2000). Clinical Measurement of Speech and Voice, 2nd Edition. London: Taylor and Francis Ltd. (pp. 177), ISBN 1-5659-3869-0. That in turn cites Fitch, J.L. and Holbrook, A. (1970). Modal Fundamental Frequency of Young Adults in Archives of Otolaryngology, 92, 379-382, Table 2 (p. 381).
I am on the research for studies that investigate speaker normalization in children. For example, I wonder whether children around the age of six years can already normalize acoustic differences between speakers as well as adults. Any suggestions for literature on this topic?
Looking forward to reading your suggestions.
I am trying to build a text to speech convertor from scratch.
For that
Text 'A' should sound Ayyy
Text 'B' should sound Bee
Text 'Ace' should sound 'Ase'
Etc
So how many total sounds should I need to resconstruct full English language words
I conducted a thirty day experiment with about 10 participants. The experiment was supposed to test whether one can "train" humans to predict the age(a), height(h), and weight(w) of a person from their voice.
I conducted two experiments: 1) Regression where I asked participants to assign a numerical number to each of the 5 speaker's a,h,w.; 2) Binary classification where I asked participants to assign who is taller, heavier, and older from two pairs of speakers.
What kind of interesting analysis can be done from this data?
In Python, which way is best to extract the pitch of the speech signals?
Although, I extracted pitches via "piptrack" in "librosa" and "PitchDetection" in "upitch", but I'm not sure which of these is best and accurate.
Is there any simple and realtime or at least semi-realtime way instead of these?
Can any one help me on how to build a phoneme embedding?, the phonemes have different size in some features , how to solve this problem ?
thank you
I have trained an isolated spoken digit model for 0-9. My speech recognition system is recognizing the isolated digits like 0,1,2...9 but it fails to recognize the continuous digits like 11, 123, 11111, etc.. Can anyone please help me in converting these isolated digits to connected digits
I have two speech signals coming from two different people. I want to find out whether or not both people are saying the same phrase. Is there anything that I can directly measure between the two signals to know how similar they are?
Hi,
I am looking for free speech databases for speaker recognition (at least more than 50 speakers) Do you have any suggestions?
For my bachelor thesis, I would like to analyse the voice stream of a few meetings of 5 to 10 persons.
The goal is to validate some hypothesis linking speech time repartition to the workshop creativity. I am looking for a tool that can be implemented easily and without any extensive knowledge of signal processing.
Ideally, I would like to feed the tool with an an audio input and get the time segments of the speaker either graphically or in matrix/array form.
- diarization does not need to be realtime
- source can be single or multi stream (we could install microphones on each participant)
- the process and can be (semi-)supervised if need be, we know the number of participants beforehand.
- Tool can be an matlab, .exe, java, or similar file. I am open for suggestions.
Again I am looking for the simplest, easy-to-install solution.
Thank you in advance
Basile Verhulst
I am working on emotion detection from speech.
Phonetically Balance word list ,malayalam is available in ISHA battery;however the full article in which the list was published isnt available.Hence we couldnt find out whether or not psychometric function was done for the list.It would be helpful if someone could suggest any other PB word list in malayalam for which the psychometric function was done or direct me to the orginal article in which the list was published.
Thank you.
Because the result of noisy speech filtering strongly depends on the silence intervals problem solution.
I am trying to analyze frequency mean, range, and variability from a speaker reading a passage aloud. I am using Praat and a Matlab script I am writing to analyze these. The common threshold in Praat is 75 Hz to 300 Hz for a male speaking voice and 100 Hz to 500 Hz for a female speaking voice. I want to make sure I am obtaining the most accurate fundamental frequencies of their voice, not higher frequencies from breathes or ends of words. Does anyone with experience in these analyses have a more accurate threshold criteria or are these thresholds in Praat suitable?
Hi all,
I would like to ask from all the experts here, in order to get the better view on the usage of cleaned signals which already removed the echo using few types of adaptive algorithms with method of AEC.(acoustic echo cancellation)
How the significance of MSE and PSNR can improve in the classification processes? Which i mean normally we evaluate using the technique of WER, Accuracy and may EER too.Is there any kind connectivity of MSE and PSNR values in terms of improving those classification metrics.?
wish to have the clarification on this.
Thanks much
I am looking for the frequency average values of RP consonants in order to compare this with the accented English consonants pronounced by the students
I m working on Voice conversion techniques and i would like to know how to perform Dynamic Time Warping Parallel for three utterances?
Is there any limitation(minimum) on the length of the of room impulse response, for it to degrade the performance of speech recognition.
Usually to perceive(by listening) the reverberation, the room impulse response should be around 1000 coefficients. My question exactly is that if the length of room impulse response is less, does it affect the performance of speech recognition tools such as Pocket sphinx.
Is there any literature where such analysis has been done.
There exists more than one type of Mandarin tone sandhi.
The most basic is the Tone 3 possibly changing to Tone 2 under certain conditions.
Also, there is the half tone phenomenon.
Is there some place that assembles descriptions of all these things?
I know of some of the work of Duanmu San, Zhiming Bao, Moira Yip, and Yuenren Chao. Are there other bodies of related work?
When you tighten your Thyrovocalis muscle are you changing your harmonic frequency or your formant frequency?
How much of fMRI BOLD signal from a task requiring overt speech is lost because of head movement or articulation artifacts? Is there the risk that too much correction leads to unaffordable conclusions?
I am planning to record experimental data in a reasonably quiet, but not sound-proof environment. Intensity analysis, especially spectral tilt, should be possible. I have read about the proximity effect that could be a problem. Is it better to use a dynamic or a condenser mic? Should it be unidirectional or omnidirectional?..
Many works have been done on ASR but none (to the best of my knowledge) has been carried out on Yorùbá language. Yorùbá language is a tonal language in which words of the same phonemes can have different meanings, e.g. igbá - "calabash" , igba - "200", ìgbà - "period", ìgbá - "garden egg", igbà - "climbing rope". This makes the recognition difficult. Please kindly assist if you know of any solution or a guide. Your contributions will be highly appreciated.
Acoustic-phonetic production experiments often report relative segment durations (rather than absolute durations), mostly because relative durations are less prone to influences from speaking rate.
Typical reference units for normalization in the literature are:
1) units that contain the target segment (e.g., the syllable, the word, the phrase)
2) units that are adjacent to the target segment (e.g., sounds or words to the right or left)
3) the average phone duration in the respective phrase
Depending on the structure of the utterance and/or the nature of the target segment (e.g., phonemically long vs. short), differences across experimental conditions may appear larger or smaller (depending on whether the duration of the reference unit is negatively or positively correlated with the duration of the target).
Are there theoretical considerations that speak for (or against) one of those units of reference? Or do we need perception data in order to decide which relative measure participants are sensitive to? Should we always collect recordings in different speech rates in order to identify relative durations that are not (or least) influenced by the speaking rate manipulation?
Hello everyone! I would like to hear any suggestions you might have regarding the equipment I can use for Blind Speaker Separation experiments in real rooms (real time as well as offline). Has any of you set up such experiments? What kind of microphones have you used, acquisition sound card etc.
Thanks in advance!
I have two speech signals, recorded using two different recording equipment and recorded at two different time. May I know if there is a way to establish that the two signals are really related irrespective of the way they were recorded?
In Ghana, I have observed that the phoneme /j/ is realized as /dz/; /y/ or /Ʒ/ in speeches by individuals. I have also noticed that the difference in the realizations depends on either the absence or the presence of the target phoneme in the learners’ speech (i.e. transfer errors). Where it is present but the realizations are not the same, the learner tries to articulate the phoneme as a phoneme he/ she already knows. Where the target phoneme does not exist in the already known languages of the learner, he or she tries to make a substitution with another phoneme that exists in his or her linguistic repertoire. Can someone share with me some example of phonemic variations that he or she has noticed in their students’ speeches? Are the reasons for the variations different from what I have stated?
I am able to access the transcripts but I am unable to access the audio files even on free online corpora webpages. Could anyone tell me how to access both transcripts as well as audio files together?
In my work, I want to use Gaussian mixture model in speaker identification. I use Mel frequency cepstral coefficient (MFCC) to extract the feature extraction of the training and testing speech signal and I use obj= fitgmdist(X,K) to estimate the parameter of Gaussian mixture model for training speech signal. I use [p, nlogl]=posterior(obj, testdata) and I choose the minimum (nlogl) to show the maximum similarity between reference and testing models as shown in matlab attach file.
The problem in my program is the minimum nlogl changes and it recognizes different speaker even if I use the same testing speech signal. For example, when I run the program for the first time, the program recognize that the first testing speaker has the maximum similarity with training speech signals (I=1) and If try to run the program again for the same testing speech, I will get the five testing speaker have the maximum similarity with training model . I do not know what is the problem in the program and why the program gives different speaker when I run the program for three times for the same testing speech signal .can any person specialize in speaker regonition system and Gaussian mixture model answer about my question
With best regards
We want to do instantaneous frequency altered feedback (for stuttering patients) and thus thought we could implement it using MATLAB. Unfortunately there's not much precise information out there besides for pvoc which does not give us the expected quality. Have also been looking at github but solutions --> were normally not good documented/not complete (subfunctions). Any idea? A build GUI based app would be perfect, since student assistants have to use it!
I am looking for books, which could help me write my BA thesis on differences between Standard Scottish English and Received Pronunciation. Are there any easily accessible books on that topic?
I want to assess the nasality in two groups of children undergoing cleft palate surgery (2 different techniques) and compare the difference. Are there any simple ways of clinically assessing nasality?
My recent perception experiments (The Tension Theory of Pitch Production and Perception) show that the physical dimension (length of a string or diameter of a membrane) is an inherent source of force for the vibrating body. I am now trying to see how tension meters measure the tension of a string since they do not seem to take all the physical dimensions of the string (or other bodies) into account.
I remember having seen some papers talking about a hybrid speech enhancement method. In this hybrid method, it takes the form of a weighted sum of several individual methods. By considering the error in different methods to be uncorrelated, a better noise reduction performance can be achieved in the hybrid method. If you happen to know such publications, please help me. Thank you very much!
It's often problematic to squeeze vowels into 2D vowel space, because the differences caused by F3 and higher formants are lost (even if using F2'). I'd like something producing better looking results than DPlot (used for the attached file) with more freedom in axis orientation.
I had calculated the fundamental frequency and the first and second formant using the spectral peak picking approach and I'd like to know if there is a way to measure how much relation is between the harmonics of f0 and the first formant. How can I know that from the spectral peak picking approach harmonics are not mischosen instead of the correct f1 value
Thanks
During the closing of the VF oscillation cycle, the two VFs would tend to collide with each other. This would pose a challenge to numerically simulate the fluid-structure coupled problem of the VF-glottal jet. Is there a way by which we can impose a condition on the VFs to prevent it from colliding. If yes, how can we implement it numerically?
For sure, there is a (well known) theoretical background for pitch estimation, including many interesting academic papers with comparative studies of methods. On the other hand, one knows that reverberant room effects can be handled through signal pre-whitening methods. Nonetheless, my question is to those who, like myself, feel frustrated by the almost erratic performance of pitch estimators in naturally spoken sentences (i.e. normal rhythm) in small reverberant rooms, even after signal pre-whitening. Thus, I would like to know is someone successfully experimented new pragmatic approaches, possibly not convencional ones.
Dear sir/mam,
One of my friends is doing Ph.D. in English on pragmatics, which focuses on speech act theory. Could you please suggest some good reading regarding this?
thanking you
yours sincerely
Dr.Kiran R. Ranadive
I am working on a perception experiment on German stress and would like to measure the syllable duration of my stimuli. I was wondering how to treat ambisyllabic consonants, e.g. [l] in [maˈrɪlə] 'apricot'. My first intention was to split the ambisyllabic consonant and count half of its duration to the second and half of its duration to the third syllable of the word. I would be very glad to receive practical help on this issue.
I'm wondering if terrestrial animals could possibly produce vocalizations that would create visual warp distortions close to their mouth during production. Would this effect be heavily dependent on environmental temperature and humidity?
Using jitter and shimmer in speaker identification.