Questions related to Speech Recognition
I need to analyze some qualitative data including both video and voice recording for my master thesis. Before the process of data analysis, I will need the transcripts (written forms) of these qualitative data. Therefore, I am in need of some software or websites having the function of automated speech recognition (ASR) system. I will be appreciated if it is free of charge.
Looking forward to hearing some suggestions from senior researchers.
I have a project in which I have given a dataset (more than enough) of 10-20seconds audio files (singing these "swar" / "ragas": "sa re ga ma pa" ) without any labels, nothing well in data ... and I have to create a deep learning model which will recognise what speech it is and for how long it is present in the audio clip (time range of particular "Swar" sa ,re ,ga, ma )
The answesr to questions that I am looking for are
1. how I can achieve my goal , should I use RNN , CNN ,LSTM or hidden Markov model or something else like unsupervised learning for speech recognition ?
2. How to get correct speech tone for Indian language as most acoustic speech recognition models are tuned for English ?
3. How to find the time range ?for what range particular sound with particular "swar" is present in music clip ? how to add that time range recognition with speech recognition model ?
4. are there any existing music recognition models which resembles my research topic ? ,if yes please tag them .
I am looking for full guide for this project as it's completely new and people who are interested to work with me /guide me are also welcome .
Hello, I am looking for papers about the pros and cons of CNNs and RNNs, and the advantages of a hybrid CNN-RNN model over the two separate models (if indeed there is an advantage) in speech recognition tasks, or in event detection tasks. Can anyone suggest relevant studies?
I and my teammates want to find out if there is a way to do (remote) scientific collaboration in the field of Machine Learning/Deep Learning about speech recognition and audio analysis. The goal is only to learn and to become a member in our project.
Thanks in advance.
We know that the pre-processing of speech recognition includes echo cancellation, de-reverberation, audio enhancement, noise suppression, and so on. Is there a comprehensive toolbox/code for that purpose that can be a start point for a beginner?
I have an audio set, and I have to determine the normal and abnormal sound. I have preprocess the audio data using MFCC. I got output as (audio Files, MFCC coefficients, MFCC vector, 1). When I passed this input to convLSTM, it shows an error that it requires five dimension input. But I am passing 4 dimension input. So how can I increase the dimension or is there any way to pass 4 dimension input to convLSTM.
Kindly guide me.
Hello! I am hoping to find a speech recognition tool that can automatically determine the length of each speaker's speaking turn in seconds, based on an audio or video recording of a conversation. I'd like to know the average length of each person's speaking turn in a given recorded conversation. Also, I'm looking for an accurate measure of how much overlap there was between different speakers. Ideally, the tool would automatically detect that multiple speakers were talking at the same time, and give me either a percentage or number of seconds in the conversation sample that more than one person was talking.
The standard approaches in speech recognition (and A.I. generally) rest on some very interesting assumptions that deserve to be carefully examined...
I am pursuing my masters in data science. I need a good research papers on speech recognition that i can refer for my research to create neural networks - recurrent neural network for speech to text etc.
I need Urgent basis, How to Implement Arabic Speech to Urdu Text using speech recognition Matlab tool, if i use neural network is it possible to design according to my requirements.
Speech to Text (Using Matlab)
Arabic Quranic Speech to Urdu Text if i used audio data sets and recorded speech and i want to convert it into a Urdu text using Matlab.
Answer please !
My daughter, student paramedic, is proposing a service improvement for UK ambulance service, to add dictation to electronic patient care reporting. If you have done or know of similar service improvements specifically using speech recognition in pre-hospital care, please reply. Thanks, Paul Kompfner
Affective technologies are the interfaces concerning the emotional artificial intelligence branch known as affective computing (Picard, 1997). Applications such as facial emotion recognition technologies, wearables that can measure your emotional and internal states, social robots interacting with the user by extracting and perhaps generating emotions, voice assistants that can detect your emotional states through modalities such as voice pitch and frequency and so on...
Since these technologies are relatively invasive to our private sphere (feelings), I am trying to find influencing factors that might enhance user acceptance of these types of technologies in everyday life (I am measuring the effects with the TAM). Factors such as trust and privacy might be very obvious, but moderating factors such as gender and age are also very interesting. Furthermore, I need relevant literature which I can ground my work on since I am writing a literature review on this topic.
I am thankful for any kind of help!
In Python, which way is best to extract the pitch of the speech signals?
Although, I extracted pitches via "piptrack" in "librosa" and "PitchDetection" in "upitch", but I'm not sure which of these is best and accurate.
Is there any simple and realtime or at least semi-realtime way instead of these?
Hello I work with Convolutional Neural Network and LSTM in speech emotion recognition, in my result I see that CNN has shown better performance than the traditional LSTM in my speech recognition .
Normaly LSTM should better in Speech recognition as I use sequential data.
I have trained an isolated spoken digit model for 0-9. My speech recognition system is recognizing the isolated digits like 0,1,2...9 but it fails to recognize the continuous digits like 11, 123, 11111, etc.. Can anyone please help me in converting these isolated digits to connected digits
The goal is to localize the starting time and the ending time of each phoneme in the waveform signal. If the code is written in Java, that would be better! Thanks in advance!
I have two speech signals coming from two different people. I want to find out whether or not both people are saying the same phrase. Is there anything that I can directly measure between the two signals to know how similar they are?
I want to add different feature vectors to enhance the recognition rate using HTK. How to make it?
I'm considering writing a "Speech Processing 101" compendium for the course I'm teaching, because I'm not aware of any existing good material for such a course. That leads up to two questions:
- Could you recommend a forum for publishing educational material with an open-access license?
- Alternatively, are you aware of a good resource for Speech Processing 101, with an emphasis on the DSP side (=it's not about speech recognition).
Hi all, I am doing some research about continuous speech recognition. I want to implement the product hmm using htk, and I meet some problems. For example, I don't know how to initialize the product hmm (i.e,using the single stream hmm or using a proto?), and how to tie silence model "sil" and "sp" (because the state's number of "sil" has change, no longer 3). Can someone help me to solve these problems? Thank you very much.
The offline data synchronization is important synchronization techniques in designing an app that would works without an Internet connection. Synchronization between a server and an android device helps the users to use applications more effectively when even if the user is not connected to the internet and can save data locally and remotely.
Sethuraman, Raj, Roger A. Kerin, and William L. Cron. "A field study comparing online and offline data collection methods for identifying product attribute preferences using conjoint analysis." Journal of Business Research 58.5 (2005): 602-610.
I was wondering I could ask about gesture and sports perfomance.
I think that the gesture of some movements befor actual performance have some effect on following actual movement performances(accuracy, fluency, timing and so on).
In fact, in some sports, such as baseball, tabletenis or boxing, swinging bat or racket without real ball, or moving alone is poplar practice.
I found some gestrue studies involving speech and recognition for classification, but I couldn't find studies investigating if gestures, pantomimes or mimickings have influence to following action through experimental psychological methods.
It would be very great if anybody tell me about this research field.
I'm currently working on a G2P research . the experiments is nearly finished. And I'm the process of paper write up . My Question is the following .I have noticed from the papers that I have read that using Word Error rate (WER) is a common metric for showing the G2P performance . Naturally WER calculates errors in sentences , with little adaptation the same can be used calculated errors in words . in my case I have dictionary that I want to calculate the WER of . How do you think I should tackle this ? should I calculate the WER for each word in the dictionary and then describe the Models performs with respect to the test data by averaging all the WER calculated for those words . or there is some other way to get a more accurate value .
I'm asking this because in most of the papers that I have read the authors are reporting the models performance with a WER value without indicating how was this value derived ( averaging or some other method ) .
Do you know some good papers/thesis with experiments of neural network applications. The area (image/speech recognition/weather forecast/etc.) is not important, but which are good (or outstanding in your opinion) in terms of:
- How the design/experiment is described (reproducible);
- Execution / implementation;
- Analyses of results (used metrics etc.);
- Threats to validity / discussion
I am also interested to hear about some papers which are (in your opinion) only good in a particular area (e.g. the analysis) but weak in other areas.
I see a lot of NN papers which are weak (e.g. not really reproducible, small dataset, analysis on only error rate and that is it).
Do you know some papers about (reporting/design) guidelines for NN-experiments? I found e.g.:
Zhang, G. Peter. "Avoiding pitfalls in neural network research." Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 37.1 (2007): 3-16.
Background is that I want to avoid to trap into most common issues for my thesis or at least mention it if I cannot solve it given time constraints (etc).
I want explanation of MFCC coefficients we get, only first 12-13 coefficients are considered for evaluating the performance of feature vector. Whats the reason behind this. if we take higher coefficients as well, what will be effect. And how we now whether our feature vectors is good or bad, like in case of sound signal, if we compute its feature vectors, how can we analyze whether sound features are good.
The other question is about LPC feature extraction method, as it is based on order of coefficients, so mostly 10-12 LPC order is considered in this scheme, whats the reason behind this, if we take lower or higher order what will be effect on its performance.
If we compare MFCC and LPCC methods, one works in mel cpestrum domain and other in cepstrum domain, whats the benefit of cpestrum and main difference between mel cepstrum and cepstrum, and which is one is better.
What types of features can help to recognize speaker/speech more accurately ? Why we need some additional features for speaker recognition? What is a recent trend in this field? Is any literature available for recent trends in speech recognition?
A method to calculate based on interconnected multiple processing units
it's gonna be built. The network is composed of the arbitrary number of cells or nodes or units or neurones that are set Link the input to the output
An artificial neural network is a practical way to learn various functions such as functions with values
True, functions with discrete values and functions with vector values.
The learning of the neural network is immune to the educational data errors and this way
Networks have successfully addressed issues such as speech recognition, image identification and interpretation
Learning the robot has been applied
What aspects of augmented/virtual reality applications could be improved by techniques of artificial intelligence? We can think of good speech recognition commands, virtual assistants like Siri or Cortana but what else?
Ideas, research group websites specialized in AR/VR and AI, state of the art papers, etc. are welcome as I want to cover as much possibilities in the shortest time...
I need your valuable suggestions/comments for selection of research topic. I have two options as:
2- Speech Recognition
I am totally new in this domain actually. Therefore, I need guideline which topic is best from research point of view and comparatively easy to understand and development as well, so that I can produce some good publications at the end.
Please mention some State of the Art techniques/ Algorithms being used for the analysis above mentioned topics. Moreover, tell me if there are some new engineering applications of Speech enhancement/ recognition. Thanks in advance!!!!!
Can we optimize mean and variance using GA in speech recognition? Can we call it as refinemenet of HMM. Can we mark it as GA+HMM term?
I have speech data consisting of moods, for recognizing the moods in it I want to implement HMM in Matlab, as I am new to HMM, I am finding it difficult to determine the parameters used in it. Can anyone please help me with some demo examples? Thank you!
I am in the process of conducting several structured intervies in a work environment. These will be around 5-10 Mins long and I'm planning to record them auditory. Can you recommend any software that may help evaluate the recordings by: 1) aiding in the transcription (preferrably with speech recognition) and 2) collating the made statements.
I have tried for automatic speech recognition module using HMM classifier and obtained some fine output. Later, I took survey based on state of the art methods which involves incorporation of deep learning.Please suggest some matlab link that will be helpful for our work.
Thankyou in advance
I would like to ask from all the experts here, in order to get the better view on the usage of cleaned signals which already removed the echo using few types of adaptive algorithms with method of AEC.(acoustic echo cancellation)
How the significance of MSE and PSNR can improve in the classification processes? Which i mean normally we evaluate using the technique of WER, Accuracy and may EER too.Is there any kind connectivity of MSE and PSNR values in terms of improving those classification metrics.?
wish to have the clarification on this.
In most of the survey of speech segregation, hit rate and false alarm rate are calculated in percentage after the mask estimation. If anyone know the procedure to manipulate it .Please share to get verified. Thankyou in advance
What is the time/cost effectiveness of using voice recorders with Dragon speech recognition with NVivo software? Any problems?
I find it odd in this day and age of voice recognition software that it is so difficult to find software that can automatically transcribe interview recordings. With powerful speech recognition engines such as Siri, Cortana, and Google Now, it seems odd that researchers are still stuck traditional speech recognition software that requires voice training and can only reliably understand a single user.
I've experimented using Google's Keep app - which has a speech to text feature - and read some material I got from a government site (full of jargon). It did quiet well with me reading from the site speaking directly to the app, and it even did a reasonable job with my colleague, who called me over speakerphone and read back the same section of the website. And the app did this all with zero training! So the technology seems to exist already!
The problem with this sort of online transcribing is that I don't know what the software company does with the audio or text that it hears and transcrbes. I presume it is used to improve the software's accuracy. That doesn't work for me because my interview recordings and resulting transcription would contain confidential information that cannot be allowed leave our internal network.
Am I overlooking a product that would meet my requirments? What are other qualitative researchers doing to resolve this challenge?
I have segregated combined speech sources using neural network based classifier in speech segregation process.For the estimation of Signal-to-noise ratio whether we should use the outputs of ideal binary mask is my doubt.Please guide me to do the estimation.
Thankyou in advance
Im working on a project and i need to demonstrate that one speech codec is better than the other in terms of voice quality, is there way i can simulate them so i can compare both of them?
What algorithms and methods have been using for face recognition and speech recognition in today’s software application? What is their accuracy of recognition and rate of recognition?
We are looking for a French text in which speech sounds are selected such as to obtain a fixed proportion of voiced and unvoiced sounds (or more degrees of sonority). This text would we used in a contrastive multilingual experiment on vocal load.
In addition, we are interested in phonetically balanced corpora for French.
I'm setting up auditory fear conditioning, and I wonder how I can measure decibel of a tone for conditioned stimulus. I want a 75-dB tone and have a decibel meter.
I am not sure where I need to place the decibel meter in the context
to adjust a 75-dB tone. near the speaker? on the bottom? in the middle? The speaker is on the right wall of a square shaped context and if I want to proceed fear extinction in a different octagon-shaped context, I need to adjust the tone again for the new context, right? In this case, where do I place the decibel meter ?
Thanks for reading and I'll be waiting for your tips.
I would like to ask you how do you compare a speech sample and a different kind of auditory sample (e.g., noise, sounds produced by animals...) when you are looking for similarities and differences between the two samples.
For instance, there are some times when people believe they are listening to words when hearing a noise, or the wind. If a participant reported having heard "mother" when he/she actually listened to a noise, how would you carry out the comparison between the two different sounds? Is there any way to do that?
Ideas and references are welcome!
Lets say that i was able to generate phoneme HMMs, now how can I use them to recognize a full sentence or even a word utterance?
Also, I've encountered many references to embedded training, where the whole utterance is used to train phonemes, I've tracked it down but I haven't found an explanation of the way it's implemented, so if anyone have any good material on the matter, i'd be grateful.
P.S. Speech recognition is not my field, but I'm trying to apply some of its techniques on gesture recognition
1) The idea is to recognize context out of a conversation about a topic. For instance if two people are talking (assuming without overlapping their voices), it should be able to differentiate between the two voices either by only differentiating between two voices or by differentiating by recognizing the users(which would require training the voices therefore I would work on it once the rest of the project is done).
2) After differentiating the contents of the conversation based on who spoke what, I would further analyze the contents.
I want to implement an i-vector based speaker recognition system. This system should be tested against NIST SRE 2008 dataset. I have training files and sph files. However these files are not labeled and no speaker ID is available with them.
Apparently, there are answer key files which determines the identity of the speaker (among many other properties) of test files.
The file names should look like: "NIST_SRE08_short2.model.key"
Can anyone provide me with these files or give me some guide?
If we train our PLDA with microphone data only, and test with Phone data, will it effects the system performance?
and If we train with large amount of data of microphone and with less data of phone, how much the accuracy be effected?
Or there should be a balance between them?
I would like to analyze vocal responses from a working memory n-back task with two possible responses ("yes" vs no response). Aim of the analysis is to get an automatically generated output file with two columns: (1) subjects study code (1...n) or rather file label and (2) vocal response (e.g. "yes" vs no or 1 vs 0).
I already tried Inquisit Lab 5's tool "Analyze recorded responses" but it did not work that well, i. e. after analyzing a few data sets which were coded correctly, Inquisit is not able to distinguish between responses and non-responses any longer.
Do you have experiences with Inquisit Lab 5 or any other suggestions regarding to speech recognition?
Thanks a lot!
What would be the effect of the speech utterance length on speaker recognition. i.e
if T, UBM, LDA, PLDA-----> are trained on short utterance i.e. from 3 to 15 seconds, but
enrollment of speaker (modeled speaker) are trained on long utterance such as 30 to 60 seconds uttarnce? Would it effect the performance of the system????
Two System (Speaker Recognition)
UBM-GMM Optimal time for training and testing the system
i-vector Optimal time for training and testing the system
well i am recently working on my project module which is speech recognition system.for that i choose CMU Sphinx (Version Pocket Sphinx) but i am stuck that how to use it mean that i want to run it own Microsoft visual studio or in the Unity Mono-developer after it i want to make its grammar well i am aware of its grammar but the point is it is not running.
Any foam of help will be appreciated...
For speaker recognition, we need developement data to train T, UBM.
Is it possible that the single speech sample have more than 1 speaker i.e in
s = s1+s2+s3 ??
The following is one of the recent research reports on ASR built using the deep learning framework: Dario Amodei, ..., Andrew Ng, .. Zhenyao Zhu, "Deep Speech 2: End-to-End Speech Recognition in English and Mandarin, arXiv:1512.02595, Cornell University Library, Dec 2015. I want to know about other similar recognition work on other languages using deep neural network.
I am conducting a study to develop and validate a peadiatric picture identification test to assess speech recognition in Sinhala language. It is somewhat similar to the test of Word Ineligibility by Picture Identification test developed by Mark Ross and Jay Lerman in 1970.
Now I am in the stage of consulting a professional artist to draw the pictures. I heard about specific guidelines are there to draw the pictures. If anyone is aware of such guidelines please be kind to let me know it.
Your assistance and help is highly appreciated.Thank you and best regards.