Science topic

Speech Recognition - Science topic

Speech Recognition Discussion Group
Questions related to Speech Recognition
  • asked a question related to Speech Recognition
Question
10 answers
I need to analyze some qualitative data including both video and voice recording for my master thesis. Before the process of data analysis, I will need the transcripts (written forms) of these qualitative data. Therefore, I am in need of some software or websites having the function of automated speech recognition (ASR) system. I will be appreciated if it is free of charge.
Looking forward to hearing some suggestions from senior researchers.
Relevant answer
Answer
Thank you for the information you have provided. Nishat Vasker I will check them out.
  • asked a question related to Speech Recognition
Question
3 answers
I have a project in which I have given a dataset (more than enough) of 10-20seconds audio files (singing these "swar" / "ragas": "sa re ga ma pa" ) without any labels, nothing well in data ... and I have to create a deep learning model which will recognise what speech it is and for how long it is present in the audio clip (time range of particular "Swar" sa ,re ,ga, ma )
The answesr to questions that I am looking for are
1. how I can achieve my goal , should I use RNN , CNN ,LSTM or hidden Markov model or something else like unsupervised learning for speech recognition ?
2. How to get correct speech tone for Indian language as most acoustic speech recognition models are tuned for English ?
3. How to find the time range ?for what range particular sound with particular "swar" is present in music clip ? how to add that time range recognition with speech recognition model ?
4. are there any existing music recognition models which resembles my research topic ? ,if yes please tag them .
I am looking for full guide for this project as it's completely new and people who are interested to work with me /guide me are also welcome .
  • asked a question related to Speech Recognition
Question
5 answers
Hello, I am looking for papers about the pros and cons of CNNs and RNNs, and the advantages of a hybrid CNN-RNN model over the two separate models (if indeed there is an advantage) in speech recognition tasks, or in event detection tasks. Can anyone suggest relevant studies?
Relevant answer
Answer
Thank you for your reply Aravinda C V , I think that I have a look at it
  • asked a question related to Speech Recognition
Question
4 answers
Hi everyone,
I and my teammates want to find out if there is a way to do (remote) scientific collaboration in the field of Machine Learning/Deep Learning about speech recognition and audio analysis. The goal is only to learn and to become a member in our project.
Thanks in advance.
Relevant answer
Answer
Please have look on our(Eminent Biosciences (EMBS)) collaborations.. and let me know if interested to associate with us
Our recent publications In collaborations with industries and academia in India and world wide.
EMBS publication In association with Universidad Tecnológica Metropolitana, Santiago, Chile. Publication Link: https://pubmed.ncbi.nlm.nih.gov/33397265/
EMBS publication In association with Moscow State University , Russia. Publication Link: https://pubmed.ncbi.nlm.nih.gov/32967475/
EMBS publication In association with Icahn Institute of Genomics and Multiscale Biology,, Mount Sinai Health System, Manhattan, NY, USA. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/29199918
EMBS publication In association with University of Missouri, St. Louis, MO, USA. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/30457050
EMBS publication In association with Virginia Commonwealth University, Richmond, Virginia, USA. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/27852211
EMBS publication In association with ICMR- NIN(National Institute of Nutrition), Hyderabad Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/23030611
EMBS publication In association with University of Minnesota Duluth, Duluth MN 55811 USA. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/27852211
EMBS publication In association with University of Yaounde I, PO Box 812, Yaoundé, Cameroon. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/30950335
EMBS publication In association with Federal University of Paraíba, João Pessoa, PB, Brazil. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/30693065
Eminent Biosciences(EMBS) and University of Yaoundé I, Yaoundé, Cameroon. Publication Link: https://pubmed.ncbi.nlm.nih.gov/31210847/
Eminent Biosciences(EMBS) and University of the Basque Country UPV/EHU, 48080, Leioa, Spain. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/27852204
Eminent Biosciences(EMBS) and King Saud University, Riyadh, Saudi Arabia. Publication Link: http://www.eurekaselect.com/135585
Eminent Biosciences(EMBS) and NIPER , Hyderabad, India. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/29053759
Eminent Biosciences(EMBS) and Alagappa University, Tamil Nadu, India. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/30950335
Eminent Biosciences(EMBS) and Jawaharlal Nehru Technological University, Hyderabad , India. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/28472910
Eminent Biosciences(EMBS) and C.S.I.R – CRISAT, Karaikudi, Tamil Nadu, India. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/30237676
Eminent Biosciences(EMBS) and Karpagam academy of higher education, Eachinary, Coimbatore , Tamil Nadu, India. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/30237672
Eminent Biosciences(EMBS) and Ballets Olaeta Kalea, 4, 48014 Bilbao, Bizkaia, Spain. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/29199918
Eminent Biosciences(EMBS) and Hospital for Genetic Diseases, Osmania University, Hyderabad - 500 016, Telangana, India. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/28472910
Eminent Biosciences(EMBS) and School of Ocean Science and Technology, Kerala University of Fisheries and Ocean Studies, Panangad-682 506, Cochin, India. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/27964704
Eminent Biosciences(EMBS) and CODEWEL Nireekshana-ACET, Hyderabad, Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/26770024
Eminent Biosciences(EMBS) and Bharathiyar University, Coimbatore-641046, Tamilnadu, India. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/27919211
Eminent Biosciences(EMBS) and LPU University, Phagwara, Punjab, India. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/31030499
Eminent Biosciences(EMBS) and Department of Bioinformatics, Kerala University, Kerala. Publication Link: http://www.eurekaselect.com/135585
Eminent Biosciences(EMBS) and Gandhi Medical College and Osmania Medical College, Hyderabad 500 038, India. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/27450915
Eminent Biosciences(EMBS) and National College (Affiliated to Bharathidasan University), Tiruchirapalli, 620 001 Tamil Nadu, India. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/27266485
Eminent Biosciences(EMBS) and University of Calicut - 673635, Kerala, India. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/23030611
Eminent Biosciences(EMBS) and NIPER, Hyderabad, India. ) Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/29053759
Eminent Biosciences(EMBS) and King George's Medical University, (Erstwhile C.S.M. Medical University), Lucknow-226 003, India. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/25579575
Eminent Biosciences(EMBS) and School of Chemical & Biotechnology, SASTRA University, Thanjavur, India Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/25579569
Eminent Biosciences(EMBS) and Safi center for scientific research, Malappuram, Kerala, India. Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/30237672
Eminent Biosciences(EMBS) and Dept of Genetics, Osmania University, Hyderabad Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/25248957
EMBS publication In association with Institute of Genetics and Hospital for Genetic Diseases, Osmania University, Hyderabad Publication Link: https://www.ncbi.nlm.nih.gov/pubmed/26229292
Sincerely,
Dr. Anuraj Nayarisseri
Principal Scientist & Director,
Eminent Biosciences.
Mob :+91 97522 95342
  • asked a question related to Speech Recognition
Question
2 answers
What are your consideration while detecting speech? For example- sex(m/f), Age , BMI, emotion , pitch etc
Relevant answer
Answer
actually lip rounding decreases all formants..not only F3
  • asked a question related to Speech Recognition
Question
3 answers
We know that the pre-processing of speech recognition includes echo cancellation, de-reverberation, audio enhancement, noise suppression, and so on. Is there a comprehensive toolbox/code for that purpose that can be a start point for a beginner?
Relevant answer
Answer
Firstly you should have some fundamentals of speech processing (speech signal basics and features, DSP procedures like FFT, STFT, Convolution, …)
In addition to important signal processing techniques like finding energy and the power spectrum of the speech signal, autocorrelation, silence speech removal methods, etc.
I suggest this good book as a start:
"Fundamentals of speech recognition" by Lawrence Rabiner and Biing-Hwang Juang.
Regarding your question about a toolbox, Matlab has Deep Learning Toolbox to design, train, and analyze deep learning networks.
You can preprocess audio input with operations such as echo cancellation, audio enhancement, noise suppression by using datastores and functions available in Deep Learning Toolbox and other MATLAB toolboxes like Signal Processing Toolbox which offer functions, datastores, and apps for processing, and augmenting deep learning data.
For this see:
Which gives an example that shows how to train a deep learning model that detects the presence of speech commands in audio.
Best regards
  • asked a question related to Speech Recognition
Question
2 answers
I have an audio set, and I have to determine the normal and abnormal sound. I have preprocess the audio data using MFCC. I got output as (audio Files, MFCC coefficients, MFCC vector, 1). When I passed this input to convLSTM, it shows an error that it requires five dimension input. But I am passing 4 dimension input. So how can I increase the dimension or is there any way to pass 4 dimension input to convLSTM.
Kindly guide me.
Regards
  • asked a question related to Speech Recognition
Question
6 answers
Hello! I am hoping to find a speech recognition tool that can automatically determine the length of each speaker's speaking turn in seconds, based on an audio or video recording of a conversation. I'd like to know the average length of each person's speaking turn in a given recorded conversation. Also, I'm looking for an accurate measure of how much overlap there was between different speakers. Ideally, the tool would automatically detect that multiple speakers were talking at the same time, and give me either a percentage or number of seconds in the conversation sample that more than one person was talking.
Relevant answer
Answer
Yaakov J Stein That is very helpful, thank you very much!
Heidi
  • asked a question related to Speech Recognition
Question
12 answers
The standard approaches in speech recognition (and A.I. generally) rest on some very interesting assumptions that deserve to be carefully examined...
Relevant answer
Answer
Hello Marc Champagne , interesting question. Let me answer in detail, in a structured but pedestrian manner:
(1) Waveform
First of all, we have to be clear about what we define as waveform: the sound produced is captured by microphones (or equivalently decoded from a received coded signal) and it is analysed as a signal s(t), where t is usually discrete, hence the function s(t) (for continuous time) becomes a sequence also called a time series.
This is called a speech signal. It is analysed in the time domain, in the frequency domain or any transform domain (wavelet, or other time-frequency representation)
Quantization and pattern recognition on di-phones, which are transitions between phonemes, can be performed. Other methods are possible.
This corresponds to an analogy for cooking: you are given a plate, with a cooked dish, and you are trying to get to the recipee of the cook: how the dish can be made...
(2) Analyse further speech production
Humans speak by shaping their mouth and vocal tract, and the airflow in these, etc... This is well described and understood, and medically used by speech therapists ("orthophoneticians") to help people overcome difficulties they may have in generating audible and understandable speech.
An other angle of analysis is the lip reading methodology used by people with hearing disabilities. This assumes a good visibility of the speaker: mouth, lips, eyes and other elements of body language which give cues. With this, you can develop an automated lip face and body-language reading system.
The speech production system of human can be modeled in flexibly deformable connected tubes, and the vocal tract mechanisms of air flow, such as the glottal closure (for which I have developed a detector, years ago).
This proactive modeling of speech production versus the "waveform based methods" correspond to the intelligent automatic piano (or other instrument) which plays the score you input into it versus the recorded playing as a waveform.
(3) Conclusion
Speech coding has made tremendous progress when it has started to combine analysis and synthesis (the method is called analysis by synthesis, and is used in codecs fro the era of CELP, code excited linear prediction) used in the first digital mobile communication system globally standardised: GSM (the codec was "RPE-LTP by Kroon Deprettere).
Now is maybe a time to revisit how combining production modeling with auditory recognition models could bring new systems of practical interest.
Does it help you?
Let me know
  • asked a question related to Speech Recognition
Question
1 answer
What are the next steps in speaker identification after we extract the MFCC features? Thank you so much.
  • asked a question related to Speech Recognition
Question
14 answers
I am pursuing my masters in data science. I need a good research papers on speech recognition that i can refer for my research to create neural networks - recurrent neural network for speech to text etc.
Relevant answer
Answer
more practical machine learning course https://www.fast.ai/
  • asked a question related to Speech Recognition
Question
3 answers
I need Urgent basis, How to Implement Arabic Speech to Urdu Text using speech recognition Matlab tool, if i use neural network is it possible to design according to my requirements.
Speech to Text (Using Matlab)
Arabic Quranic Speech to Urdu Text if i used audio data sets and recorded speech and i want to convert it into a Urdu text using Matlab.
Answer please !
Thanks
Relevant answer
Answer
  • asked a question related to Speech Recognition
Question
3 answers
My daughter, student paramedic, is proposing a service improvement for UK ambulance service, to add dictation to electronic patient care reporting. If you have done or know of similar service improvements specifically using speech recognition in pre-hospital care, please reply. Thanks, Paul Kompfner
Relevant answer
Hi, The implementation of the dictation system seems to eliminate documentation errors, I recommend reading the related studies.
  • asked a question related to Speech Recognition
Question
11 answers
Affective technologies are the interfaces concerning the emotional artificial intelligence branch known as affective computing (Picard, 1997). Applications such as facial emotion recognition technologies, wearables that can measure your emotional and internal states, social robots interacting with the user by extracting and perhaps generating emotions, voice assistants that can detect your emotional states through modalities such as voice pitch and frequency and so on...
Since these technologies are relatively invasive to our private sphere (feelings), I am trying to find influencing factors that might enhance user acceptance of these types of technologies in everyday life (I am measuring the effects with the TAM). Factors such as trust and privacy might be very obvious, but moderating factors such as gender and age are also very interesting. Furthermore, I need relevant literature which I can ground my work on since I am writing a literature review on this topic.
I am thankful for any kind of help!
Relevant answer
Answer
Affective technologies like social robots must answer appropriately according to context. For example, if the goal is build empathy (towards human acceptance), social robot must imitate the affect state of humans. In any way, affective technologies need recognize humans emotions first. In this context, we development this paper:
I hope it will be useful
  • asked a question related to Speech Recognition
Question
4 answers
In Python, which way is best to extract the pitch of the speech signals?
Although, I extracted pitches via "piptrack" in "librosa" and "PitchDetection" in "upitch", but I'm not sure which of these is best and accurate.
Is there any simple and realtime or at least semi-realtime way instead of these?
Relevant answer
Answer
There are so many varieties of tools for extracting pitch, but none of the fully automatic algorithms I know can guarantee accuracy and consistency of extracted f0, especially in terms of continuous f0 trajectories in connected speech. An alternative is to allow human operators to intervene where automatic algorithms helplessly fail. ProsodyPro (http://www.homepages.ucl.ac.uk/~uclyyix/ProsodyPro/) provides such a function. It is a script based on Praat—A program already with some of the best pitch extraction algorithms. But ProsodyPro allows human users to intervene with difficult cases by rectifying raw vocal pulse markings. It thus maximizes our ability to observe continuous f0 trajectories.
  • asked a question related to Speech Recognition
Question
3 answers
Can any one help me on how to build a phoneme embedding?, the phonemes have different size in some features , how to solve this problem ?
thank you
Relevant answer
Answer
Yes, you can use RNN Encoder-Decoder to produce the phoneme embeddings, it means RNN maps each the phoneme to embedding space.
  • asked a question related to Speech Recognition
Question
12 answers
Hello I work with Convolutional Neural Network and LSTM in speech emotion recognition, in my result I see that CNN has shown better performance than the traditional LSTM in my speech recognition .
Why this?
Normaly LSTM should better in Speech recognition as I use sequential data.
Thanks
Relevant answer
Answer
The offset is the displacement from sample to sample. If you measure your sliding window from the beginning of your sample to your next sample, a sliding window offset of 25ms means there is no overlap between them. Lets take that you slice a sequence of phonemes every 25ms with a sliding window offset of 25ms, what guarantee do you have of obtaining all phonemes within those frames?That is why you do it with some offset x<25 ms.
  • asked a question related to Speech Recognition
Question
6 answers
I have trained an isolated spoken digit model for 0-9. My speech recognition system is recognizing the isolated digits like 0,1,2...9 but it fails to recognize the continuous digits like 11, 123, 11111, etc.. Can anyone please help me in converting these isolated digits to connected digits
Relevant answer
Answer
Segmentation of naturally spoken speech into words, even when there is a relatively small dictionary of words, is a harder problem than recognizing isolated digits.
People tend to think of spoken words as somehow isolated but "close" in time. This is not the case, unless you have a cooperating speaker (who helps the detection, or at least monitors it and repeats when it misdetects).
You can easily find in the literature the standard end-point detection mechanisms people use (mostly Viterbi based), and then run the isolated word detectors, but they are computationally expensive and don't really work very well for natural speech (the possible exception was flexible endpoint DTW, but I doubt that you are using DTW as a detector).
Y(J)S
  • asked a question related to Speech Recognition
Question
6 answers
The goal is to localize the starting time and the ending time of each phoneme in the waveform signal. If the code is written in Java, that would be better! Thanks in advance!
Relevant answer
Answer
If you look on the Carnegie Mellon website, the speech technology group has a lot of free tools available for resesarcht o do language processin.g
  • asked a question related to Speech Recognition
Question
5 answers
I search in many site to get the real code of speech recognition even a simple code of speech recognition, so who know any code please contact me here or in my @ : kedjar421994@gmail.com
Relevant answer
Answer
Actually, there are quite a lot of codes on GitHub for automatic speech recognition. For example:
and
When you want to find some real codes, GitHub is always a good option. Good luck.
  • asked a question related to Speech Recognition
Question
1 answer
I searching for RNN & CNN code for speech recognition system , where each speaker speak a sentences database not single word
Relevant answer
Answer
If you are ok with python, you can try TensorFlow implementation of deepSpeech based on RNN:
  • asked a question related to Speech Recognition
Question
6 answers
Is there any available tool that allows ML to train my own voice and then use the results for speech recognition, just like windows Cortana?
Relevant answer
Answer
I dont know about any ready made tool to do this like Cortana. But if you can generate a good labeled data set of your voice you can train that with scikitlearn or tenser flow for speech recognition. There are some other tools specially for Speech recognition like CMUSphinx and KaldASR .
  • asked a question related to Speech Recognition
Question
6 answers
I have two speech signals coming from two different people. I want to find out whether or not both people are saying the same phrase. Is there anything that I can directly measure between the two signals to know how similar they are?
Relevant answer
Answer
It sounds simple, but unfortunately, it is not!
There are many confounding factors that make this process complicated. I give you some examples: consider you have a recording of your own voice recorded in a sound proof room saying "OPEN THE DOOR", and you would like to use that recording as the reference to which other voice commands are compared to take an action to open the door, for example.
  • Now, if you utter the same utterance but in a noisy environment, the two recordings are no longer the same.
  • If you change the room and record it in a reverberant room, the two signals are no longer the same.
  • If you say the same sentence but in different speed (speech rate) as you uttered the reference one, the two signals are no longer the same.
  • If you utter the same sentence but in different rhythm as you uttered the reference one, again, the two signals are no longer the same.
  • Now, consider that all or some of the above mentioned factors happen at the same time. Again, the two signals are no longer the same.
  • Now, imagine that you want to compare your reference signal with another person's recording of the same sentence. If both recordings are recorded in a similar environmental condition (same room, same equipment) and the same rhythm and rate, again, the two recordings are not the same.
  • Age, gender, health condition are other confounding factors that influence the signal.
Considering the formants of the two signals and comparing them using some similarity measures could be a very simple and quick solution. But unfortunately, they do not provide a good result since, for example, the similarity measure of two completely different sentences recorded in a particular acoustic environment can be relatively higher than two roughly similar sentences recorded in different environment, or if in the second recording the speaker utters the similar words than the reference recording but in different order.
To deal with these factors and variabilities, you might need a model (such as hidden Markov model, Gaussian mixture model) to capture the acoustical characteristics of the signals (in some relevant feature space such as cepstral domain or time-frequency domain) and to relate the segments of a signal to the language unites, and also you need a language model to link the unites to recognize the sentence. All these procedures are covered under the speech recognition field.
  • asked a question related to Speech Recognition
Question
4 answers
I want to add different feature vectors to enhance the recognition rate using HTK. How to make it?
Relevant answer
Answer
You can also use two separate models for different features, and then a sum of probabilities for each class.
  • asked a question related to Speech Recognition
Question
5 answers
I'm considering writing a "Speech Processing 101" compendium for the course I'm teaching, because I'm not aware of any existing good material for such a course. That leads up to two questions:
- Could you recommend a forum for publishing educational material with an open-access license?
- Alternatively, are you aware of a good resource for Speech Processing 101, with an emphasis on the DSP side (=it's not about speech recognition).
Relevant answer
Answer
Hi
There are several repositories and platforms to publish Open Educational Resources.
To have a good overview, I would recommend to visit the site «OER World Map» (https://oerworldmap.org).
Additionally, I think you could have a look at the following publishing houses:
- Meson Press: https://meson.press/
subjects: digital cultures and networked media
subjects: mainly humanities and social sciences
Best
Anne-Katharina
  • asked a question related to Speech Recognition
Question
13 answers
Except PER metric, what are the existing performance metrics to compare two different recognizers in speech recognition?
Relevant answer
Answer
word error rate (WER%)
  • asked a question related to Speech Recognition
Question
4 answers
Hi all, I am doing some research about continuous speech recognition. I want to implement the product hmm using htk, and I meet some problems. For example, I don't know how to initialize the product hmm (i.e,using the single stream hmm or using a proto?), and how to tie silence model "sil" and "sp" (because the state's number of "sil" has change, no longer 3). Can someone help me to solve these problems? Thank you very much.
Relevant answer
Answer
  • asked a question related to Speech Recognition
Question
5 answers
Hi everyone
I am studying some papers in speech recognition by using CNN. If any know how to give speech signal into CNN?
Relevant answer
Answer
Best feature are Log-mel features (Mel-frequency spectral coefficients). MFCC features without cosine transform.
  • asked a question related to Speech Recognition
Question
1 answer
The offline data synchronization is important synchronization techniques in designing an app that would works without an Internet connection. Synchronization between a server and an android device helps the users to use applications more effectively when even if the user is not connected to the internet and can save data locally and remotely.
Paper:
Sethuraman, Raj, Roger A. Kerin, and William L. Cron. "A field study comparing online and offline data collection methods for identifying product attribute preferences using conjoint analysis." Journal of Business Research 58.5 (2005): 602-610.
Relevant answer
Answer
Generally the performance quality attribute consists of response time, availability and in some cases reliability. But in the question the real need for security and performance seems to be irrelevant to the main use-case. Also the terms AI ,NN and speech recognition are so unclear. AI is usually the super set for a variety of algorithms/methods/techniques which mimics the intelligent behavior by the computer. So speech recognition is an application of AI. NN is an AI tool, which can be used for a wide variety of problems including speech recognition.
So can give us more details about the exact use-case of the system? or the real requirements?
  • asked a question related to Speech Recognition
Question
5 answers
Hi everyone.
I was wondering I could ask about gesture and sports perfomance.
I think that the gesture of some movements befor actual performance have some effect on following actual movement performances(accuracy, fluency, timing and so on).
In fact, in some sports, such as baseball, tabletenis or boxing, swinging bat or racket without real ball, or moving alone is poplar practice.
I found some gestrue studies involving speech and recognition for classification, but I couldn't find studies investigating if gestures, pantomimes or mimickings have influence to following action through experimental psychological methods.
It would be very great if anybody tell me about this research field.
Thank you
Takahiro Sugi
Relevant answer
Answer
Cunha, R. G., Torres, F. E., & Zângaro, R. A. (2018). Ground reaction force in the kinetic analysis of the sporting gesture shot in lower limbs. Revista Argentina de Bioingeniería, 22(1), 55-59.
Wakefield, E., Novack, M. A., Congdon, E. L., Franconeri, S., & Goldin‐Meadow, S. (2018). Gesture helps learners learn, but not merely by guiding their visual attention. Developmental science, e12664.
Ciman, M., & Wac, K. (2018). Individuals’ stress assessment using human-smartphone interaction analysis. IEEE Transactions on Affective Computing, 9(1), 51-65.
  • asked a question related to Speech Recognition
Question
1 answer
Can anyone tell me how to form hmm model parameters that is (a,b,pi) in speech recognition system which is to be given as input to the forward algorithm to find the observation probability using matlab?
Relevant answer
Answer
Well then, it has been five years lol.
You can see the book “Fundamentals of speech recognition” written by L.Rabiner, B. H. Juang. There’re some chapters about features and HMM.
In matlab, you can just see the document about the functions, hmmtrain, hmmdecode … etc.
  • asked a question related to Speech Recognition
Question
4 answers
I am researching about spanish prosody. I want to download an annotated corpus about prosody. Does anyone know something about it?
Relevant answer
Answer
I don't have access to the corpus. The website didn't work for download it. If anyone have it, please let me know.
Thanks to Christina E. Valaki and Camilo Enrique Díaz Romero
  • asked a question related to Speech Recognition
Question
4 answers
Hello ,
I'm currently working on a G2P research . the experiments is nearly finished. And I'm the process of paper write up . My Question is the following .I have noticed from the papers that I have read that using Word Error rate (WER) is a common metric for showing the G2P performance . Naturally WER calculates errors in sentences , with little adaptation the same can be used calculated errors in words . in my case I have dictionary that I want to calculate the WER of . How do you think I should tackle this ? should I calculate the WER for each word in the dictionary and then describe the Models performs with respect to the test data by averaging all the WER calculated for those words . or there is some other way to get a more accurate value .
I'm asking this because in most of the papers that I have read the authors are reporting the models performance with a WER value without indicating how was this value derived ( averaging or some other method ) .
thank you
Relevant answer
Answer
Hello,
We did some research using WER (Word-Error Rate) metric, if this is helpful.
For WER and PER calculation we used Hjerson, a tool for automatic classification of. As input, the tool requires reference translation(s) and hypothesis along with their corresponding base forms. It counts as the minimum number of insertions, deletions and substitutions that have to be performed to convert the generated machine translation (hypothesis) into the reference text. Every word in the hypothesized sentences is compared with reference sentence and every word which does not match (inserted, deleted or substituted) is counted as an error and divided by total number of words in the reference sentence.
The main disadvantage of WER is the fact that it does now take permutations of words into consideration, i.e. the word order of the hypothesis translation cannot be different from word order of the reference, even if the translation is correct. We did correlations with human evaluations.
Here are some references:
- WER for ASR (automatic speeck recognition) + MT
- WER for speech synthesis
- WER for MT
Best,
Sanja
  • asked a question related to Speech Recognition
Question
3 answers
  • Robots use speech recognition to capture messages from people (I think I'm correct).
  • Do they use TTS to speak?
Relevant answer
Answer
Yes of course they do. I can't see another way. Best wishes.
  • asked a question related to Speech Recognition
Question
6 answers
Do you know some good papers/thesis with experiments of neural network applications. The area (image/speech recognition/weather forecast/etc.) is not important, but which are good (or outstanding in your opinion) in terms of:
  • How the design/experiment is described (reproducible);
  • Execution / implementation;
  • Analyses of results (used metrics etc.);
  • Threats to validity / discussion
I am also interested to hear about some papers which are (in your opinion) only good in a particular area (e.g. the analysis) but weak in other areas.
I see a lot of NN papers which are weak (e.g. not really reproducible, small dataset, analysis on only error rate and that is it).
Do you know some papers about (reporting/design) guidelines for NN-experiments? I found e.g.:
Zhang, G. Peter. "Avoiding pitfalls in neural network research." Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on 37.1 (2007): 3-16.
Background is that I want to avoid to trap into most common issues for my thesis or at least mention it if I cannot solve it given time constraints (etc).
Thanks
Relevant answer
Answer
agree with @Andrew Ekstrom
regards
  • asked a question related to Speech Recognition
Question
10 answers
Hello,
I want explanation of MFCC coefficients we get, only first 12-13 coefficients are considered for evaluating the performance of feature vector. Whats the reason behind this. if we take higher coefficients as well, what will be effect. And how we now whether our feature vectors is good or bad, like in case of sound signal, if we compute its feature vectors, how can we analyze whether sound features are good.
The other question is about LPC feature extraction method, as it is based on order of coefficients, so mostly 10-12 LPC order is considered in this scheme, whats the reason behind this, if we take lower or higher order what will be effect on its performance.
If we compare MFCC and LPCC methods, one works in mel cpestrum domain and other in cepstrum domain, whats the benefit of cpestrum and main difference between mel cepstrum and cepstrum, and which is one is better.
Relevant answer
Answer
An intuition about the cepstral features can help to figure out what we should look for when we use them in a speech-based system.
- As cepstral features are computed by taking the Fourier transform of the warped logarithmic spectrum, they contain information about the rate changes in the different spectrum bands. Cepstral features are favorable due to their ability to separate the impact of source and filter in a speech signal. In other words, in the cepstral domain, the influence of the vocal cords (source) and the vocal tract (filter) in a signal can be separated since the low-frequency excitation and the formant filtering of the vocal tract are located in different regions in the cepstral domain.
- If a cepstral coefficient has a positive value, it represents a sonorant sound since the majority of the spectral energy in sonorant sounds are concentrated in the low-frequency regions.
- On the other hand, if a cepstral coefficient has a negative value, it represents a fricative sound since most of the spectral energies in fricative sounds are concentrated at high frequencies.
- The lower order coefficients contain most of the information about the overall spectral shape of the source-filter transfer function.
- The zero-order coefficient indicates the average power of the input signal.
- The first-order coefficient represents the distribution spectral energy between low and high frequencies.
- Even though higher order coefficients represent increasing levels of spectral details, depending on the sampling rate and estimation method, 12 to 20 cepstral coefficients are typically optimal for speech analysis. Selecting a large number of cepstral coefficients results in more complexity in the models. For example, if we intend to model a speech signal by a Gaussian mixture model (GMM), if a large number of cepstral coefficients is used, we typically need more data in order to accurately estimate the parameters of the GMM.
  • asked a question related to Speech Recognition
Question
3 answers
What types of features can help to recognize speaker/speech more accurately ? Why we need some additional features for speaker recognition? What is a recent trend in this field? Is any literature available for recent trends in speech recognition?
Relevant answer
Answer
For speaker recognition :
  • The DNN-free approach : MFCC/LFCC (+ Delta + DeltaDelta) features [1] are usually used. More robust alternatives such as MHEC [2] and PNCC [3] have also been developed in the past years. Another alternative is NMF (non-negative matrix factorisation) applied directly on the spectrogram [4].
  • The DNN-based approach : A deep neural network can be trained to either :
  1. learn a new unsupervised representation of the acoustic features : an auto-encoder/denoising auto-encoder/VAE [5] is trained using stacked mel-filterbank or mfcc features as input/output, then the activation of one of the hidden layers is used as a new representation (also called bottleneck features when the dimension of the new representation is very low compared to the original)
  2. lean a more sophisticated (and hopefully discriminative) representation by training a DNN as a phonetic classifier at the frame level (+context) [6] and using the activations of one of the hidden layers as a new representation. A detailed review can be found in [7].
This list doesn't exclude acoustic–linguistic features like formants, rhythmic features and high level features like prosody. It's important to know that the choice of features can be application-dependent ; different features can be chosen for a general-perpose automatic speaker recognition system vs a forensic speaker recognition system.
---
References :
[1] Hansen, John HL, and Taufiq Hasan. "Speaker recognition by machines and humans: A tutorial review." IEEE Signal processing magazine 32.6 (2015): 74-99.
[2] Sadjadi, Seyed Omid, Taufiq Hasan, and John HL Hansen. "Mean Hilbert envelope coefficients (MHEC) for robust speaker recognition." Thirteenth Annual Conference of the International Speech Communication Association. 2012.
[3] McLaren, Mitchell, et al. "Improving speaker identification robustness to highly channel-degraded speech through multiple system fusion." Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.
[4] Joder, Cyril, and Björn Schuller. "Exploring nonnegative matrix factorization for audio classification: Application to speaker recognition." Speech Communication; 10. ITG Symposium; Proceedings of. VDE, 2012.
[5] Zhang, Zhaofeng, et al. "Deep neural network-based bottleneck feature and denoising autoencoder-based dereverberation for distant-talking speaker identification." EURASIP Journal on Audio, Speech, and Music Processing 2015.1 (2015): 12.
[6] Lei, Yun, et al. "A novel scheme for speaker recognition using a phonetically-aware deep neural network." Acoustics, Speech and Signal Processing (ICASSP), 2014 IEEE International Conference on. IEEE, 2014.
[7] Matějka, Pavel, et al. "Analysis of DNN approaches to speaker identification." Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2016.
  • asked a question related to Speech Recognition
Question
5 answers
A method to calculate based on interconnected multiple processing units
it's gonna be built. The network is composed of the arbitrary number of cells or nodes or units or neurones that are set Link the input to the output
  An artificial neural network is a practical way to learn various functions such as functions with values
True, functions with discrete values and functions with vector values.
  The learning of the neural network is immune to the educational data errors and this way
Networks have successfully addressed issues such as speech recognition, image identification and interpretation
Learning the robot has been applied
Relevant answer
Answer
hi Maysam Toghraee
The function code is available in the folder ..\toolbox\nnet\nnet\nnetwork. You must analyze it
  • asked a question related to Speech Recognition
Question
4 answers
What aspects of augmented/virtual reality applications could be improved by techniques of artificial intelligence? We can think of good speech recognition commands, virtual assistants like Siri or Cortana but what else?
Ideas, research group websites specialized in AR/VR and AI, state of the art papers, etc. are welcome as I want to cover as much possibilities in the shortest time...
Relevant answer
Answer
The q is interesting .Just some wild suggestions. AI allows integration of sense data outside normal human range. So
a few among many possibilities below:
1. We could add say meaning to high frequencies (above 20Khz ) of dog howls and barks and translate it into human speech.
2.It is known for instance that animals and birds are able to sense earthquakes well before humans. Perhaps AI would be useful here.
3.It might help in recognizing human emotions (body language interpretations for example )
Cheers
  • asked a question related to Speech Recognition
Question
3 answers
Dear All,
I need your valuable suggestions/comments for selection of research topic. I have two options as:
1-Speech Enhancement
2- Speech Recognition
I am totally new in this domain actually. Therefore, I need guideline which topic is best from research point of view and comparatively easy to understand and development as well, so that I can produce some good publications at the end.
Please mention some State of the Art techniques/ Algorithms being used for the analysis above mentioned topics. Moreover, tell me if there are some new engineering applications of Speech enhancement/ recognition. Thanks in advance!!!!!
Relevant answer
Answer
I'd recommend browsing through a classic to get the idea what is what:
Lawrence R. Rabiner et al. Introduction to digital speech processing
The research now shifts (for either topic) to the application via deep learning, so maybe look in here:
  • asked a question related to Speech Recognition
Question
3 answers
Can we optimize mean and variance using GA in speech recognition? Can we call it as refinemenet of HMM. Can we mark it as GA+HMM term?
Relevant answer
Answer
Just to sell a work which is sort of related that used both GA to optimise GMM-HMMs in acoustic modelling in speech recognition:
In that paper, what we did was to select a portion of Gaussians from each GMM to construct a dynamic GMM model for each speech frame (as a replacement of the original GMM). The selection was made based on a k-nearest neighbour rule (i.e. select k Gaussian components nearest to the current frame). k was a HMM state dependent parameter (so each GMM has a k) and all integers "k" were trained based on a discriminative criterion using GA (as it is an integer programming problem).
  • asked a question related to Speech Recognition
Question
4 answers
I have speech data consisting of moods, for recognizing the moods in it I want to implement HMM in Matlab, as I am new to HMM, I am finding it difficult to determine the parameters used in it. Can anyone please help me with some demo examples? Thank you!
  • asked a question related to Speech Recognition
Question
5 answers
I am in the process of conducting several structured intervies in a work environment. These will be around 5-10 Mins long and I'm planning to record them auditory. Can you recommend any software that may help evaluate the recordings by: 1) aiding in the transcription (preferrably with speech recognition) and 2) collating the made statements.
Thanks alot!
Relevant answer
Answer
I know anagraf, praat and audacity.
Praat and audacity are free.
  • asked a question related to Speech Recognition
Question
5 answers
Dear sir/madam,
I have tried for automatic speech recognition module using HMM classifier and obtained some fine output. Later, I took survey based on state of the art methods which involves incorporation of deep learning.Please suggest some matlab link that will be helpful for our work.
Thankyou in advance
Relevant answer
Answer
Dear  Hazrat Ali,
thank you for your kind replies and I will read the reference provided by you which is very helpful.
Have a  nice day
  • asked a question related to Speech Recognition
Question
6 answers
Projects for Speech Recognition; prototype and small applications preffered in Delphi
Relevant answer
Answer
Sphinx is a flexible framework for research in speech recognition. use that
  • asked a question related to Speech Recognition
Question
2 answers
How to calculate PESQ (Perceptual Evaluation of Speech Quality) Score of any noisy speech signal, especially, whose speech signals which have sampling frequency 12 kHz.
  • asked a question related to Speech Recognition
Question
4 answers
Hi all, 
I would like to ask from all the experts here, in order to get the better view on the usage of cleaned signals which already removed the echo using few types of adaptive algorithms with method of AEC.(acoustic echo cancellation)
How the significance of MSE and PSNR can improve in the classification processes? Which i mean normally we evaluate using the technique of WER, Accuracy and may EER too.Is there any kind connectivity of MSE and PSNR values in terms of improving those classification metrics.?
wish to have the clarification on this.
Thanks much
Relevant answer
Answer
It is one of the old issues in a speech recognition research field. That is on the relationship between any speech enhancement technique and the classification accuracy.
As far as I know, both the MSE and PSNR are frequently used for improving the quality of the input. They are known as useful in reducing WER. However, the relationship with recognition accuracy is not directly proportional. 
Enhancing a noisy signal in terms of MSE or PSNR means that you may have a good quality of the input but there is a risk. Sometimes, unexpected artifacts are produced by the speech enhancement techniques and WER can be increased in the worst case.
So, in phonemic classification task, matched condition is more crucial. And in the case of mismatched condition between train and test, MSE and PSNR are somewhat related to WER, but not directly. It is a case-by-case study.    
  • asked a question related to Speech Recognition
Question
1 answer
Dear sir/madam,
In most of the survey of speech segregation, hit rate and false alarm rate are calculated in percentage after the mask estimation. If anyone know the procedure to manipulate it .Please share to get verified. Thankyou in advance
Relevant answer
Answer
I'm not sure what you mean by "Procedure to manipulate"... what do you need to change for the HR and FAR? 
  • asked a question related to Speech Recognition
Question
3 answers
What is the time/cost effectiveness of using voice recorders with Dragon speech recognition with NVivo software? Any problems?
Relevant answer
Answer
Thank you very much for this information, this is very interesting and it is very helpful.
  • asked a question related to Speech Recognition
Question
9 answers
I find it odd in this day and age of voice recognition software that it is so difficult to find software that can automatically transcribe interview recordings. With powerful speech recognition engines such as Siri, Cortana, and Google Now, it seems odd that researchers are still stuck traditional speech recognition software that requires voice training and can only reliably understand a single user.
I've experimented using Google's Keep app - which has a speech to text feature - and read some material I got from a government site (full of jargon). It did quiet well with me reading from the site speaking directly to the app, and it even did a reasonable job with my colleague, who called me over speakerphone and read back the same section of the website. And the app did this all with zero training! So the technology seems to exist already! 
The problem with this sort of online transcribing is that I don't know what the software company does with the audio or text that it hears and transcrbes. I presume it is used to improve the software's accuracy. That doesn't work for me because my interview recordings and resulting transcription would contain confidential information that cannot be allowed leave our internal network.
Am I overlooking a product that would meet my requirments? What are other qualitative researchers doing to resolve this challenge?
Relevant answer
Answer
https://en.wikipedia.org/wiki/List_of_speech_recognition_software may serve well as a starting point. If you 'only' need speech-to-text in English, a lot of options is available. When it comes to other languages:   :(
Regards
  • asked a question related to Speech Recognition
Question
2 answers
Dear sir/madam,
 I have segregated combined speech sources using neural network based classifier in speech segregation process.For the estimation of Signal-to-noise ratio whether we should use the outputs of ideal binary mask is my doubt.Please guide me to do the estimation.
Thankyou in advance
Relevant answer
Answer
 Thankyou for your kind response
  • asked a question related to Speech Recognition
Question
1 answer
Im working on a project and i need to demonstrate that one speech codec is better than the other in terms of voice quality, is there way i can simulate them so i can compare both of them?
Relevant answer
Answer
Hello,
There's an implementation of some codecs in ffmpeg and libav libraries. You can install them easilly in Linux and use them to encode/decode your signals.
Greetings.
  • asked a question related to Speech Recognition
Question
7 answers
What algorithms and methods have been using for face recognition and speech recognition in today’s software application? What is their accuracy of recognition and rate of recognition?
Relevant answer
Answer
Deep learning is the best, although, it has a high performance but it needs thousands of examples to train, it takes long time to train (days or weeks), it has no solid theoretical foundation to tune the parameters (totally a black box) and finally it does not help to understand what is going comprehensively compared to other well known statistical approaches.
  • asked a question related to Speech Recognition
Question
6 answers
what features are beneficial to find the age from the voices of human beings?
Relevant answer
Answer
Dear Manish,
In the following related papers you can find many relevant feature sets:
(1)
Sedaaghi, M. H. (2009). A comparative study of gender and age classification in speech signals. Iranian Journal of Electrical and Electronic Engineering, 5(1), 1-12.‏
(2)
Lingenfelser, F., Wagner, J., Vogt, T., Kim, J., & André, E. (2010). Age and gender classification from speech using decision level fusion and ensemble based techniques. In INTERSPEECH (Vol. 10, pp. 2798-2801).‏
(3)
Chaudhari, S., & Kagalkar, R. (2012). A Review of Automatic Speaker Age Classification, Recognition and Identifying Speaker Emotion Using Voice Signal. International Journal of Science and Research (IJSR).‏
(4)
Metze, F., Ajmera, J., Englert, R., Bub, U., Burkhardt, F., Stegmann, J., ... & Littel, B. (2007, April). Comparison of four approaches to age and gender recognition for telephone applications. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP'07 (Vol. 4, pp. IV-1089). IEEE.‏
(5)
Brown, W. S., Morris, R. J., Hollien, H., & Howell, E. (1991). Speaking fundamental frequency characteristics as a function of age and professional singing. Journal of Voice, 5(4), 310-315.‏
Best!
Yaakov
  • asked a question related to Speech Recognition
Question
3 answers
We are looking for a French text in which speech sounds are selected such as to obtain a fixed proportion of voiced and unvoiced sounds (or more degrees of sonority). This text would we used in a contrastive multilingual experiment on vocal load.
In addition, we are interested in phonetically balanced corpora for French.
Thank you!
Relevant answer
Answer
You may find these books useful although quite old.
Lucile Charles & Annie-Claude Motron (2001). Phonetique Progressive du Francais avec 600 exercices. Paris: CLE International.
Lhote Elizabeth (1990). Le paysage sonore d'une langue, le francais. Hambourg: Buske Verlag.
Best wishes
  • asked a question related to Speech Recognition
Question
8 answers
Hello.
I'm setting up auditory fear conditioning, and I wonder how I can measure decibel of a tone for conditioned stimulus. I want a 75-dB tone and have a decibel meter.
I am not sure where I need to place the decibel meter in the context
to adjust a 75-dB tone. near the speaker? on the bottom? in the middle? The speaker is on the right wall of a square shaped context and if I want to proceed fear extinction in a different octagon-shaped context, I need to adjust the tone again for the new context, right? In this case, where do I place the decibel meter ?
Thanks for reading and I'll be waiting for your tips.
Relevant answer
Answer
It would help to know more about your equipment, e.g., whether your microphone system has a probe tube on it. In essence you want to put the microphone at the position of the eardrum, which can be done with a probe tube and an anesthetized animal. Otherwise measure with the microphone at the position where the animal's eardrum(s) will be during the procedure. You also need to be sure that you are using a free-field calibrated microphone. If you change the setup you need to re-check the calibration. Make sure the space around the animal is free of hard, sound-reflecting surfaces or cover such things with soft cloth.
  • asked a question related to Speech Recognition
Question
11 answers
Hi there,
I would like to ask you how do you compare a speech sample and a different kind of auditory sample (e.g., noise, sounds produced by animals...) when you are looking for similarities and differences between the two samples.
For instance, there are some times when people believe they are listening to words when hearing a noise, or the wind. If a participant reported having heard "mother" when he/she actually listened to a noise, how would you carry out the comparison between the two different sounds? Is there any way to do that?
Ideas and references are welcome!
Thanks!
Relevant answer
Answer
You're looking at more of a psychological phenomenon than an acoustic one. It's similar to the "phonetic restoration" effect that's been studied in the past.  
If you think of  the human auditory system as actively seeking evidence for a particular speech event and finding sufficient evidence for it in the sound then you get the observed phantom percept.  A different version of the effect can be observed in "babble", what people will hear in recordings of superimposed voices.
Actually, if you want to pursue this systematically it could get interesting. For example, can you find sounds that, across listeners, appear to be fertile sources of illusion? What are their characteristics?
  • asked a question related to Speech Recognition
Question
3 answers
Lets say that i was able to generate phoneme HMMs, now how can I use them to recognize a full sentence or even a word utterance? 
Also, I've encountered many references to embedded training, where the whole utterance is used to train phonemes, I've tracked it down but I haven't found an explanation of the way it's implemented, so if anyone have any good material on the matter, i'd be grateful. 
P.S. Speech recognition is not my field, but I'm trying to apply some of its techniques on gesture recognition 
Relevant answer
Answer
Hi Asmaa,
like you, i am working on sequence classification problems such as gesture recognition tasks, from perspective of probabilistic graphical models.
i specialty study on CRF family for these tasks, as you know, as you mentioned, Bayesian Nets such as HMMs are common models which is used in such tasks. for this reason i needed to apply HMM in my experiments and use its results for comparison.
when i start to use HMM, i exactly had both of your questions. after one month studying on them, i completely agree with you about this: "I've tracked it down but I haven't found an explanation of the way it's implemented..", but as guide i suggest you to look at HTK book, it is a good and complete documents for HTK toolbox, you will find your answers with example and how to implementation  them.
finally, if  i want to give short answers to your question for obtain overview,
[for q1] after training (seperatly) HMMs on phenomenons, they combine as single model, and for every test sequence (i.e. an utterance signal, or word image in OCR)  they perform viterbi algorithm on model outputs to decode and find its label sequence.
[for q2] FYI, i want to note when HMM learning performed on training set, consisting utterance, we must have phoneme segment positions for every utterance data sequnce.
  • asked a question related to Speech Recognition
Question
3 answers
1) The idea is to recognize context out of a conversation about a topic. For instance if two people are talking (assuming without overlapping their voices), it should be able to differentiate between the two voices either by only differentiating between two voices or by differentiating by recognizing the users(which would require training the voices therefore I would work on it once the rest of the project is done).
2) After differentiating the contents of the conversation based on who spoke what, I would further analyze the contents.
Relevant answer
Answer
I think your problem formulation is a little unclear...
1) It's a bit confusing when you say "and convert it into text". If you want to actually do speech-to-text, it is not quite related to speaker recognition...
2) In title you say recognize the user. But then say that you want to find where 2 or more people speak. To find a target speaker in the utterance, you can start with conventional GMM or i-vector based approaches (have a look at https://pypi.python.org/pypi/bob.spear/1.1.2 to start with). However, classifying overlapping speech (from 2 or more users) needs different techniques and in fact a bit more complicated
Feel free to provide more details and reformulate your question to get more relevant feedback
  • asked a question related to Speech Recognition
Question
3 answers
Dear All,
I want to implement an i-vector based speaker recognition system. This system should be tested against NIST SRE 2008 dataset. I have training files and sph files. However these files are not labeled and no speaker ID is available with them.
Apparently, there are answer key files which determines the identity of the speaker (among many other properties) of test files.
The file names should look like: "NIST_SRE08_short2.model.key"
Can anyone provide me with these files or give me some guide?
Thanks.
Relevant answer
Answer
Dear Woo,
Thanks for your answer, but I don't have access to the full database. I have only a small portion of test data and I want to use it in my speech procesing course project. Would you mind sending the key files for the test part to me?
Thanks in advance.
  • asked a question related to Speech Recognition
Question
1 answer
If we train our PLDA with microphone data only, and test with Phone data, will it effects the system performance? 
and If we train with large amount of  data of microphone and with less data of phone, how much the accuracy be effected?
Or there should be a balance between them?
Relevant answer
Answer
1. recognition always work on correlation of data or you can say correlation of features in the data.
2. more number of samples will always help you to increase accuracy
3.accuracy is your case will depend on what type of features you use
4. if you want to enhance accuracy use both time and frequency domain features , it may slow down your algorithm but accuracy will improve.more the feature you include more will be the accuracy.
5.my advice work with very simple logic you need to increase correlation.it can be done by increasing samples if you cant increase samples you have to increase features for recolonization . 
  • asked a question related to Speech Recognition
Question
4 answers
I would like to analyze vocal responses from a working memory n-back task with two possible responses ("yes" vs no response). Aim of the analysis is to get an automatically generated output file with two columns: (1) subjects study code (1...n) or rather file label and (2) vocal response (e.g. "yes" vs no or 1 vs 0).
I already tried Inquisit Lab 5's tool "Analyze recorded responses" but it did not work that well, i. e. after analyzing a few data sets which were coded correctly, Inquisit is not able to distinguish between responses and non-responses any longer.
Do you have experiences with Inquisit Lab 5 or any other suggestions regarding to speech recognition?
Thanks a lot!
Relevant answer
Answer
Dear Volker! I hope to address are very useful articles on the subject of my electronic library.
Vladimir
  • asked a question related to Speech Recognition
Question
1 answer
What would be the effect of the speech utterance length on speaker recognition. i.e
if T, UBM, LDA, PLDA-----> are trained on short utterance i.e. from 3 to 15 seconds, but 
enrollment of speaker (modeled speaker) are trained on long utterance such as 30 to 60 seconds uttarnce?  Would it effect the performance of the system????
Relevant answer
Answer
These models treat observations independently across time, so there should be no problem with train and testing on utterances of different length.
There are few concerns that you should take into account:
1) It is better to make sure you have enough observations (per speaker) for training. If you have several short utterances per speaker, this should be fine
2) UBM training can be severely affected by silences. When you have long utterances in test, they likely contain a lot of silences. A common practice is to use at least energy-based voice activity detector and to score using only voiced frames.
You may find useful the SRE10 Kaldi recipe at least for having some general ideas about data pre-processing.
  • asked a question related to Speech Recognition
Question
6 answers
Two System (Speaker Recognition)
UBM-GMM Optimal time for training and testing the system
i-vector Optimal time for training and testing the system
Relevant answer
Answer
Hi,
Responding to your last question,
It depends on what you're doing. Speaker recognition is a broad term and in practice, you're generally doing either verification (checking if two recordings correspond to the same speaker) or identification (trying to recognize the identity of the speaker persent in the test segment by comparing it to a set of known speaker models).
In these two contexts, "enrollement data" generally means "known speakers" and "test data" means "suspected/unknown speakers". Enrollement data are some kind of "reference" to which you compare your test recordings (either for verification or identification) and "can" be used to train a scoring model, eventhough in general, a different set is used for this matter, which is called "training set".
For example, In the NIST SRE 2010 (speaker recognition evaluation), enrollement and test data (reference speakers and unknown/suspected speakers) are a subset of the NIST 2010 database, while the training data (used to train the UBM, the T matrix AND the PLDA model) belong to a completely different dataset (eg. NIST SRE 2004, 2005, 2006 / Switchboard II Phases 2 and 3 / Switchboard Cellular Parts 1 and 2 / Fisher English Parts 1 and 2/ ..). You can check the "Experiments and results" section of this paper for example [1].
Now what happens if you add a new speaker ? It depends.
1 - If you are adding a new speaker class to your training data (a set of i-vectors corresponding to one particular speaker), you'll have to re-train your scoring model in order to take this new class into account (retrain your PLDA or re-compute your WCCN matrix, ..) and then use the new model to compare your enrollement/test segments. Generally, this does not happen because you're supposed to use as much training data as possible from the start and stick with the same model for all you experiments (otherwise you'll have to redo all your experiments everytime you add new data in order to have comparable results (scores coming from the same model)).
2 - If you're adding new enrollement data (new reference speakers), then nothing changes. The scoring model is supposed to perform as a black box that provides scores for any new test/enrollement utterances. Once it is trained, it is used the same way for any test/enrollement data.
The important thing to understand is that the expression "speaker class" can have different interpretations when used to talk about enrollement and train data. The former refers to a "reference speaker" that will be compared to in the scoring phase (you can compare a test segment to one or many enrollement sessions [2]) while the latter refers to the speaker classes used to train the scoring model. It's like training a PCA or a regression model, the dataset used to train the model is generally independent from the one you're testing on, but if needed, you can transform your training data using the same model.
---
References :
[1] Bousquet, Pierre-Michel, et al. "Variance-spectra based normalization for i-vector standard and probabilistic linear discriminant analysis." Odyssey. 2012.
[2] Liu, Gang, et al. "An investigation on back-end for speaker recognition in multi-session enrollment." Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013.
  • asked a question related to Speech Recognition
Question
1 answer
well i am recently working on my project module which is speech recognition system.for that i choose CMU Sphinx  (Version Pocket Sphinx) but i am stuck that how to use it mean that i want to run it own Microsoft visual studio or in the Unity Mono-developer after it i want to make its grammar well i am aware of its grammar but the point is it is not running.
Any foam of help will be appreciated... 
Relevant answer
Answer
Did you check this tutorial?
If this does not help, I invite you to post your question with exact command and compilation errors in sphinx official community forum 
By the way. Someone seem to already have the same issue. Check here
  • asked a question related to Speech Recognition
Question
3 answers
Can anyone tell me why when I extract the audio signal from video file it gives 2d signal??? when I use [y,fs]=audioread(audiofile.wav) it gives size of y =131328x2 double
Relevant answer
Answer
I assume you are speaking about Matlab audioread.
Your signal is stereo (left and right channel). To make it mono (if you do not need the information from 2 channels), simply run
[y,fs]=audioread(audiofile.wav,forcemono) 
  • asked a question related to Speech Recognition
Question
3 answers
For speaker recognition, we need developement data to train T, UBM.
Is it possible that the single speech sample have more than 1 speaker i.e in
s = s1+s2+s3 ?? 
Relevant answer
Answer
I believe that in TIMIT database a single sentence pronounced by a number of speakers separately in different files, you may combine those samples for your application. Moreover, for conversational speech with two speakers LDC has several databases. 
  • asked a question related to Speech Recognition
Question
3 answers
Can you suggest good applications or software that can aid transcription process?
Relevant answer
Answer
Hi, 
Here you will find some of free softwares to transcribe interviews: http://www.adweek.com/socialtimes/best-free-interview-transcription-tools/186227
 More Free Interview Transcription Tools
Transcriptions (for Mac): “Media with tape behavior, Customizeable media-control shortcuts(->shortcut recorder), Timestamps, Text substitution and Footpedal-support”
Express Scribe Free (Windows or Mac): “The free version supports common audio formats, including wav, mp3, wma and dct. Download the free version of Express Scribe here. You can always upgrade to the professional version for proprietary format support, including ds2 and mvf.”
  • asked a question related to Speech Recognition
Question
3 answers
i need to do isolated word speech recognition of 16 words using backpropagation algorithm using MFCC as inputs
Relevant answer
Answer
Hi Veena, to understand better and give an answer: do you mean from an audio stream catch some key words passing through? (to talk about of phonetic words linkage, declination, and other pre-processing aspects)
  • asked a question related to Speech Recognition
Question
3 answers
The following is one of the recent research reports on ASR built using the deep learning framework: Dario Amodei, ..., Andrew Ng, .. Zhenyao Zhu, "Deep Speech 2: End-to-End Speech Recognition in English and Mandarin, arXiv:1512.02595, Cornell University Library, Dec 2015. I want to know about other similar recognition work on other languages using deep neural network.
Relevant answer
Answer
More recent German study, written in English, comparing traditional with DNN-based recognizers: https://www.lt.informatik.tu-darmstadt.de/fileadmin/user_upload/Group_LangTech/publications/Radeck-ArnethEtAl_TSD2015_SpeechCorpus.pdf
  • asked a question related to Speech Recognition
Question
3 answers
 I am conducting a study to develop and validate a peadiatric picture identification test to assess speech recognition in Sinhala language. It is somewhat similar to the test of Word Ineligibility by Picture Identification test developed by Mark Ross and Jay Lerman in 1970.
Now I am in the stage of consulting a professional artist to draw the pictures. I heard about specific guidelines are there to draw the pictures. If anyone is aware of such guidelines please be kind to let me know it.
Your assistance and help is highly appreciated.Thank you and best regards.
Yours sincerely
Eranthi.
Relevant answer
Answer
Hi,
I think you can ask Madam Asha Yathiraj (link attached). She has prepared lots of picture books for speech identification in children. You may get very good inputs from her.
regards,
Nike
  • asked a question related to Speech Recognition
Question
3 answers