Science topic
Speech and Language Processing - Science topic
A group for researchers working in the area of speech and language processing.
Questions related to Speech and Language Processing
Hello,
We are working on a review regarding the relationship between language and the mutiple-demand network. You will be responsible for addressing the reviewer's criticisms. Please leave your email address if you are interested.
Best,
W
Hi everyone. I have been conducting a few experiments with simultaneous speech, but I have been using recorded speech (.wav, .ogg or .mp3 files) in all of them. However, I would like to play the simultaneous speech using Text-to-Speech solutions directly, instead of saving to a file first (mainly to avoid the delay, but also to be used across the OS/device).
All my attempts to play two simultaneous TTS voices (separate threads/processes, ...) have failed, as it seems that speech synthesis / TTS uses a unique channel (resulting in sequential audio).
Do you know any alternatives to make this work (independent of the OS/device - although windows / android are preferred)? Moreover, can you provide me additional information / references on why it doesn't work, so I can try to find a workaround?
Thanks in advance.
I have trained an isolated spoken digit model for 0-9. My speech recognition system is recognizing the isolated digits like 0,1,2...9 but it fails to recognize the continuous digits like 11, 123, 11111, etc.. Can anyone please help me in converting these isolated digits to connected digits
I am going to teach a Speech and Hearing Science class. We use Praat for acoustic analysis experiments. Recently Windsurfer was downloaded to the lab computers. I have never used it. I have read some research comparing the two software packages. What is the consensus? Is one better than the other? I do need directions for Windsurfer, however.
I'm looking for paper in deep learning and machine learning used in Arabic abstractive text summarization.
UPDATE (SEPTEMBER 2017)
TITLE: The Rate of Verbal Thought: An Hypothesis
AUTHOR: Ronald Netsell, PhD, Emeritus Professor, Communication Sciences and Disorders, Missouri State University, Springfield, MO
The purpose of this report is to develop the hypothesis that the rate of verbal thought is no faster than the rate of inner speech or speech aloud. There is prima-facie evidence that inner speech and speech aloud are direct reflections of verbal thought. Why else would you say “That’s not what I meant” after hearing what you said? Or, “I don’t realize it until I hear it.” This hypothesis was published in 1959 in an article entitled “Evidence that 'thinking aloud' constitutes an externalization of inner speech” (Benjafield, 1969). Others have discussed this hypothesis (Morin, 2009; Glass, 2013).
It’s important to distinguish two types of inner speech: expanded and condensed (Ferneyhough, 2044). Expanded inner speech refers to word-for-word production, while condensed inner speech is fragmented, rapidly crossing topics with one word. Interestingly, and in context of the present report, these two types also have been referred to as “willful voluntary thought” and “verbal mind wandering”, respectively (Perrone-Bertolotti et.al, 2014). Apparently, these authors assumed that their types of inner speech represented verbal thought. The block diagram of Figure 1 distinguishes verbal from nonverbal thought. The idea that the rate of expanded inner speech (willful voluntary thought) was the same as the "rate of verbal thought" arose our recent findings (Netsell et al, 2016). Participants were instructed to “say the first thing that comes to mind.” Although this instruction was not intentionally designed to elicit verbal thought (thinking with words),
_____________________________________________________________________
Insert Figure 1 about here
_____________________________________________________________________
it appears to have done so. We found that expanded inner speech was 600 msecs faster than speech aloud (p=.0002). We hypothesized that speech aloud was slower because of the time it takes to move the articulators (lips, tongue, etc). This hypothesis has been criticized (e.g. Glass, 2013; Ghitza 2016?).
These findings suggest the rate of neural processing is the same for expanded inner speech and speech aloud. Why wouldn't the rate neural processing of verbal thought be the same (~5.0 syllables/second)? We listen to our verbal thought on-line as we're talking aloud (speaking without 'thinking'). If what we say aloud doesn't match what we’re thinking verbally, we'll say something like "That's not what I meant to say." Then, revise what's said aloud. Obviously, the hypothesis that we think no faster than we talk will be very difficult to test empirically.
____________________________________________________
REFERENCES
Benjafield, J. (1969) Evidence that 'thinking aloud' constitutes an externalization of inner speech” Psychonomic Science 15(2):83-84.
Morin, A. (2009). Inner Speech and Consciousness. In: William P. Banks, (Editor), Encyclopedia of Consciousness. Oxford: Elsevier 389-402.
Glass, J. (2013). A neurobiological model of ‘inner speech’ for conscious thought. Journal of Consciousness Studies 20:7-14.
Ferneyhough, C.(2004). Alien voices and inner dialogue: towards a developmental account of auditory verbal hallucinations. New Ideas in Psychology 22:49–68
Perrone-Bertolotti, M. . Rapin,L, J.-P. Lachauxc,d, M. Baciua,b, H. Lœvenbruck (2014). What is that little voice inside my head? Inner speech phenomenology, its role in cognitive performance, and its relation to self-monitoring. Behavioral Brain Research 261:220–239.
Netsell, R., Kleinsasser, S., & Daniel, T. (2016). The rate of expanded inner speech during spontaneous sentence productions. Perceptual & Motor Skills 123(2): 383-393.
__________________________________________________________________________
Figure 1. A block diagram model representing the process of transforming thought into words. Thought can be verbal or nonverbal. Verbal thought can be expressed aloud without conscious thought (speaking without thinking). Alternatively, verbal thought can be expressed consciously as expanded inner speech i.e. talking to yourself inside your head (Netsell et al, 2016).
For my bachelor thesis, I would like to analyse the voice stream of a few meetings of 5 to 10 persons.
The goal is to validate some hypothesis linking speech time repartition to the workshop creativity. I am looking for a tool that can be implemented easily and without any extensive knowledge of signal processing.
Ideally, I would like to feed the tool with an an audio input and get the time segments of the speaker either graphically or in matrix/array form.
- diarization does not need to be realtime
- source can be single or multi stream (we could install microphones on each participant)
- the process and can be (semi-)supervised if need be, we know the number of participants beforehand.
- Tool can be an matlab, .exe, java, or similar file. I am open for suggestions.
Again I am looking for the simplest, easy-to-install solution.
Thank you in advance
Basile Verhulst
I am currently looking at validating the ACE III in the alcohol related brain damage population. I would like to collapse the immediate and delayed memory domains of the RBANS to create a superordinate "memory" domain to allow a more direct comparison. Similarly, I would like to collapse the "language" and "fluency" domains of the ACE III into a superordinate "language" domain for better comparison to the RBANS "language" domain. Is there a precedent for doing this?
please I need help to found paper more relevant to Extracting Relations from the conversational text and related to deep learning and machine generated
What is your suggestion to obtain reliability for Visual Analogue Scale?
Dear colleagues,
I've almost completed intonation awareness-rising activities (English intonation training for Russian and Chilese EFL learners). I've got losts of recorded material that I'll now start to analyze. I'll be using RAAT for displaying tones (falling, rising, fall-rising).
Is there a specificity of similarity measures for the Arabic language
(Or we can apply the measures developed in the literature, such as levensthein directly)
We talk about "boundaries of intonation units" and that language is a "code". And if we go in the direction of "categorial perceptions" and "motor theories," could it be expected that speech pauses (rather hesitations than breath pauses) can draw attention to the listener and thus promote the performance of remembrance? At which point could one expect a discriminating point at the pause length in relation to the rate of articulation? Imagine a Morse Code, e.g. SOS: We say three times short, three times long, three times short, but nobody talks about the silence between the individual units, right? Someone in distress at sea might have a different frequency of all units including pauses than someone on a deserted island who has been sending this code for weeks or months. How does the receiver discriminate between the individual units (in these cases, of course, we hope that there is a receiver at all ;-) ) and how does he know that it is an SOS signal? Can this model-like idea be applied to the language? And does it make any sense to think about the long-term memory? Or does it only concern the short-term memory and what is actually stored in the brain are generated emotions?
I am working on color categorization and terminology with bilingual speakers. The two languages follow different paths of categorization, and the system that each language uses overlaps in individual speech. I was wondering whether there was any other study concerning a similar topic. Thanks!
Below is a partial abstract of our recent study. The inner speech sentences were self-timed by the subject. We're looking for a physical (EEG) measure of sentence onset and offset to calculate rate of inner speech.
ABSTRACT. The rate expanded inner speech and outer speech was compared in 20 typical adults. Participants generated and timed spontaneous sentences with expanded inner speech and outer speech following the instruction to say “the first thing that comes to mind.” The rate of expanded inner speech was slightly, but significantly, faster (0.6 seconds) than the rate of outer speech. The findings supported the hypothesis that expanded inner speech was faster than outer speech because of the time required to move the articulators in the latter. Physical measures of speaking rate are needed to validate self-timed measures.
Thanks for any input.
I want to see, if there are any relations between the consumption of digital media and the speech development or rather the vocabulary acquisition.
Arabic speech corpus developed by @Nawar Halabi @MicroLinkPc is machine generated voice from machine auto diacritized texts maybe some human correction involved
what is the diacritizer and the TTS tools and algorithms used in the generation?
Furthermore, I would like to know if there are some paper about the hours and sample needed to have valid and reliable data using Automatic Speech Recognition.
We define automatic or fluent as "done without thinking". The questions is done using that definition.
I am trying to analyze frequency mean, range, and variability from a speaker reading a passage aloud. I am using Praat and a Matlab script I am writing to analyze these. The common threshold in Praat is 75 Hz to 300 Hz for a male speaking voice and 100 Hz to 500 Hz for a female speaking voice. I want to make sure I am obtaining the most accurate fundamental frequencies of their voice, not higher frequencies from breathes or ends of words. Does anyone with experience in these analyses have a more accurate threshold criteria or are these thresholds in Praat suitable?
During the process of questionnaire translation and validation, the original readability of the items should be maintained. How much testing of readability of a translated version of questionnaire is important? How to measure this readability? Gunning fog index? Flesch–Kincaid readability tests? Homan-Hewitt readability formula? Maybe other suggestions, please? Should I compare each one-sentence original item with the corresponding translated item? Which statistical measure should I use? Repeated measures t-test?
I am doing speech recordings for an upcoming study to measure loudness and frequency of speech in people with motor disorders. We have a method for all of the recording already, but are having trouble playing a calibration tone at a known dB level. Would we do this by measuring with a sound level meter as close to the sound source as we can. We'll be using the same sound level meter to measure the calibration tone at a fixed distance from the microphone that is recording the participant.
Hi, I would like to know what are specific language disabilities any pattern or classification?
How learning mnemonics will improve language abilities ,is it more of memory or speech..
I am working with DMDX to record vocal responses. With four stimuli, everything looks fine, but with 5, the program just stops working.
I want to compare a model of speech synthesis to other concurrent and well known models (WaveNet,etc).
Deep learning and Generative Models : Trends?
Anything in this area will be usful : survy, recent article, etc
Hi all,
I would like to ask from all the experts here, in order to get the better view on the usage of cleaned signals which already removed the echo using few types of adaptive algorithms with method of AEC.(acoustic echo cancellation)
How the significance of MSE and PSNR can improve in the classification processes? Which i mean normally we evaluate using the technique of WER, Accuracy and may EER too.Is there any kind connectivity of MSE and PSNR values in terms of improving those classification metrics.?
wish to have the clarification on this.
Thanks much
The continuous sequence of images (for example the conversation between a deaf person using an interpreter to converse with someone who does not understand the signs) being converted to speech, where the system would serve as the image-to-speech converter.
I want to classify audio advertisements based on user preferences.So I need to extract features from the audio files which will be pertaining to users.thus I need a way to extract features.I want to know whether there is a method for this without processing the text form of this audio.
Hello. I would like to know if there is a readability formula, using SLM & SVM for Spanish language. Thank you in advance.
I had to create a word list for a speech intelligibility assessment I am completing. A previous relatively large scale study has analysed the phoneme distribution in % of the language. I need to compare the phoneme distribution (% for each sound) of my wordlist to the phoneme distribution of the large study. Which statistical test should I use to analyse if they are similar to each other and hence ascertain that my list approximates the distribution? Thanks
Does anyone know of any sources to check the relative frequency of various consonant places of articulation in word-initial position in English (or any other language)?
In other words, what percentage of word-initial consonants in English are coronal, labial, dorsal, etc.?
I want to do semi supervise part of speech tagging for this first i want to cluster the un-label corpus base on words pattern.Which technique will be best for this.
I'm collecting data for my speech therapy degree. I'm building a three-test battery to gauge speed and accuracy in adults with former developmental dyslexia.
One of the tests is a lexical decision task, which needs to be made harder by shaping it in a tachistoscopic presentation, and I therefore need to estabilish a basic amount of time for the stimuli to be recognized (and not only detected).
In literature it can be found a rather large range of intervals for the minimum amunt time of the stimulus recognition, from about 20 ms to about 200 ms, accordingly to word lenght and some other variables, however over all in studies about visual analysis
What I am seeking, is very specific data on the very baseline of the recognitizion of words, i.e. data on the reading abilities in normal subjects.
I could find very few solutions with limited level of application which were not directed to speech. I am very curious to know about any relevant methods to speech.
As I'm new to the topic, I'm looking for information on benchmark corpora that can be obtained (not necessary free) for audio events classification or computational auditory scene analysis.
I'm especially interested in house/street sounds.
Our project entails the evaluation of the "best" ASR software that runs in the Cloud and, preferably in embedded devices.
While we will start with grammar-command applications, we want to quickly migrate to more applications that require NLU & NLP processing at a "state-of-the-art" level. This is a commercial platform-- but not a "toy".
there are differences between the frequencies of sounds, as 8000Hz , 16000Hz, 44100Hz , ....,etc.
why the researchers prefer the higher frequencies?
Dear sir/madam,
I have segregated combined speech sources using neural network based classifier in speech segregation process.For the estimation of Signal-to-noise ratio whether we should use the outputs of ideal binary mask is my doubt.Please guide me to do the estimation.
Thankyou in advance
I am new to this tool. I got one in-house project to build an desktop application for translation (English - Hindi) . I am getting problem in hindi-pos tagging, and parsing of the Hindi sentence.
Can any one help me out with this.
Hindi-POS tagging showing 'UNK' for hindi font.
Represent words in devangari to phonems or consonant vowel pattern
I am looking for literature discussing if some types of phonemes are more or less likely to undergo sound changes.
It seems intuitively the case that some sounds like /m/, /n/ and /a/ are less likely to change during the process of language change than sounds with more complex or "marked" articulation.
Listening skill has often been called the Cinderella skill of language teaching because it involves a number of variables that are too difficult to be operationalized within the allotted class time. A comprehensive model of L2 listening comprehension cannot be developed without a full account of the parameters dominating the process.
Does anyone know about published research (or other available resources) on scoring issues on sign language production tests? For example, development of scoring instruments, type of scales being used, inter-/intra-rater reliability, procedures to solve disagreement between raters, construct representation etc.
what features are beneficial to find the age from the voices of human beings?
We are looking for a French text in which speech sounds are selected such as to obtain a fixed proportion of voiced and unvoiced sounds (or more degrees of sonority). This text would we used in a contrastive multilingual experiment on vocal load.
In addition, we are interested in phonetically balanced corpora for French.
Thank you!
Hi there,
I would like to ask you how do you compare a speech sample and a different kind of auditory sample (e.g., noise, sounds produced by animals...) when you are looking for similarities and differences between the two samples.
For instance, there are some times when people believe they are listening to words when hearing a noise, or the wind. If a participant reported having heard "mother" when he/she actually listened to a noise, how would you carry out the comparison between the two different sounds? Is there any way to do that?
Ideas and references are welcome!
Thanks!
To set up a speaker recognition system using NIST 2004 dataset I found speaker indices of test "x???.sph" from the address : http://www.itl.nist.gov/iad/mig/tests/spk/2006/
To train total variability I need to use speaker indices of train data "t???.sph". where can I find it?
Please help me.
Thanks in advance
I would like to analyze vocal responses from a working memory n-back task with two possible responses ("yes" vs no response). Aim of the analysis is to get an automatically generated output file with two columns: (1) subjects study code (1...n) or rather file label and (2) vocal response (e.g. "yes" vs no or 1 vs 0).
I already tried Inquisit Lab 5's tool "Analyze recorded responses" but it did not work that well, i. e. after analyzing a few data sets which were coded correctly, Inquisit is not able to distinguish between responses and non-responses any longer.
Do you have experiences with Inquisit Lab 5 or any other suggestions regarding to speech recognition?
Thanks a lot!
What would be the effect of the speech utterance length on speaker recognition. i.e
if T, UBM, LDA, PLDA-----> are trained on short utterance i.e. from 3 to 15 seconds, but
enrollment of speaker (modeled speaker) are trained on long utterance such as 30 to 60 seconds uttarnce? Would it effect the performance of the system????
Is there any influence of the mismatch between the language used for training the system hyperparameters (TVS, LDA, and PLDA hyperparameters ) and the system users' language on the performance of the speaker recognition system ??
Thanks in advance .
I recently started to work on Speaker/Language recogntion using i-vector, and after consluting with researcher on researchgate, I came to the following steps:
1) Database
i) Developement dataset (UBM, T training), if labeled (LDA and PLDA also)
ii) Training dataset(For speaker/Language Enrollment, modeled speakers), if the Developement dataset is not labeled, I trained LDA and PLDA on training dataset(needs comments on this)
iii) Testing dataset (for testing the modeled speakers/language)
About Language Detection:
If I have lot of speech samples, but no labled for that speech utterance, how can I train LDA/PLDA for languages? or can I trained these on training languages data?
What about the Gender? how much the results will be effected if we have different/same UBM, T? Is it ok to have single UBM, T for both genders?
Is there any way to apply the i-vector detection without applying LDA and PLDA such as SVM on i-vectors without i-vector reduction??
loss is inevitable during translation.but which level of language is more liable for loss (morphological , syntactic or semantic?
in morphological level: what type / category of words
in Syntactic : what sentence pattern/ structure
in Semantic: what type of meaning/domain
I have the samples of phonemes of English language. I want the best method of concatenative synthesis and also the best way to resolve glitch observed in concatenating small units of speech(phonemes)
I have been trying to acquire EMG signal for sub vocal feature extraction but signal contains a lot of noise that makes impossible to work on subsequent steps.
Thanks in advance
I have used P2FA force alignment system to anotate the .wav files. However, the results of phonemic anotation is not good. Is there any open source force alignment software available for American English? By the way, if the software using the CMU dictionary, it will be good for me.
Especially from the point of view of textual competence
Especially theory of constructivism
A lot of theories and studies seem to deal with how input is processed for meaning and form, but I am interested in looking at how the learner then takes this processed input and constructs some sort of representation of the L2 for later use. Does anyone know which, or any, SLA theories that specifically deal with this process?
I'm seeking for free Speech Recognition for Arabic language (ASR). Can you help me to find it?
I have looked at Steen G. 1999 and other works that have cited him, still I find the notation a little diffcult to understand and apply.
I'm interested in differences in the use of /r/-liaison between native speakers of non-rhotic English (e.g. RP) and the use of that phenomenon by EFL learners.
Hi I have some papers in speaker recognition, if any body have any interesting subject?
I am facing some difficulties to make the indicator of someone who having good stress, rhythm and intonation why they are speaking and reading.
brain functions in the early child language acquisition
Hi, I'm a Phd student and I've been interested in Brain Computer interface and speech imagery , in particular in vowels and sillables imagery . I 'm at the beginning of my study and my specific field is that of EEg signals related to spoken and unspoken (imagery) speech. I need of scientific articles about this issue. and about lthe neural anguage processing......can someone help me?? Thanks
Need to get in touch with someone who has already worked once with Kaldi ASR for speech recognition.
I think not always. For example, "to turnover a new leaf"(of life) the word leaf is used not in its first meaning (a leaf of a tree) but may be a blank sheet of paper? Then it's a metaphor.
I am working on language identification through i-vector.But i am thinking that if a language which use two language like if we speak hindi and some time u prefer some english word than that type of data set shows some problem for model which built for corresponding language.and it also decrease the performance of model
I am writing chapter 3 of my proposal and I need an instrument to measure language development for low functioning autistic children. I will appreciate if any of you will allow me to use the instrument that you already have.
I am doing a quasi-experimental study and using a small population of 5 autistic students (3 to 5 years old). My strategy includes photographs of each child natural environment which will allow me to initiate conversation with each one. I use each child's IEP as a pretest and will use the measurement that I am looking for to verify progress in the post-test towards the end of the training.
I'd like to find a source for the population so the Ethnologue can cite it.
I didn't find any datasets for Natural Language questions and their corresponding SQL statements. So, I was thinking of creating one for myself and other researchers to work on. I do want to know what's the best way to do that, collect them from people and do manual reviews of the matching between NL and SQL, and an automatic review of the SQL statements that they are working?
Hi everybody,,
I'm currently working on Handwritten Arabic words recognition, i have built a feature matrix for each image of size 34x10, where there is 3 types of features; 8 concavity, 11 distribution and 15 gradient for a total of 34, and the 10 is the length of the image divided by the sliding window size which is 3, hence the 10.
What i'm asking for is, how can i input this feature vector to HTK to start training HMM models with a given topology but unknown parameters, if there is an example that will be great, i have read through the HTK guide but most of it i couldn't understand, since it talks about speech recognition.
Please guide me.
Thank you.