• Iman Esmaili added an answer:
    Does using speech samples with SNR < 0 make my recognition less accurate?

    I am doing isolated word recognition based on MFCCs. some of my samples revealed to have SNR < 0, should I use them or simply delete them?

    Iman Esmaili · Shahed University


    Of course using low SNR data degrades your recognition accuracy But to use the low SNR data or not  depends on your recognition plan. we have clean speech recognition and noisy speech recognition. If you have no restriction in environment just use the clean speech but if your system must work in different conditions you must use all of your data and you have to find some way to deal with noise.

    for example: spectral subtraction is a simple and efficient way to deal with white noise.

  • Chilin Shih added an answer:
    How can I estimate a person's vocal tract length, using a recorded audio file?
    I'm performing some experiments that require a vocal tract length change, but I need to know the original one.
    I'm aware of the formula: L = c / 4F, where the "c" is the speed of sound (34029 cm/s) and "F" is the first formant frequency. I'm also aware that I should use vowels closest as possible to an unconstricted vocal tract.
    However, I made a few experiments with the software program Praat and I got rather different and difficult to interpret results. In a single vowel, I get a large range of frequencies (1st formant ones), so I thought I should focus on the average? Is that correct? Moreover, among different vowels I get very different results. Is that normal?

    Thanks in advance!
    Chilin Shih · University of Illinois, Urbana-Champaign

    Alternatively, we got reasonable measurement using two microphones, one placed at the mouth and one at the throat outside the glottis, and estimate the distance by the time it takes for the acoustic wave to travel from the glottis to the opening of the mouth. The technique is described in  "A quasi-glottogram signal", JASA 2003.


  • Mihai Prunescu added an answer:
    Can somebody give me some examples of phonemic variations in a language and the probable reasons for such variations?

    In Ghana, I have observed that the phoneme /j/ is realized as /dz/; /y/ or /Ʒ/ in speeches by individuals. I have also noticed that the difference in the realizations depends on either the absence or the presence of the target phoneme in the learners’ speech (i.e. transfer errors). Where it is present but the realizations are not the same, the learner tries to articulate the phoneme as a phoneme he/ she already knows. Where the target phoneme does not exist in the already known languages of the learner, he or she tries to make a substitution with another phoneme that exists in his or her linguistic repertoire. Can someone share with me some example of phonemic variations that he or she has noticed in their students’ speeches? Are the reasons for the variations different from what I have stated?

    Mihai Prunescu · Institute of Mathematics of the Romanian Academy

    Some of such variations are determined by areas of speakers. The romanian word "pe" (on) is spoken around Bucharest like " pă". There are towsends of such area dependent prononciations in many languages. In German, the past perfect particle "ge" is spoken in the Berlin area like "ie" (written je). The expression "Hast du jedient" ("Have you served in the army?") instead of "Hast du gedient?" is classical.

  • Peggy Katelhoen added an answer:
    How are verbs of communication used to introduce Direct Speech in different languages?

    I am exploring the use of verbs of communication (verba dicenda) across languages and genres. My main aim is to see whether typological differences across languages (as described by Talmy) are maintained in the domain of communication. I am particularly concerned with how different languages use VoCs to introduce (and reconstruct) Direct Speech in written narratives and the rhetorical implications of this use, but I am also interested in their use in oral contexts. Any research dealing with this will be much appreciated.

    Peggy Katelhoen · Università degli Studi di Torino

    There is an older publication (German and Spanish): Hernández, Eduardo Jorge (1993):
    Verba dicendi. Kontrastive Untersuchungen Deutsch-Spanisch. Series: Hispano-Americana. Frankfurt/M., Berlin, Bern, New York, Paris, Wien, 
    and my own book: Katelhön, Peggy (2005): Das fremde Wort im Gespräch: Rededarstellung und Redewiedergabe in italienischen und deutschen Gespächen, Berlin: Weidler Verlag (discourse representation in spoken Italian and German languages),

    For German I found out that there are verbs like "kommen" (come) that can indroduce an DS with an implicit negative valutation....Very interesting are also such forms without a verb "e lui/e lei" and "ich so, er so" o with the verb "fare" make" in Italian...Examples and bibliographie you can find in the book

    With best regards, PK

  • Johan Sundberg added an answer:
    What is the typical lung pressure for normal human phonation/speech?

    I need the value of lung pressure to set up the boundary condition for the inflow for a 2D vocal fold simulation for a normal phonation condition.

    Johan Sundberg · KTH Royal Institute of Technology

    Phonation threshold pressure is generally between 1 and 3 cm H2O, and it increases with fundamental frequency, see Titze in J Acoust Soc Amer. For conversational speech it is typically about 5 cm H2O and for loud speech it can be 10 to 15 cm H2O. It is generally measured uninvasively as the oral pressure during /p/-occlusion.

  • Mirco Ravanelli added an answer:
    What is the best available toolbox for implementation of Deep Neural Networks (DNN)?

    There are plenty of toolboxes offering functions for this specific task, so it would be great if we could all contribute and conclude about the best available DNN toolbox to this date (mainly for speech applications). 

    It will be great if we can give the pros and cons of using any toolbox and at the end we will conclude from the top voted answers. 

    Mirco Ravanelli · Fondazione Bruno Kessler

    I think Theano is a very powerful and flexible toolbox for deep neural network.
    Personally, I think that for speech recognition purposes, another toolkit you should consider is Kaldi.
    Kaldi is a complete speech recognition software, which also embeds a deep neural network toolkit.
    Advantages: very easy to use
    Cons: not as flexible as Theano

    Mirco Ravanelli


  • Eddy B. Brixen added an answer:
    Acoustic analysis of speech - recommendations for lapel mics?

    I am planning a series of 'field recordings' of speech. An individual speaker per recording - to be conducted in 'a quiet indoor space'. Planned analyses include format tracking (F0-3).

    Researchers in the field of acoustic/phonetic analysis of speech: What lapel mics do you use? What are your experiences with different models? Do you have recommendations for particular models currently available? Would prefer an economical solution (for multi-site testing), but open to suggestions.


    Suzy Styles

    Eddy B. Brixen · EBB-consult

    The distance and the axis is important. We have seen discussions on different LTASS profiles in different languages - and the discussion was really about the placement of microphones. A perfect microphone is the DPA4060. Remove the grid and you have a flat frequency response, low noise, and low distortion mic.

    But placing the microphone on the chest makes you loose approximately 10 dB at 3-4 kHz!! (check this paper: Brixen, E.B.: Spectral degradation of speech captured by miniature microphones mounted on persons heads and chests. AES Convention no. 100, Copenhagen, Denmark. Preprint 4284.

    A headset microphone that works is DPA d:fine 66. The level is 10 dB higher "at the edge of your smile" compared to the chest mounted mic. And this provides you with 10 dB less background noise......

    Good luck


  • T. Nagarajan added an answer:
    Why is it necessary to have a restriction of minimum-phase signal to use modified group delay?

    The group delay function can be effectively used for various speech processing tasks only when the signal under consideration is a minimum phase signal.

    T. Nagarajan · Sri Sivasubramaniya Nadar College of Engineering

    Yes, it is compulsory. The group delay is, to certain extent, similar to the magnitude spectrum of the signal. Those spikes are due to wrapped phase and not actual, and it has to be avoided. 

  • Is there a corpus with whistled speech tokens from Silbo Gomero?

    I am looking to do research with a learning experiment that requires whistling tokens. Specifically, my past research has focused on Silbo Gomero, so I am now in need of access to sound bites from that language. If there are Spanish translations to accompany the whistled sound bite, that would be ideal! Thank you!

    Gregorio Rodríguez Herrera · Universidad de Las Palmas de Gran Canaria

    Hello Pat!

    I think these woks of M. Trapero may be interest

  • Michael Clarke added an answer:
    What are the main differences between children and adult speech?

    I know that this question is too general, but I want to get opinions on the possible ways to split these differences into several groups, eg. “Acoustic and linguistic differences”.

    Thank you very much in advance.


    Michael Clarke · Cancer Treatment Centers of America

    I am sure I am not telling you anything you already do not know.  Speech awareness and production change as the childs aerodigestive tract and articulatory structures  grow neurologically and physically. The more complex articulatory motions develop in skill last.  That is why so many SLPs in the school sytem are working on remediating pronunciatieon of /s/ /r/ /l/ and why the general population still does not accurately pronounce /z/ at the end of words. Children living in the  Midwest United States in Kindergarten and first grade  are not fully expected to have mastered  /r/ produciton. Competence goes in grossly predictable patterns. (/m n ng  p f h w/, /b d g k r/, /t th L v/, /sh ch dg/)  In addition illness of childhood compromise pronunciation.  We have a life style that aggrigavates sinus mucosa and so velar valving for non/nasal produciton is frequently a contrast between the young and old (the older population have less of a problem with this).   The same problems are agrivating eustation tube function so middle ear problems and hearing of low frequency sounds is often poor in the younger population.  Confusion of sounds is common. 

    Gross Linguistic factors have to do with onset of various linguistic development of referrent (word appoximations) based on frequency of useage, contrast of nasal and stops, starting with higly visible-labilal sounds (eg:mama, papa, baw/ball), gross differnetiation of place/manner/ voicing( gawgy.doggy), simplificaiton of articulation (Is/Its),

    semantic (phrase) development and

    syntactic development with semantic markers of plural /s,z/; gerund/ infinitive marking (-er), verb modifier (ly).  Early errors will occur due to the complexity of the linguistical formulation the child is attempting or the communicaiton load put upon them. (eg: my son's use of CRACKIE: confusing COOKIE and CRACKER. thinger/finger) and early onset  dysfluency .

    I am sure their are early education, preschool, and school therapist who can amplify this explination if not to give you better examples..

    Adults of various(and varying from sinus porblems) skill levels may  have difficulty with polysyllabic coarticulation/sequencing, maintaining voicing/developing enough intra oral pressure for voicing and so symplify or revert to poorly learned patterns and phoneme sequences (Black dialect has formatlized one of theses into using AX/Ask.  This also occurs in simplification or undershoot in pronunciation of blends like [n/-nd], sibilants [s/-sts], or voiced sibilants s/-z, -sh/-ch.

    The low income population can have missing teeth or low grade pain that distract oral feedback of pronunciation.

    Persons with GERD may have a loss of molars and restricted breathing from abdominal pain. Any recent change to the articulators will have an immediate though usually temporary affect on pronunciation. Just think of the last time you had novocaine at the dentist office.

    The elderly have problems not so much from hearing loss (bone conduction for auditory feedback is often better than the acoustic signals for conductive loss -low frequency sounds, likley equvialent for high frequncy sounds -sibilants fricatives, affricates).  The more frequent problem is from poorly fitting dentures. Articulatory accuracy suffers especially for sibilants that require a fine airstream to be broken against the teeth. Also range of motion and rapid articulatory motion are hampered by a restricted tongue that is using the lateral tongue often to hold the dentition in place. The least thought of is .. age or illness related muscle weakness (sarcopenia) waisting most profoundly found in bedbound elderly.  Muscle wasting can occur after 4 days of inactivity./ in bed especaily orally with oral -throat soreness from cancer treatment on alternate feeding.


  • Masoud Qanbari asked a question:
    How can i find 300 or 600 bps speech codec source codes?

    need help in very low bit rate codec.

  • Speech to text software

    Dear all,
    I am looking for a free speech to text (STT) software for writing technical documents (BSc, MSc, PhD, etc). I've found a list here: http://en.wikipedia.org/wiki/List_of_speech_recognition_software

    Which STT software do you recommend and why?
    Thank you very much in advance.

    Kinds regards,

    Fernando A. Marengo Rodriguez · Federal University of Santa Catarina

    Thank you! Then, which software package do you recommend Jan?

  • Tanja Golja added an answer:
    Do you know an expert in learning analytics?

    I would like to know who can give a great speech about learning analytics in the contexts of k-12 school or higher education. I am preparing an international forum about learning analytics in Seoul, South Korea. Could you recommend an expert whom I can invite for the forum?    

    Tanja Golja · University of Technology Sydney

    Professor Simon Buckingham Shum

    See http://newsroom.uts.edu.au/news/2014/08/smarter-data

  • Is there any way to detect anxiety or stress cues from language (e.g. a text corpus or a speech)?

    I'm talking about things like themes, but also cooccorrencies count etc. I can't seem to find any literature.

    Emilia Iglesias Fernández · University of Granada

    In an observational study of an simultaneous interpreting corpus of the European Parliament original English speeches, one Spanish interpreter`s input  displayed a high degree of anxiety: I measured with PRAAT for clusters of speech rate, pitch range, pitch contour and disfluencies. The finding showed a high degree of coarticulation (morphemes and words uttered on top of each other), less number of pauses, higher speech rate and higher intensity (Iglesias and Gaedeke 2012)

  • Xaver Koch added an answer:
    Can we measure the amount of stress required to produce speech?

    In particular, during voiced speech production? I am looking for understanding the process of speech production in detail.

    Xaver Koch · Radboud University Nijmegen

    As a starting point:
    Effect of vocal effort on spectral properties of vowels by Jean-Sylvain Liénard & Maria-Gabriella Di Benedetto, JASA 1999

  • Claire Colombel added an answer:
    Who knows where databases for language recognition can be obtained?
    Also, I'm looking for the speech databases (or databases of speech features) of different languages: european, asian, etc.
    Claire Colombel · University of New Caledonia

    You may also try childes


  • Juan-Manuel Lopez-Muñoz added an answer:
    When does a speech act begin?

    Physically and linguistically, a speech act begins when a speaker utters the words. But the planning and organization of the phonological, semantic, syntactic and pragmatic components of the utterance are always in process while speaking. Should they be seen as part of a speech act? And if so, should the linguistic planning be seen as the beginning of a speech act? Thanks

    Juan-Manuel Lopez-Muñoz · Universidad de Cádiz

    Usually a speech act precedes another; if not, a speech act is preceded by a gesture, especially a movement approach towards someone. I think it is not a question of borders but continuum of communication.

  • Julien Lagarde added an answer:
    What about integrating Searle or Austin's view about speech acts?
    Julien Lagarde · Université de Montpellier 1

    Speech acts theory and "interpersonal synergy" have not been linked to the best of my (limited) knowledge.

  • Saeid Safavi added an answer:
    Is anyone aware of the existence of a speech corpus containing speech files from infants?


    Saeid Safavi · University of Birmingham

    Dear all

    Thank you very much for the comments.

  • Syaheerah Lebai Lutfi added an answer:
    Any research on adaptation to speaker for emotion recognition from speech?

    Is there any research on this topic?

    Speaker adaptation is widely known in the speech recognition field. I am wondering whether it has been applied to enhance performance of affective state recognition.

    Syaheerah Lebai Lutfi · University of Science Malaysia

    Yes, you can check out the work from Roberto Barra-Chicote or Jaime Lorenzo-Trueba.

  • Marwan M. Obeidat added an answer:
    What form of language is more reliable, valid, authoritative, and trustworthy to use: speech or writing?
    What are your language preferences in terms of authenticity: speech or writing?
    Marwan M. Obeidat · Hashemite University

    By interpreting their speech, dear Михаил

  • Jon Mason added an answer:
    What are the issues arising & important questions that need addressing in the deployment of Learning Analytics systems?

    In many implementations of Learning Analytics systems at universities the collection of learner-generated data is routine. But, students are typically not advised that this is happening. In formal research contexts such data collection would constitute a breach of research ethics. This issue is one among many others as it becomes easier & easier to produce & collect data
    What is the relationship between "free speech" and "open data"?

    Jon Mason · Charles Darwin University

    thanks Christine, that's helpful -- data privacy is certainly one of the big issues along with "informed consent"

  • Elaine Ostrach Chaika added an answer:
    Are there any neurolinguistic and psycholinguistic proofs on part of speech and syntactic position?

    I think part of speech has a close relation with syntactic position. But I don't have any proof on this issue, especially proofs from neurolinguistic and psycholinguistic study. Can anybody help me with this?

    Elaine Ostrach Chaika · Providence College

    Parts of speech and syntactic position depend totally on the language or even dialect you're examining.  In some languages, verbs are always at the end of sentences. In others, at the start. The same differences occur with other parts of speech.

  • Arthur Boothroyd added an answer:
    How can I manually segment speech data correctly? Which tool can be used for that purpose?

    Give the steps to segment the speech data and building acoustic models

    Arthur Boothroyd · San Diego State University

    It depends what you mean by "correctly". As I am sure you are aware, the relationship between the acoustic signal and the phoneme string into which it might be transcribed does not lend itself to the identification of clear temporal boundaries. So there is often no way to isolate an acoustic segment that represents all of the information contributing to identification of a given phoneme without including information contributing to identification of other phonemes. But, if you simply want a tool to specify the beginning and end of an acoustic segment of interest, and/or extract it to a separate file, there are many - Pratt, Sound Forge, and DaDisp among them.

  • Tanvina B Patel added an answer:
    Does anyone have information regarding dis/advantages of formant, hmm and unit synthesis approach?

    Hi. I have searched for advantages and disadvantages of formant, hmm and unit selection approach but I am not able to figure out completely so please tell me. I am looking for details in context to text-to-speech synthesis.

    For better speech accuracy which approach is good?

    Tanvina B Patel · Dhirubhai Ambani Institute of Information and Communication Technology

    HTS voice is intelligible and if the F0 values of synthesized files are close to those of the original speaker then it is fine.

    Also the footprint is small.

  • Ans Schapendonk added an answer:
    Does linear prediction mechanism helps to track the changes in the speech articulatory system?

    In particular, considering only voiced regions having more than one phonemes, is it possible to identify the boundaries between these phonemes with the help of pole plot obtained from LP or time varying LP coefficients?

    Ans Schapendonk · Philipps-Universität Marburg

    No, prediction is not lineair, but moves like a monade which is at the same time a pipeline. This modell has a mathematical formel (theory of everything TOE > TOET > face) which bases on the modell of Goropius Becanus P, T, K ánd W (Grimm is missing the W). The soundhelix of 'gestural' has a helix (i.c. soundhelix) that helixes in 'gestorven' (died) from 'starf' > 'erf' > 'orbs'. Loot at 'Goropius Becanus' soundhelix laat God hangen'. Becanus was the son of a BAK (midwife) like Erasmus. You can find their theory of the BAK SEVEN (Sederschaal) in the stars (Hatschepnut): watersnake HYDRA (hydraulic).

  • Jugurta Montalvao added an answer:
    What is the best approach for pitch estimation (of speech signal) under (convolutive) room reverberation effect?

    For sure, there is a (well known) theoretical background for pitch estimation, including many interesting academic papers with comparative studies of methods. On the other hand, one knows that reverberant room effects can be handled through signal pre-whitening methods. Nonetheless, my question is to those who, like myself, feel frustrated by the almost erratic performance of pitch estimators in naturally spoken sentences (i.e. normal rhythm) in small reverberant rooms, even after signal pre-whitening. Thus, I would like to know is someone successfully experimented new pragmatic approaches, possibly not convencional ones.

    Jugurta Montalvao · Universidade Federal de Sergipe

    Dear Yi Xu,I appreciate a lot your taking some time to analyse my sound sample. Now, let me explain why I made such a supposition that room reverberation was the main problem I'm fighting against. First, I implemented and used a very basic autocorrelation based method, whose result is shown in the upper figure (appended pdf file). Although I can feel (with my ears) a lot of F0 signal (almost all the time) around 100 Hz, what the computer shows me is a lot of estimated F0 around 200 Hz. For sure, it could be the well-known doubling effect (first hypothesis). But the weird thing is that I also can feel (with my ears) like a room reverberation mode being excited around 200 Hz (which may be compatible with some small room's dimension), what leads me to the second hypothesis.

    In case it is just a room reverberation effect, I do expect to reduce it through spectral subtraction (for it is expected to vary much less than the excitation spectrum). Therefore, I just subtracted the average correlation profile (thanks to the linearity property of the inverse Fourier transform of the average spectrum) and re-estimated the F0 with the subtracted window correlation profiles. The result is shown in the second figure, were a substantial improvement can be noticed, thus corroborating the room reverberation hypothesis.

    Unfortunately, it depends on a post-processing procedure (for average profile estimation) and it, in spite of the noticed improvement, it does not yet yield all the F0 contour that I can detect with my ear (which is my ultimate goal).

    Obs.: Please notice that the strong room effect is also in agreement with your remark: 'The likely reason is that the microphone was too far away from the speaker.'. 

  • Nicanor García added an answer:
    Is the "musical noise" generated by some Speech Enhancement algorithms uniformly distributed across the spectrum?

    I am trying to assess the degree of degradation that "musical noise" causes in the low frequency bands of the spectrum of speech signals. Perceptually (playing back the treated signal) this artifact is stronger in mid and high frequencies (over 700 Hz), however I need an objective way to confirm or disprove this.

    Does anyone have information on this subject or knows a way to evaluate the amount of musical noise present in a signal?

    Thank you very much.

    Nicanor García · University of Antioquia

    Thank you very much for your answers and sorry for the tardiness of my response.

    This question originated from some experiments I'm performing on contaminated voice signals, in which I'm using subspace Speech Enhancement techniques in the Wavelet domain, more precisely on the Wavelet Packet decomposition coefficients. After reconstructing the signal from the processed coefficients it had some musical noise.

    From your answers and further reading on several sources I can conclude that "musical noise" is related to frequency spectrum characteristics.

    I would like to know the relation between the Wavelet domain processing I applied and the occurrence of "musical noise".

  • Adrià Rofes added an answer:
    Does an audioVISUAL corpus of naturally occurring [non-elicited] speech errors exist? In which languages?

    Do you know if audiovisual corpora of speech errors in different languages exist? 

    Adrià Rofes · Università degli Studi di Trento

    Have a look at the Aphasia Bank (http://talkbank.org/AphasiaBank/).

    AphasiaBank is supported by NIH-NIDCD grant R01-DC008524 for 2007-2017. The immediate goal of AphasiaBank is construction of a shared database of multimedia interactions for the study of communication in aphasia. The ultimate goal of this work is the improvement of evidence-based therapy for aphasia.

    These are the languages I found in their browsable database:

About Speech

Communication through a system of conventional vocal symbols.

Topic Followers (2477) See all