    Are there any factors that affect pitch perception and pitch determination in speech?

    Is there any research on factors that have effects on pitch perception and pitch determination in speech? I'm trying to figure whether F0 alone is sufficient for determining various pitch in speech.

    Chilin Shih · University of Illinois, Urbana-Champaign

     There are a lot of individual variation in pitch perception, even if listeners speak the same language. So listener's hearing and their ability to discriminate f0 (measure just noticeable f0 difference) should be tested in order to evaluate experiment results.

  • Paolo Mairano added an answer:
    How does the energy contained in a speech signal be representative of the language in which it was spoken?

    I am doing my final year project on "Classification of Tonal and Non-Tonal languages" using neural networks. The system takes pitch contour and energy as parameter Using only the pitch contour as a parameter yields an accuracy of 66%, whereas adding short term energy increases it to above 80%. 

    Many standard literatures also consider energy as a characteristic feature of the language, but provides no explanation.

    Paolo Mairano · The University of Warwick

    Hi Biplav,

    I know that there have been some studies claiming that languages representing to different rhythm categories (syllable-timed, stress-timed, mora-timed, etc.) may differ in the way they use energy. I am not sure I am covinced about this, but here is the reference:

    Lee, C.S. & McAngus Todd, N. (2004) Towards an auditory account of speech rhythm: application of a model of the auditory ‘primal sketch’ to two multi-language corpora. Cognition, 93/3, 225-254.

    @Diwakar, I don't think tonal languages simply have 'more energy' in speech. If there is a difference (as suggested by Biplav's results), it is probably a difference in how energy is used in that language (rather than how much energy is used, which may depend on too many factors), right? Possibly, as you mention, there may be more consant energy peaks for vowels in tonal languages. But then again, I am not sure it as simple and as general as that: some tonal languages have neutral tones, where vowels can be fairly reduced...

  • Ali Ibrahim Aboloyoun added an answer:
    Is there a speech assessment for cleft palate children?

    What is the ideal age for speech assessment for the cleft children?

    What are the measures of speech assessment that can be done in day to day practice?

    How soon after cleft palate surgery should the speech assessment be done?

    Ali Ibrahim Aboloyoun · King Abdullah Medical City

    Our openion for patients with cleft palate that it is a team work cases. Language and speech evlauation must be done as early as possible after surgical intervention which is the main factor affecting speech output as if it is done by clever surgeon leading to adequate palatal length and mobility the speech out put expected to be very good and if short and immobile palate the patient well need a hard work to slightly improve his speech

    evaluation can be done subjectively by direct or indirect way by lestening to and examination of the patient by expiernced Phoniatrician or SLP

  • Hendrik Schade added an answer:
    Is there any paper that clearly states that our diction /register is a lot more loose in speech than in writing?

    Dear all, I basically just need one citation (even though more would be better) on this in the context of a corpus analysis and I thought I would have an easy time finding one but I really did not so I would appreciate any help.

    I hope it is okay to ask this kind of question. So far I have only used RG to publish and to follow researchers, this is my first time using Q&A.

    Hendrik Schade · Bielefeld University

    Thank you all for the help! It is amazing that there are possibilities like this because sometimes you just spend way too much time on insignificant things without any result and then feel like you have not done anything at all.

  • Robert Wahlstedt added an answer:
    How can I perform Cepstral analysis in CSL?

    I am trying to explore the procedure to do Cepstral analsis using Computerized Speech Lab (CSL), Kay Pentax. Are there any manual/procedure guidelines? I am only able to generate a FFT, however unable to proceed further.


    Robert Wahlstedt · Whitworth University

    Hello, I found some tutorials at http://www.rit.edu/ntid/speechlang/slpros/instruction/segmental/segmental/tutorials/1 i think that slp graduate schools might have the best free stuff online. 

  • A.G. Ramakrishnan added an answer:
    What are the Spectral and Temporal Features in Speech signal?

    IN speech signal processing, i am getting these two terms more and more. what are they actually?

    A.G. Ramakrishnan · Indian Institute of Science

    The most successful spectral features used in speech are (i) Mel frequency cepstral coefficients (MFCC) and (ii) Perceptive Linear Prediction (PLP) features. It is well known that the basilar membrane in the inner ear actually analyzes the frequency content of the speech we hear. In fact, the analysis of basilar membrane can be modeled by a bank of constant Q, band pass filters. There also exist the critical bands, which give rise to the phenomenon of masking - where one strong tone or burst can mask another weaker tone within the critical band. Actually, both MFCC and PLP capture these characteristics of our auditory system in some way; so, even though it looks strange, the same features give reasonably good performance for speech recognition, speaker recognition, language identification and even accent identification ! However, these spectral features are not very robust to noise.

    On the other hand, some of the time domain (temporal) features such as plosion index and maximum correlation coefficient are relatively more robust to noise. 

  • John Crowley added an answer:
    Can you recommend readings on Bakhtin and genre theory?

    I am reading Bakhtin's "Problem of Speech Genre" and his Philosophy of the Act in the hopes of gaining a better understanding of his views on genre. Can anyone recommend additional readings, whether by Bakhtin or about his thought?

    John Crowley · Yale University

    Best Bakhtin books i know are David Lodge's After Bakhtin -- which is wide-ranging but pretty introductory and may not be useful to you; and The First Hundred Years of Mikhail Bakhtin by Caryl Emerson, much more in-deoth and dealing with philosophical as well as literary questions,

  • Jenya Iuzzini-Seigel added an answer:
    Is apraxia of speech (AOS) the same as language delay?

    Is apraxia of speech (AOS) the same as language delay? and what are the most featured phonological patterns that characterize apraxic people?

    Is there a specific battery used to diagnose apraxia of speech? and in case there is not, what are its symptoms?

    Jenya Iuzzini-Seigel · Massachusetts General Hospital

    Greetings! Our new research on Childhood Apraxia of Speech (CAS) is showing that there is a very high rate of comorbid language impairment among children with CAS. In addition, children with CAS+language impairment perform differently on speech perception tasks compared with those who have CAS-only (speech symptoms only).   CAS on its own is considered a motor speech disorder, however, given the high rate of comorbid language impairment, it is essential that this is well evaluated and treated appropriately.

  • Biswajit Satapathy added an answer:
    Can anyone recommend a platform for building (or know of an existing) listening span task that can be acoustically manipulated?

    I'm studying the effects of unfamiliar-accented speech on verbal working memory. I believe the task that best suits my experiment is a listening span task (LSPAN). I am interested if anyone has experience using these tasks and, in particular, if anyone has manipulated the acoustic boundaries of the stimuli. If so, can you offer advice or models for creating such a task (or point me towards one that currently exists that I may be able to use and adapt?) 

    Rather than rely on accented-speakers to record the LSPAN stimuli, I'd like to control for the exact acoustic features. 

    Biswajit Satapathy · Ozonetel Systems Pvt. Ltd.

    Hi Lauren, Apart from praat, following tool may help you more,

    1. Wavesurfer : http://www.speech.kth.se/wavesurfer/

    2. Audacity : http://audacity.sourceforge.net/

    3. SOX, edinburg speech-tool

    And for simulation you can use; 

    1. scilab

    2. Octave

    3. Matlab 

    Here Matlab is not free but other two are free simulation toolkit.

    Hope these tools will help you for speech analysis.

  • Amaury Lendasse added an answer:
    What is the best available toolbox for implementation of Deep Neural Networks (DNN)?

    There are plenty of toolboxes offering functions for this specific task, so it would be great if we could all contribute and conclude about the best available DNN toolbox to this date (mainly for speech applications). 

    It will be great if we can give the pros and cons of using any toolbox and at the end we will conclude from the top voted answers. 

    Amaury Lendasse · University of Iowa


  • Monika Połczyńska added an answer:
    Are there any neurolinguistic and psycholinguistic proofs on part of speech and syntactic position?

    I think part of speech has a close relation with syntactic position. But I don't have any proof on this issue, especially proofs from neurolinguistic and psycholinguistic study. Can anybody help me with this?

    Monika Połczyńska · Adam Mickiewicz University

    Hi Chang,

    There have been a number of fMRI studies on parts of speech and syntax, including (but not limited to) canonical versus non canonical word order. Here are a few articles that might be useful:  Bornkessel et al. 2005, Be-Shachar and Grodzinsky 2004, Mack et al. 2013, and  Meltzer-Asscher et al. 2015

    Here you have PubMed links to these publications:





    Best of luck!


  • Corinne Seals added an answer:
    How can I improve speech and language communication in children that have English as an additional language ?

    Hope you can help me

    Corinne Seals · Victoria University of Wellington

    Recent research suggests that a Flexible Multilingual Educational policy may be best - allowing children to codeswitch between their home language and the language they're learning (English) to build strength in all of their languages simultaneously. The new book on Flexible Multilingual Education for children by Jean-Jacques Weber (2014) does a fantastic job explaining this.

  • beh zad Ghorbani added an answer:
    Is the "musical noise" generated by some Speech Enhancement algorithms uniformly distributed across the spectrum?

    I am trying to assess the degree of degradation that "musical noise" causes in the low frequency bands of the spectrum of speech signals. Perceptually (playing back the treated signal) this artifact is stronger in mid and high frequencies (over 700 Hz), however I need an objective way to confirm or disprove this.

    Does anyone have information on this subject or knows a way to evaluate the amount of musical noise present in a signal?

    Thank you very much.

    beh zad Ghorbani · Islamic Republic of Iran Broadcasting University

    I can improvement the musical noise with perceptual frequency masking filter. 

  • Jonathan Arthur added an answer:
    How do I proceed in case of normal pure tone and speech audiometric results but complaint of difficulty hearing under background noise?

    An adult person complaints of difficulty hearing in background noise. Pure tone audiometry and speech audiometry reveals normal findings with good speech discrimination scores. ABR and OAE results normal. What can be the further investigations required? and possible interventions

    Jonathan Arthur · Swansea University

    Diagnosis of APD is probably more straight forward than the rehabilitation in my opinion. I agree with Alan around using specific rehabilitation for these types of patients, hearing therapy / auditory rehabilitationist (depending where you are based) would be useful. More recently I have advised lipreading classes by a qualified lipreading teacher. Another possible helpful solution would be to use a hearing aid set with zero / minimal gain connected to a wireless lapel microphone to improve the signal to noise ratio for certain 1 to 1 listening situations. Re-sound can supply these. Often an explanation of the condition can help too.

  • Kuruvachan K George added an answer:
    Where can I find the methods that find the silence intervals of speech?

    Because the result of noisy speech filtering strongly depends on the silence intervals problem solution.

    Kuruvachan K George · Amrita Vishwa Vidyapeetham

    Such algorithms are part of Voice Activity Detectors (VAD), used to detect the silence segments in the speech data. Various techniques such as, signal energy, zero crossing, spectral centroid.. are used to in those algorithms. One of our papers is also attached. 

  • At L Hof added an answer:
    What is the typical lung pressure for normal human phonation/speech?

    I need the value of lung pressure to set up the boundary condition for the inflow for a 2D vocal fold simulation for a normal phonation condition.

    At L Hof · University of Groningen

    You may try  to consult the thesis of Harm Schutte at


  • César Asensio added an answer:
    How can one use posterior probability of Gaussian mixture model using matlab?

    In my work, I want to use Gaussian mixture model in speaker identification. I use Mel frequency cepstral coefficient (MFCC) to extract the feature extraction of the training and testing speech signal and I use obj= fitgmdist(X,K) to estimate the parameter of Gaussian mixture model for training speech signal. I use [p, nlogl]=posterior(obj, testdata) and I choose the minimum (nlogl) to show the maximum similarity between reference and testing models as shown in matlab attach file.

    The problem in my program is the minimum nlogl changes and it recognizes different speaker even if I use the same testing speech signal. For example, when I run the program for the first time, the program recognize that the first testing speaker has the maximum similarity with training speech signals (I=1) and If try to run the program again for the same testing speech, I will get the five testing speaker have the maximum similarity with training model . I do not know what is the problem in the program and why the program gives different speaker when I run the program for three times for the same testing speech signal .can any person specialize in speaker regonition system and Gaussian mixture model answer about my question 

    With best regards

    César Asensio · Universidad Politécnica de Madrid

    I would suggest to test prtools toolbox for matlab

  • Nikola Ilankovic added an answer:
    What is the etiological relationship between MMS immunization and leasio of internal ear (laesio cochleae and n. cochlearis) by children?

    What are the consequences on the speech development? What is the connection with autistic development?

    Nikola Ilankovic · University of Belgrade

    Thank You Vladimir. But the Morbilli are in over 90 % very light illnes by little children! The complications are most frequently by adults. Why is then necessery the immunisation?? THe immune rection after infection and or immunistion can delay 6 and more months.

  • Alexander I. Rudnicky added an answer:
    Is there any effect of speech signals volume on the performance of speaker recognition systems ?

    Is there any effect of speech signals volume on the performance of speaker recognition systems? For example, if the audio files used in the learning stage have a larger volume than those used at the test step, is this difference of volume will affect the performance of the speaker recognition system?

    Alexander I. Rudnicky · Carnegie Mellon University

    Well, if the data are too loud and it will be distorted. So first make sure there's no clipping. Also note that the source of the loudness makes a difference. Is it because the gain was too high, or were people shouting? Other things being equal the training data ought to be reasonable similar to the test data; the end this is still a pattern matching problem.

    Note that techniques such as CMN (spectral mean normalization) are useful. In our own work we haven't observed much effect of normalization: the features are spectral, so as long as that information is reasonably there, things should should work. If anything, we've noticed that attempts at normalization usually degrade performance.

    Of course what you should do to find for sure in your situation, simply do different trainings and see what happens.

  • Iman Esmaili added an answer:
    Does using speech samples with SNR < 0 make my recognition less accurate?

    I am doing isolated word recognition based on MFCCs. some of my samples revealed to have SNR < 0, should I use them or simply delete them?

    Iman Esmaili · Shahed University


    Of course using low SNR data degrades your recognition accuracy But to use the low SNR data or not  depends on your recognition plan. we have clean speech recognition and noisy speech recognition. If you have no restriction in environment just use the clean speech but if your system must work in different conditions you must use all of your data and you have to find some way to deal with noise.

    for example: spectral subtraction is a simple and efficient way to deal with white noise.

  • Chilin Shih added an answer:
    How can I estimate a person's vocal tract length, using a recorded audio file?
    I'm performing some experiments that require a vocal tract length change, but I need to know the original one.
    I'm aware of the formula: L = c / 4F, where the "c" is the speed of sound (34029 cm/s) and "F" is the first formant frequency. I'm also aware that I should use vowels closest as possible to an unconstricted vocal tract.
    However, I made a few experiments with the software program Praat and I got rather different and difficult to interpret results. In a single vowel, I get a large range of frequencies (1st formant ones), so I thought I should focus on the average? Is that correct? Moreover, among different vowels I get very different results. Is that normal?

    Thanks in advance!
    Chilin Shih · University of Illinois, Urbana-Champaign

    Alternatively, we got reasonable measurement using two microphones, one placed at the mouth and one at the throat outside the glottis, and estimate the distance by the time it takes for the acoustic wave to travel from the glottis to the opening of the mouth. The technique is described in  "A quasi-glottogram signal", JASA 2003.


  • Mihai Prunescu added an answer:
    Can somebody give me some examples of phonemic variations in a language and the probable reasons for such variations?

    In Ghana, I have observed that the phoneme /j/ is realized as /dz/; /y/ or /Ʒ/ in speeches by individuals. I have also noticed that the difference in the realizations depends on either the absence or the presence of the target phoneme in the learners’ speech (i.e. transfer errors). Where it is present but the realizations are not the same, the learner tries to articulate the phoneme as a phoneme he/ she already knows. Where the target phoneme does not exist in the already known languages of the learner, he or she tries to make a substitution with another phoneme that exists in his or her linguistic repertoire. Can someone share with me some example of phonemic variations that he or she has noticed in their students’ speeches? Are the reasons for the variations different from what I have stated?

    Mihai Prunescu · Institute of Mathematics of the Romanian Academy

    Some of such variations are determined by areas of speakers. The romanian word "pe" (on) is spoken around Bucharest like " pă". There are towsends of such area dependent prononciations in many languages. In German, the past perfect particle "ge" is spoken in the Berlin area like "ie" (written je). The expression "Hast du jedient" ("Have you served in the army?") instead of "Hast du gedient?" is classical.

  • Peggy Katelhoen added an answer:
    How are verbs of communication used to introduce Direct Speech in different languages?

    I am exploring the use of verbs of communication (verba dicenda) across languages and genres. My main aim is to see whether typological differences across languages (as described by Talmy) are maintained in the domain of communication. I am particularly concerned with how different languages use VoCs to introduce (and reconstruct) Direct Speech in written narratives and the rhetorical implications of this use, but I am also interested in their use in oral contexts. Any research dealing with this will be much appreciated.

    Peggy Katelhoen · Università degli Studi di Torino

    There is an older publication (German and Spanish): Hernández, Eduardo Jorge (1993):
    Verba dicendi. Kontrastive Untersuchungen Deutsch-Spanisch. Series: Hispano-Americana. Frankfurt/M., Berlin, Bern, New York, Paris, Wien, 
    and my own book: Katelhön, Peggy (2005): Das fremde Wort im Gespräch: Rededarstellung und Redewiedergabe in italienischen und deutschen Gespächen, Berlin: Weidler Verlag (discourse representation in spoken Italian and German languages),

    For German I found out that there are verbs like "kommen" (come) that can indroduce an DS with an implicit negative valutation....Very interesting are also such forms without a verb "e lui/e lei" and "ich so, er so" o with the verb "fare" make" in Italian...Examples and bibliographie you can find in the book

    With best regards, PK

  • Eddy B. Brixen added an answer:
    Acoustic analysis of speech - recommendations for lapel mics?

    I am planning a series of 'field recordings' of speech. An individual speaker per recording - to be conducted in 'a quiet indoor space'. Planned analyses include format tracking (F0-3).

    Researchers in the field of acoustic/phonetic analysis of speech: What lapel mics do you use? What are your experiences with different models? Do you have recommendations for particular models currently available? Would prefer an economical solution (for multi-site testing), but open to suggestions.


    Suzy Styles

    Eddy B. Brixen · EBB-consult

    The distance and the axis is important. We have seen discussions on different LTASS profiles in different languages - and the discussion was really about the placement of microphones. A perfect microphone is the DPA4060. Remove the grid and you have a flat frequency response, low noise, and low distortion mic.

    But placing the microphone on the chest makes you loose approximately 10 dB at 3-4 kHz!! (check this paper: Brixen, E.B.: Spectral degradation of speech captured by miniature microphones mounted on persons heads and chests. AES Convention no. 100, Copenhagen, Denmark. Preprint 4284.

    A headset microphone that works is DPA d:fine 66. The level is 10 dB higher "at the edge of your smile" compared to the chest mounted mic. And this provides you with 10 dB less background noise......

    Good luck


  • T. Nagarajan added an answer:
    Why is it necessary to have a restriction of minimum-phase signal to use modified group delay?

    The group delay function can be effectively used for various speech processing tasks only when the signal under consideration is a minimum phase signal.

    T. Nagarajan · Sri Sivasubramaniya Nadar College of Engineering

    Yes, it is compulsory. The group delay is, to certain extent, similar to the magnitude spectrum of the signal. Those spikes are due to wrapped phase and not actual, and it has to be avoided. 

  • Is there a corpus with whistled speech tokens from Silbo Gomero?

    I am looking to do research with a learning experiment that requires whistling tokens. Specifically, my past research has focused on Silbo Gomero, so I am now in need of access to sound bites from that language. If there are Spanish translations to accompany the whistled sound bite, that would be ideal! Thank you!

    Gregorio Rodríguez Herrera · Universidad de Las Palmas de Gran Canaria

    Hello Pat!

    I think these woks of M. Trapero may be interest

  • Michael Clarke added an answer:
    What are the main differences between children and adult speech?

    I know that this question is too general, but I want to get opinions on the possible ways to split these differences into several groups, eg. “Acoustic and linguistic differences”.

    Thank you very much in advance.


    Michael Clarke · Cancer Treatment Centers of America

    I am sure I am not telling you anything you already do not know.  Speech awareness and production change as the childs aerodigestive tract and articulatory structures  grow neurologically and physically. The more complex articulatory motions develop in skill last.  That is why so many SLPs in the school sytem are working on remediating pronunciatieon of /s/ /r/ /l/ and why the general population still does not accurately pronounce /z/ at the end of words. Children living in the  Midwest United States in Kindergarten and first grade  are not fully expected to have mastered  /r/ produciton. Competence goes in grossly predictable patterns. (/m n ng  p f h w/, /b d g k r/, /t th L v/, /sh ch dg/)  In addition illness of childhood compromise pronunciation.  We have a life style that aggrigavates sinus mucosa and so velar valving for non/nasal produciton is frequently a contrast between the young and old (the older population have less of a problem with this).   The same problems are agrivating eustation tube function so middle ear problems and hearing of low frequency sounds is often poor in the younger population.  Confusion of sounds is common. 

    Gross Linguistic factors have to do with onset of various linguistic development of referrent (word appoximations) based on frequency of useage, contrast of nasal and stops, starting with higly visible-labilal sounds (eg:mama, papa, baw/ball), gross differnetiation of place/manner/ voicing( gawgy.doggy), simplificaiton of articulation (Is/Its),

    semantic (phrase) development and

    syntactic development with semantic markers of plural /s,z/; gerund/ infinitive marking (-er), verb modifier (ly).  Early errors will occur due to the complexity of the linguistical formulation the child is attempting or the communicaiton load put upon them. (eg: my son's use of CRACKIE: confusing COOKIE and CRACKER. thinger/finger) and early onset  dysfluency .

    I am sure their are early education, preschool, and school therapist who can amplify this explination if not to give you better examples..

    Adults of various(and varying from sinus porblems) skill levels may  have difficulty with polysyllabic coarticulation/sequencing, maintaining voicing/developing enough intra oral pressure for voicing and so symplify or revert to poorly learned patterns and phoneme sequences (Black dialect has formatlized one of theses into using AX/Ask.  This also occurs in simplification or undershoot in pronunciation of blends like [n/-nd], sibilants [s/-sts], or voiced sibilants s/-z, -sh/-ch.

    The low income population can have missing teeth or low grade pain that distract oral feedback of pronunciation.

    Persons with GERD may have a loss of molars and restricted breathing from abdominal pain. Any recent change to the articulators will have an immediate though usually temporary affect on pronunciation. Just think of the last time you had novocaine at the dentist office.

    The elderly have problems not so much from hearing loss (bone conduction for auditory feedback is often better than the acoustic signals for conductive loss -low frequency sounds, likley equvialent for high frequncy sounds -sibilants fricatives, affricates).  The more frequent problem is from poorly fitting dentures. Articulatory accuracy suffers especially for sibilants that require a fine airstream to be broken against the teeth. Also range of motion and rapid articulatory motion are hampered by a restricted tongue that is using the lateral tongue often to hold the dentition in place. The least thought of is .. age or illness related muscle weakness (sarcopenia) waisting most profoundly found in bedbound elderly.  Muscle wasting can occur after 4 days of inactivity./ in bed especaily orally with oral -throat soreness from cancer treatment on alternate feeding.


  • Masoud Qanbari asked a question:
    How can i find 300 or 600 bps speech codec source codes?

    need help in very low bit rate codec.

  • Speech to text software

    Dear all,
    I am looking for a free speech to text (STT) software for writing technical documents (BSc, MSc, PhD, etc). I've found a list here: http://en.wikipedia.org/wiki/List_of_speech_recognition_software

    Which STT software do you recommend and why?
    Thank you very much in advance.

    Kinds regards,

    Fernando A. Marengo Rodriguez · Federal University of Santa Catarina

    Thank you! Then, which software package do you recommend Jan?

