About
69
Publications
12,274
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
3,213
Citations
Introduction
Publications
Publications (69)
The human brain tracks temporal regularities in acoustic signals faithfully. Recent neuroimaging studies have shown complex modulations of synchronized neural activities to the shape of stimulus envelopes. How to connect neural responses to different envelope shapes with listeners' perceptual ability to synchronize to acoustic rhythms requires furt...
Speech comprehension requires the ability to temporally segment the acoustic input for higher-level linguistic analysis. Oscillation-based approaches suggest that low-frequency auditory cortex oscillations track syllable-sized acoustic information and therefore emphasize the relevance of syllabic-level acoustic processing for speech segmentation. H...
Oscillation-based models of speech perception postulate a cortical computational principle by which decoding is performed within a window structure derived by a segmentation process. Segmentation of syllable-size chunks is realized by a theta oscillator. We provide evidence for an analogous role of a delta oscillator in the segmentation of phrase-s...
Oscillation-based models of speech perception postulate a cortical computational principle by which decoding is performed within a window structure derived by a segmentation process. In the syllable level segmentation is realized by a theta oscillator. We provide evidence for an analogous role of a delta oscillator at the phrasal level. We recorded...
This is a commentary on a review article by Meyer, Sun & Martin (2019), “Synchronous, but not entrained: exogenous and endogenous cortical rhythms of speech and language processing”, doi:10.1080/23273798.2019.1693050. At the heart of this review article is the language comprehension process. Anchored at a psycho- and neurolinguistic viewpoint, the...
Can neural activity reveal syntactic structure building processes and their violations? To verify this, we recorded electroencephalographic and behavioral data as participants discriminated concatenated isochronous sentence chains containing only grammatical sentences (regular trials) from those containing ungrammatical sentences (irregular trials)...
Speech comprehension requires the ability to temporally segment the acoustic input for higher-level linguistic analysis. Oscillation-based approaches suggest that low-frequency auditory cortex oscillations track syllable-sized acoustic information and therefore emphasize the relevance of syllabic-level processing for speech segmentation. Most lingu...
The rhythms of speech and the time scales of linguistic units (e.g., syllables) correspond remarkably to cortical oscillations. Previous research has demonstrated that in young adults, the intelligibility of time-compressed speech can be rescued by "repackaging" the speech signal through the regular insertion of silent gaps to restore correspondenc...
Human listeners understand spoken language across a variety of rates, but when speech is presented three times or more faster than its usual rate, it becomes unintelligible. How the brain achieves such tolerance and why speech becomes unintelligible above certain rates is still unclear. We addressed these questions using electrocorticography (ECoG)...
This psychoacoustic study provides behavioural evidence that neural entrainment in the theta range (3–9 Hz) causally shapes speech perception. Adopting the “rate normalization” paradigm (presenting compressed carrier sentences followed by uncompressed target words), we show that uniform compression of a speech carrier to syllable rates inside the t...
Oscillation-based models of speech perception postulate a cortical computation principle by which decoding is performed within a time-varying window structure, synchronised with the input on multiple time scales. The windows are generated by a segmentation process, implemented by a cascade of oscillators. This paper tests the hypothesis that prosod...
At the core of oscillation-basedmodels of speech perception is the notion that decoding is guided by parsing. In these models, parsing is executed by setting a time-varying, hierarchical window structure synchronized to the input. Syllabic parsing is into speech fragments that are multi-phone in duration, and it is realized by a theta oscillator ca...
This study examines the decoding times at which the brain processes structural information in music and compares them to timescales implicated in recent work on speech. Combining an experimental paradigm based on Ghitza and Greenberg (Phonetica, 66(1-2), 113-126, 2009) for speech with the approach of Farbood et al. (Journal of Experimental Psycholo...
Studies on the intelligibility of time-compressed speech have shown flawless performance for moderate compression factors, a sharp deterioration for compression factors above three, and an improved performance as a result of “repackaging”—a process of dividing the time-compressed waveform into fragments, called packets, and delivering the packets i...
The premise of this study is that models of hearing, in general, and of individual hearing impairment, in particular, can be improved by using speech test results as an integral part of the modeling process. A conceptual iterative procedure is presented which, for an individual, considers measures of sensitivity, cochlear compression, and phonetic...
A RECENT COMMENTARY (OSCILLATORS AND SYLLABLES: a cautionary note. Cummins, 2012) questions the validity of a class of speech perception models inspired by the possible role of neuronal oscillations in decoding speech (e.g., Ghitza, 2011; Giraud and Poeppel, 2012). In arguing against the approach, Cummins raises a cautionary flag "from a phoneticia...
A recent opinion article (Neural oscillations in speech: do not be enslaved by the envelope. Obleser et al., 2012) questions the validity of a class of speech perception models inspired by the possible role of neuronal oscillations in decoding speech (e.g., Ghitza, 2011; Giraud and Poeppel, 2012). The authors criticize, in particular, what they see...
Recent hypotheses on the potential role of neuronal oscillations in speech perception propose that speech is processed on multi-scale temporal analysis windows formed by a cascade of neuronal oscillators locked to the input pseudo-rhythm. In particular, Ghitza (2011) proposed that the oscillators are in the theta, beta, and gamma frequency bands wi...
In this paper, we investigate a closed-loop auditory model and explore its potential as a feature representation for speech recognition. The closed-loop representation consists of an auditory-based, efferent-inspired feedback mechanism that regulates the operating point of a filter bank, thus enabling it to dynamically adapt to changing background...
The premise of this study is that current models of speech perception, which are driven by acoustic features alone, are incomplete, and that the role of decoding time during memory access must be incorporated to account for the patterns of observed recognition phenomena. It is postulated that decoding time is governed by a cascade of neuronal oscil...
This study was motivated by the hypothesis that low-frequency cortical oscillations help the brain decode the speech signal.
The intelligibility (in terms of word error rate) of natural-sounding, synthetically-generated sentences was measured using
a paradigm that alters speech-energy rhythm over a range of modulation frequencies. The material comp...
Current predictors of speech intelligibility are inadequate for understanding and predicting speech confusions caused by acoustic interference. We develop a model of auditory speech processing that includes a phenomenological representation of the action of the Medial Olivocochlear efferent pathway and that is capable of predicting consonant confus...
Sensory processing is associated with gamma frequency oscillations (30-80 Hz) in sensory cortices. This raises the question whether gamma oscillations can be directly involved in the representation of time-varying stimuli, including stimuli whose time scale is longer than a gamma cycle. We are interested in the ability of the system to reliably dis...
This study was motivated by the prospective role played by brain rhythms in speech perception. The intelligibility - in terms of word error rate - of natural-sounding, synthetically generated sentences was measured using a paradigm that alters speech-energy rhythm over a range of frequencies. The material comprised 96 semantically unpredictable sen...
We present a model of auditory speech processing capable of predicting consonant confusions by normal hearing listeners, based on a phenomenological model of the Medial Olivocochlear efferent pathway. We then use this model to predict human error patterns of initial consonants in consonant-vowel-consonant words. In the process we demonstrate its po...
We developed a computational model of diphone perception based on salient properties of peripheral and central auditory processing. The model comprises an efferent-inspired closed-loop model of the auditory periphery (PAM) connected to a template-matching circuit (TMC). Robustness against background noise is provided principally by the signal proce...
The work described here arose from the need to understand and predict speech confusions caused by acoustic interference and
by hearing impairment. Current predictors of speech intelligibility are inadequate for making such predictions (even for normal-hearing
listeners). The Articulation Index, and related measures, STI and SII, are geared to predi...
In the past few years, objective quality assessment models have become increasingly used for assessing or monitoring speech and audio quality. By measuring perceived quality on an easily-understood subjective scale, such as listening quality (excellent, good, fair, poor, bad), these methods provide a quick and repeatable way to estimate customer ex...
JNDS of interaural time delay (ITD) of selected frequency bands in the presence of other frequency bands have been reported for noiseband stimuli [Zurek (1985); Trahiotis and Bernstein (1990)]. Similar measurements will be reported for speech and music signals. When stimuli are synthesized with bandpass/band-stop operations, performance with comple...
Studies in neurophysiology and in psychophysics provide evidence for the existence of temporal integration mechanisms in the auditory system. These auditory mechanisms may be viewed as "detectors," parametrized by their cutoff frequencies. There is an interest in quantifying those cutoff frequencies by direct psychophysical measurement, in particul...
The hypothesis explored in this study is that the MOC efferent system
plays an important role in speech reception in the presence of sustained
background noise. This talk describes efforts to assess this hypothesis
using a test of initial consonant reception (the Diagnostic Rhyme Test)
performed by subjects with normal hearing. Activation of select...
A coding paradigm is proposed which is based solely on the
properties of the human auditory system and does not assume any specific
source properties. Hence, its performance is equally good for speech,
noisy speech, and music signals. The signal decomposition in the
proposed paradigm takes advantage of binaural properties of the human
auditory syst...
A computational model to predict MOS of processed speech is proposed. The system measures the distortion of processed speech (compared to the source speech) using a peripheral model of the mammalian auditory system and a psychophysically-inspired measure, and maps the distortion value onto the MOS scale. This paper describes our attempt to derive a...
Neurophysiological and psychophysical studies provide evidence for the existence of temporal integration mechanisms in the auditory system. These may be viewed as low‐pass filters, parametrized by their cutoff frequencies. It is of interest to specify these cutoffs, particularly for tasks germane to the effect of temporal smoothing on speech qualit...
A computational model to predict MOS (mean opinion score) of
processed speech is proposed. The system measures the distortion of
processed speech (compared to the source speech) using a peripheral
model of the mammalian auditory system and a psychophysically-inspired
measure, and maps the distortion value onto the MOS scale. This paper
describes ou...
For many tasks in speech signal processing it is of interest to
develop an objective measure that correlates well with the perceptual
distance between speech segments. (By speech segments the authors mean
pieces of a speech signal of duration 50-150 milliseconds. For
concreteness they consider a segment to mean a diphone.) Such a distance
metric wo...
The performance of large-vocabulary automatic speech recognition (ASR) systems deteriorates severely in mismatched training and testing conditions. Signal processing techniques based on the human auditory system have been proposed to improve ASR performance, especially under adverse acoustic conditions. The paper compares one such scheme, the ensem...
The purpose of this special session is to call the attention of the hearing science community to the need for new knowledge on how speech segments of durations of 50?150 ms long (e.g., phonemes, diphones), are being represented in the auditory system. In this session, the need for such knowledge will be addressed in the context of two specific spee...
For many tasks in speechsignal processing it is of interest to develop an objective measure that correlates well with the perceptual distance between speech segments. (Speech segments are defined as pieces of a speech signal of duration 50–150 ms. For concreteness, a segment is considered to mean a diphone, i.e., a segment from the midpoint of one...
At present, the performance of automatic speech recognition (ASR) systems is still limited by variabilities within and between speakers, by acoustic differences between training and application environments, and by the sensitivity of ASR systems against changing communication channels. This talk considers the conjecture that the use of speech‐produ...
Auditory models that are capable of achieving human performance in
tasks related to speech perception would provide a basis for realizing
effective speech processing systems. Saving bits in speech coders, for
example, relies on a perceptual tolerance to acoustic deviations from
the original speech. Perceptual invariance to adverse signal conditions...
This study provides a quantitative measure of the accuracy of the auditory periphery in representing prespecified time-frequency regions of initial and final diphones of spoken CVCs. The database comprised word pairs that span the speech space along Jakobson et al.'s binary phonemic features [Tech. Rep. No. 13, Acoustic Laboratory, MIT, Cambridge,...
A long-standing question that arises when studying a particular auditory model is how to evaluate its performance. More precisely, it is of interest to evaluate to what extent the model representation can describe the actual human internal representation. Here, this question is addressed in the context of speech perception. That is, given a speech...
In most implementations of hidden Markov models (HMMs) a state is assumed to be a stationary random sequence of observation vectors whose mean and covariance are estimated. Successive observations in a state are assumed to be independent and identically distributed. These assumptions are reasonable when each state represents a short segment of the...
Most speech processing systems (e.g., speech recognitionsystems or speech codingsystems) contain a feature‐analysis stage that extracts the required task specific information from the speech waveform. This study addresses the question of how to identify what part of the speechinformation is lost in this process. To answer this question, a diagnosti...
Traditional speech coding schemes are designed to produce synthesized speech with a waveform (or a spectrum) that is as close as possible to the original. With limits on the bit rate, however, it would be better to produce synthesized speech that matches the original speech at the auditory‐nerve level. Current models of the auditory periphery enabl...
In most implementations of hidden Markov models (HMM) a state is assumed to be a stationary random sequence of observation vectors whose mean and covariance are estimated. Successive observations in a state are assumed to be independent and identically distributed. These assumptions are reasonable when each state represents a short segment of the s...
The author describes the closed-loop ensemble-interval-histogram
(EIH) model. It is constructed by adding a feedback system to the
former, open-loop, EIH model (Ghitza, Computer, speech and Language,
1(2), pp.109-130, Dec. 1986). While the open-loop EIH is a computational
model based upon the ascending path of the auditory periphery, the
feedback s...
Traditional speech analysis/synthesis techniques are designed to produce synthesized speech with a spectrum (or waveform) that is as close as possible to the original. It is suggested, instead, that representations of the synthetic and the original speech be matched at the auditory nerve level. This concept has been used in conjunction with the sin...
In a previous report (Ghitza, 1987, [1]) we described a computational model based upon the temporal characteristics of the information in the auditory nerve fiber firing patterns, which produced an "auditory" spectral representation (the EIH) of the input signal. We also demonstrated that for speech recognition purposes, the EIH is more robust agai...
We describe here a computational model based upon the temporal characteristics of the information in the auditory nerve-fiber firing patterns. The model produces a frequency domain representation of the input signal in terms of the ensemble histogram of the inverse of the interspike intervals, measured from firing patterns generated by a simulated...
Efficient scalar quantization tables for LPC k-parameters were developed using a distortion measure based on just-noticeable-differences (JND's) in formant parameters of the speech spectrum envelope. Forty percent fewer bits were required than the 41/frame used in conventional approaches. An empirical technique was developed for relating perturbati...
Traditional speech analysis/synthesis techniques are designed to produce synthesized speech with a spectrum (or waveform) which is as close as possible to the original. It is suggested, instead, to match the in-synchrony-bands spectrum measures (Ghitza, ICASSP-85, Tampa FL., Vol.2, p. 505) of the synthetic and the original speech. This concept has...
A speech spectrum intensity measure based on temporal non-place modeling of the cat's auditory nerve firing patterns is introduced where the spectrum intensity values are estimated using timing-synchrony measurements only. The ability of this measure to serve as a speech information carrier was tested psychoacoustically, by integrating the proposed...
Flanagan in 1955 performed psychoacoustical experiments to measure the JNDs for the formant center-frequency (Flanagan, 1955a) and its intensity (Flanagan, 1955b), and thereby determine the precision required in formant vocoder speech synthesis (Flanagan,1957). These experiments were performed on steady-state, synthetic speech vowels, yielding the...
Psychoacousticaldiscrimination limits described the ability of the human observer to perceive changes in an acoustic stimulus; sounds differing by less than a “Just Noticeable Difference” (JND) are similarly heard. Flanagan and his associates [J. Acoust. Soc. Am. 27, 613 (1955); 27, 1223 (1955); 30, 435 (1958)] studied steady‐state synthetic speech...
We describe a computational model of diphone perception based on salient properties of peripheral and central auditory processing. The model comprises an efferent-inspired closed-loop model of the auditory periphery connected to a template-matching neuronal circuit with a gamma rhythm at its core. We show that by exploiting auditory feedback a plac...
This is a final report for a stand-alone grant supporting the first 9 months of a 4-year research program entitled "Auditory peripheral processing of degraded speech". The underlying thesis is that the auditory periphery contributes to the robust performance of humans in speech reception in noise through a concerted contribution of the efferent fee...