Stephen A. Zahorian

Stephen A. Zahorian
  • Binghamton University

About

133
Publications
23,087
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
1,748
Citations
Current institution
Binghamton University

Publications

Publications (133)
Chapter
Full-text available
Based on extensive prior studies of speech science focused on the spectral-temporal properties of human speech perception, as well as a wide range of spectral-temporal speech features already in use, and motivated by the time-frequency resolution properties of human hearing, this chapter proposes and evaluates one general class of spectral-temporal...
Article
Full-text available
Automatic Speech Recognition (ASR) is widely used in many applications and tools. Smartphones, video games, and cars are a few examples where people use ASR routinely and often daily. A less commonly used, but potentially very important arena for using ASR, is the health domain. For some people, the impact on life could be enormous. The goal of thi...
Preprint
Full-text available
In this paper, we present a reverberation removal approach for speaker verification, utilizing dual-label deep neural networks (DNNs). The networks perform feature mapping between the spectral features of reverberant and clean speech. Long short term memory recurrent neural networks (LSTMs) are trained to map corrupted Mel filterbank (MFB) features...
Article
Convolutional neural networks have been used with great success for image processing tasks. The convolution filters can act as powerful feature detectors in images to find edges and shapes. Time domain speech signals are one dimensional, but frequency domain information can be viewed in the same way as an image, allowing one to use two-dimensional...
Article
The clinical diagnosis of Alzheimer’s disease and other dementias is very challenging, especially in the early stages. Our hypothesis is that any disease that effects particular brain regions involved in speech production and processing will also leave detectable finger prints in the speech. The goal of this work is an easy-to-use, non-invasive, in...
Article
For at least two decades, the primary acoustic features used for both automatic speech recognition (ASR) and automatic speaker identification (SID) have been Mel frequency cepstral coefficients (MFCCs) and their first and second order difference terms, referred to as Delta and double Delta terms. The MFCC’s capture static spectral information, wher...
Conference Paper
Full-text available
Early non-invasive diagnosis of Alzheimer’s disease (AD) and other forms of dementia is a challenging task. Early detection of the symptoms of the disorder could help families and medical professionals prepare for the difficulties ahead, as well as possibly provide a recruitment tool for clinical trials. One possible approach to a non-invasive diag...
Conference Paper
Full-text available
Speech delay is a childhood language problem that sometimes is resolved on its own but sometimes may cause more serious language difficulties later. This leads therapists to screen children for detection at early ages in order to eliminate future problems. Using the Goldman-Fristoe Test of Articulation (GFTA) method, therapists listen to a child's...
Article
Measuring periodicity is an important measurement in speech processing. It can be used in many areas, especially for tracking fundamental frequency (F0), typically referred to as pitch. This seemingly easy measurement is made difficult since even voiced sections of speech are only semi-periodic or periodic over short intervals. In this paper, four...
Article
Ideally, respiratory masks, when used to make certain measurements of speech and singing, should not interfere with what is being measured. Unfortunately this is not always the case. In this paper, two masks intended for speech measurements, are experimentally compared. One is a hard-walled mask manufactured and marketed by Kay-Pentax. The other is...
Article
Speech delay is a childhood language problem that might resolve without intervention, but might alternatively presage continued speech and language deficits. Thus, early detection through screening might help to identify children for whom intervention is warranted. The goal of this work is to develop Automatic Speech Recognition (ASR) methods to pa...
Article
For applications such as tone modeling and automatic tone recognition, smoothed F0 (pitch) all-voiced pitch tracks are desirable. Three pitch trackers that have been shown to give good accuracy for pitch tracking are YAAPT, YIN, and PRAAT. On tests with English and Japanese databases, for which ground truth pitch tracks are available by other means...
Article
"Yet another Algorithm for Pitch Tracking -YAAPT" was published in a 2008 JASA paper (Zahorian and Hu), with additional experimental results presented at the fall 2012 ASA meeting in Kansas City. The results presented in both the journal paper and at the fall 2012 meeting indicated that YAAPT generally has lower error rates than other widely used p...
Conference Paper
Full-text available
In this paper, the ability of human listeners to recognize tones from continuous Mandarin Chinese is evaluated and compared to the accuracy of automatic systems for tone classification and recognition. All tones used for experimentation were extracted from the RASC863 continuous Mandarin Chinese database. The human listeners are native speakers of...
Article
In this paper, we evaluate the front-end of Automatic Speech Recognition (ASR) systems, with respect to different types of spectral processing methods that are extensively used. Experimentally, we show that direct use of FFT spectral values is just as effective as using either Mel or Gammatone filter banks, as an intermediate processing stage, if t...
Article
Tones are important characteristics of Mandarin Chinese for conveying lexical meaning. Thus tone recognition, either explicit or implicit, is required for automatic recognition of Mandarin. Most literature on machine recognition of tones is based on syllables spoken in isolation or even machine-synthesized voices. This is likely due to the difficul...
Article
This work is a continuation and extension of work presented at the fall 2011 meeting of the Acoustical Society of America (Wong and Zahorian). In that work, and also at work done at Carnegie Mellon University, auditory model derived spectral amplitude nonlinearities, with symmetric additional compression (after log amplitude scaling) were found to...
Article
Full-text available
"Yet another Algorithm for Pitch Tracking -YAAPT" was published in a 2010 JASA paper (Zahorian and Hu). Although demonstrated to provide high accuracy and noise robustness for fundamental frequency tracking for both studio quality speech and telephone speech, especially as compared to other well-known algorithms (YIN, Praat, RAPT), YAAPT has not be...
Article
There is nearly universal agreement among engineering educators that the ABET2000 rules, although very well intentioned, have unintentionally increased the workload required to document that all ABET outcomes (a through k) are met, and that a process of continuous improvement is in place. Although there is no magic wand to eliminate all of the docu...
Article
A dual transmission model of the fetal heart sounds is presented in which the properties of the signals received on a sensor, installed on the maternal abdominal surface, depend upon the position of the fetus. For a fetus in the occiput anterior position, the predominant spectral content lies in the frequency band 16-50Hz ("impact" mode), but for a...
Article
Many speech segmentation techniques have been proposed to automate phonetic alignment. Most of the techniques require, however, labeled data to train, and perform well only for read, high-quality speech. Automatic phonetic alignment, for lower quality varied data with no labeled training data, the subject of this paper, is a much more challenging d...
Article
In a study presented at the fall 2010 meeting of the Acoustical Society of America (Zahorian etal., "Time/frequency resolution of acoustic features for automatic speech recognition"), we demonstrated that spectral/temporal evolution features which emphasize temporal aspects of acoustic features, with relatively low spectral resolution, are effectiv...
Article
Nine hundred video clips (approximately 30 h in each of English, Mandarin, and Russian) have been collected from Internet sources such as youtube.com and rutube.ru. This multi-language audio/video database has been orthographically transcribed by human listeners with time markers at the sentence level. However, the aim is to provide this database t...
Article
Auditory models for outer periphery processing include a sigmoid shaped nonlinearity that is even more compressed than standard logarithmic scaling at very low and very high amplitudes. In some studies done at Carnegie Mellon University, it has been shown that this compressive nonlinearity is the most important aspect of the Seneff auditory model i...
Conference Paper
Full-text available
Over the past few decades, research in automatic speech recognition and automatic speaker recognition has been greatly facilitated by the sharing of large annotated speech databases such as those distributed by the Linguistic Data Consortium (LDC). Open sources, particularly web sites such as YouTube, contain vast and varied speech recordings in a...
Article
The underlying assumption for spectral∕temporal features for use in automatic speech recognition is that the frequency resolution should be emphasized in relation to temporal resolution. Accordingly, Mel frequency cepstral coefficients are typically computed using an approximately 25-ms frame length with a 10-ms frame spacing, and using 3-5 frames...
Conference Paper
Full-text available
This paper presents two nonlinear feature dimensionality reduction methods based on neural networks for a HMM-based phone recognition system. The neural networks are trained as feature classifiers to reduce feature dimensionality as well as maximize discrimination among speech features. The outputs of different network layers are used for obtaining...
Conference Paper
Full-text available
Recently, the modulation spectrum has been proposed and found to be a useful source of speech information. The modulation spectrum represents longer term variations in the spectrum and thus implicitly requires features extracted from much longer speech segments compared to MFCCs and their delta terms. In this paper, a Discrete Cosine Transform (DCT...
Conference Paper
SOCRATES (Student Oriented Creative Resource for the Assessment and Teaching of Engineering Skills) will be an automated computer-based system for simulating the Socratic Method of learning in the subject area of Probability and Statistics. The goal is for the student to be an active partner in the learning process using directed self-reflections....
Article
Many studies from speech science have shown that the mel frequency scale more closely matches speech perception than the linear frequency scale. Automatic speech recognition engineers have empirically demonstrated that the use of the mel scale results in more accurate speech recognition than that obtainable with features computed with respect to a...
Conference Paper
Full-text available
A neural network based feature dimensionality reduction for speech recognition is described for accurate phonetic speech recognition. In our previous work, a neural network based nonlinear principal component analysis (NLPCA) was proposed as a dimensionality reduction approach for speech features. It was shown that the reduced dimensionality featur...
Article
Full-text available
In this paper, a fundamental frequency (F(0)) tracking algorithm is presented that is extremely robust for both high quality and telephone speech, at signal to noise ratios ranging from clean speech to very noisy speech. The algorithm is named "YAAPT," for "yet another algorithm for pitch tracking." The algorithm is based on a combination of time d...
Conference Paper
Full-text available
One of the main practical difficulties for automatic speech recognition is the large dimensionality of acoustic feature spaces and the subsequent training problems collectively referred to as the "curse of dimensionality." Many linear techniques, most notably principal components analysis (PCA) and linear discriminant analysis (LDA) and several var...
Conference Paper
This paper presents speech signal modeling techniques that are well suited to robust recognition of connected digits in noisy environments. After several preprocessing steps speech is represented by a block-encoding of discrete cosine transform of its spectra. In this paper we combine linear predictive coding (LPC), morphological filtering, and lon...
Conference Paper
Full-text available
Automatic speech recognizers perform poorly when training and test data are systematically different in terms of noise and channel characteristics. One manifestation of such differences is variations in the probability density functions (pdfs) between training and test features. Consequently, both automatic speech recognition and automatic speaker...
Article
Computer‐based visual speech training aids are potentially useful feedback tools for hearing impaired people. In this paper, a training aid for the articulation of short Consonant‐Vowel‐Consonant (CVC) words is presented using an integrated real‐time display of phonetic content and loudness. Although not yet extensively tested with hearing‐impaired...
Conference Paper
This paper describes all algorithm for determining pitch period markers in a continuous speech signal using prior knowledge of pitch values. The algorithm uses dynamic programming to determine optimal markers from a set of probable markers. Local costs are assigned based on amplitudes of local peaks, and transition costs are based oil the closeness...
Conference Paper
A visual speech training aids for persons with hearing impairments has been developed using a Windows-based multimedia computer. In previous papers, the signal processing steps and display options have been described for giving real-time feedback about the quality of pronunciation for 10 steady-state American English monopthong vowels (/aa/, /iy/,...
Conference Paper
Full-text available
A visual speech training aid for persons with hearing impairments has been developed using a Windows-based multimedia computer. The training aid provides real time visual feedback as to the quality of pronunciation for 10 steady-state American English monopthong vowels (/aa/, /iy/, /uw/, /ae/, /er/, /ih/, /eh/, /ao/, /ah/, and /uh/). This training...
Article
Full-text available
This paper describes the development of a question model to be used with an intelligent questioning system. The purpose of the intelligent questioning system is to improve the educational process in engineering courses by allowing students to learn more in less time, to understand more deeply, and to enjoy their learning experience. Key elements of...
Conference Paper
In this paper, we present a pitch detection algorithm that is extremely robust for both high quality and telephone speech. The kernel method for this algorithm is the "NCCF or normalized cross correlation" (Talkin (1995)). Major innovations include: processing of the original acoustic signal and a nonlinearly processed version of the signal to part...
Article
Although there has been considerable research on the development and use of assessment instruments to measure the effectiveness of various pedagogical approaches to teaching introductory physics classes (Hestenes et al. 1, Hestenes et al 2, Hake 3, Saul et al. 4) and other science courses (for example, see Vosniadou 5), there is relatively little s...
Article
Full-text available
This paper describes speech signal modeling techniques which are well-suited to high performance and robust isolated word recognition. We present new techniques for incorporating spectral/temporal information as a function of the temporal position within each word. In particular, spectral/temporal parameters are computed using both variable length...
Conference Paper
Full-text available
This paper presents an investigation of non-uniform time sampling methods for spectral/temporal feature extraction for use in automatic speech recognition. In most current methods for signal modeling of speech information, "dynamic" features are determined from frame-based parameters using a fixed time sampling, i.e., fixed block length and fixed b...
Conference Paper
Full-text available
Spectral feature computations continue to be a very difficult problem for accurate machine recognition of vowels especially in the presence of noise or for otherwise degraded acoustic signals. In this work, a new peak envelope method for vowel classification is developed, based on a missing frequency components model of speech recognition. Accordin...
Article
Full-text available
This paper describes the approach taken to prepare Old Dominion University's undergraduate computer engineering curriculum for technology-based delivery. In order to improve on methods for student learning, technology is now being developed for use in both the classroom and for distance education. To accomplish this, the curriculum content is organ...
Article
This paper describes the approach taken to prepare Old Dominion University's undergraduate computer engineering curriculum for technology based delivery. Old Dominion University is one of the largest deliverers of engineering technology undergraduate programs via its TELETECHNET delivery system. Based on the positive experiences gained, it will now...
Article
A novel pattern classification technique and a new feature extraction method are described and tested for vowel classification. The pattern classification technique partitions an N-way classification task into N*(N-1)/2 two-way classification tasks. Each two-way classification task is performed using a neural network classifier that is trained to d...
Conference Paper
Full-text available
This paper presents speech signal modeling techniques which are well suited to high performance and robust isolated word recognition. Speech is encoded by a discrete cosine transform of its spectra, after several preprocessing steps. Temporal information is then also explicitly encoded into the feature set. We present a new technique for incorporat...
Conference Paper
Full-text available
This paper describes the approach taken to prepare Old Dominion University's undergraduate computer engineering curriculum for technology based delivery. Old Dominion University is one of the largest deliverers of engineering technology undergraduate programs via its TELETECHNET delivery system. Based on the positive experiences gained, it will now...
Article
A neural network algorithm for speaker identification with large groups of speakers is described. This technique is derived from a technique in which an N-way speaker identification task is partitioned into N*(N-1)/2 two-way classification tasks. Each two-way classification task is performed using a small neural network which is a two-way, or pair-...
Conference Paper
Full-text available
Spectral/temporal segment features are adapted for isolated word recognition and tested with the entire English alphabet set using Hidden Markov Models. The ISOLET database from OGI and the HTK toolkit from Cambridge university were used to test our feature extraction technique. With our feature set we were able to achieve 97.3% recognition accurac...
Article
A method is described and tested for spectral analysis of vowels using window lengths that are integer multiples of the pitch period for voiced signals. For segments of vowels selected from steady?state vowels, the average fundamental frequency is first determined using ??standard?? autocorrelation methods. The acoustic signal is then analyzed agai...
Article
Improvements to the vowel articulation training aid described in a previous paper [Zimmer, Zahorian, and Auberg, J. Acoust. Soc. Am. 101, 3199(A) (1997)] have been made. The system uses a standard Windows 95/NT compatible sound card on a multimedia PC to provide continuous feedback about articulation for ten American English monopthong vowels in tw...
Conference Paper
A vowel training aid system for hearing impaired persons which uses a Windows-based multimedia computer has been developed. The system provides two main displays which give visual feedback for vowels spoken in isolation and short word contexts. Feature extraction methods and neural network processing techniques provide a high degree of accuracy for...
Conference Paper
This paper presents methods and experimental results for phonetic classification using 39 phone classes and the NIST recommended training and test sets for NTIMIT and TIMIT. Spectral/temporal features which represent the smoothed trajectory of FFT derived speech spectra over 300 ms intervals are used for the analysis. Classification tests are made...
Article
A computer‐based system which provides real‐time feedback for vowel articulation training for the hearing impaired is described. This system is a revised version of the training aid described in previous papers [Zahorian and Correal, J. Acoust. Soc. Am. 95, 3014(A) (1994); Beck and Zahorian, ICASSP‐92, II–241‐244]. Revised feature extraction and cl...
Article
In recent years, a significant amount of clinical research has been devoted to establish relationship between variations in fetal heart rate and abnormal prenatal conditions. For example, it has been shown that the progression of fetal asphyxia is closely linked to recognizable changes in fetal heart rate patterns. Currently, fetal heart rate monit...
Conference Paper
The authors present an approach for efficiently computing a compact temporal/spectral feature set for representing a segment of speech, with effective resolution depending on both frequency and time position within the segment. The goal is to mimic the resolution properties of the human auditory system, but using a computationally efficient FFT-bas...
Article
Full-text available
A spoken language system combines speech recognition, natural language processing and human interface technology. It functions by recognizing the person's words, interpreting the sequence of words to obtain a meaning in terms of the application, and providing an appropriate response back to the user. Potential applications of spoken language system...
Conference Paper
Experiments in modeling speech signals for phoneme classification are described. Enhancements to standard speech processing methods include basis vector representations of dynamic feature trajectories, morphological smoothing (dilation) of spectral features, and the use of many closely spaced, short analysis windows. Results are reported from exper...
Article
Acoustic methods for monitoring fetal heart rate are potentially advantageous over ultrasound methods, since, eventually, long‐term low‐cost monitoring would be possible for high risk mothers, with no concern for fetal damage due to the monitoring device. However, sensor design and signal processing requirements are very demanding in this low signa...
Article
Full-text available
A spoken language system combines speech recognition, natural language processing and human interface technology. It functions by recognizing the person's words, interpreting the sequence of words to obtain a meaning in terms of the application, and providing an appropriate response back to the user. Potential applications of spoken language system...
Article
Full-text available
Neural networks are often used as a powerful discriminating classifier for tasks in automatic speech recognition. They have several advantages over parametric classifiers such as those based on a Mahalanbois distance measure, since no (possibly invalid) assumptions are made about data distributions. However there are disadvantages in terms of amoun...
Article
A wavelet packet transform is described to compute N spectral/temporal features for the 6 English stop consonants /b,p,d,t,g,k/. These features were used by a Binary Pair Partitioned neural network for speaker-independent classification of the stop consonants. The wavelet packet transform is generated by a pair of quadratic mirror filters which dec...
Article
A neural network technique is presented for the task of speaker verification. Using this technique each speaker is represented by a statistical profile derived from a vector of relative 'distances' to a set of other speakers. This profile uniquely represents each speaker in a multidimensional speaker space. The technique first uses binary-pair part...
Conference Paper
A novel representation for speech signals is proposed. The time-varying frequency content of a speech segment is represented as a weighted sum of two-dimensional basis vectors; these incorporate both frequency warping and frequency-dependent time warping. This is quite flexible; for example, any arbitrary time or frequency warping function can easi...
Conference Paper
A novel signal modeling technique is described to compute smoothed time-frequency features for encoding speech information. These time-frequency features compactly and accurately model phonetic information, while accounting for the main effects of contextual variations. These segment-level features are computed such that more emphasis is given to t...
Article
There were three primary objectives for this task: (1) The investigation of the feasibility of making the fetal heart rate monitor portable, using a laptop computer; (2) Improvements in the signal processing for the monitor; and (3) Implementation of a real-time hardware software system. These tasks have been completed as discussed in the following...
Article
In previous work signal processing algorithms were presented for real?time mapping of vowels to a two?dimensional display for articulation training [Beck and Zahorian, ICASSP?92, II, 241?244]. In this paper, a series of tests designed to evaluate the effectiveness of this display for vowel training are reported. Experiments were conducted with five...
Article
In previous synthesis experiments with multi‐tone vowels it was found that vowel perception is more closely correlated with acoustic cues derived from the envelope of the magnitude spectrum than with cues derived from the overall magnitude spectrum [Zahorian et al., J. Acoust. Soc. Am. 93, 2298–2299 (1993)]. In the present study automatic vowel cla...
Article
A clustering algorithm for speaker identification based on neural networks is described. This technique is modeled after a previously developed technique in which an N-way speaker identification task is partitioned into N*(N-1)/2 two-way classification tasks. Each two-way classification task is performed using a small size neural network which is a...
Article
A method is presented for the application of binary-pair partitioned neural networks in the task of speaker verification. The binary-pair partitioned neural network is a previously developed technique used for speaker identification [1]. The training and evaluation procedures are discussed, as well as the selection of the verification thresholds. F...
Article
The goal of automatic pattern recognition is generally to minimize probability of error. Therefore, networks trained specifically to minimize classification errors (MME) have been advocated (Gish, 1990, 1992; Telfer and Szu, 1992) as more suitable for pattern classification than networks trained to minimize mean square error (LMS). In this paper we...
Article
The first three formants, i.e., the first three spectral prominences of the short-time magnitude spectra, have been the most commonly used acoustic cues for vowels ever since the work of Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)]. However, spectral shape features, which encode the global smoothed spectrum, provide a more complete...
Article
In previous experiments for which multiple tone stimuli were synthesized such that either the formants or global spectral shape were matched to that of naturally spoken vowel tokens, it was found for both cases that vowel identity and quality was not well preserved [S. A. Zahorian and Z.‐J. Zhong, J. Acoust. Soc. Am. 92, 2414–2415 (1992)]. In the p...
Article
Vowel tokens were synthesized from sinusoids using two methods. For the first method (formant sinusoids), three variable frequency sinusoids were used, with frequencies adjusted to match the first three formants extracted from naturally spoken vowels. The amplitudes were filtered at −12 dB/oct, to approximate the roll‐off of the glottal source. For...

Network

Cited By