Paul Mueller’s research while affiliated with University of Pennsylvania and other places

What is this page?


This page lists works of an author who doesn't have a ResearchGate profile or hasn't added the works to their profile yet. It is automatically generated from public (personal) data to further our legitimate goal of comprehensive and accurate scientific recordkeeping. If you are this author and want this page removed, please let us know.

Publications (36)


Fig. 1. Block diagram outlining spectral conversion for a parallel and nonparallel corpus within the IWF framework. Nonparallel training is achieved by adaptation of the parameters derived from parallel training of a different speaker and noise conditions.
Fig. 2. Resulting ASSNR (dB) for different values of input SNR (white noise), for the five cases tested, i.e., perfect prediction (ideal error), the iterative Wiener filter (IWF), spectral conversion for IWF (SC-IWF, parallel corpus), spectral conversion by adaptation for IWF (SC-Adapt-IWF, nonparallel corpus), and spectral subtraction.  
Fig. 3. ASSNR (decibels) for different values of input SNR (car noise), for the five cases tested, i.e., perfect prediction (ideal error), KEMI, spectral conversion followed by KEMI (SC-KEMI, parallel corpus), spectral conversion by adaptation followed by KEMI (SC-KEMI-Adapt, nonparallel corpus), and LSAE.  
Fig. 4. Results from the DCR listening test, for input SNR of 05dB (car noise),  
Fig. 5. Spectrograms of (a) the clean speech signal " The angry boy answered, " (b) the noisy speech with 0 dB SNR, and the enhanced speech processed by (c) the IWF algorithm, (d) IWF preceded by perfect prediction (ideal case), (e) IWF preceded by parallel conversion, (f) IWF preceded by nonparallel conversion, (g) the KEMI algorithm, (h) KEMI preceded by perfect prediction (ideal case), (i) KEMI preceded by parallel conversion, (j) KEMI preceded by nonparallel conversion.  
A Spectral Conversion Approach to Single-Channel Speech Enhancement
  • Article
  • Full-text available

June 2007

·

177 Reads

·

19 Citations

IEEE Transactions on Audio Speech and Language Processing

·

·

Paul Mueller

·

In this paper, a novel method for single-channel speech enhancement is proposed, which is based on a spectral conversion feature denoising approach. Spectral conversion has been applied previously in the context of voice conversion, and has been shown to successfully transform spectral features with particular statistical properties into spectral features that best fit (with the constraint of a piecewise linear transformation) different target statistics. This spectral transformation is applied as an initialization step to two well-known single channel enhancement methods, namely the iterative Wiener filter (IWF) and a particular iterative implementation of the Kalman filter. In both cases, spectral conversion is shown here to provide a significant improvement as opposed to initializations using the spectral features directly from the noisy speech. In essence, the proposed approach allows for applying these two algorithms in a user-centric manner, when "clean" speech training data are available from a particular speaker. The extra step of spectral conversion is shown to offer significant advantages regarding output signal-to-noise ratio (SNR) improvement over the conventional initializations, which can reach 2 dB for the IWF and 6 dB for the Kalman filtering algorithm, for low input SNRs and for white and colored noise, respectively

Download

Fig. 1. Block diagram outlining spectral conversion for a parallel and nonparallel corpus. In the latter case, spectral conversion is preceded by adaptation of the derived parameters from the parallel corpus to the nonparallel corpus.
Fig. 2. Normalized error (a) when using different number of adaptation parameters (0 corresponds to no adaptation) and (b) for various choices of training dataset (see Table III). The dashed line corresponds to the error when a parallel corpus is used for training. The dashed–dotted line corresponds to no adaptation.  
Nonparallel Training for Voice Conversion Based on a Parameter Adaptation Approach

June 2006

·

140 Reads

·

118 Citations

IEEE Transactions on Audio Speech and Language Processing

The objective of voice conversion algorithms is to modify the speech by a particular source speaker so that it sounds as if spoken by a different target speaker. Current conversion algorithms employ a training procedure, during which the same utterances spoken by both the source and target speakers are needed for deriving the desired conversion parameters. Such a (parallel) corpus, is often difficult or impossible to collect. Here, we propose an algorithm that relaxes this constraint, i.e., the training corpus does not necessarily contain the same utterances from both speakers. The proposed algorithm is based on speaker adaptation techniques, adapting the conversion parameters derived for a particular pair of speakers to a different pair, for which only a nonparallel corpus is available. We show that adaptation reduces the error obtained when simply applying the conversion parameters of one pair of speakers to another by a factor that can reach 30%. A speaker identification measure is also employed that more insightfully portrays the importance of adaptation, while listening tests confirm the success of our method. Both the objective and subjective tests employed, demonstrate that the proposed algorithm achieves comparable results with the ideal case when a parallel corpus is available.


Combined Software/Hardware Implementation of a Filterbank Front-End for Speech Recognition

November 2005

·

7 Reads

·

Yuan Cao

·

Shehzad Khan

·

[...]

·

Paul Mueller

In this paper, a cost-effective implementation of a programmable filterbank front-end for speech recognition is presented. The objective has been to design a real-time bandpass filtering system with a filterbank of 16 filters, with analog audio input and analog output. The output consists of 16 analog signals, which are the envelopes of the filter outputs of the audio signal. These analog signals are then led to an analog neural computer, which performs the feature-based recognition task. One of the main objectives has been to allow the user to easily change the filter specifications without affecting the remaining system, thus a software implementation of the filterbank was preferred. In addition, the neural computer requires analog input. Therefore, we implemented the filterbank on a PC, with the input A/D and the output D/A performed by the PC stereo soundcard. Since multiple analog outputs are necessary for the neural computer (one for each filter), it then follows that the soundcard output should contain the multiplexed 16 filter outputs, while a hardware module is needed for demultiplexing the soundcard output into the final 16 analog signals.


A Spectral Conversion Approach to Feature Denoising and Speech Enhancement

September 2005

·

44 Reads

·

1 Citation

In this paper we demonstrate that spectral conversion can be successfully applied to the speech enhancement problem as a feature denoising method. The enhanced spectral features can be used in the context of the Kalman filter for estimating the clean speech signal. In essence, instead of estimating the clean speech features and the clean speech signal using the iterative Kalman filter, we show that is more efficient to initially estimate the clean speech features from the noisy speech features using spectral conversion (using a training speech corpus) and then apply the standard Kalman filter. Our results show an average improvement compared to the iterative Kalman filter that can reach 6 dB in the average segmental output Signal-to-Noise Ratio (SNR), in low input SNR's.


A Spectral Conversion Approach to the Iterative Wiener Filter for Speech Enhancement

June 2004

·

10 Reads

·

1 Citation

The Iterative Wiener Filter (IWF) for speech enhancement in additive noise is an effective and simple algorithm to implement. One of its main disadvantages is the lack of proper criteria for convergence, which has been shown to introduce severe degradation to the estimated clean signal. Here, an improvement of the IWF algorithm is proposed, when additional information is available for the signal to be enhanced. If a small amount of clean speech data is available, spectral conversion techniques can be applied for esimating the clean short-term spectral envelope of the speech signal from the noisy signal, with significant noise reduction. Our results show an average improvement compared to the original IWF that can reach 2 dB in the segmental output Signal-to-Noise Ratio (SNR), in low input SNR's, which is perceptually significant.


Fig. 1. Block diagram outlining spectral conversion for a parallel and non-parallel corpus. In the latter case, spectral conversion is preceded by adaptation of the derived parameters from the parallel corpus to the non-parallel corpus.  
Non-Parallel Training for Voice Conversion by Maximum Likelihood Constrained Adaptation

May 2004

·

30 Reads

·

7 Citations

The objective of voice conversion methods is to modify the speech characteristics of a particular speaker in such manner, as to sound like speech by a different target speaker. Current voice conversion algorithms are based on deriving a conversion function by estimating its parameters through a corpus that contains the same utterances spoken by both speakers. Such a corpus, usually referred to as a parallel corpus, has the disadvantage that many times it is difficult or even impossible to collect. Here, we propose a voice conversion method that does not require a parallel corpus for training, i.e. the spoken utterances by the two speakers need not be the same, by employing speaker adaptation techniques to adapt to a particular pair of source and target speakers, the derived conversion parameters from a different pair of speakers. We show that adaptation reduces the error obtained when simply applying the conversion parameters of one pair of speakers to another by a factor that can reach 30% in many cases, and with performance comparable with the ideal case when a parallel corpus is available.


Fig. (1) Block diagram of the auditory-based front-end processing system.
Robust Auditory-Based Speech Processing Using the Average Localized Synchrony Detection

August 2002

·

78 Reads

·

50 Citations

IEEE Transactions on Speech and Audio Processing

A new auditory-based speech processing system based on the biologically rooted property of the average localized synchrony detection (ALSD) is proposed. The system detects periodicity in the speech signal at Bark-scaled frequencies while reducing the response's spurious peaks and sensitivity to implementation mismatches, and hence presents a consistent and robust representation of the formants. The system is evaluated for its formant extraction ability while reducing spurious peaks. It is compared with other auditory-based and traditional systems in the tasks of vowel and consonant recognition on clean speech from the TIMIT database and in the presence of noise. The results illustrate the advantage of the ALSD system in extracting the formants and reducing the spurious peaks. They also indicate the superiority of the synchrony measures over the mean-rate in the presence of noise.


Fig. 1. Block diagram of the stop recognition system used.  
Fig. 2. Block diagram of an auditory-based front-end system.  
Fig. 3. Algorithm for voicing detection of stop consonants.  
Fig. 4. Two-dimensional space preliminary classification regions for (a) unvoiced stops and (b) voiced stops. Zero VF2 corresponds to the absence of a following vowel's second formant. Alvelars (+), velars (*), and labials (o). It is clear that alveolars and velars show better clustering labials.
Fig. 5. Hard-decision algorithm for the place of articulation detection of stops. Condition A in the figure is (LINP>LINP_THHI), condition B is (LINP<LINP_THLO), and condition C is [NOT (A OR B)]. MNSS_TH, DRHF_TH, LINP_THHI, and LINP_THLO are the threshold values.  
Acoustic-Phonetic Features for the Automatic Classification of Stop Consonants

December 2001

·

574 Reads

·

73 Citations

IEEE Transactions on Speech and Audio Processing

In this paper, the acoustic-phonetic characteristics of the American English stop consonants are investigated. Features studied in the literature are evaluated for their information content and new features are proposed. A statistically guided, knowledge-based, acoustic-phonetic system for the automatic classification of stops, in speaker independent continuous speech, is proposed. The system uses a new auditory-based front-end processing and incorporates new algorithms for the extraction and manipulation of the acoustic-phonetic features that proved to be rich in their information content. Recognition experiments are performed using hard decision algorithms on stops extracted from the TIMIT database continuous speech of 60 speakers (not used in the design process) from seven different dialects of American English. An accuracy of 96% is obtained for voicing detection, 90% for place of articulation detection and 86% for the overall classification of stops


Fig. (2) Accuracy of the stop place detection in the presence of additive white Gaussian noise using either spectral slopes (o) or spectral peaks (x) to extract the BF.  
Confusion matrix for voicing detection on 1200 stops from 60 speakers. Accuracy is 96%.
Robust Classification of Stop Consonants Using Auditory-Based Speech Processing

October 2001

·

70 Reads

·

12 Citations

Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing

In this work, a feature-based system for the automatic classification of stop consonants, in speaker independent continuous speech, is reported. The system uses a new auditory-based speech processing front-end that is based on the biologically rooted property of average localized synchrony detection (ALSD). It incorporates new algorithms for the extraction and manipulation of the acoustic-phonetic features that proved, statistically, to be rich in their information content. The experiments are performed on stop consonants extracted from the TIMIT database with additive white Gaussian noise at various signal-to-noise ratios. The obtained classification accuracy compares favorably with previous work. The results also showed a consistent improvement of 3% in the place detection over the Generalized Synchrony Detector (GSD) system under identical circumstances on clean and noisy speech. This illustrates the superior ability of the ALSD to suppress the spurious peaks and produce a consistent and robust formant (peak) representation.


Acoustic-phonetic features for the automatic classification of fricatives

June 2001

·

122 Reads

·

74 Citations

The Journal of the Acoustical Society of America

In this article, the acoustic-phonetic characteristics of the American English fricative consonants are investigated from the automatic classification standpoint. The features studied in the literature are evaluated and new features are proposed. To test the value of the extracted features, a statistically guided, knowledge-based, acoustic-phonetic system for the automatic classification of fricatives in speaker-independent continuous speech is proposed. The system uses an auditory-based front-end processing system and incorporates new algorithms for the extraction and manipulation of the acoustic-phonetic features that proved to be rich in their information content. Classification experiments are performed using hard-decision algorithms on fricatives extracted from the TIMIT database continuous speech of 60 speakers (not used in the design/training process) from seven different dialects of American English. An accuracy of 93% is obtained for voicing detection, 91% for place of articulation detection, and 87% for the overall classification of fricatives.


Citations (21)


... A variety of foveated imaging systems have been developed, and they can be classified into two basic types depending on how the foveated images are obtained: optical system design-based and digital image processing-based (main methods of back-end data processing) [2]. Devices based on optical structures obtain multi-resolution images through special imaging detectors or optical elements, including non-uniform imaging detectors [3,4], spatial light modulators (SLMs) [5,6], liquid crystal lenses [7], and direct superposition of FOV with different focal lengths [8,9]. However, such systems are normally complicated, expensive, limited in FOV expansion, and have difficult fovea adjustment. ...

Reference:

Flexible foveated imaging using a single Risley-prism imaging system
A Foveated Silicon Retina for Two-Dimensional Tracking
  • Citing Article
  • June 2000

... Taking inspiration from both the theoretical properties and the empirically-observed behavior of Seneff's Generalized Synchrony Detector (GSD) (Seneff, 1984(Seneff, , 1988, we propose a kind of local spectral energy normalization to compensate for variations in channel frequency response. We identify, and offer a solution for, a potential problem with the behavior of Seneff's model which may explain why some previous attempts to use the model directly in speech recognition applications (summarized in Section 2.2.1) have demonstrated only limited improvement in accuracy (Jankowski and Lippmann, 1992;Ohshima and Stern, 1994;Jankowski et al., 1995;Ali et al., 2000Ali et al., , 2002Kim et al., 2006;Stern and Morgan, 2012a). ...

Auditory-based speech processing based on the average localized synchrony detection
  • Citing Article
  • June 2000

... Voice transformation refers to modifications of the non-linguistic characteristics (voice quality, voice individuality) of a given utterance without affecting its textual content. There are lot of literature available for voice de-identification for both text-dependent [98] and text-independent case [91]. The de-identification in AV biometrics has not received much heed when compared to individual cues. ...

Non-Parallel Training for Voice Conversion by Maximum Likelihood Constrained Adaptation

... However, all circuits components tested successfully when a light source is used as the target. Additional measured data can be found in (Etienne-Cummings, 1995). Further work will improve the contrast sensitivity, combat noise and also consider two dimensional implementations with target acquisition (saccades) capabilities. ...

Real-time visual target tracking: two implementations of velocity-based smooth pursuit
  • Citing Article
  • June 1995

Proceedings of SPIE - The International Society for Optical Engineering

... (papananos, Georgantas, and Tsividis, 1997). This is not true, especially for frequencies below 1 kHz, which is enumerated in the literature (Shah, 1993;Deguelle, 1988;Mueller, et al. 1989;Steyaert, et al. 1991). The literature, though basically deals with the implementation of very large time constant in CMOS technologies. ...

Design and Fabrication of VLSI Components for a General Purpose Analog Neural Computer
  • Citing Article
  • January 1989

... Several methods were proposed [12] as motivated by auditory-based periodic and aperiodic signal analysis to increase the recognition efficiency. Kajita and Itakura [13] proposed a sub-band auto-correlation technique which can extract spectral information from the speech signal by using filter banks and the auto-correlation detectors. ...

Speech processing using the average localized synchrony detection
  • Citing Article
  • May 2000

The Journal of the Acoustical Society of America

... Furthermore, the current-mode neuromorphic circuits used in the TMS for designing the WTA network are ideal for hardware models of selective attention systems [3]. Several neuromorphic attention systems of this kind have been proposed in the past [4]–[9]. These systems typically contain photo-sensing elements and processing elements on the same focal plane, apply the competitive selection process to visual stimuli sensed and processed by the focal plane processor itself, and perform visual tracking operations. ...

VLSI Model of Primate Visual Smooth Pursuit.
  • Citing Conference Paper
  • January 1995

... And I approximate a temporal Hilbert transform by equalizing the highpass and lowpass output amplitudeswhich have a phase difference of exactly 90 • . Etienne-Cummings et al.'s implementation of Adelson and Bergen's [40] closely-related spatiotemporal energy model is quite similar [41,42]-but it lacks the adaptive temporal dynamics provided by the Hilbert transform. ...

VLSI Implementation of Cortical Visual Motion Detection Using an Analog Neural Computer.
  • Citing Conference Paper
  • January 1996

... Our current endeavours are towards building a database of disordered speech data with the age of speakers ranging across different age-groups. We plan to investigate techniques for consonant classification to distinguish between place of articulation more robustly [21]. The method proposed in this paper is dependent on language, we also aim at exploring nonlinguistic techniques for assessment of articulation errors. ...

Robust Classification of Stop Consonants Using Auditory-Based Speech Processing

Proceedings - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing