
Nelson Morgan- Ph.D.
- University of California, Berkeley
Nelson Morgan
- Ph.D.
- University of California, Berkeley
About
254
Publications
28,631
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
8,319
Citations
Introduction
Current institution
Publications
Publications (254)
In this paper we apply diagnostic analysis to gain a deeper understanding of the performance of the the keyword search system that we have developed for conversational telephone speech in the IARPA Babel program. We summarize the Babel task, its primary performance metric, “actual term weighted value” (ATWV), and our recognition and keyword search...
Previous work has demonstrated that spectro-temporal Gabor features reduced word error rates for automatic speech recognition under noisy conditions. However, the features based on mel spectra were easily corrupted in the presence of noise or channel distortion. We have exploited an algorithm for power normalized cepstral coefficients (PNCCs) to ge...
Given a test waveform, state-of-the-art ASR systems extract a sequence of MFCC features and decode them with a set of trained HMMs. When this test data is clean, and it matches the condition used for training the models, then there are few errors. While it is known that ASR systems are brittle in noisy or mismatched conditions, there has been littl...
Many feature extraction methods that have been used for automatic speech recognition (ASR) have either been inspired by analogy to biological mechanisms, or at least have similar functional properties to biological or psychoacoustic properties for humans or other mammals. These methods have in many cases provided significant reductions in errors, p...
Introduction Some Attributes of Auditory Physiology and Perception “Classic” Auditory Representations Current Trends in Auditory Feature Analysis Summary Acknowledgments References
When Speech and Audio Signal Processing published in 1999, it stood out from its competition in its breadth of coverage and its accessible, intutiont-based style. This book was aimed at individual students and engineers excited about the broad span of audio processing and curious to understand the available techniques. Since then, with the advent o...
This paper reviews a line of research carried out over the last decade in speech recognition assisted by discriminatively trained, feedforward networks. The particular focus is on the use of multiple layers of processing preceding the hidden Markov model based decoding of word sequences. Emphasis is placed on the use of multiple streams of highly d...
Current speech recognition systems, for example, typically use Gaussian mixture models (GMMs), to estimate the observation (or emission) probabilities of hidden Markov models (HMMs), and GMMs are generative models that have only one layer of latent variables. Instead of developing more powerful models, most of the research effort has gone into find...
Spectro-temporal filtering has been shown to result in features that can help to increase the robustness of automatic speech recognition (ASR) in the past. We replace the spectro-temporal representation used in previous work with spectrograms that incorporate knowledge about the signal processing of the human auditory system and which are derived f...
We have incorporated spectrotemporal features in a speech activity detection (SAD) task for the Speech in Noisy Environments 2 (SPINE2) data set. The features were generated by applying 2D Gabor filters to the mel spectrogram in order to measure the strength of various spectral and temporal modulation frequencies in different patches of the spectro...
IntroductionThe Frame-Fill ConceptPattern Matching or Vector QuantizationThe Kang–Coulter 600-bps VocoderSegmentation Methods for Bandwidth ReductionExercises
Transparent Audio CodingPerceptual MaskingNoise ShapingSome Example Coding SchemesSummaryExercises
IntroductionThe Predictive ModelProperties of the RepresentationGetting the CoefficientsRelated RepresentationsConcluding DiscussionExercises
IntroductionThe Wave Equation for the Vibrating StringDiscrete-Time Traveling WavesBoundary Conditions and Discrete Traveling WavesStanding WavesDiscrete-Time Models of Acoustic TubesAcoustic Tube ResonancesRelation of Tube Resonances to Formant FrequenciesExercises
IntroductionIsolated Word RecognitionConnected Word RecognitionSegmental ApproachesDiscussionExercises
IntroductionHMM TrainingForward–Backward TrainingOptimal Parameters for Emission Probability EstimatorsViterbi TrainingLocal Acoustic Probability Estimators for ASRInitializationSmoothingConclusions
Exercises
IntroductionA Historical NoteThe Real CepstrumThe Complex CepstrumApplication of Cepstral Analysis to Speech SignalsConcluding ThoughtsExercises
IntroductionSound-Pressure Level and LoudnessFrequency Analysis and Critical BandsMaskingSummaryExercises
IntroductionThe z TransformInverse z TransformConvolutionSamplingLinear Difference EquationsFirst-Order Linear Difference EquationsResonanceConcluding CommentsExercises
Sources and MixturesEvaluating Source SeparationMulti-Channel ApproachesBeamforming with Microphone ArraysIndependent Component AnalysisComputational Auditory Scene AnalysisModel-Based SeparationConclusions
Exercises
IntroductionSome Examples of Acoustically Generated Musicals SoundsMusic Synthesis ConceptsAnalysis-Based SynthesisOther Techniques for Music SynthesisReverberationSeveral Examples of SynthesisExercises
IntroductionAdaptationLattice-Based MMI and MPEConclusion
Exercises
IntroductionReview of Fletcher's Critical Band ExperimentsThreshold Measurements and Filter ShapesGamma-Tone Filters, Roex Filters, and Auditory ModelsOther Considerations in Filter-Bank DesignSpeech Spectrum Analysis Using the FFTConclusions
Exercises
IntroductionConcatenative Methods
Statistical Parametric MethodsA Historical PerspectiveSpeculationTools and EvaluationExercisesAppendix: Synthesizer Examples
IntroductionHistorical Review of Pitch-Perception ModelsPhysiological Exploration of Place Versus PeriodicityResults from Psychoacoustic Testing and ModelsSummaryExercises
IntroductionTime-Scale ModificationTransformation Without Explicit Pitch DetectionTransformations in Analysis–Synthesis SystemsSpeech Modifications in the Phase VocoderSpeech Transformations Without Pitch ExtractionThe Sine Transform Coder as a Transformation AlgorithmVoice Modification to Emulate a Target VoiceExercises
IntroductionStating the ProblemParameterization and Probability EstimationConclusion
Exercises
IntroductionA Few DefinitionsClass-Related Probability FunctionsMinimum Error ClassificationLikelihood-Based MAP ClassificationApproximating a Bayes ClassifierStatistically Based Linear DiscriminantsIterative Training: The EM AlgorithmExercises
IntroductionSequence of Steps in a Plucked or Bowed String InstrumentVibrations of the Bowed StringFrequency-Response Measurements of the Bridge of a ViolinVibrations of the Body of String InstrumentsRadiation Pattern of Bowed String InstrumentsSome Considerations in Piano DesignThe Trumpet, Trombone, French Horn, and TubaExercises
IntroductionSound WavesSound Waves in RoomsRoom Acoustics as a Component in Speech SystemsExercises
Background
Voice-coding conceptsHomer Dudley (1898–1981)ExercisesAppendix: Hearing of the Fall of Troy
IntroductionDiscriminant TrainingHMM–ANN Based ASROther Applications of ANNs to ASRExercisesAppendix: Posterior Probability Proof
IntroductionThe Articulation Index and Human RecognitionComparisons Between Human and Machine Speech RecognizersConcluding ThoughtsExercises
IntroductionGeneral Design of a Speaker Recognition SystemExample System ComponentsEvaluationModern Research ChallengesExercises
The Information in Music AudioMusic TranscriptionNote TranscriptionScore AlignmentChord TranscriptionStructure DetectionConclusion
Exercises
IntroductionVowel Perception: Psychoacoustics and PhysiologyThe Confusion MatrixPerceptual Cues for PlosivesPhysiological Studies of Two Voiced PlosivesMotor Theories of Speech PerceptionNeural Firing Patterns for Connected Speech StimuliConcluding ThoughtsExercises
IntroductionFeature ExtractionPattern-Classification Methods
Support Vector MachinesUnsupervised ClusteringConclusions
ExercisesAppendix: Multilayer Perceptron Training
IntroductionA Note on NomenclaturePitch Detection, Perception and ArticulationThe Voicing DecisionSome Difficulties in Pitch DetectionSignal Processing to Improve Pitch DetectionPattern-Recognition Methods for Pitch DetectionSmoothing to Fix Errors in Pitch EstimationNormalizing the Autocorrelation FunctionExercises
Radio RexDigit RecognitionSpeech Recognition in the 1950sThe 1960s1971–1976 ARPA ProjectAchieved by 1976The 1980s in Automatic Speech RecognitionMore Recent WorkSome LessonsExercises
Von KempelenThe VoderTeaching the Operator to Make the Voder “Talk”Speech Synthesis After the VoderMusic MachinesExercises
IntroductionFiltering ConceptsTransformations for Digital Filter DesignDigital Filter Design with Bilinear TransformationThe Discrete Fourier TransformFast Fourier Transform Methods
Relation Between the DFT and Digital FiltersExercises
IntroductionGeneral Design of a Speaker Diarization SystemExample System ComponentsResearch ChallengesExercises
The Music Retrieval ProblemMusic FingerprintingQuery by HummingCover Song MatchingMusic Classification and AutotaggingMusic SimilarityConclusions
Exercises
IntroductionPhones and PhonemesPhonetic and Phonemic AlphabetsArticulatory FeaturesSubword Units as Categories for ASRPhonological Models for ASRContext-Dependent PhonesOther Subword UnitsPhrasesSome Issues in Phonological ModelingExercises
IntroductionAcoustic Tube Models of English PhonemesExcitation Mechanisms in Speech ProductionExercises
IntroductionCommon Feature VectorsDynamic FeaturesStrategies for RobustnessAuditory ModelsMultichannel InputDiscriminant FeaturesDiscussionExercises
IntroductionVoice Excitation and Spectral FlatteningVoice-Excited Channel VocoderVoice-Excited and Error-Signal-Excited LPC VocodersWaveform Coding with Predictive Methods
Adaptive Predictive Coding of SpeechSubband CodingMultipulse LPC VocodersCode-Excited Linear Predictive CodingReducing Codebook Search Time in CELPConclusions
Exercises
IntroductionStandards for Digital Speech CodingDesign Considerations in Channel Vocoder Filter BanksEnergy Measurements in a Channel VocoderA Vocoder Design for Spectral Envelope EstimationBit Saving in Channel VocodersDesign of the Excitation Parameters for a Channel VocoderLPC VocodersCepstral VocodersDesign ComparisonsVocoder StandardizationExer...
IntroductionPhonological ModelsLanguage ModelsDecoding With Acoustic and Language ModelsA Complete SystemAccepting Realistic InputConcluding Comments
IntroductionAnatomical Pathways From the Ear to the Perception of SoundThe Peripheral Auditory SystemHair Cell and Auditory Nerve FunctionsProperties of the Auditory NerveSummary and Block Diagram of the Peripheral Auditory SystemExercises
In the last decade, several studies have shown that the robustness of ASR systems can be increased when 2D Gabor filters are used to extract specific modulation frequencies from the input pattern. This paper analyzes important design parameters for spectro-temporal features based on a Gabor filter bank: We perform experiments with filters that exhi...
In this paper, we propose a discriminative extension to agglomerative hierarchical clustering, a typical technique for speaker diarization, that fits seamlessly with most state-of-the art diarization algorithms. We propose to use maximum mutual information using bootstrapping i.e., initial predictions are used as input for retraining of models in a...
Previous work has shown that spectro-temporal features reduce WER for automatic speech recognition under noisy conditions. The spectro-temporal framework, however, is not the only way to process features in order to reduce errors due to noise in the signal. The two-stage mel-warped Wiener filtering method used in the "Advanced Front End" (AFE), now...
This article has been withdrawn at the request of the author(s) and/or editor. The Publisher apologizes for any inconvenience this may cause. The full Elsevier Policy on Article Withdrawal can be found at http://www.elsevier.com/locate/withdrawalpolicy .
To advance research, it is important to identify promising future research directions, especially those that have not been adequately pursued or funded in the past. The working group producing this article was charged to elicit from the human language technology (HLT) community a set of well-considered directions or rich areas for future research t...
This article is the second part of an updated version of the "MINDS 2006-2007 Report of the Speech Understanding Working Group," one of five reports emanating from two workshops entitled "Meeting of the MINDS: Future Directions for Human Language Technology," sponsored by the U.S. Disruptive Technology Office (DTO). (MINDS is an acronym for "machin...
Automatic speech recognition enables a wide range of current and emerging applications such as automatic transcription, multimedia content analysis, and natural human-computer interfaces. This article provides a glimpse of the opportunities and challenges that parallelism provides for automatic speech recognition and related application research fr...
What is a Negative Result? In a sense, well-designed experiments never have a completely negative result, since there is always the opportunity to learn something. In fact, unexpected results by definition provide the most information. Conventionally, negative results refer to those that do not support the hypothesis that an experiment has been des...
Industry needs help from the research community to succeed in its recent dramatic shift to parallel computing. Failure could jeopardize both the IT industry and the portions of the economy that depend on rapidly improving information technology. Jeopardy for the IT industry means opportunity for the research community. If researchers meet the paral...
We report progress in the use of multi-stream spectro-temporal features for both small and large vocabulary automatic speech recognition tasks. Features are divided into multiple streams for parallel processing and dynamic utilization in this approach. For small vocabulary speech recognition experiments, the in- corporation of up to 28 dynamically-...
We performed automated feature selection for multi-stream (i.e., ensemble) automatic speech recognition, using a hill- climbing (HC) algorithm that changes one feature at a time if the change improves a performance score. For both clean and noisy data sets (using the OGI Numbers corpus), HC usually improved performance on held out data compared to...
The second part of the updated version of "MINDS 2006-2007 Report of the Speech Understanding Working Group" is presented which came from two workshops entitled "Meeting of the MINDS: Future Directions for Human Language Technology". The specific topics being discussed include: the fundamental science of human speech perception and production; tran...
To advance research, it is important to identify promising future research directions, especially those that have not been adequately pursued or funded in the past. The working group producing this article was charged to elicit from the human language technology (HLT) community a set of well-considered directions or rich areas for future research t...
Our goal in this work was to develop an accurate method to identify laughter segments, ultimately for the purpose of speaker recognition. Our previous work used MLPs to per- form frame level detection of laughter using short-term fea- tures, including MFCCs and pitch, and achieved a 7.9% EER on our test set. We improved upon our previous results by...
We describe the large vocabulary automatic speech recognition system developed for Modern Standard Arabic by the SRI/Nightingale team, and used for the 2007 GALE evaluation as part of the speech translation system. We show how system performance is affected by different development choices, ranging from text processing and lexicon to decoding syste...
A multi-stream approach to utilizing the inherently large number of spectro-temporal features for speech recognition is investigated in this study. Instead of reducing the feature- space dimension, this method divides the features into streams so that each represents a patch of information in the spectro- temporal response field. When used in combi...
This paper describes a simple method for significantly improving tandem features used to train acoustic models for large-vocabulary speech recognition. The linear activations at the outputs of an MLP classifier were modified according to known reference labels: where necessary, the activation of the output unit corresponding to the correct phone la...
This paper explores Tandem feature extraction used in a large-vocabulary speech recognition system. In this frame- work a multi-layer perceptron estimates phone probabilities which are treated as acoustic observations in a traditional HMM-GMM system. To determine a lower error bound, we simulated an idealized classifier based on alignment of refere...
Automatic speech recognition is the attempt to use a machine to derive the linguistic message from a speech signal.
This chapter describes the English-language SmartKom-Mobile system and related research. We explain the work required to support a second language in SmartKom and the design of the English speech recognizer. We then discuss research carried out on signal processing methods for robust
speech recognition and on language analysis using the Embodied Co...
We describe the development of a speech recognition system for conversational telephone speech (CTS) that incorporates acous- tic features estimated by multilayer perceptrons (MLP). The acoustic features are based on frame-level phone posterior prob- abilities, obtained by merging two different MLP estimators, one based on PLP-Tandem features, the...
In this paper, we present our recent progress on multi-layer perceptron (MLP) based data-driven feature extraction using improved MLP structures. Four-layer MLPs are used in this study. Different signal processing methods are applied before the input layer of the MLP. We show that the first hidden layer of a four-layer MLP is able to detect some ba...
The use of huge databases in ASR has become an important source of ASR system improvements in the last years. How- ever, their use demands an increase of the computational re- sources necessary to train the recognizers. Several techniques have been proposed in the literature with the purpose of making a better use of these enormous databases by sel...
We have been reducing word error rates (WERs) on conversational telephone speech (CTS) tasks by capturing long-term500ms) temporal information using multi-layered perceptrons (MLPs). In this paper we experiment with an MLP architecture called Tono-topic MLP (TMLP), incorporating two hidden layers. The first of these is tonotopically organized: for...
Incorporating long-term (500-1000 ms) temporal information using multi-layered perceptrons (MLPs) has improved perfor- mance on ASR tasks, especially when used to complement tra- ditional short-term (25-100 ms) features. This paper further studies techniques for incorporating long-term temporal infor- mation in the acoustic model by presenting expe...
One of the major research thrusts in the speech group at ICSI is to use Multi-Layer Perceptron (MLP) based features in automatic speech recognition (ASR). This paper presents a study of three aspects of this effort: 1) the properties of the MLP features which make them useful, 2) incorporating MLP features together with PLP features in ASR, and 3)...
Multi-Layer Perceptrons (MLPs) can be used in automatic speech recognition in many ways. A particular application of this
tool over the last few years has been the Tandem approach, as described in [7] and other more recent publications. Here we
discuss the characteristics of the MLP-based features used for the Tandem approach, and conclude with a r...
The automatic transcription of conversational speech, both from telephone and in-person interactions, is still an extremely
challenging task. Our efforts to recognize speech from meetings is likely to benefit from any advances we achieve with conversational
telephone speech, a topic of considerable focus for our research. Towards both of these ends...
TempoRAl Patterns (TRAPs) and Tandem MLP/HMM approaches incorporate feature streams computed from longer time intervals than the conventional short-time analysis. These methods have been used for challenging small- and medium-vocabulary recognition tasks, such as Aurora and SPINE. Conversational telephone speech recognition is a difficult large-voc...
This paper provides a progress report on ICSI's Meeting Proj ect, including both the data collected and annotated as part of the pro- ject, as well as the research lines such materials support. We in- clude a general description of the official "ICSI Meeting Cor pus", as currently available through the Linguistic Data Consortium, dis- cuss some of...
Local state (or phone) posterior probabilities are often investigated as local classifiers (e.g., hybrid HMM/ANN systems) or as transformed acoustic features (e.g., ``Tandem'') towards improved speech recognition systems. In this paper, we present initial results towards boosting these approaches by improving the local state, phone, or word posteri...
In collaboration with colleagues at UW, OGI, IBM, and SRI, we are developing technology to process spoken language from informal meetings. The work includes a substantial data collection and transcription effort, and has required a nontrivial degree of infrastructure development. We are undertaking this because the new task area provides a signific...
For a connected digits speech recognition task, we have com- pared the performance of two inexpensive electret microphones with that of a single high quality PZM microphone. Recogni- tion error rates were measured both with and without compen- sation techniques, where both single-channel and two-channel approaches were used. In all cases the task w...
Our feature extraction module for the Aurora task is based on a combination of a conventional noise supression tech- nique (Wiener filtering) with our temporal processing tech- nigues (linear discriminant RASTA filtering and nonlinear TempoRAl Pattern (TRAP) classifier). We observe better than 58% relative error improvement on the prescribed Au- ro...