Nelson Morgan

Nelson Morgan
  • Ph.D.
  • University of California, Berkeley

About

254
Publications
28,631
Reads
How we measure 'reads'
A 'read' is counted each time someone views a publication summary (such as the title, abstract, and list of authors), clicks on a figure, or views or downloads the full-text. Learn more
8,319
Citations
Current institution
University of California, Berkeley

Publications

Publications (254)
Conference Paper
In this paper we apply diagnostic analysis to gain a deeper understanding of the performance of the the keyword search system that we have developed for conversational telephone speech in the IARPA Babel program. We summarize the Babel task, its primary performance metric, “actual term weighted value” (ATWV), and our recognition and keyword search...
Conference Paper
Full-text available
Previous work has demonstrated that spectro-temporal Gabor features reduced word error rates for automatic speech recognition under noisy conditions. However, the features based on mel spectra were easily corrupted in the presence of noise or channel distortion. We have exploited an algorithm for power normalized cepstral coefficients (PNCCs) to ge...
Conference Paper
Given a test waveform, state-of-the-art ASR systems extract a sequence of MFCC features and decode them with a set of trained HMMs. When this test data is clean, and it matches the condition used for training the models, then there are few errors. While it is known that ASR systems are brittle in noisy or mismatched conditions, there has been littl...
Article
Many feature extraction methods that have been used for automatic speech recognition (ASR) have either been inspired by analogy to biological mechanisms, or at least have similar functional properties to biological or psychoacoustic properties for humans or other mammals. These methods have in many cases provided significant reductions in errors, p...
Article
Introduction Some Attributes of Auditory Physiology and Perception “Classic” Auditory Representations Current Trends in Auditory Feature Analysis Summary Acknowledgments References
Article
When Speech and Audio Signal Processing published in 1999, it stood out from its competition in its breadth of coverage and its accessible, intutiont-based style. This book was aimed at individual students and engineers excited about the broad span of audio processing and curious to understand the available techniques. Since then, with the advent o...
Article
This paper reviews a line of research carried out over the last decade in speech recognition assisted by discriminatively trained, feedforward networks. The particular focus is on the use of multiple layers of processing preceding the hidden Markov model based decoding of word sequences. Emphasis is placed on the use of multiple streams of highly d...
Article
Full-text available
Current speech recognition systems, for example, typically use Gaussian mixture models (GMMs), to estimate the observation (or emission) probabilities of hidden Markov models (HMMs), and GMMs are generative models that have only one layer of latent variables. Instead of developing more powerful models, most of the research effort has gone into find...
Article
Full-text available
Spectro-temporal filtering has been shown to result in features that can help to increase the robustness of automatic speech recognition (ASR) in the past. We replace the spectro-temporal representation used in previous work with spectrograms that incorporate knowledge about the signal processing of the human auditory system and which are derived f...
Article
We have incorporated spectrotemporal features in a speech activity detection (SAD) task for the Speech in Noisy Environments 2 (SPINE2) data set. The features were generated by applying 2D Gabor filters to the mel spectrogram in order to measure the strength of various spectral and temporal modulation frequencies in different patches of the spectro...
Chapter
IntroductionThe Frame-Fill ConceptPattern Matching or Vector QuantizationThe Kang–Coulter 600-bps VocoderSegmentation Methods for Bandwidth ReductionExercises
Chapter
Transparent Audio CodingPerceptual MaskingNoise ShapingSome Example Coding SchemesSummaryExercises
Chapter
IntroductionThe Predictive ModelProperties of the RepresentationGetting the CoefficientsRelated RepresentationsConcluding DiscussionExercises
Chapter
IntroductionThe Wave Equation for the Vibrating StringDiscrete-Time Traveling WavesBoundary Conditions and Discrete Traveling WavesStanding WavesDiscrete-Time Models of Acoustic TubesAcoustic Tube ResonancesRelation of Tube Resonances to Formant FrequenciesExercises
Chapter
IntroductionIsolated Word RecognitionConnected Word RecognitionSegmental ApproachesDiscussionExercises
Chapter
IntroductionHMM TrainingForward–Backward TrainingOptimal Parameters for Emission Probability EstimatorsViterbi TrainingLocal Acoustic Probability Estimators for ASRInitializationSmoothingConclusions Exercises
Chapter
IntroductionA Historical NoteThe Real CepstrumThe Complex CepstrumApplication of Cepstral Analysis to Speech SignalsConcluding ThoughtsExercises
Chapter
IntroductionSound-Pressure Level and LoudnessFrequency Analysis and Critical BandsMaskingSummaryExercises
Chapter
IntroductionThe z TransformInverse z TransformConvolutionSamplingLinear Difference EquationsFirst-Order Linear Difference EquationsResonanceConcluding CommentsExercises
Chapter
Sources and MixturesEvaluating Source SeparationMulti-Channel ApproachesBeamforming with Microphone ArraysIndependent Component AnalysisComputational Auditory Scene AnalysisModel-Based SeparationConclusions Exercises
Chapter
IntroductionSome Examples of Acoustically Generated Musicals SoundsMusic Synthesis ConceptsAnalysis-Based SynthesisOther Techniques for Music SynthesisReverberationSeveral Examples of SynthesisExercises
Chapter
IntroductionAdaptationLattice-Based MMI and MPEConclusion Exercises
Chapter
IntroductionReview of Fletcher's Critical Band ExperimentsThreshold Measurements and Filter ShapesGamma-Tone Filters, Roex Filters, and Auditory ModelsOther Considerations in Filter-Bank DesignSpeech Spectrum Analysis Using the FFTConclusions Exercises
Chapter
IntroductionConcatenative Methods Statistical Parametric MethodsA Historical PerspectiveSpeculationTools and EvaluationExercisesAppendix: Synthesizer Examples
Chapter
IntroductionHistorical Review of Pitch-Perception ModelsPhysiological Exploration of Place Versus PeriodicityResults from Psychoacoustic Testing and ModelsSummaryExercises
Chapter
IntroductionTime-Scale ModificationTransformation Without Explicit Pitch DetectionTransformations in Analysis–Synthesis SystemsSpeech Modifications in the Phase VocoderSpeech Transformations Without Pitch ExtractionThe Sine Transform Coder as a Transformation AlgorithmVoice Modification to Emulate a Target VoiceExercises
Chapter
IntroductionStating the ProblemParameterization and Probability EstimationConclusion Exercises
Chapter
IntroductionA Few DefinitionsClass-Related Probability FunctionsMinimum Error ClassificationLikelihood-Based MAP ClassificationApproximating a Bayes ClassifierStatistically Based Linear DiscriminantsIterative Training: The EM AlgorithmExercises
Chapter
IntroductionSequence of Steps in a Plucked or Bowed String InstrumentVibrations of the Bowed StringFrequency-Response Measurements of the Bridge of a ViolinVibrations of the Body of String InstrumentsRadiation Pattern of Bowed String InstrumentsSome Considerations in Piano DesignThe Trumpet, Trombone, French Horn, and TubaExercises
Chapter
IntroductionSound WavesSound Waves in RoomsRoom Acoustics as a Component in Speech SystemsExercises
Chapter
Background Voice-coding conceptsHomer Dudley (1898–1981)ExercisesAppendix: Hearing of the Fall of Troy
Chapter
IntroductionDiscriminant TrainingHMM–ANN Based ASROther Applications of ANNs to ASRExercisesAppendix: Posterior Probability Proof
Chapter
IntroductionThe Articulation Index and Human RecognitionComparisons Between Human and Machine Speech RecognizersConcluding ThoughtsExercises
Chapter
IntroductionGeneral Design of a Speaker Recognition SystemExample System ComponentsEvaluationModern Research ChallengesExercises
Chapter
The Information in Music AudioMusic TranscriptionNote TranscriptionScore AlignmentChord TranscriptionStructure DetectionConclusion Exercises
Chapter
IntroductionVowel Perception: Psychoacoustics and PhysiologyThe Confusion MatrixPerceptual Cues for PlosivesPhysiological Studies of Two Voiced PlosivesMotor Theories of Speech PerceptionNeural Firing Patterns for Connected Speech StimuliConcluding ThoughtsExercises
Chapter
IntroductionFeature ExtractionPattern-Classification Methods Support Vector MachinesUnsupervised ClusteringConclusions ExercisesAppendix: Multilayer Perceptron Training
Chapter
IntroductionA Note on NomenclaturePitch Detection, Perception and ArticulationThe Voicing DecisionSome Difficulties in Pitch DetectionSignal Processing to Improve Pitch DetectionPattern-Recognition Methods for Pitch DetectionSmoothing to Fix Errors in Pitch EstimationNormalizing the Autocorrelation FunctionExercises
Chapter
Radio RexDigit RecognitionSpeech Recognition in the 1950sThe 1960s1971–1976 ARPA ProjectAchieved by 1976The 1980s in Automatic Speech RecognitionMore Recent WorkSome LessonsExercises
Chapter
Von KempelenThe VoderTeaching the Operator to Make the Voder “Talk”Speech Synthesis After the VoderMusic MachinesExercises
Chapter
IntroductionFiltering ConceptsTransformations for Digital Filter DesignDigital Filter Design with Bilinear TransformationThe Discrete Fourier TransformFast Fourier Transform Methods Relation Between the DFT and Digital FiltersExercises
Chapter
IntroductionGeneral Design of a Speaker Diarization SystemExample System ComponentsResearch ChallengesExercises
Chapter
The Music Retrieval ProblemMusic FingerprintingQuery by HummingCover Song MatchingMusic Classification and AutotaggingMusic SimilarityConclusions Exercises
Chapter
IntroductionPhones and PhonemesPhonetic and Phonemic AlphabetsArticulatory FeaturesSubword Units as Categories for ASRPhonological Models for ASRContext-Dependent PhonesOther Subword UnitsPhrasesSome Issues in Phonological ModelingExercises
Chapter
IntroductionAcoustic Tube Models of English PhonemesExcitation Mechanisms in Speech ProductionExercises
Chapter
IntroductionCommon Feature VectorsDynamic FeaturesStrategies for RobustnessAuditory ModelsMultichannel InputDiscriminant FeaturesDiscussionExercises
Chapter
IntroductionVoice Excitation and Spectral FlatteningVoice-Excited Channel VocoderVoice-Excited and Error-Signal-Excited LPC VocodersWaveform Coding with Predictive Methods Adaptive Predictive Coding of SpeechSubband CodingMultipulse LPC VocodersCode-Excited Linear Predictive CodingReducing Codebook Search Time in CELPConclusions Exercises
Chapter
IntroductionStandards for Digital Speech CodingDesign Considerations in Channel Vocoder Filter BanksEnergy Measurements in a Channel VocoderA Vocoder Design for Spectral Envelope EstimationBit Saving in Channel VocodersDesign of the Excitation Parameters for a Channel VocoderLPC VocodersCepstral VocodersDesign ComparisonsVocoder StandardizationExer...
Chapter
IntroductionPhonological ModelsLanguage ModelsDecoding With Acoustic and Language ModelsA Complete SystemAccepting Realistic InputConcluding Comments
Chapter
IntroductionAnatomical Pathways From the Ear to the Perception of SoundThe Peripheral Auditory SystemHair Cell and Auditory Nerve FunctionsProperties of the Auditory NerveSummary and Block Diagram of the Peripheral Auditory SystemExercises
Conference Paper
Full-text available
In the last decade, several studies have shown that the robustness of ASR systems can be increased when 2D Gabor filters are used to extract specific modulation frequencies from the input pattern. This paper analyzes important design parameters for spectro-temporal features based on a Gabor filter bank: We perform experiments with filters that exhi...
Conference Paper
Full-text available
In this paper, we propose a discriminative extension to agglomerative hierarchical clustering, a typical technique for speaker diarization, that fits seamlessly with most state-of-the art diarization algorithms. We propose to use maximum mutual information using bootstrapping i.e., initial predictions are used as input for retraining of models in a...
Conference Paper
Previous work has shown that spectro-temporal features reduce WER for automatic speech recognition under noisy conditions. The spectro-temporal framework, however, is not the only way to process features in order to reduce errors due to noise in the signal. The two-stage mel-warped Wiener filtering method used in the "Advanced Front End" (AFE), now...
Article
This article has been withdrawn at the request of the author(s) and/or editor. The Publisher apologizes for any inconvenience this may cause. The full Elsevier Policy on Article Withdrawal can be found at http://www.elsevier.com/locate/withdrawalpolicy .
Article
Full-text available
To advance research, it is important to identify promising future research directions, especially those that have not been adequately pursued or funded in the past. The working group producing this article was charged to elicit from the human language technology (HLT) community a set of well-considered directions or rich areas for future research t...
Article
Full-text available
This article is the second part of an updated version of the "MINDS 2006-2007 Report of the Speech Understanding Working Group," one of five reports emanating from two workshops entitled "Meeting of the MINDS: Future Directions for Human Language Technology," sponsored by the U.S. Disruptive Technology Office (DTO). (MINDS is an acronym for "machin...
Article
Full-text available
Automatic speech recognition enables a wide range of current and emerging applications such as automatic transcription, multimedia content analysis, and natural human-computer interfaces. This article provides a glimpse of the opportunities and challenges that parallelism provides for automatic speech recognition and related application research fr...
Article
Full-text available
What is a Negative Result? In a sense, well-designed experiments never have a completely negative result, since there is always the opportunity to learn something. In fact, unexpected results by definition provide the most information. Conventionally, negative results refer to those that do not support the hypothesis that an experiment has been des...
Article
Full-text available
Industry needs help from the research community to succeed in its recent dramatic shift to parallel computing. Failure could jeopardize both the IT industry and the portions of the economy that depend on rapidly improving information technology. Jeopardy for the IT industry means opportunity for the research community. If researchers meet the paral...
Conference Paper
Full-text available
We report progress in the use of multi-stream spectro-temporal features for both small and large vocabulary automatic speech recognition tasks. Features are divided into multiple streams for parallel processing and dynamic utilization in this approach. For small vocabulary speech recognition experiments, the in- corporation of up to 28 dynamically-...
Conference Paper
Full-text available
We performed automated feature selection for multi-stream (i.e., ensemble) automatic speech recognition, using a hill- climbing (HC) algorithm that changes one feature at a time if the change improves a performance score. For both clean and noisy data sets (using the OGI Numbers corpus), HC usually improved performance on held out data compared to...
Article
Full-text available
The second part of the updated version of "MINDS 2006-2007 Report of the Speech Understanding Working Group" is presented which came from two workshops entitled "Meeting of the MINDS: Future Directions for Human Language Technology". The specific topics being discussed include: the fundamental science of human speech perception and production; tran...
Article
Full-text available
To advance research, it is important to identify promising future research directions, especially those that have not been adequately pursued or funded in the past. The working group producing this article was charged to elicit from the human language technology (HLT) community a set of well-considered directions or rich areas for future research t...
Conference Paper
Full-text available
Our goal in this work was to develop an accurate method to identify laughter segments, ultimately for the purpose of speaker recognition. Our previous work used MLPs to per- form frame level detection of laughter using short-term fea- tures, including MFCCs and pitch, and achieved a 7.9% EER on our test set. We improved upon our previous results by...
Conference Paper
We describe the large vocabulary automatic speech recognition system developed for Modern Standard Arabic by the SRI/Nightingale team, and used for the 2007 GALE evaluation as part of the speech translation system. We show how system performance is affected by different development choices, ranging from text processing and lexicon to decoding syste...
Conference Paper
A multi-stream approach to utilizing the inherently large number of spectro-temporal features for speech recognition is investigated in this study. Instead of reducing the feature- space dimension, this method divides the features into streams so that each represents a patch of information in the spectro- temporal response field. When used in combi...
Conference Paper
This paper describes a simple method for significantly improving tandem features used to train acoustic models for large-vocabulary speech recognition. The linear activations at the outputs of an MLP classifier were modified according to known reference labels: where necessary, the activation of the output unit corresponding to the correct phone la...
Conference Paper
This paper explores Tandem feature extraction used in a large-vocabulary speech recognition system. In this frame- work a multi-layer perceptron estimates phone probabilities which are treated as acoustic observations in a traditional HMM-GMM system. To determine a lower error bound, we simulated an idealized classifier based on alignment of refere...
Chapter
Automatic speech recognition is the attempt to use a machine to derive the linguistic message from a speech signal.
Chapter
Full-text available
This chapter describes the English-language SmartKom-Mobile system and related research. We explain the work required to support a second language in SmartKom and the design of the English speech recognizer. We then discuss research carried out on signal processing methods for robust speech recognition and on language analysis using the Embodied Co...
Conference Paper
We describe the development of a speech recognition system for conversational telephone speech (CTS) that incorporates acous- tic features estimated by multilayer perceptrons (MLP). The acoustic features are based on frame-level phone posterior prob- abilities, obtained by merging two different MLP estimators, one based on PLP-Tandem features, the...
Conference Paper
In this paper, we present our recent progress on multi-layer perceptron (MLP) based data-driven feature extraction using improved MLP structures. Four-layer MLPs are used in this study. Different signal processing methods are applied before the input layer of the MLP. We show that the first hidden layer of a four-layer MLP is able to detect some ba...
Conference Paper
Full-text available
The use of huge databases in ASR has become an important source of ASR system improvements in the last years. How- ever, their use demands an increase of the computational re- sources necessary to train the recognizers. Several techniques have been proposed in the literature with the purpose of making a better use of these enormous databases by sel...
Article
We have been reducing word error rates (WERs) on conversational telephone speech (CTS) tasks by capturing long-term500ms) temporal information using multi-layered perceptrons (MLPs). In this paper we experiment with an MLP architecture called Tono-topic MLP (TMLP), incorporating two hidden layers. The first of these is tonotopically organized: for...
Conference Paper
Incorporating long-term (500-1000 ms) temporal information using multi-layered perceptrons (MLPs) has improved perfor- mance on ASR tasks, especially when used to complement tra- ditional short-term (25-100 ms) features. This paper further studies techniques for incorporating long-term temporal infor- mation in the acoustic model by presenting expe...
Article
One of the major research thrusts in the speech group at ICSI is to use Multi-Layer Perceptron (MLP) based features in automatic speech recognition (ASR). This paper presents a study of three aspects of this effort: 1) the properties of the MLP features which make them useful, 2) incorporating MLP features together with PLP features in ASR, and 3)...
Conference Paper
Multi-Layer Perceptrons (MLPs) can be used in automatic speech recognition in many ways. A particular application of this tool over the last few years has been the Tandem approach, as described in [7] and other more recent publications. Here we discuss the characteristics of the MLP-based features used for the Tandem approach, and conclude with a r...
Conference Paper
The automatic transcription of conversational speech, both from telephone and in-person interactions, is still an extremely challenging task. Our efforts to recognize speech from meetings is likely to benefit from any advances we achieve with conversational telephone speech, a topic of considerable focus for our research. Towards both of these ends...
Article
TempoRAl Patterns (TRAPs) and Tandem MLP/HMM approaches incorporate feature streams computed from longer time intervals than the conventional short-time analysis. These methods have been used for challenging small- and medium-vocabulary recognition tasks, such as Aurora and SPINE. Conversational telephone speech recognition is a difficult large-voc...
Article
Full-text available
This paper provides a progress report on ICSI's Meeting Proj ect, including both the data collected and annotated as part of the pro- ject, as well as the research lines such materials support. We in- clude a general description of the official "ICSI Meeting Cor pus", as currently available through the Linguistic Data Consortium, dis- cuss some of...
Article
Local state (or phone) posterior probabilities are often investigated as local classifiers (e.g., hybrid HMM/ANN systems) or as transformed acoustic features (e.g., ``Tandem'') towards improved speech recognition systems. In this paper, we present initial results towards boosting these approaches by improving the local state, phone, or word posteri...
Article
Full-text available
In collaboration with colleagues at UW, OGI, IBM, and SRI, we are developing technology to process spoken language from informal meetings. The work includes a substantial data collection and transcription effort, and has required a nontrivial degree of infrastructure development. We are undertaking this because the new task area provides a signific...
Conference Paper
Full-text available
For a connected digits speech recognition task, we have com- pared the performance of two inexpensive electret microphones with that of a single high quality PZM microphone. Recogni- tion error rates were measured both with and without compen- sation techniques, where both single-channel and two-channel approaches were used. In all cases the task w...
Conference Paper
Full-text available
Our feature extraction module for the Aurora task is based on a combination of a conventional noise supression tech- nique (Wiener filtering) with our temporal processing tech- nigues (linear discriminant RASTA filtering and nonlinear TempoRAl Pattern (TRAP) classifier). We observe better than 58% relative error improvement on the prescribed Au- ro...

Network

Cited By