Article

Isolated words recognition system based on hybrid approach DTW/GHMM

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this paper, we present a new hybrid approach for isolated spoken word recognition using Hidden Markov Model models (HMM) combined with Dynamic time warping (DTW). HMM have been shown to be robust in spoken recognition systems. We propose to extend the HMM method by combining it with the DTW algorithm in order to combine the advantages of these two powerful pattern recognition technique. In this work we do a comparative evaluation between traditional Continuous Hidden Markov Models (GHMM), and the new approach DTW/GHMM. This approach integrates the prototype (word reference template) for each word in the training phase of the Hybrid system. An iterative algorithm based on conventional DTW algorithm and on an averaging technique is used for determined the best prototype during the training phase in order to increase model discrimination. The test phase is identical for the GHMM and DTW/GHMM methods. We evaluate the performance of each system using several different test sets and observe that, the new approach models presented the best results in all cases Povzetek: V sestavku so opisane hibridne metode za prepoznavanje besed.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... La elección de alguno de ellos depende de factores como el idioma, si es dependendiente o no del locutor, si se trata de palabras aisladas o de habla continua e incluso de factores inherentes a la misma técnica a utilizar. El algoritmo Dynamic Time Warping (DTW) (Sakoe and Chiba, 1978) usa la técnica de comparación de plantillas a través del mapeo no lineal de una señal durante su comparación con otra, lo que le permite brindar buenas tasad de aceptación convirtiéndose en una técnica muy conocida y usada por la comunidad (Bourouba et al., 2005). ...
... En (Pandit and Kittler, 1998) se presenta una técnica de optimización de los conjuntos de características basado en el uso de DTW logrando mejorar el ratio de aceptación en experimentos dependientes del locutor. También (Bourouba et al., 2005) busca mejorar el reconocimiento de palabras aisladas proponiendo como reconocedor un híbrido entre los Modelos Ocultos de Markov (HMM) y DTW. (Borrero et al., 2009) plantea el uso de LPCC como extractor de características y DTW para el reconocimiento de palabras aisladas en idioma español por parte de un locutor, aplicado al control de navegación de un mini-robot móvil. ...
Conference Paper
Full-text available
Resumen En este documento se presentan los resultados obtenidos al implementar un sistema para el reconocimiento automático de palabras aisladas, habladas en el idioma español y emitidas por un locutor, usando un método híbrido para la extracción de características , ´ este método se basa en la unión de los métodos MFCC (Mel Frequency Cepstral Coefficients) y MODGDF (Modified Group Delay Feature) los cuales brindan respectivamente información de la frecuencia y la fase de la señal de habla, que en conjunto otorgan un manejo más completo de los factores que la audición humana toma muy en cuenta permitiendo obtener una mayor tasa de reconocimiento. 1. Introducción Un Sistema de Reconocimiento Automático del Habla (SRAH) es aquel que es capaz de gestionar la señal de voz emitida por un individuo. Para lograr esto, dicha señal pasa por un proceso de digitalización para obtener elementos de medición (muestras), las cuales permiten caracterizar su comportamiento e implementar procesos, enfocados al reconocimiento. Bajó este esquema, la señal de voz se ve inmersa en dos etapas importantes: entrenamiento y reco-nocimiento, siendo la primera dé estas, una de las más críticas y donde recae en gran parte eí exito de este tipo de sistemas (Oropeza and Suárez , 2006). Los SRAH se encuentran en la constante búsqueda de métodos que les permitan implemen-tar eficientemente cada una de estas etapas y obtener mejores resultados, por este motivo los au-tores presentan diferentes métodos de extracción de características orientados al reconocimiento de habla, los cuales mayormente se basan en la magnitud del espectro de la Transformada de Fourier, resaltando: LPCC (Linear Prediction Coefficients Cepstrum), MFCC (Mel Frequency Cepstral Coefficients), PLP (Perceptual Linear Prediction), entre otros. El MFCC es uno de los métodos que goza de popularidad debido a los buenos resultados que brinda por su robustez frente al ruido y gracias a que se basa en el uso de la Transformada de Fourier ventaneada para obtener las frecuencias de la señal y a partir de ellas sus respec-tivos coeficientes cepstrales en escala de Mel, de esta manera se logrará que SRAH tenga un comportamiento aproximado al sistema auditivo humano (Huang et al., 2001). Sin embargo, el usó unico de la Transformada de Fourier logra que los SRAH obvien un factor muy importante que la audición humana toma muy en cuenta. Ese factor es la información de la fase de la señal de habla. El Modified Group Delay Feature (MODGDF) es un método de extracción de características presentado en (Murthy and Rao-Gadde, 2003) y que obtiene la información del espectro de la fase, por lo que se convierte en un eficiente complemento del MFCC para lograr una mayor tasa de reconocimiento.
... There are a lot of algorithms for speech detection and the scientists till now researching on it for effectively detect voice activity within a signal. Here we use an algorithm of voice activity detection [6]. This algorithm calculates the energy and number of zero crossing rate in a frame, compares these values with a threshold to determine if we have a possible voice activity or not. ...
Article
Full-text available
Automatic recognition of spoken words is one of the most challenging tasks in the field of speech recognition. The difficulty of this task is due to the acoustic similarity of many of the words and their syllabi. Accurate recognition requires the system to perform fine phonetic distinctions. This paper presents a technique for recognizing spoken words in Bangla. In this study we first derive feature from spoken words. This paper presents some technique for recognizing spoken words in Bangla. In this work we use MFCC, LPC, GMM and DTW.
... E-Hocine Bourouba et. al.[9] presented a hybrid approach for isolated spoken word recognition. A hybrid approach accomplished using two powerful pattern recognition techniques HMM combined with DTW in order to combine the advantages of these. ...
Conference Paper
Full-text available
Human interaction with machine is possible by using speech recognition technology. This paper presents an isolated word, speaker independent speech recognition system that is capable of recognizing spoken Gujarati numerals from 0 to 9. The proposed system uses the Mel Frequency Cepstral Coefficient (MFCC) method to extract features from spoken Gujarati numerals and the Dynamic Time Warping (DTW) technique to compare the test patterns with train patterns to recognized spoken numerals. As an experiment point of view, the proposed model compares equal number of train and test data set. We achieved an average of 71.17% accuracy rate for all Gujarati numerals from 0 to 9.
... Many researchers have used DBN with supervised or unsupervised training algorithms [12][13][14]. Literature reported that systems based on DBN are better than HMM, GMM, GMM-HMM techniques [15][16][17][18][19][20][21][22]. Jaitly, N. et al. 2011 [23] used DBN using Restricted Boltzmann Machine (RBM) and contrastive divergence (CD) training algorithm on speech signals. ...
Article
Full-text available
Speech is a natural way used by humans to communicate information. Speech signal conveys the linguistic information (the message) and lot of information about the speaker himself: Gender, age, regional origin, health and emotional state. Speech recognition is the technology of letting a machine understand human speech. Speaker recognition is the technology by which a machine distinguishes different speakers from each other. In real life, speaker and speech recognition have been used very frequently which vary from healthcare, military to applications pertaining to daily use. These may include but are not limited to commanding electronic devices through speech. All existing speech recognition system perform efficiently in control environment. But for real time applications, the performance gets affected because of the variability in speaking style and background noise. In order to deal with all these effects, enhanced Mel Frequency Cepstral Coefficients (MFCC) are calculated. Deep belief network (DBN) of stacked restricted Boltzmann machine (RBM) is used for training and testing. The proposed system is implemented using standard TIDIGITS dataset giving 97.29% accuracy.
... Six instructions have been utilized in the work namely 'START,' 'STOP,' 'FOR-WARD,' 'REVERSE,' 'LEFT,' and 'RIGHT' uttered by a male and female instructor in a room environment. A fourstage neural network model (Bourobua et al. 2006) with two hidden layers as shown in Fig. 7 is used in the proposed work. ...
Article
Full-text available
Robotic vehicles have been actively researched in recent times to automate most of the commercial applications to ease the daily life of consumers. Robotic automation has been an integral part of industrial concerns drastically reducing the manpower and effort needed for various processes. They are mainly based on RF communication, the spectrum of which is quite scarce. This has necessitated alternate means of providing communication paths for robotic vehicles. Light fidelity (Li-Fi) is a rapidly emerging technology exploiting the optical properties of abundantly available light energy. In spite of being limited by a line-of-sight communication unlike RF, they do not provide any hazards as well as quite suitable for small- to medium-scale indoor communication. This research article has proposed a Li-Fi based robotic vehicle controlled by voice commands issued from instructors. Voice recognition is achieved using MFCC-NN model, while the light energy is collected by a solar panel replacing the conventional photodetectors. The efficiency of the proposed work is established after observing superior performance for a wide range of experimentations done. The characteristics of light propagation through different glass medium have also been investigated justifying their probable utility in a small-scale business environment.
... Many empirical studies had been performed to compare various classifier combination methods [27][28][29]. In most of the literature [30][31][32][33][34], Hybrid models were created by combining the characteristics of the classifier algorithms like Artificial Neural Network (ANN), Support Vector Machines (SVM), Hidden Markov Models (HMM) which results in new Hybrid classification algorithm. A. Henriques et al. [35] proposed a robust process which combined a parallel and a serial architecture, where initial classification results obtained by SVM were refined in a second SVM classifier and the final result was given by a linear combination of two ensembles of SVM classifiers and a minimum distance classifier. ...
Article
Classifier combination methods have proved to be an effective tool to increase the performance of classification techniques that can be used in any pattern recognition applications. Despite a significant number of publications describing successful classifier combination implementations, the theoretical basis is still not matured enough and achieved improvements are inconsistent. In this paper, we propose a novel statistical validation technique known as correlation-based classifier combination technique for combining classifier in any pattern recognition problem. This validation has significant influence on the performance of combinations, and their utilization is necessary for complete theoretical understanding of combination algorithms. The analysis presented is statistical in nature but promises to lead to a class of algorithms for rank-based decision combination. The potentials of the theoretical and practical issues in implementation are illustrated by applying it on 2 standard datasets in pattern recognition domain, namely, handwritten digit recognition and letter image recognition datasets taken from UCI Machine Learning Database Repository (http://www.ics.uci.edu/_mlearn). An empirical evaluation using 8 well-known distinct classifiers confirms the validity of our approach compared to some other combinations of multiple classifiers algorithms. Finally, we also suggest a methodology for determining the best mix of individual classifiers.
... In [22] simple post-processor duration model or a more Complex Hidden Semi-Markov Model based approaches gives the better performance depending upon the efficiency requirements . Hybrid Hidden Markov Model with Conventional DTW achieves the best prototype model during the training phase in order to increase model discrimination [2]. A hybrid end point detector is proposed in [13] which gives less than 0.5 Percent of a rejection rate, while providing recognition accuracy is obtained from hand-edited endpoint. ...
Article
The ability of a reader to recognize written words correctly, virtually and effortlessly is defined as Word Recognition or Isolated Word Recognition. It will recognize each word from their shape. Speech Recognition is the operating system which enablesto convert spoken words to written text which is called as Speech to Text (STT) method. Usual Method used in Speech Recognition (SR) is Neural Network, Hidden Markov Model (HMM) and Dynamic Time Warping (DTW). The widely used technique for Speech Recognition is HMM. Hidden Markov Model assumes that successive acoustic features of a spoken word are state independent. The occurrence of one feature is independent of the occurrence of the others state. Here each single unit of word is considered as state. Based upon the probability of the state it generates possible word sequence for the spoken word. Instead of listening to the speech, the generated sequence of text can be easily viewed. Each word is recognized from their shape. People with hearing impaired can make use of this Speech Recognition.
... J T Chien et al.[12] have worked on Chinese speech corpus of around 32,000 words.They have used 32 Gaussian using different MFCC feature set and showed that recognition score increases by including feature derivatives. E H Bourouba et al.[13] have used MFCC feature set with hybrid acoustic model to achieve 90% recognition score.They have showed that including derivatives of feature vector improves the recognition rate. Z Hachkar et al.[14] have presented comparative evaluation of MFCC and PLP in noisy environment and they have shown that in noisy data PLP outperforms MFCC. ...
Article
Full-text available
State of the art automatic speech recognition system uses Mel frequency cepstral coefficients as feature extractor along with Gaussian mixture model for acoustic modeling but there is no standard value to assign number of mixture component in speech recognition process.Current choice of mixture component is arbitrary with little justification. Also the standard set for European languages can not be used in Hindi speech recognition due to mismatch in database size of the languages.Parameter estimation with too many or few component may inappropriately estimate the mixture model. Therefore, number of mixture is important for initial estimation of expectation maximization process. In this research work, the authors estimate number of Gaussian mixture component for Hindi database based upon the size of vocabulary.Mel frequency cepstral feature and perceptual linear predictive feature along with its extended variations with delta-delta-delta feature have been used to evaluate this number based on optimal recognition score of the system . Comparitive analysis of recognition performance for both the feature extraction methods on medium size Hindi database is also presented in this paper.HLDA has been used as feature reduction technique and also its impact on the recognition score has been highlighted.
... In the standard HMM system, each word is represented by a separate HMM [5]. In the learning stage, each pronunciation is converted to spectral domain (MFCC, energy and second order coefficients) that constitutes an observation sequence for the evaluation of HMM parameters associated to the word. ...
Article
Full-text available
Thanks to Automatic Speech Recognition (ASR), a lot of machines can nowadays emulate human being ability to understand and speak natural language. However, ASR problematic could be as interesting as it is difficult. Its difficulty is precisely due to the complexity of speech processing, which takes into consideration many aspects: acoustic, phonetic, syntactic, etc. Thus, the most commonly used technology, in the context of speech recognition, is based on statistical models. Especially, the Hidden Markov Models which are capable of simultaneously modeling frequency and temporal characteristics of the speech signal. There is also the alternative of using Neuronal Networks. But another interesting framework applied in ASR is indeed the hybrid Artificial Neural Network (ANN) and Hidden Markov Model (HMM) speech recognizer that improves the accuracy of the two models. In the present work, we propose an Arabic digits recognition system based on hybrid Optimal Artificial Neural Network and Hidden Markov Model (OANN/HMM). The main innovation in this work is the use of an optimal neural network to determine the optimal groups, unlike in classical Kohonen approach. The numerical results are powerful and show the practical interest of our approach.
... These can be mainly divided into three models: Dynamic Time Warping (DTW), Hidden Markov Model (HMM) and Artificial Neural Network (ANN). Some hybrid approaches such as DTW/GHMM classifier have also been implemented showing an improvement of 2% to 10% in the recognition [8] [9]. ...
Conference Paper
Full-text available
This paper presents a time domain feature extraction method of speaker identification using Time Encoded Signal Processing and Recognition (TESPAR) approach. TESPAR matrices are not only generated for English words but also for the Urdu and Pashto words. For classification, the standard Artificial Neural Network (ANN) classifier and its variant have been used. The recognition results obtained show that when the user spoke a word from the vocabulary in an isolated fashion, 99% of the time it is correctly recognized. The results of TESPAR based feature are compared with features extracted using Mel-Frequency Cepstral Coefficients (MFCC) and Linear Predictive Coefficients (LPC). The MFCC and LPC features are obtained using the Hidden Markov Model toolkit, HTK. Feed forward neural network with back propagation has been used for the recognition. The results show that the speaker recognition systems with TESPAR features gives better performance with a high recognition rate and low computational complexity as compared with MFCC and LPC based features.
... Fortunately, a speech data contains information on both hence in one process both the speaker and the uttered word can be simultaneously recognised. The algorithm used for the speech recognition consists of pre-processing, feature extraction, and recognition modules [2,7,8]. ...
Article
Speech Recognition approach intends to recognize the text from the speech utterance which can be more helpful to the people with hearing disabled. Support Vector Machine (SVM) and Hidden Markov Model (HMM) are widely used techniques for speech recognition system. Acoustic features namely Linear Predictive Coding (LPC), Linear Prediction Cepstral Coefficient (LPCC) and Mel Frequency Cepstral Coefficients (MFCC) are extracted. Modeling techniques such as SVM and HMM were used to model each individual word thus owing to 620 models which are trained to the system. Each isolated word segment from the test sentence is matched against these models for finding the semantic representation of the test input speech. The performance of the system is evaluated for the words related to computer domain and the system shows an accuracy of 91.46% for SVM 98.92% for HMM. From the exhaustive analysis, it is evident that HMM performs better than other modeling techniques such as SVM.
Article
Dynamic Time Warping (DTW) is template based cost minimization technique. We propose Hidden Markov Model (HMM) based enhanced DTW technique to efficiently recognize various speaking rate signals and for recognizing closely similar utterances. We extend the derivation of Viterbi and forward algorithms for finding optimized path alignment in new propose technique and extend the Baum-Welch algorithm to optimize the model parameters. The proposed technique is compared with conventional DTW technique, and from comparative results analysis we find that it improves the results from 84% to 94 % using DTW technique for Hindi spoken words for various speech utterances in different environmental conditions or for varying speakers.
Conference Paper
Devnagari (Marathi) is an Indo-Aryan language and has a number of speakers all around the world. Marathi language has gained acceptability in the media & communication and therefore deserves to have a place in the growing field of automatic speech recognition. This manuscript describes the automatic speech recognition system that recognizes Marathi phoneme using Continuous Density Hidden Markov Model (CDHMM). It implements the Mel-frequency cepstral coefficients (MFCC) for feature extraction. Baum Welch algorithm is used for re-estimating the parameters. Finally, the Viterbi algorithm is used to recognize the phoneme. Experimental results are also discussed in this paper.
Conference Paper
Full-text available
In this paper, we study the controllability of semilinear evolution differential systems with nonlocal initial conditions in a separable Banach space. The results are obtained by using Hausdorff measure of noncompactness and a new calculation method.
Conference Paper
The speech of non-native speakers of a language is usually characterized by the presence of acoustic phonetic and phonological deviations from the norm. Research in this field has attempted to identify these deviations in order to characterize different accents. English language has been studied for many years both from the perspective of native and non-native accents. Non-native speakers' speech may possess several acoustic-phonetic characteristics that are unusual in the native speakers' speech. This paper is concerned with the effect of Bengali accent on English vowel recognition. It is seen that spectral characteristics of different English vowel sounds are largely influenced by the Bengali-accented speech. Formants' nature of each vowel sound in Bengali accent differs from those of native accents. These findings are very effective to develop an English vowel recognizer that is capable to work successfully with Bengali-accented people. The recognition method applied here is very simple and the recognition accuracy is also very satisfactory.
Conference Paper
Full-text available
Speech is the primary form of communication between human beings. By talking to others we allow them to know our thoughts, feelings, needs and also come to know their feelings, thoughts and needs. This form of human communication is still a dream if we compare it to man-machine communication. The motivating factor for research in this area was the possibility of developing a speech recognition system that did not require any type of learning for their use, allowing that anyone could manipulate the system through speech, since the speech is the more natural medium for human communication. This work aims to build a game for discovery of colors using speech for recognition, this system will serve as a basis for future work on speech recognition as the more traditional techniqueswere applied in these systems. The recognition system developed was applied to the recognition of isolated words withspeaker dependency and runs on mobile devices with Android operating system
Article
In this paper, artificial neural networks were used to accomplish isolated speech recognition. The topic was investigated in two steps, consisting of the pre-processing part with Digital Signal Processing (DSP) techniques and the post-processing part with Artificial Neural Networks (ANN). These two parts were briefly explained and speech recognizers using different ANN architectures were implemented on Matlab. Three different neural network models; multi layer back propagation, Elman and probabilistic neural networks were designed. Performance comparisons with similar studies found in the related literature indicated that our proposed ANN structures yield satisfactory results.
Conference Paper
Successful and effective man-machine communication is a major milestone for human beings in transferring their abilities to machines. As language is the most natural means of communication, speech recognition and synthesis are best means for communicating with the computer. Reliable speech recognition is a difficult problem, requiring a combination of many techniques. Recent progress in speech synthesis has produced synthesizers with very high intelligibility but the sound quality and naturalness still remain a major problem to be addressed. This paper aims to look at the problem in an altogether different perspective; speech recognition and synthesis for an individual is much simpler problem to tackle and yet presents a wide range of applications that serve the basic purpose of human computer interaction.
Conference Paper
Full-text available
Changes in spectral characteristics are important cues for discriminating and identifying speech sounds. These changes can occur over very short time intervals. Computing frames every 10 ms, as commonly done in recognition systems, is not sufficient to capture such dynamic changes. In this paper, we propose a variable frame rate (VFR) algorithm. The algorithm results in an increased number of frames for rapidly-changing segments with relatively high energy and less frames for steady-state segments. The current implementation used an average data rate which is less than 100 frames per second. For an isolated word recognition task, and using an HMM-based speech recognition system, the proposed technique results in significant improvements in recognition accuracy especially at low signal-to-noise ratios. The technique was evaluated with mel frequency cepstral coefficient (MFCC) vectors and MFCC vectors with enhanced peak isolation
Conference Paper
A method for the automatic recognition of offline handwritten signatures using both global and local features is described. As global features, we use the envelope of the signature sequenced as polar coordinates; and as local features we use points located inside the envelope that describe the density or distribution of signature strokes. Each feature is processed as a sequence by a hidden Markov Model (HMM) classifier. The results of both classifiers are linearly combined, obtaining a recognition ratio of 95.15% with a database of 60 handwritten signatures
Conference Paper
Signature recognition is a relevant area in secure applications referred to as biometric identification. The image of the signature to be recognized (in off-line systems) can be considered as a spatio-temporal signal due to the shapely geometric and sequential character of the pencil drawing. The recognition and classification methods known to us are based on the extraction of geometric parameters and their classification by either a linear or nonlinear classifier. This procedure neglects the temporal information of the signature. In order to alleviate this, this paper proposes to use signature parameters with spatio-temporal information and its classification by a classifier capable of dealing with spatio-temporal problems as hidden Markov models (HMM). The proposed parameters are calculated in two stages; first, the preprocessing stage which includes noise reduction and outline detection through a skeletonization or thinning algorithm; and second, a parameterization stage in which the signature is encoded following the signature line and recording the length and direction of the pencil drawing obtaining a vector that includes the signature spatio-temporal information. The classification of the above parameters is done by a HMM classifier working in the same way as isolated word recognition systems. To design (train and test) the HMM classifier we have built a database of 24 signatures of 60 different writers
Book
In the period following World War II, it began to be recognized that there were a large number of interesting and significant activities which could be classified as multistage decision processes. It was soon seen that the mathematical problems that arose in their study stretched the conventional confines of analysis, and required new methods for their successful treatment. The classical techniques of calculus and the calculus of variations were occasionally valuable and useful in these new areas, but were clearly limited in range and versatility, and were definitely lacking as far as furnishing numerical answers was concerned. © 1962, by The RAND Corporation. Published, 1962, by Princeton University Press. All Rights Reserved.
Article
Little has been done in the study of these intriguing questions, and I do not wish to give the impression that any extensive set of ideas exists that could be called a "theory." What is quite surprising, as far as the histories of science and philosophy are concerned, is that the major impetus for the fantastic growth of interest in brain processes, both psychological and physiological, has come from a device, a machine, the digital computer. In dealing with a human being and a human society, we enjoy the luxury of being irrational, illogical, inconsistent, and incomplete, and yet of coping. In operating a computer, we must meet the rigorous requirements for detailed instructions and absolute precision. If we understood the ability of the human mind to make effective decisions when confronted by complexity, uncertainty, and irrationality then we could use computers a million times more effectively than we do. Recognition of this fact has been a motivation for the spurt of research in the field of neurophysiology.
Article
The authors present a large vocabulary, continuous speech recognition system based on linked predictive neural networks (LPNNs). The system uses neural networks as predictors of speech frames, yielding distortion measures which can be used by the one-stage DTW algorithm to perform continuous speech recognition. The system currently achieves 95%, 58%, and 39% word accuracy on tasks with perplexity 7, 111, and 402, respectively, outperforming several simple HMMs that have been tested. It was also found that the accuracy and speed of the LPNN can be slightly improved by the judicious use of hidden control inputs. The strengths and weaknesses of the predictive approach are discussed.
Article
A fundamental challenge in speech recognition is the discrimination of the acoustic vectors characterizing each speech time-slot into classes of sub-word units. The most popular technique relies on hidden Markov models (HMM) but alternative models could make use of linear or non-linear discriminant functions and also Multilayer Perceptrons (MLP).A common property of these techniques is their learning ability. However, the advantage of discriminant functions and multilayer perceptrons is their possibility to force simultaneously the acceptance of a vector into its own class and its rejection by the rival classes. Ideally, this classification relies on a logical decision with a score invariant for any vector in a given class. Generally, classes are not linearly separable. As linear discriminant functions are not flexible enough to attain both objectives, suitable non-linear discriminant functions could approximate logical decision and achieve non-linear separation. For that purpose, multilayer perceptrons are particularly powerful tools because of their possibility to approximate a very wide range of non-linear functions with a rather restricted set of parameters. The main properties of MLP are pointed out and illustrated by simple examples.The possibility of including discriminant principles in a Dynamic Time Warping process and in a Viterbi-like training is shown.The discriminant properties of HMM, linear discriminant functions and MLP are compared from the point of view of local labeling of the acoustic vectors and of the global recognition.Finally, a phoneme based real task application (50 phonemes, 1000 vocabulary words) is described. It makes use of a particular MLP, based on the NETtalk architecture, and shows how the non-linear discriminant functions and the consideration of the temporal context dependence of the acoustic vectors are useful for the phonetic speech labeling.
Article
RÉSUMÉ Il est bien connu que la quantisation vectorielle (VQ) est une méthode solide pour des systèmes de la reconnais-sance du locuteur qui demandent peu de données pour l'entraînement. Toutefois la méthode VQ conventionnelle utilise seulement les mesures de distorsion et ne s'occupe pas de la séquence des codes VQ. Nous proposons d' éten-dre la méthode de distorsion VQ par combiner les distor-sions avec la probabilité de la séquence des codes VQ provenant d'un modèle de Markov discret (DHMM). Cette méthode convient le mieux pour les systèmes de la recon-naissance de la parole et du locuteur combinés. Nos expé-riences avec la base de données TI46 montrent que la me-sure combinée donne des résultats mieux que les mesures VQ et DHMM seules. ABSTRACT Vector Quantisation (VQ) has been shown to be robust in speaker recognition systems which require a small amount of training data. However the conventional VQ-based method only uses distortion measurements and discards the sequence of quantised codewords. In this paper we propose a method which extends the VQ distortion method by combining it with the likelihood of the se-quence of VQ indices against a discrete hidden Markov model (DHMM). The method is particularly suitable for combined speech recognition and speaker recognition systems. Experiments on the TI46 database show that the combined score gives better performance than both the conventional VQ-based and DHMM-based methods.
Conference Paper
A hybrid method for continuous-speech recognition which combines hidden Markov models (HMMs) and a connectionist technique called connectionist Viterbi training (CVT) is presented. CVT can be run iteratively and can be applied to large-vocabulary recognition tasks. Successful completion of training the connectionist component of the system, despite the large network size and volume of training data, depends largely on several measures taken to reduce learning time. The system is trained and tested on the TI/NBS speaker-independent continuous-digits database. Performance on test data for unknown-length strings is 98.5% word accuracy and 95.0% string accuracy. Several improvements to the current system are expected to increase these accuracies significantly
Conference Paper
An architecture for a neural network that implements a hidden Markov model (HMM) is presented. This HMM net suggests integrating signal preprocessing (such as vector quantization) with the classifier. A minimum mean-squared-error training criterion for the HMM/neural net is presented and compared to maximum-likelihood and maximum-mutual-information criteria. The HMM forward-backward algorithm is shown to be the same as the neural net backpropagation algorithm. The implications of probability constraints on the HMM parameters are discussed. Relaxing these constraints allows negative probabilities, equivalent to inhibitory connections. A probabilistic interpretation is given for a network with negative, and even complex-valued, parameters
Conference Paper
A phoneme based, speaker-dependent continuous-speech recognition system embedding a multilayer perceptron (MLP) (i.e. a feedforward artificial neural network) into a hidden Markov model (HMM) approach is described. Contextual information from a sliding window on the input frames is used to improve frame or phoneme classification performance over the corresponding performance for simple maximum-likelihood probabilities, or even maximum a posteriori (MAP) probabilities which are estimated without the benefit of context. Performance for a simple discrete density HMM system appears to be somewhat better when MLP methods are used to estimate the probabilities
Conference Paper
The authors describe two systems in which neural network classifiers are merged with dynamic programming (DP) time alignment methods to produce high-performance continuous speech recognizers. One system uses the connectionist Viterbi-training (CVT) procedure, in which a neural network with frame-level outputs is trained using guidance from a time alignment procedure. The other system uses multi-state time-delay neural networks (MS-TDNNs), in which embedded DP time alignment allows network training with only word-level external supervision. The CVT results on the, TI Digits are 99.1% word accuracy and 98.0% string accuracy. The MS-TDNNs are described in detail, with attention focused on their architecture, the training procedure, and results of applying the MS-TDNNs to continuous speaker-dependent alphabet recognition: on two speakers, word accuracy is respectively 97.5% and 89.7%
Conference Paper
The dominant acoustic modeling methodology based on Hid- den Markov Models is known to have certain weaknesses. Par- tial solutions to these flaws have been presented, but the fun da- mental problem remains: compression of the data to a compact HMM discards useful information such as time dependencies and speaker information. In this paper, we look at pure exam- ple based recognition as a solution to this problem. By replac- ing the HMM with the underlying examples, all information in the training data is retained. We show how information about speaker and environment can be used, introducing a new inter- pretation of adaptation. The basis for the recognizer is the well- known DTW algorithm, which has often been used for small tasks. However, large vocabulary speech recognition introduces new demands, resulting in an explosion of the search space. We show how this problem can be tackled using a data driven ap- proach which selects appropriate speech examples as candidates for DTW-alignment.
Article
In this study, we propose an algorithm for Arabic isolated digit recognition. The algorithm is based on extracting acoustical features from the speech signal and using them as input to multi-layer perceptrons neural networks. Each word in the vocabulary digits (0 to 9) is associated with a network. The networks are implemented as predictors for the speech samples for a certain duration of time. The back-propagation algorithm is used to train the networks. The hidden markov model (HMM) is implemented to extract temporal features (states) for the speech signal. The input vector to the networks consists of twelve mel frequency cepstral coefficients, log of the energy, and five elements representing the state. Our results show that we are able to reduce the word error rate comparing with an HMM word recognition system.
Article
This thesis studies the introduction of a priori structure into the design of learning systems based on artificial neural networks applied to sequence recognition, in particular to phoneme recognition in continuous speech. Because we are interested in sequence analysis, algorithms for training recurrent networks are studied and an original algorithm for constrained recurrent networks is proposed and test results are reported. We also discuss the integration of connectionist models with other analysis tools that have been shown to be useful for sequences, such as dynamic programming and hidden Markov models. We introduce an original algorithm to perform global optimization of a neural network/hidden Markov model hybrid, and show how to perform such a global optimization on all the parameters of the system. Finally, we consider some alternatives to sigmoid networks: Radial Basis Functions, and a method for searching for better learning rules using a priori knowledge and optimization algorithms.
Conference Paper
The authors compare dynamic programming, or DP, multilayer perceptron, time-delay neural network, or TDNN, shift-tolerant learning vector quantization, and K-means on a multispeaker isolated-word small vocabulary problem. A suboptimal cooperation between TDNN and other algorithms is proposed and successfully tested on the problem. The combination of TDNN and DP performs especially well. An optimal cooperation method between DP and some other algorithms is proposed
Article
A study is presented into the importance of two commonly overlooked factors influencing generalisation ability in the field of hidden Markov model (HMM) based recogniser training algorithms by means of a comparative study of four initialisation methods and three stop criteria in different applications. The results show that better results have been found with the equal-occupancy initialisation method and the fixed-threshold stop criterion
Article
The current state-of-the-art in large-vocabulary, continuous speech recognition is based on the use of hidden Markov models (HMM). In an attempt to improve over HMM performance, the authors developed a hybrid system that combines the advantages of neural networks and HMM using a multiple hypothesis (or N-best) paradigm. The connectionist component of the system, the segmental neural net (SNN), models all the frames of a phonetic segment simultaneously, thus overcoming the well-known conditional-independence limitation of the HMM. They describe the hybrid system and discuss various aspects of SNN modeling, including network architectures, training algorithms and context modeling. Finally, they evaluate the hybrid system by performing several speaker-independent experiments with the DARPA Resource Management (RM) corpus, and demonstrate that the hybrid system shows a consistent improvement in performance over the baseline HMM system
Article
The Principle of Optimality and one method for its application, dynamic programming, was popularized by Bellman in the early 1950's. Dynamic programming was soon proposed for speech recognition and applied to the problem as soon as digital computers with sufficient memory were available, around 1962. Today, most commercially available recognizers and many of the systems being developed in research laboratories use dynamic programming, typically to address the problem of the time alignment between a speech segment and some template or synthesized speech artifact. in this tutorial paper, the application of dynamic programming to connected-speech recognition is introduced and discussed. The deterministic form, used for template matching for connected speech, is described in detail. The stochastic form, ordinarily called the Viterbi algorithm, is also introduced.
Article
This tutorial provides an overview of the basic theory of hidden Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and gives practical details on methods of implementation of the theory along with a description of selected applications of the theory to distinct problems in speech recognition. Results from a number of original sources are combined to provide a single source of acquiring the background required to pursue further this area of research. The author first reviews the theory of discrete Markov chains and shows how the concept of hidden states, where the observation is a probabilistic function of the state, can be used effectively. The theory is illustrated with two simple examples, namely coin-tossing, and the classic balls-in-urns system. Three fundamental problems of HMMs are noted and several practical techniques for solving these problems are given. The various types of HMMs that have been studied, including ergodic as well as left-right models, are described
Signature Classification by Hidden Markov Model, 33rd Internacional Carnahan Conference on Security Technology Comisaría General de Policía Científica, Ministerio del Interior
  • M A Ferrer
  • J L Camino
  • C M Travieso
  • C Morales
M. A. Ferrer, J. L. Camino, C. M. Travieso, C. Morales, Signature Classification by Hidden Markov Model, 33rd Internacional Carnahan Conference on Security Technology, (IEEE ICCST'99), Comisaría General de Policía Científica, Ministerio del Interior, IEEE Spain Section, COIT, Seguridad España S.A, Madrid, Spain, Oct. 1999, 481-484.
  • S Renals
  • N Morgan
  • H Bourlard
  • M Cohen
  • H Franco
Renals, S., Morgan, N., Bourlard, H., Cohen, M. & Franco, H. (1994), Connectionist probability estimators in HMM speech recognition, IEEE Transactions on Speech and Audio Processing 2(1),1994, 161-174.