## No full-text available

To read the full-text of this research,

you can request a copy directly from the author.

This tutorial provides an overview of the basic theory of hidden
Markov models (HMMs) as originated by L.E. Baum and T. Petrie (1966) and
gives practical details on methods of implementation of the theory along
with a description of selected applications of the theory to distinct
problems in speech recognition. Results from a number of original
sources are combined to provide a single source of acquiring the
background required to pursue further this area of research. The author
first reviews the theory of discrete Markov chains and shows how the
concept of hidden states, where the observation is a probabilistic
function of the state, can be used effectively. The theory is
illustrated with two simple examples, namely coin-tossing, and the
classic balls-in-urns system. Three fundamental problems of HMMs are
noted and several practical techniques for solving these problems are
given. The various types of HMMs that have been studied, including
ergodic as well as left-right models, are described

To read the full-text of this research,

you can request a copy directly from the author.

... In parallel to the development of all the presented feature representations, several techniques were proposed to substitute the simple classifier commonly used. Given the success of Hidden Markov Models (HMM) [48] in speech recognition previously, the use of HMM as a classifier for face recognition was introduced in several works. In [49,50], strips of raw pixels covering areas such as forehead, eye, nose, mouth and chin were extracted from a face image and converted into a chronological sequence. ...

... As an alternative to template matching approaches, a second type of classifier based on parametric modelling methods was developed. The first of these classifiers was Hidden Markov Model (HMM) [48,[101][102][103], which was a stochastic process where a sequence of observations was produced by a Markov chain of a finite number of states. To model the observation generation process with HMM, a Probability Density Function (PDF) is employed and several alternatives can be selected. ...

... The Hidden Markov Model (HMM) [48] is a generative probabilistic model that represents a set of stochastic sequences (such as face regions, audio concepts or segments) as a Markov chain with a finite number of states, . Each state generates observations following a random process which we characterize by its Probability Density Function (PDF). ...

The increasing use of technological devices and biometric recognition systems in people daily lives has motivated a great deal of research interest in the development of effective and robust systems. However, there are still some challenges to be solved in these systems when Deep Neural Networks (DNNs) are employed. For this reason, this thesis proposes different approaches to address these issues.
First of all, we have analyzed the effect of introducing the most widespread DNN architectures to develop systems for face and text-dependent speaker verification tasks. In this analysis, we observed that state-of-the-art DNNs established for many tasks, including face verification, did not perform efficiently for text-dependent speaker verification. Therefore, we have conducted a study to find the cause of this poor performance and we have noted that under certain circumstances this problem is due to the use of a global average layer as pooling mechanism in DNN architectures. Since the order of the phonetic information is relevant in text-dependent speaker verification task, whether a global average pooling is employed, this order is neglected and the results achieved for the verification performance metrics are too high. Hence, the first approach proposed in this thesis is an alignment mechanism which is used to replace the global average pooling. This alignment mechanism allows to keep the temporal structure and to encode the utterance and speaker identity in a supervector. As alignment mechanism, different types of approaches such as Hidden Markov Model (HMM) or Gaussian Mixture Model (GMM) can be used. Moreover, during the development of this mechanism, we also noted that the lack of larger training databases is another important issue to create these systems. Therefore, we have also introduced a new architecture philosophy based on the Knowledge Distillation (KD) approach. This architecture is known as teacher-student architecture and provides robustness to the systems during the training process and against possible overfitting due to the lack of data. In this part, another alternative approach is proposed to focus on the relevant frames of the sequence and maintain the phonetic information, which consists of Multi-head Self-Attention (MSA). The architecture pro- posed to use the MSA layers also introduces phonetic embeddings and memory layers to improve the discrimination between speakers and utterances. Moreover, to complete the architecture with the previous techniques, another approach has been incorporated where two learnable vectors have been introduced which are called class and distillation tokens. Using these tokens during training, temporal information is kept and encoded into the tokens, so that at the end, a global utterance descriptor similar to the supervector is obtained.
Apart from the above approaches to obtain robust representations, the other main part of this thesis has focused on introducing new loss functions to train DNN architectures. Traditional loss functions have provided reasonably good results for many tasks, but there are not usually designed to optimize the goal task. For this reason, we have proposed several new loss functions as objective for training DNN architectures which are based on the final verification metrics. The first approach developed for this part is inspired by the Area Under the ROC Curve (AUC). Thus, we have presented a differentiable approximation of this metric called aAUC loss to successfully train a triplet neural network as back-end. However, the selection of the training data has to be carefully done to carry out this back-end, so it involves a high computational cost. Therefore, we have developed several approaches to take advantage of training with a loss function oriented to the goal task but keeping the efficiency and speed of multi-class training. To implement these approaches, the differentiable approximation of the Detection Cost Function (aDCF ) and Cost of Log-Likelihood Ratio (CLLR) verification metrics have been employed as training objective loss. By optimizing DNN architectures to minimize these loss functions, the system learns to reduce errors in decisions and scores produced. The use of these approaches has also shown a better ability to learn more general representations than training with other traditional loss functions. Finally, we have also proposed a new straightforward back-end that employs the information learned by the matrix of the last layer of DNN architecture during training with aDCF loss. Using the matrix of this last layer, an enrollment model with a learnable vector is trained for each enrollment identity to perform the verification process.

... To obtain a clear understanding of the forward algorithm, consider the joint probability p(S t , A t ). The forward algorithm is able to efficiently compute this joint probability in a recursive way as in [37]. Herein, the forward algorithm is described as follows ...

... However, the optimization problem relies on predicting the most likely hidden state S * t from (17) using the forward algorithm, which uses the actual hyperparameter values q nk and . This problem can be solved iteratively using the Baum-Welsh algorithm [37]. ...

... A more interested reader can refer to[37] for more details about the Baum-Welsh expectation-maximization algorithm. ...

In this work, we present a novel traffic prediction and fast uplink framework for IoT networks controlled by binary Markovian events. First, we apply the forward algorithm with hidden Markov models (HMM) in order to schedule the available resources to the devices with maximum likelihood activation probabilities via fast uplink grant. In addition, we evaluate the regret metric as the number of wasted transmission slots to evaluate the performance of the prediction. Next, we formulate a fairness optimization problem to minimize the age of information while keeping the regret as minimum as possible. Finally, we propose an iterative algorithm to estimate the model hyperparameters (activation probabilities) in a real-time application and apply an online-learning version of the proposed traffic prediction scheme. Simulation results show that the proposed algorithms outperform baseline models such as time division multiple access (TDMA) and grant-free (GF) random-access in terms of regret, the efficiency of system usage, and age of information.

... Delete state, start state, and end state do not sends any signs. As a result, they are called silent states [45]. ...

... So each particle is a real number encoding of dimension (9l + 3 + (2l + 1)|A|). Since biological DNA contains four nucleotides {'A', 'T', 'C', 'G'}, the number of parameters of the DNA model is (17l + 7) [45]. According to the nature of transfer probability and symbol sending probability, the state transfer probability and symbol sending probability in the HMM need to be normalized to satisfy the (3l + 1) transfer probability normalization constraint equation and (2l + 1) symbol emission probability normalization equation before evaluating the corresponding HMM model of the particle. ...

... Based on the results of OTLEO training, the globally optimal Hidden Markov Model corresponding to the particles is obtained. This model is then used to compare the sequences using the Viterbi algorithm [45] to obtain the optimal pairing results, and the results are evaluated using an objective function based on the SOP (sum-of-pairs) [46] scoring system. SOP scoring function as follows: ...

Equilibrium Optimizer (EO) is a newly developed intelligent optimization algorithm inspired by control volume mass balance models. EO has been proven to have an excellent solution effect on some optimization problems, with the advantages of ease of implementation and strong adaptability. However, the original EO has some disadvantages when solving complex multimodal problems, including an immature balance between exploration and exploitation, the high probability of falling into local optima entrapment, and the slow rate of convergence. In order to address these shortcomings, a modified equilibrium optimizer (OTLEO) is proposed using teaching-learning-based optimization (TLBO) and opposition-based learning (OBL). These modifications aim to maintain the diversity of the solutions, expand the algorithm’s search range, improve exploration and exploitation capabilities, and avoid local optima. To verify and analyze the modified equilibrium optimizer algorithm’s performance, the OTLEO was tested for 32 classical benchmark functions. All algorithms are independently run 30 times in the same environment. Thereafter, the comparative evaluation against the OTLEO and other six representative algorithms is conducted. Four real-world engineering application problems, including multiple sequence alignment and so on, are used for additional validation. The experimental results, statistical tests, qualitative analysis, and stability analysis demonstrate that the proposed OTLEO outperforms the original EO and other algorithms.

... O primeiro trabalho considera equalizadores baseados em filtragem transversal, e algoritmos tradicionais como o LMS e o RLS [11] para a estimação do canal, onde o focoé um receptor com baixa complexidade. Em [9] o focoé se aproximar da capacidade de canal, e o autor considera uma alternativa cega, baseada em detecção MAP e turbo decodificação iterativa, onde o princípio turbo tambéḿ e aplicado na estimação do canal, queé feita usando-se o algoritmo de Baum-Welch [12]- [14]. ...

... Nosso foco neste trabalho está em resolver a má convergência na estimação do canal. Estimação esta queé realizada usando-se o algoritmo de Baum-Welch [12]- [14] sobre o modelo oculto de Markov definido pela treliça com Ë estados correspondente ao canal com IES. A aplicação do algoritmo de Baum-Welch neste caso resulta em um processo de estimação iterativa dos parâmetros do canal necessários para a detecção e a decodificação. ...

... As recursões que definem o algoritmo de Baum-Welch no nosso caso podem ser escritas como [9], [14], [16]: ...

... Speech perception involves multiple hierarchical levels of processing, to go from the acoustic, continuous signal to speech unit identification, to, ultimately, the perceived meaning of utterances. Classical psycholinguistic models inspired by interaction-activation processes such as TRACE or SHORT-LIST [1,2,3,4], as well as automatic speech recognition (ASR) models such as Hidden Markov Models (HMMs) or Deep Neural Networks (DNNS) [5,6,7], directly decode the speech continuous stream from spectro-temporal information through a battery of computational processes associating phonetic-prosodic, lexical and syntactic-semantic knowledge. However, recent studies in speech neuroscience focus on cognitive processes that appear crucial for speech perception, and that perform temporal segmentation, that is, identifying in the speech signal temporally relevant events (e.g., syllabic boundaries). ...

... However, recent studies in speech neuroscience focus on cognitive processes that appear crucial for speech perception, and that perform temporal segmentation, that is, identifying in the speech signal temporally relevant events (e.g., syllabic boundaries). Through synchronization processes between different populations of neurons operating in different frequency bands, typically in the gamma band (40-100 Hz) for acoustic spectrotemporal analysis, in the theta band (3)(4)(5)(6)(7)(8) for syllabic segmentation, and in the delta band (1-2 Hz) for rhythmic/syntactic binding, the human brain would exploit neuronal oscillations to perform this temporal segmentation of incoming acoustic signals [8,9,10]. ...

... To address these questions, various models of syllabic parsing based on neural oscillations in the cortex at the theta rhythm (3)(4)(5)(6)(7)(8) have been proposed in the literature [18,19,20,21]. While some of these neuro-computational models exploit sophisticated realistic neuronal processing principles, sometimes even at the spike level of representation, we focus here on the model developed by Räsänen, Doyle and Frank [20] which stands out as the simplest one, operating on simple processes of envelope detection and linear second-order oscillators. ...

... The inference algorithm is based on one of the most extensively used mathematical models for analysis of time series with time dependent statesthe Hidden Markov Models (HMM) [51]. HMMs consider an observed discrete time-series as a collection of random variables Y 1:T = {Y 1 , Y 2 , . . ...

... We wish to estimate the value of a latent parameter θ t = (a t , q t ) given the entire observed time-series v 1:T . Therefore, the inference has a naturally Bayesian interpretation, and is implemented using the forward-backward algorithm for HMMs [51,52]. In the description of the forward-backward algorithm for simplicity we assume that the probability of a given observation depends only on the current value of the parameter, not on past data points. ...

Multi-potent progenitor (MPP) cells act as a key intermediary step between haematopoietic stem cells and the entirety of the mature blood cell system. Their eventual fate determination is thought to be achieved through migration in and out of spatially distinct niches. Here we first analyze statistically MPP cell trajectory data obtained from a series of long time-course 3D in vivo imaging experiments on irradiated mouse calvaria, and report that MPPs display transient super-diffusion with apparent non-Gaussian displacement distributions. Second, we explain these experimental findings using a run-and-tumble model of cell motion which incorporates the observed dynamical heterogeneity of the MPPs. Third, we use our model to extrapolate the dynamics to time-periods currently inaccessible experimentally, which enables us to quantitatively estimate the time and length scales at which super-diffusion transitions to Fickian diffusion. Our work sheds light on the potential importance of motility in early haematopoietic progenitor function.

... We train a set of HMMs, where each HMM represents one behaviour (e.g. we have one HMM to represent the 'toileting' behaviour, another to represent the 'doing laundry' behaviour, etc.), using the standard Expectation-Maximization (EM) algorithm (Rabiner 1989). We use the method described in (Chua, Marsland, and Guesgen 2009) to perform segmentation and behaviour recognition. ...

... , O 10 |λ)). Since it is unlikely that all of the sequences in the window belong to one behaviour, a re-segmentation is performed by using the forward algorithm (Rabiner 1989) to calculate the likelihood of each observation in the window according to the winning HMM. The results are shown in Table 2. ...

Behaviour recognition is the process of inferring the behaviour of an individual from a series of observations acquired from sensors such as in a smart home. The majority of existing behaviour recognition systems are based on supervised learning algorithms, which means that training them requires a preprocessed, annotated dataset. Unfortunately, annotating a dataset is a rather tedious process and one that is prone to error. In this paper we suggest a way to identify structure in the data based on text compression and the edit distance between words, without any prior labelling. We demonstrate that by using this method we can automatically identify patterns and segment the data into patterns that correspond to human behaviours. To evaluate the effectiveness of our proposed method, we use a dataset from a smart home and compare the labels produced by our approach with the labels assigned by a human to the activities in the dataset. We find that the results are promising and show significant improvement in the recognition accuracy over Self-Organising Maps (SOMs).

... For the duration predictor and alignment step (dashed block in in Fig. 1), we used the same implementation as stated in Glow-based [15] and DPMbase [14] systems. We use a monotonic alignment search based on the Viterbi method [25] to find alignment between phoneme sequence and Mel spectrogram. This step is also conditioned on speaker embedding and expressivity embedding. ...

... This effectively assigns smaller probabilities to large radiocarbon residuals and improves agreement between the core age models and the radiocarbon observations. 220 Welch Expectation Maximization algorithm (Rabiner, 1989;Durbin et al., 1998). These parameters account for vital effects among different benthic foraminifera species (e.g., Marchitto et al., 2014) and different local water mass properties at different locations (e.g., temperature and δ 18 O of seawater). ...

Previously developed software packages that generate probabilistic age models for ocean sediment cores are designed to use either age proxies (e.g., radiocarbon or tephra layers) or stratigraphic alignment (e.g., of benthic δ18O) and cannot combine age inferences from both techniques. Furthermore, many radiocarbon dating packages are not specifically designed for marine sediment cores and default settings may not accurately reflect the probability of sedimentation rate variability in the deep ocean, requiring subjective tuning of parameter settings. Here we present a new technique for generating Bayesian age models and stacks using ocean sediment core radiocarbon and benthic δ18O data, implemented in a software package named BIGMACS (Bayesian Inference Gaussian Process regression and Multiproxy Alignment of Continuous Signals). BIGMACS constructs multiproxy age models by combining age inferences from both radiocarbon ages and benthic δ18O stratigraphic alignment and constrains sedimentation rates using an empirically derived prior model based on 37 14C-dated ocean sediment cores (Lin et al., 2014). BIGMACS also constructs continuous benthic δ18O stacks via a Gaussian process regression, which requires a smaller number of cores than previous stacking methods. This feature allows users to construct stacks for a region that shares a homogeneous deep water δ18O signal, while leveraging radiocarbon dates across multiple cores. Thus, BIGMACS efficiently generates local or regional stacks with smaller uncertainties in both age and δ18O than previously available techniques. We present two example regional benthic δ18O stacks and demonstrate that the multiproxy age models produced by BIGMACS are more precise than their single proxy counterparts.

... Probabilistic graphical models such as hidden Markov models (HMM) (Rabiner 1989) and conditional random fields (CRF) (Lafferty, McCallum, and Pereira 2001) can infer abstracted actions corresponding to system calls as these models allow a sequence to be tagged with labels based on context. While HMMs and linear-chain CRFs share many similarities, HMMs tend to be simpler to learn from data and do not require handcrafted feature functions characteristic of CRFs. ...

We present a novel AI-based methodology that identifies phases of a host-level cyber attack simply from system call logs. System calls emanating from cyber attacks on hosts such as honey pots are often recorded in audit logs. Our methodology first involves efficiently loading, caching, processing, and querying system events contained in audit logs in support of computer forensics. Output of queries remains at the system call level and is difficult to process. The next step is to infer a sequence of abstracted actions, which we colloquially call a storyline, from the system calls given as observations to a latent-state probabilistic model. These storylines are then accurately identified with class labels using a learned classifier. We qualitatively and quantitatively evaluate methods and models for each step of the methodology using 114 different attack phases collected by logging the attacks of a red team on a server, on some likely benign sequences containing regular user activities, and on traces from a recent DARPA project. The resulting end-to-end system, which we call Cyberian, identifies the attack phases with a high level of accuracy illustrating the benefit that this machine learning-based methodology brings to security forensics.

... During the training stage, the corresponding parameters of the potential model are learned and the test samples are classified based on the likelihood. To name a few, hidden markov model (HMM) (Rabiner 1989) is widely used in TSC for speech recognition. Naive bayes sequence classifier (Rish et al. 2001) is another typical model based method which observes the feature independent assumption. ...

Multi-view time series classification (MVTSC) aims to improve the performance by fusing the distinctive temporal information from multiple views. Existing methods for MVTSC mainly aim to fuse multi-view information at an early stage, e.g., by extracting a common feature subspace among multiple views. However, these approaches may not fully explore the unique temporal patterns of each view in complicated time series. Additionally, the label correlations of multiple views, which are critical to boosting, are usually under-explored for the MVTSC problem. To address the aforementioned issues, we propose a Correlative Channel-Aware Fusion (C$^2$AF) network. First, C$^2$AF extracts comprehensive and robust temporal patterns by a two-stream structured encoder for each view, and derives the intra-view/inter-view label correlations with a concise correlation matrix. Second, a channel-aware learnable fusion mechanism is implemented through CNN to further explore the global correlative patterns. Our C$^2$AF is an end-to-end framework for MVTSC. Extensive experimental results on three real-world datasets demonstrate the superiority of our C$^2$AF over the state-of-the-art methods. A detailed ablation study is also provided to illustrate the indispensability of each model component.

... Given the hardness results mentioned above, most of the literature has concentrated on the second problem. Two dominant approaches for solving the second problem are the forward-backward algorithm for HMMs (Rabiner 1989) and the inside-outside algorithm for PCFGs (Baker 1979;Lari and Young 1990). ...

The problem of identifying a probabilistic context free grammar has two aspects: the first is determining the grammar's topology (the rules of the grammar) and the second is estimating probabilistic weights for each rule. Given the hardness results for learning context-free grammars in general, and probabilistic grammars in particular, most of the literature has concentrated on the second problem. In this work we address the first problem. We restrict attention to structurally unambiguous weighted context-free grammars (SUWCFG) and provide a query learning algorithm for strucuturally unambiguous probabilistic context-free grammars (SUPCFG). We show that SUWCFG can be represented using co-linear multiplicity tree automata (CMTA), and provide a polynomial learning algorithm that learns CMTAs. We show that the learned CMTA can be converted into a probabilistic grammar, thus providing a complete algorithm for learning a strucutrally unambiguous probabilistic context free grammar (both the grammar topology and the probabilistic weights) using structured membership queries and structured equivalence queries. We demonstrate the usefulness of our algorithm in learning PCFGs over genomic data.

... Gawali speech recognition approach. Three strategies have been implemented by the authors to 180 recognize code-switching speech. The first option is to decode the speech using a global 181 language model created from a multilingual text, which acts as a baseline system. ...

In the subject of pattern recognition, speech recognition is an important study topic. The authors give a detailed assessment of voice recognition strategies for several majority languages in this study. Over the last several decades, many researchers have contributed to the field of voice processing and recognition. Although there are several frameworks for speech processing and recognition, there are only a few ASR systems available for language recognition throughout the world. However, the data gathered for this research reveals that the bulk of the effort has been done to construct ASR systems for majority languages, whereas minority languages suffer from a lack of standard speech corpus. We also looked at some of the key issues for voice recognition in various languages in this research. We have explored various kinds of hybrid acoustic modeling methods required for efficient results. Because the success of a classifier is dependent on the removal of information during the feature separation phase, it is critical to carefully pick the value extraction techniques and classifiers.

... Many problems in machine learning and artificial intelligence involve discrete-time partially observable nonlinear dynamical systems. If the observations are discrete, then Hidden Markov Models (HMMs) (Rabiner 1989) or, in the controlled setting, Partially Observable Markov Decision Processes (POMDPs) (Sondik 1971) can be used to represent belief as a discrete distribution over latent states. Predictive State Representations (PSRs) (Littman, Sutton, and Singh 2002), Transformed Predictive State Representations (TPSRs) (Rosencrantz, Gordon, and Thrun 2004;Boots, Siddiqi, and Gordon 2010), and the closely related Observable Operator Models (OOMs) (Jaeger 2000) are generalizations of POMDPs that have attracted interest because they both have greater representational capacity than Copyright c 2011, Association for the Advancement of Artificial Intelligence (www.aaai.org). ...

Recently, a number of researchers have proposed spectral algorithms for learning models of dynamical systems — for example, Hidden Markov Models (HMMs), Partially Observable Markov Decision Processes (POMDPs), and Transformed Predictive State Representations (TPSRs). These algorithms are attractive since they are statistically consistent and not subject to local optima. However, they are batch methods: they need to store their entire training data set in memory at once and operate on it as a large matrix, and so they cannot scale to extremely large data sets (either many examples or many features per example). In turn, this restriction limits their ability to learn accurate models of complex systems. To overcome these limitations, we propose a new online spectral algorithm, which uses tricks such as incremental Singular Value Decomposition (SVD) and random projections to scale to much larger data sets and more complex systems than previous methods. We demonstrate the new method on an inertial measurement prediction task and a high-bandwidth video mapping task and we illustrate desirable behaviors such as "closing the loop," where the latent state representation changes suddenly as the learner recognizes that it has returned to a previously known place.

... best reachable directed path conditional on OCD O uv between 279 the two nodes could be tracked. To do this, we used the idea of 280 dynamic programming from the Viterbi algorithm [27], [28] 281 and modified it to directed GC matrices. We consider state 282 l − 1 is the set of all nodes before step l. ...

The directed brain functional network construction gives us the new insights into the relationships between brain regions from the causality point of view. The Granger causality analysis is one of the powerful methods to model the directed network. The complex brain network is also hierarchically constructed, which is particularly suited to facilitate segregated functions and the global integration of the segregated functions. Therefore, it is of great interest to explore new approach to model the hierarchical architecture of the directed network. In the present study, we proposed a new approach, namely, stepwise multivariate Granger causality (SMGC), considering both the directed and hierarchical features of brain functional network to explore the stepwise causal relationship in the network. The simulation study demonstrated that the diverse and complex hierarchical organization could be embedded in the apparently simple directed network. The proposed SMGC method could capture the multiple hierarchy of the directed network. When applying to the real functional magnetic resonance imaging (fMRI) datasets, the core triple resting-state networks in human brain showed within-network directed connections in the first-level directed network and rich and diverse between-network pathways in the second-level hierarchical network. The default mode network (DMN) had a prominent role in the resting-state acting as both the causal source and the important relay station. Further exploratory research on the adaption of directed hierarchical network in athletes suggested the enhanced bidirectional communication between the DMN and the central executive network (CEN) and the enhanced directed connections from the salience network (SN) to the CEN in the athlete group. The SMGC approach is capable of capturing the hierarchical architecture of the brain directed functional network, which refreshes the new stepwise causal relationship in the directed network. This might shed light on the potential application for exploring the altered hierarchical organization of brain directed network in neuropsychiatric disorders.

... The conditional probabilities of the output given the input can also be learnt by the generative methods, but an extra Bayesian step is required. While the generative class is represented in the literature only by the Hidden Markov Models (HMMs) (Rabiner 1989) and Restricted Boltzmann Machines (RBMs) (Salakhutdinov et al. 2007), the discriminative class includes methods like Conditional Random Fields (CRFs) , Maximum Entropy (Berger et al. 1996), Support Vector Machine (SVM) (Boser et al. 1992), and the majority of neural networks. As the latter ones have received more attention recently, we decided to present the neural networks in a separate section from the machine learning one. ...

Sentiment analysis is an important tool to automatically understand the user-generated content on the Web. The most fine-grained sentiment analysis is concerned with the extraction and sentiment classification of aspects and has been extensively studied in recent years. In this work, we provide an overview of the first step in aspect-based sentiment analysis that assumes the extraction of opinion targets or aspects. We define a taxonomy for the extraction of aspects and present the most relevant works accordingly, with a focus on the most recent state-of-the-art methods. The three main classes we use to classify the methods designed for the detection of aspects are pattern-based, machine learning, and deep learning methods. Despite their differences, only a small number of works belong to a unique class of methods. All the introduced methods are ranked in terms of effectiveness. In the end, we highlight the main ideas that have led the research on this topic. Regarding future work, we deemed that the most promising research directions are the domain flexibility and the end-to-end approaches.

... The current state-of-the-art imputation methods such as BEAGLE, Minimac, and IMPUTE suite make use of the hidden Markov model (HMM) [30][31][32][33][34] based approach that is developed by Li and Stephens [35][36][37][38][39][40]. HMM treats each haplotype as a "state" and analyzes the probabilities of all the "paths" that pass through the states to generate the alleles that are typed by the array [36]. ...

Background
The decreasing cost of DNA sequencing has led to a great increase in our knowledge about genetic variation. While population-scale projects bring important insight into genotype–phenotype relationships, the cost of performing whole-genome sequencing on large samples is still prohibitive. In-silico genotype imputation coupled with genotyping-by-arrays is a cost-effective and accurate alternative for genotyping of common and uncommon variants. Imputation methods compare the genotypes of the typed variants with the large population-specific reference panels and estimate the genotypes of untyped variants by making use of the linkage disequilibrium patterns. Most accurate imputation methods are based on the Li–Stephens hidden Markov model, HMM, that treats the sequence of each chromosome as a mosaic of the haplotypes from the reference panel.
Results
Here we assess the accuracy of vicinity-based HMMs, where each untyped variant is imputed using the typed variants in a small window around itself (as small as 1 centimorgan). Locality-based imputation is used recently by machine learning-based genotype imputation approaches. We assess how the parameters of the vicinity-based HMMs impact the imputation accuracy in a comprehensive set of benchmarks and show that vicinity-based HMMs can accurately impute common and uncommon variants.
Conclusions
Our results indicate that locality-based imputation models can be effectively used for genotype imputation. The parameter settings that we identified can be used in future methods and vicinity-based HMMs can be used for re-structuring and parallelizing new imputation methods. The source code for the vicinity-based HMM implementations is publicly available at https://github.com/harmancilab/LoHaMMer .

... The following diagram shows the formulations that will be used to compute I P Y [j T 0 ⧵s] instead of the brute formula (see Fig. 1). It is known (see Rabiner 1989) that a direct computation of the brute formula is of cost O((2T − 1)r T+1 ) . To reduce this cost, generally, we use Forward and/or Backward iterative schemes. ...

In a Hidden Markov model (HMM), from hidden states, the model generates emissions that are visible. Generally, the problems to be solved by such models, are based on such emissions that are considered as observed data. In this work, we propose to study the case where some emissions are missing in a given emission sequence using different techniques, in particular a split technique which reduces the computational cost. Mainly we resolve the fundamental problems of an HMM with a lack of observations. The algorithms obtained following this approach are successfully tested through numerical examples.

... The Hidden Markov Model (HMM) 31,32 approach is used to reduce cortical activity at rest into sequences of transient, intermittently reoccurring events, known as brain states. At each brain state, large-scale networks are characterized by specific patterns of power and phase-coupling, and these are factorized as a function of frequency (i.e., spectrally resolved) [33][34][35] . ...

Neuronal populations in the brain are engaged in a temporally coordinated manner at rest. Here we show that spontaneous transitions between large-scale resting-state networks are altered in chronic neuropathic pain. We applied an approach based on the Hidden Markov Model to magnetoencephalography data to describe how the brain moves from one activity state to another. This identified 12 fast transient (~80 ms) brain states including the sensorimotor, ascending nociceptive pathway, salience, visual, and default mode networks. Compared to healthy controls, we found that people with neuropathic pain exhibited abnormal alpha power in the right ascending nociceptive pathway state, but higher power and coherence in the sensorimotor network state in the beta band, and shorter time intervals between visits of the sensorimotor network, indicating more active time in this state. Conversely, the neuropathic pain group showed lower coherence and spent less time in the frontal attentional state. Therefore, this study reveals a temporal imbalance and dysregulation of spectral frequency-specific brain microstates in patients with neuropathic pain. These findings can potentially impact the development of a mechanism-based therapeutic approach by identifying brain targets to stimulate using neuromodulation to modify abnormal activity and to restore effective neuronal synchrony between brain states.

... Speech is highly non-stationary and varies with time; however, HMMs model speech assuming two circumstances: quasi-stationary and conditional independence [19], [20]. Despite the fact that feature vectors are highly correlated, conditional independence assumes feature vectors are conditionally independent of ones that are before and after with the next state transition probability depending only on the current state. ...

Automatic Speech Recognition (ASR) applications have increased greatly during the last decade due to the emergence of new devices and home automation hardware that can benefit greatly from allowing users to interact hands free, such as smart watches, earbuds, portable translators, and home assistants. ASR implementation for these applications inevitably suffers from performance degradation in real life scenarios. Most ASR systems expect the working environment to be like the training environment, which is often not the case, especially for new applications with limited data availability. This study is concerned with experimentally showing the effect of variations in the environment on different ASR models and the capacity of different models to improve performance when provided with training data like the testing environment.
Taking a certain type of variability into account takes place by modifying or adapting one of the ASR system components, thus alleviating the effect of variability in real-life scenarios. However, this nominal approach does not account for all possible variabilities simultaneously, but on the contrary might result in deterioration in performance against other types of changes in the testing environment.
Most of the recent successes in ASR are mainly dependent on the abundance of data in a certain domain along with the increased capacity of the learning models. The performance of ASR then decreases with the decrease of either the amount of data or model capacity. Hence, this work proposes different data augmentation techniques and focuses on the capacity of the different models to improve with different types of augmented data.
Key words – Acoustic Modelling, Data Augmentation, Recurrent Autoencoder, Neural Style Transfer.

... So, in all cases discussed above, those in 1 , the movers, have the greatest value of ( ) , followed by those in 2 , the mediocres, who have the next higher value of ( ) and then those in 3 , the stayers, who have the smallest value of ( ) . algorithm, which further details can be found in Rabiner (1989) and MacDonald and Zucchini (1997), ...

In recent works in manpower planning interest has been awakened in modelling manpower systems in departmentalized framework. This, as a form of disaggregation, may solve the problem of observable heterogeneity but not latent heterogeneity; it rather opens up other aspects of latent heterogeneity hitherto unaccounted for in classical (non-departmentalized) manpower models. In this paper a multinomial Markov-switching model is formulated for investigating latent heterogeneity in intra-departmental and interdepartmental transitions in departmentalized manpower systems. The model incorporates the mover-mediocre-stayer principle. The use of EM algorithm for estimation of the model parameters and a validity test for assessing the model performance in comparison to the classical Markov manpower model are presented.

This paper presents a speech-based system for autism severity estimation combined with automatic speaker diarization. Speaker diarization was performed by two different methods. The first used acoustic features, which included Mel-Frequency Cepstral Coefficients (MFCC) and pitch, and the second used x-vectors - embeddings extracted from Deep Neural Networks (DNN). The speaker diarization was trained using a Fully Connected Deep Neural Network (FCDNN) in both methods. We then trained a Convolutional Neural Network (CNN) to estimate the severity of autism based on 48 acoustic and prosodic features of speech. One hundred thirty-two young children were recorded in the Autism Diagnostic Observation Schedule (ADOS) examination room, using a distant microphone. Between the two diarization methods, the MFCC and Pitch achieved a better Diarization Error Rate (DER) of 26.91%. Using this diarization method, the severity estimation system achieved a correlation of 0.606 (Pearson) between the predicted and the actual autism severity scores (i.e., ADOS scores). Clinical Relevance- The presented system identifies children's speech segments and estimates their autism severity sc30:310ore.

Hidden Markov models have been widely applied for data processing, mainly for image segmentation. However, when applied to non-stationary data, the results obtained may be poor due to mismatch between the models and reality. To this, such models have been extended to triplet Markov models which exhibit better performances while keeping similar computational complexity. In this paper, we introduce two novel models, based on triplet chain and field, able to handle simultaneously noise and classes non-stationarity, through a pairwise markovian auxiliary process. The performances of our models were assessed against regular hidden Markov models on simulated and synthetic images segmentation. The results obtained confirm the interest of the models proposed.KeywordsSegmentationMarkov modelsNon-stationarity

In this paper, we present MuteIt, an ear-worn system for recognizing unvoiced human commands. MuteIt presents an intuitive alternative to voice-based interactions that can be unreliable in noisy environments, disruptive to those around us, and compromise our privacy. We propose a twin-IMU set up to track the user's jaw motion and cancel motion artifacts caused by head and body movements. MuteIt processes jaw motion during word articulation to break each word signal into its constituent syllables, and further each syllable into phonemes (vowels, visemes, and plosives). Recognizing unvoiced commands by only tracking jaw motion is challenging. As a secondary articulator, jaw motion is not distinctive enough for unvoiced speech recognition. MuteIt combines IMU data with the anatomy of jaw movement as well as principles from linguistics, to model the task of word recognition as an estimation problem. Rather than employing machine learning to train a word classifier, we reconstruct each word as a sequence of phonemes using a bi-directional particle filter, enabling the system to be easily scaled to a large set of words. We validate MuteIt for 20 subjects with diverse speech accents to recognize 100 common command words. MuteIt achieves a mean word recognition accuracy of 94.8% in noise-free conditions. When compared with common voice assistants, MuteIt outperforms them in noisy acoustic environments, achieving higher than 90% recognition accuracy. Even in the presence of motion artifacts, such as head movement, walking, and riding in a moving vehicle, MuteIt achieves mean word recognition accuracy of 91% over all scenarios.

With the phenomenal increase in image and video databases, there is an increase in the human-computer interaction that recognizes Sign Language. Exchanging information using different gestures between two people is sign language, known as non-verbal communication. Sign language recognition is already done in various languages; however, for Indian sign language, there is no adequate amount of work done. This paper presents a review on sign language recognition for multiple languages. Data acquisition methods have been over-viewed in four ways (a) Glove-based, (b) Kinect-based, (c) Leap motion controller and (d) Vision-based. Some of them have pros and cons that have also been discussed for every data acquisition method. Applications of sign language recognition are also discussed.
Furthermore, this review also creates a coherent taxonomy to represent the modern research divided into three levels: Level 1 Elementary level (Recognition of sign characters), Level 2 Advanced level (Recognition of sign words) and Level 3 Professional level (Sentence interpretation). The available challenges and issues for each level are also explored in this research to provide valuable perceptions into technological environments. Various publicly available data-sets for different sign languages are also discussed. An efficient review of this paper shows that the significant exploration of communication via sign acknowledgment has been performed on static, dynamic, isolated and continuous gestures using various acquisition methods. Comprehensively, the hope is, this study will enable readers to learn new pathways and gain knowledge to carry out further research work in the domain related to sign language recognition.

This paper presents work on forecasting the fuel consumption rate of a harbour craft vessel through the combined time-series and classification prediction modelling. This study utilizes the machine learning tool which is trained using the 5-month raw operational data, i.e., fuel rate, vessel position and wind data. The Haar wavelet transform filters the noisy readings in the fuel flow rate data. Wind data are transformed into wind effect (drag), and the vessel speed is acquired through transforming GPS coordinates of vessel location to vessel distance travelled over time. Subsequently, the k-means clustering groups the tugboat operational data from the same operations (i.e., cruising and towing) for the training of the classification model. Both the time-series (LSTM network) and classification models are executed in parallel to make prediction results. The comparison of empirical results is made to discuss the effect of different architectures and hyperparameters on the prediction performance. Finally, fuel usage optimization by hypothetical adjustment of vessel speed is presented as one direct application of the methods presented in this paper.

Intracellular membrane fusion is primarily driven by coupled folding and assembly of soluble N-ethylmaleimide-sensitive factor attachment protein receptors (SNAREs). SNARE assembly is intrinsically inefficient and must be chaperoned by a family of evolutionarily and structurally conserved Sec1/Munc-18 (SM) proteins. The physiological pathway of the chaperoned SNARE assembly has not been well understood, partly due to the difficulty in dissecting the many intermediates and pathways of SNARE assembly and measure their associated energetics and kinetics. Optical tweezers have proven to be a powerful tool to characterize the intermediates involved in the chaperoned SNARE assembly. Here, we demonstrate the application of optical tweezers combined with a homemade microfluidic system into studies of synaptic SNARE assembly chaperoned by their cognate SM protein Munc18-1. Three synaptic SNAREs and Munc18-1 constitute the core machinery for synaptic vesicle fusion involved in neurotransmitter release. Many other proteins further regulate the core machinery to enable fusion at the right time and location. The methods described here can be applied to other proteins that regulate SNARE assembly to control membrane fusion involved in numerous biological and physiological processes.Key wordsOptical tweezersSNAREsSNARE assemblySec1/Munc-18 proteinsTemplate complexes

Comparison is a powerful tool for decision support. Many measures have been proposed for comparing two metric spaces, but they do not consider the data dispersion in each metric space. Furthermore, no dedicated measure has yet been proposed for comparing two metric spaces containing elements belonging to various subtypes of a global datatype. In order to attenuate the aforementioned limitations, this paper proposes a new technique for comparing two metric spaces. More formally, given a metric space X, histograms are used for performing a gradual analysis of the data dispersion inside the neighborhood of each element of X. This is a refinement of the neighborhoods’ analysis realized in an existing work. Then, another existing technique is used for associating one hidden Markov model \({\lambda }_X\) with X such that \({\lambda }_X\) learns the bin values and the visual shapes of the histograms derived from the instances in X. Meta-data derived from \({\lambda }_X\) are then saved as the components of a descriptor vector \(\overrightarrow{X}\) associated with X. Finally, the comparison between two metric spaces is performed through the comparison of their respective associated descriptor vectors using existing distance or similarity measures between two vectors. The proposed approach inherits the accuracy and the efficiency of the existing techniques on which it relies. Therefore, the experiments realized in this paper are only intended to show how it can be used for comparing particular metric spaces containing geolocations or stars in the celestial sphere.

IntroductionSepsis is a heterogeneous syndrome that results in life-threatening organ dysfunction. Our goal was to determine the relevant variables and patient phenotypes to use in predicting sepsis outcomes.Methods
We performed an ancillary study concerning 119 patients with septic shock at intensive care unit (ICU) admittance (T0). We defined clinical worsening as having an increased sequential organ failure assessment (SOFA) score of ≥ 1, 48 h after admission (ΔSOFA ≥ 1). We performed univariate and multivariate analyses based on the 28-day mortality rate and ΔSOFA ≥ 1 and determined three patient phenotypes: safe, intermediate and unsafe. The persistence of the intermediate and unsafe phenotypes after T0 was defined as a poor outcome.ResultsAt T0, the multivariate analysis showed two variables associated with 28-day mortality rate: norepinephrine dose and serum lactate concentration. Regarding ΔSOFA ≥ 1, we identified three variables at T0: norepinephrine dose, lactate concentration and venous-to-arterial carbon dioxide difference (P(v-a)CO2). At T0, the three phenotypes (safe, intermediate and unsafe) were found in 28 (24%), 70 (59%) and 21 (18%) patients, respectively. We thus suggested using an algorithm featuring norepinephrine dose, lactate concentration and P(v-a)CO2 to predict patient outcomes and obtained an area under the curve (AUC) of 74% (63–85%).Conclusion
Our findings highlight the fact that identifying relevant variables and phenotypes may help physicians predict patient outcomes.

This study describes a Natural Language Processing (NLP) toolkit, as the first contribution of a larger project, for an under-resourced language—Urdu. In previous studies, standard NLP toolkits have been developed for English and many other languages. There is also a dire need for standard text processing tools and methods for Urdu, despite it being widely spoken in different parts of the world with a large amount of digital text being readily available. This study presents the first version of the UNLT (Urdu Natural Language Toolkit) which contains three key text processing tools required for an Urdu NLP pipeline; word tokenizer, sentence tokenizer, and part-of-speech (POS) tagger. The UNLT word tokenizer employs a morpheme matching algorithm coupled with a state-of-the-art stochastic n -gram language model with back-off and smoothing characteristics for the space omission problem. The space insertion problem for compound words is tackled using a dictionary look-up technique. The UNLT sentence tokenizer is a combination of various machine learning, rule-based, regular-expressions, and dictionary look-up techniques. Finally, the UNLT POS taggers are based on Hidden Markov Model and Maximum Entropy-based stochastic techniques. In addition, we have developed large gold standard training and testing data sets to improve and evaluate the performance of new techniques for Urdu word tokenization, sentence tokenization, and POS tagging. For comparison purposes, we have compared the proposed approaches with several methods. Our proposed UNLT, the training and testing data sets, and supporting resources are all free and publicly available for academic use.

Software Defined Networking (SDN) is a networking architecture within the control is centralized through a software-based controller. Thus, being a single point of attack makes it the preferred target in case of attack. Multi controller architecture has been considered to reinforce the control plane. However, the communication interface between the controller is a security threat. We already propose a dual controller architecture, one nominal controller which is in charge of the data plane computation plus a second one which is in charge of the detection of anomalies in the decisions taken by the first controller. Previous work considered a deterministic control and this paper extends to the case of a non-determinist algorithm. In this objective we introduce a multi-criteria detection approach and we developed two approaches: verifying the consistency of the performance of the decisions taken and verifying the consistency of the sequence of decisions of the controller. We tested the proposition on a study case.

The remaining useful life of a system is unknown and uncertain due to the uncertainty of system failure. However, by monitoring the behaviour of the system, it is possible to predict the current health and also the near future health states. To make a correct prognostic, we need to understand the degradation process of similar systems from the historical data, which is often not easy to collect in huge amount because the degradation process is a slow progression. A complete sequence requires collecting data from the beginning of a system’s operation until its death or failure. However, in reality, most deployments will have to deal with missing data, misreading or sensor saturation. This paper works on handling the missing data for improving the model training by extracting as much information as possible even from the incomplete sequences. In this paper, we propose an IOHMM-based missing data processing method, which is shown to provide better results compared to the list deletion method. A bootstrap method is developed that resamples using replacement sequences picked by the learning algorithms. Two well-known learning algorithms: the Baum Welch and the forward-backward algorithm are adapted to handle the missing data. A numerical application is simulated to demonstrate the role of the proposed algorithm and the corresponding model performance in RUL prediction, which is the basis of the RUL management.

Epilepsy is one of the most prevalent neurological diseases globally, which causes seizures in the patient. As per a survey done worldwide, it is found that approximately 70 million people are living with epilepsy (~1% of the total population of the world). Effective detection of these seizures requires specialized approaches such as video and electroencephalography monitoring, which are expensive and are mainly available at specialized hospitals and institutes. Hence, there is a need to develop simpler and affordable systems that can be made available to health care centers and patients for accurate detection of epileptic seizures. A wireless remote monitoring system based on a wrist-worn accelerometer is an optimum choice for the same. Sophisticated algorithms need to be developed for effectively detecting seizure events from this accelerometer data with minimal false alarms. This paper presents a Hidden Markov Model (HMM) based probabilistic approach applied to the reduced-dimension feature vector representation of time-series accelerometer data to detect epileptic seizures. The results obtained from the HMM were compared with three commonly used machine learning models viz. support vector machine (SVM), logistic regression, and random forest. The proposed approach was able to detect 95.7% of seizures with a low false alarm rate of 14.8% with a run time of just under 24 seconds.

A body activity grading strategy is proposed for computer-assisted cervical rehabilitation training, which employs hidden Markov model to partition an exercise into independently assessable phases and a scoring reference to rate respective kinematic features. Samples of 34 cervical rehabilitation exercises are evaluated by both manual and the proposed approaches, where the average phase segmentation difference is 93 ms, the phase scoring difference is 0.045, and the grading difference for overall samples is 5.5% between the approaches. It indicates that the proposed method has similar accuracy as physical therapists and is thus capable of performing online supervision for cervical rehabilitation training.

In today's technological era, document images play an important and integral part in our day to day life, and specifically with the surge of Covid-19, digitally scanned documents have become key source of communication, thus avoiding any sort of infection through physical contact. Storage and transmission of scanned document images is a very memory intensive task, hence compression techniques are being used to reduce the image size before archival and transmission. To extract information or to operate on the compressed images, we have two ways of doing it. The first way is to decompress the image and operate on it and subsequently compress it again for the efficiency of storage and transmission. The other way is to use the characteristics of the underlying compression algorithm to directly process the images in their compressed form without involving decompression and re-compression. In this paper, we propose a novel idea of developing an OCR for CCITT (The International Telegraph and Telephone Consultative Committee) compressed machine printed TIFF document images directly in the compressed domain. After segmenting text regions into lines and words, HMM is applied for recognition using three coding modes of CCITT- horizontal, vertical and the pass mode. Experimental results show that OCR on pass modes give a promising results.

In the following report we propose pipelines for Goodness of Pronunciation (GoP) computation solving OOV problem at testing time using Vocab/Lexicon expansion techniques. The pipeline uses different components of ASR system to quantify accent and automatically evaluate them as scores. We use the posteriors of an ASR model trained on native English speech, along with the phone level boundaries to obtain phone level pronunciation scores. We used this as a baseline pipeline and implemented methods to remove UNK and SPN phonemes in the GoP output by building three pipelines. The Online, Offline and Hybrid pipeline which returns the scores but also can prevent unknown words in the final output. The Online method is based per utterance, Offline method pre-incorporates a set of OOV words for a given data set and the Hybrid method combines the above two ideas to expand the lexicon as well work per utterance. We further provide utilities such as the Phoneme to posterior mappings, GoP scores of each utterance as a vector, and Word boundaries used in the GoP pipeline for use in future research.

Labelled data for training sequence labelling models can be collected from multiple annotators or workers in crowdsourcing. However, these labels could be noisy because of the varying expertise and reliability of annotators. In order to ensure high quality of data, it is crucial to infer the correct labels by aggregating noisy labels. Although label aggregation is a well-studied topic, only a number of studies have investigated how to aggregate sequence labels. Recently, neural network models have attracted research attention for this task. In this paper, we explore two neural network-based methods. The first method combines Hidden Markov Models with networks while also learning distributed representations of annotators (i.e., annotator embedding); the second method combines BiLSTM with autoencoders. The experimental results on three real-world datasets demonstrate the effectiveness of using neural networks for sequence label aggregation. Moreover, our analysis shows that annotators’ embeddings not only make our model applicable to real-time applications, but also useful for studying the behaviour of annotators.

Generally, those patients with dysarthria utter a distorted sound and the restrained intelligibility of a speech for both human and machine. To enhance the intelligibility of dysarthric speech, we applied a deep learning-based speech enhancement (SE) system in this task. Conventional SE approaches are used for shrinking noise components from the noise-corrupted input, and thus improve the sound quality and intelligibility simultaneously. In this study, we are focusing on reconstructing the severely distorted signal from the dysarthric speech for improving intelligibility. The proposed SE system prepares a convolutional neural network (CNN) model in the training phase, which is then used to process the dysarthric speech in the testing phase. During training, paired dysarthric-normal speech utterances are required. We adopt a dynamic time warping technique to align the dysarthric-normal utter-ances. The gained training data are used to train a CNN - based SE model. The proposed SE system is evaluated on the Google automatic speech recognition (ASR) system and a subjective listening test. The results showed that the proposed method could notably enhance the recognition performance for more than 10% in each of ASR and human recognitions from the unprocessed dysarthric speech. Clinical Relevance- This study enhances the intelligibility and ASR accuracy from a dysarthria speech to more than 10.

Prognostics and Health Management (commonly called PHM) is a field that focuses on the degradation mechanisms of systems in order to estimate their health status, anticipate their failure and optimize their maintenance. PHM uses methods, tools and algorithms for monitoring, anomaly detection, cause diagnosis, prognosis of the remaining useful life (RUL) and maintenance optimization. It allows for permanently monitoring the health of the system and provides operators and managers with relevant information to decide on actions to be taken to maintain the system in optimal operational conditions. This paper aims to present the emergence of the PHM thematically to describe the subjacent processes, particularly prognosis, how it supplies the different maintenance strategies and to explain the benefits that can be anticipated. More specifically, this paper establishes a state of the art in prognostic methods used today in the PHM strategy. In addition, this paper shows the multitude of possible prognostic approaches and the choice of one among them that will help to provide a framework for industrial companies.

Automatic speech emotion recognition (SER) is a crucial task in communication-based systems, where feature extraction plays an important role. Recently, a lot of SER models have been developed and implemented successfully in English and other western languages. However, the performance of the traditional Indian languages in SER is not up to the mark. This problem of SER in low-resource Indian languages mainly the Bengali language is dealt with in this paper. In the first step, the relevant phase-based information from the speech signal is extracted in the form of phase-based cepstral features (PBCC) using cepstral, and statistical analysis. Several pre-processing techniques are combined with features extraction and gradient boosting machine-based classifier in the proposed SER model. Finally, the evaluation and comparison of simulation results on speaker-dependent, speaker-independent tests are performed using multiple language datasets, and independent test sets. It is observed that the proposed PBCC features-based model is performing well with an average of 96% emotion recognition efficiency as compared to standard methods.

Nowadays, the industrial scenario is driven by the need of costs and time reduction. In this contest, system failure prediction plays a pivotal role in order to program maintenance operations only in the last stages of the real operating life, avoiding unnecessary machine downtime. In the last decade, Hidden Markov Models have been widely exploited for machinery prognostic purposes. The probabilistic dependency between the measured observations and the real damaging stage of the system has usually been described as a mixture of Gaussian distributions. This paper aims to generalize the probabilistic function as a mixture of generalized Gaussian distributions in order to consider possible distribution variations during the different states. In this direction, this work proposes an algorithm for the estimation of the model parameters exploiting the observations measured on the real system. The prognostic effectiveness of the resulting model has been demonstrated through the analysis of several run-to-failure datasets concerning both rolling element bearings and more complex systems.

Action recognition is a major branch of computer vision research. As a widely used technology, action recognition has been applied to human-computer interaction, intelligent pension, and intelligent transportation system. Because of the explosive growth of action recognition related methods, the performance of action recognition on many difficult data sets has improved significantly. In terms of the different data sets used for action recognition, action recognition can mainly be divided into RGB-based action recognition method and skeleton-based action recognition method. The former method can take advantage of the prior knowledge of image recognition. However, it has high requirements for computing power and storage ability, and it is difficult to avoid the influence of irrelevant background and illumination. In contrast, the latter method’s calculation amount and required storage space are reduced significantly. However, it lacks context information that is useful for action recognition. This review provides a comprehensive description of these two methods, covering the milestone algorithms, the state-of-the-art algorithms, the commonly used data sets, evaluation metrics, challenges, and promising future directions. So far as we know, this work is the first survey covering traditional methods of action recognition, RGB-based end-to-end action recognition method, pose estimation, and skeleton-based action recognition in one review. This survey aims to help scholars who study action recognition technology to systematically learn action recognition technology, select data sets, understand current challenges, and choose promising future research directions.

Single-molecule FRET (smFRET) is a versatile technique to study the dynamics and function of biomolecules since it makes nanoscale movements detectable as fluorescence signals. The powerful ability to infer quantitative kinetic information from smFRET data is, however, complicated by experimental limitations. Diverse analysis tools have been developed to overcome these hurdles but a systematic comparison is lacking. Here, we report the results of a blind benchmark study assessing eleven analysis tools used to infer kinetic rate constants from smFRET trajectories. We test them against simulated and experimental data containing the most prominent difficulties encountered in analyzing smFRET experiments: different noise levels, varied model complexity, non-equilibrium dynamics, and kinetic heterogeneity. Our results highlight the current strengths and limitations in inferring kinetic information from smFRET trajectories. In addition, we formulate concrete recommendations and identify key targets for future developments, aimed to advance our understanding of biomolecular dynamics through quantitative experiment-derived models.

We present a hidden Markov model analysis for fluorescent time series of colloidal quantum dots. A fundamental quantity to measure optical performance of the quantum dots is a distribution function for the light-emission duration. So far, to estimate it, a threshold value for the fluorescent intensity was introduced, and the light-emission state was evaluated as a state above the threshold. With this definition, the light-emission duration was estimated, and its distribution function was derived as a blinking plot. Due to the noise in the fluorescent data, however, this treatment generates a large number of artificially short-lived emission states, thus leading to an erroneous blinking plot. In the present paper, we propose a hidden Markov model to eliminate these artifacts. The hidden Markov model introduces a hidden variable specifying the light-emission and quenching states behind the observed fluorescence. We found that it is possible to avoid the above artifacts by identifying the state from the hidden-variable time series. We found that, from the analysis of experimental and theoretical benchmark data, the accuracy of our hidden Markov model is beyond human cognitive ability.

Chromosomal translocations result from the joining of DNA double-strand breaks (DSBs) and frequently cause cancer. However, the steps linking DSB formation to DSB ligation remain undeciphered. We report that DNA replication timing (RT) directly regulates lymphomagenic Myc translocations during antibody maturation in B cells downstream of DSBs and independently of DSB frequency. Depletion of minichromosome maintenance complexes alters replication origin activity, decreases translocations, and deregulates global RT. Ablating a single origin at Myc causes an early-to-late RT switch, loss of translocations, and reduced proximity with the immunoglobulin heavy chain ( Igh ) gene, its major translocation partner. These phenotypes were reversed by restoring early RT. Disruption of early RT also reduced tumorigenic translocations in human leukemic cells. Thus, RT constitutes a general mechanism in translocation biogenesis linking DSB formation to DSB ligation.

Addressing the problems facing the elderly, whether living independently or in managed care facilities, is considered one of the most important applications for action recognition research. However, existing systems are not ready for automation, or for effective use in continuous operation. Therefore, we have developed theoretical and practical foundations for a new real-time action recognition system. This system is based on Hidden Markov Model (HMM) along with colorizing depth maps. The use of depth cameras provides privacy protection. Colorizing depth images in the hue color space enables compressing and visualizing depth data, and detecting persons. The specific detector used for person detection is You Look Only Once (YOLOv5). Appearance and motion features are extracted from depth map sequences and are represented with a Histogram of Oriented Gradients (HOG). These HOG feature vectors are transformed as the observation sequences and then fed into the HMM. Finally, the Viterbi Algorithm is applied to recognize the sequential actions. This system has been tested on real-world data featuring three participants in a care center. We tried out three combinations of HMM with classification algorithms and found that a fusion with Support Vector Machine (SVM) had the best average results, achieving an accuracy rate (84.04%).

Algorithms for recognizing strings of connected words from whole-word patterns have become highly efficient and accurate, although computation rates remain high. Even the most ambitious connected-word recognition task is practical with today's integrated circuit technology, but extracting reliable, robust whole-word reference patterns still is difficult. In the past, connected-word recognizers relied on isolated-word reference patterns or patterns derived from a limited context (e.g., the middle digit from strings of three digits). These whole-word patterns were adequate for slow rates of articulated speech, but not for strings of words spoken at high rates (e.g., about 200 to 300 words per minute). To alleviate this difficulty, a segmental k-means training procedure was used to extract whole-word patterns from naturally spoken word strings. The segmented words are then used to create a set of word reference patterns for recognition. Recognition string accuracies were 98 to 99 percent for digits in variable length strings and 90 to 98 percent for sentences from an airline reservation task. These performance scores represent significant improvements over previous connected-word recognizers.

Accurate detection of the boundaries of a speech utterance during a recording interval has been shown to be crucial for reliable and robust automatic speech recognition. The endpoint detection problem is fairly straightforward for high-level speech signals spoken in low-level stationary noise environments (e.g. signal-to-noise ratios greater than 30 dB). However, these ideal conditions do not always exist. One example, where reliable word detection is difficult, is speech spoken in a mobile environment. Because of road, tire, fan noises, etc. detection of speech often becomes problematic.

In this paper, we describe BYBLOS, the BBN continuous speech recognition system. The system, designed for large vocabulary applications, integrates acoustic, phonetic, lexical, and linguistic knowledge sources to achieve high recognition performance. The basic approach, as described in previous papers [1, 2], makes extensive use of robust context-dependent models of phonetic coarticulation using Hidden Markov Models (HMM). We describe the components of the BYBLOS system, including: signal processing frontend, dictionary, phonetic model training system, word model generator, grammar and decoder. In recognition experiments, we demonstrate consistently high word recognition performance on continuous speech across: speakers, task domains, and grammars of varying complexity. In speaker-dependent mode, where 15 minutes of speech is required for training to a speaker, 98.5% word accuracy has been achieved in continuous speech for a 350-word task, using grammars with perplexity ranging from 30 to 60. With only 15 seconds of training speech we demonstrate performance of 97% using a grammar.

In this paper, we extend the interpretation of distortion measures, based upon the observation that measurements of speech spectral envelopes (as normally obtained from analysis procedures) are prone to statistical variations due to window position fluctuations, excitation interference, measurement noise, etc. and may possess spurious characteristics because of analysis model constraints. We have found that these undesirable spectral measurement variations can be controlled (i.e. reduced in the level of variation) through proper cepstral processing and that a statistical model can be established to predict the variances of the cepstral coefficient measurements. The findings lead to the use of a bandpass "liftering" process aimed at reducing the variability of the statistical components of spectral measurements. We have applied this liftering process to various speech recognition problems; in particular, vowel recognition and isolated word recognition. With the liftering process, we have been able to achieve an average digit error rate of 1%, which is about half of the previously reported best results, with dynamic time warping in a speaker-independent isolated digit test.

A method for estimating the parameters of hidden Markov models of speech is described. Parameter values are chosen to maximize the mutual information between an acoustic observation sequence and the corresponding word sequence. Recognition results are presented, comparing this method with maximum-likelihood estimation. In the example given, estimating parameters by maximizing mutual information resulted in the training script having a probability 10**1 **8 **9 times greater than when parameters were estimated by maximum likelihood estimation. Moreover, training by maximizing mutual information resulted in 18% fewer recognition errors.

Speech recognition is formulated as a problem of maximum likelihood decoding. This formulation requires statistical models of the speech production process. In this paper, we describe a number of statistical models for use in speech recognition. We give special attention to determining the parameters for such models from sparse data. We also describe two decoding methods, one appropriate for constrained artificial languages and one appropriate for more realistic decoding tasks. To illustrate the usefulness of the methods described, we review a number of decoding results that have been obtained with them.

Many signals can be modeled as probabilistic functions of Markov chains in which the observed signal is a random vector whose probability density function (pdf) depends on the current state of an underlying Markov chain. Such models are called Hidden Markov Models (HMMs) and are useful representations for speech signals in terms of some convenient observations (e.g., cepstral coefficients or pseudolog area ratios). One method of estimating parameters of HMMs is the well-known Baum-Welch reestimation method. For continuous pdf's, the method was known to work only for elliptically symmetric densities. We have recently shown that the method can be generalized to handle mixtures of elliptically symmetric pdf's. Any continuous pdf can be approximated to any desired accuracy by such mixtures, in particular, by mixtures of multivariate Gaussian pdf's. To effectively make use of this method of parameter estimation, it is necessary to understand how it is affected by the amount of training data available, the number of states in the Markov chain, the dimensionality of the signal, etc. To study these issues, Markov chains and random vector generators were simulated to generate training sequences from “toy” models. The model parameters were estimated from these training sequences and compared to the “true” parameters by means of an appropriate distance measure. The results of several such experiments show the strong sensitivity of the method to some (but not all) of the model parameters. A procedure for getting good initial parameter estimates is, therefore, of considerable importance.

In this paper we present an approach to speaker-independent, isolated word recognition in which the well-known techniques of vector quantization and hidden Markov modeling are combined with a linear predictive coding analysis front end. This is done in the framework of a standard statistical pattern recognition model. Both the vector quantizer and the hidden Markov models need to be trained for the vocabulary being recognized. Such training results in a distinct hidden Markov model for each word of the vocabulary. Classification consists of computing the probability of generating the test word with each word model and choosing the word model that gives the highest probability. There are several factors, in both the vector quantizer and the hidden Markov modeling, that affect the performance of the overall word recognition system, including the size of the vector quantizer, the structure of the hidden Markov model, the ways of handling insufficient training data, etc. The effects, on recognition accuracy, of many of these factors are discussed in this paper. The entire recognizer (training and testing) has been evaluated on a 10-word digits vocabulary. For training, a set of 100 talkers spoke each of the digits one time. For testing, an independent set of 100 tokens of each of the digits was obtained. The overall recognition accuracy was found to be 96.5 percent for the 100-talker test set. These results are comparable to those obtained in earlier work, using a dynamic time-warping recognition algorithm with multiple templates per digit. It is also shown that the computation and storage requirements of the new recognizer were an order of magnitude less than that required for a conventional pattern recognition system using linear prediction with dynamic time warping.

Accurate location of the endpoints of spoken words and phrases is important for reliable and robust speech recognition. The endpoint detection problem is fairly straightforward for high-level speech signals in low-level stationary noise environments (e.g., signal-to-noise ratios greater than 30-dB rms). However, this problem becomes considerably more difficult when either the speech signals are too low in level (relative to the background noise), or when the background noise becomes highly nonstationary. Such conditions are often encountered in the switched telephone network when the limitation on using local dialed-up lines is removed. In such cases the background noise is often highly variable in both level and spectral content because of transmission line characteristics, transients and tones from the line and/or from signal generators, etc. Conventional speech endpoint detectors have been shown to perform very poorly (on the order of 50-percent word detection) under these conditions. In this paper we present an improved word-detection algorithm, which can incorporate both vocabulary (syntactic) and task (semantic) information, leading to word-detection accuracies close to 100 percent for isolated digit detection over a wide range of telephone transmission conditions.

We propose a probabilistic distance measure for measuring the dissimilarity between pairs of hidden Markov models with arbitrary observation densities. The measure is based on the Kullback-Leibler number and is consistent with the reestimation technique for hidden Markov models. Numerical examples that demonstrate the utility of the proposed distance measure are given for hidden Markov models with discrete densities. We also discuss the effects of various parameter deviations in the Markov models on the resulting distance, and study the relationships among parameter estimates (obtained from reestimation), initial guesses of parameter values, and observation duration through the use of the measure.

Recent work at AT&T Bell Laboratories has shown how the theories of Vector Quantization (VQ) and Hidden Markov Modeling (HMM) can be applied to the recognition of isolated word vocabularies. The initial experiments with an HMM word recognizer were restricted to a vocabulary of 10 digits. For this simple vocabulary with dialed-up telephone recordings, we found that a high-performance, speaker-independent word recognizer could be implemented, and that the performance was, for the most part, insensitive to parameters of both the HMM and the VQ. In this paper we extend our investigations of the HMM recognizer to the recognition of isolated words from a medium-size vocabulary (129 words), as used in the AT&T Bell Laboratories airlines reservation and information system. For this moderately complex word vocabulary, we have found that recognition accuracy is indeed a function of the HMM parameters (i.e., the number of states in the model and the number of symbols per state). We have also found that a VQ that includes energy information gives better performance than a conventional VQ of the same size (i.e., same number of code-book entries).

A connected speech recognition method based on the Baum forward backward algorithm is presented. The segmentation of the test sentence uses the probability that an acoustic vector lays at the separation of two speech subunit models (Hidden Markov models). The labelling rests on the highest probability that a vector has been emitted on the last state of a subunit model. Results are presented for word- and phoneme-recognition.

Speech recognition research can be distinguished into three areas: isolated word recognition where words are separated by distinct pauses; continuous speech recognition where sentences are produced continuously in a natural manner; and speech understanding where the aim is not transcription but understanding in the sense that the system responds correctly to a spoken instruction or request. This chapter focuses on Continuous Speech Recognition (CSR) and summarizes acoustic processing techniques. The Markov models of speech processes are introduced in the chapter and it describes an elegant linguistic decoder based on dynamic programming that is practical under certain conditions. The the practical aspects of the sentence hypothesis search conducted by the linguistic decoder is discussed in the chapter and it introduces algorithms for extracting model parameter values automatically from the data. The methods of assessing the performance of the CSR systems and the relative difficulty of recognition tasks are discussed. The chapter illustrates the capabilities of present recognition systems by describing the results of certain recognition experiments.

In this paper we discuss parameter estimation by means of the reestimation algorithm for a class of multivariate mixture density functions of Markov chains. The scope of the original reestimation algorithm is expanded and the previous assumptions of log concavity or ellipsoidal symmetry are obviated, thereby enhancing the modeling capability of the technique. Reestimation formulas in terms of the well-known forward-backward inductive procedure are also derived.

During the past decade, the applicability of hidden Markov models (HMM) to various facets of speech analysis has been demonstrated in several different experiments. These investigations all rest on the assumption that speech is a quasi-stationary process whose stationary intervals can be identified with the occupancy of a single state of an appropriate HMM. In the traditional form of the HMM, the probability of duration of a state decreases exponentially with time. This behavior does not provide an adequate representation of the temporal structure of speech. The solution proposed here is to replace the probability distributions of duration with continuous probability density functions to form a continuously variable duration hidden Markov model (CVDHMM). The gamma distribution is ideally suited to specification of the durational density since it is one-sided and only has two parameters which, together, define both mean and variance. The main result is a derivation and proof of convergence of re-estimation formulae for all the parameters of the CVDHMM. It is interesting to note that if the state durations are gamma-distributed, one of the formulae is non-algebraic but, fortuitously, has properties such that it is easily and rapidly solved numerically to any desired degree of accuracy. Other results are presented including the performance of the formulae on simulated data.

In this paper we extend previous work on isolated-word recognition based on hidden Markov models by replacing the discrete symbol representation of the speech signal with a continuous Gaussian mixture density. In this manner the inherent quantization error introduced by the discrete representation is essentially eliminated. The resulting recognizer was tested on a vocabulary of the ten digits across a wide range of talkers and test conditions and shown to have an error rate comparable to that of the best template recognizers and significantly lower than that of the discrete symbol hidden Markov model system. We discuss several issues involved in the training of the continuous density models and in the implementation of the recognizer.

Recent work at Bell Laboratories has shown how the theories of LPC Vector Quantization (VQ) and hidden Markov modeling (HMM) can be applied to the recognition of isolated word vocabularies. Our first experiments with HMM based recognizers were restricted to a vocabulary of the ten digits. For this simple vocabulary we found that a high performance recognizer (word accuracy on the order of 97%) could be implemented, and that the performance was, for the most part, insensitive to parameters of both the Markov model and the vector quantizer. In this talk we extend our investigations to the recognition of isolated words from a medium size vocabulary, (129 words), as used in the Bell Laboratories airline reservation and information system. For this moderately complex vocabulary we have found that recognition accuracy is indeed a function of the HMM parameter (i.e., the number of states and the number of symbols in the vector quantizer). We have also found that a vector quantizer which uses energy information gives better performance than a conventional LPC shape vector quantizer of the same size (i.e., number of codebook entries).

Continuous speech was treated as if produced by a finite‐state machine making a transition every centisecond. The observable output from state transitions was considered to be a power spectrum—a probabilistic function of the target state of each transition. Using this model, observed sequences of power spectra from real speech were decoded as sequences of acoustic states by means of the Viterbi trellis algorithm. The finite‐state machine used as a representation of the speechsource was composed of machines representing words, combined according to a “language model.” When trained to the voice of a particular speaker, the decoder recognized seven‐digit telephone numbers correctly 96% of the time, with a better than 99% per‐digit accuracy. Results for other tests of the system, including syllable and phoneme recognition, will also be given.

This paper gives a unified theoretical view of the Dynamic Time Warping (DTW) and the Hidden Markov Model (HMM) techniques for speech recognition problems. The application of hidden Markov models in speech recognition is discussed. We show that the conventional dynamic time-warping algorithm with Linear Predictive (LP) signal modeling and distortion measurements can be formulated in a strictly statistical framework. It is further shown that the DTW/LP method is implicitly associated with a specific class of Markov models and is equivalent to the probability maximization procedures for Gaussian autoregressive multivariate probabilistic functions of the underlying Markov model. This unified view offers insights into the effectiveness of the probabilistic models in speech recognition applications.

In this paper we present several of the salient theoretical and practical issues associated with modeling a speech signal as a probabilistic function of a (hidden) Markov chain. First we give a concise review of the literature with emphasis on the Baum-Welch algorithm. This is followed by a detailed discussion of three issues not treated in the literature: alternatives to the Baum-Welch algorithm; critical facets of the implementation of the algorithms, with emphasis on their numerical properties; and behavior of Markov models on certain artificial but realistic problems. Special attention is given to a particular class of Markov models, which we call “left-to-right” models. This class of models is especially appropriate for isolated word recognition. The results of the application of these methods to an isolated word, speaker-independent speech recognition experiment are given in a companion paper.

In this paper a new sequential decoding algorithm is introduced that uses stack storage at the receiver. It is much simpler to describe and analyze than the Fano algorithm, and is about six times faster than the latter at transmission rates equal to Rcomp the rate below which the average number of decoding steps is bounded by a constant. Practical problems connected with implementing the stack algorithm are discussed and a scheme is described that facilitates satisfactory performance even with limited stack storage capacity. Preliminary simulation results estimating the decoding effort and the needed stack siazree presented.

An isolated word recognizer based on vector quantization at the acoustic level and on stochastic modeling at the phonetic level is described. The power of this approach lies in its best utilization of the training data. The first experimental results obtained are encouraging and suggest that further optimization is possible.

A method for modelling time series is presented and then applied to the analysis of the speech signal. A time series is represented as a sample sequence generated by a finite state hidden Markov model with output densities parameterized by linear prediction polynomials and error variances. These objects are defined and their properties developed. The theory culminates in a theorem that provides a computationally efficient iterative scheme to improve the model. The theorem has been used to create models from speech signals of considerable length. One such model is examined with emphasis on the relationship between states of the model and traditional classes of speech events. A use of the method is illustrated by an application to the talker verification problem.

The Speech Recognition Group at IBM Research in Yorktown Heights has developed a real-time, isolated-utterance speech recognizer for natural language based on the IBM Personal Computer AT and IBM Signal Processors. The system has recently been enhanced by expanding the vocabulary from 5,000 words to 20,000 words and by the addition of a speech workstation to support usability studies on document creation by voice. The system supports spelling and interactive personalization to augment the vocabularies. This paper describes the implementation, user interface, and comparative performance of the recognizer.

This paper proposes a new strategy, the Multi-Level Decoding (MLD), that allows to use a Very Large Size Dictionary (VLSD, size more than 100,000 words) in speech recognition. MLD proceeds in three steps: bullet a Syllable Match procedure uses an acoustic model to build a list of the most probable syllables that match the acoustic signal from a given time frame. bullet from this list, a Word Match procedure uses the dictionary to build partial word hypothesis. bullet then a Sentence Match procedure uses a probabilistic language model to build partial sentence hypothesis until total sentences are found. An original matching algorithm is proposed for the Syllable Match procedure. This strategy is experimented on a dictation task of French texts. Two different dictionaries are tested, bullet one composed of the 10,000 most frequent words, bullet the other composed of 200,000 words. The recognition results are given and compared. The error rate on words with 10,000 words is 17.3%. If the errors due to the lack of coverage are not counted, the error rate with 10,000 words is reduced to 10.6%. The error rate with 200,000 words is 12.7%.

A new iterative approach for hidden Markov modeling of information sources which aims at minimizing the discrimination information (or the cross-entropy) between the source and the model is proposed. This approach does not require the commonly used assumption that the source to be modeled is a hidden Markov process. The algorithm is started from the model estimated by the traditional maximum likelihood (ML) approach and alternatively decreases the discrimination information over all probability distributions of the source which agree with the given measurements and all hidden Markov models. The proposed procedure generalizes the Baum algorithm for ML hidden Markov modeling. The procedure is shown to be a descent algorithm for the discrimination information measure and its local convergence is proved.

This paper describes an experimental continuous speech recognition system comprising procedures for acoustic/phonetic classification, lexical access and sentence retrieval. Speech is assumed to be composed of a small number of phonetic units which may be identified with the states of a hidden Markov model. The acoustic correlates of the phonetic units are then characterized by the observable Gaussian process associated with the corresponding state of the underlying Markov chain. Once the parameters of such a model are determined, a phonetic transcription of an utterance can be obtained by means of a Viterbi-like algorithm. Given a lexicon in which each entry is orthographically represented in terms of the chosen phonetic units, a word lattice is produced by a lexical access procedure. Lexical items whose orthography matches subsequences of the phonetic transcription are sought by means of a hash coding technique and their likelihoods are computed directly from the corresponding interval of acoustic measurements. The recognition process is completed by recovering from the word lattice, the string of words of maximum likelihood conditioned on the measurements. The desired string is derived by a best-first search algorithm. In an experimental evaluation of the system, the parameters of an acoustic/phonetic model were estimated from fluent utterances of 37 seven-digit numbers. A digit recognition rate of 96% was then observed on an independent test set of 59 utterances of the same form from the same speaker. Half of the observed errors resulted from insertions while deletions and substitutions accounted equally for the other half.

One approach to large vocabulary speech recognition, is to build phonetic Markov models, and to concatenate them to obtain word models. In previous work, we already designed a recognizer based on 40 phonetic Markov machines, which accepts a 10,000 words vocabulary ([3]), and recently 200,000 words vocabulary ([5]). Since there is one machine per phoneme, these models obviously do not account for coarticulatory effects, which may lead to recognition errors. In this paper, we improve the phonetic models by using general principles about coarticulation effects on automatic phoneme recognition. We show that both the analysis of the errors made by the recognizer, and linguistic facts about phonetic context influence, suggest a method for choosing context dependent models. This method allows to limit the growing of the number of phonems, and still account for the most important coarticulation effects. We present our experiments with a system applying these principles to a set of models for French. With this new system including context-dependant machines, the phoneme recognition rate goes from 82.2% to 85.3%, and the error rate on words with a 10,000 word dictionary, is decreased from 11.2 to 9.8%.

This paper proposes a new way of using vector quantization for improving recognition performance for a 60,000 word vocabulary speaker-trained isolated word recognizer using a phonemic Markov model approach to speech recognition. We show that we can effectively increase the codebook size by dividing the feature vector into two vectors of lower dimensionality, and then quantizing and training each vector separately. For a small codebook size, integration of the results of the two parameter vectors provides significant improvement in recognition performance as compared to the quantizing and training of the entire feature set together. Even for a codebook size as small as 64, the results obtained when using the new quantization procedure are quite close to those obtained when using Gaussian distribution of the parameter vectors.

Most current speech recognition systems are sensitive to variations in speaker style, the following is the result of an effort to make a Hidden Markov Model (HMM) Isolated Word Recognizer (IWR) tolerant to such speech changes caused by speaker stress. More than an order-of-magnitude reduction of the error rate was achieved for a 105 word simulated stress database and a 0% error rate was achieved for the TI 20 isolated word database.

A new training procedure called multi-style training has been developed to improve performance when a recognizer is used under stress or in high noise but cannot be trained in these conditions. Instead of speaking normally during training, talkers use different, easily produced, talking styles. This technique was tested using a speech data base that included stress speech produced during a workload task and when intense noise was presented through earphones. A continuous-distribution talker-dependent Hidden Markov Model (HMM) recognizer was trained both normally (5 normally spoken tokens) and with multi-style training (one token each from normal, fast, clear, loud, and question-pitch talking styles). The average error rate under stress and normal conditions fell by more than a factor of two with multi-style training and the average error rate under conditions sampled during training fell by a factor of four.

Hidden Markov modeling has become an increasingly popular technique in automatic speech recognition. Recently, attention has been focused on the application of these models to talker-independent, isolated-word recognition. Initial results using models with discrete output densities for isolated-digit recognition were later improved using models based on continuous output densities. In a series of experiments on isolated-word recognition, we applied hidden Markov models with multivariate Gaussian output densities to the problem. Speech data was represented by feature vectors consisting of eight log area ratios and the log LPC error. A weak measure of vocal-tract dynamics was included in the observations by appending to the feature vector observed at time t, the vector observed at time t-δ, for some fixed offset δ. The best models were obtained with offsets of 75 or 90 msecs. When a comparison is made on a common data base, the resulting error rate of 0.2% for isolated-digit recognition improves on previous algorithms.

The use of instantaneous and transitional spectral representations of spoken utterances for speaker recognition is investigated. LPC derived-cepstral coefficients are used to represent instantaneous spectral information and best linear fits of each cepstral coefficient over a specified time window are used to represent transitional information. An evaluation has been carried out using a data base of isolated digit utterances over dialed-up telephone lines by 10 talkers. Two vector quantization (VQ) codebooks, instantaneous and transitional, are constructed from training utterances for each speaker. The experimental results show that the instantaneous and transitional representations are relatively uncorrelated thus providing complementary information for speaker recognition. A rectangular window of approximately 100-150 ms duration provides an effective estimate of spectral transitions for speaker recognition. Also, simple transmission channel variations are shown to affect the instantaneous spectral representations and the corresponding recognition performance significantly, while the transitional representations and performance are relatively resistant.

The problem of modeling durational structure is addressed. Results of experiments on the use of temporal models to optimally change the duration and temporal structure of words are presented and used to show that a first-order Markov chain is inadequate for effectively modeling expected local duration. Semi-Markov models are proposed and shown to lead to improved performance. The implications of these results for automatic speech recognition are considered. Hidden semi-Markov models are introduced, and two alternative models of state duration are proposed. The problem of model parameter estimation is addressed. Finally, preliminary experimental results are presented that compare the recognition performance obtained using hidden Markov models with that obtained using a special class of hidden semi-Markov models.

This paper describes the results of our work in designing a system for phonetic recognition of unrestricted continuous speech. We describe several algorithms used to recognize phonemes using context-dependent Hidden Markov Models of the phonemes. We present results for several variations of the parameters of the algorithms. In addition, we propose a technique that makes it possible to integrate traditional acoustic-phonetic features into a hidden Markov process. The categorical decisions usually associated with heuristic acoustic-phonetic algorithms are replaced by automated training techniques and global search strategies. The combination of general spectral information and specific acoustic-phonetic features is shown to result in more accurate phonetic recognition than either representation by itself.

Although a great deal of effort has gone into studying large-vocabulary speech-recognition problems, there remains a number of interesting, and potentially exceedingly important, problems which do not require the complexity of these large systems. One such problem is connected-digit recognition, which has applications to telecommunications, order entry, credit-card entry, forms automation, and data-base management, among others. Connected-digit recognition is also an interesting problem for another reason, namely that it is one in which whole-word training patterns are applicable as the basic speech-recognition unit. Thus one can bring to bear all the fundamental speech recognition technology associated with whole-word recognition to solve this problem. As such, several connected digit recognizers have been proposed in the past few years. The performance of these systems has steadily improved to the point where high digit-recognition accuracy is achievable in a speaker-trained mode.

In this paper, we describe sphinx, the world's first accurate large-vocabulary speaker-independent continuous speech recognition system. We will present current results of sphinx, compare its performance against similar systems, and account for its high accuracy.ZusammenfassungIn diesem Beitrag beschreiben wir sphinx, das weltweit erste akkurate Erkennungssystem für groβen Wortschatz, das sprecherunabhängig kontinuierliche Sprache verarbeitet. Wir stellen aktuelle Ergebnisse von sphinx vor, vergleichen seine Erkennungsleistung mit denjenigen anderer, verglechbarer Systeme und gebe Gründe an für hohe Verläβlichkeit.RésuméDans cet article, nous décrivons sphinx, le premier systéme de reconnaissance automatique de la parole continue, indépendant du locuteur et à grand vocabulaire. Nous présentons ses premiers résultats, comparons ses performances à celles d'autres systèmes semblables et expliquons sa grande précision.

A broadly applicable algorithm for computing maximum likelihood estimates from incomplete data is presented at various levels of generality. Theory showing the monotone behaviour of the likelihood and convergence of the algorithm is derived. Many examples are sketched, including missing value situations, applications to grouped, censored or truncated data, finite mixture models, variance component estimation, hyperparameter estimation, iteratively reweighted least squares and factor analysis.

This paper gives an exposition of linear prediction in the analysis of discrete signals. The signal is modeled as a linear combination of its past values and present and past values of a hypothetical input to a system whose output is the given signal. In the frequency domain, this is equivalent to modeling the signal spectrum by a pole-zero spectrum. The major part of the paper is devoted to all-pole models. The model parameters are obtained by a least squares analysis in the time domain. Two methods result, depending on whether the signal is assumed to be stationary or nonstationary. The same results are then derived in the frequency domain. The resulting spectral matching formulation allows for the modeling of selected portions of a spectrum, for arbitrary spectral shaping in the frequency domain, and for the modeling of continuous as well as discrete spectra. This also leads to a discussion of the advantages and disadvantages of the least squares error criterion. A spectral interpretation is given t