Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Recent advances in the availability of computational resources allow for more sophisticated approaches to speech recognition than ever before. This study considers Artificial Neural Network and Hidden Markov Model methods of classification for Human Speech Recognition through Diphthong Vowel sounds in the English Phonetic Alphabet rather than the classical approach of the classification of whole words and phrases, with a specific focus on both single and multi-objective evolutionary optimisation of bioinspired classification methods. A set of audio clips are recorded by subjects from the United Kingdom and Mexico and the recordings are transformed into a static dataset of statistics by way of their Mel-Frequency Cepstral Coefficients (MFCC) at sliding window length of 200ms as well as a reshaped MFCC timeseries format for forecast-based models. An deep neural network with evolutionary optimised topology achieves 90.77% phoneme classification accuracy in comparison to the best HMM that achieves 86.23% accuracy with 150 hidden units, when only accuracy is considered in a single-objective optimisation approach. The obtained solutions are far more complex than the HMM taking around 248 seconds to train on powerful hardware versus 160 for the HMM. A multi-objective approach is explored due to this. In the multi-objective approaches of scalarisation presented, within which real-time resource usage is also considered towards solution fitness, far more optimal solutions are produced which train far quicker than the forecast approach (69 seconds) with classification ability retained (86.73%). Weightings towards either maximising accuracy or reducing resource usage from 0.1 to 0.9 are suggested depending on the resources available, since many future IoT devices and autonomous robots may have limited access to cloud resources at a premium in comparison to the GPU used in this experiment.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The relevance of evolutionary-based algorithms, like a genetic algorithm, that belong to a family of search algorithms inspired by the process of evolution in nature, was demonstrated in a recent study showing that optimizing the topology of an Artificial Neural Network may lead to a high classification rate of spoken utterances of both native and non-native English speakers [24]. ...
... It has a potential application in automatic speech recognition and its applications. For example, in [24], the authors used evolutionary algorithms based optimized deep neural network for recognition of diphthong vowel sounds in the English phonetic alphabet. In [36], the author tied genetic algorithm with Manhattan distance to classify plain and emphatic vowels in continuous Arabic speech. ...
Article
Full-text available
Distinctive phonetic features have an important role in Arabic speech phoneme recognition. In a given language, distinctive phonetic features are extrapolated from acoustic features using different methods. However, exploiting lengthy acoustic features vector in the sake of phoneme recognition has a huge cost in terms of computational complexity, which in turn, affects real time applications. The aim of this work is to consider methods to reduce the size of features vector employed for distinctive phonetic feature and phoneme recognition. The objective is to select the relevant input features that contribute to the speech recognition process. This, in turn, will lead to a reduced computational complexity of recognition algorithm, and an improved recognition accuracy. In the proposed approach, genetic algorithm is used to perform optimal features selection. Therefore, a baseline model based on feedforward neural networks is first built. This model is used to benchmark the results of proposed features selection method with a method that employs all elements of a features vector. Experimental results, utilizing the King Abdulaziz City for Science and Technology Arabic Phonetic Database, show that the average genetic algorithm based phoneme overall recognition accuracy is maintained slightly higher than that of recognition method employing the full-fledge features vector. The genetic algorithm based distinctive phonetic features recognition method has achieved a 50% reduction in the dimension of the input vector while obtaining a recognition accuracy of 90%. Moreover, the results of the proposed method is validated using Wilcoxon signed rank test.
... The training of an acoustic system is intended for solving two or more tasks simultaneously, which are different in nature. The purpose of the auxiliary task includes generating the features of clean speech through a regression loss [7]. ...
... Due to the non-reverberant and clean nature of the acoustic environment, the amount of offered annotated data is extremely significant for the Speech Recognition System [3]. Though, these perfect acoustic conditions are not much reasonable as in diverse many real-life conditions and may be suffered by the deprivations of the speech signal that is come from the room proprieties of the acoustic or from the noise in the surrounding, which directs to the reverberations of the speech [7]. These occurrences can formulate the speech recognition model that attains significantly depreciate results and is a much more challenging task. ...
... The relevance of evolutionary-based algorithms, like a genetic algorithm, that belong to a family of search algorithms inspired by the process of evolution in nature, was demonstrated in a recent study showing that optimizing the topology of an Artificial Neural Network may lead to a high classification rate of spoken utterances of both native and non-native English speakers [24]. ...
... It has a potential application in automatic speech recognition and its applications. For example, in [24], the authors used evolutionary algorithms based optimized deep neural network for recognition of diphthong vowel sounds in the English phonetic alphabet. In [36], the author tied genetic algorithm with Manhattan distance to classify plain and emphatic vowels in continuous Arabic speech. ...
Article
Full-text available
Distinctive phonetic features have an important role in Arabic speech phoneme recognition. In a given language, distinctive phonetic features are extrapolated from acoustic features using different methods. However, exploiting lengthy acoustic features vector in the sake of phoneme recognition has a huge cost in terms of computational complexity, which in turn, affects real time applications. The aim of this work is to consider methods to reduce the size of features vector employed for distinctive phonetic feature and phoneme recognition. The objective is to select the relevant input features that contribute to the speech recognition process. This, in turn, will lead to a reduced computational complexity of recognition algorithm, and an improved recognition accuracy. In the proposed approach, genetic algorithm is used to perform optimal features selection. Therefore, a baseline model based on feedforward neural …
... Apart from the Bengali language, MFCC has been used in other language researches as well. Most of them utilized 13 MFCC features in a deep neural network (DNN) classifier, such as recognition of speakers [20], English phoneme [21], emotion from English audio [6], five Malayalam vowel phonemes [22], twenty Dari speech tokens [23], three Arabic words [24], and ten English command words [25]. Some studies also prefer convolutional neural network (CNN) models such as isolated English word recognition by Soliman et al. [26]. ...
Article
Full-text available
Speech-related research has a wide range of applications. Most speech-related researches employ Mel-frequency cepstral coefficients (MFCCs) as acoustic features. However, finding the optimum number of MFCCs is an active research question. MFCC-based speech classification was performed for both vowels and words in the Bengali language. As for the classification model, deep neural network (DNN) with Adam optimizer was used. The performances were measured with five different performance metrics, namely confusion matrix, classification accuracy, area under curve of receiver operating characteristic (AUC-ROC), F1 score, and Cohen's Kappa with four-fold cross-validations at different number of MFCCs. All performance metrics gave the best score for 24/25 MFCCs; hence it is suggested that the optimum number of MFCCs should be 25, although many existing studies use only 13 MFCCs. Furthermore, it is verified that increasing the number of MFCCs yields better classification metrics with lower computational burden than the increment of hidden layers. Lastly, the optimum number of MFCCs obtained from this study was used in a more improved DNN model, from which 99% and 90% accuracies were achieved for vowel and word classification, respectively, and the vowel classification score outperformed state-of-the-art results.
... The importance of WAR suggests that, on a broad scale, the mechanism is not dependent on speakers. Bird et al. (2020) proposed the classification methods of the Artificial Neural Network and Secret Markov Model for Human Speech Recognition in the English Phonetic Alphabet. Subjects from the United Kingdom and Mexico record a collection of audio clips and the recordings are converted into a static statistical dataset. ...
Article
Full-text available
In the last few decades, there has been considerable amount of research on the use of Machine Learning (ML) for speech recognition based on Convolutional Neural Network (CNN). These studies are generally focused on using CNN for applications related to speech recognition. Additionally, various works are discussed that are based on deep learning since its emergence in the speech recognition applications. Comparing to other approaches, the approaches based on deep learning are showing rather interesting outcomes in several applications including speech recognition, and therefore, it attracts a lot of researches and studies. In this paper, a review is presented on the developments that occurred in this field while also discussing the current researches that are being based on the topic currently.
... For the Deep Neural Network that classifies the speaker, a topology of three hidden layers consisting of 30, 7, and 29 neurons respectively with ReLu activation functions and an ADAM optimiser [63] is initialised. These hyperparameters are chosen due to a previous study that performed a genetic search of neural network topologies for the classification of phonetic sounds in the form of MFCCs [64]. The networks are given an unlimited number of epochs to train, only ceasing through a set early stopping callback of 25 epochs with no improvement of ability. ...
Preprint
Full-text available
In speech recognition problems, data scarcity often poses an issue due to the willingness of humans to provide large amounts of data for learning and classification. In this work, we take a set of 5 spoken Harvard sentences from 7 subjects and consider their MFCC attributes. Using character level LSTMs (supervised learning) and OpenAI's attention-based GPT-2 models, synthetic MFCCs are generated by learning from the data provided on a per-subject basis. A neural network is trained to classify the data against a large dataset of Flickr8k speakers and is then compared to a transfer learning network performing the same task but with an initial weight distribution dictated by learning from the synthetic data generated by the two models. The best result for all of the 7 subjects were networks that had been exposed to synthetic data, the model pre-trained with LSTM-produced data achieved the best result 3 times and the GPT-2 equivalent 5 times (since one subject had their best result from both models at a draw). Through these results, we argue that speaker classification can be improved by utilising a small amount of user data but with exposure to synthetically-generated MFCCs which then allow the networks to achieve near maximum classification scores.
Thesis
Full-text available
In modern Human-Robot Interaction, much thought has been given to accessibility regarding robotic locomotion, specifically the enhancement of awareness and lowering of cognitive load. On the other hand, with social Human-Robot Interaction considered, published research is far sparser given that the problem is less explored than pathfinding and locomotion. This thesis studies how one can endow a robot with affective perception for social awareness in verbal and non-verbal communication. This is possible by the creation of a Human-Robot Interaction framework which abstracts machine learning and artificial intelligence technologies which allow for further accessibility to non-technical users compared to the current State-of-the-Art in the field. These studies thus initially focus on individual robotic abilities in the verbal, non-verbal and multimodality domains. Multimodality studies show that late data fusion of image and sound can improve environment recognition, and similarly that late fusion of Leap Motion Controller and image data can improve sign language recognition ability. To alleviate several of the open issues currently faced by researchers in the field, guidelines are reviewed from the relevant literature and met by the design and structure of the framework that this thesis ultimately presents. The framework recognises a user's request for a task through a chatbot-like architecture. Through research in this thesis that recognises human data augmentation (paraphrasing) and subsequent classification via language transformers, the robot's more advanced Natural Language Processing abilities allow for a wider range of recognised inputs. That is, as examples show, phrases that could be expected to be uttered during a natural human-human interaction are easily recognised by the robot. This allows for accessibility to robotics without the need to physically interact with a computer or write any code, with only the ability of natural interaction (an ability which most humans have) required for access to all the modular machine learning and artificial intelligence technologies embedded within the architecture. Following the research on individual abilities, this thesis then unifies all of the technologies into a deliberative interaction framework, wherein abilities are accessed from long-term memory modules and short-term memory information such as the user's tasks, sensor data, retrieved models, and finally output information. In addition, algorithms for model improvement are also explored, such as through transfer learning and synthetic data augmentation and so the framework performs autonomous learning to these extents to constantly improve its learning abilities. It is found that transfer learning between electroencephalographic and electromyographic biological signals improves the classification of one another given their slight physical similarities. Transfer learning also aids in environment recognition, when transferring knowledge from virtual environments to the real world. In another example of non-verbal communication, it is found that learning from a scarce dataset of American Sign Language for recognition can be improved by multi-modality transfer learning from hand features and images taken from a larger British Sign Language dataset. Data augmentation is shown to aid in electroencephalographic signal classification by learning from synthetic signals generated by a GPT-2 transformer model, and, in addition, augmenting training with synthetic data also shows improvements when performing speaker recognition from human speech. Given the importance of platform independence due to the growing range of available consumer robots, four use cases are detailed, and examples of behaviour are given by the Pepper, Nao, and Romeo robots as well as a computer terminal. The use cases involve a user requesting their electroencephalographic brainwave data to be classified by simply asking the robot whether or not they are concentrating. In a subsequent use case, the user asks if a given text is positive or negative, to which the robot correctly recognises the task of natural language processing at hand and then classifies the text, this is output and the physical robots react accordingly by showing emotion. The third use case has a request for sign language recognition, to which the robot recognises and thus switches from listening to watching the user communicate with them. The final use case focuses on a request for environment recognition, which has the robot perform multimodality recognition of its surroundings and note them accordingly. The results presented by this thesis show that several of the open issues in the field are alleviated through the technologies within, structuring of, and examples of interaction with the framework. The results also show the achievement of the three main goals set out by the research questions; the endowment of a robot with affective perception and social awareness for verbal and non-verbal communication, whether we can create a Human-Robot Interaction framework to abstract machine learning and artificial intelligence technologies which allow for the accessibility of non-technical users, and, as previously noted, which current issues in the field can be alleviated by the framework presented and to what extent.
Article
Fusion strategies that utilize time-frequency features have achieved superior performance in acoustic scene classification tasks. However, the existing fusion schemes are mainly frameworks that involve different modules for feature learning, fusion, and modeling. These frameworks are prone to introduce artificial interference and thus make it challenging to obtain the system's best performance. In addition, the lack of adequate information interaction between different features in the existing fusion schemes prevents the learned features from achieving the optimal discriminative ability. To tackle these problems, we design a deep mutual attention network based on the principle of receptive field regularization and the mutual attention mechanism. The proposed network can realize the joint learning and complementary enhancement of multiple time-frequency features end-to-end, which improves features' learning efficiency and discriminative ability. Experimental results on six publicly available datasets show that the proposed network outperforms almost all state-of-the-art systems regarding classification accuracy.
Article
Human action recognition using wearable sensors play a remarkable role in the province of Human-centric Computing. This paper accords a novel sparse representation based hierarchical evolutionary framework for classifying human activities. The main objective of this research is to propose a model for human action recognition that produces superior recognition results. This framework employs data from two inertial sensors namely an accelerometer and a gyroscope. Features like time and frequency-domain features were utilized in this work. This framework operates at multiple levels in the hierarchy, wherein, the output of one level is given as input to the next level in the hierarchy. A novel algorithm for deducing the hierarchical structure based on the input data called Hierarchical Architecture Design (HAD) algorithm is presented. We have also presented a novel Sparse Dictionary Optimization (SDO) algorithm for generating dictionaries that aid in the efficacious sparse representation-based classification. Finally, action recognition is done using the proposed Sparse Representation based Hierarchical (SRH) classifier. The performance analysis of the proposed system was conducted using University of Southern California Human Activity Dataset (USC-HAD) and Human Activities and Postural Transitions (HAPT) Dataset. The proposed classification framework attained a very high F-score value of 98.01% and 93.51% for the USC-HAD and HAPT datasets respectively.
Article
In this paper, the k-median of a graph is used to decompose the domain (mesh) of the continuous two- and three-dimensional finite element models. The problem of k-median is stated as an optimization problem and is solved by utilizing eight robust meta-heuristic algorithms. The Artificial Bee Colony algorithm (ABC), Cyclical Parthenogenesis algorithm (CPA), Cuckoo Search algorithm (CS), Teaching-Learning Based Optimization algorithm (TLBO), Tug of War Optimization algorithm (TWO), Water Evaporation Optimization algorithm (WEO), Ray Optimization algorithm (RO), and Vibrating Particles System algorithm (VPS) constitute the set of algorithms that are employed in the present study. In order to tune the parameters of the meta-heuristics, the Taguchi method is used. The efficiency and robustness of the algorithms are investigated through two- and three- dimensional finite element models. Communicated by Marat Z Dosaev.
Article
Full-text available
Automatic speech recognition (ASR) has been one of the biggest and hardest challenges in the field. A large majority of research in this area focuses on widely spoken languages such as English. The problems of automatic Lithuanian speech recognition have attracted little attention so far. Due to complicated language structure and scarcity of data, models proposed for other languages such as English cannot be directly adopted for Lithuanian. In this paper we propose an ASR system for the Lithuanian language, which is based on deep learning methods and can identify spoken words purely from their phoneme sequences. Two encoder-decoder models are used to solve the ASR task: a traditional encoder-decoder model and a model with attention mechanism. The performance of these models is evaluated in isolated speech recognition task (with an accuracy of 0.993) and long phrase recognition task (with an accuracy of 0.992).
Conference Paper
Full-text available
Phoneme awareness provides the path to high resolution speech recognition to overcome the difficulties of classical word recognition. Here we present the results of a preliminary study on Artificial Neural Network (ANN) and Hidden Markov Model (HMM) methods of classification for Human Speech Recognition through Diphthong Vowel sounds in the English Phonetic Alphabet, with a specific focus on evolutionary optimisation of bio-inspired classification methods. A set of audio clips are recorded by subjects from the United Kingdom and Mexico. For each recording, the data were pre-processed, using Mel-Frequency Cepstral Coefficients (MFCC) at a sliding window of 200ms per data object, as well as a further MFCC timeseries format for forecast-based models, to produce the dataset. We found that an evolutionary optimised deep neural network achieves 90.77% phoneme classification accuracy as opposed to the best HMM of 150 hidden units achieving 86.23% accuracy. Many of the evolutionary solutions take substantially longer to train than the HMM, however one solution scoring 87.5% (+1.27%) requires fewer resources than the HMM.
Article
Full-text available
Deep neural networks (DNNs) have been playing a significant role in acoustic modeling. Convolutional neural networks (CNNs) are the advanced version of DNNs that achieve 4–12% relative gain in the word error rate (WER) over DNNs. Existence of spectral variations and local correlations in speech signal makes CNNs more capable of speech recognition. Recently, it has been demonstrated that bidirectional long short-term memory (BLSTM) produces higher recognition rate in acoustic modeling because they are adequate to reinforce higher-level representations of acoustic data. Spatial and temporal properties of the speech signal are essential for high recognition rate, so the concept of combining two different networks came into mind. In this paper, a hybrid architecture of CNN-BLSTM is proposed to appropriately use these properties and to improve the continuous speech recognition task. Further, we explore different methods like weight sharing, the appropriate number of hidden units, and ideal pooling strategy for CNN to achieve a high recognition rate. Specifically, the focus is also on how many BLSTM layers are effective. This paper also attempts to overcome another shortcoming of CNN, i.e. speaker-adapted features, which are not possible to be directly modeled in CNN. Next, various non-linearities with or without dropout are analyzed for speech tasks. Experiments indicate that proposed hybrid architecture with speaker-adapted features and maxout non-linearity with dropout idea shows 5.8% and 10% relative decrease in WER over the CNN and DNN systems, respectively.
Conference Paper
Full-text available
We conducted an in situ study of six households in domestic and driving situations in order to better understand how voice assistants (VA) are used and evaluate the efficiency of vocal interactions in natural contexts. The filmed observations and interviews revealed activities of supervision, verification, diagnosis and problem-solving. These activities were not only costly in time, but they also interrupted the flow in the inhabitants’ other activities. Although the VAs were expected to facilitate the accomplishment of a second, simultaneous task, they in fact were a hindrance. Such failures can cause abandonment, but the results nevertheless revealed a paradox of use: the inhabitants forgave and accepted these errors, while continuing to appropriate the vocal system.
Conference Paper
Full-text available
This paper proposes an approach to selecting the amount of layers and neurons contained within Multilayer Perceptron hidden layers through a single-objective evolutionary approach with the goal of model accuracy. At each generation, a population of Neural Network architectures are created and ranked by their accuracy. The generated solutions are combined in a breeding process to create a larger population, and at each generation the weakest solutions are removed to retain the population size inspired by a Darwinian 'survival of the fittest'. Multiple datasets are tested, and results show that architectures can be successfully improved and derived through a hyper-heuristic evolutionary approach, in less than 10% of the exhaustive search time. The evolutionary approach was further optimised through population density increase as well as gradual solution max complexity increase throughout the simulation.
Article
Full-text available
Deep Neural Networks (DNN) have become a powerful, and extremely popular mechanism, which has been widely used to solve problems of varied complexity , due to their ability to make models fitted to non-linear complex problems. Despite its well-known benefits, DNNs are complex learning models whose parametrization and architecture are made usually by hand. This paper proposes a new Evolutionary Algorithm, named EvoDeep, devoted to evolve the parameters and the architecture of a DNN in order to maximize its classification accuracy, as well as maintaining a valid sequence of layers. This model is tested against a widely used dataset of handwritten digits images. The experiments performed using this dataset show that the Evolutionary Algorithm is able to select the parameters and the DNN architecture appropriately, achieving a 98.93% accuracy in the best run.
Article
Full-text available
Despite the success of the automatic speech recognition framework in its own application field, its adaptation to the problem of acoustic event detection has resulted in limited success. In this paper, instead of treating the problem similar to the segmentation and classification tasks in speech recognition, we pose it as a regression task and propose an approach based on random forest regression. Furthermore, event localization in time can be efficiently handled as a joint problem. We first decompose the training audio signals into multiple interleaved superframes which are annotated with the corresponding event class labels and their displacements to the temporal onsets and offsets of the events. For a specific event category, a random-forest regression model is learned using the displacement information. Given an unseen superframe, the learned regressor will output the continuous estimates of the onset and offset locations of the events. To deal with multiple event categories, prior to the category-specific regression phase, a superframe-wise recognition phase is performed to reject the background superframes and to classify the event superframes into different event categories. While jointly posing event detection and localization as a regression problem is novel, the superior performance on two databases ITC-Irst and UPC-TALP demonstrates the efficiency and potential of the proposed approach.
Article
Full-text available
Phoneme classification is investigated for linear feature domains with the aim of improving robustness to additive noise. In linear feature domains noise adaptation is exact, potentially leading to more accurate classification than representations involving non-linear processing and dimensionality reduction. A generative framework is developed for isolated phoneme classification using linear features. Initial results are shown for representations consisting of concatenated frames from the centre of the phoneme, each containing $f$ frames. As phonemes have variable duration, no single $f$ is optimal for all phonemes, therefore an average is taken over models with a range of values of $f$. Results are further improved by including information from the entire phoneme and transitions. In the presence of additive noise, classification in this framework performs better than an analogous PLP classifier, adapted to noise using cepstral mean and variance normalisation, below $18$dB SNR. Finally we propose classification using a combination of acoustic waveform and PLP log-likelihoods. The combined classifier performs uniformly better than either of the individual classifiers across all noise levels.
Article
Full-text available
This paper presents the evolution description and relevance of IVTE-Intelligent Virtual Teaching Environment project in terms of Artificial Intelligence and Artificial Intelligence in Education field. Furthermore, it describes the importance of Multi-agents modeling used in the IVTE software and also gives emphasis in the Cognitive Agent Model represented by an Animated Pedagogical Agent. The purpose of IVTE software is to educate children to preserve the environment. The IVTE software is implemented with Multi-agent (MAS) and Intelligent Tutoring Systems (ITS) technology, which gives more adaptable information to the teaching process. The adaptable information is promoted by Tutor of ITS or, in other words, by Animated Pedagogical Agent. The Animated Pedagogical Agent monitors, guides and individualizes the learning process using student model and teaching strategies.
Conference Paper
Full-text available
We introduce a new reinforcement learning benchmark based on the classic platform game Super Mario Bros. The benchmark has a high-dimensional input space, and achieving a good score requires sophisticated and varied strategies. However, it has tunable difficulty, and at the lowest difficulty setting decent score can be achieved using rudimentary strategies and a small fraction of the input space. To investigate the properties of the benchmark, we evolve neural network-based controllers using different network architectures and input spaces. We show that it is relatively easy to learn basic strategies capable of clearing individual levels of low difficulty, but that these controllers have problems with generalization to unseen levels and with taking larger parts of the input space into account. A number of directions worth exploring for learning better-performing strategies are discussed.
Conference Paper
Full-text available
The random forest language model (RFLM) has shown encour- aging results in several automatic speech recognition (ASR) tasks but has been hindered by practical limitations, notably the space-complexity of RFLM estimation from large amounts of data. This paper addresses large-scale training and testing of the RFLM via an efficient disk-swapping strategy that exploits the recursive structure of a binary decision tree and the local ac- cess property of the tree-growing algorithm, redeeming the full potential of the RFLM, and opening avenues of further research, including useful comparisons with n-gram models. Benefits of this strategy are demonstrated by perplexity reduction and lat- tice rescoring experiments using a state-of-the-art ASR system.
Article
Full-text available
To compare out-of-box performance of three commercially available continuous speech recognition software packages: IBM ViaVoice 98 with General Medicine Vocabulary; Dragon Systems NaturallySpeaking Medical Suite, version 3.0; and L&H Voice Xpress for Medicine, General Medicine Edition, version 1.2. Twelve physicians completed minimal training with each software package and then dictated a medical progress note and discharge summary drawn from actual records. Errors in recognition of medical vocabulary, medical abbreviations, and general English vocabulary were compared across packages using a rigorous, standardized approach to scoring. The IBM software was found to have the lowest mean error rate for vocabulary recognition (7.0 to 9.1 percent) followed by the L&H software (13.4 to 15.1 percent) and then Dragon software (14.1 to 15.2 percent). The IBM software was found to perform better than both the Dragon and the L&H software in the recognition of general English vocabulary and medical abbreviations. This study is one of a few attempts at a robust evaluation of the performance of continuous speech recognition software. Results of this study suggest that with minimal training, the IBM software outperforms the other products in the domain of general medicine; however, results may vary with domain. Additional training is likely to improve the out-of-box performance of all three products. Although the IBM software was found to have the lowest overall error rate, successive generations of speech recognition software are likely to surpass the accuracy rates found in this investigation.
Article
Full-text available
The authors present a time-delay neural network (TDNN) approach to phoneme recognition which is characterized by two important properties: (1) using a three-layer arrangement of simple computing units, a hierarchy can be constructed that allows for the formation of arbitrary nonlinear decision surfaces, which the TDNN learns automatically using error backpropagation; and (2) the time-delay arrangement enables the network to discover acoustic-phonetic features and the temporal relationships between them independently of position in time and therefore not blurred by temporal shifts in the input. As a recognition task, the speaker-dependent recognition of the phonemes B, D, and G in varying phonetic contexts was chosen. For comparison, several discrete hidden Markov models (HMM) were trained to perform the same task. Performance evaluation over 1946 testing tokens from three speakers showed that the TDNN achieves a recognition rate of 98.5% correct while the rate obtained by the best of the HMMs was only 93.7%
Book
A fascinating and instructive guide to Markov chains for experienced users and newcomers alike This unique guide to Markov chains approaches the subject along the four convergent lines of mathematics, implementation, simulation, and experimentation. It introduces readers to the art of stochastic modeling, shows how to design computer implementations, and provides extensive worked examples with case studies. Markov Chains: From Theory to Implementation and Experimentation begins with a general introduction to the history of probability theory in which the author uses quantifiable examples to illustrate how probability theory arrived at the concept of discrete-time and the Markov model from experiments involving independent variables. An introduction to simple stochastic matrices and transition probabilities is followed by a simulation of a two-state Markov chain. The notion of steady state is explored in connection with the long-run distribution behavior of the Markov chain. Predictions based on Markov chains with more than two states are examined, followed by a discussion of the notion of absorbing Markov chains. Also covered in detail are topics relating to the average time spent in a state, various chain configurations, and n-state Markov chain simulations used for verifying experiments involving various diagram configurations. • Fascinating historical notes shed light on the key ideas that led to the development of the Markov model and its variants • Various configurations of Markov Chains and their limitations are explored at length • Numerous examples—from basic to complex—are presented in a comparative manner using a variety of color graphics • All algorithms presented can be analyzed in either Visual Basic, Java Script, or PHP • Designed to be useful to professional statisticians as well as readers without extensive knowledge of probability theory Covering both the theory underlying the Markov model and an array of Markov chain implementations, within a common conceptual framework, Markov Chains: From Theory to Implementation and Experimentation is a stimulating introduction to and a valuable reference for those wishing to deepen their understanding of this extremely valuable statistical tool.
Article
We propose a Bayesian method to detect change points in a sequence of functional observations that are signal functions observed with noises. Since functions have unlimited features, it is natural to think that the sequence of signal functions driving the underlying functional observations change through an evolution process, that is, different features change over time but possibly at different times. A change-point is then viewed as the cumulative effect of changes in many features, so that the dissimilarities in the signal functions before and after the change-points are at the maximum level. In our setting, features are characterized by the wavelet coefficients in their expansion. We consider a Bayesian approach by putting priors independently on the wavelet coefficients of the underlying functions, allowing a change in their values over time. Then we compute the posterior distribution of change point for each sequence of wavelet coefficients, and obtain a measure of overall similarity between two signal functions in this sequence, which leads to the notion of an overall change-point by minimizing the similarity across the change point relative to the similarities within each segment. We study the performance of the proposed method through a simulation study and apply it to a dataset on climate change.
Article
The hybrid convolutional neural network and hidden Markov model (CNN-HMM) has recently achieved considerable performance in speech recognition because deep neural networks, model complex correlations between features. Automatic speech recognition (ASR) as an input to many intelligent and expert systems has impacts in various fields such as evolving search engines (inclusion of speech recognition in search engines), healthcare industry (medical reporting by medical personnel, and disease diagnosis expert systems), service delivery, communication in service providers (to establish the callers demands and then direct them to the appropriate operator for assistance), etc. This paper introduces a method, which further reduces the recognition error rate. In this paper, we first propose adaptive windows convolutional neural network (AWCNN) to analyze joint temporal-spectral features variation. AWCNN makes the model more robust against both intra- and inter-speaker variations. We further propose a new residual learning, which leads to better utilization of information in deep layers and provides a better control on transferring input information. The proposed speech recognition system can be used as the vocal input for many artificial and expert systems. We evaluated the proposed method on TIMIT, FARSDAT, Switchboard, and CallHome datasets and one image database i.e. MNIST. The experimental results show that the proposed method reduces the absolute error rate by 7% compared with the state-of-the-art methods in some speech recognition tasks.
Conference Paper
Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates $backslash$emphdeep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.
Conference Paper
Natural User Interfaces (NUI) are supposed to be used by humans in a very logic way. However, the run to deploy Speech-based NUIs by the industry has had a large impact on the naturality of such interfaces. This paper presents a usability test of the most prestigious and internationally used Speech-based NUI (i.e., Alexa, Siri, Cortana and Google’s). A comparison of the services that each one provides was also performed considering: access to music services, agenda, news, weather, To-Do lists and maps or directions, among others. The test was design by two Human Computer Interaction experts and executed by eight persons. Results show that even though there are many services available, there is a lot to do to improve the usability of these systems. Specially focused on separating the traditional use of computers (based on applications that require parameters to function) and to get closer to real NUIs.
Conference Paper
MuMMER (MultiModal Mall Entertainment Robot) is a four-year, EU-funded project with the overall goal of developing a humanoid robot (SoftBank Robotics’ Pepper robot being the primary robot platform) with the social intelligence to interact autonomously and naturally in the dynamic environments of a public shopping mall, providing an engaging and entertaining experience to the general public. Using co-design methods, we will work together with stakeholders including customers, retailers, and business managers to develop truly engaging robot behaviours. Crucially, our robot will exhibit behaviour that is socially appropriate and engaging by combining speech-based interaction with non-verbal communication and human-aware navigation. To support this behaviour, we will develop and integrate new methods from audiovisual scene processing, social-signal processing, high-level action selection, and human-aware robot navigation. Throughout the project, the robot will be regularly deployed in Ideapark, a large public shopping mall in Finland. This position paper describes the MuMMER project: its needs, the objectives, R&D challenges and our approach. It will serve as reference for the robotics community and stakeholders about this ambitious project, demonstrating how a co-design approach can address some of the barriers and help in building follow-up projects.
Conference Paper
Deep Bidirectional LSTM (DBLSTM) recurrent neural networks have recently been shown to give state-of-the-art performance on the TIMIT speech database. However, the results in that work relied on recurrent-neural-network-specific objective functions, which are difficult to integrate with existing large vocabulary speech recognition systems. This paper investigates the use of DBLSTM as an acoustic model in a standard neural network-HMM hybrid system. We find that a DBLSTM-HMM hybrid gives equally good results on TIMIT as the previous work. It also outperforms both GMM and deep network benchmarks on a subset of the Wall Street Journal corpus. However the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy. We conclude that the hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates. Further investigation needs to be conducted to understand how to better leverage the improvements in frame-level accuracy towards better word error rates.
Article
The intelligibility of foreign?accented English was investigated using minimal?pairs contrasts probing a number of different error types. Forty?four native English?speaking listeners were presented with English words, sentences, and a brief passage produced by one of eight native speakers of Mandarin Chinese or one native English speaker. The 190 words were presented to listeners in a minimal?pairs forced?choice task. For the sentences and passage, listeners were instructed to write down what they understood. A feature?based analysis of the minimal?pairs data was performed, with percent correct scores computed for each feature. The sentence and passage data, scored as percent of content words correctly transcribed by listeners, were transformed and used as the dependent variables in two multiple regression analyses, with seven feature scores from the minimal?pairs test (four consonant and three vowel features) used as the independent variables. The seven minimal?pairs variables accounted for approximately 72% of the variance in sentence intelligibility and 49% of that of the passages. Of these seven variables, vowel tenseness, diphthongization, and consonant voicing accounted for 70% of the sentence and 45% of the passage variance. These data suggest that specific segmental error types may have differential effects on intelligibility. [Work supported by NIH?NIDCD Grant No. 2R44DC02213.]
Article
Designing a machine that mimics human behavior, particularly the capability of speaking naturally and responding properly to spoken language, has intrigued engineers and scientists for centuries. Since the 1930s, when Homer Dudley of Bell Laboratories proposed a system model for speech analysis and synthesis (1, 2), the problem of automatic speech recognition has been approached progressively, from a simple machine that responds to a small set of sounds to a sophisticated system that responds to fluently spoken natural language and takes into account the varying statistics of the language in which the speech is produced. Based on major advances in statistical modeling of speech in the 1980s, automatic speech recognition systems today find widespread application in tasks that require a human-machine interface, such as automatic call processing in the telephone network and query-based information systems that do things like provide updated travel information, stock price quotations, weather reports, etc. In this article, we review some major highlights in the research and development of automatic speech recognition during the last few decades so as to provide a technological perspective and an appreciation of the fundamental progress that has been made in this important area of information and communication technology.
Article
A subjective scale for the measurement of pitch was constructed from determinations of the half-value of pitches at various frequencies. This scale differs from both the musical scale and the frequency scale, neither of which is subjective. Five observers fractionated tones of 10 different frequencies and the values were used to construct a numerical scale which is proportional to the perceived magnitude of subjective pitch. The close agreement of this pitch scale with an integration of the DL's for pitch shows that, unlike the DL's for loudness, all DL's for pitch are of uniform subjective magnitude. The agreement further implies that pitch and differential sensitivity to pitch are both rectilinear functions of extent on the basilar membrane, and that in cutting a pitch in half, the observer adjusts the tone until it stimulates a position half-way from the original locus to the apical end of the membrane. Measurement of the subjective size of musical intervals (such as octaves) in terms of the pitch scale shows that the intervals become larger as the frequency of the midpoint of the interval increases (except for very high tones). (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Book
With the proliferation of digital audio distribution over digital media, audio content analysis is fast becoming a requirement for designers of intelligent signal-adaptive audio processing systems. Written by a well-known expert in the field, this book provides quick access to different analysis algorithms and allows comparison between different approaches to the same task, making it useful for newcomers to audio signal processing and industry experts alike. A review of relevant fundamentals in audio signal processing, psychoacoustics, and music theory, as well as downloadable MATLAB files are also included. Please visit the companion website: www.AudioContentAnalysis.org
Conference Paper
Using dedicated hardware to do machine learning typically ends up in disaster because of cost, obsolescence, and poor software. The popularization of graphic processing units (GPUs), which are now available on every PC, provides an attractive alternative. We propose a generic 2-layer fully connected neural network GPU implementation which yields over 3× speedup for both training and testing with respect to a 3 GHz P4 CPU.
Article
Emotional expression and understanding are normal instincts of human beings, but automatical emotion recognition from speech without referring any language or linguistic information remains an unclosed problem. The limited size of existing emotional data samples, and the relative higher dimensionality have outstripped many dimensionality reduction and feature selection algorithms. This paper focuses on the data preprocessing techniques which aim to extract the most effective acoustic features to improve the performance of the emotion recognition. A novel algorithm is presented in this paper, which can be applied on a small sized data set with a high number of features. The presented algorithm integrates the advantages from a decision tree method and the random forest ensemble. Experiment results on a series of Chinese emotional speech data sets indicate that the presented algorithm can achieve improved results on emotional recognition, and outperform the commonly used Principle Component Analysis (PCA)/Multi-Dimensional Scaling (MDS) methods, and the more recently developed ISOMap dimensionality reduction method.
Conference Paper
Signal classification is an area of much interests in signal processing. Traditional classification methods designed for discrete variables are limited in its power. Here we propose a novel approach for some signal classification problems. It is a combination of three artificial intelligence approaches: tree-based approach, ensemble voting and kernel learning. We call this approach kernel-induced random forest (KIRF) for signal data. It is novel with respect to KIRF because a new type of kernel suitable for signal data is proposed and applied. We use two examples, a phoneme speech data and a waveform simulation data to illustrate its usage and evidences of improving on traditional methods such as neural networks and discriminant methods. Evidences from the data show that our results are significantly better than those traditional methods for signal classification.
The application of feed forward back propagation artificial neural networks with one hidden layer (ANN) to perform the equivalent of multiple linear regression (MLR) has been examined using artificial structured data sets and real literature data. The predictive ability of the networks has been estimated using a training/ test set protocol. The results have shown advantages of ANN over MLR analysis. The ANNs do not require high order terms or indicator variables to establish complex structure-activity relationships. Overfitting does not have any influence on network prediction ability when overtraining is avoided by cross-validation. Application of ANN ensembles has allowed the avoidance of chance correlations and satisfactory predictions of new data have been obtained for a wide range of numbers of neurons in the hidden layer.
Article
Standard Mel frequency cepstrum coefficient (MFCC) computation technique utilizes discrete cosine transform (DCT) for decorrelating log energies of filter bank output. The use of DCT is reasonable here as the covariance matrix of Mel filter bank log energy (MFLE) can be compared with that of highly correlated Markov-I process. This full-band based MFCC computation technique where each of the filter bank output has contribution to all coefficients, has two main disadvantages. First, the covariance matrix of the log energies does not exactly follow Markov-I property. Second, full-band based MFCC feature gets severely degraded when speech signal is corrupted with narrow-band channel noise, though few filter bank outputs may remain unaffected. In this work, we have studied a class of linear transformation techniques based on block wise transformation of MFLE which effectively decorrelate the filter bank log energies and also capture speech information in an efficient manner. A thorough study has been carried out on the block based transformation approach by investigating a new partitioning technique that highlights associated advantages. This article also reports a novel feature extraction scheme which captures complementary information to wide band information; that otherwise remains undetected by standard MFCC and proposed block transform (BT) techniques. The proposed features are evaluated on NIST SRE databases using Gaussian mixture model-universal background model (GMM-UBM) based speaker recognition system. We have obtained significant performance improvement over baseline features for both matched and mismatched condition, also for standard and narrow-band noises. The proposed method achieves significant performance improvement in presence of narrow-band noise when clubbed with missing feature theory based score computation scheme.
Article
Thesis (Ph. D.)--University of Edinburgh, 1989. Photocopy.
Article
Nearly perfect speech recognition was observed under conditions of greatly reduced spectral information. Temporal envelopes of speech were extracted from broad frequency bands and were used to modulate noises of the same bandwidths. This manipulation preserved temporal envelope cues in each band but restricted the listener to severely degraded information on the distribution of spectral energy. The identification of consonants, vowels, and words in simple sentences improved markedly as the number of bands increased; high speech recognition performance was obtained with only three bands of modulated noise. Thus, the presentation of a dynamic temporal pattern in only a few broad spectral regions is sufficient for the recognition of speech.
Article
An important question in neuroevolution is how to gain an advantage from evolving neural network topologies along with weights. We present a method, NeuroEvolution of Augmenting Topologies (NEAT), which outperforms the best fixed-topology method on a challenging benchmark reinforcement learning task. We claim that the increased efficiency is due to (1) employing a principled method of crossover of different topologies, (2) protecting structural innovation using speciation, and (3) incrementally growing from minimal structure. We test this claim through a series of ablation studies that demonstrate that each component is necessary to the system as a whole and to each other. What results is significantly faster learning. NEAT is also an important contribution to GAs because it shows how it is possible for evolution to both optimize and complexify solutions simultaneously, offering the possibility of evolving increasingly complex solutions over generations, and strengthening the analogy with biological evolution.
Article
With the aging population, the number of individuals requiring long-term care is expected to dramatically increase in the next twenty years, placing an increasing burden on healthcare. Many patients are admitted to assisted living facilities at a fairly early stage due to their inability to perform normal daily living activities. The purpose of this study is to determine if the use of technology for both monitoring and intervention can permit elderly patients to remain in their homes for longer periods of time with the benefit of the comfort of familiar surroundings while at the same time reducing the burden on caregivers. In addition, remote access to healthcare can improve monitoring of the patient's physical and mental condition and involve the patient in his or her own care. The home monitoring and intervention system is based on intelligent agent methodology developed by the authors.
Article
We suggest a classification and feature extraction method on functional data where the predictor variables are curves. The method, called functional segment discriminant analysis (FSDA), combines the classical linear discriminant analysis and support vector machine. FSDA is particularly useful for irregular functional data, characterized by spatial heterogeneity and local patterns like spikes. FSDA not only reduces the computation and storage burden by using a fraction of the spectrum, but also identifies important predictors and extracts features. FSDA is highly flexible, easy to incorporate information from other data sources and/or prior knowledge from the investigators. We apply FSDA to two public domain data sets and discuss the understanding developed from the study.
Conference Paper
The objective of this work is to test and verify a framework that aims to implement a set of resources for developing intelligent learning objects. For this purpose, a learning environment model was created. The model has an animated pedagogical agent (APA) that interacts with two specific Agents of the system. The framework architecture was developed by Gomes and proposes a learning environment model which has an animated character acting as a learning management system (LMS). The other agents of the environment have specific features of the intelligent learning objects (ILOs), as considered in this architecture. The results of this research prove that the learning environment, implemented according to this model, can be more flexible and adaptable. These results also show innumerable advantages in the proposed convergence between learning objects technology and multiagent systems paradigm.
Conference Paper
Non-professional athletes usually rely on the information about training methods and nutrition recommendations provided online. However, the quality of online information sources is extremely variable. iAPERAS is an expert system using Bayes networks and designed for athletes. It represents a better alternative to online resources, because it is based on scientific research findings and evaluated by domain experts.
Article
In this paper, we present a novel technique of constructing phonetic decision trees (PDTs) for acoustic modeling in conversational speech recognition. We use random forests (RFs) to train a set of PDTs for each phone state unit and obtain multiple acoustic models accordingly. We investigate several methods of combining acoustic scores from the multiple models, including maximum-likelihood estimation of the weights of different acoustic models from training data, as well as using confidence score of -value or relative entropy to obtain the weights dynamically from online data. Since computing acoustic scores from the multiple models slows down decoding search, we propose clustering methods to compact the RF-generated acoustic models. The conventional concept of PDT-based state tying is extended to RF-based state tying. On each RF tied state, we cluster the Gaussian density functions (GDFs) from multiple acoustic models into classes and compute a prototype for each class to represent the original GDFs. In this way, the number of GDFs in each RF tied state is decreased greatly, which significantly reduces the time for computing acoustic scores. Experimental results on a telemedicine automatic captioning task demonstrate that the proposed RF-PDT technique leads to significant improvements in word recognition accuracy.