Article
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Recent advances in the availability of computational resources allow for more sophisticated approaches to speech recognition than ever before. This study considers Artificial Neural Network and Hidden Markov Model methods of classification for Human Speech Recognition through Diphthong Vowel sounds in the English Phonetic Alphabet rather than the classical approach of the classification of whole words and phrases, with a specific focus on both single and multi-objective evolutionary optimisation of bioinspired classification methods. A set of audio clips are recorded by subjects from the United Kingdom and Mexico and the recordings are transformed into a static dataset of statistics by way of their Mel-Frequency Cepstral Coefficients (MFCC) at sliding window length of 200ms as well as a reshaped MFCC timeseries format for forecast-based models. An deep neural network with evolutionary optimised topology achieves 90.77% phoneme classification accuracy in comparison to the best HMM that achieves 86.23% accuracy with 150 hidden units, when only accuracy is considered in a single-objective optimisation approach. The obtained solutions are far more complex than the HMM taking around 248 seconds to train on powerful hardware versus 160 for the HMM. A multi-objective approach is explored due to this. In the multi-objective approaches of scalarisation presented, within which real-time resource usage is also considered towards solution fitness, far more optimal solutions are produced which train far quicker than the forecast approach (69 seconds) with classification ability retained (86.73%). Weightings towards either maximising accuracy or reducing resource usage from 0.1 to 0.9 are suggested depending on the resources available, since many future IoT devices and autonomous robots may have limited access to cloud resources at a premium in comparison to the GPU used in this experiment.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... The training of an acoustic system is intended for solving two or more tasks simultaneously, which are different in nature. The purpose of the auxiliary task includes generating the features of clean speech through a regression loss [7]. ...
... Due to the non-reverberant and clean nature of the acoustic environment, the amount of offered annotated data is extremely significant for the Speech Recognition System [3]. Though, these perfect acoustic conditions are not much reasonable as in diverse many real-life conditions and may be suffered by the deprivations of the speech signal that is come from the room proprieties of the acoustic or from the noise in the surrounding, which directs to the reverberations of the speech [7]. These occurrences can formulate the speech recognition model that attains significantly depreciate results and is a much more challenging task. ...
... The relevance of evolutionary-based algorithms, like a genetic algorithm, that belong to a family of search algorithms inspired by the process of evolution in nature, was demonstrated in a recent study showing that optimizing the topology of an Artificial Neural Network may lead to a high classification rate of spoken utterances of both native and non-native English speakers [24]. ...
... It has a potential application in automatic speech recognition and its applications. For example, in [24], the authors used evolutionary algorithms based optimized deep neural network for recognition of diphthong vowel sounds in the English phonetic alphabet. In [36], the author tied genetic algorithm with Manhattan distance to classify plain and emphatic vowels in continuous Arabic speech. ...
Article
Full-text available
Distinctive phonetic features have an important role in Arabic speech phoneme recognition. In a given language, distinctive phonetic features are extrapolated from acoustic features using different methods. However, exploiting lengthy acoustic features vector in the sake of phoneme recognition has a huge cost in terms of computational complexity, which in turn, affects real time applications. The aim of this work is to consider methods to reduce the size of features vector employed for distinctive phonetic feature and phoneme recognition. The objective is to select the relevant input features that contribute to the speech recognition process. This, in turn, will lead to a reduced computational complexity of recognition algorithm, and an improved recognition accuracy. In the proposed approach, genetic algorithm is used to perform optimal features selection. Therefore, a baseline model based on feedforward neural networks is first built. This model is used to benchmark the results of proposed features selection method with a method that employs all elements of a features vector. Experimental results, utilizing the King Abdulaziz City for Science and Technology Arabic Phonetic Database, show that the average genetic algorithm based phoneme overall recognition accuracy is maintained slightly higher than that of recognition method employing the full-fledge features vector. The genetic algorithm based distinctive phonetic features recognition method has achieved a 50% reduction in the dimension of the input vector while obtaining a recognition accuracy of 90%. Moreover, the results of the proposed method is validated using Wilcoxon signed rank test.
... The relevance of evolutionary-based algorithms, like a genetic algorithm, that belong to a family of search algorithms inspired by the process of evolution in nature, was demonstrated in a recent study showing that optimizing the topology of an Artificial Neural Network may lead to a high classification rate of spoken utterances of both native and non-native English speakers [24]. ...
... It has a potential application in automatic speech recognition and its applications. For example, in [24], the authors used evolutionary algorithms based optimized deep neural network for recognition of diphthong vowel sounds in the English phonetic alphabet. In [36], the author tied genetic algorithm with Manhattan distance to classify plain and emphatic vowels in continuous Arabic speech. ...
Article
Full-text available
Distinctive phonetic features have an important role in Arabic speech phoneme recognition. In a given language, distinctive phonetic features are extrapolated from acoustic features using different methods. However, exploiting lengthy acoustic features vector in the sake of phoneme recognition has a huge cost in terms of computational complexity, which in turn, affects real time applications. The aim of this work is to consider methods to reduce the size of features vector employed for distinctive phonetic feature and phoneme recognition. The objective is to select the relevant input features that contribute to the speech recognition process. This, in turn, will lead to a reduced computational complexity of recognition algorithm, and an improved recognition accuracy. In the proposed approach, genetic algorithm is used to perform optimal features selection. Therefore, a baseline model based on feedforward neural …
... We try to provide a brief introduction to neural networks and the speech recognition process in the previous sections. Early research work into speech recognition fields was initiated in 1952 by the three Bell Lab researchers and developed an Audrey system [18,119]. It was designed to recognize single-spoken digits. ...
Article
Full-text available
The continuous development in Automatic Speech Recognition has grown and demonstrated its enormous potential in Human Interaction Communication systems. It is quite a challenging task to achieve high accuracy due to several parameters such as different dialects, spontaneous speech, speaker’s enrolment, computation power, dataset, and noisy environment that decrease the performance of the speech recognition system. It has motivated various researchers to make innovative contributions to the development of a robust speech recognition system. The study presents a systematic analysis of current state-of-the-art research work done in this field during 2015-2021. The prime focus of the study is to highlight the neural network-based speech recognition techniques, datasets, toolkits, and evaluation metrics utilized in the past seven years. It also synthesizes the evidence from past studies to provide empirical solutions for accuracy improvement. This study highlights the current status of speech recognition systems using neural networks and provides a brief knowledge to the new researchers.
... Based on the above understanding, this paper will make the evaluation of English course learning quality more scientific and quantitative through intelligent technology. e prior guide samples are obtained by the entropy method, then the neural set network model is optimized to learn the prior sample knowledge by using the adaptive mutation genetic algorithm, and an evaluation model is established, which reduces the subjectivity of the neural set network learning samples [4][5][6]. ...
Article
Full-text available
The rationality and timeliness of the comprehensive results of English course learning quality are increasingly important in the process of modern education. There are some problems in the scientific evaluation of English course learning quality and teachers’ own English course learning, such as the need for proper adjustment and improvement. Based on the improved network theory of genetic algorithm, this paper takes an online English course learning quality evaluation model and uses MATLAB 7.0 to write the graphical user interface of the neural set network English course learning quality prediction model. The model uses the genetic algorithm of adaptive mutation to optimize the initial weights and values of the neural set network and solves the problems of prediction accuracy and convergence speed of English course learning quality evaluation results. Simulation experiments show that the neural set network has a strong dependence on the initial weights and thresholds. Using the improved genetic algorithm to optimize the initial weights and thresholds of the neural set network reduced the time for the neural set network to find the weights and thresholds that meet the training termination conditions, the prediction accuracy was increased to 0.897, the prediction accuracy was 78.85%, and the level prediction accuracy was 84.62%, which effectively promoted the development of online English course learning in colleges and the continuous improvement of teachers’ English course learning level.
... Over the past two decades, multi-objective evolutionary algorithms (MOEAs) have become very popular methods for identifying the Pareto optimal set of designs for a problem-recent examples include (Alcaraz et al. (2020); Avilés et al. (2020); Bird et al. (2020); Chen et al. (2021); Drake et al. (2020); Kuk et al. (2021)). There have also been attempts to use MOEAs to identify Pareto optimal sets of engine designs (Corre et al. (2019); DErrico et al. (2011); Lotfan et al. (2016)). ...
Article
Full-text available
Engineering design optimization problems increasingly require computationally expensive high-fidelity simulation models to evaluate candidate designs. The evaluation budget may be small, limiting the effectiveness of conventional multi-objective evolutionary algorithms. Bayesian optimization algorithms (BOAs) are an alternative approach for expensive problems but are underdeveloped in terms of support for constraints and non-continuous design variables—both of which are prevalent features of real-world design problems. This study investigates two constraint handling strategies for BOAs and introduces the first BOA for mixed-integer problems, intended for use on a real-world engine design problem. The new BOAs are empirically compared to their closest competitor for this problem—the multi-objective evolutionary algorithm NSGA-II, itself equipped with constraint handling and mixed-integer components. Performance is also analysed on two benchmark problems which have similar features to the engine design problem, but are computationally cheaper to evaluate. The BOAs offer statistically significant convergence improvements of between 5.9% and 31.9% over NSGA-II across the problems on a budget of 500 design evaluations. Of the two constraint handling methods, constrained expected improvement offers better convergence than the penalty function approach. For the engine problem, the BOAs identify improved feasible designs offering 36.4% reductions in nitrogen oxide emissions and 2.0% reductions in fuel consumption when compared to a notional baseline design. The use of constrained mixed-integer BOAs is recommended for expensive engineering design optimization problems.
... For classification purpose, among the available classifiers, most widely used classifiers for audio signal namely KNN [23], SVM [17] and a DNN [7,8,17,27] have been used in the proposed method in standard python anaconda library. The data sets are split into train-test set with 80% and 20% respectively. ...
Article
Full-text available
In linguistics, phonemes are the atomic sound, called word segmentor play an important role to recognize the word properly. A novel approach of seven Bengali vowels and ten diphthongs (a syllable for the pronunciation of two consecutive vowels) phoneme recognition has been proposed in the paper. In the proposed method, before extracting the feature, a novel pre-processing technique using amplitude interpolation method has been developed to align the starting point of all the phonemes of the same class which in turn boosts the recognition rate. Here seven Bengali vowels and ten diphthongs audio clips uttered by twenty persons (ten times each) of different age group and sex have been recorded to create a data set of 3400 audio samples for the proposed experiment. For each class of phonemes and diphthongs one sample (selected by linguistic) have been considered as a benchmark. Then each of the recorded audio clips is interpolated to match with the benchmark clip of the corresponding phoneme by finding the valleys in the amplitude using Lagrange interpolation technique. After that, 19 MFCC (Mel Frequency Cepstral Co-Efficient) speech features have been extracted from each phoneme of the interpolated audio clips and feed to classify using Support Vector Machine (SVM), k- Nearest Neighbour (KNN) and Deep Neural Network (DNN) classifier and the average classification accuracy obtained for vowels and diphthongs are 94.93% and 94.56% respectively. To check the effectiveness of the proposed pre-processing technique same MFCC features have been extracted from the raw recorded phonemes and feed to same classifiers and average accuracy obtained for vowels and diphthongs are 89.21% and 88.56% respectively which shows the effectiveness of the proposed method. It is also to note that best accuracy obtained using the DNN classifier with the accuracy of 98.16% for vowels and 97% for diphthongs.
... Apart from the Bengali language, MFCC has been used in other language researches as well. Most of them utilized 13 MFCC features in a deep neural network (DNN) classifier, such as recognition of speakers [20], English phoneme [21], emotion from English audio [6], five Malayalam vowel phonemes [22], twenty Dari speech tokens [23], three Arabic words [24], and ten English command words [25]. Some studies also prefer convolutional neural network (CNN) models such as isolated English word recognition by Soliman et al. [26]. ...
Article
Full-text available
Speech-related research has a wide range of applications. Most speech-related researches employ Mel-frequency cepstral coefficients (MFCCs) as acoustic features. However, finding the optimum number of MFCCs is an active research question. MFCC-based speech classification was performed for both vowels and words in the Bengali language. As for the classification model, deep neural network (DNN) with Adam optimizer was used. The performances were measured with five different performance metrics, namely confusion matrix, classification accuracy, area under curve of receiver operating characteristic (AUC-ROC), F1 score, and Cohen's Kappa with four-fold cross-validations at different number of MFCCs. All performance metrics gave the best score for 24/25 MFCCs; hence it is suggested that the optimum number of MFCCs should be 25, although many existing studies use only 13 MFCCs. Furthermore, it is verified that increasing the number of MFCCs yields better classification metrics with lower computational burden than the increment of hidden layers. Lastly, the optimum number of MFCCs obtained from this study was used in a more improved DNN model, from which 99% and 90% accuracies were achieved for vowel and word classification, respectively, and the vowel classification score outperformed state-of-the-art results.
... The importance of WAR suggests that, on a broad scale, the mechanism is not dependent on speakers. Bird et al. (2020) proposed the classification methods of the Artificial Neural Network and Secret Markov Model for Human Speech Recognition in the English Phonetic Alphabet. Subjects from the United Kingdom and Mexico record a collection of audio clips and the recordings are converted into a static statistical dataset. ...
Article
Full-text available
In the last few decades, there has been considerable amount of research on the use of Machine Learning (ML) for speech recognition based on Convolutional Neural Network (CNN). These studies are generally focused on using CNN for applications related to speech recognition. Additionally, various works are discussed that are based on deep learning since its emergence in the speech recognition applications. Comparing to other approaches, the approaches based on deep learning are showing rather interesting outcomes in several applications including speech recognition, and therefore, it attracts a lot of researches and studies. In this paper, a review is presented on the developments that occurred in this field while also discussing the current researches that are being based on the topic currently.
... For the Deep Neural Network that classifies the speaker, a topology of three hidden layers consisting of 30, 7, and 29 neurons respectively with ReLu activation functions and an ADAM optimiser [63] is initialised. These hyperparameters are chosen due to a previous study that performed a genetic search of neural network topologies for the classification of phonetic sounds in the form of MFCCs [64]. The networks are given an unlimited number of epochs to train, only ceasing through a set early stopping callback of 25 epochs with no improvement of ability. ...
Preprint
Full-text available
In speech recognition problems, data scarcity often poses an issue due to the willingness of humans to provide large amounts of data for learning and classification. In this work, we take a set of 5 spoken Harvard sentences from 7 subjects and consider their MFCC attributes. Using character level LSTMs (supervised learning) and OpenAI's attention-based GPT-2 models, synthetic MFCCs are generated by learning from the data provided on a per-subject basis. A neural network is trained to classify the data against a large dataset of Flickr8k speakers and is then compared to a transfer learning network performing the same task but with an initial weight distribution dictated by learning from the synthetic data generated by the two models. The best result for all of the 7 subjects were networks that had been exposed to synthetic data, the model pre-trained with LSTM-produced data achieved the best result 3 times and the GPT-2 equivalent 5 times (since one subject had their best result from both models at a draw). Through these results, we argue that speaker classification can be improved by utilising a small amount of user data but with exposure to synthetically-generated MFCCs which then allow the networks to achieve near maximum classification scores.
Chapter
Full-text available
This book is a quick review of machine learning methods for engineering applications. It provides an introduction to the principles of machine learning and common algorithms in the first section. Proceeding chapters summarize and analyze the existing scholarly work and discuss some general issues in this field. Next, it offers some guidelines on applying machine learning methods to software engineering tasks. Finally, it gives an outlook into some of the future developments and possibly new research areas of machine learning and artificial intelligence in general. Techniques highlighted in the book include: Bayesian models, support vector machines, decision tree induction, regression analysis, and recurrent and convolutional neural network. Finally, it also intends to be a reference book. Key Features: - Describes real-world problems that can be solved using machine learning - Explains methods for directly applying machine learning techniques to concrete real-world problems - Explains concepts used in Industry 4.0 platforms, including the use and integration of AI, ML, Big Data, NLP, and the Internet of Things (IoT). - It does not require prior knowledge of the machine learning. This book is meant to be an introduction to artificial intelligence (AI), machine earning, and its applications in Industry 4.0. It explains the basic mathematical principles but is intended to be understandable for readers who do not have a background in advanced mathematics.
Thesis
Full-text available
In modern Human-Robot Interaction, much thought has been given to accessibility regarding robotic locomotion, specifically the enhancement of awareness and lowering of cognitive load. On the other hand, with social Human-Robot Interaction considered, published research is far sparser given that the problem is less explored than pathfinding and locomotion. This thesis studies how one can endow a robot with affective perception for social awareness in verbal and non-verbal communication. This is possible by the creation of a Human-Robot Interaction framework which abstracts machine learning and artificial intelligence technologies which allow for further accessibility to non-technical users compared to the current State-of-the-Art in the field. These studies thus initially focus on individual robotic abilities in the verbal, non-verbal and multimodality domains. Multimodality studies show that late data fusion of image and sound can improve environment recognition, and similarly that late fusion of Leap Motion Controller and image data can improve sign language recognition ability. To alleviate several of the open issues currently faced by researchers in the field, guidelines are reviewed from the relevant literature and met by the design and structure of the framework that this thesis ultimately presents. The framework recognises a user's request for a task through a chatbot-like architecture. Through research in this thesis that recognises human data augmentation (paraphrasing) and subsequent classification via language transformers, the robot's more advanced Natural Language Processing abilities allow for a wider range of recognised inputs. That is, as examples show, phrases that could be expected to be uttered during a natural human-human interaction are easily recognised by the robot. This allows for accessibility to robotics without the need to physically interact with a computer or write any code, with only the ability of natural interaction (an ability which most humans have) required for access to all the modular machine learning and artificial intelligence technologies embedded within the architecture. Following the research on individual abilities, this thesis then unifies all of the technologies into a deliberative interaction framework, wherein abilities are accessed from long-term memory modules and short-term memory information such as the user's tasks, sensor data, retrieved models, and finally output information. In addition, algorithms for model improvement are also explored, such as through transfer learning and synthetic data augmentation and so the framework performs autonomous learning to these extents to constantly improve its learning abilities. It is found that transfer learning between electroencephalographic and electromyographic biological signals improves the classification of one another given their slight physical similarities. Transfer learning also aids in environment recognition, when transferring knowledge from virtual environments to the real world. In another example of non-verbal communication, it is found that learning from a scarce dataset of American Sign Language for recognition can be improved by multi-modality transfer learning from hand features and images taken from a larger British Sign Language dataset. Data augmentation is shown to aid in electroencephalographic signal classification by learning from synthetic signals generated by a GPT-2 transformer model, and, in addition, augmenting training with synthetic data also shows improvements when performing speaker recognition from human speech. Given the importance of platform independence due to the growing range of available consumer robots, four use cases are detailed, and examples of behaviour are given by the Pepper, Nao, and Romeo robots as well as a computer terminal. The use cases involve a user requesting their electroencephalographic brainwave data to be classified by simply asking the robot whether or not they are concentrating. In a subsequent use case, the user asks if a given text is positive or negative, to which the robot correctly recognises the task of natural language processing at hand and then classifies the text, this is output and the physical robots react accordingly by showing emotion. The third use case has a request for sign language recognition, to which the robot recognises and thus switches from listening to watching the user communicate with them. The final use case focuses on a request for environment recognition, which has the robot perform multimodality recognition of its surroundings and note them accordingly. The results presented by this thesis show that several of the open issues in the field are alleviated through the technologies within, structuring of, and examples of interaction with the framework. The results also show the achievement of the three main goals set out by the research questions; the endowment of a robot with affective perception and social awareness for verbal and non-verbal communication, whether we can create a Human-Robot Interaction framework to abstract machine learning and artificial intelligence technologies which allow for the accessibility of non-technical users, and, as previously noted, which current issues in the field can be alleviated by the framework presented and to what extent.
Article
Fusion strategies that utilize time-frequency features have achieved superior performance in acoustic scene classification tasks. However, the existing fusion schemes are mainly frameworks that involve different modules for feature learning, fusion, and modeling. These frameworks are prone to introduce artificial interference and thus make it challenging to obtain the system's best performance. In addition, the lack of adequate information interaction between different features in the existing fusion schemes prevents the learned features from achieving the optimal discriminative ability. To tackle these problems, we design a deep mutual attention network based on the principle of receptive field regularization and the mutual attention mechanism. The proposed network can realize the joint learning and complementary enhancement of multiple time-frequency features end-to-end, which improves features' learning efficiency and discriminative ability. Experimental results on six publicly available datasets show that the proposed network outperforms almost all state-of-the-art systems regarding classification accuracy.
Article
Human action recognition using wearable sensors play a remarkable role in the province of Human-centric Computing. This paper accords a novel sparse representation based hierarchical evolutionary framework for classifying human activities. The main objective of this research is to propose a model for human action recognition that produces superior recognition results. This framework employs data from two inertial sensors namely an accelerometer and a gyroscope. Features like time and frequency-domain features were utilized in this work. This framework operates at multiple levels in the hierarchy, wherein, the output of one level is given as input to the next level in the hierarchy. A novel algorithm for deducing the hierarchical structure based on the input data called Hierarchical Architecture Design (HAD) algorithm is presented. We have also presented a novel Sparse Dictionary Optimization (SDO) algorithm for generating dictionaries that aid in the efficacious sparse representation-based classification. Finally, action recognition is done using the proposed Sparse Representation based Hierarchical (SRH) classifier. The performance analysis of the proposed system was conducted using University of Southern California Human Activity Dataset (USC-HAD) and Human Activities and Postural Transitions (HAPT) Dataset. The proposed classification framework attained a very high F-score value of 98.01% and 93.51% for the USC-HAD and HAPT datasets respectively.
Article
In this paper, the k-median of a graph is used to decompose the domain (mesh) of the continuous two- and three-dimensional finite element models. The problem of k-median is stated as an optimization problem and is solved by utilizing eight robust meta-heuristic algorithms. The Artificial Bee Colony algorithm (ABC), Cyclical Parthenogenesis algorithm (CPA), Cuckoo Search algorithm (CS), Teaching-Learning Based Optimization algorithm (TLBO), Tug of War Optimization algorithm (TWO), Water Evaporation Optimization algorithm (WEO), Ray Optimization algorithm (RO), and Vibrating Particles System algorithm (VPS) constitute the set of algorithms that are employed in the present study. In order to tune the parameters of the meta-heuristics, the Taguchi method is used. The efficiency and robustness of the algorithms are investigated through two- and three- dimensional finite element models. Communicated by Marat Z Dosaev.
Conference Paper
Full-text available
Phoneme awareness provides the path to high resolution speech recognition to overcome the difficulties of classical word recognition. Here we present the results of a preliminary study on Artificial Neural Network (ANN) and Hidden Markov Model (HMM) methods of classification for Human Speech Recognition through Diphthong Vowel sounds in the English Phonetic Alphabet, with a specific focus on evolutionary optimisation of bio-inspired classification methods. A set of audio clips are recorded by subjects from the United Kingdom and Mexico. For each recording, the data were pre-processed, using Mel-Frequency Cepstral Coefficients (MFCC) at a sliding window of 200ms per data object, as well as a further MFCC timeseries format for forecast-based models, to produce the dataset. We found that an evolutionary optimised deep neural network achieves 90.77% phoneme classification accuracy as opposed to the best HMM of 150 hidden units achieving 86.23% accuracy. Many of the evolutionary solutions take substantially longer to train than the HMM, however one solution scoring 87.5% (+1.27%) requires fewer resources than the HMM.
Chapter
Full-text available
This paper proposes an approach to selecting the amount of layers and neurons contained within Multilayer Perceptron hidden layers through a single-objective evolutionary approach with the goal of model accuracy. At each generation, a population of Neural Network architectures are created and ranked by their accuracy. The generated solutions are combined in a breeding process to create a larger population, and at each generation the weakest solutions are removed to retain the population size inspired by a Darwinian ‘survival of the fittest’. Multiple datasets are tested, and results show that architectures can be successfully improved and derived through a hyper-heuristic evolutionary approach, in less than 10% of the exhaustive search time. The evolutionary approach was further optimised through population density increase as well as gradual solution max complexity increase throughout the simulation.
Poster
Full-text available
Phoneme awareness provides the path to high resolution speech recognition to overcome the difficulties of classical word recognition. Here we present the results of a preliminary study on Artificial Neural Network (ANN) and Hidden Markov Model (HMM) methods of classification for Human Speech Recognition through Diphthong Vowel sounds in the English Phonetic Alphabet, with a specific focus on evolutionary optimisation of bio-inspired classification methods. A set of audio clips are recorded by subjects from the United Kingdom and Mexico. For each recording, the data were pre-processed, using Mel-Frequency Cepstral Coefficients (MFCC) at a sliding window of 200ms per data object, as well as a further MFCC timeseries format for forecast-based models, to produce the dataset. We found that an evolutionary optimised deep neural network achieves 90.77% phoneme classification accuracy as opposed to the best HMM of 150 hidden units achieving 86.23% accuracy. Many of the evolutionary solutions take substantially longer to train than the HMM, however one solution scoring 87.5% (+1.27%) requires fewer resources than the HMM.
Article
Full-text available
Deep neural networks (DNNs) have been playing a significant role in acoustic modeling. Convolutional neural networks (CNNs) are the advanced version of DNNs that achieve 4–12% relative gain in the word error rate (WER) over DNNs. Existence of spectral variations and local correlations in speech signal makes CNNs more capable of speech recognition. Recently, it has been demonstrated that bidirectional long short-term memory (BLSTM) produces higher recognition rate in acoustic modeling because they are adequate to reinforce higher-level representations of acoustic data. Spatial and temporal properties of the speech signal are essential for high recognition rate, so the concept of combining two different networks came into mind. In this paper, a hybrid architecture of CNN-BLSTM is proposed to appropriately use these properties and to improve the continuous speech recognition task. Further, we explore different methods like weight sharing, the appropriate number of hidden units, and ideal pooling strategy for CNN to achieve a high recognition rate. Specifically, the focus is also on how many BLSTM layers are effective. This paper also attempts to overcome another shortcoming of CNN, i.e. speaker-adapted features, which are not possible to be directly modeled in CNN. Next, various non-linearities with or without dropout are analyzed for speech tasks. Experiments indicate that proposed hybrid architecture with speaker-adapted features and maxout non-linearity with dropout idea shows 5.8% and 10% relative decrease in WER over the CNN and DNN systems, respectively.
Conference Paper
Full-text available
We conducted an in situ study of six households in domestic and driving situations in order to better understand how voice assistants (VA) are used and evaluate the efficiency of vocal interactions in natural contexts. The filmed observations and interviews revealed activities of supervision, verification, diagnosis and problem-solving. These activities were not only costly in time, but they also interrupted the flow in the inhabitants’ other activities. Although the VAs were expected to facilitate the accomplishment of a second, simultaneous task, they in fact were a hindrance. Such failures can cause abandonment, but the results nevertheless revealed a paradox of use: the inhabitants forgave and accepted these errors, while continuing to appropriate the vocal system.
Conference Paper
Full-text available
This paper proposes an approach to selecting the amount of layers and neurons contained within Multilayer Perceptron hidden layers through a single-objective evolutionary approach with the goal of model accuracy. At each generation, a population of Neural Network architectures are created and ranked by their accuracy. The generated solutions are combined in a breeding process to create a larger population, and at each generation the weakest solutions are removed to retain the population size inspired by a Darwinian 'survival of the fittest'. Multiple datasets are tested, and results show that architectures can be successfully improved and derived through a hyper-heuristic evolutionary approach, in less than 10% of the exhaustive search time. The evolutionary approach was further optimised through population density increase as well as gradual solution max complexity increase throughout the simulation.
Article
Full-text available
Deep Neural Networks (DNN) have become a powerful, and extremely popular mechanism, which has been widely used to solve problems of varied complexity , due to their ability to make models fitted to non-linear complex problems. Despite its well-known benefits, DNNs are complex learning models whose parametrization and architecture are made usually by hand. This paper proposes a new Evolutionary Algorithm, named EvoDeep, devoted to evolve the parameters and the architecture of a DNN in order to maximize its classification accuracy, as well as maintaining a valid sequence of layers. This model is tested against a widely used dataset of handwritten digits images. The experiments performed using this dataset show that the Evolutionary Algorithm is able to select the parameters and the DNN architecture appropriately, achieving a 98.93% accuracy in the best run.
Article
Full-text available
Deep learning allows computational models that are composed of multiple processing layers to learn representations of data with multiple levels of abstraction. These methods have dramatically improved the state-of-the-art in speech recognition, visual object recognition, object detection and many other domains such as drug discovery and genomics. Deep learning discovers intricate structure in large data sets by using the backpropagation algorithm to indicate how a machine should change its internal parameters that are used to compute the representation in each layer from the representation in the previous layer. Deep convolutional nets have brought about breakthroughs in processing images, video, speech and audio, whereas recurrent nets have shone light on sequential data such as text and speech.
Article
Full-text available
Despite the success of the automatic speech recognition framework in its own application field, its adaptation to the problem of acoustic event detection has resulted in limited success. In this paper, instead of treating the problem similar to the segmentation and classification tasks in speech recognition, we pose it as a regression task and propose an approach based on random forest regression. Furthermore, event localization in time can be efficiently handled as a joint problem. We first decompose the training audio signals into multiple interleaved superframes which are annotated with the corresponding event class labels and their displacements to the temporal onsets and offsets of the events. For a specific event category, a random-forest regression model is learned using the displacement information. Given an unseen superframe, the learned regressor will output the continuous estimates of the onset and offset locations of the events. To deal with multiple event categories, prior to the category-specific regression phase, a superframe-wise recognition phase is performed to reject the background superframes and to classify the event superframes into different event categories. While jointly posing event detection and localization as a regression problem is novel, the superior performance on two databases ITC-Irst and UPC-TALP demonstrates the efficiency and potential of the proposed approach.
Article
Full-text available
Phoneme classification is investigated for linear feature domains with the aim of improving robustness to additive noise. In linear feature domains noise adaptation is exact, potentially leading to more accurate classification than representations involving non-linear processing and dimensionality reduction. A generative framework is developed for isolated phoneme classification using linear features. Initial results are shown for representations consisting of concatenated frames from the centre of the phoneme, each containing f frames. As phonemes have variable duration, no single f is optimal for all phonemes, therefore an average is taken over models with a range of values of f. Results are further improved by including information from the entire phoneme and transitions. In the presence of additive noise, classification in this framework performs better than an analogous PLP classifier, adapted to noise using cepstral mean and variance normalisation, below 18dB SNR. Finally we propose classification using a combination of acoustic waveform and PLP log-likelihoods. The combined classifier performs uniformly better than either of the individual classifiers across all noise levels.
Article
Full-text available
This paper presents the evolution description and relevance of IVTE-Intelligent Virtual Teaching Environment project in terms of Artificial Intelligence and Artificial Intelligence in Education field. Furthermore, it describes the importance of Multi-agents modeling used in the IVTE software and also gives emphasis in the Cognitive Agent Model represented by an Animated Pedagogical Agent. The purpose of IVTE software is to educate children to preserve the environment. The IVTE software is implemented with Multi-agent (MAS) and Intelligent Tutoring Systems (ITS) technology, which gives more adaptable information to the teaching process. The adaptable information is promoted by Tutor of ITS or, in other words, by Animated Pedagogical Agent. The Animated Pedagogical Agent monitors, guides and individualizes the learning process using student model and teaching strategies.
Conference Paper
Full-text available
We introduce a new reinforcement learning benchmark based on the classic platform game Super Mario Bros. The benchmark has a high-dimensional input space, and achieving a good score requires sophisticated and varied strategies. However, it has tunable difficulty, and at the lowest difficulty setting decent score can be achieved using rudimentary strategies and a small fraction of the input space. To investigate the properties of the benchmark, we evolve neural network-based controllers using different network architectures and input spaces. We show that it is relatively easy to learn basic strategies capable of clearing individual levels of low difficulty, but that these controllers have problems with generalization to unseen levels and with taking larger parts of the input space into account. A number of directions worth exploring for learning better-performing strategies are discussed.
Conference Paper
Full-text available
The random forest language model (RFLM) has shown encour- aging results in several automatic speech recognition (ASR) tasks but has been hindered by practical limitations, notably the space-complexity of RFLM estimation from large amounts of data. This paper addresses large-scale training and testing of the RFLM via an efficient disk-swapping strategy that exploits the recursive structure of a binary decision tree and the local ac- cess property of the tree-growing algorithm, redeeming the full potential of the RFLM, and opening avenues of further research, including useful comparisons with n-gram models. Benefits of this strategy are demonstrated by perplexity reduction and lat- tice rescoring experiments using a state-of-the-art ASR system.
Article
Full-text available
Digital processing of speech signal and voice recognition algorithm is very important for fast and accurate automatic voice recognition technology. The voice is a signal of infinite information. A direct analysis and synthesizing the complex voice signal is due to too much information contained in the signal. Therefore the digital signal processes such as Feature Extraction and Feature Matching are introduced to represent the voice signal. Several methods such as Liner Predictive Predictive Coding (LPC), Hidden Markov Model (HMM), Artificial Neural Network (ANN) and etc are evaluated with a view to identify a straight forward and effective method for voice signal. The extraction and matching process is implemented right after the Pre Processing or filtering signal is performed. The non-parametric method for modelling the human auditory perception system, Mel Frequency Cepstral Coefficients (MFCCs) are utilize as extraction techniques. The non linear sequence alignment known as Dynamic Time Warping (DTW) introduced by Sakoe Chiba has been used as features matching techniques. Since it's obvious that the voice signal tends to have different temporal rate, the alignment is important to produce the better performance.This paper present the viability of MFCC to extract features and DTW to compare the test patterns.
Article
Full-text available
A model for a large network of "neurons" with a graded response (or sigmoid input-output relation) is studied. This deterministic system has collective properties in very close correspondence with the earlier stochastic model based on McCulloch - Pitts neurons. The content- addressable memory and other emergent collective properties of the original model also are present in the graded response model. The idea that such collective properties are used in biological systems is given added credence by the continued presence of such properties for more nearly biological "neurons." Collective analog electrical circuits of the kind described will certainly function. The collective states of the two models have a simple correspondence. The original model will continue to be useful for simulations, because its connection to graded response systems is established. Equations that include the effect of action potentials in the graded response system are also developed.
Book
A fascinating and instructive guide to Markov chains for experienced users and newcomers alike This unique guide to Markov chains approaches the subject along the four convergent lines of mathematics, implementation, simulation, and experimentation. It introduces readers to the art of stochastic modeling, shows how to design computer implementations, and provides extensive worked examples with case studies. Markov Chains: From Theory to Implementation and Experimentation begins with a general introduction to the history of probability theory in which the author uses quantifiable examples to illustrate how probability theory arrived at the concept of discrete-time and the Markov model from experiments involving independent variables. An introduction to simple stochastic matrices and transition probabilities is followed by a simulation of a two-state Markov chain. The notion of steady state is explored in connection with the long-run distribution behavior of the Markov chain. Predictions based on Markov chains with more than two states are examined, followed by a discussion of the notion of absorbing Markov chains. Also covered in detail are topics relating to the average time spent in a state, various chain configurations, and n-state Markov chain simulations used for verifying experiments involving various diagram configurations. • Fascinating historical notes shed light on the key ideas that led to the development of the Markov model and its variants • Various configurations of Markov Chains and their limitations are explored at length • Numerous examples—from basic to complex—are presented in a comparative manner using a variety of color graphics • All algorithms presented can be analyzed in either Visual Basic, Java Script, or PHP • Designed to be useful to professional statisticians as well as readers without extensive knowledge of probability theory Covering both the theory underlying the Markov model and an array of Markov chain implementations, within a common conceptual framework, Markov Chains: From Theory to Implementation and Experimentation is a stimulating introduction to and a valuable reference for those wishing to deepen their understanding of this extremely valuable statistical tool.
Article
We propose a Bayesian method to detect change points in a sequence of functional observations that are signal functions observed with noises. Since functions have unlimited features, it is natural to think that the sequence of signal functions driving the underlying functional observations change through an evolution process, that is, different features change over time but possibly at different times. A change-point is then viewed as the cumulative effect of changes in many features, so that the dissimilarities in the signal functions before and after the change-points are at the maximum level. In our setting, features are characterized by the wavelet coefficients in their expansion. We consider a Bayesian approach by putting priors independently on the wavelet coefficients of the underlying functions, allowing a change in their values over time. Then we compute the posterior distribution of change point for each sequence of wavelet coefficients, and obtain a measure of overall similarity between two signal functions in this sequence, which leads to the notion of an overall change-point by minimizing the similarity across the change point relative to the similarities within each segment. We study the performance of the proposed method through a simulation study and apply it to a dataset on climate change.
Article
Automatic speech recognition (ASR) has been one of the biggest and hardest challenges in the field. A large majority of research in this area focuses on widely spoken languages such as English. The problems of automatic Lithuanian speech recognition have attracted little attention so far. Due to complicated language structure and scarcity of data, models proposed for other languages such as English cannot be directly adopted for Lithuanian. In this paper we propose an ASR system for the Lithuanian language, which is based on deep learning methods and can identify spoken words purely from their phoneme sequences. Two encoder-decoder models are used to solve the ASR task: a traditional encoder-decoder model and a model with attention mechanism. The performance of these models is evaluated in isolated speech recognition task (with an accuracy of 0.993) and long phrase recognition task (with an accuracy of 0.992).
Article
The hybrid convolutional neural network and hidden Markov model (CNN-HMM) has recently achieved considerable performance in speech recognition because deep neural networks, model complex correlations between features. Automatic speech recognition (ASR) as an input to many intelligent and expert systems has impacts in various fields such as evolving search engines (inclusion of speech recognition in search engines), healthcare industry (medical reporting by medical personnel, and disease diagnosis expert systems), service delivery, communication in service providers (to establish the callers demands and then direct them to the appropriate operator for assistance), etc. This paper introduces a method, which further reduces the recognition error rate. In this paper, we first propose adaptive windows convolutional neural network (AWCNN) to analyze joint temporal-spectral features variation. AWCNN makes the model more robust against both intra- and inter-speaker variations. We further propose a new residual learning, which leads to better utilization of information in deep layers and provides a better control on transferring input information. The proposed speech recognition system can be used as the vocal input for many artificial and expert systems. We evaluated the proposed method on TIMIT, FARSDAT, Switchboard, and CallHome datasets and one image database i.e. MNIST. The experimental results show that the proposed method reduces the absolute error rate by 7% compared with the state-of-the-art methods in some speech recognition tasks.
Conference Paper
Recurrent neural networks (RNNs) are a powerful model for sequential data. End-to-end training methods such as Connectionist Temporal Classification make it possible to train RNNs for sequence labelling problems where the input-output alignment is unknown. The combination of these methods with the Long Short-term Memory RNN architecture has proved particularly fruitful, delivering state-of-the-art results in cursive handwriting recognition. However RNN performance in speech recognition has so far been disappointing, with better results returned by deep feedforward networks. This paper investigates backslashemphdeep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs. When trained end-to-end with suitable regularisation, we find that deep Long Short-term Memory RNNs achieve a test set error of 17.7% on the TIMIT phoneme recognition benchmark, which to our knowledge is the best recorded score.
Conference Paper
Natural User Interfaces (NUI) are supposed to be used by humans in a very logic way. However, the run to deploy Speech-based NUIs by the industry has had a large impact on the naturality of such interfaces. This paper presents a usability test of the most prestigious and internationally used Speech-based NUI (i.e., Alexa, Siri, Cortana and Google’s). A comparison of the services that each one provides was also performed considering: access to music services, agenda, news, weather, To-Do lists and maps or directions, among others. The test was design by two Human Computer Interaction experts and executed by eight persons. Results show that even though there are many services available, there is a lot to do to improve the usability of these systems. Specially focused on separating the traditional use of computers (based on applications that require parameters to function) and to get closer to real NUIs.
Conference Paper
MuMMER (MultiModal Mall Entertainment Robot) is a four-year, EU-funded project with the overall goal of developing a humanoid robot (SoftBank Robotics’ Pepper robot being the primary robot platform) with the social intelligence to interact autonomously and naturally in the dynamic environments of a public shopping mall, providing an engaging and entertaining experience to the general public. Using co-design methods, we will work together with stakeholders including customers, retailers, and business managers to develop truly engaging robot behaviours. Crucially, our robot will exhibit behaviour that is socially appropriate and engaging by combining speech-based interaction with non-verbal communication and human-aware navigation. To support this behaviour, we will develop and integrate new methods from audiovisual scene processing, social-signal processing, high-level action selection, and human-aware robot navigation. Throughout the project, the robot will be regularly deployed in Ideapark, a large public shopping mall in Finland. This position paper describes the MuMMER project: its needs, the objectives, R&D challenges and our approach. It will serve as reference for the robotics community and stakeholders about this ambitious project, demonstrating how a co-design approach can address some of the barriers and help in building follow-up projects.
Conference Paper
Deep Bidirectional LSTM (DBLSTM) recurrent neural networks have recently been shown to give state-of-the-art performance on the TIMIT speech database. However, the results in that work relied on recurrent-neural-network-specific objective functions, which are difficult to integrate with existing large vocabulary speech recognition systems. This paper investigates the use of DBLSTM as an acoustic model in a standard neural network-HMM hybrid system. We find that a DBLSTM-HMM hybrid gives equally good results on TIMIT as the previous work. It also outperforms both GMM and deep network benchmarks on a subset of the Wall Street Journal corpus. However the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy. We conclude that the hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates. Further investigation needs to be conducted to understand how to better leverage the improvements in frame-level accuracy towards better word error rates.
Article
The intelligibility of foreign?accented English was investigated using minimal?pairs contrasts probing a number of different error types. Forty?four native English?speaking listeners were presented with English words, sentences, and a brief passage produced by one of eight native speakers of Mandarin Chinese or one native English speaker. The 190 words were presented to listeners in a minimal?pairs forced?choice task. For the sentences and passage, listeners were instructed to write down what they understood. A feature?based analysis of the minimal?pairs data was performed, with percent correct scores computed for each feature. The sentence and passage data, scored as percent of content words correctly transcribed by listeners, were transformed and used as the dependent variables in two multiple regression analyses, with seven feature scores from the minimal?pairs test (four consonant and three vowel features) used as the independent variables. The seven minimal?pairs variables accounted for approximately 72% of the variance in sentence intelligibility and 49% of that of the passages. Of these seven variables, vowel tenseness, diphthongization, and consonant voicing accounted for 70% of the sentence and 45% of the passage variance. These data suggest that specific segmental error types may have differential effects on intelligibility. [Work supported by NIH?NIDCD Grant No. 2R44DC02213.]
Article
Designing a machine that mimics human behavior, particularly the capability of speaking naturally and responding properly to spoken language, has intrigued engineers and scientists for centuries. Since the 1930s, when Homer Dudley of Bell Laboratories proposed a system model for speech analysis and synthesis (1, 2), the problem of automatic speech recognition has been approached progressively, from a simple machine that responds to a small set of sounds to a sophisticated system that responds to fluently spoken natural language and takes into account the varying statistics of the language in which the speech is produced. Based on major advances in statistical modeling of speech in the 1980s, automatic speech recognition systems today find widespread application in tasks that require a human-machine interface, such as automatic call processing in the telephone network and query-based information systems that do things like provide updated travel information, stock price quotations, weather reports, etc. In this article, we review some major highlights in the research and development of automatic speech recognition during the last few decades so as to provide a technological perspective and an appreciation of the fundamental progress that has been made in this important area of information and communication technology.
Article
The Internet has opened up a range of new communication opportunities for people with special needs since it is an accessible communication medium that provides an opportunity to exchange practical information and support and to experience an accepting relationship with less prejudice. To date, few computer-mediated support intervention programs have been designed especially to support the socio-emotional needs of people with special needs. This paper presents the results of a study that evaluated an electronic mentoring intervention program designed to provide socio-emotional support for proteges with disabilities by mentors who also have disabilities. Using a qualitative research design, the study characterized the electronic mentoring process and its contributions from the mentors' point of view. The findings provided support for the potential of electronic mentoring for personal development and empowerment of young adults with special needs.
Article
A subjective scale for the measurement of pitch was constructed from determinations of the half-value of pitches at various frequencies. This scale differs from both the musical scale and the frequency scale, neither of which is subjective. Five observers fractionated tones of 10 different frequencies and the values were used to construct a numerical scale which is proportional to the perceived magnitude of subjective pitch. The close agreement of this pitch scale with an integration of the DL's for pitch shows that, unlike the DL's for loudness, all DL's for pitch are of uniform subjective magnitude. The agreement further implies that pitch and differential sensitivity to pitch are both rectilinear functions of extent on the basilar membrane, and that in cutting a pitch in half, the observer adjusts the tone until it stimulates a position half-way from the original locus to the apical end of the membrane. Measurement of the subjective size of musical intervals (such as octaves) in terms of the pitch scale shows that the intervals become larger as the frequency of the midpoint of the interval increases (except for very high tones). (PsycINFO Database Record (c) 2012 APA, all rights reserved)
Book
With the proliferation of digital audio distribution over digital media, audio content analysis is fast becoming a requirement for designers of intelligent signal-adaptive audio processing systems. Written by a well-known expert in the field, this book provides quick access to different analysis algorithms and allows comparison between different approaches to the same task, making it useful for newcomers to audio signal processing and industry experts alike. A review of relevant fundamentals in audio signal processing, psychoacoustics, and music theory, as well as downloadable MATLAB files are also included. Please visit the companion website: www.AudioContentAnalysis.org
Conference Paper
Using dedicated hardware to do machine learning typically ends up in disaster because of cost, obsolescence, and poor software. The popularization of graphic processing units (GPUs), which are now available on every PC, provides an attractive alternative. We propose a generic 2-layer fully connected neural network GPU implementation which yields over 3× speedup for both training and testing with respect to a 3 GHz P4 CPU.
Article
Emotional expression and understanding are normal instincts of human beings, but automatical emotion recognition from speech without referring any language or linguistic information remains an unclosed problem. The limited size of existing emotional data samples, and the relative higher dimensionality have outstripped many dimensionality reduction and feature selection algorithms. This paper focuses on the data preprocessing techniques which aim to extract the most effective acoustic features to improve the performance of the emotion recognition. A novel algorithm is presented in this paper, which can be applied on a small sized data set with a high number of features. The presented algorithm integrates the advantages from a decision tree method and the random forest ensemble. Experiment results on a series of Chinese emotional speech data sets indicate that the presented algorithm can achieve improved results on emotional recognition, and outperform the commonly used Principle Component Analysis (PCA)/Multi-Dimensional Scaling (MDS) methods, and the more recently developed ISOMap dimensionality reduction method.
Conference Paper
Signal classification is an area of much interests in signal processing. Traditional classification methods designed for discrete variables are limited in its power. Here we propose a novel approach for some signal classification problems. It is a combination of three artificial intelligence approaches: tree-based approach, ensemble voting and kernel learning. We call this approach kernel-induced random forest (KIRF) for signal data. It is novel with respect to KIRF because a new type of kernel suitable for signal data is proposed and applied. We use two examples, a phoneme speech data and a waveform simulation data to illustrate its usage and evidences of improving on traditional methods such as neural networks and discriminant methods. Evidences from the data show that our results are significantly better than those traditional methods for signal classification.
Article
The application of feed forward back propagation artificial neural networks with one hidden layer (ANN) to perform the equivalent of multiple linear regression (MLR) has been examined using artificial structured data sets and real literature data. The predictive ability of the networks has been estimated using a training/ test set protocol. The results have shown advantages of ANN over MLR analysis. The ANNs do not require high order terms or indicator variables to establish complex structure-activity relationships. Overfitting does not have any influence on network prediction ability when overtraining is avoided by cross-validation. Application of ANN ensembles has allowed the avoidance of chance correlations and satisfactory predictions of new data have been obtained for a wide range of numbers of neurons in the hidden layer.
Article
Standard Mel frequency cepstrum coefficient (MFCC) computation technique utilizes discrete cosine transform (DCT) for decorrelating log energies of filter bank output. The use of DCT is reasonable here as the covariance matrix of Mel filter bank log energy (MFLE) can be compared with that of highly correlated Markov-I process. This full-band based MFCC computation technique where each of the filter bank output has contribution to all coefficients, has two main disadvantages. First, the covariance matrix of the log energies does not exactly follow Markov-I property. Second, full-band based MFCC feature gets severely degraded when speech signal is corrupted with narrow-band channel noise, though few filter bank outputs may remain unaffected. In this work, we have studied a class of linear transformation techniques based on block wise transformation of MFLE which effectively decorrelate the filter bank log energies and also capture speech information in an efficient manner. A thorough study has been carried out on the block based transformation approach by investigating a new partitioning technique that highlights associated advantages. This article also reports a novel feature extraction scheme which captures complementary information to wide band information; that otherwise remains undetected by standard MFCC and proposed block transform (BT) techniques. The proposed features are evaluated on NIST SRE databases using Gaussian mixture model-universal background model (GMM-UBM) based speaker recognition system. We have obtained significant performance improvement over baseline features for both matched and mismatched condition, also for standard and narrow-band noises. The proposed method achieves significant performance improvement in presence of narrow-band noise when clubbed with missing feature theory based score computation scheme.
Article
Thesis (Ph. D.)--University of Edinburgh, 1989. Photocopy.