ThesisPDF Available

Automated Feature Engineering for Deep Neural Networks with Genetic Programming

Authors:

Abstract and Figures

Feature engineering is a process that augments the feature vector of a machine learning model with calculated values that are designed to enhance the accuracy of a model’s predictions. Research has shown that the accuracy of models such as deep neural networks, support vector machines, and tree/forest-based algorithms sometimes benefit from feature engineering. Expressions that combine one or more of the original features usually create these engineered features. The choice of the exact structure of an engineered feature is dependent on the type of machine learning model in use. Previous research demonstrated that various model families benefit from different types of engineered feature. Random forests, gradient-boosting machines, or other tree-based models might not see the same accuracy gain that an engineered feature allowed neural networks, generalized linear models, or other dot-product based models to achieve on the same data set. This dissertation presents a genetic programming-based algorithm that automatically engineers features that increase the accuracy of deep neural networks for some data sets. For a genetic programming algorithm to be effective, it must prioritize the search space and efficiently evaluate what it finds. This dissertation algorithm faced a potential search space composed of all possible mathematical combinations of the original feature vector. Five experiments were designed to guide the search process to efficiently evolve good engineered features. The result of this dissertation is an automated feature engineering (AFE) algorithm that is computationally efficient, even though a neural network is used to evaluate each candidate feature. This approach gave the algorithm a greater opportunity to specifically target deep neural networks in its search for engineered features that improve accuracy. Finally, a sixth experiment empirically demonstrated the degree to which this algorithm improved the accuracy of neural networks on data sets augmented by the algorithm’s engineered features.
Content may be subject to copyright.
A preview of the PDF is not available
... They, too, do not estimate the contribution of such features to ML or DL models. [53] proposes to apply AFE for automatically generating features as an input layer to neural network (NN). Leveraging genetic programming, the algorithm explores the mathematical expression space to identify effective features to enhance accuracy of a given DNN model. ...
Article
Full-text available
In the realm of machine and deep learning (DL) regression tasks, the role of effective feature engineering (FE) is pivotal in enhancing model performance. Traditional approaches of FE often rely on domain expertise to manually design features for machine learning (ML) models. In the context of DL models, the FE is embedded in the neural network’s architecture, making it hard for interpretation. In this study, we propose to integrate symbolic regression (SR) as an FE process before a ML model to improve its performance. We show, through extensive experimentation on synthetic and 21 real-world datasets, that the incorporation of SR-derived features significantly enhances the predictive capabilities of both machine and DL regression models with 34%–86% root mean square error (RMSE) improvement in synthetic datasets and 4%–11.5% improvement in real-world datasets. In an additional realistic use case, we show the proposed method improves the ML performance in predicting superconducting critical temperatures based on Eliashberg theory by more than 20% in terms of RMSE. These results outline the potential of SR as an FE component in data-driven models, improving them in terms of performance and interpretability.
... Artificial Neural Networks or deep learning methods were also shown application potential, showing high accuracy results with only 3 hidden layers. While computational requirement for deep learning hyperparameter tuning are significantly higher than those of random forest regressor [15], ANN model can be easily tuned and adjusted for other, similar problems (transfer learning), while majority of machine learning algorithms must be completely retrained for new purposes. ...
Article
Full-text available
The article reviews traditional and modern methods for prediction of gas turbine operating characteristics and its potential fail-ures. Moreover, a comparison of Machine Learning based prediction models, including Artificial Neural Networks (ANN), is pre-sented. The research focuses on High Pressure Compressor (HPC) recoup pressure level of 4th generation LM2500 gas generator (LM2500+G4) coupled with a 2-stage High Speed Power Turbine Module. The researched parameter is adjustable and may be used to balance net axial loads exerted on thrust bearing to ensure stable gas turbine operation, but its direct measurement is technically difficult implicating the need to indirect measurement via set of other gas turbine sensors. Input data for the research have been obtained from BHGE manufactured and monitored gas turbines and consists of real-time data extracted from industrial installations. Machine learning models trained using the data show less than 1% Mean Absolute Percentage Error (MAPE) as obtained with the use of Random Forest and Gradient Boosting Regression models. Multilayer Perceptron Artificial Neural Net-works (MLP ANN) models are reviewed, and their performance checks inferior to Random Forest algorithm-based model. The importance of hyperparameter tuning and feature engineering is discussed.
... An exhaustive literature search (Hodge, & Austin, 2004;Chandola, Banerjee, & Kumar, 2009;Chandola, Cheboli, Kumar 2009;Parmar, & Patel, 2017;Heaton, 2017;Fehst et al., 2018) reveals several types of approaches that use automated feature extraction, also known as automated feature engineering and automated feature learning: ...
Thesis
Full-text available
In order to ensure the validity of sensor data, it must be thoroughly analyzed for various types of anomalies. Traditional machine learning methods of anomaly detections in sensor data are based on domain-specific feature engineering. A typical approach is to use domain knowledge to analyze sensor data and manually create statistics-based features, which are then used to train the machine learning models to detect and classify the anomalies. Although this methodology is used in practice, it has a significant drawback due to the fact that feature extraction is usually labor intensive and requires considerable effort from domain experts. An alternative approach is to use deep learning algorithms. Research has shown that modern deep neural networks are very effective in automated extraction of abstract features from raw data in classification tasks. Long short-term memory networks, or LSTMs in short, are a special kind of recurrent neural networks that are capable of learning long-term dependencies. These networks have proved to be especially effective in the classification of raw time-series data in various domains. This dissertation systematically investigates the effectiveness of the LSTM model for anomaly detection and classification in raw time-series sensor data. As a proof of concept, this work used time-series data of sensors that measure blood glucose levels. A large number of time-series sequences was created based on a genuine medical diabetes dataset. Anomalous series were constructed by six methods that interspersed patterns of common anomaly types in the data. An LSTM network model was trained with k-fold cross-validation on both anomalous and valid series to classify raw time-series sequences into one of seven classes: non-anomalous, and classes corresponding to each of the six anomaly types. As a control, the accuracy of detection and classification of the LSTM was compared to that of four traditional machine learning classifiers: support vector machines, Random Forests, naive Bayes, and shallow neural networks. The performance of all the classifiers was evaluated based on nine metrics: precision, recall, and the F1-score, each measured in micro, macro and weighted perspective. While the traditional models were trained on vectors of features, derived from the raw data, that were based on knowledge of common sources of anomaly, the LSTM was trained on raw time-series data. Experimental results indicate that the performance of the LSTM was comparable to the best traditional classifiers by achieving 99% accuracy in all 9 metrics. The model requires no labor-intensive feature engineering, and the finetuning of its architecture and hyper-parameters can be made in a fully automated way. This study, therefore, finds LSTM networks an effective solution to anomaly detection and classification in sensor data.
Thesis
Full-text available
There has been significant progress in the development of techniques to deliver effective technology enhanced learning systems in education, with substantial progress in the field of learning analytics. These analyses are able to support academics in the identification of students at risk of failure or withdrawal. The early identification of students at risk is critical to giving academic staff and institutions the opportunity to make timely interventions. This thesis considers established machine learning techniques, as well as a novel method, for the prediction of student outcomes and the support of interventions, including the presentation of a variety of predictive analyses and of a live experiment. It reviews the status of technology enhanced learning systems and the associated institutional obstacles to their implementation and deployment. Many courses are comprised of relatively small student cohorts, with institutional privacy protocols limiting the data readily available for analysis. It appears that very little research attention has been devoted to this area of analysis and prediction. I present an experiment conducted on a final year university module, with a student cohort of 23, where the data available for prediction is limited to lecture/tutorial attendance, virtual learning environment accesses and intermediate assessments. I apply and compare a variety of machine learning analyses to assess and predict student performance, applied at appropriate points during module delivery. Despite some mixed results, I found potential for predicting student performance in small student cohorts with very limited student attributes, with accuracies comparing favourably with published results using large cohorts and significantly more attributes. I propose that the analyses will be useful to support module leaders in identifying opportunities to make timely academic interventions. Student data may include a combination of nominal and numeric data. A large variety of techniques are available to analyse numeric data, however there are fewer techniques applicable to nominal data. I summarise the results of what I believe to be a novel technique to analyse nominal data by making a systematic comparison of data pairs. In this thesis I have surveyed existing intelligent learning/training systems and explored the contemporary AI techniques which appear to offer the most promising contributions to the prediction of student attainment. I have researched and catalogued the organisational and non-technological challenges to be addressed for successful system development and implementation and proposed a set of critical success criteria to apply. This dissertation is supported by published work.
Article
Full-text available
This paper proposes the design of two coordinated wide-area damping controllers (CWADCs) for damping low frequency oscillations (LFOs), while accounting for the uncertainties present in the power system. The controllers based on Deep Neural Network (DNN) and Deep Reinforcement Learning (DRL), respectively, coordinate the operation of different local damping controls such as power system stabilizers (PSSs), static VAr compensators (SVCs), and supplementary damping controllers for DC lines (DC-SDCs). The DNN-CWADC learns to make control decisions using supervised learning; the training dataset consisting of polytopic controllers designed with the help of linear matrix inequality (LMI)-based mixed H2/HH_2/H_\infty optimization. The DRL-CWADC learns to adapt to the system uncertainties based on its continuous interaction with the power system environment by employing an advanced version of the state-of-the-art deep deterministic policy gradient (DDPG) algorithm referred to as \emph{bounded exploratory control}-based DDPG (BEC-DDPG). The studies performed on a 33 machine, 127 bus equivalent model of the Western Electricity Coordinating Council (WECC) system-embedded with different types of damping controls demonstrate the effectiveness of the proposed CWADCs.
Conference Paper
A great deal of research has focused on algorithms for learning features from un- labeled data. Indeed, much progress has been made on benchmark datasets like NORB and CIFAR by employing increasingly complex unsupervised learning al- gorithms and deep models. In this paper, however, we show that several very sim- ple factors, such as the number of hidden nodes in the model, may be as important to achieving high performance as the choice of learning algorithm or the depth of the model. Specifically, we will apply several off-the-shelf feature learning al- gorithms (sparse auto-encoders, sparse RBMs and K-means clustering, Gaussian mixtures) to NORB and CIFAR datasets using only single-layer networks. We then present a detailed analysis of the effect of changes in the model setup: the receptive field size, number of hidden nodes (features), the step-size (stride) be- tween extracted features, and the effect of whitening. Our results show that large numbers of hidden nodes and dense feature extraction are as critical to achieving high performance as the choice of algorithm itselfso critical, in fact, that when these parameters are pushed to their limits, we are able to achieve state-of-the- art performance on both CIFAR and NORB using only a single layer of features. More surprisingly, our best performance is based on K-means clustering, which is extremely fast, has no hyper-parameters to tune beyond the model structure it- self, and is very easy implement. Despite the simplicity of our system, we achieve performance beyond all previously published results on the CIFAR-10 and NORB datasets (79.6% and 97.0% accuracy respectively).
Article
Machine learning models, such as neural networks, decision trees, random forests and gradient boosting machines accept a feature vector and provide a prediction. These models learn in a supervised fashion where a set of feature vectors with expected output is provided. It is very common practice to engineer new features from the provided feature set. Such engineered features will either augment, or replace portions of the existing feature vector. These engineered features are essentially calculated fields, based on the values of the other features. Engineering such features is primarily a manual, time-consuming task. Additionally, each type of model will respond differently to different types of engineered features. This paper reports on empirical research to demonstrate what types of engineered features are best suited to which machine learning model type. This is accomplished by generating several datasets that are designed to benefit from a particular type of engineered feature. The experiment demonstrates to what degree the machine learning model is capable of synthesizing the needed feature on its own. If a model is capable of synthesizing an engineered feature, it is not necessary to provide that feature. The research demonstrated that the studied models do indeed perform differently with various types of engineered features.
Chapter
This article presents SimpleDS, a simple and publicly available dialogue system trained with deep reinforcement learning. In contrast to previous reinforcement learning dialogue systems, this system avoids manual feature engineering by performing action selection directly from raw text of the last system and (noisy) user responses. Our initial results, in the restaurant domain, report that it is indeed possible to induce reasonable behaviours with such an approach that aims for higher levels of automation in dialogue control for intelligent interactive systems and robots.
Chapter
We introduce fully automated process for sentiment analysis in short texts in Polish language. Process consists of (a) generation of emotion lexicon using Twitter annotated messages (b) building sentiment data set using annotated messages and the generated lexicon, (c) training NEAT genetic algorithm using previously prepared data set and (d) the final evaluation using 10 fold cross validation. We show that this method provides good results and can be used to simplify sentiment analysis processes for Polish language content.